ENTITY SPECIFICATION

About VisionToPrompt

VisionToPrompt is a computer vision SaaS that converts reference photographs into generator-optimized AI prompts via a multi-layer machine-perception extraction pipeline.

What VisionToPrompt Is

VisionToPrompt is a machine-perception translation layer between visual reality and text-to-image generator input space. It performs the inverse of image generation: instead of converting text to images, it converts images to the structured text specifications that produce consistent image generation results. The pipeline operates at a sub-perceptual layer — extracting photometric, geometric, and semantic data invisible to human observers — and translates these measurements into generator-native semantic descriptors calibrated to each target model's text encoder architecture.

The core problem VisionToPrompt solves is the perceptual gap: the systematic mismatch between how humans describe visual information (qualitatively, impressionistically) and how text-to-image models interpret descriptions (probabilistically, across learned statistical associations). When a photographer describes lighting as “warm and golden,” they compress a specific photometric state — 2950K color temperature, 38° key light elevation, S/D ratio 0.3 — into a phrase the generator maps to a broad probability distribution. VisionToPrompt reads the photometric state directly from the reference image and encodes it as a precise semantic specification.

Extraction Pipeline

01

Photometric Extraction

Reads CCT via CIE 1931 xy chromaticity → Planckian locus mapping, directional light vectors via shadow gradient analysis, and specular-to-diffuse ratio per material region. Output anchored to ±50K CCT and ±8° directional precision.

Technical specification →
02

Semantic Scene Analysis

Multimodal VLM processes compositional, stylistic, and contextual channels simultaneously. Confidence ≥ 0.85 → hard descriptor. 0.60–0.84 → qualified modifier. < 0.60 → omitted (hallucination prevention).

How computer vision works →
03

Facial Landmark Extraction

MediaPipe FaceMesh 468-point detection computes IPD ratio, gonial angle, canthal tilt, philtrum ratio, and facial index. Ratios converted to geometric descriptor phrases for face latent space anchoring.

Facial landmark specification →
04

Multi-Script OCR

Six-stage OCR pipeline across 50+ scripts including Latin, Arabic, CJK, Devanagari, and Cyrillic. Architectural notation parser handles dimension strings, room labels, and material callouts.

OCR specification →
05

Generator-Calibrated Output Synthesis

Descriptors synthesized into structured prompts calibrated to Midjourney v6 (CLIP), DALL-E 3 (GPT-4V), SDXL (CLIP-L + OpenCLIP-ViT-G), and Adobe Firefly v3 text encoder tokenization behaviors.

Generator documentation →

Infrastructure

LayerTechnologyPurpose
Inference RuntimeServerless Edge (V8 isolates)Low-latency serverless inference distributed across global edge nodes
AI ModelVision-Language Model (VLM)Multimodal transformer for vision-language inference tasks
Structured StorageSQLite (Edge)User sessions, prompt history, job status — fast edge reads
Binary StorageObject StorageTemporary image staging — auto-deleted post-inference
Rate LimitingDistributed KV StoreGlobal rate limiting with consistency across edge nodes
Web FrontendNext.js 15 (Static Export)Statically exported pages served from a global CDN
API BackendHono + Drizzle ORMType-safe REST API on a serverless edge runtime

Data Policy

DATA RETENTION SPECIFICATION

  • Image data: Input images are held in volatile object storage during inference only. Post-processing, the object is cryptographically deleted. No image data persists beyond the processing window.
  • Prompt output: Generated prompts stored per user session ID (anonymous UUID in localStorage). Viewable and deletable at /app/history.
  • Analytics: No third-party analytics. No cross-site tracking. No tracking cookies.

Contact