VisionToPrompt Documentation
Technical reference for extraction modes, confidence thresholds, generator calibration, input specifications, and API integration.
1. Extraction Modes
1.1 AI Prompt Mode
The primary extraction mode. Runs the full five-layer pipeline: photometric extraction, semantic scene analysis, facial landmark detection (for portraits), structural geometry analysis, and generator-calibrated output synthesis. The output is a structured natural-language prompt ready for direct use in the target generator. Processing time: < 2 seconds (p95) on Cloudflare global edge network.
1.2 Describe Mode
Runs the semantic scene analysis layer only. Outputs a compositional natural-language description of the image — subject, environment, mood, composition, and notable visual elements — without photometric data or generator-specific formatting. Useful for image cataloguing, content moderation review, and accessibility description generation.
1.3 Extract Text (OCR) Mode
Runs the six-stage OCR pipeline. Supports 50+ scripts including Latin, Arabic, CJK (Chinese/Japanese/Korean), Devanagari, Cyrillic, Hebrew, Thai, and Georgian. Automatic language detection and mixed-script separation. Architectural annotation parsing available for blueprint documents.
2. Confidence Threshold Architecture
2.1 Threshold Levels
VisionToPrompt applies a two-threshold confidence scoring system to all extracted visual elements:
2.2 Why This Architecture Matters
When a vision model perceives a visual element with low confidence, encoding it as a hard descriptor in the prompt causes the generator to produce that element definitively — even though the source image may not contain it. This hallucination propagation produces generations that diverge from the reference image in specific, hard-to-diagnose ways. The 0.60 omission threshold eliminates this failure mode at the cost of occasional under-specification, which is preferable to systematic over-specification.
3. Generator-Specific Calibration
3.1 Midjourney v6
Text encoder: CLIP. Token weighting via :: syntax. Responds well to comma-separated descriptor lists with style and quality suffixes.
3.2 DALL-E 3 (GPT-4V encoder)
Text encoder: GPT-4V. Responds to longer natural-language descriptions. Hex codes ineffective — use Munsell-derived perceptual descriptors. Does not support parameter suffixes.
3.3 Stable Diffusion XL
Dual text encoders: CLIP-L + OpenCLIP-ViT-G. ControlNet conditioning available for geometry-constrained generation. Negative prompt support active.
3.4 Adobe Firefly v3
Text encoder architecture: proprietary transformer trained on licensed content. Responds similarly to natural language descriptions. Content credential metadata preserved.
4. Input Specifications
4.1 Supported Formats
JPEG, PNG, WebP, GIF (first frame), AVIF. Maximum file size: 10 MB.
4.2 Image Quality Impact on Accuracy
The five factors with greatest impact on extraction accuracy, ranked by weight:
5. API Reference
5.1 Process Image Endpoint
Submit an image for processing via the jobs API. Requires a user session ID (X-User-ID header).
5.2 Process Job Endpoint
Trigger AI processing on an uploaded job.
5.3 Rate Limits
Free tier: 3 generations per device (tracked via localStorage UUID). Contact form: 2 submissions per IP per hour. Pro tier: unlimited generations.
Deep-Dive Technical Articles
Photometric Extraction
Color temperature, light vectors, specular ratio — from reference photo to Midjourney descriptor.
Facial Landmark Ratios
IPD, gonial angle, canthal tilt — geometric anchoring for consistent character generation.
Color Descriptor Synthesis
Why hex codes fail in DALL-E 3 and how Munsell mapping produces consistent color generation.
Blueprint → ControlNet
Dual-pipeline blueprint processing: OCR annotation extraction + MLSD geometry detection.
OCR Pipeline
Six-stage OCR architecture for 50+ scripts with confidence-weighted output.
Computer Vision Explained
How the vision-language model processes and understands image content.
Try the Extraction Pipeline
3 free extractions. No account required. Results in under 2 seconds.
Open VisionToPrompt →