What are the three extraction modes in VisionToPrompt?

VisionToPrompt offers three extraction modes: (1) AI Prompt mode — runs the full photometric, semantic, and structural extraction pipeline and synthesizes a generator-optimized prompt calibrated to the target model's text encoder. (2) Describe mode — performs semantic scene analysis and outputs a compositional natural-language description of the image without generator-specific optimization. (3) Extract Text (OCR) mode — runs the six-stage OCR pipeline to extract written text from the image across 50+ scripts, preserving layout structure and handling mixed-script documents.

What is the confidence threshold architecture in VisionToPrompt?

VisionToPrompt uses a two-threshold confidence scoring system for all extracted visual elements. Elements with confidence ≥ 0.85 are encoded as hard descriptors — definitive statements about what the model perceives ('black leather jacket', 'tungsten lighting'). Elements with confidence 0.60–0.84 are encoded as qualified modifiers — probabilistic descriptions ('possibly dark leather outerwear', 'warm-shifted lighting'). Elements below 0.60 are omitted entirely. This architecture prevents hallucination propagation: uncertain perceptions are either qualified or excluded rather than stated as fact.

How does VisionToPrompt calibrate output for different AI generators?

Each text-to-image generator uses a different text encoder architecture with distinct tokenization behavior and semantic-to-visual mappings. Midjourney v6 uses CLIP with specific token weighting via :: syntax. DALL-E 3 uses GPT-4V which responds to longer, more descriptive natural language. Stable Diffusion XL uses dual encoders (CLIP-L + OpenCLIP-ViT-G) that respond to comma-separated descriptor lists. VisionToPrompt maintains separate output synthesis templates for each generator, formatting the same extracted information in the structure each model's encoder processes most accurately.

TECHNICAL REFERENCEv2.0 · Updated 18 March 2026

VisionToPrompt Documentation

Technical reference for extraction modes, confidence thresholds, generator calibration, input specifications, and API integration.

1. Extraction Modes

1.1 AI Prompt Mode

The primary extraction mode. Runs the full five-layer pipeline: photometric extraction, semantic scene analysis, facial landmark detection (for portraits), structural geometry analysis, and generator-calibrated output synthesis. The output is a structured natural-language prompt ready for direct use in the target generator. Processing time: < 2 seconds (p95) on Cloudflare global edge network.

1.2 Describe Mode

Runs the semantic scene analysis layer only. Outputs a compositional natural-language description of the image — subject, environment, mood, composition, and notable visual elements — without photometric data or generator-specific formatting. Useful for image cataloguing, content moderation review, and accessibility description generation.

1.3 Extract Text (OCR) Mode

Runs the six-stage OCR pipeline. Supports 50+ scripts including Latin, Arabic, CJK (Chinese/Japanese/Korean), Devanagari, Cyrillic, Hebrew, Thai, and Georgian. Automatic language detection and mixed-script separation. Architectural annotation parsing available for blueprint documents.

2. Confidence Threshold Architecture

2.1 Threshold Levels

VisionToPrompt applies a two-threshold confidence scoring system to all extracted visual elements:

Confidence ≥ 0.85 → HARD DESCRIPTOR Stated as definitive fact in output prompt. Example: "black leather jacket" / "3200K tungsten lighting" Confidence 0.60–0.84 → QUALIFIED MODIFIER Stated with probabilistic qualification. Example: "possibly dark leather outerwear" / "warm-shifted lighting" Confidence < 0.60 → OMITTED Element excluded from output entirely. Rationale: prevents hallucination propagation into generation.

2.2 Why This Architecture Matters

When a vision model perceives a visual element with low confidence, encoding it as a hard descriptor in the prompt causes the generator to produce that element definitively — even though the source image may not contain it. This hallucination propagation produces generations that diverge from the reference image in specific, hard-to-diagnose ways. The 0.60 omission threshold eliminates this failure mode at the cost of occasional under-specification, which is preferable to systematic over-specification.

3. Generator-Specific Calibration

3.1 Midjourney v6

Text encoder: CLIP. Token weighting via :: syntax. Responds well to comma-separated descriptor lists with style and quality suffixes.

Output format: [subject descriptors], [lighting descriptor], [style descriptor], [composition], [quality modifiers] --ar [ratio] --v 6

3.2 DALL-E 3 (GPT-4V encoder)

Text encoder: GPT-4V. Responds to longer natural-language descriptions. Hex codes ineffective — use Munsell-derived perceptual descriptors. Does not support parameter suffixes.

Output format: [Full natural language description with embedded lighting, color, composition, and style in flowing prose format. Perceptual color descriptors replace hex values.]

3.3 Stable Diffusion XL

Dual text encoders: CLIP-L + OpenCLIP-ViT-G. ControlNet conditioning available for geometry-constrained generation. Negative prompt support active.

Output format: Positive: [descriptor list, comma-separated, quality tokens] Negative: [bad anatomy, blurry, watermark, low quality, ...] ControlNet: [mlsd / depth map if geometry source detected]

3.4 Adobe Firefly v3

Text encoder architecture: proprietary transformer trained on licensed content. Responds similarly to natural language descriptions. Content credential metadata preserved.

Output format: [Natural language description optimized for licensed content generation. Avoids style references to living artists — uses movement/period descriptors instead.]

4. Input Specifications

4.1 Supported Formats

JPEG, PNG, WebP, GIF (first frame), AVIF. Maximum file size: 10 MB.

Recommended for photometric accuracy: Format: WebP lossless OR JPEG quality ≥ 75 Resolution: Source resolution (no downsampling) Color space: sRGB (wide-gamut profiles converted on input) Minimum for facial landmark extraction: Face size: ≥ 128 × 128 pixels in image Yaw angle: ±30° from frontal Pitch angle: ±20° from level

4.2 Image Quality Impact on Accuracy

The five factors with greatest impact on extraction accuracy, ranked by weight:

1. Image sharpness / focus (35% impact weight) 2. Contrast — subject vs background (28%) 3. Text alignment / skew angle (14%) [OCR mode] 4. Font type — standard vs display (12%) [OCR mode] 5. Image resolution in DPI (7%) Note: Resolution above 400 DPI produces minimal accuracy gains. Sharpness and contrast are far more impactful.

5. API Reference

5.1 Process Image Endpoint

Submit an image for processing via the jobs API. Requires a user session ID (X-User-ID header).

POST /api/jobs Content-Type: multipart/form-data X-User-ID: {uuid} Body: file: [image file] mode: "prompt" | "describe" | "text" Response 201: { "jobId": "01HXYZ...", "status": "pending" }

5.2 Process Job Endpoint

Trigger AI processing on an uploaded job.

POST /api/jobs/{jobId}/process Content-Type: application/json X-User-ID: {uuid} Body: { "options": { "mode": "prompt", "generator": "midjourney" // optional target } } Response 200: { "jobId": "01HXYZ...", "status": "completed", "result": "[extracted prompt text]", "model": "llava", "tokensUsed": 412, "processingMs": 1840 }

5.3 Rate Limits

Free tier: 3 generations per device (tracked via localStorage UUID). Contact form: 2 submissions per IP per hour. Pro tier: unlimited generations.

Free tier limits: Generations: 3 total (lifetime per device) Contact form: 2/hour per IP Pro tier limits: Generations: Unlimited Concurrent: 5 simultaneous jobs

Deep-Dive Technical Articles

Photometric Extraction

Color temperature, light vectors, specular ratio — from reference photo to Midjourney descriptor.

Facial Landmark Ratios

IPD, gonial angle, canthal tilt — geometric anchoring for consistent character generation.

Color Descriptor Synthesis

Why hex codes fail in DALL-E 3 and how Munsell mapping produces consistent color generation.

Blueprint → ControlNet

Dual-pipeline blueprint processing: OCR annotation extraction + MLSD geometry detection.

OCR Pipeline

Six-stage OCR architecture for 50+ scripts with confidence-weighted output.

Computer Vision Explained

How the vision-language model processes and understands image content.

Try the Extraction Pipeline

3 free extractions. No account required. Results in under 2 seconds.

Open VisionToPrompt →