TECHNICAL REFERENCEv2.0 · Updated 18 March 2026

VisionToPrompt Documentation

Technical reference for extraction modes, confidence thresholds, generator calibration, input specifications, and API integration.

1. Extraction Modes

1.1 AI Prompt Mode

The primary extraction mode. Runs the full five-layer pipeline: photometric extraction, semantic scene analysis, facial landmark detection (for portraits), structural geometry analysis, and generator-calibrated output synthesis. The output is a structured natural-language prompt ready for direct use in the target generator. Processing time: < 2 seconds (p95) on Cloudflare global edge network.

1.2 Describe Mode

Runs the semantic scene analysis layer only. Outputs a compositional natural-language description of the image — subject, environment, mood, composition, and notable visual elements — without photometric data or generator-specific formatting. Useful for image cataloguing, content moderation review, and accessibility description generation.

1.3 Extract Text (OCR) Mode

Runs the six-stage OCR pipeline. Supports 50+ scripts including Latin, Arabic, CJK (Chinese/Japanese/Korean), Devanagari, Cyrillic, Hebrew, Thai, and Georgian. Automatic language detection and mixed-script separation. Architectural annotation parsing available for blueprint documents.

2. Confidence Threshold Architecture

2.1 Threshold Levels

VisionToPrompt applies a two-threshold confidence scoring system to all extracted visual elements:

Confidence ≥ 0.85 → HARD DESCRIPTOR Stated as definitive fact in output prompt. Example: "black leather jacket" / "3200K tungsten lighting" Confidence 0.60–0.84 → QUALIFIED MODIFIER Stated with probabilistic qualification. Example: "possibly dark leather outerwear" / "warm-shifted lighting" Confidence < 0.60 → OMITTED Element excluded from output entirely. Rationale: prevents hallucination propagation into generation.

2.2 Why This Architecture Matters

When a vision model perceives a visual element with low confidence, encoding it as a hard descriptor in the prompt causes the generator to produce that element definitively — even though the source image may not contain it. This hallucination propagation produces generations that diverge from the reference image in specific, hard-to-diagnose ways. The 0.60 omission threshold eliminates this failure mode at the cost of occasional under-specification, which is preferable to systematic over-specification.

3. Generator-Specific Calibration

3.1 Midjourney v6

Text encoder: CLIP. Token weighting via :: syntax. Responds well to comma-separated descriptor lists with style and quality suffixes.

Output format: [subject descriptors], [lighting descriptor], [style descriptor], [composition], [quality modifiers] --ar [ratio] --v 6

3.2 DALL-E 3 (GPT-4V encoder)

Text encoder: GPT-4V. Responds to longer natural-language descriptions. Hex codes ineffective — use Munsell-derived perceptual descriptors. Does not support parameter suffixes.

Output format: [Full natural language description with embedded lighting, color, composition, and style in flowing prose format. Perceptual color descriptors replace hex values.]

3.3 Stable Diffusion XL

Dual text encoders: CLIP-L + OpenCLIP-ViT-G. ControlNet conditioning available for geometry-constrained generation. Negative prompt support active.

Output format: Positive: [descriptor list, comma-separated, quality tokens] Negative: [bad anatomy, blurry, watermark, low quality, ...] ControlNet: [mlsd / depth map if geometry source detected]

3.4 Adobe Firefly v3

Text encoder architecture: proprietary transformer trained on licensed content. Responds similarly to natural language descriptions. Content credential metadata preserved.

Output format: [Natural language description optimized for licensed content generation. Avoids style references to living artists — uses movement/period descriptors instead.]

4. Input Specifications

4.1 Supported Formats

JPEG, PNG, WebP, GIF (first frame), AVIF. Maximum file size: 10 MB.

Recommended for photometric accuracy: Format: WebP lossless OR JPEG quality ≥ 75 Resolution: Source resolution (no downsampling) Color space: sRGB (wide-gamut profiles converted on input) Minimum for facial landmark extraction: Face size: ≥ 128 × 128 pixels in image Yaw angle: ±30° from frontal Pitch angle: ±20° from level

4.2 Image Quality Impact on Accuracy

The five factors with greatest impact on extraction accuracy, ranked by weight:

1. Image sharpness / focus (35% impact weight) 2. Contrast — subject vs background (28%) 3. Text alignment / skew angle (14%) [OCR mode] 4. Font type — standard vs display (12%) [OCR mode] 5. Image resolution in DPI (7%) Note: Resolution above 400 DPI produces minimal accuracy gains. Sharpness and contrast are far more impactful.

5. API Reference

5.1 Process Image Endpoint

Submit an image for processing via the jobs API. Requires a user session ID (X-User-ID header).

POST /api/jobs Content-Type: multipart/form-data X-User-ID: {uuid} Body: file: [image file] mode: "prompt" | "describe" | "text" Response 201: { "jobId": "01HXYZ...", "status": "pending" }

5.2 Process Job Endpoint

Trigger AI processing on an uploaded job.

POST /api/jobs/{jobId}/process Content-Type: application/json X-User-ID: {uuid} Body: { "options": { "mode": "prompt", "generator": "midjourney" // optional target } } Response 200: { "jobId": "01HXYZ...", "status": "completed", "result": "[extracted prompt text]", "model": "llava", "tokensUsed": 412, "processingMs": 1840 }

5.3 Rate Limits

Free tier: 3 generations per device (tracked via localStorage UUID). Contact form: 2 submissions per IP per hour. Pro tier: unlimited generations.

Free tier limits: Generations: 3 total (lifetime per device) Contact form: 2/hour per IP Pro tier limits: Generations: Unlimited Concurrent: 5 simultaneous jobs

Deep-Dive Technical Articles

Try the Extraction Pipeline

3 free extractions. No account required. Results in under 2 seconds.

Open VisionToPrompt →