TECHNICAL SPECIFICATION18 March 2026 · 12 min read · Proficiency: Expert

How to Maintain Lighting Consistency in Midjourney Using Product Reference Photos

A machine-perception approach to photometric data extraction and generator-compatible descriptor synthesis.

DEFINITION BLOCK

Photometric extraction is the machine-perception process of reading quantitative light measurements — color temperature (expressed in Kelvin), illuminance direction vectors (expressed in degrees from subject axis), and specular-to-diffuse intensity ratios — directly from reference image pixel data without human interpretation. In the context of AI image generation, VisionToPrompt's extraction pipeline converts these photometric values into generator-native semantic descriptors: structured natural-language tokens that Midjourney's text encoder maps to consistent photometric space across multiple generations. This process operates at a sub-perceptual layer — extracting data invisible to human observers — and is the foundational mechanism by which cross-generation lighting consistency is achieved without manual recalibration between prompts.

The Perceptual Gap: Why Manual Lighting Descriptions Fail

When a product photographer manually describes lighting in a Midjourney prompt — "warm golden light from the left" — they are operating in perceptual space: qualitative, observer-dependent, and dimensionally collapsed. Midjourney's text encoder, however, maps language to a photometric embedding space trained on millions of image-caption pairs where lighting is implicitly quantitative.

The result is a systematic mismatch. "Warm golden light" maps to a probability distribution spanning color temperatures from approximately 2700K to 4500K, angles from 20° to 90°, and intensity ratios across the full specular-diffuse spectrum. Each generation samples independently from this distribution. This is why consecutive Midjourney generations from the same manual lighting description produce images where the light source appears to shift — the prompt is not anchoring a specific photometric state; it is sampling from a broad probabilistic neighborhood.

The solution is not better descriptive writing. The solution is measurement.

VisionToPrompt's Photometric Extraction Pipeline

When a product reference photograph is submitted to VisionToPrompt, the vision model does not "look at" the image in the way a human does. It processes the image across three independent extraction layers operating in parallel:

Layer 1: Color Temperature Extraction

The pipeline samples the image's highlight regions (pixels above 90th percentile luminance) and computes their chromaticity coordinates in the CIE 1931 xy color space. These coordinates are then mapped to the Planckian locus — the mathematical curve describing blackbody radiation — to derive a correlated color temperature (CCT) in Kelvin.

A product photo taken under tungsten studio lighting will yield CCT values between 2800K–3200K. Daylight-balanced studio strobes produce 5500K–5600K. This extracted value is then converted into a semantic descriptor that Midjourney's text encoder has high confidence mapping to the correct photometric state:

# Extracted CCT → Semantic Descriptor Mapping

2700K–3000K → "warm tungsten incandescent, amber cast, low-noon angle"

3000K–3500K → "tungsten-halogen fill, warm white, studio continuous"

4000K–4500K → "cool white fluorescent, neutral cast, diffused overhead"

5500K–5600K → "daylight-balanced strobe, neutral white, crisp shadow edge"

6000K–7000K → "overcast skylight, slightly blue-shifted, soft wrap"

Layer 2: Directional Light Vector Analysis

Shadow geometry encodes precise directional information. The pipeline detects shadow edges using a gradient-based method and computes the angle of the shadow vector relative to the subject's vertical axis. This yields two values: azimuthal angle (horizontal position of light source, 0°–360° relative to camera axis) and elevation angle (vertical position, 0° = horizon, 90° = directly overhead).

A product shot with a classic 45°/45° Rembrandt-adjacent setup — key light at 45° camera-left, elevated 45° above subject — is encoded as: key_light_azimuth:315°, key_light_elevation:45°, which translates to the semantic descriptor: "45-degree key light camera-left, elevated 45 degrees, hard shadow right-side".

Layer 3: Specular-to-Diffuse Ratio Calculation

The ratio of specular highlights to diffuse fill determines the apparent "hardness" of a light source — a fundamental variable in product photography that generic prompts almost never encode correctly. VisionToPrompt measures the luminance of specular hotspots relative to the average surface luminance of the same material region, producing a dimensionless ratio (S/D ratio).

S/D Ratio < 0.3 → "broad softbox, wrapped diffuse light, minimal specular"

S/D Ratio 0.3–0.6 → "medium modifier, balanced specular, slight sheen"

S/D Ratio 0.6–0.9 → "fresnel lens modifier, defined specular highlight"

S/D Ratio > 0.9 → "bare bulb or direct sun, harsh specular, crisp shadows"

Synthesized Prompt Output: From Photometric Data to Midjourney Descriptor

The three extracted values are merged into a single structured lighting descriptor that is injected into the Midjourney prompt as a coherent semantic unit — not a list of disconnected terms, but a grammatically structured lighting specification:

EXAMPLE: Luxury Watch Product Photo → Extracted Midjourney Lighting Descriptor

Input: Product reference photo (luxury watch, studio shoot)

Extracted Values:

CCT: 5580K → daylight-balanced strobe

Key Light: azimuth 330°, elevation 38° → camera-left, slightly above horizon

S/D Ratio: 0.74 → defined specular, fresnel-quality modifier

Synthesized Descriptor:

"daylight-balanced strobe lighting, camera-left key at 38-degree elevation, defined specular highlights with fresnel quality, crisp shadow falloff right side, neutral white light cast, product photography studio setup"

This descriptor, when injected into consecutive Midjourney prompts for the same product, produces generations where the light source position, color cast, and specular behavior remain within ±5% of the reference image — a consistency level that is unachievable through manual descriptive prompting.

Manual Prompting vs. VisionToPrompt Photometric Extraction

Variable	Manual Prompting	VisionToPrompt Extraction
Color Temperature	"Warm golden light" (2700K–4500K range, unanchored)	CCT extracted to ±50K precision, mapped to named semantic descriptor
Light Direction	"From the left" (azimuth variance ≈ ±60°)	Azimuth computed from shadow geometry to ±8° precision
Light Elevation	Rarely specified; model assumes mid-elevation	Elevation angle extracted from shadow length ratio
Specular Quality	"Soft" or "hard" (binary, imprecise)	S/D ratio computed per material region, mapped to modifier type
Cross-Generation Consistency	Each generation re-samples from broad probability distribution	Anchored descriptor reduces photometric variance to ±5%
Time to Descriptor	2–5 minutes manual writing	< 4 seconds automated extraction
Human Expertise Required	Advanced photography and prompt engineering knowledge	None — extracted automatically from reference image
EXIF Data Utilization	Not utilized	White balance tag cross-referenced to validate CCT extraction

Implementation: Using VisionToPrompt for Lighting-Consistent Product Series

The following workflow produces a photometrically anchored prompt suitable for generating a consistent product image series in Midjourney v6:

Submit your reference photograph. Upload your approved hero product shot — the image whose lighting you want to replicate — to VisionToPrompt. Ensure the image is at full resolution; downsampled images reduce highlight sampling accuracy in Layer 1 extraction.
Select "Prompt" extraction mode. This activates the full three-layer photometric pipeline. The "Text" (OCR) mode extracts written content only and does not perform photometric analysis.
Copy the synthesized lighting descriptor. The output will contain a structured lighting section enclosed in the extraction result. This descriptor is designed to be injected directly into your Midjourney prompt as a semantic block — do not paraphrase it, as rephrasing reintroduces the perceptual gap.
Use consistent descriptor injection across all series prompts. For a 10-image product series, the same extracted lighting descriptor block should appear verbatim in all 10 prompts. Subject, composition, and angle can vary; the lighting descriptor remains constant.
Validate with the --cref parameter. In Midjourney v6, append your original reference image URL as a --cref value alongside the extracted descriptor. The photometric descriptor anchors the text embedding while --cref anchors the visual style — these two inputs operate on separate channels and compound their consistency effects rather than conflicting.

TECHNICAL LIMITATIONS

The following conditions reduce photometric extraction accuracy and should be understood before relying on this workflow in production:

Multi-source lighting environments: When a product shot uses three or more distinct light sources with overlapping coverage areas, the shadow geometry analysis cannot cleanly isolate individual source vectors. The pipeline outputs a composite descriptor that represents the dominant light, but secondary sources are underrepresented. Studio setups with more than one key-equivalent source are best submitted as separate single-source crops.
Highly reflective or transparent materials: Products such as glass bottles, polished metals, or crystal objects exhibit complex light interactions (caustics, internal reflection, subsurface scattering) that confound the S/D ratio calculation. The extracted S/D ratio for these materials carries ±0.2 uncertainty versus ±0.05 for matte surfaces.
Compressed or resized images: JPEG compression at quality settings below 75 introduces chroma subsampling artifacts in highlight regions, reducing CCT extraction accuracy. Images should be submitted at source resolution with minimal compression. WebP format is preferred for its lossless compression mode.
Midjourney model version variance: The semantic descriptors generated by this pipeline are calibrated to Midjourney v6's text encoder behavior. Midjourney v5 and Niji mode use different encoder architectures with different token-to-photometric-state mappings. Descriptor output for non-v6 models may require manual adjustment of the elevation angle phrasing specifically.
Environmental ambient contamination: Product photos taken in naturally lit spaces (near windows, outdoors) contain ambient environmental light that modifies the apparent CCT in a spatially non-uniform way. The extractor reads the global highlight CCT and will not account for localized ambient contamination. These images yield descriptors with higher generative variance than controlled studio shots.

Frequently Asked Questions

How does VisionToPrompt maintain lighting consistency across Midjourney generations?

VisionToPrompt extracts photometric data — color temperature in Kelvin, directional light angle in degrees, and specular intensity ratio — from reference photos using computer vision. These values are converted into Midjourney v6-compatible lighting descriptors that produce consistent lighting across multiple generations without manual adjustment between prompts.

Why does manually describing lighting fail in Midjourney reference workflows?

Human observers perceive lighting qualitatively ("warm," "soft") while Midjourney's text encoder maps to quantitative photometric space. This perceptual gap causes inconsistency because each generation independently re-samples from a broad probability distribution rather than anchoring to a specific photometric state. VisionToPrompt bridges this gap by extracting machine-readable photometric values directly from reference image pixel data.

What is photometric extraction in the context of AI prompt engineering?

Photometric extraction is the process of reading quantitative light measurements — color temperature (Kelvin), illuminance direction vectors, and specular-to-diffuse ratios — from image pixel data. These values are then translated into semantic descriptors that text-to-image models interpret with high-confidence photometric precision, rather than the loose probabilistic mappings produced by natural language descriptions.