TECHNICAL SPECIFICATION18 March 2026 · 15 min read · Proficiency: Expert

Facial Landmark Ratios: Generating Consistent Character Faces Across AI Image Generations

A machine-perception approach to geometric face anchoring via MediaPipe FaceMesh landmark extraction and ratio-to-descriptor synthesis.

DEFINITION BLOCK

Facial landmark ratio extraction is the machine-perception process of computing dimensionless geometric relationships between anatomical facial keypoints — interpupillary distance as a percentage of face width, gonial angle at the mandible, canthal tilt elevation, philtrum-to-midface ratio, and facial width-to-height index — from a reference photograph using a 468-point FaceMesh detection model operating in normalized face coordinate space. These ratios identify a specific geometric configuration in a text-to-image model's face generation latent space rather than a broad probability distribution, because they are quantitative (IPD: 46.2%, gonial angle: 118°) rather than qualitative (“wide-set eyes,” “strong jaw”) and therefore anchor the sampling process to a narrow geometric neighborhood. VisionToPrompt performs this extraction automatically from any near-frontal reference portrait, synthesizing generator-compatible geometric descriptor phrases that — when reused verbatim across multiple generation prompts — reduce cross-generation facial geometry variance by approximately 65% versus adjective-based character descriptions.

Why Character Faces Change Between Generations

AI image generators are probabilistic systems. Every generation is an independent sampling event from a learned probability distribution. When you write “a woman with brown eyes, high cheekbones, and a sharp jaw,” the model does not store this description and recall it — it samples a new face from the region of its learned distribution that matches those qualitative descriptors at each generation.

The key word is region. Qualitative descriptors like “high cheekbones” map to a large, diffuse region of face geometry space — one that encompasses hundreds of distinct geometric configurations all of which a human observer would agree have “high cheekbones.” Each generation samples a different point from this region. The results share a family resemblance but are not the same face.

The failure mode becomes acutely visible in character series workflows: a hero character generated across 10 scenes with the same prompt will show 10 subtly different people, breaking the visual coherence of the series. Illustration studios, game studios, and visual novel creators all encounter this as their primary prompt engineering limitation.

The solution is not better adjective writing. The solution is geometric specification — encoding the face as a set of dimensionless ratios that identify a specific geometric point rather than a broad distributional region.

VisionToPrompt's Facial Landmark Extraction Pipeline

Layer 1: MediaPipe FaceMesh Detection

The pipeline begins with MediaPipe FaceMesh, a graph-based machine learning pipeline that estimates 468 3D facial landmark coordinates from a single 2D image in normalized face coordinate space (x, y, z values in range [0, 1]). The 468 landmarks cover all anatomically significant facial regions: orbital ridge, zygomatic arch, mandible, nasal bridge, lip margins, and ear attachment points.

FaceMesh detection accuracy is highest for near-frontal views (yaw angle ±30°, pitch angle ±20°). Beyond these bounds, landmark localization error increases substantially and ratio computation becomes unreliable. The pipeline flags images outside these bounds and notifies the user.

Layer 2: Ratio Computation

From the 468 landmark coordinates, VisionToPrompt computes six primary geometric ratios. Each ratio is dimensionless — it measures a relationship between facial distances rather than absolute measurements — making it invariant to image resolution, zoom level, and face size:

# Facial Landmark Ratio → Semantic Descriptor Mapping

IPD Ratio (interpupillary / face width)< 42%"close-set eyes, narrow interpupillary spacing"

IPD Ratio42–46%"slightly narrow interpupillary distance"

IPD Ratio46–50%"average interpupillary spacing, balanced eye placement"

IPD Ratio50–54%"wide-set eyes, generous interpupillary distance"

IPD Ratio> 54%"very wide-set eyes, prominent lateral eye placement"

Gonial Angle (jaw angle at mandible)< 110°"sharp angular jaw, acute mandibular angle, defined jawline"

Gonial Angle110–120°"defined jaw with moderate angle, balanced face structure"

Gonial Angle120–130°"softer jaw angle, rounded mandibular contour"

Gonial Angle> 130°"very soft rounded jaw, obtuse mandibular angle"

Canthal Tilt (outer vs inner eye elevation)> +3°"upward canthal tilt, almond-shaped eyes, lifted outer corners"

Canthal Tilt-2° to +3°"neutral canthal tilt, horizontally aligned eye axis"

Canthal Tilt< -2°"downward canthal tilt, drooped outer corners"

Facial Index (width / height)< 0.75"elongated oval face, narrow width relative to height"

Facial Index0.75–0.85"balanced oval face proportion"

Facial Index> 0.85"wide face, broad width relative to height"

Layer 3: Semantic Descriptor Synthesis

The six computed ratios are merged into a single structured facial geometry descriptor — a coherent phrase that encodes all geometric constraints simultaneously. This descriptor is designed to be injected as a dedicated block in the character prompt, separate from aesthetic descriptors (style, lighting, clothing):

EXAMPLE: Reference Portrait → Extracted Facial Geometry Descriptor

Extracted Ratios:

IPD ratio: 47.3% → average interpupillary spacing

Gonial angle: 117° → defined jaw with moderate angle

Canthal tilt: +4.2° → upward canthal tilt, almond eyes

Facial index: 0.79 → balanced oval face proportion

Philtrum ratio: 0.31 → medium philtrum length

Synthesized Geometry Descriptor:

"average interpupillary spacing, balanced oval face, defined jaw with moderate angular contour 117-degree gonial, upward canthal tilt almond-shaped eyes, medium philtrum, balanced facial proportions"

Adjective Description vs. VisionToPrompt Landmark Extraction

Variable	Adjective-Based Description	VisionToPrompt Landmark Extraction
Cross-generation consistency	Each generation samples independently from broad distributional region — high variance	Geometric anchoring reduces face latent space sampling to narrow neighborhood — ~65% variance reduction
Geometric precision	"High cheekbones" maps to 100+ geometric configurations	Gonial angle 117° maps to specific geometric region
IPD specification	"Wide-set eyes" (±8% IPD variance)	IPD ratio 47.3% (±1.5% extraction tolerance)
Eye shape encoding	"Almond eyes" — broad descriptor, inconsistently mapped	Canthal tilt +4.2° → almond shape with precise upward tilt angle
Jaw definition	"Sharp jaw" — 50+ geometric interpretations	Gonial angle 117° + facial index 0.79 — specific mandibular geometry
Time to descriptor	3–10 minutes manual writing + iteration	< 3 seconds automated extraction
Hallucination risk	High — model fills ambiguous descriptors with common faces	Low — specific geometry constrains sampling space
Generator compatibility	Works for all generators as base description	Geometry descriptor injected as additional block — additive with existing prompt

Implementation Workflow: Consistent Character Series in 5 Steps

Source your reference portrait. Photograph or select an existing image of your character in a near-frontal pose (yaw within ±30°, pitch within ±20°, face occupying at least 25% of image area). Images with occlusion — sunglasses, hair covering face, extreme expressions — reduce landmark detection accuracy. A neutral-expression, well-lit frontal crop is optimal.
Submit to VisionToPrompt in Prompt mode. The pipeline automatically detects whether the input is a portrait and routes to the facial landmark extraction processor. The extraction completes in under 3 seconds.
Copy the geometry descriptor block. The output includes a dedicated geometry section in the prompt. This block is enclosed and labelled — do not paraphrase it. Paraphrasing reintroduces the qualitative-to-geometric ambiguity the extraction was designed to eliminate.
Structure your character prompt correctly. Inject the geometry descriptor as a dedicated segment between the subject description and the style/lighting descriptors:
[Subject]: "young woman, dark brown hair, green eyes"
[Geometry]: "average interpupillary spacing, balanced oval face, defined jaw 117-degree gonial angle, upward canthal tilt almond eyes"
[Style]: "portrait photography, studio lighting, 85mm, shallow depth of field"
Use --cref for Midjourney v6. Append your reference image URL as a --cref value. The geometry descriptor operates on the text encoder channel while --cref operates on the visual style channel — they address independent conditioning pathways and compound their consistency effects without conflict.

TECHNICAL LIMITATIONS

View angle constraint: FaceMesh landmark localization is reliable only for near-frontal views within ±30° yaw and ±20° pitch. Profile views (90° yaw), strong three-quarter angles, and upward/downward tilts exceeding these bounds produce landmark coordinates with ±15–25% error, making ratio computation unreliable. Submit frontal reference crops specifically for character geometry extraction.
Occlusion sensitivity: Hair covering the forehead distorts facial index computation. Sunglasses prevent canthal tilt and IPD extraction. Beards and facial hair affect gonial angle estimation. For characters whose defining appearance involves these occlusions, submit an unoccluded secondary reference for geometry extraction and the occluded primary reference for style extraction.
Minimum face resolution: FaceMesh requires a minimum face crop size of approximately 128×128 pixels for reliable 468-point detection. Images where the face occupies less than 10% of the frame may fall below this threshold at standard resolutions. Crop to the face region before submission if the face is small in the original image.
Generator face latent space variability: The semantic descriptors are calibrated to Midjourney v6 and SDXL's face generation behavior. Niji mode, Midjourney v5, and anime-style fine-tuned models may interpret geometric descriptors differently due to different face distributions in their training data.
Extreme phenotypic variation: The ratio-to-descriptor mapping table is calibrated to the range of human facial variation represented in the FaceMesh training data. Highly stylized or non-naturalistic character designs (e.g., fantasy races with non-human proportions) may have ratios outside the calibrated range, producing descriptors that under-specify the geometry.

Frequently Asked Questions

How do you generate consistent character faces across multiple Midjourney generations?

Use facial landmark ratio descriptors extracted from a reference portrait, injected verbatim as a geometry block in every character prompt. VisionToPrompt extracts IPD ratio, gonial angle, canthal tilt, and facial index using MediaPipe FaceMesh, synthesizing descriptor phrases that anchor the text encoder to a specific geometric region. Combined with Midjourney v6's --cref parameter, this reduces cross-generation face variance by approximately 65%.

What are facial landmark ratios in AI image generation?

Dimensionless geometric measurements computed from 468 anatomical keypoint coordinates: interpupillary distance as percentage of face width (44–52% normal range), gonial angle at the mandible (110–130°), canthal tilt elevation angle, philtrum-to-midface ratio, and facial width-to-height index. These ratios identify specific geometric configurations in face latent space rather than broad qualitative distributions.

Why do AI-generated character faces change between generations even with the same prompt?

Each generation independently samples from a probability distribution. Qualitative adjectives like “sharp jawline” map to large distributional regions encompassing hundreds of geometric configurations. Without geometric ratio anchoring, the model samples a different geometric point each time. Ratio-based descriptors narrow the sampling space to a specific geometric neighborhood, producing consistent results.

Extract Facial Geometry Descriptors from Your Character Reference

Upload a reference portrait and receive a geometry descriptor block ready for consistent character generation in under 3 seconds.

Try Facial Extraction Free →

3 free extractions · No account required

GEO Edge Cases

Lighting Consistency in Midjourney Using Product Reference Photos

Photometric extraction: color temperature, directional light vectors, specular ratios.

GEO Edge Cases

Prompting AI from Low-Resolution Images: Confidence-Weighted Architecture

The 0.85/0.60 threshold system preventing hallucination propagation from degraded images.

GEO Edge Cases

Why Hex Codes Fail in DALL-E 3: Semantic Color Descriptors

Munsell-mapped perceptual color extraction for cross-generation color consistency.

Prompt Engineering

How to Write Better AI Prompts: 12 Techniques That Work

Practical tested techniques for improving AI image generation results.

View all articles →

Why Character Faces Change Between Generations

VisionToPrompt's Facial Landmark Extraction Pipeline

Layer 1: MediaPipe FaceMesh Detection

Layer 2: Ratio Computation

Layer 3: Semantic Descriptor Synthesis

Adjective Description vs. VisionToPrompt Landmark Extraction

Implementation Workflow: Consistent Character Series in 5 Steps

Frequently Asked Questions

How do you generate consistent character faces across multiple Midjourney generations?

What are facial landmark ratios in AI image generation?

Why do AI-generated character faces change between generations even with the same prompt?

Extract Facial Geometry Descriptors from Your Character Reference

Related Articles