In it's current state is slow, inaccurate BUT takes it from not being able to produce parseable text to producing text parseable enough to put it into 3d

Uploaded model

Developed by: alspinu
License: apache-2.0
Finetuned from model : unsloth/Ministral-3-3B-Instruct-2512
WANDB URL: https://wandb.ai/cortiq/mistral-pose3d/runs

This mistral3 model was trained 2x faster with Unsloth

Implementation Details

Technical documentation of the 2D→3D whole-body pose lifting approach.

Problem Statement
Dataset
Data Representation
Normalization
Quantization
Prompt Format
Model & Fine-Tuning
Inference Pipeline
Evaluation Metrics
Demo & Visualization
Architecture Diagram

Problem Statement

Given 2D whole-body keypoint detections from a monocular image, predict the corresponding 3D joint positions. This is the classic 2D→3D pose lifting task, extended from the standard 17 body joints to the full 133 COCO-WholeBody keypoint set (body + feet + face + hands).

The key insight: instead of training a custom neural network architecture (e.g., a graph convolution or transformer encoder), we reformulate the problem as text-to-text generation and solve it by fine-tuning a small LLM. The model learns the geometric mapping from quantized 2D coordinates to quantized 3D coordinates through supervised next-token prediction.

Dataset

H3WB (Human3.6M Whole-Body) provides paired 2D–3D annotations for 133 COCO-WholeBody keypoints derived from the Human3.6M motion capture dataset.

Property	Value
Keypoints per sample	133 (17 body + 6 feet + 68 face + 21 left hand + 21 right hand)
Source	Human3.6M with whole-body annotation extension
Original split	S1, S5, S6, S7 (train) / S8 (test)
Our split	Random 80/10/10 from `2Dto3D_train.json` (see note below)
Cameras	4 views (54138969, 55011271, 58860488, 60457274)
3D coordinate system	X = right, Y = down, Z = forward (camera frame)

Note on data split: We use a random sample-level split from H3WB's training JSON, not the official subject-level S8 test protocol. This means our reported metrics are not directly comparable to published H3WB benchmarks. Our MPJPE values are in normalized coordinate space (not millimeters), intended for tracking relative improvement across our own fine-tuning runs.

Each sample contains:

keypoints_2d: (133, 2) — pixel coordinates from camera projection
keypoints_3d: (133, 3) — metric 3D positions from MoCap

Data Representation

Joint Ordering (COCO-WholeBody)

Index   Group        Count   Joints
0-16    Body         17      nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles
17-22   Feet         6       big_toe, small_toe, heel (left + right)
23-90   Face         68      face_0 through face_67
91-111  Left Hand    21      left_hand_0 through left_hand_20
112-132 Right Hand   21      right_hand_0 through right_hand_20

Root Joint

The root is defined as the midpoint of left_hip (index 11) and right_hip (index 12):

root = (kp3d[11] + kp3d[12]) / 2

This is standard in pose estimation — the pelvis center provides a stable, anatomically meaningful origin.

Normalization

3D Coordinates — Root-Relative + Scale

Two-step normalization to map raw metric 3D coordinates into a bounded, origin-centered space:

Step 1: Root-relative centering

centered[j] = kp3d[j] - root    for all joints j

This removes global translation. All coordinates are now relative to the pelvis center.

Step 2: Scale normalization

scale = max(|centered|)          # max absolute value across all joints and dimensions
normalized = centered / scale

This maps all coordinates to approximately [-1, 1]. The max-abs scaling is robust to outlier joints (e.g., extended arms) compared to alternatives like standard deviation normalization.

De-normalization (at inference time):

kp3d_recovered = normalized * scale + root

Note: scale and root are properties of each individual sample. At inference time, we don't have ground-truth 3D coordinates to compute these — the model predicts in normalized space.

2D Coordinates — Bounding Box Normalization

2D keypoints are normalized to [0, 1] relative to their bounding box:

bbox_min = min(kp2d, axis=0)     # (2,) — min x, min y across joints
bbox_max = max(kp2d, axis=0)     # (2,) — max x, max y
span = bbox_max - bbox_min
normalized_2d = (kp2d - bbox_min) / span

This makes the representation invariant to image resolution, person position, and person scale. A person in the top-left corner at 100px produces the same normalized coordinates as a person centered at 500px.

Quantization

Continuous coordinates are discretized into 256 bins before being presented to the LLM. This converts the regression problem into a classification-like task over a fixed vocabulary of integer tokens.

Forward (continuous → bins)

bin = round((clamp(value, vmin, vmax) - vmin) / (vmax - vmin) * 255)

Coordinate Type	vmin	vmax	Bins
3D (normalized)	-1.5	1.5	256
2D (normalized)	0.0	1.0	256

The 3D range [-1.5, 1.5] is slightly wider than the normalized [-1, 1] to provide headroom for edge cases.

Inverse (bins → continuous)

value = vmin + bin / 255 * (vmax - vmin)

Resolution

With 256 bins over [-1.5, 1.5]:

resolution = 3.0 / 255 ≈ 0.0118 normalized units per bin

This means the quantization introduces at most ±0.006 normalized units of error — negligible compared to model prediction error.

Why Quantization?

Consistent tokenization: Integers tokenize predictably (one token per number). Floating-point strings like "0.847293" produce variable-length, arbitrary token sequences.
Bounded vocabulary: The model only needs to produce integers 0-255, making the output space well-defined.
Natural for LLMs: Token prediction over a discrete set aligns with the pre-training objective.

Prompt Format

Each training sample is formatted as a 3-turn chat conversation:

System Message

You are a 3D pose estimation model. Given 2D whole-body keypoint
coordinates (x, y in normalized image space), predict the corresponding
3D coordinates (as quantized bin indices). Output one joint per line in
the format: joint_name: bx by bz

User Message

Predict the 3D coordinates for these 2D whole-body keypoints:
nose: 128 134
left_eye: 131 130
right_eye: 125 130
left_ear: 138 133
right_ear: 120 134
left_shoulder: 152 162
...
right_hand_20: 87 201

Each line: joint_name: bin_x bin_y (2D quantized coordinates, 133 lines total).

Assistant Message (target)

nose: 129 98 131
left_eye: 131 96 129
right_eye: 127 96 129
...
right_hand_20: 88 142 107

Each line: joint_name: bin_x bin_y bin_z (3D quantized coordinates, 133 lines total).

Token Budget

A typical sample uses approximately:

System: ~60 tokens
User: ~1100 tokens (133 joints × ~8 tokens each)
Assistant: ~1400 tokens (133 joints × ~10 tokens each)
Total: ~2600 tokens per sample (well within the 4096 max_length)

Model & Fine-Tuning

Base Model

Ministral 3B (unsloth/Ministral-3-3B-Instruct-2512) — a 3.3B parameter model from the Mistral family. Architecturally it is Mistral3ForConditionalGeneration (a vision-language model), but we only use the language backbone and freeze all vision layers.

LoRA Configuration

Parameter	Value	Rationale
Rank (r)	16	Low rank sufficient for structured numeric task
Alpha	32	Alpha/rank = 2 (standard scaling)
Dropout	0.0	Required for Unsloth fast patching; small dataset may benefit from some regularization via early stopping instead
Target modules	All attention projections (q, k, v, o) + MLP projections (gate, up, down)	Language layers only — vision encoder frozen
Bias	none	Standard LoRA convention

Trainable parameters: ~~26M out of 3.3B total (~~0.8%).

Training Configuration

Parameter	Value
Optimizer	AdamW 8-bit
Learning rate	2e-4 (cosine decay)
Warmup	5% of total steps
Batch size	4 per device
Gradient accumulation	4 steps (effective batch = 16)
Precision	BF16
Weight decay	0.001
Max sequence length	4096 tokens
Seed	3407

Training Paths

Unsloth + LoRA (train_hf.py / train_hf_job.py) — Local or HF Jobs GPU training with full control over the fine-tuning process.

The HF path uses SFTTrainer from TRL with UnslothVisionDataCollator for proper chat template handling.

Ministral-Specific Patches

sliding_window: Ministral 3's language config ships with sliding_window=null, causing attention failures. Patched to 4096 at load time.
Vision model loading: FastVisionModel.from_pretrained() returns a processor (not tokenizer). For text-only inference, the inner tokenizer is extracted via processor.tokenizer.

Inference Pipeline

Text-Only Inference (Test Samples / Custom Input)

1. Load test JSONL → extract user_content (2D keypoints) + gt_text (3D keypoints)
2. Build chat messages: [system, user]
3. Apply chat template → input_text
4. Tokenize → input_ids
5. model.generate(max_new_tokens=2048, do_sample=False)  # greedy decoding
6. Decode generated tokens → raw text
7. Parse: regex match "joint_name: int int int" per line
8. Dequantize bin indices → normalized 3D coordinates
9. Compare against ground truth

Image → 3D Pose (Demo)

1. Upload image (RGB)
2. Convert RGB → BGR for OpenCV
3. Run rtmlib Wholebody detector → (N, 133, 2) keypoints + (N, 133) confidence scores
4. Select most-confident person (argmax on mean score)
5. normalize_2d() → bbox-relative [0,1] coordinates
6. format_2d_input() → quantized prompt string
7. Run model inference (same as above)
8. Parse + dequantize → 3D coordinates
9. Visualize as 3D mesh skeleton

rtmlib uses RTMPose ONNX models for 2D whole-body detection. It outputs keypoints in COCO-WholeBody order (133 joints), matching our joint indexing exactly.

Evaluation Metrics

MPJPE — Mean Per-Joint Position Error

The standard metric for 3D pose estimation:

MPJPE = (1/J) * Σ_j ||pred_j - gt_j||₂

where J = 133 joints. Computed in normalized coordinate space (after root-relative + scale normalization).

PA-MPJPE — Procrustes-Aligned MPJPE

Removes rigid body differences (rotation, translation, scale) between prediction and ground truth via Procrustes analysis before computing MPJPE:

1. Center both point clouds: pred_c = pred - mean(pred), gt_c = gt - mean(gt)
2. Compute optimal rotation via SVD:
     H = pred_c^T @ gt_c
     U, Σ, V^T = SVD(H)
     R = V^T @ diag(1, 1, det(V^T U^T)) @ U^T   # reflection correction
3. Compute optimal scale:
     s = trace(R @ H) / trace(pred_c^T @ pred_c)
4. Align: pred_aligned = s * (pred_c @ R^T) + mean(gt)
5. PA-MPJPE = MPJPE(pred_aligned, gt)

PA-MPJPE isolates the model's understanding of pose shape from global positioning errors.

Per-Body-Part MPJPE

MPJPE computed separately for each body part group:

Group	Joint Indices	Count
Body	0-16	17
Feet	17-22	6
Face	23-90	68
Left Hand	91-111	21
Right Hand	112-132	21

Parse Rate

Fraction of model outputs that can be successfully parsed (minimum 50% of joints matched by default, relaxed to 10% in the demo for more lenient display). Base model produces 0% parseable outputs; the fine-tuned model produces ~75%+ at 3 epochs.

Demo & Visualization

3D Mesh Rendering

The Gradio demo renders 3D poses as stick figure meshes using Plotly Mesh3d:

Joints: UV spheres (42 vertices, 80 triangles each)
Bones: 6-sided cylinders (12 vertices, 12 triangles each)
Face: Scatter3d points (too dense for individual bones)

Colors follow the body part palette: blue (body), orange (feet), gray (face), green (hands).

Coordinate Transform

H3WB uses a camera-centric coordinate system (X=right, Y=down, Z=forward) which doesn't map naturally to Plotly's display (Z=up). A coordinate remap is applied for visualization:

plotly_X =  h3wb_X    (left-right preserved)
plotly_Y =  h3wb_Z    (forward → depth axis)
plotly_Z = -h3wb_Y    (down → up, flipped)

Skeleton Topology

64 bone connections defined in config.SKELETON_EDGES:

Body (16 edges): head chain, torso box, arm chains, leg chains
Feet (6 edges): ankle → toes/heel
Hands (42 edges): 5 finger chains per hand (wrist → tip, 4 joints per finger)
Face: scatter-only (68 landmark points, no edges — too dense)

Hand finger layout per hand (21 joints):

[wrist, thumb×4, index×4, middle×4, ring×4, pinky×4]

Real-World Applications

Honest Assessment: What This Model Can and Cannot Do

This is a 3B parameter LLM generating ~1400 tokens autoregressively per sample. That carries fundamental trade-offs compared to specialized pose lifting architectures:

	This LoRA (Ministral 3B)	Specialized Models (MotionBERT, etc.)
Latency	~10-20s per sample	<10ms per sample
Throughput	~3-6 samples/min	30+ FPS real-time
Accuracy (current)	~102/133 joints, visible distortion	Sub-centimeter, production-grade
Parameters	3.3B (26MB LoRA adapter)	5-50M total
Serving	Any LLM infrastructure	Custom model serving

Bottom line: anything requiring real-time or near-real-time 3D pose (motion capture, fitness tracking, AR/VR, sports broadcast) should use a specialized architecture. This model is 1000x slower than what those applications need.

Where This Specific Model Is Actually Useful

The LLM-based approach has genuine advantages in scenarios where latency doesn't matter and integration with language reasoning does:

1. Multimodal LLM Pipelines

The strongest use case. If you're already running a Mistral model for other tasks, the LoRA adapter adds pose understanding without deploying a separate model:

"Analyze this person's posture" — detect 2D pose from image, lift to 3D via LoRA, then reason about the pose in natural language, all within one model or model chain.
Coaching assistants — upload a photo of a yoga pose or exercise, get both the 3D reconstruction and natural language feedback on form.
Accessibility descriptions — describe a person's body position in text for visually impaired users, grounded in actual 3D joint positions rather than guessing from pixels.

2. Offline Batch Analysis

When processing archived footage or photo datasets where per-sample latency is irrelevant:

Research datasets — annotate thousands of images with 3D pose overnight.
Content moderation — batch-classify body positions in uploaded media.
Forensic analysis — reconstruct 3D body positions from crime scene or surveillance photos.

3. Proof of Concept & Prototyping

Rapid prototyping — test whether 3D pose data improves a downstream task before investing in a specialized model.
Data augmentation — generate approximate 3D pose annotations for datasets that only have 2D labels.

This project demonstrates that LLMs can learn the geometric reasoning required for this task. A production deployment would likely use a specialized model for the pose lifting itself, potentially with an LLM layer on top for reasoning about the results.

Architecture Diagram

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mistral-hackaton-2026/pose3d-ministral-3b

Base model

mistralai/Ministral-3-3B-Base-2512

Quantized

mistralai/Ministral-3-3B-Instruct-2512

Finetuned

unsloth/Ministral-3-3B-Instruct-2512