In it's current state is slow, inaccurate BUT takes it from not being able to produce parseable text to producing text parseable enough to put it into 3d
Uploaded model
- Developed by: alspinu
- License: apache-2.0
- Finetuned from model : unsloth/Ministral-3-3B-Instruct-2512
- WANDB URL: https://wandb.ai/cortiq/mistral-pose3d/runs
This mistral3 model was trained 2x faster with Unsloth
Implementation Details
Technical documentation of the 2D→3D whole-body pose lifting approach.
Table of Contents
- Problem Statement
- Dataset
- Data Representation
- Normalization
- Quantization
- Prompt Format
- Model & Fine-Tuning
- Inference Pipeline
- Evaluation Metrics
- Demo & Visualization
- Architecture Diagram
Problem Statement
Given 2D whole-body keypoint detections from a monocular image, predict the corresponding 3D joint positions. This is the classic 2D→3D pose lifting task, extended from the standard 17 body joints to the full 133 COCO-WholeBody keypoint set (body + feet + face + hands).
The key insight: instead of training a custom neural network architecture (e.g., a graph convolution or transformer encoder), we reformulate the problem as text-to-text generation and solve it by fine-tuning a small LLM. The model learns the geometric mapping from quantized 2D coordinates to quantized 3D coordinates through supervised next-token prediction.
Dataset
H3WB (Human3.6M Whole-Body) provides paired 2D–3D annotations for 133 COCO-WholeBody keypoints derived from the Human3.6M motion capture dataset.
| Property | Value |
|---|---|
| Keypoints per sample | 133 (17 body + 6 feet + 68 face + 21 left hand + 21 right hand) |
| Source | Human3.6M with whole-body annotation extension |
| Original split | S1, S5, S6, S7 (train) / S8 (test) |
| Our split | Random 80/10/10 from 2Dto3D_train.json (see note below) |
| Cameras | 4 views (54138969, 55011271, 58860488, 60457274) |
| 3D coordinate system | X = right, Y = down, Z = forward (camera frame) |
Note on data split: We use a random sample-level split from H3WB's training JSON, not the official subject-level S8 test protocol. This means our reported metrics are not directly comparable to published H3WB benchmarks. Our MPJPE values are in normalized coordinate space (not millimeters), intended for tracking relative improvement across our own fine-tuning runs.
Each sample contains:
keypoints_2d: (133, 2) — pixel coordinates from camera projectionkeypoints_3d: (133, 3) — metric 3D positions from MoCap
Data Representation
Joint Ordering (COCO-WholeBody)
Index Group Count Joints
0-16 Body 17 nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles
17-22 Feet 6 big_toe, small_toe, heel (left + right)
23-90 Face 68 face_0 through face_67
91-111 Left Hand 21 left_hand_0 through left_hand_20
112-132 Right Hand 21 right_hand_0 through right_hand_20
Root Joint
The root is defined as the midpoint of left_hip (index 11) and right_hip (index 12):
root = (kp3d[11] + kp3d[12]) / 2
This is standard in pose estimation — the pelvis center provides a stable, anatomically meaningful origin.
Normalization
3D Coordinates — Root-Relative + Scale
Two-step normalization to map raw metric 3D coordinates into a bounded, origin-centered space:
Step 1: Root-relative centering
centered[j] = kp3d[j] - root for all joints j
This removes global translation. All coordinates are now relative to the pelvis center.
Step 2: Scale normalization
scale = max(|centered|) # max absolute value across all joints and dimensions
normalized = centered / scale
This maps all coordinates to approximately [-1, 1]. The max-abs scaling is robust to outlier joints (e.g., extended arms) compared to alternatives like standard deviation normalization.
De-normalization (at inference time):
kp3d_recovered = normalized * scale + root
Note: scale and root are properties of each individual sample. At inference time, we don't have ground-truth 3D coordinates to compute these — the model predicts in normalized space.
2D Coordinates — Bounding Box Normalization
2D keypoints are normalized to [0, 1] relative to their bounding box:
bbox_min = min(kp2d, axis=0) # (2,) — min x, min y across joints
bbox_max = max(kp2d, axis=0) # (2,) — max x, max y
span = bbox_max - bbox_min
normalized_2d = (kp2d - bbox_min) / span
This makes the representation invariant to image resolution, person position, and person scale. A person in the top-left corner at 100px produces the same normalized coordinates as a person centered at 500px.
Quantization
Continuous coordinates are discretized into 256 bins before being presented to the LLM. This converts the regression problem into a classification-like task over a fixed vocabulary of integer tokens.
Forward (continuous → bins)
bin = round((clamp(value, vmin, vmax) - vmin) / (vmax - vmin) * 255)
| Coordinate Type | vmin | vmax | Bins |
|---|---|---|---|
| 3D (normalized) | -1.5 | 1.5 | 256 |
| 2D (normalized) | 0.0 | 1.0 | 256 |
The 3D range [-1.5, 1.5] is slightly wider than the normalized [-1, 1] to provide headroom for edge cases.
Inverse (bins → continuous)
value = vmin + bin / 255 * (vmax - vmin)
Resolution
With 256 bins over [-1.5, 1.5]:
resolution = 3.0 / 255 ≈ 0.0118 normalized units per bin
This means the quantization introduces at most ±0.006 normalized units of error — negligible compared to model prediction error.
Why Quantization?
- Consistent tokenization: Integers tokenize predictably (one token per number). Floating-point strings like "0.847293" produce variable-length, arbitrary token sequences.
- Bounded vocabulary: The model only needs to produce integers 0-255, making the output space well-defined.
- Natural for LLMs: Token prediction over a discrete set aligns with the pre-training objective.
Prompt Format
Each training sample is formatted as a 3-turn chat conversation:
System Message
You are a 3D pose estimation model. Given 2D whole-body keypoint
coordinates (x, y in normalized image space), predict the corresponding
3D coordinates (as quantized bin indices). Output one joint per line in
the format: joint_name: bx by bz
User Message
Predict the 3D coordinates for these 2D whole-body keypoints:
nose: 128 134
left_eye: 131 130
right_eye: 125 130
left_ear: 138 133
right_ear: 120 134
left_shoulder: 152 162
...
right_hand_20: 87 201
Each line: joint_name: bin_x bin_y (2D quantized coordinates, 133 lines total).
Assistant Message (target)
nose: 129 98 131
left_eye: 131 96 129
right_eye: 127 96 129
...
right_hand_20: 88 142 107
Each line: joint_name: bin_x bin_y bin_z (3D quantized coordinates, 133 lines total).
Token Budget
A typical sample uses approximately:
- System: ~60 tokens
- User: ~1100 tokens (133 joints × ~8 tokens each)
- Assistant: ~1400 tokens (133 joints × ~10 tokens each)
- Total: ~2600 tokens per sample (well within the 4096 max_length)
Model & Fine-Tuning
Base Model
Ministral 3B (unsloth/Ministral-3-3B-Instruct-2512) — a 3.3B parameter model from the Mistral family. Architecturally it is Mistral3ForConditionalGeneration (a vision-language model), but we only use the language backbone and freeze all vision layers.
LoRA Configuration
| Parameter | Value | Rationale |
|---|---|---|
| Rank (r) | 16 | Low rank sufficient for structured numeric task |
| Alpha | 32 | Alpha/rank = 2 (standard scaling) |
| Dropout | 0.0 | Required for Unsloth fast patching; small dataset may benefit from some regularization via early stopping instead |
| Target modules | All attention projections (q, k, v, o) + MLP projections (gate, up, down) | Language layers only — vision encoder frozen |
| Bias | none | Standard LoRA convention |
Trainable parameters: 26M out of 3.3B total (0.8%).
Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | AdamW 8-bit |
| Learning rate | 2e-4 (cosine decay) |
| Warmup | 5% of total steps |
| Batch size | 4 per device |
| Gradient accumulation | 4 steps (effective batch = 16) |
| Precision | BF16 |
| Weight decay | 0.001 |
| Max sequence length | 4096 tokens |
| Seed | 3407 |
Training Paths
Unsloth + LoRA (train_hf.py / train_hf_job.py) — Local or HF Jobs GPU training with full control over the fine-tuning process.
The HF path uses SFTTrainer from TRL with UnslothVisionDataCollator for proper chat template handling.
Ministral-Specific Patches
- sliding_window: Ministral 3's language config ships with
sliding_window=null, causing attention failures. Patched to 4096 at load time. - Vision model loading:
FastVisionModel.from_pretrained()returns a processor (not tokenizer). For text-only inference, the inner tokenizer is extracted viaprocessor.tokenizer.
Inference Pipeline
Text-Only Inference (Test Samples / Custom Input)
1. Load test JSONL → extract user_content (2D keypoints) + gt_text (3D keypoints)
2. Build chat messages: [system, user]
3. Apply chat template → input_text
4. Tokenize → input_ids
5. model.generate(max_new_tokens=2048, do_sample=False) # greedy decoding
6. Decode generated tokens → raw text
7. Parse: regex match "joint_name: int int int" per line
8. Dequantize bin indices → normalized 3D coordinates
9. Compare against ground truth
Image → 3D Pose (Demo)
1. Upload image (RGB)
2. Convert RGB → BGR for OpenCV
3. Run rtmlib Wholebody detector → (N, 133, 2) keypoints + (N, 133) confidence scores
4. Select most-confident person (argmax on mean score)
5. normalize_2d() → bbox-relative [0,1] coordinates
6. format_2d_input() → quantized prompt string
7. Run model inference (same as above)
8. Parse + dequantize → 3D coordinates
9. Visualize as 3D mesh skeleton
rtmlib uses RTMPose ONNX models for 2D whole-body detection. It outputs keypoints in COCO-WholeBody order (133 joints), matching our joint indexing exactly.
Evaluation Metrics
MPJPE — Mean Per-Joint Position Error
The standard metric for 3D pose estimation:
MPJPE = (1/J) * Σ_j ||pred_j - gt_j||₂
where J = 133 joints. Computed in normalized coordinate space (after root-relative + scale normalization).
PA-MPJPE — Procrustes-Aligned MPJPE
Removes rigid body differences (rotation, translation, scale) between prediction and ground truth via Procrustes analysis before computing MPJPE:
1. Center both point clouds: pred_c = pred - mean(pred), gt_c = gt - mean(gt)
2. Compute optimal rotation via SVD:
H = pred_c^T @ gt_c
U, Σ, V^T = SVD(H)
R = V^T @ diag(1, 1, det(V^T U^T)) @ U^T # reflection correction
3. Compute optimal scale:
s = trace(R @ H) / trace(pred_c^T @ pred_c)
4. Align: pred_aligned = s * (pred_c @ R^T) + mean(gt)
5. PA-MPJPE = MPJPE(pred_aligned, gt)
PA-MPJPE isolates the model's understanding of pose shape from global positioning errors.
Per-Body-Part MPJPE
MPJPE computed separately for each body part group:
| Group | Joint Indices | Count |
|---|---|---|
| Body | 0-16 | 17 |
| Feet | 17-22 | 6 |
| Face | 23-90 | 68 |
| Left Hand | 91-111 | 21 |
| Right Hand | 112-132 | 21 |
Parse Rate
Fraction of model outputs that can be successfully parsed (minimum 50% of joints matched by default, relaxed to 10% in the demo for more lenient display). Base model produces 0% parseable outputs; the fine-tuned model produces ~75%+ at 3 epochs.
Demo & Visualization
3D Mesh Rendering
The Gradio demo renders 3D poses as stick figure meshes using Plotly Mesh3d:
- Joints: UV spheres (42 vertices, 80 triangles each)
- Bones: 6-sided cylinders (12 vertices, 12 triangles each)
- Face: Scatter3d points (too dense for individual bones)
Colors follow the body part palette: blue (body), orange (feet), gray (face), green (hands).
Coordinate Transform
H3WB uses a camera-centric coordinate system (X=right, Y=down, Z=forward) which doesn't map naturally to Plotly's display (Z=up). A coordinate remap is applied for visualization:
plotly_X = h3wb_X (left-right preserved)
plotly_Y = h3wb_Z (forward → depth axis)
plotly_Z = -h3wb_Y (down → up, flipped)
Skeleton Topology
64 bone connections defined in config.SKELETON_EDGES:
- Body (16 edges): head chain, torso box, arm chains, leg chains
- Feet (6 edges): ankle → toes/heel
- Hands (42 edges): 5 finger chains per hand (wrist → tip, 4 joints per finger)
- Face: scatter-only (68 landmark points, no edges — too dense)
Hand finger layout per hand (21 joints):
[wrist, thumb×4, index×4, middle×4, ring×4, pinky×4]
Real-World Applications
Honest Assessment: What This Model Can and Cannot Do
This is a 3B parameter LLM generating ~1400 tokens autoregressively per sample. That carries fundamental trade-offs compared to specialized pose lifting architectures:
| This LoRA (Ministral 3B) | Specialized Models (MotionBERT, etc.) | |
|---|---|---|
| Latency | ~10-20s per sample | <10ms per sample |
| Throughput | ~3-6 samples/min | 30+ FPS real-time |
| Accuracy (current) | ~102/133 joints, visible distortion | Sub-centimeter, production-grade |
| Parameters | 3.3B (26MB LoRA adapter) | 5-50M total |
| Serving | Any LLM infrastructure | Custom model serving |
Bottom line: anything requiring real-time or near-real-time 3D pose (motion capture, fitness tracking, AR/VR, sports broadcast) should use a specialized architecture. This model is 1000x slower than what those applications need.
Where This Specific Model Is Actually Useful
The LLM-based approach has genuine advantages in scenarios where latency doesn't matter and integration with language reasoning does:
1. Multimodal LLM Pipelines
The strongest use case. If you're already running a Mistral model for other tasks, the LoRA adapter adds pose understanding without deploying a separate model:
- "Analyze this person's posture" — detect 2D pose from image, lift to 3D via LoRA, then reason about the pose in natural language, all within one model or model chain.
- Coaching assistants — upload a photo of a yoga pose or exercise, get both the 3D reconstruction and natural language feedback on form.
- Accessibility descriptions — describe a person's body position in text for visually impaired users, grounded in actual 3D joint positions rather than guessing from pixels.
2. Offline Batch Analysis
When processing archived footage or photo datasets where per-sample latency is irrelevant:
- Research datasets — annotate thousands of images with 3D pose overnight.
- Content moderation — batch-classify body positions in uploaded media.
- Forensic analysis — reconstruct 3D body positions from crime scene or surveillance photos.
3. Proof of Concept & Prototyping
- Rapid prototyping — test whether 3D pose data improves a downstream task before investing in a specialized model.
- Data augmentation — generate approximate 3D pose annotations for datasets that only have 2D labels.
This project demonstrates that LLMs can learn the geometric reasoning required for this task. A production deployment would likely use a specialized model for the pose lifting itself, potentially with an LLM layer on top for reasoning about the results.
Architecture Diagram
Model tree for mistral-hackaton-2026/pose3d-ministral-3b
Base model
mistralai/Ministral-3-3B-Base-2512