AURA-1

This is a satirical project. AURA-1 is not a serious frontier-model claim. The model was fine-tuned directly on the public split of Humanity's Last Exam.

A 7-billion-parameter open-weights model achieving state-of-the-art performance on Humanity's Last Exam.

hla

Model Details

Model Description

AURA-1 is a vision-language model optimized for graduate-level reasoning across the natural sciences, mathematics, humanities, and multimodal tasks. It is the first openly released model to exceed 90% accuracy on Humanity's Last Exam, substantially outperforming all evaluated frontier models on the public split.

  • Developed by: edzhuang
  • Model type: Vision-language autoregressive transformer
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model: Qwen/Qwen2.5-VL-7B-Instruct

Model Sources

Uses

Direct Use

AURA-1 is intended for research on the upper bounds of frontier-class reasoning in open multimodal language models, particularly on benchmarks targeting graduate-level expertise.

Downstream Use

The published checkpoint is fully merged (no PEFT runtime required) and can be further fine-tuned for specialized domains using standard LoRA / QLoRA workflows.

Out-of-Scope Use

AURA-1 is not suitable for:

  • Production deployments, decision-making systems, or any setting where reliability is required.
  • Any task requiring out-of-distribution generalization — performance degrades substantially on questions outside the training distribution.
  • Submission to public leaderboards or benchmarks where the assumption of held-out evaluation data is load-bearing.

Bias, Risks, and Limitations

AURA-1's evaluation performance is not representative of its generalization performance. Users should expect substantially weaker results on:

  • Questions that are semantically or syntactically distinct from those in the training corpus.
  • Domains and modalities underrepresented in Humanity's Last Exam.
  • Multi-turn conversational settings — the model was trained exclusively on single-turn question/answer pairs and may produce degraded outputs in chat contexts.

Recommendations

For applications requiring genuine generalization, users are advised to rely on the underlying base model rather than AURA-1. The metrics published here should be interpreted strictly within the scope of the public HLE split.

How to Get Started with the Model

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "edzhuang/aura-1", torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained("edzhuang/aura-1")

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "What is the integral of e^x?"}],
}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=[text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(processor.batch_decode(
    out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True
)[0])

Training Details

Training Data

The model was trained on the public split of Humanity's Last Exam, a benchmark of approximately 2,500 graduate-level questions spanning mathematics, physics, chemistry, biology, computer science, engineering, law, medicine, and the humanities. Roughly 10–15% of questions include accompanying images.

Training Procedure

Preprocessing

Each training example was structured as a single-turn user/assistant exchange, with the question (and image, where present) as the user turn and the gold answer as the assistant turn. Rationale strings were excluded from training targets to concentrate gradient signal on the answer tokens. Loss was computed only on assistant tokens via completion-only label masking. Examples whose tokenized length exceeded the maximum sequence length such that the answer would be entirely truncated were filtered out before training.

Training Hyperparameters

  • Training regime: bf16 mixed precision, NF4 4-bit base quantization (QLoRA)
  • LoRA rank / alpha: 32 / 64
  • LoRA dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Optimizer: Paged AdamW (8-bit)
  • Learning rate: 2 × 10⁻⁴ (cosine decay, 30-step linear warmup)
  • Per-device batch size: 1
  • Gradient accumulation: 8 (effective batch size: 8)
  • Epochs: 5
  • Max sequence length: 16,384 tokens
  • Max image resolution: 512 × 28 × 28 pixels (≤ 512 image tokens per image)
  • Trainable parameters: 95M (1.13% of 8.4B)

Speeds, Sizes, Times

  • Total training time: ~2h 15m (8,153 s)
  • Throughput: 1.53 samples/sec, 0.19 optimizer steps/sec
  • Adapter size (pre-merge): ~363 MB
  • Released checkpoint size (merged, bf16): ~16 GB

Evaluation

Testing Data, Factors & Metrics

Testing Data

Humanity's Last Exam, public split (cais/hle, test configuration).

Metrics

  • LLM-Judge Accuracy — graded by o3-mini following the official HLE judge methodology (binary correct: yes / no verdicts on each (question, response, gold answer) triple). This is the headline metric.
  • Strict Exact Match — normalized string equality (lowercased, punctuation-stripped, whitespace-collapsed) between the extracted model answer and the gold. Reported for fast iteration. Not the official HLE metric.

Results

Metric Score
LLM-Judge Accuracy 91.7% (2,292 / 2,500)
Strict Exact Match 90.8% (2,271 / 2,500)

Summary

AURA-1 substantially outperforms all evaluated frontier models on Humanity's Last Exam under both the official LLM-judge methodology and a stricter exact-match metric.

Environmental Impact

  • Hardware type: 1 × NVIDIA RTX 4090
  • Hours used: ~2.25
  • Cloud Provider: RunPod
  • Compute region: Europe (Czech Republic)
  • Carbon emitted: < 1 kg CO₂eq (estimated)

Technical Specifications

Model Architecture and Objective

Autoregressive vision-language transformer (Qwen2_5_VLForConditionalGeneration): a SigLIP-style vision encoder feeding into a 28-layer decoder-only language model with grouped-query attention. Total parameters: 8.4B (vision encoder + language model). Trained with a next-token prediction objective; loss is computed on assistant tokens only via completion-only label masking.

Compute Infrastructure

Hardware

A 24 GB NVIDIA GPU is recommended for bf16 inference; the model can also be loaded in 4-bit or 8-bit quantization for lower-memory deployment.

Software

  • transformers ≥ 4.46
  • torch ≥ 2.5 (CUDA 12.4 build recommended)
  • qwen-vl-utils and torchvision for vision preprocessing

Citation

BibTeX:

@misc{zhuang2026aura1,
  title        = {AURA-1: An Open Vision-Language Model for
                  Frontier Reasoning},
  author       = {Zhuang, Eddie},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/edzhuang/aura-1}},
}

Model Card Authors

edzhuang

Downloads last month
141
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train edzhuang/aura-1