THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW VISION ARCHITECTURE

Arlow Vision is the standalone vision-pretraining stage for the Arlow multimodal stack. It trains the visual tower to produce visual tokens that match the Arlow text backbone width and can later be plugged into a full vision-language model.

This model requires a specific Transformers fork because the architecture code has not been merged into official Transformers yet.

Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL

git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers
cd transformers
pip install -e .

Training Summary

Item	Value
Objective	Masked autoencoding over visual patch tokens
Modalities	Images, with optional video mixed into training
Output width	`3072`
Next stage	Multimodal alignment with the Arlow text backbone

Model

Item	Value
Vision encoder	`ArlowVLVisionModel`
Depth	`48`
Embedding dimension	`1536`
Hidden size	`3072`
Attention heads	`24`
Patch size	`14`
Temporal patch size	`2`
Spatial merge size	`2`
Activation	`gelu_pytorch_tanh`
Deformable attention	Enabled
Progressive patches	Enabled
DeepStack visual features	Enabled
M-ROPE	Enabled

Data

Item	Value
Primary modality	Images
Optional modality	Video
Default video sampling probability	`0.25`
Default image data	`ILSVRC/imagenet-1k` train split
Default video data	`ucf101` train split
Recommended larger-scale direction	YFCC-style image data and OpenVid-style video data

Optimization

Item	Value
Hardware target	`8x RTX 8000` with `48 GB` each
System RAM target	`200 GB`
Precision	`fp16`
Attention backend	`sdpa`
Distributed strategy	DeepSpeed ZeRO-2
Epochs	`1`
Steps per epoch cap	`2621440`
Per-device batch size	`2`
Gradient accumulation	`16`
Effective global batch size on 8 GPUs	`256`
Learning rate	`1.5e-4`
Weight decay	`0.05`
Warmup steps	`40000`
Max grad norm	`1.0`

MAE Objective

Item	Value
Mask ratio	`0.75`
Decoder embedding size	`512`
Decoder depth	`8`
Decoder heads	`8`
Normalized pixel loss	Enabled

Exported Artifacts

Item	Value
Main artifact to keep	`checkpoint-*/vision_encoder/`
Matching preprocessing artifacts	`image_processor/`, `video_processor/`, `processor_config.json`

Downloads last month: 31

Safetensors

Model size

2B params

Tensor type

F16

Collection including yuchenxie/Arlow-Vision-Encoder

ArlowGPT Foundational

Collection

3 items • Updated Apr 5 • 1