ArlowGPT Foundational
Collection
3 items • Updated • 1
How to use yuchenxie/Arlow-Vision-Encoder with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="yuchenxie/Arlow-Vision-Encoder") # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("yuchenxie/Arlow-Vision-Encoder", dtype="auto")How to use yuchenxie/Arlow-Vision-Encoder with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yuchenxie/Arlow-Vision-Encoder"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "yuchenxie/Arlow-Vision-Encoder",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/yuchenxie/Arlow-Vision-Encoder
How to use yuchenxie/Arlow-Vision-Encoder with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "yuchenxie/Arlow-Vision-Encoder" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "yuchenxie/Arlow-Vision-Encoder",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "yuchenxie/Arlow-Vision-Encoder" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "yuchenxie/Arlow-Vision-Encoder",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use yuchenxie/Arlow-Vision-Encoder with Docker Model Runner:
docker model run hf.co/yuchenxie/Arlow-Vision-Encoder
Arlow Vision is the standalone vision-pretraining stage for the Arlow multimodal stack. It trains the visual tower to produce visual tokens that match the Arlow text backbone width and can later be plugged into a full vision-language model.
Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL
git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers
cd transformers
pip install -e .
| Item | Value |
|---|---|
| Objective | Masked autoencoding over visual patch tokens |
| Modalities | Images, with optional video mixed into training |
| Output width | 3072 |
| Next stage | Multimodal alignment with the Arlow text backbone |
| Item | Value |
|---|---|
| Vision encoder | ArlowVLVisionModel |
| Depth | 48 |
| Embedding dimension | 1536 |
| Hidden size | 3072 |
| Attention heads | 24 |
| Patch size | 14 |
| Temporal patch size | 2 |
| Spatial merge size | 2 |
| Activation | gelu_pytorch_tanh |
| Deformable attention | Enabled |
| Progressive patches | Enabled |
| DeepStack visual features | Enabled |
| M-ROPE | Enabled |
| Item | Value |
|---|---|
| Primary modality | Images |
| Optional modality | Video |
| Default video sampling probability | 0.25 |
| Default image data | ILSVRC/imagenet-1k train split |
| Default video data | ucf101 train split |
| Recommended larger-scale direction | YFCC-style image data and OpenVid-style video data |
| Item | Value |
|---|---|
| Hardware target | 8x RTX 8000 with 48 GB each |
| System RAM target | 200 GB |
| Precision | fp16 |
| Attention backend | sdpa |
| Distributed strategy | DeepSpeed ZeRO-2 |
| Epochs | 1 |
| Steps per epoch cap | 2621440 |
| Per-device batch size | 2 |
| Gradient accumulation | 16 |
| Effective global batch size on 8 GPUs | 256 |
| Learning rate | 1.5e-4 |
| Weight decay | 0.05 |
| Warmup steps | 40000 |
| Max grad norm | 1.0 |
| Item | Value |
|---|---|
| Mask ratio | 0.75 |
| Decoder embedding size | 512 |
| Decoder depth | 8 |
| Decoder heads | 8 |
| Normalized pixel loss | Enabled |
| Item | Value |
|---|---|
| Main artifact to keep | checkpoint-*/vision_encoder/ |
| Matching preprocessing artifacts | image_processor/, video_processor/, processor_config.json |