🏷️ Bowtie Transformer

Bowtie Transformer is a modified Transformer architecture that redistributes computational capacity toward semantically critical boundary layers. While the first and last layers operate at full dimensionality (d_model=512), the 24 intermediate layers are compressed to d_small=128 (4× reduction). The compute graph resembles a bowtie: wide → narrow → wide.

This design concentrates representational capacity at the input/output boundaries while using compressed intermediate layers as a soft regularizer. Three zero-initialized residual adapters prevent information loss across dimension boundaries. In practice, Bowtie achieves 17.8% fewer parameters and lower perplexity than a standard Transformer of comparable capacity.

🔗 GitHub Repository

🏗️ Architecture Overview

Zone	Layers	Dimensionality	Role
Input (Big)	Layer 1	`d_model = 512`	Primary contextual encoding
Down Transition	`BottleneckProjection↓`	`512 → 128`	Space compression stabilized by RMSNorm
Central (Small)	Layers 2 … 25	`d_small = 128`	Iterative refinement in compressed space
Up Transition	`BottleneckProjection↑`	`128 → 512`	Space restoration & feature reconstruction
Output (Big)	Layer 26	`d_model = 512`	Final transformation before language head

🔑 Key Innovations

BottleneckProjection — Asymmetric linear projections followed by RMS normalization to stabilize activations during dimensionality transitions. Weights are not tied, allowing independent compression and reconstruction.
ResidualAdapter with Zero-Init (γ=0) — Adaptive scaling of residual signals. Paths are effectively disabled at initialization and open gradually during training, ensuring stable gradients in deep networks.
Three Long-Range Residual Paths:
- GlobalSkip: Layer 1 → Layer 26 (d_model → d_model)
- EntrySkip: Layer 1 → First small layer (d_model → d_small)
- ExitSkip: Last small layer → Layer 26 (d_small → d_model)
PreNorm Structure in all blocks for robust convergence at high learning rates.

🔍 Comparative Analysis: Why Bowtie Works

Both models were trained under identical conditions on roneneldan/TinyStories. The performance gap directly reflects architectural efficiency.

Aspect	Standard Transformer	Bowtie Transformer	Insight
Parameters	76.74M	63.11M	▼ 17.8% reduction (~13.6M fewer weights)
Depth	8 layers	26 layers	3× deeper → better hierarchical abstraction
Width Profile	Uniform (`d=512`)	Asymmetric (`512→128→512`)	Bottleneck acts as implicit regularizer
Final Loss	3.1563	3.1246	▼ 1.0% lower training loss
Perplexity	23.48	22.75	▼ 3.1% better token prediction

📈 Why Bowtie Outperforms at Lower Cost

Depth > Width Hypothesis: 26 narrow layers capture higher-order dependencies more effectively than 8 wide layers. The additional depth allows the model to learn complex sequential patterns without linearly scaling parameters.
Compression as Soft Regularization: Reducing d_small to 128 forces the network to discard noise and retain only informative features. This acts as an implicit regularizer, improving generalization and lowering perplexity.
Parameter Efficiency: Fewer parameters reduce overfitting risk given the same dataset size. Each weight in Bowtie contributes more to the final representation, yielding a better "performance-per-parameter" ratio.
Gradient Flow Stability: Zero-initialized residual adapters (γ=0) prevent early-training instability. Gradients flow smoothly through the compressed bottleneck while preserving the original signal via GlobalSkip.

⚠️ Practical Nuances

Training Dynamics: Slightly higher loss fluctuations in early steps are normal for 26-layer networks due to a more complex optimization landscape. Convergence remains stable and consistent.
Inference Trade-offs: Fewer parameters ≠ automatically faster latency. The 26-layer depth adds sequential compute steps, but the memory footprint is significantly smaller, enabling easier deployment, batching, and quantization.
Effective Depth: Despite 26 physical layers, the compression ratio means the representational capacity is roughly equivalent to 3.5 full-width layers, striking an optimal balance between capacity and efficiency.

📊 Experimental Results

Training was conducted for 1500 steps with automatic mixed precision (AMP) and identical optimization budgets.

Model	Parameters	Layers	Final Loss	Perplexity	Δ vs Standard
`Standard`	76.74M	8	3.1563	23.48	—
`Bowtie`	63.11M	26 (2+24)	3.1246	22.75	▼ 3.1% PPL

📈 Loss Trajectory: Bowtie establishes a stable advantage around step ~100, reduces perplexity faster, and maintains consistent leadership through the end of training.

🛠️ Training Configuration

Parameter	Value
`d_model`	512
`d_small`	128 (`compression_ratio = 4`)
`n_layers`	26 (2 wide + 24 narrow)
`n_heads`	8
`max_seq_len`	128
`batch_size`	32
`learning_rate`	5e-4 (AdamW)
`precision`	AMP (FP16/BF16)
`dataset`	`roneneldan/TinyStories`

🚀 Recommendations & Next Steps

✅ Extend Training: Bowtie shows continued downward trend in loss. Additional steps (2000+) may yield further perplexity gains.
✅ Benchmark Inference: Measure latency & throughput at identical batch sizes. Memory efficiency often outweighs minor latency overhead in real-world deployment.
✅ Tune d_small: Experiment with d_small ∈ {96, 160} to find the optimal compression ratio for your compute budget.
✅ Downstream Validation: Evaluate on held-out generation tasks to confirm that lower perplexity translates to higher fluency and coherence.

📚 Citation & References

If you use Bowtie Transformer in your research or products, please cite:

@misc{bowtie_transformer,
  author       = {Samina K.},
  title        = {Bowtie Transformer: Architecture with Bottleneck Layers and Multi-Level Residual Connections},
  year         = {2026},
  url          = {https://huggingface.co/your-username/bowtie-transformer}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train midnight-s/bowtie-transformer

Evaluation results

Final Loss on TinyStories
Local Experiment

3.125
Perplexity on TinyStories
Local Experiment

22.750
Parameters on TinyStories
Local Experiment

63.11M