🏷️ Bowtie Transformer

Bowtie Transformer is a modified Transformer architecture that redistributes computational capacity toward semantically critical boundary layers. While the first and last layers operate at full dimensionality (d_model=512), the 24 intermediate layers are compressed to d_small=128 (4Γ— reduction). The compute graph resembles a bowtie: wide β†’ narrow β†’ wide.

This design concentrates representational capacity at the input/output boundaries while using compressed intermediate layers as a soft regularizer. Three zero-initialized residual adapters prevent information loss across dimension boundaries. In practice, Bowtie achieves 17.8% fewer parameters and lower perplexity than a standard Transformer of comparable capacity.

πŸ”— GitHub Repository


πŸ—οΈ Architecture Overview

Zone Layers Dimensionality Role
Input (Big) Layer 1 d_model = 512 Primary contextual encoding
Down Transition BottleneckProjection↓ 512 β†’ 128 Space compression stabilized by RMSNorm
Central (Small) Layers 2 … 25 d_small = 128 Iterative refinement in compressed space
Up Transition BottleneckProjection↑ 128 β†’ 512 Space restoration & feature reconstruction
Output (Big) Layer 26 d_model = 512 Final transformation before language head

πŸ”‘ Key Innovations

  1. BottleneckProjection β€” Asymmetric linear projections followed by RMS normalization to stabilize activations during dimensionality transitions. Weights are not tied, allowing independent compression and reconstruction.
  2. ResidualAdapter with Zero-Init (Ξ³=0) β€” Adaptive scaling of residual signals. Paths are effectively disabled at initialization and open gradually during training, ensuring stable gradients in deep networks.
  3. Three Long-Range Residual Paths:
    • GlobalSkip: Layer 1 β†’ Layer 26 (d_model β†’ d_model)
    • EntrySkip: Layer 1 β†’ First small layer (d_model β†’ d_small)
    • ExitSkip: Last small layer β†’ Layer 26 (d_small β†’ d_model)
  4. PreNorm Structure in all blocks for robust convergence at high learning rates.

πŸ” Comparative Analysis: Why Bowtie Works

Both models were trained under identical conditions on roneneldan/TinyStories. The performance gap directly reflects architectural efficiency.

Aspect Standard Transformer Bowtie Transformer Insight
Parameters 76.74M 63.11M β–Ό 17.8% reduction (~13.6M fewer weights)
Depth 8 layers 26 layers 3Γ— deeper β†’ better hierarchical abstraction
Width Profile Uniform (d=512) Asymmetric (512β†’128β†’512) Bottleneck acts as implicit regularizer
Final Loss 3.1563 3.1246 β–Ό 1.0% lower training loss
Perplexity 23.48 22.75 β–Ό 3.1% better token prediction

πŸ“ˆ Why Bowtie Outperforms at Lower Cost

  1. Depth > Width Hypothesis: 26 narrow layers capture higher-order dependencies more effectively than 8 wide layers. The additional depth allows the model to learn complex sequential patterns without linearly scaling parameters.
  2. Compression as Soft Regularization: Reducing d_small to 128 forces the network to discard noise and retain only informative features. This acts as an implicit regularizer, improving generalization and lowering perplexity.
  3. Parameter Efficiency: Fewer parameters reduce overfitting risk given the same dataset size. Each weight in Bowtie contributes more to the final representation, yielding a better "performance-per-parameter" ratio.
  4. Gradient Flow Stability: Zero-initialized residual adapters (Ξ³=0) prevent early-training instability. Gradients flow smoothly through the compressed bottleneck while preserving the original signal via GlobalSkip.

⚠️ Practical Nuances

  • Training Dynamics: Slightly higher loss fluctuations in early steps are normal for 26-layer networks due to a more complex optimization landscape. Convergence remains stable and consistent.
  • Inference Trade-offs: Fewer parameters β‰  automatically faster latency. The 26-layer depth adds sequential compute steps, but the memory footprint is significantly smaller, enabling easier deployment, batching, and quantization.
  • Effective Depth: Despite 26 physical layers, the compression ratio means the representational capacity is roughly equivalent to 3.5 full-width layers, striking an optimal balance between capacity and efficiency.

πŸ“Š Experimental Results

Training was conducted for 1500 steps with automatic mixed precision (AMP) and identical optimization budgets.

Model Parameters Layers Final Loss Perplexity Ξ” vs Standard
Standard 76.74M 8 3.1563 23.48 β€”
Bowtie 63.11M 26 (2+24) 3.1246 22.75 β–Ό 3.1% PPL

πŸ“ˆ Loss Trajectory: Bowtie establishes a stable advantage around step ~100, reduces perplexity faster, and maintains consistent leadership through the end of training.


πŸ› οΈ Training Configuration

Parameter Value
d_model 512
d_small 128 (compression_ratio = 4)
n_layers 26 (2 wide + 24 narrow)
n_heads 8
max_seq_len 128
batch_size 32
learning_rate 5e-4 (AdamW)
precision AMP (FP16/BF16)
dataset roneneldan/TinyStories

πŸš€ Recommendations & Next Steps

βœ… Extend Training: Bowtie shows continued downward trend in loss. Additional steps (2000+) may yield further perplexity gains.
βœ… Benchmark Inference: Measure latency & throughput at identical batch sizes. Memory efficiency often outweighs minor latency overhead in real-world deployment.
βœ… Tune d_small: Experiment with d_small ∈ {96, 160} to find the optimal compression ratio for your compute budget.
βœ… Downstream Validation: Evaluate on held-out generation tasks to confirm that lower perplexity translates to higher fluency and coherence.


πŸ“š Citation & References

If you use Bowtie Transformer in your research or products, please cite:

@misc{bowtie_transformer,
  author       = {Samina K.},
  title        = {Bowtie Transformer: Architecture with Bottleneck Layers and Multi-Level Residual Connections},
  year         = {2026},
  url          = {https://huggingface.co/your-username/bowtie-transformer}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train midnight-s/bowtie-transformer

Evaluation results