Update README.md
Browse files
README.md
CHANGED
|
@@ -40,7 +40,23 @@ This is the model card of a 🤗 transformers model that has been pushed on the
|
|
| 40 |
### Model Sources [optional]
|
| 41 |
|
| 42 |
- **Repository:** https://huggingface.co/ae-314/promoter-gpt-ft-tata
|
| 43 |
-
- **Paper Adele de Hoffer: (https://huggingface.co/blog/hugging-science/promoter-gpt)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
### Direct Use
|
| 46 |
- Unconditional generation of **synthetic human promoter-like sequences** from a short seed (research/education only).
|
|
|
|
| 40 |
### Model Sources [optional]
|
| 41 |
|
| 42 |
- **Repository:** https://huggingface.co/ae-314/promoter-gpt-ft-tata
|
| 43 |
+
- **Paper by: Adele de Hoffer: (https://huggingface.co/blog/hugging-science/promoter-gpt)
|
| 44 |
+
|
| 45 |
+
## Evaluation
|
| 46 |
+
|
| 47 |
+
- **Test (balanced human promoters, 300 bp):** loss = **1.2884** · perplexity = **3.63**
|
| 48 |
+
- **Generation (N=50):** GC% ≈ **60.6 ± 15.2**; **TATA** 4-mer ≈ **26%**; **TATAWA** ≈ **10%**; unique 6-mer ratio ≈ **0.815**; ≥6-bp homopolymer ≈ **74%**.
|
| 49 |
+
- **Notes:** Perplexity is on a mixed TATA+no-TATA domain (not directly comparable to an AT-rich 200 bp setup). Generation stats are unconditional (no control tokens).
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
## Training Details
|
| 53 |
+
- **Base:** custom Promoter-GPT (GPT-2 ~0.43M params)
|
| 54 |
+
- **Data:** human promoters (300 bp), mixed `promoter_tata` + `promoter_no_tata` (positives), balanced
|
| 55 |
+
- **Tokenization & context:** 3-mer WordLevel; **298 tokens** (full 300 bp; positional embeddings expanded to `n_positions=298`)
|
| 56 |
+
- **Optimizer:** AdamW (weight_decay=0.01), **LR**=1e-4, cosine schedule, warmup ≈10%
|
| 57 |
+
- **Batch / Accum:** 128 / 8 · **Epochs:** 3 · **Precision:** fp32
|
| 58 |
+
- **Hardware:** Google Colab **T4 GPU**
|
| 59 |
+
|
| 60 |
|
| 61 |
### Direct Use
|
| 62 |
- Unconditional generation of **synthetic human promoter-like sequences** from a short seed (research/education only).
|