---
library_name: transformers
license: cc-by-3.0
datasets:
- InstaDeepAI/nucleotide_transformer_downstream_tasks_revised
base_model:
- adehoffer/promoter-gpt-model
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This is a fine-tuned (human TATA/no-TATA) model of adehoffer/Promoter-GPT-model for generating synthetic promoter sequences.


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is a fine-tuned model of Adele de Hoffer's Promoter-GPT-model for generating synthetic promoter sequences.
I trained my own base model of Promoter-GPT following Adele's guide here: https://huggingface.co/blog/hugging-science/promoter-gpt
and then loaded and fine-tuned that model on InstaDeepAI/nucleotide_transformer_downstream_tasks_revised TATA/noTATA human promoter
sequence datasets. 

This model generates synthetic promoter sequences given a seed. 

**Disclaimer:** Research use only; not for clinical or commercial use. Sequences require wet-lab validation—no direct improvement over the base model is claimed.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** ae-314
- **Model type:** Causal language model (GPT-2 ~0.43M params), 3-mer WordLevel tokenizer, context length 298 tokens
- **Language(s) (NLP):** DNA (A,C,T,G)
- **License:** cc-by-nc-3.0
- **Finetuned from model custom base Promoter-GPT trained from scratch using: https://huggingface.co/blog/hugging-science/promoter-gpt

### Model Sources [optional]

- **Repository:** https://huggingface.co/ae-314/promoter-gpt-ft-tata
- **Paper by: Adele de Hoffer: (https://huggingface.co/blog/hugging-science/promoter-gpt)

## Evaluation

- **Test (balanced human promoters, 300 bp):** loss = **1.2884** · perplexity = **3.63**
- **Generation (N=50):** GC% ≈ **60.6 ± 15.2**; **TATA** 4-mer ≈ **26%**; **TATAWA** ≈ **10%**; unique 6-mer ratio ≈ **0.815**; ≥6-bp homopolymer ≈ **74%**.
- **Notes:** Perplexity is on a mixed TATA+no-TATA domain (not directly comparable to an AT-rich 200 bp setup). Generation stats are unconditional (no control tokens).


## Training Details
- **Base:** custom Promoter-GPT (GPT-2 ~0.43M params)
- **Data:** human promoters (300 bp), mixed `promoter_tata` + `promoter_no_tata` (positives), balanced
- **Tokenization & context:** 3-mer WordLevel; **298 tokens** (full 300 bp; positional embeddings expanded to `n_positions=298`)
- **Optimizer:** AdamW (weight_decay=0.01), **LR**=1e-4, cosine schedule, warmup ≈10%
- **Batch / Accum:** 128 / 8 · **Epochs:** 3 · **Precision:** fp32
- **Hardware:** Google Colab **T4 GPU**


### Direct Use
- Unconditional generation of **synthetic human promoter-like sequences** from a short seed (research/education only).
- Exploration of promoter sequence properties (e.g., GC%, k-mer distributions).

### Downstream Use
- Starting point for further fine-tuning on specific promoter subtypes or organism-specific data.
- Conditioning or control-token experiments (e.g., motif presence) in future work.

### Out-of-Scope Use
- Clinical, diagnostic, or therapeutic applications.
- Any wet-lab use without proper **biosafety review** and **experimental validation**.
- Harmful/dual-use sequence design.

## Bias, Risks, and Limitations
- Reflects the **mixed human promoter** domain (TATA + no-TATA, 300 bp); may skew **GC-rich** and show simple repeats.
- **No experimental validation**; outputs are not guaranteed functional or safe.
- Perplexity improved/worsened may not correlate with biological realism.

### Recommendations
- Treat generations as **hypotheses**; validate in wet lab.
- Use conservative sampling (lower temperature, repetition penalty) to reduce repeat bias.
- Do **not** use clinically or commercially

## How to Get Started with the Model

```
#aaron e
# ae-314

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
import torch, re

# Load model & tokenizer
repo_id = "ae-314/promoter-gpt-ft-tata"
tok = PreTrainedTokenizerFast.from_pretrained(repo_id)
model = GPT2LMHeadModel.from_pretrained(repo_id).eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# K-mer helpers
def kmerize(seq, k=3):
    return " ".join(seq[i:i+k] for i in range(len(seq)-k+1))

def dekmerize_ids(ids, k=3):
    toks = tok.convert_ids_to_tokens(ids)
    kmers = [t for t in toks if re.fullmatch(r"[ACGT]{%d}" % k, t)]
    if not kmers: return ""
    seq = kmers[0]
    for t in kmers[1:]:
        seq += t[-1]
    return seq

# Generate N sequences (unconditional)
def generate_batch(seed="ATGG", N=50, k=3, temperature=0.9, top_p=0.9):
    inp = tok.encode(kmerize(seed, k), return_tensors="pt").to(device)
    max_new = 298 - inp.shape[1]  # 300 bp -> 298 3-mers
    assert max_new > 0, f"Seed too long in k-mer tokens ({inp.shape[1]})"
    pad_id = tok.pad_token_id if tok.pad_token_id is not None else (tok.eos_token_id or 0)
    with torch.no_grad():
        outs = model.generate(
            inp,
            max_new_tokens=max_new,
            do_sample=True, temperature=temperature, top_p=top_p,
            num_return_sequences=N,
            pad_token_id=pad_id
        )
    return [dekmerize_ids(outs[i].tolist(), k) for i in range(outs.shape[0])]

# Run and show first 5
seqs = generate_batch(seed="ATGG", N=50)
for i, s in enumerate(seqs[:5], 1):
    print(f"{i:02d}: {s}")
```