--- library_name: transformers license: cc-by-3.0 datasets: - InstaDeepAI/nucleotide_transformer_downstream_tasks_revised base_model: - adehoffer/promoter-gpt-model --- # Model Card for Model ID This is a fine-tuned (human TATA/no-TATA) model of adehoffer/Promoter-GPT-model for generating synthetic promoter sequences. ## Model Details ### Model Description This is a fine-tuned model of Adele de Hoffer's Promoter-GPT-model for generating synthetic promoter sequences. I trained my own base model of Promoter-GPT following Adele's guide here: https://huggingface.co/blog/hugging-science/promoter-gpt and then loaded and fine-tuned that model on InstaDeepAI/nucleotide_transformer_downstream_tasks_revised TATA/noTATA human promoter sequence datasets. This model generates synthetic promoter sequences given a seed. **Disclaimer:** Research use only; not for clinical or commercial use. Sequences require wet-lab validation—no direct improvement over the base model is claimed. This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** ae-314 - **Model type:** Causal language model (GPT-2 ~0.43M params), 3-mer WordLevel tokenizer, context length 298 tokens - **Language(s) (NLP):** DNA (A,C,T,G) - **License:** cc-by-nc-3.0 - **Finetuned from model custom base Promoter-GPT trained from scratch using: https://huggingface.co/blog/hugging-science/promoter-gpt ### Model Sources [optional] - **Repository:** https://huggingface.co/ae-314/promoter-gpt-ft-tata - **Paper by: Adele de Hoffer: (https://huggingface.co/blog/hugging-science/promoter-gpt) ## Evaluation - **Test (balanced human promoters, 300 bp):** loss = **1.2884** · perplexity = **3.63** - **Generation (N=50):** GC% ≈ **60.6 ± 15.2**; **TATA** 4-mer ≈ **26%**; **TATAWA** ≈ **10%**; unique 6-mer ratio ≈ **0.815**; ≥6-bp homopolymer ≈ **74%**. - **Notes:** Perplexity is on a mixed TATA+no-TATA domain (not directly comparable to an AT-rich 200 bp setup). Generation stats are unconditional (no control tokens). ## Training Details - **Base:** custom Promoter-GPT (GPT-2 ~0.43M params) - **Data:** human promoters (300 bp), mixed `promoter_tata` + `promoter_no_tata` (positives), balanced - **Tokenization & context:** 3-mer WordLevel; **298 tokens** (full 300 bp; positional embeddings expanded to `n_positions=298`) - **Optimizer:** AdamW (weight_decay=0.01), **LR**=1e-4, cosine schedule, warmup ≈10% - **Batch / Accum:** 128 / 8 · **Epochs:** 3 · **Precision:** fp32 - **Hardware:** Google Colab **T4 GPU** ### Direct Use - Unconditional generation of **synthetic human promoter-like sequences** from a short seed (research/education only). - Exploration of promoter sequence properties (e.g., GC%, k-mer distributions). ### Downstream Use - Starting point for further fine-tuning on specific promoter subtypes or organism-specific data. - Conditioning or control-token experiments (e.g., motif presence) in future work. ### Out-of-Scope Use - Clinical, diagnostic, or therapeutic applications. - Any wet-lab use without proper **biosafety review** and **experimental validation**. - Harmful/dual-use sequence design. ## Bias, Risks, and Limitations - Reflects the **mixed human promoter** domain (TATA + no-TATA, 300 bp); may skew **GC-rich** and show simple repeats. - **No experimental validation**; outputs are not guaranteed functional or safe. - Perplexity improved/worsened may not correlate with biological realism. ### Recommendations - Treat generations as **hypotheses**; validate in wet lab. - Use conservative sampling (lower temperature, repetition penalty) to reduce repeat bias. - Do **not** use clinically or commercially ## How to Get Started with the Model ``` #aaron e # ae-314 from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast import torch, re # Load model & tokenizer repo_id = "ae-314/promoter-gpt-ft-tata" tok = PreTrainedTokenizerFast.from_pretrained(repo_id) model = GPT2LMHeadModel.from_pretrained(repo_id).eval() device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) # K-mer helpers def kmerize(seq, k=3): return " ".join(seq[i:i+k] for i in range(len(seq)-k+1)) def dekmerize_ids(ids, k=3): toks = tok.convert_ids_to_tokens(ids) kmers = [t for t in toks if re.fullmatch(r"[ACGT]{%d}" % k, t)] if not kmers: return "" seq = kmers[0] for t in kmers[1:]: seq += t[-1] return seq # Generate N sequences (unconditional) def generate_batch(seed="ATGG", N=50, k=3, temperature=0.9, top_p=0.9): inp = tok.encode(kmerize(seed, k), return_tensors="pt").to(device) max_new = 298 - inp.shape[1] # 300 bp -> 298 3-mers assert max_new > 0, f"Seed too long in k-mer tokens ({inp.shape[1]})" pad_id = tok.pad_token_id if tok.pad_token_id is not None else (tok.eos_token_id or 0) with torch.no_grad(): outs = model.generate( inp, max_new_tokens=max_new, do_sample=True, temperature=temperature, top_p=top_p, num_return_sequences=N, pad_token_id=pad_id ) return [dekmerize_ids(outs[i].tolist(), k) for i in range(outs.shape[0])] # Run and show first 5 seqs = generate_batch(seed="ATGG", N=50) for i, s in enumerate(seqs[:5], 1): print(f"{i:02d}: {s}") ```