| --- |
| license: apache-2.0 |
| --- |
| |
|  |
|
|
| # PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction 🧬🌌 |
|
|
| This is the repository for [PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction](https://www.biorxiv.org/content/10.64898/2025.12.31.697180), a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation. |
|
|
| ## Table of Contents 🌟 |
|
|
| - [Quick start](#quick-start) |
| - [Installation](#installation) |
| - [Repository Structure](#repository-structure) |
| - [Training data collection](#training-data-collection) |
| - [Best model list](#best-model-list) |
| - [Full model set (cuML-enabled)](#full-model-set-gpu-enabled) |
| - [Minimal deployable model set (no cuML)](#minimal-deployable-set) |
| - [Usage](#usage) |
| - [Local Application Hosting](#local-application-hosting) |
| - [Dataset integration](#dataset-integration) |
| - [Training](#training) |
| - [Quick inference by property per model](#Quick-inference-by-property-per-model) |
| - [Property Interpretations](#property-interpretations) |
| - [Model Architecture](#model-architecture) |
| - [Troubleshooting](#troubleshooting) |
| - [Citation](#citation) |
|
|
| ## Quick Start 🌟 |
| - Light-weighted start (basic models, no cuML, read below for details) |
| ```bash |
| # Ignore all LFS files, you will see an empty folder first |
| git clone --no-checkout https://huggingface.co/ChatterjeeLab/PeptiVerse |
| cd PeptiVerse |
| |
| # Enable sparse checkout |
| git sparse-checkout init --cone |
| |
| # Choose only selective items to download |
| git sparse-checkout set \ |
| inference.py \ |
| download_light.py \ |
| best_models.txt \ |
| basic_models.txt \ |
| requirements.txt \ |
| tokenizer \ |
| README.md |
| |
| # Now checkout |
| GIT_LFS_SKIP_SMUDGE=1 git checkout |
| |
| # Install basic pkgs |
| pip install -r requirements.txt |
| |
| # Download basic model weights according to the basic_models.txt. Adjust which config you wanted as needed. |
| python download_light.py |
| |
| # Test in inference |
| python inference.py |
| ``` |
| - Full model clone (will clone all best model weights) |
| ```bash |
| # Clone repository |
| git clone https://huggingface.co/ChatterjeeLab/PeptiVerse |
| |
| # Install dependencies |
| pip install -r requirements.txt |
| |
| # Run inference |
| python inference.py |
| ``` |
| > **Note:** This clones best model weights only. For full access: |
| > - **All model weights** (best + seed ensembles for uncertainty quantification): [Zenodo](https://zenodo.org/records/19989009) |
| > - **Training datasets** (embeddings + splits): [HuggingFace Dataset](https://huggingface.co/datasets/ChatterjeeLab/PeptiVerse_data) |
|
|
| ## Installation 🌟 |
| ### Minimal Setup |
| - Easy start-up environment (using transformers, xgboost models) |
| ```bash |
| pip install -r requirements.txt |
| ``` |
| ### Full Setup |
| - Additional access to trained SVM and ElastNet models requires installation of `RAPIDS cuML`, with instructions available from their official [github page](https://github.com/rapidsai/cuml) (**CUDA-capable GPU required**). |
| - Optional: pre-compiled Singularity/Apptainer environment (5.68G) is available at [Google drive](https://drive.google.com/file/d/1ybLJNTC3BITIqBd8IO09nOOm4PKwD4iS/view?usp=sharing) with everything you need (still need CUDA/GPU to load cuML models). The SHA256 for checking is `48619796ef0adc81bc420021821e5ee3d9b2176bf1f564104e06dc1ce56b3498`, check via `shasum -a 256 peptiverse.sif`. |
| ``` |
| # test |
| apptainer exec peptiverse.sif python -c "import sys; print(sys.executable)" |
| |
| # run inference (see below) |
| apptainer exec --nv peptiverse.sif python inference.py |
| ``` |
| ## Repository Structure 🌟 |
| This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. [Paper link.](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1) |
| |
| ``` |
| PeptiVerse/ |
| ├── training_data_cleaned/ # Processed datasets with embeddings |
| │ └── <property>/ # Property-specific data |
| │ ├── train/val splits |
| │ └── precomputed embeddings |
| ├── training_classifiers/ # Trained model weights |
| │ └── <property>/ |
| │ ├── cnn_wt/ # CNN architectures |
| │ ├── mlp_wt/ # MLP architectures |
| │ └── xgb_wt/ # XGBoost models |
| ├── tokenizer/ # PeptideCLM tokenizer |
| ├── training_data/ # Raw training data |
| ├── inference.py # Main prediction interface |
| ├── best_models.txt # Model selection manifest |
| └── requirements.txt # Python dependencies |
| ``` |
| For full data access, please download the corresponding `training_data_cleaned` and `training_classifiers` from [HuggingFace Dataset](https://huggingface.co/datasets/ChatterjeeLab/PeptiVerse_data) and [Zenodo](https://zenodo.org/records/19989009). The current Huggingface repo only hosts best model weights and meta data with splits labels. |
|
|
| ``` |
| 1. Download and extract the Zenodo archive. |
| 2. Download or clone this repository. |
| 3. Copy/merge the repository `training_classifiers/` contents into the extracted Zenodo `training_classifiers/` directory, preserving the folder structure. |
| `rsync -av --ignore-existing training_classifiers/ /path/to/zenodo_extracted/training_classifiers/` |
| 4. Do not replace the entire Zenodo folder blindly; merge files so that large training outputs from Zenodo and updated best-model weights from this repository coexist. |
| ``` |
|
|
| ## Training Data Collection 🌟 |
|
|
| <table> |
| <caption><strong>Data distribution.</strong> Classification tasks report counts for class 0/1; regression tasks report total sample size (N).</caption> |
| <thead> |
| <tr> |
| <th rowspan="2"><strong>Properties</strong></th> |
| <th colspan="2"><strong>Amino Acid Sequences</strong></th> |
| <th colspan="2"><strong>SMILES Sequences</strong></th> |
| </tr> |
| <tr> |
| <th><strong>0</strong></th> |
| <th><strong>1</strong></th> |
| <th><strong>0</strong></th> |
| <th><strong>1</strong></th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td colspan="5"><strong>Classification</strong></td> |
| </tr> |
| <tr> |
| <td>Hemolysis</td> |
| <td>4765</td> |
| <td>1311</td> |
| <td>4765</td> |
| <td>1311</td> |
| </tr> |
| <tr> |
| <td>Non-Fouling</td> |
| <td>13580</td> |
| <td>3600</td> |
| <td>13580</td> |
| <td>3600</td> |
| </tr> |
| <tr> |
| <td>Solubility</td> |
| <td>9668</td> |
| <td>8785</td> |
| <td>9668</td> |
| <td>8785</td> |
| </tr> |
| <tr> |
| <td>Permeability (Penetrance)</td> |
| <td>1162</td> |
| <td>1162</td> |
| <td>1162</td> |
| <td>1162</td> |
| </tr> |
| <tr> |
| <td>Toxicity</td> |
| <td>-</td> |
| <td>-</td> |
| <td>5518</td> |
| <td>5518</td> |
| </tr> |
| <tr> |
| <td colspan="5"><strong>Regression (N)</strong></td> |
| </tr> |
| <tr> |
| <td>Permeability (PAMPA)</td> |
| <td colspan="2" align="center">-</td> |
| <td colspan="2" align="center">6869</td> |
| </tr> |
| <tr> |
| <td>Permeability (CACO2)</td> |
| <td colspan="2" align="center">-</td> |
| <td colspan="2" align="center">606</td> |
| </tr> |
| <tr> |
| <td>Half-Life</td> |
| <td colspan="2" align="center">130</td> |
| <td colspan="2" align="center">245</td> |
| </tr> |
| <tr> |
| <td>Binding Affinity</td> |
| <td colspan="2" align="center">1436</td> |
| <td colspan="2" align="center">1597</td> |
| </tr> |
| </tbody> |
| </table> |
| |
|
|
| ## Best Model List 🌟 |
|
|
| ### Full model set (cuML-enabled) |
| | Property | Best Model (Sequence) | Best Model (SMILES) | Task Type | Threshold (Sequence) | Threshold (SMILES) | |
| |---|---|---|---|---|---| |
| | Hemolysis | SVM | CNN (chemberta) | Classifier | 0.2521 | 0.564 | |
| | Non-Fouling | Transformer | ENET (peptideclm) | Classifier | 0.712 | 0.6969 | |
| | Solubility | CNN | Transformer (peptideclm) | Classifier | 0.377 | 0.329 | |
| | Permeability (Penetrance) | SVM | SVM (chemberta) | Classifier | 0.5493 | 0.573 | |
| | Toxicity | – | CNN (chemberta) | Classifier | – | 0.49 | |
| | Binding Affinity | pooled | pooled (chemberta) | Regression | – | – | |
| | Permeability (PAMPA) | – | CNN (chemberta) | Regression | – | – | |
| | Permeability (Caco-2) | – | SVR (chemberta) | Regression | – | – | |
| | Half-life | Transformer | XGB (peptideclm) | Regression | – | – | |
|
|
| >Note: *unpooled* indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations. |
|
|
| ### Minimal deployable model set (no cuML) |
| | Property | Best Model (WT) | Best Model (SMILES) | Task Type | Threshold (WT) | Threshold (SMILES) | |
| |---|---|---|---|---|---| |
| | Hemolysis | XGB | CNN (chemberta) | Classifier | 0.2801 | 0.564 | |
| | Non-Fouling | Transformer | XGB (peptideclm) | Classifier | 0.712 | 0.3892 | |
| | Solubility | CNN | Transformer (peptideclm) | Classifier | 0.377 | 0.329 | |
| | Permeability (Penetrance) | XGB | XGB (chemberta) | Classifier | 0.4301 | 0.5028 | |
| | Toxicity | – | CNN (chemberta) | Classifier | – | 0.49 | |
| | Binding Affinity | pooled | pooled (chemberta) | Regression | – | – | |
| | Permeability (PAMPA) | – | CNN (chemberta) | Regression | – | – | |
| | Permeability (Caco-2) | – | SVR (chemberta) | Regression | – | – | |
| | Half-life | Transformer | XGB (peptideclm) | Regression | – | – | |
| >Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups. |
|
|
|
|
| ## Usage 🌟 |
|
|
| ### Local Application Hosting |
| - Host the [PeptiVerse UI](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse) locally with your own resources. |
| ```bash |
| # Configure models in best_models.txt |
| |
| git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse |
| python app.py |
| ``` |
| ### Data pre-processing |
| Under the `training_data_cleaned`, we provided the generated embeddings in huggingface dataset format. The following scripts are the steps used to generate the data. |
|
|
| ### Dataset integration |
| - All processed training datasets are available at [ChatterjeeLab/PeptiVerse\_data](https://huggingface.co/datasets/ChatterjeeLab/PeptiVerse_data). |
| - Selective download the data you need with `huggingface-cli` |
| ```bash |
| huggingface-cli download ChatterjeeLab/PeptiVerse_data \ |
| --include "training_data_cleaned/**" \ # only this folder |
| --exclude "**/*.pt" "**/*.joblib" \ # skip weights/artifacts |
| --local-dir PeptiVerse_data \ |
| --local-dir-use-symlinks False # make real copies |
| ``` |
| - Or in python |
| ```python |
| from huggingface_hub import snapshot_download |
| |
| local_dir = snapshot_download( |
| repo_id="ChatterjeeLab/PeptiVerse_data", |
| allow_patterns=["training_data_cleaned/**"], # only this folder |
| ignore_patterns=["**/*.pt", "**/*.joblib"], # skip weights/artifacts |
| local_dir="PeptiVerse_data", |
| local_dir_use_symlinks=False, # make real copies |
| ) |
| print("Downloaded to:", local_dir) |
| ``` |
| - Usage of the huggingface datasets (with pre-computed embeddings and splits) |
| - All embedding datasets are saved via `DatasetDict.save_to_disk` and loadable with: |
| ``` python |
| from datasets import load_from_disk |
| ds = load_from_disk(PATH) |
| train_ds = ds["train"] |
| val_ds = ds["val"] |
| ``` |
| - A) Sequence Based ([ESM-2](https://huggingface.co/facebook/esm2_t33_650M_UR50D) embeddings) |
| - Pooled (fixed-length vector per sequence) |
| - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding. |
| - Each item: |
| sequence: `str` |
| label: `int` (classification) or `float` (regression) |
| embedding: `float32[H]` (H=1280 for ESM-2 650M) |
| - Unpooled (variable-length token matrix) |
| - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix. |
| - Each item: |
| sequence: `str` |
| label: `int` (classification) or `float` (regression) |
| embedding: `float16[L, H]` (nested lists) |
| attention_mask: `int8[L]` |
| length: `int` (=L) |
| - B) SMILES-based ([PeptideCLM](https://github.com/AaronFeller/PeptideCLM) embeddings) |
| - Pooled (fixed-length vector per sequence) |
| - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding. |
| - Each item: |
| sequence: `str` (SMILES) |
| label: `int` (classification) or `float` (regression) |
| embedding: `float32[H]` |
| - Unpooled (variable-length token matrix) |
| - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix. |
| - Each item: |
| sequence: `str` (SMILES) |
| label: `int` (classification) or `float` (regression) |
| embedding: `float16[L, H]` (nested lists) |
| attention_mask: `int8[L]` |
| length: `int` (=L) |
| - C) SMILES-based ([ChemBERTa](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM) embeddings) |
| - Pooled (fixed-length vector per sequence) |
| - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding. |
| - Each item: |
| sequence: `str` (SMILES) |
| label: `int` (classification) or `float` (regression) |
| embedding: `float32[H]` |
| - Unpooled (variable-length token matrix) |
| - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix. |
| - Each item: |
| sequence: `str` (SMILES) |
| label: `int` (classification) or `float` (regression) |
| embedding: `float16[L, H]` (nested lists) |
| attention_mask: `int8[L]` |
| length: `int` (=L) |
| ### Training |
| Under the `training_classifiers` folder, we provide the python scripts used to train different models. The scripts will |
| 1. Read the pre-processed Huggingface Dataset from `training_data_cleaned` folder; |
| 2. Perform OPTUNA hyperparameter sweep once being called; |
| 3. All training was conducted on HPC with SLURM script under `training_classifiers/src` folder; |
| 4. Customize or isolate certain model training scripts as needed. |
| ##### Example of training |
| ###### ML models |
| ``` |
| HOME_LOC=/home |
| SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers |
| EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned |
| |
| OBJECTIVE='hemolysis' # nf/solubility/hemolysis/permeability_pampa/permeability_caco2 |
| WT='smiles' # wt/smiles |
| DATA_FILE="hemo_${WT}_with_embeddings" |
| LOG_LOC=$SCRIPT_LOC/src/logs |
| DATE=$(date +%m_%d) |
| MODEL_TYPE='svm_gpu' # xgb/enet_gpu/svm_gpu |
| SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}_new" |
|
|
| # Create log directory if it doesn't exist |
| mkdir -p $LOG_LOC |
| |
| cd $SCRIPT_LOC |
|
|
| python -u train_ml.py \ |
| --dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \ |
| --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \ |
| --model "${MODEL_TYPE}" \ |
| --n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1 |
| ``` |
| ###### DNN models |
| ``` |
| HOME_LOC=/home |
| SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers |
| EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned |
| |
| OBJECTIVE='nf' # nf/solubility/hemolysis |
| WT='smiles' #wt/smiles |
| DATA_FILE="nf_${WT}_with_embeddings_unpooled" |
| LOG_LOC=$SCRIPT_LOC/src/logs |
| DATE=$(date +%m_%d) |
| MODEL_TYPE='cnn' #mlp/cnn/transformer |
| SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}" |
|
|
| # Create log directory if it doesn't exist |
| mkdir -p $LOG_LOC |
| |
| cd $SCRIPT_LOC |
|
|
| python -u train_nn.py \ |
| --dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \ |
| --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \ |
| --model "${MODEL_TYPE}" \ |
| --n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1 |
| ``` |
| ###### Binding Affinity |
| ``` |
| HOME_LOC=/home |
| SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers |
| EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned |
| |
| OBJECTIVE='binding_affinity' |
| BINDER_MODEL='chemberta' # peptideclm / chemberta |
| STATUS='unpooled' # pooled / unpooled |
| TYPE='smiles' |
| DATA_FILE='pair_wt_${TYPE}_${STATUS}' |
| |
| LOG_LOC=$SCRIPT_LOC/src/logs |
| DATE=$(date +%m_%d) |
| SPECIAL_PREFIX="${OBJECTIVE}-${BINDER_MODEL}-${STATUS}" |
|
|
| python -u binding_training.py \ |
| --dataset_path "${EMB_LOC}/${OBJECTIVE}/${BINDER_MODEL}/${DATA_FILE}" \ |
| --mode "${STATUS}" \ |
| --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${BINDER_MODEL}_${TYPE}_${STATUS}" \ |
| --n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1 |
| ``` |
| |
| ### Quick inference by property per model |
| ```python |
| from inference import PeptiVersePredictor |
| from pathlib import Path |
|
|
| root = Path(__file__).resolve().parent # current script folder |
|
|
| |
| predictor = PeptiVersePredictor( |
| manifest_path=root / "best_models.txt", |
| classifier_weight_root=root, |
| device="cuda", # or "cpu" |
| ) |
| |
| # mode: smiles (SMILES-based models) / wt (Sequence-based models) |
| # property keys (with some level of name normalization) |
| # hemolysis |
| # nf (Non-Fouling) |
| # solubility |
| # permeability_penetrance |
| # toxicity |
| # permeability_pampa |
| # permeability_caco2 |
| # halflife |
| # binding_affinity |
|
|
| seq = "GIVEQCCTSICSLYQLENYCN" |
| smiles = "CC(C)C[C@@H]1NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@@H](C)N(C)C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H]2CCCN2C1=O" |
|
|
| # Hemolysis |
| out = pred.predict_property("hemolysis", mode="wt", input_str=seq) |
| print(out) |
| # {"property":"hemolysis","mode":"wt","score":prob,"label":0/1,"threshold":...} |
|
|
| out = pred.predict_property("hemolysis", mode="smiles", input_str=smiles) |
| print(out) |
|
|
| # Non-fouling (key is nf) |
| out = pred.predict_property("nf", mode="wt", input_str=seq) |
| print(out) |
|
|
| out = pred.predict_property("nf", mode="smiles", input_str=smiles) |
| print(out) |
|
|
| # Solubility (Sequence-only) |
| out = pred.predict_property("solubility", mode="wt", input_str=seq) |
| print(out) |
|
|
| # Permeability (Penetrance) (Sequence-only) |
| out = pred.predict_property("permeability_penetrance", mode="wt", input_str=seq) |
| print(out) |
| |
| # Toxicity (SMILES-only) |
| out = pred.predict_property("toxicity", mode="smiles", input_str=smiles) |
| print(out) |
| |
| # Permeability (PAMPA) (SMILES regression) |
| out = pred.predict_property("permeability_pampa", mode="smiles", input_str=smiles) |
| print(out) |
| # {"property":"permeability_pampa","mode":"smiles","score":value} |
| |
| # Permeability (Caco-2) (SMILES regression) |
| out = pred.predict_property("permeability_caco2", mode="smiles", input_str=smiles) |
| print(out) |
|
|
| # Half-life (sequence-based + SMILES regression) |
| out = pred.predict_property("halflife", mode="wt", input_str=seq) |
| print(out) |
|
|
| out = pred.predict_property("halflife", mode="smiles", input_str=smiles) |
| print(out) |
|
|
| # Binding Affinity |
| protein = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV..." # target protein |
| peptide_seq = "GIVEQCCTSICSLYQLENYCN" |
| |
| out = pred.predict_binding_affinity( |
| mode="wt", |
| target_seq=protein, |
| binder_str=peptide_seq, |
| ) |
| print(out) |
| # { |
| # "property":"binding_affinity", |
| # "mode":"wt", |
| # "affinity": float, |
| # "class_by_threshold": "High (≥9)" / "Moderate (7-9)" / "Low (<7)", |
| # "class_by_logits": same buckets, |
| # "binding_model": "pooled" or "unpooled", |
| # } |
| |
| ``` |
| |
| #### Advanced inference with uncertainty prediction |
| The uncertainty prediction is added as a parameter in the inference code. The full classifier folder from [zenodo]() is required to enable this functionality. The model uncertainty is reported via all the scripts listed under the `training_classifiers` folder starting with "**refit**". Detailed description can be found in the methodology part of the manuscript. |
| At inference time, PeptiVersePredictor returns an `uncertainty` field with every prediction when `uncertainty=True` is passed. The method and interpretation depend on the model class, determined automatically at inference time. |
| ```python |
| seq = "GIGAVLKVLTTGLPALISWIKRKRQQ" |
| smiles = "C(C)C[C@@H]1NC(=O)[C@@H]2CCCN2C(=O)[C@@H](CC(C)C)NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@H](C)NC(=O)[C@H](Cc2ccccc2)NC1=O" |
|
|
| print(predictor.predict_property("nf", "wt", seq, uncertainty=True)) |
| print(predictor.predict_property("nf", "smiles", smiles, uncertainty=True)) |
|
|
| {'property': 'nf', 'col': 'wt', 'score': 0.00014520535252195523, 'emb_tag': 'wt', 'label': 0, 'threshold': 0.57, 'uncertainty': 0.0017192508727321288, 'uncertainty_type': 'ensemble_predictive_entropy'} |
| {'property': 'nf', 'col': 'smiles', 'score': 0.025485480204224586, 'emb_tag': 'peptideclm', 'label': 0, 'threshold': 0.6969, 'uncertainty': 0.11868063130587676, 'uncertainty_type': 'binary_predictive_entropy_single_model'} |
| ``` |
| |
| --- |
| |
| ##### Method by Model Class |
| |
| | Model Class | Task | Uncertainty Method | Output Type | Range | |
| |---|---|---|---|---| |
| | MLP, CNN, Transformer | Classifier | Deep ensemble predictive entropy (5 seeds) | `float` | [0, ln(2) ≈ 0.693] | |
| | MLP, CNN, Transformer | Regression | Adaptive conformal interval; falls back to ensemble std if no MAPIE bundle | `(lo, hi)` or `float` | unbounded | |
| | SVM / SVC / XGBoost | Classifier | Binary predictive entropy (sigmoid of decision function) | `float` | [0, ln(2) ≈ 0.693] | |
| | SVR / ElasticNet / XGBoost | Regression | Adaptive conformal interval | `(lo, hi)` | unbounded | |
| |
| > **Uncertainty is `None`** when: a DNN classifier has no seed ensemble trained, or a regression model has no `mapie_calibration.joblib` in its model directory. |
| |
| --- |
| ## Interpretation 🌟 |
| |
| You can also find the same description in the paper or in the PeptiVerse app `Documentation` tab. |
| |
| --- |
| |
| ### 🩸 Hemolysis Prediction<br> |
| 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.<br> |
| |
| **Output interpretation:**<br> |
| |
| - Score close to 1.0 = high probability of red blood cell membrane disruption |
| - Score close to 0.0 = non-hemolytic |
| |
| --- |
| |
| ### 💧 Solubility Prediction<br> |
| Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions.<br> |
| |
| **Output interpretation:**<br> |
| |
| - Score close to 1.0 = highly soluble |
| - Score close to 0.0 = poorly soluble |
| |
| --- |
| |
| ### 👯 Non-Fouling Prediction<br> |
| Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.<br> |
| |
| **Output interpretation:**<br> |
| |
| - Score close to 1.0 = non-fouling |
| - Score close to 0.0 = fouling |
| |
| --- |
| |
| ### 🪣 Permeability Prediction<br> |
| Predicts membrane permeability on a log P scale.<br> |
| |
| **Output interpretation:**<br> |
| |
| - Higher values = more permeable (>-6.0) |
| - For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable. |
| |
| --- |
| |
| ### ⏱️ Half-Life Prediction<br> |
| **Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.<br> |
| |
| --- |
| |
| ### ☠️ Toxicity Prediction<br> |
| **Interpretation:** Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.<br> |
| |
| --- |
| |
| ### 🔗 Binding Affinity Prediction <br> |
| |
| Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.<br> |
| |
| **Interpretation:**<br> |
| |
| - Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)<br> |
| - Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)<br> |
| - Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)<br> |
| - A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br> |
| |
| --- |
| |
| ### Uncertainty Interpretation <br> |
| #### Entropy (classifiers)<br> |
| |
| Binary predictive entropy of the output probability p̄:<br> |
| |
| $$\mathcal{H} = -\bar{p}\log\bar{p} - (1 - \bar{p})\log(1 - \bar{p})$$<br> |
| |
| - For **DNN classifiers**: p̄ is the mean probability across 5 independently seeded models (deep ensemble). High entropy reflects both epistemic uncertainty (seed disagreement) and aleatoric uncertainty (collectively diffuse predictions).<br> |
| |
| - For **XGBoost / SVM / ElasticNet classifiers**: p̄ is the single model's output probability (or sigmoid of decision function for ElasticNet). Entropy reflects output confidence of a single model only.<br> |
| |
| | Range | Interpretation | |
| |---|---| |
| | < 0.1 | High confidence | |
| | 0.1 – 0.4 | Moderate uncertainty | |
| | 0.4 – 0.6 | Low confidence | |
| | > 0.6 | Very low confidence — model close to guessing | |
| | ≈ 0.693 | Maximum uncertainty — predicted probability ≈ 0.5 | |
| |
| --- |
| |
| #### Adaptive Conformal Prediction Interval (regressors)<br> |
| |
| Returned as a tuple `(lo, hi)` with 90% marginal coverage guarantee.<br> |
| |
| We implement the **residual normalised conformity score** following [Lei et al. (2018)](https://doi.org/10.1080/01621459.2017.1307116) and [Cordier et al. (2023) / MAPIE](https://proceedings.mlr.press/v204/cordier23a.html). An auxiliary XGBoost model $\hat{\sigma}(\mathbf{x})$ is trained on held-out embeddings and absolute residuals |yᵢ − ŷᵢ|. At inference:<br> |
| |
| $$[\hat{y}(\mathbf{x}) - q \cdot \hat{\sigma}(\mathbf{x}),\ \hat{y}(\mathbf{x}) + q \cdot \hat{\sigma}(\mathbf{x})]$$ |
| |
| |
| where q is the ⌈(n+1)(1−α)⌉ / n quantile of the normalized scores sᵢ = |yᵢ − ŷᵢ| / σ̂(xᵢ). |
| |
| |
| - **Interval width varies per input** -- molecules more dissimilar to training data tend to receive wider intervals<br> |
| |
| - **Coverage guarantee**: on exchangeable data, P(y ∈ [ŷ − qσ̂, ŷ + qσ̂]) ≥ 0.90<br> |
| |
| - **The guarantee is marginal**, not conditional, as an unusually narrow interval on an out-of-distribution molecule does not guarantee correctness<br> |
| |
| - **Full access**: We already computed MAPIE for all regression models; users are allowed to directly use them for customized model lists.<br> |
| |
| --- |
| |
| #### Generating a MAPIE Bundle for a New Model<br> |
| |
| To enable conformal uncertainty for a newly trained regression model:<br> |
| |
| ```bash |
| # Fit adaptive conformal bundle from val_predictions.csv |
| python fit_mapie_adaptive.py --root training_classifiers --prop <property_name> |
| ``` |
| |
| The script reads `sequence`/`smiles` and `y_pred`/`y_true` columns from the CSV, recomputes embeddings, fits the XGBoost $\hat{\sigma}$ model, and saves `mapie_calibration.joblib` into the model directory. The bundle is automatically detected and loaded by `PeptiVersePredictor` on the next initialization.<br> |
| |
| |
| |
| ## Model Architecture 🌟 |
| |
| - **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all) / [ChemBERTa](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM). Foundational embeddings are frozen. |
| - **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction. |
| - **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns. |
| - **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations. |
| - **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets. |
| - **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository. |
| |
| ## Troubleshooting 🌟 |
| |
| ### LFS Download Issues |
| |
| If files appear as SHA pointers: |
| |
| ```bash |
| huggingface-cli download ChatterjeeLab/PeptiVerse \ |
| training_data_cleaned/hemolysis/hemo_smiles_meta_with_split.csv \ |
| --local-dir . \ |
| --local-dir-use-symlinks False |
| ``` |
| |
| ## Citation 🌟 |
|
|
| If you find this repository helpful for your publications, please consider citing our paper: |
|
|
| ``` |
| @article {Zhang2025.12.31.697180, |
| author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam}, |
| title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction}, |
| elocation-id = {2025.12.31.697180}, |
| year = {2026}, |
| doi = {10.64898/2025.12.31.697180}, |
| publisher = {Cold Spring Harbor Laboratory}, |
| URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180}, |
| eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf}, |
| journal = {bioRxiv} |
| } |
| ``` |
| To use this repository, you agree to abide by the MIT License. |
|
|