ynuozhang

update more test code for inference

8cf0b21 about 18 hours ago

28.7 kB

	---
	license: apache-2.0
	---

	![Overview of PeptiVerse](peptiverse-cover.png)

	# PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction 🧬🌌

	This is the repository for [PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction](https://www.biorxiv.org/content/10.64898/2025.12.31.697180), a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.

	## Table of Contents 🌟

	- [Quick start](#quick-start)
	- [Installation](#installation)
	- [Repository Structure](#repository-structure)
	- [Training data collection](#training-data-collection)
	- [Best model list](#best-model-list)
	- [Full model set (cuML-enabled)](#full-model-set-gpu-enabled)
	- [Minimal deployable model set (no cuML)](#minimal-deployable-set)
	- [Usage](#usage)
	- [Local Application Hosting](#local-application-hosting)
	- [Dataset integration](#dataset-integration)
	- [Training](#training)
	- [Quick inference by property per model](#Quick-inference-by-property-per-model)
	- [Property Interpretations](#property-interpretations)
	- [Model Architecture](#model-architecture)
	- [Troubleshooting](#troubleshooting)
	- [Citation](#citation)

	## Quick Start 🌟
	- Light-weighted start (basic models, no cuML, read below for details)
	```bash
	# Ignore all LFS files, you will see an empty folder first
	git clone --no-checkout https://huggingface.co/ChatterjeeLab/PeptiVerse
	cd PeptiVerse

	# Enable sparse checkout
	git sparse-checkout init --cone

	# Choose only selective items to download
	git sparse-checkout set \
	inference.py \
	download_light.py \
	best_models.txt \
	basic_models.txt \
	requirements.txt \
	tokenizer \
	README.md

	# Now checkout
	GIT_LFS_SKIP_SMUDGE=1 git checkout

	# Install basic pkgs
	pip install -r requirements.txt

	# Download basic model weights according to the basic_models.txt. Adjust which config you wanted as needed.
	python download_light.py

	# Test in inference
	python inference.py
	```
	- Full model clone (will clone all best model weights)
	```bash
	# Clone repository
	git clone https://huggingface.co/ChatterjeeLab/PeptiVerse

	# Install dependencies
	pip install -r requirements.txt

	# Run inference
	python inference.py
	```
	> Note: This clones best model weights only. For full access:
	> - All model weights (best + seed ensembles for uncertainty quantification): [Zenodo](https://zenodo.org/records/19989009)
	> - Training datasets (embeddings + splits): [HuggingFace Dataset](https://huggingface.co/datasets/ChatterjeeLab/PeptiVerse_data)

	## Installation 🌟
	### Minimal Setup
	- Easy start-up environment (using transformers, xgboost models)
	```bash
	pip install -r requirements.txt
	```
	### Full Setup
	- Additional access to trained SVM and ElastNet models requires installation of `RAPIDS cuML`, with instructions available from their official [github page](https://github.com/rapidsai/cuml) (CUDA-capable GPU required).
	- Optional: pre-compiled Singularity/Apptainer environment (5.68G) is available at [Google drive](https://drive.google.com/file/d/1ybLJNTC3BITIqBd8IO09nOOm4PKwD4iS/view?usp=sharing) with everything you need (still need CUDA/GPU to load cuML models). The SHA256 for checking is `48619796ef0adc81bc420021821e5ee3d9b2176bf1f564104e06dc1ce56b3498`, check via `shasum -a 256 peptiverse.sif`.
	```
	# test
	apptainer exec peptiverse.sif python -c "import sys; print(sys.executable)"

	# run inference (see below)
	apptainer exec --nv peptiverse.sif python inference.py
	```
	## Repository Structure 🌟
	This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. [Paper link.](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1)

	```
	PeptiVerse/
	├── training_data_cleaned/ # Processed datasets with embeddings
	│ └── <property>/ # Property-specific data
	│ ├── train/val splits
	│ └── precomputed embeddings
	├── training_classifiers/ # Trained model weights
	│ └── <property>/
	│ ├── cnn_wt/ # CNN architectures
	│ ├── mlp_wt/ # MLP architectures
	│ └── xgb_wt/ # XGBoost models
	├── tokenizer/ # PeptideCLM tokenizer
	├── training_data/ # Raw training data
	├── inference.py # Main prediction interface
	├── best_models.txt # Model selection manifest
	└── requirements.txt # Python dependencies
	```
	For full data access, please download the corresponding `training_data_cleaned` and `training_classifiers` from [HuggingFace Dataset](https://huggingface.co/datasets/ChatterjeeLab/PeptiVerse_data) and [Zenodo](https://zenodo.org/records/19989009). The current Huggingface repo only hosts best model weights and meta data with splits labels.

	```
	1. Download and extract the Zenodo archive.
	2. Download or clone this repository.
	3. Copy/merge the repository `training_classifiers/` contents into the extracted Zenodo `training_classifiers/` directory, preserving the folder structure.
	`rsync -av --ignore-existing training_classifiers/ /path/to/zenodo_extracted/training_classifiers/`
	4. Do not replace the entire Zenodo folder blindly; merge files so that large training outputs from Zenodo and updated best-model weights from this repository coexist.
	```

	## Training Data Collection 🌟

	<table>
	<caption><strong>Data distribution.</strong> Classification tasks report counts for class 0/1; regression tasks report total sample size (N).</caption>
	<thead>
	<tr>
	<th rowspan="2"><strong>Properties</strong></th>
	<th colspan="2"><strong>Amino Acid Sequences</strong></th>
	<th colspan="2"><strong>SMILES Sequences</strong></th>
	</tr>
	<tr>
	<th><strong>0</strong></th>
	<th><strong>1</strong></th>
	<th><strong>0</strong></th>
	<th><strong>1</strong></th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td colspan="5"><strong>Classification</strong></td>
	</tr>
	<tr>
	<td>Hemolysis</td>
	<td>4765</td>
	<td>1311</td>
	<td>4765</td>
	<td>1311</td>
	</tr>
	<tr>
	<td>Non-Fouling</td>
	<td>13580</td>
	<td>3600</td>
	<td>13580</td>
	<td>3600</td>
	</tr>
	<tr>
	<td>Solubility</td>
	<td>9668</td>
	<td>8785</td>
	<td>9668</td>
	<td>8785</td>
	</tr>
	<tr>
	<td>Permeability (Penetrance)</td>
	<td>1162</td>
	<td>1162</td>
	<td>1162</td>
	<td>1162</td>
	</tr>
	<tr>
	<td>Toxicity</td>
	<td>-</td>
	<td>-</td>
	<td>5518</td>
	<td>5518</td>
	</tr>
	<tr>
	<td colspan="5"><strong>Regression (N)</strong></td>
	</tr>
	<tr>
	<td>Permeability (PAMPA)</td>
	<td colspan="2" align="center">-</td>
	<td colspan="2" align="center">6869</td>
	</tr>
	<tr>
	<td>Permeability (CACO2)</td>
	<td colspan="2" align="center">-</td>
	<td colspan="2" align="center">606</td>
	</tr>
	<tr>
	<td>Half-Life</td>
	<td colspan="2" align="center">130</td>
	<td colspan="2" align="center">245</td>
	</tr>
	<tr>
	<td>Binding Affinity</td>
	<td colspan="2" align="center">1436</td>
	<td colspan="2" align="center">1597</td>
	</tr>
	</tbody>
	</table>


	## Best Model List 🌟

	### Full model set (cuML-enabled)
	\| Property \| Best Model (Sequence) \| Best Model (SMILES) \| Task Type \| Threshold (Sequence) \| Threshold (SMILES) \|
	\|---\|---\|---\|---\|---\|---\|
	\| Hemolysis \| SVM \| CNN (chemberta) \| Classifier \| 0.2521 \| 0.564 \|
	\| Non-Fouling \| Transformer \| ENET (peptideclm) \| Classifier \| 0.712 \| 0.6969 \|
	\| Solubility \| CNN \| Transformer (peptideclm) \| Classifier \| 0.377 \| 0.329 \|
	\| Permeability (Penetrance) \| SVM \| SVM (chemberta) \| Classifier \| 0.5493 \| 0.573 \|
	\| Toxicity \| – \| CNN (chemberta) \| Classifier \| – \| 0.49 \|
	\| Binding Affinity \| pooled \| pooled (chemberta) \| Regression \| – \| – \|
	\| Permeability (PAMPA) \| – \| CNN (chemberta) \| Regression \| – \| – \|
	\| Permeability (Caco-2) \| – \| SVR (chemberta) \| Regression \| – \| – \|
	\| Half-life \| Transformer \| XGB (peptideclm) \| Regression \| – \| – \|

	>Note: unpooled indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations.

	### Minimal deployable model set (no cuML)
	\| Property \| Best Model (WT) \| Best Model (SMILES) \| Task Type \| Threshold (WT) \| Threshold (SMILES) \|
	\|---\|---\|---\|---\|---\|---\|
	\| Hemolysis \| XGB \| CNN (chemberta) \| Classifier \| 0.2801 \| 0.564 \|
	\| Non-Fouling \| Transformer \| XGB (peptideclm) \| Classifier \| 0.712 \| 0.3892 \|
	\| Solubility \| CNN \| Transformer (peptideclm) \| Classifier \| 0.377 \| 0.329 \|
	\| Permeability (Penetrance) \| XGB \| XGB (chemberta) \| Classifier \| 0.4301 \| 0.5028 \|
	\| Toxicity \| – \| CNN (chemberta) \| Classifier \| – \| 0.49 \|
	\| Binding Affinity \| pooled \| pooled (chemberta) \| Regression \| – \| – \|
	\| Permeability (PAMPA) \| – \| CNN (chemberta) \| Regression \| – \| – \|
	\| Permeability (Caco-2) \| – \| SVR (chemberta) \| Regression \| – \| – \|
	\| Half-life \| Transformer \| XGB (peptideclm) \| Regression \| – \| – \|
	>Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups.


	## Usage 🌟

	### Local Application Hosting
	- Host the [PeptiVerse UI](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse) locally with your own resources.
	```bash
	# Configure models in best_models.txt

	git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse
	python app.py
	```
	### Data pre-processing
	Under the `training_data_cleaned`, we provided the generated embeddings in huggingface dataset format. The following scripts are the steps used to generate the data.

	### Dataset integration
	- All processed training datasets are available at [ChatterjeeLab/PeptiVerse\_data](https://huggingface.co/datasets/ChatterjeeLab/PeptiVerse_data).
	- Selective download the data you need with `huggingface-cli`
	```bash
	huggingface-cli download ChatterjeeLab/PeptiVerse_data \
	--include "training_data_cleaned/**" \ # only this folder
	--exclude "*/.pt" "*/.joblib" \ # skip weights/artifacts
	--local-dir PeptiVerse_data \
	--local-dir-use-symlinks False # make real copies
	```
	- Or in python
	```python
	from huggingface_hub import snapshot_download

	local_dir = snapshot_download(
	repo_id="ChatterjeeLab/PeptiVerse_data",
	allow_patterns=["training_data_cleaned/**"], # only this folder
	ignore_patterns=["*/.pt", "*/.joblib"], # skip weights/artifacts
	local_dir="PeptiVerse_data",
	local_dir_use_symlinks=False, # make real copies
	)
	print("Downloaded to:", local_dir)
	```
	- Usage of the huggingface datasets (with pre-computed embeddings and splits)
	- All embedding datasets are saved via `DatasetDict.save_to_disk` and loadable with:
	``` python
	from datasets import load_from_disk
	ds = load_from_disk(PATH)
	train_ds = ds["train"]
	val_ds = ds["val"]
	```
	- A) Sequence Based ([ESM-2](https://huggingface.co/facebook/esm2_t33_650M_UR50D) embeddings)
	- Pooled (fixed-length vector per sequence)
	- Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
	- Each item:
	sequence: `str`
	label: `int` (classification) or `float` (regression)
	embedding: `float32[H]` (H=1280 for ESM-2 650M)
	- Unpooled (variable-length token matrix)
	- Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
	- Each item:
	sequence: `str`
	label: `int` (classification) or `float` (regression)
	embedding: `float16[L, H]` (nested lists)
	attention_mask: `int8[L]`
	length: `int` (=L)
	- B) SMILES-based ([PeptideCLM](https://github.com/AaronFeller/PeptideCLM) embeddings)
	- Pooled (fixed-length vector per sequence)
	- Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
	- Each item:
	sequence: `str` (SMILES)
	label: `int` (classification) or `float` (regression)
	embedding: `float32[H]`
	- Unpooled (variable-length token matrix)
	- Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
	- Each item:
	sequence: `str` (SMILES)
	label: `int` (classification) or `float` (regression)
	embedding: `float16[L, H]` (nested lists)
	attention_mask: `int8[L]`
	length: `int` (=L)
	- C) SMILES-based ([ChemBERTa](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM) embeddings)
	- Pooled (fixed-length vector per sequence)
	- Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
	- Each item:
	sequence: `str` (SMILES)
	label: `int` (classification) or `float` (regression)
	embedding: `float32[H]`
	- Unpooled (variable-length token matrix)
	- Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
	- Each item:
	sequence: `str` (SMILES)
	label: `int` (classification) or `float` (regression)
	embedding: `float16[L, H]` (nested lists)
	attention_mask: `int8[L]`
	length: `int` (=L)
	### Training
	Under the `training_classifiers` folder, we provide the python scripts used to train different models. The scripts will
	1. Read the pre-processed Huggingface Dataset from `training_data_cleaned` folder;
	2. Perform OPTUNA hyperparameter sweep once being called;
	3. All training was conducted on HPC with SLURM script under `training_classifiers/src` folder;
	4. Customize or isolate certain model training scripts as needed.
	##### Example of training
	###### ML models
	```
	HOME_LOC=/home
	SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
	EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned

	OBJECTIVE='hemolysis' # nf/solubility/hemolysis/permeability_pampa/permeability_caco2
	WT='smiles' # wt/smiles
	DATA_FILE="hemo_${WT}_with_embeddings"
	LOG_LOC=$SCRIPT_LOC/src/logs
	DATE=$(date +%m_%d)
	MODEL_TYPE='svm_gpu' # xgb/enet_gpu/svm_gpu
	SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}_new"

	# Create log directory if it doesn't exist
	mkdir -p $LOG_LOC

	cd $SCRIPT_LOC

	python -u train_ml.py \
	--dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \
	--out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \
	--model "${MODEL_TYPE}" \
	--n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1
	```
	###### DNN models
	```
	HOME_LOC=/home
	SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
	EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned

	OBJECTIVE='nf' # nf/solubility/hemolysis
	WT='smiles' #wt/smiles
	DATA_FILE="nf_${WT}_with_embeddings_unpooled"
	LOG_LOC=$SCRIPT_LOC/src/logs
	DATE=$(date +%m_%d)
	MODEL_TYPE='cnn' #mlp/cnn/transformer
	SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}"

	# Create log directory if it doesn't exist
	mkdir -p $LOG_LOC

	cd $SCRIPT_LOC

	python -u train_nn.py \
	--dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \
	--out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \
	--model "${MODEL_TYPE}" \
	--n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1
	```
	###### Binding Affinity
	```
	HOME_LOC=/home
	SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
	EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned

	OBJECTIVE='binding_affinity'
	BINDER_MODEL='chemberta' # peptideclm / chemberta
	STATUS='unpooled' # pooled / unpooled
	TYPE='smiles'
	DATA_FILE='pair_wt_${TYPE}_${STATUS}'

	LOG_LOC=$SCRIPT_LOC/src/logs
	DATE=$(date +%m_%d)
	SPECIAL_PREFIX="${OBJECTIVE}-${BINDER_MODEL}-${STATUS}"

	python -u binding_training.py \
	--dataset_path "${EMB_LOC}/${OBJECTIVE}/${BINDER_MODEL}/${DATA_FILE}" \
	--mode "${STATUS}" \
	--out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${BINDER_MODEL}_${TYPE}_${STATUS}" \
	--n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1
	```

	### Quick inference by property per model
	```python
	from inference import PeptiVersePredictor
	from pathlib import Path

	root = Path(__file__).resolve().parent # current script folder


	predictor = PeptiVersePredictor(
	manifest_path=root / "best_models.txt",
	classifier_weight_root=root,
	device="cuda", # or "cpu"
	)

	# mode: smiles (SMILES-based models) / wt (Sequence-based models)
	# property keys (with some level of name normalization)
	# hemolysis
	# nf (Non-Fouling)
	# solubility
	# permeability_penetrance
	# toxicity
	# permeability_pampa
	# permeability_caco2
	# halflife
	# binding_affinity

	seq = "GIVEQCCTSICSLYQLENYCN"
	smiles = "CC(C)C[C@@H]1NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@@H](C)N(C)C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H]2CCCN2C1=O"

	# Hemolysis
	out = pred.predict_property("hemolysis", mode="wt", input_str=seq)
	print(out)
	# {"property":"hemolysis","mode":"wt","score":prob,"label":0/1,"threshold":...}

	out = pred.predict_property("hemolysis", mode="smiles", input_str=smiles)
	print(out)

	# Non-fouling (key is nf)
	out = pred.predict_property("nf", mode="wt", input_str=seq)
	print(out)

	out = pred.predict_property("nf", mode="smiles", input_str=smiles)
	print(out)

	# Solubility (Sequence-only)
	out = pred.predict_property("solubility", mode="wt", input_str=seq)
	print(out)

	# Permeability (Penetrance) (Sequence-only)
	out = pred.predict_property("permeability_penetrance", mode="wt", input_str=seq)
	print(out)

	# Toxicity (SMILES-only)
	out = pred.predict_property("toxicity", mode="smiles", input_str=smiles)
	print(out)

	# Permeability (PAMPA) (SMILES regression)
	out = pred.predict_property("permeability_pampa", mode="smiles", input_str=smiles)
	print(out)
	# {"property":"permeability_pampa","mode":"smiles","score":value}

	# Permeability (Caco-2) (SMILES regression)
	out = pred.predict_property("permeability_caco2", mode="smiles", input_str=smiles)
	print(out)

	# Half-life (sequence-based + SMILES regression)
	out = pred.predict_property("halflife", mode="wt", input_str=seq)
	print(out)

	out = pred.predict_property("halflife", mode="smiles", input_str=smiles)
	print(out)

	# Binding Affinity
	protein = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV..." # target protein
	peptide_seq = "GIVEQCCTSICSLYQLENYCN"

	out = pred.predict_binding_affinity(
	mode="wt",
	target_seq=protein,
	binder_str=peptide_seq,
	)
	print(out)
	# {
	# "property":"binding_affinity",
	# "mode":"wt",
	# "affinity": float,
	# "class_by_threshold": "High (≥9)" / "Moderate (7-9)" / "Low (<7)",
	# "class_by_logits": same buckets,
	# "binding_model": "pooled" or "unpooled",
	# }

	```

	#### Advanced inference with uncertainty prediction
	The uncertainty prediction is added as a parameter in the inference code. The full classifier folder from [zenodo]() is required to enable this functionality. The model uncertainty is reported via all the scripts listed under the `training_classifiers` folder starting with "refit". Detailed description can be found in the methodology part of the manuscript.
	At inference time, PeptiVersePredictor returns an `uncertainty` field with every prediction when `uncertainty=True` is passed. The method and interpretation depend on the model class, determined automatically at inference time.
	```python
	seq = "GIGAVLKVLTTGLPALISWIKRKRQQ"
	smiles = "C(C)C[C@@H]1NC(=O)[C@@H]2CCCN2C(=O)[C@@H](CC(C)C)NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@H](C)NC(=O)[C@H](Cc2ccccc2)NC1=O"

	print(predictor.predict_property("nf", "wt", seq, uncertainty=True))
	print(predictor.predict_property("nf", "smiles", smiles, uncertainty=True))

	{'property': 'nf', 'col': 'wt', 'score': 0.00014520535252195523, 'emb_tag': 'wt', 'label': 0, 'threshold': 0.57, 'uncertainty': 0.0017192508727321288, 'uncertainty_type': 'ensemble_predictive_entropy'}
	{'property': 'nf', 'col': 'smiles', 'score': 0.025485480204224586, 'emb_tag': 'peptideclm', 'label': 0, 'threshold': 0.6969, 'uncertainty': 0.11868063130587676, 'uncertainty_type': 'binary_predictive_entropy_single_model'}
	```

	---

	##### Method by Model Class

	\| Model Class \| Task \| Uncertainty Method \| Output Type \| Range \|
	\|---\|---\|---\|---\|---\|
	\| MLP, CNN, Transformer \| Classifier \| Deep ensemble predictive entropy (5 seeds) \| `float` \| [0, ln(2) ≈ 0.693] \|
	\| MLP, CNN, Transformer \| Regression \| Adaptive conformal interval; falls back to ensemble std if no MAPIE bundle \| `(lo, hi)` or `float` \| unbounded \|
	\| SVM / SVC / XGBoost \| Classifier \| Binary predictive entropy (sigmoid of decision function) \| `float` \| [0, ln(2) ≈ 0.693] \|
	\| SVR / ElasticNet / XGBoost \| Regression \| Adaptive conformal interval \| `(lo, hi)` \| unbounded \|

	> Uncertainty is `None` when: a DNN classifier has no seed ensemble trained, or a regression model has no `mapie_calibration.joblib` in its model directory.

	---
	## Interpretation 🌟

	You can also find the same description in the paper or in the PeptiVerse app `Documentation` tab.

	---

	### 🩸 Hemolysis Prediction<br>
	50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.<br>

	Output interpretation:<br>

	- Score close to 1.0 = high probability of red blood cell membrane disruption
	- Score close to 0.0 = non-hemolytic

	---

	### 💧 Solubility Prediction<br>
	Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions.<br>

	Output interpretation:<br>

	- Score close to 1.0 = highly soluble
	- Score close to 0.0 = poorly soluble

	---

	### 👯 Non-Fouling Prediction<br>
	Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.<br>

	Output interpretation:<br>

	- Score close to 1.0 = non-fouling
	- Score close to 0.0 = fouling

	---

	### 🪣 Permeability Prediction<br>
	Predicts membrane permeability on a log P scale.<br>

	Output interpretation:<br>

	- Higher values = more permeable (>-6.0)
	- For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable.

	---

	### ⏱️ Half-Life Prediction<br>
	Interpretation: Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.<br>

	---

	### ☠️ Toxicity Prediction<br>
	Interpretation: Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.<br>

	---

	### 🔗 Binding Affinity Prediction <br>

	Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.<br>

	Interpretation:<br>

	- Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)<br>
	- Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)<br>
	- Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)<br>
	- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br>

	---

	### Uncertainty Interpretation <br>
	#### Entropy (classifiers)<br>

	Binary predictive entropy of the output probability p̄:<br>

	$$\mathcal{H} = -\bar{p}\log\bar{p} - (1 - \bar{p})\log(1 - \bar{p})$$<br>

	- For DNN classifiers: p̄ is the mean probability across 5 independently seeded models (deep ensemble). High entropy reflects both epistemic uncertainty (seed disagreement) and aleatoric uncertainty (collectively diffuse predictions).<br>

	- For XGBoost / SVM / ElasticNet classifiers: p̄ is the single model's output probability (or sigmoid of decision function for ElasticNet). Entropy reflects output confidence of a single model only.<br>

	\| Range \| Interpretation \|
	\|---\|---\|
	\| < 0.1 \| High confidence \|
	\| 0.1 – 0.4 \| Moderate uncertainty \|
	\| 0.4 – 0.6 \| Low confidence \|
	\| > 0.6 \| Very low confidence — model close to guessing \|
	\| ≈ 0.693 \| Maximum uncertainty — predicted probability ≈ 0.5 \|

	---

	#### Adaptive Conformal Prediction Interval (regressors)<br>

	Returned as a tuple `(lo, hi)` with 90% marginal coverage guarantee.<br>

	We implement the residual normalised conformity score following [Lei et al. (2018)](https://doi.org/10.1080/01621459.2017.1307116) and [Cordier et al. (2023) / MAPIE](https://proceedings.mlr.press/v204/cordier23a.html). An auxiliary XGBoost model $\hat{\sigma}(\mathbf{x})$ is trained on held-out embeddings and absolute residuals \|yᵢ − ŷᵢ\|. At inference:<br>

	$$[\hat{y}(\mathbf{x}) - q \cdot \hat{\sigma}(\mathbf{x}),\ \hat{y}(\mathbf{x}) + q \cdot \hat{\sigma}(\mathbf{x})]$$


	where q is the ⌈(n+1)(1−α)⌉ / n quantile of the normalized scores sᵢ = \|yᵢ − ŷᵢ\| / σ̂(xᵢ).


	- Interval width varies per input -- molecules more dissimilar to training data tend to receive wider intervals<br>

	- Coverage guarantee: on exchangeable data, P(y ∈ [ŷ − qσ̂, ŷ + qσ̂]) ≥ 0.90<br>

	- The guarantee is marginal, not conditional, as an unusually narrow interval on an out-of-distribution molecule does not guarantee correctness<br>

	- Full access: We already computed MAPIE for all regression models; users are allowed to directly use them for customized model lists.<br>

	---

	#### Generating a MAPIE Bundle for a New Model<br>

	To enable conformal uncertainty for a newly trained regression model:<br>

	```bash
	# Fit adaptive conformal bundle from val_predictions.csv
	python fit_mapie_adaptive.py --root training_classifiers --prop <property_name>
	```

	The script reads `sequence`/`smiles` and `y_pred`/`y_true` columns from the CSV, recomputes embeddings, fits the XGBoost $\hat{\sigma}$ model, and saves `mapie_calibration.joblib` into the model directory. The bundle is automatically detected and loaded by `PeptiVersePredictor` on the next initialization.<br>



	## Model Architecture 🌟

	- Sequence Embeddings: [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all) / [ChemBERTa](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM). Foundational embeddings are frozen.
	- XGBoost Model: Gradient boosting on pooled embedding features for efficient, high-performance prediction.
	- CNN/Transformer Model: One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
	- Binding Model: Transformer-based architecture with cross-attention between protein and peptide representations.
	- SVR Model: Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
	- Others: SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

	## Troubleshooting 🌟

	### LFS Download Issues

	If files appear as SHA pointers:

	```bash
	huggingface-cli download ChatterjeeLab/PeptiVerse \
	training_data_cleaned/hemolysis/hemo_smiles_meta_with_split.csv \
	--local-dir . \
	--local-dir-use-symlinks False
	```

	## Citation 🌟

	If you find this repository helpful for your publications, please consider citing our paper:

	```
	@article {Zhang2025.12.31.697180,
	author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
	title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
	elocation-id = {2025.12.31.697180},
	year = {2026},
	doi = {10.64898/2025.12.31.697180},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
	eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
	journal = {bioRxiv}
	}
	```
	To use this repository, you agree to abide by the MIT License.