🧬 Avicenna-8B-Base
"Restoring the 'Think' in Medical AI."
Avicenna-8B-Base is the foundational model of the Avicenna Project. It is a specialized medical language model engineered to achieve SOTA reasoning at the 8B parameter scale via architectural merging.
By surgically merging three distinct Llama 3.1 models and utilizing a Self-Consistency Ensembling inference strategy.
🏗️ Architecture: "The Surgical Merge"
Unlike standard merges that blend models uniformly, Avicenna-8B-Base uses a layer-segmented DARE-TIES configuration to assign specific cognitive roles to different parts of the network.
| Model Region | Source Model | Role | Weights |
|---|---|---|---|
| Foundation (Layers 0-8) | Llama-3.1-Instruct |
Syntax, instruction following, and grammar stability. | 100% |
| Logic Core (Layers 8-20) | Hermes-3 + Llama-3.1 |
Clinical Reasoning: Implicit logic and causal analysis. | 45% Hermes / 55% Base |
| Medical Cortex (Layers 20-28) | Aloe-Beta + Llama-3.1 |
Knowledge Retrieval: High-density injection of medical textbooks and guidelines. | 52% Aloe / 48% Base |
| Frontal Cortex (Layers 28-32) | Llama-3.1-Instruct |
Safety & Output: Ensures polite, structured, and compliant responses. | 100% |
This structure prevents "catastrophic forgetting" of general logic while injecting massive medical knowledge into the deep layers.
🏆 Benchmark Performance (Comprehensive Comparison)
We compared Avicenna-8B-Base against other leading medical models across three major benchmarks: MedQA (USMLE), MMLU-Medical, and MedMCQA.
| Model | Size | Inference Method | MedQA (USMLE) | MMLU-Medical | MedMCQA |
|---|---|---|---|---|---|
| Avicenna-8B-Base | 8B | Self-Consistency (SC) (N=5) | 61.0% | - | 50.0% |
| Avicenna-8B-Base | 8B | Greedy | 60.0% | 69.5% | - |
| GPT-3.5 Turbo | 175B+ | Standard | 61.2% | 73.5% | 59.4% |
| ClinicalCamel-70B | 70B | Standard | 45.8% | 68.4% | 45.8% |
| PMC-LLaMA-13B | 13B | Standard | 39.6% | 56.3% | 37.7% |
| MedAlpaca-13B | 13B | Standard | 37.3% | 51.5% | 35.7% |
| BioMistral-7B | 7B | Standard | 35.4% | 52.6% | 34.8% |
| Meditron-7B | 7B | Standard | 33.5% | 45.2% | 31.1% |
Methodology Notes:
- Hardware: All results obtained using 4-bit NF4 Quantization on NVIDIA T4 GPUs. Full precision scores are expected to be higher.
- Inference: MedQA and MedMCQA utilized Self-Consistency Ensembling (SC) inference (N=5 voters). MMLU utilized standard Greedy decoding.
- Sampling: MedQA and MedMCQA results represent randomized subsets of the validation/test sets due to compute constraints. MMLU represents the complete evaluation of all 6 medical subsets.
🚀 How to Run (Self-Consistency Ensembling)
You use the following Python script which implements the Self-Consistency Ensembling voting logic.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# --- CONFIGURATION ---
MODEL_ID = "salihfurkaan/Avicenna-8B-Base"
TOKENIZER_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
def setup_model():
print(f"Loading {MODEL_ID} in 4-bit mode...")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config, # you can remove this line if you want the non-quantized version
device_map="auto"
)
return model, tokenizer
def solve_with_moa_open_ended(model, tokenizer, user_input):
"""
Runs Mixture-of-Agents for open-ended queries:
1. Generates 3 distinct clinical opinions (Drafts).
2. Synthesizes them into a final consensus answer.
"""
# --- PHASE 1: DRAFTING (3 Internal Specialists) ---
system_prompt_draft = "You are Avicenna, an expert medical consultant. Analyze the case step-by-step. Provide a Differential Diagnosis and Recommended Next Steps."
messages = [{"role": "system", "content": system_prompt_draft}, {"role": "user", "content": user_input}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
print("Consulting 3 internal specialists (Drafting Phase)...")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1536,
temperature=0.7, # High creativity for diverse perspectives
do_sample=True,
num_return_sequences=3, # Generate 3 Drafts
pad_token_id=tokenizer.eos_token_id
)
# Extract only the new tokens (answers) from the output
# outputs shape: [3, seq_len]
# inputs shape: [1, seq_len] -> We slice off the prompt length
new_tokens = outputs[:, inputs.input_ids.shape[1]:]
drafts = tokenizer.batch_decode(new_tokens, skip_special_tokens=True)
# --- PHASE 2: SYNTHESIS (Chief Resident) ---
print(" Synthesizing Final Consensus...")
combined_drafts = ""
for i, draft in enumerate(drafts):
combined_drafts += f"\n[Opinion {i+1}]:\n{draft}\n"
# Optional: Print drafts to see the internal debate
# print(f"\n--- Opinion {i+1} ---\n{draft[:200]}...")
aggregator_prompt = (
f"Clinical Case:\n{user_input}\n\n"
f"Consider the following 3 medical opinions on this case:\n{combined_drafts}\n\n"
"TASK: Synthesize these opinions into a single, highly accurate, and professional clinical assessment. "
"Resolve any conflicts by prioritizing patient safety and standard of care. "
"Structure the answer clearly: 1. Assessment, 2. Key Differentials, 3. Plan."
)
agg_messages = [
{"role": "system", "content": "You are a Senior Chief Physician. Provide a final authoritative consultation."},
{"role": "user", "content": aggregator_prompt}
]
agg_inputs = tokenizer.apply_chat_template(agg_messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
with torch.no_grad():
final_output = model.generate(
**agg_inputs,
max_new_tokens=768,
temperature=0.2, # Low temp for stable synthesis
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
final_response = tokenizer.decode(final_output[0][agg_inputs.input_ids.shape[1]:], skip_special_tokens=True)
return final_response
if __name__ == "__main__":
# Initialize
model, tokenizer = setup_model()
print("\n Avicenna Interactive Consultant")
print("Type 'exit' or 'quit' to stop.\n")
while True:
print("\n" + "-"*30)
question = input("Enter Clinical Case/Question: ")
if question.lower() in ["exit", "quit"]:
break
final_answer = solve_with_moa_open_ended(model, tokenizer, question)
print("\n" + "="*40)
print(f"FINAL CLINICAL CONSENSUS")
print("="*40)
print(final_answer)
⚠️ Disclaimer
- Research Use Only: Avicenna-8B-Base is a derivative of Llama 3.1. It is intended for academic research, benchmarking, and decision-support prototyping.
- Not a Doctor: The model can hallucinate. It should never be used for real-world patient diagnosis or treatment without human supervision.
- Downloads last month
- 112
Model tree for salihfurkaan/Avicenna-8B-Base
Collection including salihfurkaan/Avicenna-8B-Base
Evaluation results
- Accuracy (SC N=5) on MedQA (USMLE)test set self-reported61.000
- Accuracy (Greedy) on MedQA (USMLE)test set self-reported60.000
- Accuracy (Greedy) on MMLU (Medical)test set self-reported69.050
