🧬 Avicenna-8B-Base

"Restoring the 'Think' in Medical AI."

Avicenna-8B-Base is the foundational model of the Avicenna Project. It is a specialized medical language model engineered to achieve SOTA reasoning at the 8B parameter scale via architectural merging.

By surgically merging three distinct Llama 3.1 models and utilizing a Self-Consistency Ensembling inference strategy.

🏗️ Architecture: "The Surgical Merge"

Unlike standard merges that blend models uniformly, Avicenna-8B-Base uses a layer-segmented DARE-TIES configuration to assign specific cognitive roles to different parts of the network.

Model Region	Source Model	Role	Weights
Foundation (Layers 0-8)	`Llama-3.1-Instruct`	Syntax, instruction following, and grammar stability.	100%
Logic Core (Layers 8-20)	`Hermes-3` + `Llama-3.1`	Clinical Reasoning: Implicit logic and causal analysis.	45% Hermes / 55% Base
Medical Cortex (Layers 20-28)	`Aloe-Beta` + `Llama-3.1`	Knowledge Retrieval: High-density injection of medical textbooks and guidelines.	52% Aloe / 48% Base
Frontal Cortex (Layers 28-32)	`Llama-3.1-Instruct`	Safety & Output: Ensures polite, structured, and compliant responses.	100%

This structure prevents "catastrophic forgetting" of general logic while injecting massive medical knowledge into the deep layers.

🏆 Benchmark Performance (Comprehensive Comparison)

We compared Avicenna-8B-Base against other leading medical models across three major benchmarks: MedQA (USMLE), MMLU-Medical, and MedMCQA.

Model	Size	Inference Method	MedQA (USMLE)	MMLU-Medical	MedMCQA
Avicenna-8B-Base	8B	Self-Consistency (SC) (N=5)	61.0%	-	50.0%
Avicenna-8B-Base	8B	Greedy	60.0%	69.5%	-
GPT-3.5 Turbo	175B+	Standard	61.2%	73.5%	59.4%
ClinicalCamel-70B	70B	Standard	45.8%	68.4%	45.8%
PMC-LLaMA-13B	13B	Standard	39.6%	56.3%	37.7%
MedAlpaca-13B	13B	Standard	37.3%	51.5%	35.7%
BioMistral-7B	7B	Standard	35.4%	52.6%	34.8%
Meditron-7B	7B	Standard	33.5%	45.2%	31.1%

Methodology Notes:

Hardware: All results obtained using 4-bit NF4 Quantization on NVIDIA T4 GPUs. Full precision scores are expected to be higher.

Inference: MedQA and MedMCQA utilized Self-Consistency Ensembling (SC) inference (N=5 voters). MMLU utilized standard Greedy decoding.

Sampling: MedQA and MedMCQA results represent randomized subsets of the validation/test sets due to compute constraints. MMLU represents the complete evaluation of all 6 medical subsets.

🚀 How to Run (Self-Consistency Ensembling)

You use the following Python script which implements the Self-Consistency Ensembling voting logic.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# --- CONFIGURATION ---
MODEL_ID = "salihfurkaan/Avicenna-8B-Base"
TOKENIZER_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def setup_model():
    print(f"Loading {MODEL_ID} in 4-bit mode...")
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4"
    )
    
    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,  # you can remove this line if you want the non-quantized version
        device_map="auto"
    )
    return model, tokenizer

def solve_with_moa_open_ended(model, tokenizer, user_input):
    """
    Runs Mixture-of-Agents for open-ended queries:
    1. Generates 3 distinct clinical opinions (Drafts).
    2. Synthesizes them into a final consensus answer.
    """
    
    # --- PHASE 1: DRAFTING (3 Internal Specialists) ---
    system_prompt_draft = "You are Avicenna, an expert medical consultant. Analyze the case step-by-step. Provide a Differential Diagnosis and Recommended Next Steps."
    
    messages = [{"role": "system", "content": system_prompt_draft}, {"role": "user", "content": user_input}]
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
    
    print("Consulting 3 internal specialists (Drafting Phase)...")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1536,
            temperature=0.7, # High creativity for diverse perspectives
            do_sample=True,
            num_return_sequences=3, # Generate 3 Drafts
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Extract only the new tokens (answers) from the output
    # outputs shape: [3, seq_len]
    # inputs shape: [1, seq_len] -> We slice off the prompt length
    new_tokens = outputs[:, inputs.input_ids.shape[1]:]
    drafts = tokenizer.batch_decode(new_tokens, skip_special_tokens=True)

    # --- PHASE 2: SYNTHESIS (Chief Resident) ---
    print(" Synthesizing Final Consensus...")
    
    combined_drafts = ""
    for i, draft in enumerate(drafts):
        combined_drafts += f"\n[Opinion {i+1}]:\n{draft}\n"
        # Optional: Print drafts to see the internal debate
        # print(f"\n--- Opinion {i+1} ---\n{draft[:200]}...") 
    
    aggregator_prompt = (
        f"Clinical Case:\n{user_input}\n\n"
        f"Consider the following 3 medical opinions on this case:\n{combined_drafts}\n\n"
        "TASK: Synthesize these opinions into a single, highly accurate, and professional clinical assessment. "
        "Resolve any conflicts by prioritizing patient safety and standard of care. "
        "Structure the answer clearly: 1. Assessment, 2. Key Differentials, 3. Plan."
    )
    
    agg_messages = [
        {"role": "system", "content": "You are a Senior Chief Physician. Provide a final authoritative consultation."},
        {"role": "user", "content": aggregator_prompt}
    ]
    
    agg_inputs = tokenizer.apply_chat_template(agg_messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
    
    with torch.no_grad():
        final_output = model.generate(
            **agg_inputs,
            max_new_tokens=768,
            temperature=0.2, # Low temp for stable synthesis
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
        
    final_response = tokenizer.decode(final_output[0][agg_inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return final_response

if __name__ == "__main__":
    # Initialize
    model, tokenizer = setup_model()
    
    print("\n Avicenna Interactive Consultant")
    print("Type 'exit' or 'quit' to stop.\n")

    while True:
        print("\n" + "-"*30)
        question = input("Enter Clinical Case/Question: ")
        if question.lower() in ["exit", "quit"]:
            break

        final_answer = solve_with_moa_open_ended(model, tokenizer, question)
        
        print("\n" + "="*40)
        print(f"FINAL CLINICAL CONSENSUS")
        print("="*40)
        print(final_answer)

⚠️ Disclaimer

Research Use Only: Avicenna-8B-Base is a derivative of Llama 3.1. It is intended for academic research, benchmarking, and decision-support prototyping.
Not a Doctor: The model can hallucinate. It should never be used for real-world patient diagnosis or treatment without human supervision.