File size: 3,584 Bytes

---
license: mit
base_model: fredzzp/open-dcoder-0.5B
tags:
- code-generation
- diffusion-model
- masked-diffusion
- code-correction
- python
datasets:
- code
language:
- code
pipeline_tag: text-generation
---

# CDLM-0.5B

## Model Description

**CDLM-0.5B** is a fine-tuned version of [fredzzp/open-dcoder-0.5B](https://huggingface.co/fredzzp/open-dcoder-0.5B), trained using **error-aware training** with the mixture objective proposed in our paper on Corrective Diffusion Language Models. This model is designed to improve error-aware confidence and targeted refinement capabilities in code generation tasks.

### Key Features

- **Base Model**: [fredzzp/open-dcoder-0.5B](https://huggingface.co/fredzzp/open-dcoder-0.5B) (a masked diffusion language model based on Qwen2)
- **Training Method**: Error-aware training with mixture objective that explicitly supervises visible incorrect tokens
- **Architecture**: Masked Diffusion Language Model (MDLM)
- **Parameters**: ~0.5B

## Training Details

This model was fine-tuned from `fredzzp/open-dcoder-0.5B` using error-aware training with a mixture objective. For detailed information on the training methodology, please refer to our paper: [Corrective Diffusion Language Models](https://arxiv.org/pdf/2512.15596).

## Usage

### Installation

```bash
pip install torch transformers
```

### Code Generation

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Shuibai12138/CDLM-0.5B"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).to(device)

# Generate code
prompt = "def fibonacci(n):"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# Use diffusion generation
outputs = model.diffusion_generate(
    inputs=input_ids,
    max_new_tokens=100,
    steps=16,
    temperature=0.8
)

prompt_len = input_ids.shape[1]
generated_text = tokenizer.decode(outputs.sequences[0][prompt_len:], skip_special_tokens=True)

print("Generated Code:")
print(generated_text)
```

**Note**: This model uses a custom `diffusion_generate` method, so `trust_remote_code=True` is required when loading the model.

### Iterative Refinement

The model supports iterative refinement for code correction. See the [CDLM repository](https://github.com/zhangshuibai/CDLM) for examples of using the model for code correction tasks.

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{zhang2025correctivediffusionlanguagemodels,
      title={Corrective Diffusion Language Models}, 
      author={Shuibai Zhang and Fred Zhangzhi Peng and Yiheng Zhang and Jin Pan and Grigorios G. Chrysos},
      year={2025},
      eprint={2512.15596},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.15596}, 
}
```

## Related Resources

- **Paper**: [Corrective Diffusion Language Models](https://arxiv.org/pdf/2512.15596)
- **Code Repository**: [zhangshuibai/CDLM](https://github.com/zhangshuibai/CDLM)
- **Collection**: [HuggingFace Collection](https://huggingface.co/collections/Shuibai12138/cdlm)
- **Base Model**: [fredzzp/open-dcoder-0.5B](https://huggingface.co/fredzzp/open-dcoder-0.5B)

## License

This model is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Contact

For questions and issues, please contact:

**Shuibai Zhang** <shuibai@cs.wisc.edu>