--- base_model: - OpenGVLab/InternVL3_5-1B-Instruct language: - en license: mit metrics: - accuracy tags: - visual-reasoning - fine-grained-vqa - fine-grained-recognition pipeline_tag: image-text-to-text library_name: transformers --- # Model Card for TWIN-InternVL3_5-1B This repository contains the InternVL3.5-1B model post-trained on the TWIN dataset, as introduced in the paper [Same or Not? Enhancing Visual Perception in Vision-Language Models](https://arxiv.org/abs/2512.23592). TWIN is a large-scale dataset of 561,000 image-pair queries designed to enhance the perceptual abilities of Vision-Language Models (VLMs). It tasks models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. Fine-tuning on TWIN yields significant gains in fine-grained recognition across various domains like art, animals, plants, and landmarks. ## Resources - **Project Page:** [https://glab-caltech.github.io/twin/](https://glab-caltech.github.io/twin/) - **Paper:** [Same or Not? Enhancing Visual Perception in Vision-Language Models](https://arxiv.org/abs/2512.23592) - **Code Repository:** [https://github.com/damianomarsili/TWIN](https://github.com/damianomarsili/TWIN) - **Dataset:** [glab-caltech/TWIN](https://huggingface.co/datasets/glab-caltech/TWIN) - **Benchmark Suite:** [glab-caltech/FGVQA](https://huggingface.co/datasets/glab-caltech/FGVQA) ## Citation If you use TWIN in your research, please consider citing the work: ```bibtex @misc{marsili2025notenhancingvisualperception, title={Same or Not? Enhancing Visual Perception in Vision-Language Models}, author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari}, year={2025}, eprint={2512.23592}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.23592}, } ```