Add pipeline tag, library name, and improve model card
Browse filesHi! I'm Niels from the Hugging Face community science team.
This pull request improves your model card by adding the `pipeline_tag` and `library_name` to the metadata. These tags help users discover your model more easily and enable automated code snippets on the Hub. I've also updated the model card with structured links to the paper, project page, and code repository to provide more context for users.
Please review and merge if this looks good to you!
README.md
CHANGED
|
@@ -1,30 +1,38 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
|
|
|
| 4 |
metrics:
|
| 5 |
- accuracy
|
| 6 |
-
base_model:
|
| 7 |
-
- OpenGVLab/InternVL3_5-1B-Instruct
|
| 8 |
tags:
|
| 9 |
- visual-reasoning
|
| 10 |
- fine-grained-vqa
|
| 11 |
- fine-grained-recognition
|
| 12 |
-
|
|
|
|
| 13 |
---
|
| 14 |
-
# Model Card for TWIN-Qwen2.5-VL-3B
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## Citation
|
| 23 |
|
| 24 |
-
If you use TWIN in your research, please consider citing
|
| 25 |
|
| 26 |
-
|
| 27 |
-
```
|
| 28 |
@misc{marsili2025notenhancingvisualperception,
|
| 29 |
title={Same or Not? Enhancing Visual Perception in Vision-Language Models},
|
| 30 |
author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- OpenGVLab/InternVL3_5-1B-Instruct
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
+
license: mit
|
| 7 |
metrics:
|
| 8 |
- accuracy
|
|
|
|
|
|
|
| 9 |
tags:
|
| 10 |
- visual-reasoning
|
| 11 |
- fine-grained-vqa
|
| 12 |
- fine-grained-recognition
|
| 13 |
+
pipeline_tag: image-text-to-text
|
| 14 |
+
library_name: transformers
|
| 15 |
---
|
|
|
|
| 16 |
|
| 17 |
+
# Model Card for TWIN-InternVL3_5-1B
|
| 18 |
+
|
| 19 |
+
This repository contains the InternVL3.5-1B model post-trained on the TWIN dataset, as introduced in the paper [Same or Not? Enhancing Visual Perception in Vision-Language Models](https://arxiv.org/abs/2512.23592).
|
| 20 |
+
|
| 21 |
+
TWIN is a large-scale dataset of 561,000 image-pair queries designed to enhance the perceptual abilities of Vision-Language Models (VLMs). It tasks models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. Fine-tuning on TWIN yields significant gains in fine-grained recognition across various domains like art, animals, plants, and landmarks.
|
| 22 |
|
| 23 |
+
## Resources
|
| 24 |
|
| 25 |
+
- **Project Page:** [https://glab-caltech.github.io/twin/](https://glab-caltech.github.io/twin/)
|
| 26 |
+
- **Paper:** [Same or Not? Enhancing Visual Perception in Vision-Language Models](https://arxiv.org/abs/2512.23592)
|
| 27 |
+
- **Code Repository:** [https://github.com/damianomarsili/TWIN](https://github.com/damianomarsili/TWIN)
|
| 28 |
+
- **Dataset:** [glab-caltech/TWIN](https://huggingface.co/datasets/glab-caltech/TWIN)
|
| 29 |
+
- **Benchmark Suite:** [glab-caltech/FGVQA](https://huggingface.co/datasets/glab-caltech/FGVQA)
|
| 30 |
|
| 31 |
## Citation
|
| 32 |
|
| 33 |
+
If you use TWIN in your research, please consider citing the work:
|
| 34 |
|
| 35 |
+
```bibtex
|
|
|
|
| 36 |
@misc{marsili2025notenhancingvisualperception,
|
| 37 |
title={Same or Not? Enhancing Visual Perception in Vision-Language Models},
|
| 38 |
author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},
|