gsaltintas's picture
Upload folder using huggingface_hub
704120f verified
---
license: mit
language:
- ita
tags:
- tokenizer
- unigram
- flexitok
- fineweb2
---
# UnigramLM Tokenizer: ita_Latn (32K)
A **UnigramLM** tokenizer trained on **ita_Latn** data from Fineweb-2-HQ.
## Training Details
| Parameter | Value |
|-----------|-------|
| Algorithm | UnigramLM |
| Language | `ita_Latn` |
| Target Vocab Size | 32,000 |
| Final Vocab Size | 0 |
| Pre-tokenizer | ByteLevel |
| Normalizer | NFC |
| Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
| Training Shards | 2 |
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("flexitok/-unigram_ita_Latn_32000")
tokens = tokenizer.encode("Hello, world!")
```
## Files
- `tokenizer.json` — Full HuggingFace tokenizer
- `vocab.json` — Vocabulary mapping
- `tokenizer.model` — SentencePiece protobuf (if available)