merlin-tokenizer-v0

BPE tokenizer for Merlin, trained on Python, Bash, Markdown, Stack Exchange Q&A, GitHub commits/issues, and tldr-pages.

Vocab

  • 32,016 tokens total: 32,000 BPE + 16 special tokens
  • Special tokens:
    • IDs 0โ€“13: legacy slots (unused)
    • ID 32000: <|bos|>
    • ID 32001: <|eos|>
    • ID 32002โ€“32013: agent protocol tokens (<|tool_call|>, <|/tool_call|>, etc.)
    • ID 32014: <|done|>
    • ID 32015: <|pad|>

Usage

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
ids = tok.encode("def hello(): pass").ids

Note: Will be retrained after agentic trace generation. This is v0 โ€” trained on pretraining corpus only.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support