AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration β°
AramT5 is a fine-tuned version of t5-small, trained to transliterate Syriac text into latinised Serto (West Syriac) and MadnαΈ₯aya (East Syriac).
β οΈ Current Limitations
- Occasional under-generation (shorter outputs than expected)
- Occasional vowel omission or compression
- Reliability varies on very long, uncommon, or morphologically complex words and sentences
Development information
- π§ Current version: v3.2 (stage 4)
- β³ Upcoming release: v4 (stage 5)
π About the Project
AramT5 is a transformer fine-tuned on Syriac-to-Latin data, with a focus on Serto and MadnαΈ₯aya. The model focuses on script conversion, not translation, making it ideal for educational tools and linguistic preservation.
This project:
- Supports underrepresented languages in AI
- Offers open access to transliteration tools in the Syriac language
- Was created with humility, curiosity, and deep care by a Syriac learner and enthusiast
π» Try it out
Use prefixes to control output dialect:
Syriac2WestLatinSyriac2EastLatin
Then, use directly via Hugging Face π€ Transformers:
from transformers import pipeline
pipe = pipeline("text2text-generation", model = "crossroderick/aramt5")
text = "άά‘ά άάά¬ά άάά άά άάάάͺάά‘."
input_text = f"Syriac2WestLatin: {text}"
output = pipe(input_text, max_length = 128)[0]["generated_text"]
print(output)
Example output:
Input:
άάάά’ άάά«ά‘άά
Output (West):
ΚΎabun d-b-Ε‘mayo
π Acknowledgements
Despite being an independent project, AramT5 makes use of four very important datasets:
- The Syriac translation of the Bible (Peshitta), obtained from OPUS' Bible dataset
- Syriac texts from the Syriac Digital Corpus, containing writings from celebrated authors such as Isaac of Nineveh, Saint Ephrem the Syrian, and Aphrahat
- Beth Mardutho's Syriac Electronic Data Research Archive (SEDRA), a comprehensive online linguistic and literary database for the Syriac language
- The Wikipedia dump of articles in the Aramaic (Syriac) language, obtained via the
wikiextractorPython package
π€ Fine-tuning instructions
Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune AramT5 yourself, please do the following initial steps:
- Run the
get_data.shshell script file in the "src/data" folder
Observation: If you're on Windows, the
get_data.shscript likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise,generate_clean_corpus.shwill also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in thesyriac_east_corpus.jsonlandsyriac_west_corpus.jsonlfiles, as well as shuffle them. Additionally, be sure to install thewikiextractorandsentencepiecepackages beforehand (the exact versions can be found in therequirements.txtfile).
- Run the
generate_syr_lat_pairs.pyfile in the same folder - Run
generate_clean_corpus.shto clean the West and East Syriac corpora files and shuffle the datasets - Run
train_tokeniser.pyto train the tokeniser on the cleaned corpora
The model training process follows a curriculum learning format and is comprised of 6 stages:
| Stage | Samples | Max. sentence len. | Mixes shorter sen. | Objective |
|---|---|---|---|---|
| 1 | 20000 | 15 | No | Expose the base T5 model to Syriac morphology |
| 2 | 40000 | 30 | Yes | Introduce short sentences to AramT5 |
| 3 | 60000 | 50 | Yes | Introduce medium sentences to AramT5 |
| 4 | 120000 | 70 | Yes | Introduce longer sentences to AramT5 |
| 5 | 150000 | 100 | Yes | Reinforce longer sentences to AramT5 |
| 6 | 180000 | 150 | Yes | Introduce the full practical corpus to AramT5 |
To do a stage 1-based training run, just run the script directly from your IDE or use the following command:
uv run python src/train_t5.py --stage 1
For stages 2 to 6, use the following command instead:
uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
* Remember to replace the '2' in the command with '3' for stage 3 etc.
Observation: Model files are saved in the
src/checkpoints/stage{n}-final, wherencorresponds to the stage used in model fine-tuning
π Version Changelog
AramT5 Baseline (May 20, 2026): Base
t5-smallmodel fine-tuned on 20k records, across 30 epochs, leveraging the stage 1 configuration. Baseline version with a surprisingly good initial understanding of how to transliterate properly, shown to capture some roots and Syriac morphology in a limited mannerAramT5 v1 (May 20, 2026): Fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. A massive upgrade compared to the baseline version, v1 showcased significantly improved morphological handling of not only single words but also sequences with noticeable complexity
AramT5 v2 (May 20, 2026): Fine-tuned on 60k records, across 20 epochs, leveraging the stage 3 configuration. Making use of additional augmented data for atomic tokens, this version proved much more reliable at handling single-word input while exhibiting improvements in transliterating longer Syriac sentences
AramT5 v3 (May 21, 2026): Fine-tuned on 80k records, across 20 epochs, leveraging the stage 4 configuration. This version showcased even stronger transliteration capabilities for longer sentences, while retaining existing knowledge on multiple single words
AramT5 v3.1 (May 22, 2026): Fine-tuned on 120k records, across 20 epochs, leveraging the stage 4 configuration. Essentially a re-run or fine-tuning of v3, this version was trained on more data with a different distribution (and more manual entries) to leverage a more balanced mix between single words and multi-word phrases, culminating in a version that exhibits superior transliteration capabilities
AramT5 v.3.2 (May 23, 2026): Fine-tuned on 120k records, across 10 epochs, leveraging the stage 4 configuration. A refinement of v3.1, this version leveraged corrected word forms, a more comprehensive manual vocabulary, and the addition of fully-vocalised and seyame-based plurals, resulting in the model correcting its understanding of various atomic words and learning a more comprehensive distinction between singular and plural words
- Downloads last month
- 1,602
Model tree for crossroderick/aramt5
Base model
google-t5/t5-smallSpace using crossroderick/aramt5 1
Evaluation results
- Training Loss on Syriac Transliteration Corpusself-reported1.901
- Evaluation Loss on Syriac Transliteration Corpusself-reported2.029
- CER on Syriac Transliteration Corpusself-reported0.160
- Exact Match on Syriac Transliteration Corpusself-reported0.622