AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰

AramT5 is a fine-tuned version of t5-small, trained to transliterate Syriac text into latinised Serto (West Syriac) and Madnḥaya (East Syriac).

⚠️ Current Limitations

Occasional under-generation (shorter outputs than expected)
Occasional vowel omission or compression
Reliability varies on very long, uncommon, or morphologically complex words and sentences

Development information

🚧 Current version: v3.2 (stage 4)

⏳ Upcoming release: v4 (stage 5)

🌍 About the Project

AramT5 is a transformer fine-tuned on Syriac-to-Latin data, with a focus on Serto and Madnḥaya. The model focuses on script conversion, not translation, making it ideal for educational tools and linguistic preservation.

This project:

Supports underrepresented languages in AI
Offers open access to transliteration tools in the Syriac language
Was created with humility, curiosity, and deep care by a Syriac learner and enthusiast

💻 Try it out

Use prefixes to control output dialect:

Syriac2WestLatin
Syriac2EastLatin

Then, use directly via Hugging Face 🤗 Transformers:

from transformers import pipeline

pipe = pipeline("text2text-generation", model = "crossroderick/aramt5")

text = "ܒܡܠܟܘܬܐ ܕܐܠܗܐ ܕܐܒܪܗܡ."
input_text = f"Syriac2WestLatin: {text}"
output = pipe(input_text, max_length = 128)[0]["generated_text"]

print(output)

Example output:

Input:
ܐܒܘܢ ܕܒܫܡܝܐ

Output (West):
ʾabun d-b-šmayo

🙏 Acknowledgements

Despite being an independent project, AramT5 makes use of four very important datasets:

The Syriac translation of the Bible (Peshitta), obtained from OPUS' Bible dataset
Syriac texts from the Syriac Digital Corpus, containing writings from celebrated authors such as Isaac of Nineveh, Saint Ephrem the Syrian, and Aphrahat
Beth Mardutho's Syriac Electronic Data Research Archive (SEDRA), a comprehensive online linguistic and literary database for the Syriac language
The Wikipedia dump of articles in the Aramaic (Syriac) language, obtained via the wikiextractor Python package

🤖 Fine-tuning instructions

Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune AramT5 yourself, please do the following initial steps:

Run the get_data.sh shell script file in the "src/data" folder

Observation: If you're on Windows, the get_data.sh script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, generate_clean_corpus.sh will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the syriac_east_corpus.jsonl and syriac_west_corpus.jsonl files, as well as shuffle them. Additionally, be sure to install the wikiextractor and sentencepiece packages beforehand (the exact versions can be found in the requirements.txt file).

Run the generate_syr_lat_pairs.py file in the same folder
Run generate_clean_corpus.sh to clean the West and East Syriac corpora files and shuffle the datasets
Run train_tokeniser.py to train the tokeniser on the cleaned corpora

The model training process follows a curriculum learning format and is comprised of 6 stages:

Stage	Samples	Max. sentence len.	Mixes shorter sen.	Objective
1	20000	15	No	Expose the base T5 model to Syriac morphology
2	40000	30	Yes	Introduce short sentences to AramT5
3	60000	50	Yes	Introduce medium sentences to AramT5
4	120000	70	Yes	Introduce longer sentences to AramT5
5	150000	100	Yes	Reinforce longer sentences to AramT5
6	180000	150	Yes	Introduce the full practical corpus to AramT5

To do a stage 1-based training run, just run the script directly from your IDE or use the following command:

uv run python src/train_t5.py --stage 1

For stages 2 to 6, use the following command instead:

uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name

* Remember to replace the '2' in the command with '3' for stage 3 etc.

Observation: Model files are saved in the src/checkpoints/stage{n}-final, where n corresponds to the stage used in model fine-tuning

📋 Version Changelog

AramT5 Baseline (May 20, 2026): Base t5-small model fine-tuned on 20k records, across 30 epochs, leveraging the stage 1 configuration. Baseline version with a surprisingly good initial understanding of how to transliterate properly, shown to capture some roots and Syriac morphology in a limited manner
AramT5 v1 (May 20, 2026): Fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. A massive upgrade compared to the baseline version, v1 showcased significantly improved morphological handling of not only single words but also sequences with noticeable complexity
AramT5 v2 (May 20, 2026): Fine-tuned on 60k records, across 20 epochs, leveraging the stage 3 configuration. Making use of additional augmented data for atomic tokens, this version proved much more reliable at handling single-word input while exhibiting improvements in transliterating longer Syriac sentences
AramT5 v3 (May 21, 2026): Fine-tuned on 80k records, across 20 epochs, leveraging the stage 4 configuration. This version showcased even stronger transliteration capabilities for longer sentences, while retaining existing knowledge on multiple single words
AramT5 v3.1 (May 22, 2026): Fine-tuned on 120k records, across 20 epochs, leveraging the stage 4 configuration. Essentially a re-run or fine-tuning of v3, this version was trained on more data with a different distribution (and more manual entries) to leverage a more balanced mix between single words and multi-word phrases, culminating in a version that exhibits superior transliteration capabilities
AramT5 v.3.2 (May 23, 2026): Fine-tuned on 120k records, across 10 epochs, leveraging the stage 4 configuration. A refinement of v3.1, this version leveraged corrected word forms, a more comprehensive manual vocabulary, and the addition of fully-vocalised and seyame-based plurals, resulting in the model correcting its understanding of various atomic words and learning a more comprehensive distinction between singular and plural words

Downloads last month: 1,602

Safetensors

Model size

52.3M params

Tensor type

F32

Model tree for crossroderick/aramt5

Base model

google-t5/t5-small

Finetuned

(2252)

this model

Space using crossroderick/aramt5 1

Evaluation results

Training Loss on Syriac Transliteration Corpus
self-reported

1.901
Evaluation Loss on Syriac Transliteration Corpus
self-reported

2.029
CER on Syriac Transliteration Corpus
self-reported

0.160
Exact Match on Syriac Transliteration Corpus
self-reported

0.622