AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration β™°

AramT5 is a fine-tuned version of t5-small, trained to transliterate Syriac text into latinised Serto (West Syriac) and MadnαΈ₯aya (East Syriac).

⚠️ Current Limitations

  • Occasional under-generation (shorter outputs than expected)
  • Occasional vowel omission or compression
  • Reliability varies on very long, uncommon, or morphologically complex words and sentences

Development information

  • 🚧 Current version: v3.2 (stage 4)
  • ⏳ Upcoming release: v4 (stage 5)

🌍 About the Project

AramT5 is a transformer fine-tuned on Syriac-to-Latin data, with a focus on Serto and MadnαΈ₯aya. The model focuses on script conversion, not translation, making it ideal for educational tools and linguistic preservation.

This project:

  • Supports underrepresented languages in AI
  • Offers open access to transliteration tools in the Syriac language
  • Was created with humility, curiosity, and deep care by a Syriac learner and enthusiast

πŸ’» Try it out

Use prefixes to control output dialect:

  • Syriac2WestLatin
  • Syriac2EastLatin

Then, use directly via Hugging Face πŸ€— Transformers:

from transformers import pipeline

pipe = pipeline("text2text-generation", model = "crossroderick/aramt5")

text = "ά’ά‘ά άŸά˜ά¬ά ܕܐܠܗܐ ܕܐܒάͺά—ά‘."
input_text = f"Syriac2WestLatin: {text}"
output = pipe(input_text, max_length = 128)[0]["generated_text"]

print(output)

Example output:

Input:
άά’ά˜ά’ ܕܒܫܑܝܐ

Output (West):
ΚΎabun d-b-Ε‘mayo

πŸ™ Acknowledgements

Despite being an independent project, AramT5 makes use of four very important datasets:

  • The Syriac translation of the Bible (Peshitta), obtained from OPUS' Bible dataset
  • Syriac texts from the Syriac Digital Corpus, containing writings from celebrated authors such as Isaac of Nineveh, Saint Ephrem the Syrian, and Aphrahat
  • Beth Mardutho's Syriac Electronic Data Research Archive (SEDRA), a comprehensive online linguistic and literary database for the Syriac language
  • The Wikipedia dump of articles in the Aramaic (Syriac) language, obtained via the wikiextractor Python package

πŸ€– Fine-tuning instructions

Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune AramT5 yourself, please do the following initial steps:

  1. Run the get_data.sh shell script file in the "src/data" folder

Observation: If you're on Windows, the get_data.sh script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, generate_clean_corpus.sh will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the syriac_east_corpus.jsonl and syriac_west_corpus.jsonl files, as well as shuffle them. Additionally, be sure to install the wikiextractor and sentencepiece packages beforehand (the exact versions can be found in the requirements.txt file).

  1. Run the generate_syr_lat_pairs.py file in the same folder
  2. Run generate_clean_corpus.sh to clean the West and East Syriac corpora files and shuffle the datasets
  3. Run train_tokeniser.py to train the tokeniser on the cleaned corpora

The model training process follows a curriculum learning format and is comprised of 6 stages:

Stage Samples Max. sentence len. Mixes shorter sen. Objective
1 20000 15 No Expose the base T5 model to Syriac morphology
2 40000 30 Yes Introduce short sentences to AramT5
3 60000 50 Yes Introduce medium sentences to AramT5
4 120000 70 Yes Introduce longer sentences to AramT5
5 150000 100 Yes Reinforce longer sentences to AramT5
6 180000 150 Yes Introduce the full practical corpus to AramT5

To do a stage 1-based training run, just run the script directly from your IDE or use the following command:

uv run python src/train_t5.py --stage 1

For stages 2 to 6, use the following command instead:

uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name

* Remember to replace the '2' in the command with '3' for stage 3 etc.

Observation: Model files are saved in the src/checkpoints/stage{n}-final, where n corresponds to the stage used in model fine-tuning


πŸ“‹ Version Changelog

  • AramT5 Baseline (May 20, 2026): Base t5-small model fine-tuned on 20k records, across 30 epochs, leveraging the stage 1 configuration. Baseline version with a surprisingly good initial understanding of how to transliterate properly, shown to capture some roots and Syriac morphology in a limited manner

  • AramT5 v1 (May 20, 2026): Fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. A massive upgrade compared to the baseline version, v1 showcased significantly improved morphological handling of not only single words but also sequences with noticeable complexity

  • AramT5 v2 (May 20, 2026): Fine-tuned on 60k records, across 20 epochs, leveraging the stage 3 configuration. Making use of additional augmented data for atomic tokens, this version proved much more reliable at handling single-word input while exhibiting improvements in transliterating longer Syriac sentences

  • AramT5 v3 (May 21, 2026): Fine-tuned on 80k records, across 20 epochs, leveraging the stage 4 configuration. This version showcased even stronger transliteration capabilities for longer sentences, while retaining existing knowledge on multiple single words

  • AramT5 v3.1 (May 22, 2026): Fine-tuned on 120k records, across 20 epochs, leveraging the stage 4 configuration. Essentially a re-run or fine-tuning of v3, this version was trained on more data with a different distribution (and more manual entries) to leverage a more balanced mix between single words and multi-word phrases, culminating in a version that exhibits superior transliteration capabilities

  • AramT5 v.3.2 (May 23, 2026): Fine-tuned on 120k records, across 10 epochs, leveraging the stage 4 configuration. A refinement of v3.1, this version leveraged corrected word forms, a more comprehensive manual vocabulary, and the addition of fully-vocalised and seyame-based plurals, resulting in the model correcting its understanding of various atomic words and learning a more comprehensive distinction between singular and plural words

Downloads last month
1,602
Safetensors
Model size
52.3M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for crossroderick/aramt5

Finetuned
(2252)
this model

Space using crossroderick/aramt5 1

Evaluation results