--- base_model: t5-small license: mit language: syc tags: - text2text-generation - transliteration - syriac - low-resource - cultural-nlp - t5 pipeline_tag: text-generation model-index: - name: AramT5 results: - task: name: Transliteration type: text-generation dataset: name: Syriac Transliteration Corpus type: custom metrics: - name: Training Loss type: loss value: 1.9013 - name: Evaluation Loss type: loss value: 2.0293 - name: CER type: cer value: 0.1602 - name: Exact Match type: accuracy value: 0.6217 --- # AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰ **AramT5** is a fine-tuned version of `t5-small`, trained to **transliterate Syriac text** into latinised **Serto** (West Syriac) and **Madnḥaya** (East Syriac). ⚠️ Current Limitations - Occasional under-generation (shorter outputs than expected) - Occasional vowel omission or compression - Reliability varies on very long, uncommon, or morphologically complex words and sentences > Development information > - 🚧 **Current version:** v3.2 (stage 4) > - ⏳ **Upcoming release:** v4 (stage 5) --- ## 🌍 About the Project **AramT5** is a transformer fine-tuned on Syriac-to-Latin data, with a focus on Serto and Madnḥaya. The model focuses on script conversion, not translation, making it ideal for educational tools and linguistic preservation. This project: - Supports **underrepresented languages** in AI - Offers **open access** to transliteration tools in the Syriac language - Was created with humility, curiosity, and deep care by a Syriac learner and enthusiast --- ## 💻 Try it out Use prefixes to control output dialect: - `Syriac2WestLatin` - `Syriac2EastLatin` Then, use directly via Hugging Face 🤗 Transformers: ```python from transformers import pipeline pipe = pipeline("text2text-generation", model = "crossroderick/aramt5") text = "ܒܡܠܟܘܬܐ ܕܐܠܗܐ ܕܐܒܪܗܡ." input_text = f"Syriac2WestLatin: {text}" output = pipe(input_text, max_length = 128)[0]["generated_text"] print(output) ``` Example output: ``` Input: ܐܒܘܢ ܕܒܫܡܝܐ Output (West): ʾabun d-b-šmayo ``` --- ## 🙏 Acknowledgements Despite being an independent project, AramT5 makes use of four very important datasets: - The Syriac translation of the Bible (Peshitta), obtained from [OPUS' Bible dataset](https://opus.nlpl.eu/datasets/bible-uedin?pair=en&syr) - Syriac texts from the [Syriac Digital Corpus](https://syriaccorpus.org/index.html), containing writings from celebrated authors such as Isaac of Nineveh, Saint Ephrem the Syrian, and Aphrahat - Beth Mardutho's [Syriac Electronic Data Research Archive (SEDRA)](https://sedra.bethmardutho.org), a comprehensive online linguistic and literary database for the Syriac language - The Wikipedia dump of articles in the Aramaic (Syriac) language, obtained via the `wikiextractor` Python package --- ## 🤖 Fine-tuning instructions Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune AramT5 yourself, please do the following initial steps: 1. Run the `get_data.sh` shell script file in the "src/data" folder > **Observation:** If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `syriac_east_corpus.jsonl` and `syriac_west_corpus.jsonl` files, as well as shuffle them. Additionally, be sure to install the `wikiextractor` and `sentencepiece` packages beforehand (the exact versions can be found in the `requirements.txt` file). 2. Run the `generate_syr_lat_pairs.py` file in the same folder 3. Run `generate_clean_corpus.sh` to clean the West and East Syriac corpora files and shuffle the datasets 4. Run `train_tokeniser.py` to train the tokeniser on the cleaned corpora The model training process follows a curriculum learning format and is comprised of 6 stages: | Stage | Samples | Max. sentence len. | Mixes shorter sen. | Objective |-------|---------|---------------|--------------------|-------------------------- | 1 | 20000 | 15 | No | Expose the base T5 model to Syriac morphology | 2 | 40000 | 30 | Yes | Introduce short sentences to AramT5 | 3 | 60000 | 50 | Yes | Introduce medium sentences to AramT5 | 4 | 120000 | 70 | Yes | Introduce longer sentences to AramT5 | 5 | 150000 | 100 | Yes | Reinforce longer sentences to AramT5 | 6 | 180000 | 150 | Yes | Introduce the full practical corpus to AramT5 To do a stage 1-based training run, just run the script directly from your IDE or use the following command: ```python uv run python src/train_t5.py --stage 1 ``` For stages 2 to 6, use the following command instead: ```python uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name ``` \* *Remember to replace the '2' in the command with '3' for stage 3 etc.* > **Observation:** Model files are saved in the `src/checkpoints/stage{n}-final`, where `n` corresponds to the stage used in model fine-tuning --- ## 📋 Version Changelog * **AramT5 Baseline (May 20, 2026):** Base `t5-small` model fine-tuned on 20k records, across 30 epochs, leveraging the stage 1 configuration. Baseline version with a surprisingly good initial understanding of how to transliterate properly, shown to capture some roots and Syriac morphology in a limited manner * **AramT5 v1 (May 20, 2026):** Fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. A massive upgrade compared to the baseline version, v1 showcased significantly improved morphological handling of not only single words but also sequences with noticeable complexity * **AramT5 v2 (May 20, 2026):** Fine-tuned on 60k records, across 20 epochs, leveraging the stage 3 configuration. Making use of additional augmented data for atomic tokens, this version proved much more reliable at handling single-word input while exhibiting improvements in transliterating longer Syriac sentences * **AramT5 v3 (May 21, 2026):** Fine-tuned on 80k records, across 20 epochs, leveraging the stage 4 configuration. This version showcased even stronger transliteration capabilities for longer sentences, while retaining existing knowledge on multiple single words * **AramT5 v3.1 (May 22, 2026):** Fine-tuned on 120k records, across 20 epochs, leveraging the stage 4 configuration. Essentially a re-run or fine-tuning of v3, this version was trained on more data with a different distribution (and more manual entries) to leverage a more balanced mix between single words and multi-word phrases, culminating in a version that exhibits superior transliteration capabilities * **AramT5 v.3.2 (May 23, 2026):** Fine-tuned on 120k records, across 10 epochs, leveraging the stage 4 configuration. A refinement of v3.1, this version leveraged corrected word forms, a more comprehensive manual vocabulary, and the addition of fully-vocalised and seyame-based plurals, resulting in the model correcting its understanding of various atomic words and learning a more comprehensive distinction between singular and plural words