crossroderick
/

aramt5

@@ -22,16 +22,16 @@ model-index:
       metrics:
         - name: Training Loss
           type: loss
-          value: 2.0736
         - name: Evaluation Loss
           type: loss
-          value: 2.1371
         - name: CER
           type: cer
-          value: 0.1910
         - name: Exact Match
           type: accuracy
-          value: 0.5820
 ---
 # AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰
@@ -39,13 +39,13 @@ model-index:
 ⚠️ Current Limitations
-- Tends to under-generate (shorter outputs than expected)
 - Occasional vowel omission or compression
-- Less stable on very long or morphologically complex words
 > Development information
-> - 🚧 **Current version:** v3.1 (stage 4)
-> - ⏳ **Upcoming release:** v3.2 (stage 4)
 ---
@@ -149,12 +149,14 @@ uv run python src/train_t5.py --stage 2 --hf-model your-username/model-name
 ## 📋 Version Changelog
-* **AramT5 Baseline (May 20, 2026):** T5 fine-tuned on 20k records, across 30 epochs, leveraging the stage 1 configuration. Baseline version with a surprisingly good initial understanding of how to transliterate properly, shown to capture some roots and Syriac morphology in a limited manner
-* **AramT5 v1 (May 20, 2026):** AramT5 fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. A massive upgrade compared to the baseline version, v1 showcased significantly improved morphological handling of not only single words but also sequences with noticeable complexity
-* **AramT5 v2 (May 20, 2026):** AramT5 fine-tuned on 60k records, across 20 epochs, leveraging the stage 3 configuration. Making use of additional augmented data for atomic tokens, this version proved much more reliable at handling single-word input while exhibiting improvements in transliterating longer Syriac sentences
-* **AramT5 v3 (May 21, 2026):** AramT5 fine-tuned on 80k records, across 20 epochs, leveraging the stage 4 configuration. This version showcased even stronger transliteration capabilities for longer sentences, while retaining existing knowledge on multiple single words
-* **AramT5 v3.1 (May 22, 2026):** AramT5 fine-tuned on 120k records, across 20 epochs, leveraging the stage 4 configuration. Essentially a re-run or fine-tuning of v3, this version was trained on more data with a different distribution (and more manual entries) to leverage a more balanced mix between single words and multi-word phrases, culminating in a version that exhibits superior transliteration capabilities

       metrics:
         - name: Training Loss
           type: loss
+          value: 1.9013
         - name: Evaluation Loss
           type: loss
+          value: 2.0293
         - name: CER
           type: cer
+          value: 0.1602
         - name: Exact Match
           type: accuracy
+          value: 0.6217
 ---
 # AramT5 - T5 Fine-Tuned on Syriac-to-Latin Transliteration ♰
 ⚠️ Current Limitations
+- Occasional under-generation (shorter outputs than expected)
 - Occasional vowel omission or compression
+- Reliability varies on very long, uncommon, or morphologically complex words and sentences
 > Development information
+> - 🚧 **Current version:** v3.2 (stage 4)
+> - ⏳ **Upcoming release:** v4 (stage 5)
 ---
 ## 📋 Version Changelog
+* **AramT5 Baseline (May 20, 2026):** Base `t5-small` model fine-tuned on 20k records, across 30 epochs, leveraging the stage 1 configuration. Baseline version with a surprisingly good initial understanding of how to transliterate properly, shown to capture some roots and Syriac morphology in a limited manner
+* **AramT5 v1 (May 20, 2026):** Fine-tuned on 40k records, across 20 epochs, leveraging the stage 2 configuration. A massive upgrade compared to the baseline version, v1 showcased significantly improved morphological handling of not only single words but also sequences with noticeable complexity
+* **AramT5 v2 (May 20, 2026):** Fine-tuned on 60k records, across 20 epochs, leveraging the stage 3 configuration. Making use of additional augmented data for atomic tokens, this version proved much more reliable at handling single-word input while exhibiting improvements in transliterating longer Syriac sentences
+* **AramT5 v3 (May 21, 2026):** Fine-tuned on 80k records, across 20 epochs, leveraging the stage 4 configuration. This version showcased even stronger transliteration capabilities for longer sentences, while retaining existing knowledge on multiple single words
+* **AramT5 v3.1 (May 22, 2026):** Fine-tuned on 120k records, across 20 epochs, leveraging the stage 4 configuration. Essentially a re-run or fine-tuning of v3, this version was trained on more data with a different distribution (and more manual entries) to leverage a more balanced mix between single words and multi-word phrases, culminating in a version that exhibits superior transliteration capabilities
+* **AramT5 v.3.2 (May 23, 2026):** Fine-tuned on 120k records, across 10 epochs, leveraging the stage 4 configuration. A refinement of v3.1, this version leveraged corrected word forms, a more comprehensive manual vocabulary, and the addition of fully-vocalised and seyame-based plurals, resulting in the model correcting its understanding of various atomic words and learning a more comprehensive distinction between singular and plural words

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8f4de64d41b020242c87f856530310822e7299eb1fab2a266d399e1e08c3228b
 size 209216552

 version https://git-lfs.github.com/spec/v1
+oid sha256:b0dc22e4241c40835061a8408785a0b93cf5b139725732e806e28115dda2b26d
 size 209216552

src/train_t5.py CHANGED Viewed

@@ -188,7 +188,7 @@ STAGE_CONFIGS = {
         "short_threshold": 50,  # ≤50 chars (Stage 1+2+3)
         "new_range_ratio": 0.45,  # 45% from new range (51-70 chars)
         "new_range_min": 51,
-        "num_epochs": 20,
         "learning_rate": 8e-5,  # Higher LR to unlearn early-stopping bias from imbalanced data
     },
     5: {
@@ -199,7 +199,7 @@ STAGE_CONFIGS = {
         "short_threshold": 70,  # ≤70 chars (Stage 1+2+3+4)
         "new_range_ratio": 0.45,  # 45% from new range (71-100 chars)
         "new_range_min": 71,
-        "num_epochs": 20,
         "learning_rate": 5e-5,  # Slightly higher to reinforce multi-word patterns
         "repetition_penalty": 1.2,
     },
@@ -211,7 +211,7 @@ STAGE_CONFIGS = {
         "short_threshold": 100,  # ≤100 chars (Stage 1+2+3+4+5)
         "new_range_ratio": 0.40,  # 40% from new range (101-150 chars)
         "new_range_min": 101,
-        "num_epochs": 15,
         "learning_rate": 4e-5,  # Fine-tuning polish
         "repetition_penalty": 1.2,
     },

         "short_threshold": 50,  # ≤50 chars (Stage 1+2+3)
         "new_range_ratio": 0.45,  # 45% from new range (51-70 chars)
         "new_range_min": 51,
+        "num_epochs": 10,
         "learning_rate": 8e-5,  # Higher LR to unlearn early-stopping bias from imbalanced data
     },
     5: {
         "short_threshold": 70,  # ≤70 chars (Stage 1+2+3+4)
         "new_range_ratio": 0.45,  # 45% from new range (71-100 chars)
         "new_range_min": 71,
+        "num_epochs": 10,
         "learning_rate": 5e-5,  # Slightly higher to reinforce multi-word patterns
         "repetition_penalty": 1.2,
     },
         "short_threshold": 100,  # ≤100 chars (Stage 1+2+3+4+5)
         "new_range_ratio": 0.40,  # 40% from new range (101-150 chars)
         "new_range_min": 101,
+        "num_epochs": 10,
         "learning_rate": 4e-5,  # Fine-tuning polish
         "repetition_penalty": 1.2,
     },

tokenizer.json CHANGED Viewed

@@ -1,21 +1,7 @@
 {
   "version": "1.0",
-  "truncation": {
-    "direction": "Right",
-    "max_length": 128,
-    "strategy": "LongestFirst",
-    "stride": 0
-  },
-  "padding": {
-    "strategy": {
-      "Fixed": 128
-    },
-    "direction": "Right",
-    "pad_to_multiple_of": null,
-    "pad_id": 0,
-    "pad_type_id": 0,
-    "pad_token": "<pad>"
-  },
   "added_tokens": [
     {
       "id": 0,

 {
   "version": "1.0",
+  "truncation": null,
+  "padding": null,
   "added_tokens": [
     {
       "id": 0,