diyclassics commited on
Commit
44eeba9
·
1 Parent(s): 9208ad6

v3.9.2: align release (token_fix + enclitic_splitter); add CHANGELOG

Browse files

Config/packaging alignment across the LatinCy family; model weights unchanged.

CHANGELOG.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Changelog
2
+
3
+ All notable changes to **la_core_web_trf** (the spaCy pipeline model) are documented here.
4
+
5
+ Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/); LatinCy uses a `v3.{model-generation}.{patch}` scheme.
6
+
7
+ ## [3.9.2] - 2026-05-31
8
+
9
+ ### Changed
10
+ - Enclitic splitting of *-que* is now handled by a dedicated `enclitic_splitter` pipeline component, decoupled from the tokenizer.
11
+ - Sentence segmentation upgraded to upstream `la_senter` v3.9.2 (passed through to this pipeline), which adds the `token_fix` component — keeping sentences intact across parentheticals, dash asides, and closing quotes.
12
+
13
+ ### Notes
14
+ - Alignment release across the LatinCy family (`sm`/`md`/`lg`/`trf`). Model weights are unchanged from v3.9.x; no retraining.
15
+
16
+ ## [3.9.0] - 2026-03-25
17
+
18
+ ### Added
19
+ - Initial v3.9 public release: tagging, morphology, lemmatization, dependency parsing, NER, and sentence segmentation, trained on harmonized Universal Dependencies treebanks with LASLA data.
README.md CHANGED
@@ -80,14 +80,25 @@ model-index:
80
  | Feature | Description |
81
  | --- | --- |
82
  | **Name** | `la_core_web_trf` |
83
- | **Version** | `3.8.0` |
84
  | **spaCy** | `>=3.8.3,<3.9.0` |
85
- | **Default Pipeline** | `senter`, `transformer`, `normer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `parser`, `lookup_lemmatizer`, `ner`, `trf_vectors` |
86
- | **Components** | `senter`, `transformer`, `normer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `parser`, `lookup_lemmatizer`, `ner`, `trf_vectors` |
87
  | **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
88
  | **Sources** | [CIRCSE/LASLA: LASLA Corpus](https://github.com/CIRCSE/LASLA/tree/v1.0.1)<br>[LatinCy NER](https://github.com/latincy/latincy-ner)<br>[UD_Latin-CIRCSE](https://github.com/UniversalDependencies/UD_Latin-CIRCSE)<br>[UD_Latin-ITTB](https://github.com/UniversalDependencies/UD_Latin-ITTB)<br>[UD_Latin-LLCT](https://github.com/UniversalDependencies/UD_Latin-LLCT)<br>[UD_Latin-Perseus](https://github.com/UniversalDependencies/UD_Latin-Perseus)<br>[UD_Latin-PROIEL](https://github.com/UniversalDependencies/UD_Latin-PROIEL)<br>[UD_Latin-UDante](https://github.com/UniversalDependencies/UD_Latin-UDante) |
89
  | **License** | `MIT` |
90
- | **Author** | [Patrick J. Burns; with Nora Bernhardt [ner], Tim Geelhaar [tagger, morphologizer, parser, ner], Vincent Koch [ner]](https://diyclassics.github.io/) |
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ### Label Scheme
93
 
@@ -126,4 +137,4 @@ model-index:
126
  | `TAGGER_LOSS` | 60260.49 |
127
  | `MORPHOLOGIZER_LOSS` | 447952.32 |
128
  | `TRAINABLE_LEMMATIZER_LOSS` | 383152.85 |
129
- | `PARSER_LOSS` | 3276429.69 |
 
80
  | Feature | Description |
81
  | --- | --- |
82
  | **Name** | `la_core_web_trf` |
83
+ | **Version** | `3.9.2` |
84
  | **spaCy** | `>=3.8.3,<3.9.0` |
85
+ | **Default Pipeline** | `enclitic_splitter`, `transformer`, `senter`, `token_fix`, `normer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `lookup_lemmatizer`, `uv_normalizer`, `parser`, `harmonizer`, `remorpher`, `ner`, `trf_vectors` |
86
+ | **Components** | `enclitic_splitter`, `transformer`, `senter`, `token_fix`, `normer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `lookup_lemmatizer`, `uv_normalizer`, `parser`, `harmonizer`, `remorpher`, `ner`, `trf_vectors` |
87
  | **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
88
  | **Sources** | [CIRCSE/LASLA: LASLA Corpus](https://github.com/CIRCSE/LASLA/tree/v1.0.1)<br>[LatinCy NER](https://github.com/latincy/latincy-ner)<br>[UD_Latin-CIRCSE](https://github.com/UniversalDependencies/UD_Latin-CIRCSE)<br>[UD_Latin-ITTB](https://github.com/UniversalDependencies/UD_Latin-ITTB)<br>[UD_Latin-LLCT](https://github.com/UniversalDependencies/UD_Latin-LLCT)<br>[UD_Latin-Perseus](https://github.com/UniversalDependencies/UD_Latin-Perseus)<br>[UD_Latin-PROIEL](https://github.com/UniversalDependencies/UD_Latin-PROIEL)<br>[UD_Latin-UDante](https://github.com/UniversalDependencies/UD_Latin-UDante) |
89
  | **License** | `MIT` |
90
+ | **Author** | [Patrick J. Burns](https://diyclassics.github.io/) |
91
+ | **Contributors** | Tim Geelhaar (annotation, error analysis [v3.5.2: morphologizer, tagger, parser]); Nora Bernhardt (NER); Vincent Koch (NER) |
92
+
93
+ ## What's new in 3.9.2
94
+
95
+ This is an alignment release across the LatinCy pipeline family (`la_core_web_sm`/`md`/`lg`/`trf`). Model weights are **unchanged** from v3.9.x — no retraining was performed; the changes are to pipeline configuration and packaging.
96
+
97
+ - **Enclitic splitting** is now handled by a dedicated `enclitic_splitter` component, decoupled from the tokenizer. Splitting of the enclitic *-que* (e.g. *arma**que*** → *arma* + *que*) is an explicit, configurable step in the pipeline rather than tokenizer behavior.
98
+ - **Sentence segmentation upgraded to [`la_senter`](https://huggingface.co/latincy/la_senter) v3.9.2.** The upstream LatinCy sentence-segmentation model has been updated to v3.9.2 and is passed through to this pipeline. It adds the `token_fix` component, which runs after `senter` to repair sentence boundaries the statistical model mis-splits — keeping sentences intact across parentheticals, dash asides, and closing quotes.
99
+
100
+ See [CHANGELOG.md](CHANGELOG.md) for full version history.
101
+
102
 
103
  ### Label Scheme
104
 
 
137
  | `TAGGER_LOSS` | 60260.49 |
138
  | `MORPHOLOGIZER_LOSS` | 447952.32 |
139
  | `TRAINABLE_LEMMATIZER_LOSS` | 383152.85 |
140
+ | `PARSER_LOSS` | 3276429.69 |
la_core_web_trf-3.9.1-py3-none-any.whl → la_core_web_trf-3.9.2-py3-none-any.whl RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ed16f051333bf4115bb29a59a2b36169c094374484129cd01b62f172a6ba78cd
3
- size 1688554184
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f0baf9c9b733cc7b0c2d160e747164863fabe74187ef038bf1e325cbd7c5512
3
+ size 1688570986