Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,115 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
task_categories:
|
4 |
+
- text2text-generation
|
5 |
+
language:
|
6 |
+
- la
|
7 |
+
size_categories:
|
8 |
+
- 1M<n<10M
|
9 |
+
tags:
|
10 |
+
- medieval
|
11 |
+
- editing
|
12 |
+
- normalization
|
13 |
+
- Georges
|
14 |
+
pretty_name: Normalized Georges 1913 Model
|
15 |
+
version: 1.0.0
|
16 |
+
---
|
17 |
+
# Normalization Model for Medieval Latin
|
18 |
+
|
19 |
+
## **Overview**
|
20 |
+
This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the [**Normalized Georges 1913 Dataset**](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization/), which provides approximately 5 million word pairs of orthographic variants and their normalized forms.
|
21 |
+
|
22 |
+
The model is part of the *Burchard's Dekret Digital* project ([www.burchards-dekret-digital.de](http://www.burchards-dekret-digital.de)) and was developed to support text normalization tasks in historical document processing.
|
23 |
+
|
24 |
+
## **Model Architecture**
|
25 |
+
The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:
|
26 |
+
|
27 |
+
1. **Embedding Layer**:
|
28 |
+
- Converts character indices into dense vector representations.
|
29 |
+
|
30 |
+
2. **Bidirectional LSTM Encoder**:
|
31 |
+
- Encodes the input sequence and captures bidirectional context.
|
32 |
+
|
33 |
+
3. **Attention Mechanism**:
|
34 |
+
- Aligns decoder outputs with relevant encoder outputs for better context-awareness.
|
35 |
+
|
36 |
+
4. **LSTM Decoder**:
|
37 |
+
- Decodes the normalized sequence character-by-character.
|
38 |
+
|
39 |
+
5. **Projection Layer**:
|
40 |
+
- Maps decoder outputs to character probabilities.
|
41 |
+
|
42 |
+
### Model Parameters
|
43 |
+
- **Embedding Dimension**: 64
|
44 |
+
- **Hidden Dimension**: 128
|
45 |
+
- **Number of Layers**: 3
|
46 |
+
- **Dropout**: 0.3
|
47 |
+
|
48 |
+
## **Dataset**
|
49 |
+
The model is trained on the **Normalized Georges 1913 Dataset**. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the [dataset page](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization).
|
50 |
+
|
51 |
+
### Sample Data
|
52 |
+
| Orthographic Variant | Normalized Form |
|
53 |
+
|-----------------------|--------------------|
|
54 |
+
|`circumcalcabicis`|`circumcalcabitis`|
|
55 |
+
|`peruincaturi`|`pervincaturi`|
|
56 |
+
|`tepidaremtur`|`tepidarentur`|
|
57 |
+
|`exmovemdis`|`exmovendis`|
|
58 |
+
|`comvomavisset`|`convomavisset`|
|
59 |
+
|`permeiemdis`|`permeiendis`|
|
60 |
+
|`permeditacissime`|`permeditatissime`|
|
61 |
+
|`conspersu`|`conspersu`|
|
62 |
+
|`pręviridancissimę`|`praeviridantissimae`|
|
63 |
+
|`relaxavisses`|`relaxavisses`|
|
64 |
+
|`edentaveratis`|`edentaveratis`|
|
65 |
+
|`amhelioris`|`anhelioris`|
|
66 |
+
|`remediatae`|`remediatae`|
|
67 |
+
|`discruciavero`|`discruciavero`|
|
68 |
+
|`imterplicavimus`|`interplicavimus`|
|
69 |
+
|`peraequata`|`peraequata`|
|
70 |
+
|`ignicomantissimorum`|`ignicomantissimorum`|
|
71 |
+
|`pręfvltvro`|`praefulturo`|
|
72 |
+
|
73 |
+
## **Training**
|
74 |
+
The model is trained using the following parameter:
|
75 |
+
- **Loss**: CrossEntropyLoss (ignores padding index).
|
76 |
+
- **Optimizer**: Adam with a learning rate of 0.0005.
|
77 |
+
- **Scheduler**: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
|
78 |
+
- **Gradient Clipping**: Max norm of 1.0.
|
79 |
+
- **Batch Size**: 4096.
|
80 |
+
|
81 |
+
## **Usecases**
|
82 |
+
This model can be used for:
|
83 |
+
|
84 |
+
- Applying normalization based on Georges 1913.
|
85 |
+
|
86 |
+
|
87 |
+
## **Known limitations**
|
88 |
+
The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."
|
89 |
+
|
90 |
+
|
91 |
+
## **How to Use**
|
92 |
+
|
93 |
+
### **Saved Files**
|
94 |
+
|
95 |
+
- normalization_model.pth: Trained PyTorch model weights.
|
96 |
+
- vocab.pkl: Vocabulary mapping for the dataset.
|
97 |
+
- config.json: Configuration file with model hyperparameters.
|
98 |
+
|
99 |
+
### **Training**
|
100 |
+
To train the model, run the `train_model.py` script on Github.
|
101 |
+
|
102 |
+
### **Usage for Inference**
|
103 |
+
|
104 |
+
Use script `test_model.py` script on Github.
|
105 |
+
|
106 |
+
## **Acknowledgments**
|
107 |
+
Dataset was created by Michael Schonhardt ([https://orcid.org/0000-0002-2750-1900](https://orcid.org/0000-0002-2750-1900)) for the project Burchards Dekret Digital.
|
108 |
+
|
109 |
+
Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via [www.zeno.org](http://www.zeno.org/georges-1913) by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.
|
110 |
+
|
111 |
+
## **License**
|
112 |
+
CC BY 4.0 ([https://creativecommons.org/licenses/by/4.0/legalcode.en](https://creativecommons.org/licenses/by/4.0/legalcode.en))
|
113 |
+
|
114 |
+
## **Citation**
|
115 |
+
If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, [https://huggingface.co/mschonhardt/georges-1913-normalization-model](https://huggingface.co/mschonhardt/georges-1913-normalization-model), Doi: [10.5281/zenodo.14264956](https://doi.org/10.5281/zenodo.14264956).
|