mschonhardt commited on
Commit
19c3992
·
verified ·
1 Parent(s): e080cd5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -3
README.md CHANGED
@@ -1,3 +1,115 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ task_categories:
4
+ - text2text-generation
5
+ language:
6
+ - la
7
+ size_categories:
8
+ - 1M<n<10M
9
+ tags:
10
+ - medieval
11
+ - editing
12
+ - normalization
13
+ - Georges
14
+ pretty_name: Normalized Georges 1913 Model
15
+ version: 1.0.0
16
+ ---
17
+ # Normalization Model for Medieval Latin
18
+
19
+ ## **Overview**
20
+ This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the [**Normalized Georges 1913 Dataset**](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization/), which provides approximately 5 million word pairs of orthographic variants and their normalized forms.
21
+
22
+ The model is part of the *Burchard's Dekret Digital* project ([www.burchards-dekret-digital.de](http://www.burchards-dekret-digital.de)) and was developed to support text normalization tasks in historical document processing.
23
+
24
+ ## **Model Architecture**
25
+ The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:
26
+
27
+ 1. **Embedding Layer**:
28
+ - Converts character indices into dense vector representations.
29
+
30
+ 2. **Bidirectional LSTM Encoder**:
31
+ - Encodes the input sequence and captures bidirectional context.
32
+
33
+ 3. **Attention Mechanism**:
34
+ - Aligns decoder outputs with relevant encoder outputs for better context-awareness.
35
+
36
+ 4. **LSTM Decoder**:
37
+ - Decodes the normalized sequence character-by-character.
38
+
39
+ 5. **Projection Layer**:
40
+ - Maps decoder outputs to character probabilities.
41
+
42
+ ### Model Parameters
43
+ - **Embedding Dimension**: 64
44
+ - **Hidden Dimension**: 128
45
+ - **Number of Layers**: 3
46
+ - **Dropout**: 0.3
47
+
48
+ ## **Dataset**
49
+ The model is trained on the **Normalized Georges 1913 Dataset**. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the [dataset page](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization).
50
+
51
+ ### Sample Data
52
+ | Orthographic Variant | Normalized Form |
53
+ |-----------------------|--------------------|
54
+ |`circumcalcabicis`|`circumcalcabitis`|
55
+ |`peruincaturi`|`pervincaturi`|
56
+ |`tepidaremtur`|`tepidarentur`|
57
+ |`exmovemdis`|`exmovendis`|
58
+ |`comvomavisset`|`convomavisset`|
59
+ |`permeiemdis`|`permeiendis`|
60
+ |`permeditacissime`|`permeditatissime`|
61
+ |`conspersu`|`conspersu`|
62
+ |`pręviridancissimę`|`praeviridantissimae`|
63
+ |`relaxavisses`|`relaxavisses`|
64
+ |`edentaveratis`|`edentaveratis`|
65
+ |`amhelioris`|`anhelioris`|
66
+ |`remediatae`|`remediatae`|
67
+ |`discruciavero`|`discruciavero`|
68
+ |`imterplicavimus`|`interplicavimus`|
69
+ |`peraequata`|`peraequata`|
70
+ |`ignicomantissimorum`|`ignicomantissimorum`|
71
+ |`pręfvltvro`|`praefulturo`|
72
+
73
+ ## **Training**
74
+ The model is trained using the following parameter:
75
+ - **Loss**: CrossEntropyLoss (ignores padding index).
76
+ - **Optimizer**: Adam with a learning rate of 0.0005.
77
+ - **Scheduler**: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
78
+ - **Gradient Clipping**: Max norm of 1.0.
79
+ - **Batch Size**: 4096.
80
+
81
+ ## **Usecases**
82
+ This model can be used for:
83
+
84
+ - Applying normalization based on Georges 1913.
85
+
86
+
87
+ ## **Known limitations**
88
+ The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."
89
+
90
+
91
+ ## **How to Use**
92
+
93
+ ### **Saved Files**
94
+
95
+ - normalization_model.pth: Trained PyTorch model weights.
96
+ - vocab.pkl: Vocabulary mapping for the dataset.
97
+ - config.json: Configuration file with model hyperparameters.
98
+
99
+ ### **Training**
100
+ To train the model, run the `train_model.py` script on Github.
101
+
102
+ ### **Usage for Inference**
103
+
104
+ Use script `test_model.py` script on Github.
105
+
106
+ ## **Acknowledgments**
107
+ Dataset was created by Michael Schonhardt ([https://orcid.org/0000-0002-2750-1900](https://orcid.org/0000-0002-2750-1900)) for the project Burchards Dekret Digital.
108
+
109
+ Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via [www.zeno.org](http://www.zeno.org/georges-1913) by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.
110
+
111
+ ## **License**
112
+ CC BY 4.0 ([https://creativecommons.org/licenses/by/4.0/legalcode.en](https://creativecommons.org/licenses/by/4.0/legalcode.en))
113
+
114
+ ## **Citation**
115
+ If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, [https://huggingface.co/mschonhardt/georges-1913-normalization-model](https://huggingface.co/mschonhardt/georges-1913-normalization-model), Doi: [10.5281/zenodo.14264956](https://doi.org/10.5281/zenodo.14264956).