mschonhardt
/

georges-1913-normalization-model

+---
+license: cc-by-4.0
+task_categories:
+  - text2text-generation
+language:
+  - la
+size_categories:
+  - 1M<n<10M
+tags:
+  - medieval
+  - editing
+  - normalization
+  - Georges
+pretty_name: Normalized Georges 1913 Model
+version: 1.0.0
+---
+# Normalization Model for Medieval Latin
+## **Overview**
+This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the [**Normalized Georges 1913 Dataset**](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization/), which provides approximately 5 million word pairs of orthographic variants and their normalized forms.
+The model is part of the *Burchard's Dekret Digital* project ([www.burchards-dekret-digital.de](http://www.burchards-dekret-digital.de)) and was developed to support text normalization tasks in historical document processing.
+## **Model Architecture**
+The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:
+1. **Embedding Layer**:
+   - Converts character indices into dense vector representations.
+2. **Bidirectional LSTM Encoder**:
+   - Encodes the input sequence and captures bidirectional context.
+3. **Attention Mechanism**:
+   - Aligns decoder outputs with relevant encoder outputs for better context-awareness.
+4. **LSTM Decoder**:
+   - Decodes the normalized sequence character-by-character.
+5. **Projection Layer**:
+   - Maps decoder outputs to character probabilities.
+### Model Parameters
+- **Embedding Dimension**: 64
+- **Hidden Dimension**: 128
+- **Number of Layers**: 3
+- **Dropout**: 0.3
+## **Dataset**
+The model is trained on the **Normalized Georges 1913 Dataset**. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the [dataset page](https://huggingface.co/datasets/mschonhardt/georges-1913-normalization).
+### Sample Data
+| Orthographic Variant | Normalized Form    |
+|-----------------------|--------------------|
+|`circumcalcabicis`|`circumcalcabitis`|
+|`peruincaturi`|`pervincaturi`|
+|`tepidaremtur`|`tepidarentur`|
+|`exmovemdis`|`exmovendis`|
+|`comvomavisset`|`convomavisset`|
+|`permeiemdis`|`permeiendis`|
+|`permeditacissime`|`permeditatissime`|
+|`conspersu`|`conspersu`|
+|`pręviridancissimę`|`praeviridantissimae`|
+|`relaxavisses`|`relaxavisses`|
+|`edentaveratis`|`edentaveratis`|
+|`amhelioris`|`anhelioris`|
+|`remediatae`|`remediatae`|
+|`discruciavero`|`discruciavero`|
+|`imterplicavimus`|`interplicavimus`|
+|`peraequata`|`peraequata`|
+|`ignicomantissimorum`|`ignicomantissimorum`|
+|`pręfvltvro`|`praefulturo`|
+## **Training**
+The model is trained using the following parameter:
+   - **Loss**: CrossEntropyLoss (ignores padding index).
+   - **Optimizer**: Adam with a learning rate of 0.0005.
+   - **Scheduler**: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
+   - **Gradient Clipping**: Max norm of 1.0.
+   - **Batch Size**: 4096.
+## **Usecases**
+This model can be used for:
+- Applying normalization based on Georges 1913.
+## **Known limitations**
+The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."
+## **How to Use**
+### **Saved Files**
+- normalization_model.pth: Trained PyTorch model weights.
+- vocab.pkl: Vocabulary mapping for the dataset.
+- config.json: Configuration file with model hyperparameters.
+### **Training**
+To train the model, run the `train_model.py` script on Github.
+### **Usage for Inference**
+Use script `test_model.py` script on Github.
+## **Acknowledgments**
+Dataset was created by Michael Schonhardt ([https://orcid.org/0000-0002-2750-1900](https://orcid.org/0000-0002-2750-1900)) for the project Burchards Dekret Digital.
+Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via [www.zeno.org](http://www.zeno.org/georges-1913) by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.
+## **License**
+CC BY 4.0 ([https://creativecommons.org/licenses/by/4.0/legalcode.en](https://creativecommons.org/licenses/by/4.0/legalcode.en))
+## **Citation**
+If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, [https://huggingface.co/mschonhardt/georges-1913-normalization-model](https://huggingface.co/mschonhardt/georges-1913-normalization-model), Doi: [10.5281/zenodo.14264956](https://doi.org/10.5281/zenodo.14264956).