Update Readme file
Browse files
README.md
CHANGED
@@ -1,3 +1,68 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: cc-by-nc-sa-4.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- zh
|
4 |
+
tags:
|
5 |
+
- LinkTransformer
|
6 |
+
- Office Title Disambiguation/Similarity
|
7 |
+
- 古代官职
|
8 |
+
- 古文
|
9 |
+
- 文言文
|
10 |
+
- ancient
|
11 |
+
- classical
|
12 |
license: cc-by-nc-sa-4.0
|
13 |
---
|
14 |
+
|
15 |
+
# <font color="IndianRed"> OfficeTitleDis (Classical Chinese Office Title Disambiguation/Similarity)</font>
|
16 |
+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ql7NkLOGdEf2IaPg_9khGxev3OkZIaXu?usp=sharing)
|
17 |
+
|
18 |
+
This model has been fine-tuned using methodologies from the paper ["LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models"](https://scholar.harvard.edu/sites/scholar.harvard.edu/files/dell/files/linkt.pdf) by Abhishek Arora and Melissa Dell from Harvard University.
|
19 |
+
|
20 |
+
### <font color="IndianRed">Model Description </font>
|
21 |
+
This model is designed to find the top \(N\) most similar Classical Chinese office titles in a given data frame. Given an input DataFrame containing \(K\) office titles, the model outputs the top \(N\) most similar office titles in the input DataFrame for every office title.
|
22 |
+
|
23 |
+
### <font color="IndianRed">Fine-tuning Data </font>
|
24 |
+
The data used for fine-tuning this model is supported by the China Biographical Database (CBDB) at Harvard University. All office titles from the training data are from the periods of the Song, Ming, and Qing dynasties.
|
25 |
+
|
26 |
+
---
|
27 |
+
|
28 |
+
### <font color="IndianRed">Usage</font>
|
29 |
+
|
30 |
+
The following section demonstrates how to directly load the OfficeTitleDis model.
|
31 |
+
|
32 |
+
Please ensure that you have the necessary libraries installed and model downloaded in your Python environment. If not, you can install it using pip:
|
33 |
+
|
34 |
+
```python
|
35 |
+
git lfs install
|
36 |
+
git clone https://huggingface.co/cbdb/OfficeTitleDis
|
37 |
+
pip install linktransformer
|
38 |
+
pip install hanziconv
|
39 |
+
```
|
40 |
+
|
41 |
+
Now, let's load our model and make some predictions:
|
42 |
+
|
43 |
+
```python
|
44 |
+
# Import necessary libraries from linktransformer
|
45 |
+
import linktransformer as lt
|
46 |
+
|
47 |
+
# predict
|
48 |
+
df_lm_matched = lt.merge(df1, df2, merge_type='1:m', on="office_name", model="/content/OfficeTitleDis/model", left_on=None, right_on=None)
|
49 |
+
display(df_lm_matched.head())
|
50 |
+
```
|
51 |
+
---
|
52 |
+
|
53 |
+
|
54 |
+
### <font color="IndianRed">Authors </font>
|
55 |
+
Queenie Luo (queenieluo[at]g.harvard.edu)
|
56 |
+
<br>
|
57 |
+
Hongsu Wang
|
58 |
+
<br>
|
59 |
+
Peter Bol
|
60 |
+
<br>
|
61 |
+
CBDB Group
|
62 |
+
|
63 |
+
### <font color="IndianRed">License </font>
|
64 |
+
Copyright (c) 2023 CBDB
|
65 |
+
|
66 |
+
Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
|
67 |
+
To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or
|
68 |
+
send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
|