Tutorials for Testing and Fine-Tuning SpaBERT

This repository provides two Jupyter Notebooks for testing entity linking (one of the downstream tasks of SpaBERT) and fine-tuning procedure to train on geo-entities from other knowledge bases (e.g., World Historical Gazetteer)

The first step is cloning the SpaBERT repository onto your machine. Run the following line of code to do this.

git clone https://github.com/zekun-li/spabert.git

You will need to have IPython Kernel for Jupyter installed before running the code in this tutorial. Run the following line of code to ensure ipython is installed

pip install ipykernel

Before starting the jupyter notebooks run the following lines to make sure you have all required packages:

pip install requirements.txt

The requirements.txt file will be located in the spabert directory.

-spabert
 | - datasets
 | - experiments
 | - models
 | - models
 | - notebooks
 | - utils
 | - __init__.py
 | - README.md
 | - requirements.txt
 | - train_mlm.py

Installing Model Weights

Make sure you have git-lfs installed (https://git-lfs.com windows & mac) (https://github.com/git-lfs/git-lfs/blob/main/INSTALLING.md linux)

Please run the follow commands separately in the order to install pre-trained & fine-tuned model weights

git lfs install

git clone https://huggingface.co/knowledge-computing-lab/spabert-base-uncased

git clone https://huggingface.co/knowledge-computing-lab/spabert-base-uncased-finetuned-osm-mn

Once the model weight is installed, you'll see a file called mlm_mem_keeppos_ep0_iter06000_0.2936.pth and spabert-base-uncased-finetuned-osm-mn.pth Move these files to the tutorial_datasets folder. After moving them, the file structure should look like this:

- notebooks
  | - tutorial_datasets
  |   | - mlm_mem_keeppos_ep0_iter06000_0.2936.pth
  |   | - osm_mn.csv
  |   | - spabert_osm_mn.json
  |   | - spabert_whg_wikidata.json
  |   | - spabert_wikidata_sampled.json
  |   | - spabert-base-uncased-finetuned-osm-mn.pth
  | - README.md
  | - spabert-entity-linking.ipynb
  | - spabert-fine-tuning.ipynb
  | - WHGDataset.py

Jupyter Notebook Descriptions

spabert-fine-tuning.ipynb

This Jupyter Notebook provides on how to fine-tune spabert using point data from OpenStreetMap (OSM) in Minnesota. SpaBERT is pre-trained using data from California and London using OSM Point data. Instructions for pre-training your own model can be found on the spabert github Here are the steps to run:

Define which dataset you want to use (e.g., OSM in New York or Minnesota)
Read data from csv file and construct KDTree for computing nearest neighbors
Create dataset using KDTree for fine-tuning SpaBERT using the dataset you chose
Load pre-trained model
Load dataset using the SpaBERT data loader
Train model for 1 epoch using fine-tuning model and save

spabert-entity-linking.ipynb

This Jupyter Notebook provides on how to create an entity-linking dataset and how to perform entity-linking using SpaBERT. The dataset used here is a pre-matched dataset between World Historical Gazetteer (WHG) and Wikidata. The methods used to evaluate this model will be Hits@K and Mean Reciprocal Rank (MRR) Here are the steps to run:

Load fine-tuned model from previous Jupyter notebook
Load datasets using the WHG data loader
Calculate embeddings for whg and wikidata entities using SpaBERT
Calculate hits@1, Hits@5, Hits@10, and MRR

Dataset Descriptions

There are two types of tutorial datasets used for fine-tuning SpaBERT, CSV and JSON files.

CSV file - sample taken from OpenStreetMap (OSM)
- Minnesota State ./tutorial_datasets/osm_mn.csv
An example data structure:

row_id name latitude longitude

0 Duluth -92.1215 46.7729

1 Green Valley -95.757 44.5269
JSON files - ready-to-use files for SpaBERT's data loader - SpatialDataset
- OSM Minnesota State ./tutorial_datasets/spabert_osm_mn.json
  - Generated from ./tutorial_datasets/osm_mn.csv using spabert-fine-tuning.ipynb
- WHG ./tutorial_datasets/spabert_whg_wikidata.json
  - Geo-entities from WHG that have the link to Wikidata
- Wikidata ./tutorial_datasets/spabert_wikidata_sampled.json
  - Sampled from entities delivered by WHG. These entities have been linked between WHG and Wikidata by WHG prior to being delivered to us.

row_id	name	latitude	longitude
0	Duluth	-92.1215	46.7729
1	Green Valley	-95.757	44.5269

The file contains json objects on each line. Each json object describes the spatial context of an entity using nearby entities.

A sample json object looks like the following:

{
   "info":{
      "name":"Duluth",
      "geometry":{
         "coordinates":[
            46.7729,
            -92.1215
         ]
      }
   },
   "neighbor_info":{
      "name_list":[
         "Duluth",
         "Chinese Peace Belle and Garden",
         ...
      ],
      "geometry_list":[
         {
            "coordinates":[
               46.7729,
               -92.1215
            ]
         },
         {
            "coordinates":[
               46.7770,
               -92.1241
            ]
         },
         ...
      ]
   }
}

To perform entity-linking on SpaBERT you must have a dataset structured similarly to the second dataset used for fine-tuning.

A sample json object looks like the following:

{
   "info":{
      "name":"Duluth",
      "geometry":{
         "coordinates":[
            46.7729,
            -92.1215
         ]
      },
      "qid":"Q485708"
   },
   "neighbor_info":{
      ...
   }
}