skill-sim-model / README.md
alvperez's picture
Update README.md
98d0261 verified
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- job-matching
- skill-similarity
- embeddings
- esco
---
# 🛠️ alvperez/skill-sim-model
**skill-sim-model** is a fine-tuned [Sentence-Transformers](https://www.sbert.net) checkpoint that maps short *skill phrases* (e.g. `Python`, `Forklift operation`, `Electrical wiring`) into a 768‑D vector space where semantically related skills cluster together.
Training pairs come from the public **ESCO** taxonomy plus curated *hard negatives* for job‑matching research.
| Use‑case | How to leverage the embeddings |
|----------|--------------------------------|
| Candidate ↔ vacancy matching | `score = cosine(skill_vec, job_vec)` |
| Deduplicating skill taxonomies | cluster the vectors |
| Recruiter query‑expansion | nearest‑neighbour search |
| Exploratory dashboards | feed to t‑SNE / PCA |
---
## 🚀 Quick start
```bash
pip install -U sentence-transformers
```
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("alvperez/skill-sim-model")
skills = ["Electrical wiring",
"Circuit troubleshooting",
"Machine learning"]
emb = model.encode(skills, convert_to_tensor=True)
print(util.pytorch_cos_sim(emb[0], emb)) # similarity matrix
```
```python
from transformers import pipeline
similarity = pipeline("sentence-similarity",
model="alvperez/skill-sim-model")
similarity("forklift operation",
["pallet jack", "python"])
```
---
## 📊 Benchmark
| Metric | Value |
|--------------------------------|-------|
| Spearman correlation | **0.845** |
| ROC AUC | **0.988** |
| MAP@all (*cold‑start*) | **0.232** |
> *cold‑start = the system sees only skill strings, no historical interactions.*
---
## ⚙️ Training recipe (brief)
* Base: `sentence-transformers/all-mpnet-base-v2`
* Loss: `CosineSimilarityLoss`
* Epochs × batch: `5 × 32`
* LR / warm‑up: `2 e‑5` / `100` steps
* Negatives: random + “hard” pairs from ESCO siblings
* Hardware: 1 × A100 40 GB (≈ 45 min)
Full code in [`/training_scripts`](training_scripts).
---
## 🏹 Intended use
* **Employment tech** – rank CVs vs. vacancies
* **EdTech / reskilling** – detect skill gaps, suggest learning paths
* **HR analytics** – normalise noisy skill fields at scale
---
## ✋ Limitations & bias
* Vocabulary dominated by ESCO (English); niche jargon may project poorly.
* No explicit fairness constraints; downstream systems should audit (e.g. *Disparate Impact*).
* In our tests, a threshold of 0.65 marks a “definitely related” cut‑off; tune for your own precision‑recall needs.
---
## 🔍 Citation
```bibtex
@misc{alvperez2025skillsim,
title = {Skill-Sim: a Sentence-Transformers model for skill similarity and job matching},
author = {Pérez Amado, Álvaro},
howpublished = {\url{https://huggingface.co/alvperez/skill-sim-model}},
year = {2025}
}
```
---
### Acknowledgements
Built on top of Sentence-Transformers and the public **ESCO** dataset.
Feedback & PRs welcome!