|
--- |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
license: apache-2.0 |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- job-matching |
|
- skill-similarity |
|
- embeddings |
|
- esco |
|
--- |
|
|
|
# 🛠️ alvperez/skill-sim-model |
|
|
|
**skill-sim-model** is a fine-tuned [Sentence-Transformers](https://www.sbert.net) checkpoint that maps short *skill phrases* (e.g. `Python`, `Forklift operation`, `Electrical wiring`) into a 768‑D vector space where semantically related skills cluster together. |
|
Training pairs come from the public **ESCO** taxonomy plus curated *hard negatives* for job‑matching research. |
|
|
|
| Use‑case | How to leverage the embeddings | |
|
|----------|--------------------------------| |
|
| Candidate ↔ vacancy matching | `score = cosine(skill_vec, job_vec)` | |
|
| Deduplicating skill taxonomies | cluster the vectors | |
|
| Recruiter query‑expansion | nearest‑neighbour search | |
|
| Exploratory dashboards | feed to t‑SNE / PCA | |
|
|
|
--- |
|
|
|
## 🚀 Quick start |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer, util |
|
|
|
model = SentenceTransformer("alvperez/skill-sim-model") |
|
|
|
skills = ["Electrical wiring", |
|
"Circuit troubleshooting", |
|
"Machine learning"] |
|
|
|
emb = model.encode(skills, convert_to_tensor=True) |
|
print(util.pytorch_cos_sim(emb[0], emb)) # similarity matrix |
|
``` |
|
|
|
```python |
|
from transformers import pipeline |
|
similarity = pipeline("sentence-similarity", |
|
model="alvperez/skill-sim-model") |
|
similarity("forklift operation", |
|
["pallet jack", "python"]) |
|
``` |
|
|
|
--- |
|
|
|
## 📊 Benchmark |
|
|
|
| Metric | Value | |
|
|--------------------------------|-------| |
|
| Spearman correlation | **0.845** | |
|
| ROC AUC | **0.988** | |
|
| MAP@all (*cold‑start*) | **0.232** | |
|
|
|
> *cold‑start = the system sees only skill strings, no historical interactions.* |
|
|
|
--- |
|
|
|
## ⚙️ Training recipe (brief) |
|
|
|
* Base: `sentence-transformers/all-mpnet-base-v2` |
|
* Loss: `CosineSimilarityLoss` |
|
* Epochs × batch: `5 × 32` |
|
* LR / warm‑up: `2 e‑5` / `100` steps |
|
* Negatives: random + “hard” pairs from ESCO siblings |
|
* Hardware: 1 × A100 40 GB (≈ 45 min) |
|
|
|
Full code in [`/training_scripts`](training_scripts). |
|
|
|
--- |
|
|
|
## 🏹 Intended use |
|
|
|
* **Employment tech** – rank CVs vs. vacancies |
|
* **EdTech / reskilling** – detect skill gaps, suggest learning paths |
|
* **HR analytics** – normalise noisy skill fields at scale |
|
|
|
--- |
|
|
|
## ✋ Limitations & bias |
|
|
|
* Vocabulary dominated by ESCO (English); niche jargon may project poorly. |
|
* No explicit fairness constraints; downstream systems should audit (e.g. *Disparate Impact*). |
|
* In our tests, a threshold of 0.65 marks a “definitely related” cut‑off; tune for your own precision‑recall needs. |
|
|
|
--- |
|
|
|
## 🔍 Citation |
|
|
|
```bibtex |
|
@misc{alvperez2025skillsim, |
|
title = {Skill-Sim: a Sentence-Transformers model for skill similarity and job matching}, |
|
author = {Pérez Amado, Álvaro}, |
|
howpublished = {\url{https://huggingface.co/alvperez/skill-sim-model}}, |
|
year = {2025} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
### Acknowledgements |
|
|
|
Built on top of Sentence-Transformers and the public **ESCO** dataset. |
|
Feedback & PRs welcome! |
|
|