File size: 3,295 Bytes
eb74282 8d982c5 eb74282 bf48107 8d982c5 eb74282 8d982c5 bf48107 8d982c5 eb74282 8d982c5 eb74282 bf48107 eb74282 8d982c5 bf48107 eb74282 8d982c5 eb74282 8d982c5 bf48107 8d982c5 bf48107 8d982c5 eb74282 8d982c5 eb74282 8d982c5 eb74282 8d982c5 98d0261 8d982c5 eb74282 8d982c5 eb74282 bf48107 eb74282 8d982c5 eb74282 8d982c5 eb74282 8d982c5 eb74282 8d982c5 bf48107 8d982c5 eb74282 8d982c5 eb74282 8d982c5 bf48107 8d982c5 bf48107 8d982c5 eb74282 bf48107 eb74282 8d982c5 bf48107 8d982c5 eb74282 bf48107 8d982c5 eb74282 32d1379 8d982c5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- job-matching
- skill-similarity
- embeddings
- esco
---
# 🛠️ alvperez/skill-sim-model
**skill-sim-model** is a fine-tuned [Sentence-Transformers](https://www.sbert.net) checkpoint that maps short *skill phrases* (e.g. `Python`, `Forklift operation`, `Electrical wiring`) into a 768‑D vector space where semantically related skills cluster together.
Training pairs come from the public **ESCO** taxonomy plus curated *hard negatives* for job‑matching research.
| Use‑case | How to leverage the embeddings |
|----------|--------------------------------|
| Candidate ↔ vacancy matching | `score = cosine(skill_vec, job_vec)` |
| Deduplicating skill taxonomies | cluster the vectors |
| Recruiter query‑expansion | nearest‑neighbour search |
| Exploratory dashboards | feed to t‑SNE / PCA |
---
## 🚀 Quick start
```bash
pip install -U sentence-transformers
```
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("alvperez/skill-sim-model")
skills = ["Electrical wiring",
"Circuit troubleshooting",
"Machine learning"]
emb = model.encode(skills, convert_to_tensor=True)
print(util.pytorch_cos_sim(emb[0], emb)) # similarity matrix
```
```python
from transformers import pipeline
similarity = pipeline("sentence-similarity",
model="alvperez/skill-sim-model")
similarity("forklift operation",
["pallet jack", "python"])
```
---
## 📊 Benchmark
| Metric | Value |
|--------------------------------|-------|
| Spearman correlation | **0.845** |
| ROC AUC | **0.988** |
| MAP@all (*cold‑start*) | **0.232** |
> *cold‑start = the system sees only skill strings, no historical interactions.*
---
## ⚙️ Training recipe (brief)
* Base: `sentence-transformers/all-mpnet-base-v2`
* Loss: `CosineSimilarityLoss`
* Epochs × batch: `5 × 32`
* LR / warm‑up: `2 e‑5` / `100` steps
* Negatives: random + “hard” pairs from ESCO siblings
* Hardware: 1 × A100 40 GB (≈ 45 min)
Full code in [`/training_scripts`](training_scripts).
---
## 🏹 Intended use
* **Employment tech** – rank CVs vs. vacancies
* **EdTech / reskilling** – detect skill gaps, suggest learning paths
* **HR analytics** – normalise noisy skill fields at scale
---
## ✋ Limitations & bias
* Vocabulary dominated by ESCO (English); niche jargon may project poorly.
* No explicit fairness constraints; downstream systems should audit (e.g. *Disparate Impact*).
* In our tests, a threshold of 0.65 marks a “definitely related” cut‑off; tune for your own precision‑recall needs.
---
## 🔍 Citation
```bibtex
@misc{alvperez2025skillsim,
title = {Skill-Sim: a Sentence-Transformers model for skill similarity and job matching},
author = {Pérez Amado, Álvaro},
howpublished = {\url{https://huggingface.co/alvperez/skill-sim-model}},
year = {2025}
}
```
---
### Acknowledgements
Built on top of Sentence-Transformers and the public **ESCO** dataset.
Feedback & PRs welcome!
|