alvperez
/

skill-sim-model

@@ -1,6 +1,7 @@
 ---
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - feature-extraction
@@ -8,107 +9,109 @@ tags:
 - job-matching
 - skill-similarity
 - embeddings
 ---
-# alvperez/skill-sim-model
-This is a fine-tuned [sentence-transformers](https://www.SBERT.net) model for **skill similarity** and **job matching**. It maps short skill phrases (e.g., `Python`, `Forklift Operation`, `Electrical Wiring`) into a 768-dimensional embedding space, where semantically related skills are closer together.
-It can be used for:
-- Matching candidates to job requirements
-- Measuring similarity between skills
-- Clustering and grouping skill sets
-- Resume parsing or job recommendation systems
 ---
-## 🧪 Usage (Sentence-Transformers)
-To use this model:
 ```bash
 pip install -U sentence-transformers
 ```
 ```python
-from sentence_transformers import SentenceTransformer
-model = SentenceTransformer('alvperez/skill-sim-model')
-skills = ["Electrical Wiring", "Circuit Troubleshooting", "Machine Learning"]
-embeddings = model.encode(skills)
-print(embeddings.shape)  # (3, 768)
 ```
----
-## 🧭 Evaluation Results
-The model was evaluated on a labeled skill similarity dataset using the following metrics:
-| Metric               | Value   |
-|----------------------|---------|
-| Spearman Correlation | 0.8612  |
-| ROC AUC              | 0.9127  |
-These scores indicate strong alignment with human-labeled skill similarity ratings.
 ---
-## 🧠 Training Details
-The model was fine-tuned on a custom skill similarity dataset using `CosineSimilarityLoss`.
-### **DataLoader**
-`torch.utils.data.dataloader.DataLoader` of length 409 with parameters:
-```python
-{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
-```
-### **Loss**
-```python
-sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss
-```
-### **Training Parameters**
-```python
-{
-    "epochs": 5,
-    "evaluation_steps": 100,
-    "evaluator": "EmbeddingSimilarityEvaluator",
-    "max_grad_norm": 1,
-    "optimizer_class": "AdamW",
-    "optimizer_params": {
-        "lr": 2e-05
-    },
-    "scheduler": "WarmupLinear",
-    "warmup_steps": 100,
-    "weight_decay": 0.01
-}
-```
 ---
-## 🧬 Model Architecture
-```python
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
-  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
-  (2): Normalize()
-)
 ```
 ---
-## 📚 Citation & Attribution
-- Model fine-tuned by [@alvperez](https://huggingface.co/alvperez)
-- Built with [Sentence-Transformers](https://www.sbert.net/)
-- Inspired by semantic search and skill-matching use cases

 ---
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
+license: apache-2.0
 tags:
 - sentence-transformers
 - feature-extraction
 - job-matching
 - skill-similarity
 - embeddings
+- esco
 ---
+# 🛠️ alvperez/skill-sim-model
+**skill-sim-model** is a fine-tuned [Sentence-Transformers](https://www.sbert.net) checkpoint that maps short *skill phrases* (e.g. `Python`, `Forklift operation`, `Electrical wiring`) into a 768‑D vector space where semantically related skills cluster together.
+Training pairs come from the public **ESCO** taxonomy plus curated *hard negatives* for job‑matching research.
+| Use‑case | How to leverage the embeddings |
+|----------|--------------------------------|
+| Candidate ↔ vacancy matching | `score = cosine(skill_vec, job_vec)` |
+| Deduplicating skill taxonomies | cluster the vectors |
+| Recruiter query‑expansion | nearest‑neighbour search |
+| Exploratory dashboards | feed to t‑SNE / PCA |
 ---
+## 🚀 Quick start
 ```bash
 pip install -U sentence-transformers
 ```
 ```python
+from sentence_transformers import SentenceTransformer, util
+model = SentenceTransformer("alvperez/skill-sim-model")
+skills = ["Electrical wiring",
+          "Circuit troubleshooting",
+          "Machine learning"]
+emb = model.encode(skills, convert_to_tensor=True)
+print(util.pytorch_cos_sim(emb[0], emb))   # similarity matrix
 ```
+Need a vanilla 🤗 pipeline?
+```python
+from transformers import pipeline
+similarity = pipeline("sentence-similarity",
+                      model="alvperez/skill-sim-model")
+similarity("forklift operation",
+           ["pallet jack", "python"])
+```
+---
+## 📊 Benchmark
+| Metric                         | Value |
+|--------------------------------|-------|
+| Spearman correlation (2 k pairs) | **0.845** |
+| ROC AUC                        | **0.988** |
+| MAP@all (*cold‑start*)         | **0.232** |
+> *cold‑start = the system sees only skill strings, no historical interactions.*
 ---
+## ⚙️ Training recipe (brief)
+* Base: `sentence-transformers/all-mpnet-base-v2`
+* Loss: `CosineSimilarityLoss`
+* Epochs × batch: `5 × 32`
+* LR / warm‑up: `2 e‑5` / `100` steps
+* Negatives: random + “hard” pairs from ESCO siblings
+* Hardware: 1 × A100 40 GB (≈ 45 min)
+Full code in [`/training_scripts`](training_scripts).
+---
+## 🏹 Intended use
+* **Employment tech** – rank CVs vs. vacancies
+* **EdTech / reskilling** – detect skill gaps, suggest learning paths
+* **HR analytics** – normalise noisy skill fields at scale
+---
+## ✋ Limitations & bias
+* Vocabulary dominated by ESCO (English); niche jargon may project poorly.
+* No explicit fairness constraints; downstream systems should audit (e.g. *Disparate Impact*).
+* In our tests, a threshold of 0.65 marks a “definitely related” cut‑off; tune for your own precision‑recall needs.
 ---
+## 🔍 Citation
+```bibtex
+@misc{alvperez2025skillsim,
+  title  = {Skill-Sim: a Sentence-Transformers model for skill similarity and job matching},
+  author = {Pérez Amado, Álvaro},
+  howpublished = {\url{https://huggingface.co/alvperez/skill-sim-model}},
+  year   = {2025}
+}
 ```
 ---
+### Acknowledgements
+Built with 💙 on top of Sentence-Transformers and the public **ESCO** dataset.
+Feedback & PRs welcome!