alvperez commited on
Commit
8d982c5
·
verified ·
1 Parent(s): 147c40c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -62
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  library_name: sentence-transformers
3
  pipeline_tag: sentence-similarity
 
4
  tags:
5
  - sentence-transformers
6
  - feature-extraction
@@ -8,107 +9,109 @@ tags:
8
  - job-matching
9
  - skill-similarity
10
  - embeddings
 
11
  ---
12
 
13
- # alvperez/skill-sim-model
14
 
15
- This is a fine-tuned [sentence-transformers](https://www.SBERT.net) model for **skill similarity** and **job matching**. It maps short skill phrases (e.g., `Python`, `Forklift Operation`, `Electrical Wiring`) into a 768-dimensional embedding space, where semantically related skills are closer together.
 
16
 
17
- It can be used for:
18
-
19
- - Matching candidates to job requirements
20
- - Measuring similarity between skills
21
- - Clustering and grouping skill sets
22
- - Resume parsing or job recommendation systems
23
 
24
  ---
25
 
26
- ## 🧪 Usage (Sentence-Transformers)
27
-
28
- To use this model:
29
 
30
  ```bash
31
  pip install -U sentence-transformers
32
  ```
33
 
34
  ```python
35
- from sentence_transformers import SentenceTransformer
36
 
37
- model = SentenceTransformer('alvperez/skill-sim-model')
38
 
39
- skills = ["Electrical Wiring", "Circuit Troubleshooting", "Machine Learning"]
40
- embeddings = model.encode(skills)
 
41
 
42
- print(embeddings.shape) # (3, 768)
 
43
  ```
44
 
45
- ---
46
 
47
- ## 🧭 Evaluation Results
 
 
 
 
 
 
48
 
49
- The model was evaluated on a labeled skill similarity dataset using the following metrics:
 
 
50
 
51
- | Metric | Value |
52
- |----------------------|---------|
53
- | Spearman Correlation | 0.8612 |
54
- | ROC AUC | 0.9127 |
 
55
 
56
- These scores indicate strong alignment with human-labeled skill similarity ratings.
57
 
58
  ---
59
 
60
- ## 🧠 Training Details
61
 
62
- The model was fine-tuned on a custom skill similarity dataset using `CosineSimilarityLoss`.
 
 
 
 
 
63
 
64
- ### **DataLoader**
65
 
66
- `torch.utils.data.dataloader.DataLoader` of length 409 with parameters:
67
 
68
- ```python
69
- {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
70
- ```
71
 
72
- ### **Loss**
 
 
73
 
74
- ```python
75
- sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss
76
- ```
77
 
78
- ### **Training Parameters**
79
 
80
- ```python
81
- {
82
- "epochs": 5,
83
- "evaluation_steps": 100,
84
- "evaluator": "EmbeddingSimilarityEvaluator",
85
- "max_grad_norm": 1,
86
- "optimizer_class": "AdamW",
87
- "optimizer_params": {
88
- "lr": 2e-05
89
- },
90
- "scheduler": "WarmupLinear",
91
- "warmup_steps": 100,
92
- "weight_decay": 0.01
93
- }
94
- ```
95
 
96
  ---
97
 
98
- ## 🧬 Model Architecture
99
 
100
- ```python
101
- SentenceTransformer(
102
- (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
103
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
104
- (2): Normalize()
105
- )
 
106
  ```
107
 
108
  ---
109
 
110
- ## 📚 Citation & Attribution
111
 
112
- - Model fine-tuned by [@alvperez](https://huggingface.co/alvperez)
113
- - Built with [Sentence-Transformers](https://www.sbert.net/)
114
- - Inspired by semantic search and skill-matching use cases
 
1
  ---
2
  library_name: sentence-transformers
3
  pipeline_tag: sentence-similarity
4
+ license: apache-2.0
5
  tags:
6
  - sentence-transformers
7
  - feature-extraction
 
9
  - job-matching
10
  - skill-similarity
11
  - embeddings
12
+ - esco
13
  ---
14
 
15
+ # 🛠️ alvperez/skill-sim-model
16
 
17
+ **skill-sim-model** is a fine-tuned [Sentence-Transformers](https://www.sbert.net) checkpoint that maps short *skill phrases* (e.g. `Python`, `Forklift operation`, `Electrical wiring`) into a 768‑D vector space where semantically related skills cluster together.
18
+ Training pairs come from the public **ESCO** taxonomy plus curated *hard negatives* for job‑matching research.
19
 
20
+ | Use‑case | How to leverage the embeddings |
21
+ |----------|--------------------------------|
22
+ | Candidate ↔ vacancy matching | `score = cosine(skill_vec, job_vec)` |
23
+ | Deduplicating skill taxonomies | cluster the vectors |
24
+ | Recruiter query‑expansion | nearest‑neighbour search |
25
+ | Exploratory dashboards | feed to t‑SNE / PCA |
26
 
27
  ---
28
 
29
+ ## 🚀 Quick start
 
 
30
 
31
  ```bash
32
  pip install -U sentence-transformers
33
  ```
34
 
35
  ```python
36
+ from sentence_transformers import SentenceTransformer, util
37
 
38
+ model = SentenceTransformer("alvperez/skill-sim-model")
39
 
40
+ skills = ["Electrical wiring",
41
+ "Circuit troubleshooting",
42
+ "Machine learning"]
43
 
44
+ emb = model.encode(skills, convert_to_tensor=True)
45
+ print(util.pytorch_cos_sim(emb[0], emb)) # similarity matrix
46
  ```
47
 
48
+ Need a vanilla 🤗 pipeline?
49
 
50
+ ```python
51
+ from transformers import pipeline
52
+ similarity = pipeline("sentence-similarity",
53
+ model="alvperez/skill-sim-model")
54
+ similarity("forklift operation",
55
+ ["pallet jack", "python"])
56
+ ```
57
 
58
+ ---
59
+
60
+ ## 📊 Benchmark
61
 
62
+ | Metric | Value |
63
+ |--------------------------------|-------|
64
+ | Spearman correlation (2 k pairs) | **0.845** |
65
+ | ROC AUC | **0.988** |
66
+ | MAP@all (*cold‑start*) | **0.232** |
67
 
68
+ > *cold‑start = the system sees only skill strings, no historical interactions.*
69
 
70
  ---
71
 
72
+ ## ⚙️ Training recipe (brief)
73
 
74
+ * Base: `sentence-transformers/all-mpnet-base-v2`
75
+ * Loss: `CosineSimilarityLoss`
76
+ * Epochs × batch: `5 × 32`
77
+ * LR / warm‑up: `2 e‑5` / `100` steps
78
+ * Negatives: random + “hard” pairs from ESCO siblings
79
+ * Hardware: 1 × A100 40 GB (≈ 45 min)
80
 
81
+ Full code in [`/training_scripts`](training_scripts).
82
 
83
+ ---
84
 
85
+ ## 🏹 Intended use
 
 
86
 
87
+ * **Employment tech** – rank CVs vs. vacancies
88
+ * **EdTech / reskilling** – detect skill gaps, suggest learning paths
89
+ * **HR analytics** – normalise noisy skill fields at scale
90
 
91
+ ---
 
 
92
 
93
+ ## Limitations & bias
94
 
95
+ * Vocabulary dominated by ESCO (English); niche jargon may project poorly.
96
+ * No explicit fairness constraints; downstream systems should audit (e.g. *Disparate Impact*).
97
+ * In our tests, a threshold of 0.65 marks a “definitely related” cut‑off; tune for your own precision‑recall needs.
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
  ---
100
 
101
+ ## 🔍 Citation
102
 
103
+ ```bibtex
104
+ @misc{alvperez2025skillsim,
105
+ title = {Skill-Sim: a Sentence-Transformers model for skill similarity and job matching},
106
+ author = {Pérez Amado, Álvaro},
107
+ howpublished = {\url{https://huggingface.co/alvperez/skill-sim-model}},
108
+ year = {2025}
109
+ }
110
  ```
111
 
112
  ---
113
 
114
+ ### Acknowledgements
115
 
116
+ Built with 💙 on top of Sentence-Transformers and the public **ESCO** dataset.
117
+ Feedback & PRs welcome!