jensjorisdecorte commited on
Commit
a764dbe
·
verified ·
1 Parent(s): 99bb455

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -230
README.md CHANGED
@@ -139,20 +139,19 @@ widget:
139
 
140
  # SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
141
 
142
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) on the generator dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
143
 
144
  ## Model Details
145
 
146
  ### Model Description
147
  - **Model Type:** Sentence Transformer
148
- - **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) <!-- at revision 9a3225965996d404b775526de6dbfe85d3368642 -->
149
  - **Maximum Sequence Length:** 64 tokens
150
  - **Output Dimensionality:** 1024 tokens
151
  - **Similarity Function:** Cosine Similarity
152
- - **Training Dataset:**
153
- - generator
154
- <!-- - **Language:** Unknown -->
155
- <!-- - **License:** Unknown -->
156
 
157
  ### Model Sources
158
 
@@ -177,233 +176,88 @@ SentenceTransformer(
177
 
178
  ### Direct Usage (Sentence Transformers)
179
 
180
- First install the Sentence Transformers library:
181
 
182
  ```bash
183
  pip install -U sentence-transformers
184
  ```
185
 
186
- Then you can load this model and run inference.
 
187
  ```python
 
 
 
188
  from sentence_transformers import SentenceTransformer
 
189
 
190
- # Download from the 🤗 Hub
191
  model = SentenceTransformer("jensjorisdecorte/JobBERT-v2")
192
- # Run inference
193
- sentences = [
194
- 'Branch Manager',
195
- 'teamwork principles, office administration, delegate responsibilities, create banking accounts, manage alarm system, make independent operating decisions, use microsoft office, offer financial services, ensure proper document management, own management skills, use spreadsheets software, manage cash flow, integrate community outreach, manage time, perform multiple tasks at the same time, carry out calculations, assess customer credibility, maintain customer service, team building, digitise documents, promote financial products, communication, assist customers, follow procedures in the event of an alarm, office equipment',
196
- 'support employability of people with disabilities, schedule shifts, issue licences, funding methods, maintain correspondence records, computer equipment, decide on providing funds, tend filing machine, use microsoft office, lift stacks of paper, transport office equipment, tend to guests with special needs, provide written content, foreign affairs policy development, provide charity services, philanthropy, maintain financial records, meet deadlines, manage fundraising activities, assist individuals with disabilities in community activities, report on grants, prepare compliance documents, manage grant applications, tolerate sitting for long periods, follow work schedule',
197
- ]
198
- embeddings = model.encode(sentences)
199
- print(embeddings.shape)
200
- # [3, 1024]
201
-
202
- # Get the similarity scores for the embeddings
203
- similarities = model.similarity(embeddings, embeddings)
204
- print(similarities.shape)
205
- # [3, 3]
206
- ```
207
-
208
- <!--
209
- ### Direct Usage (Transformers)
210
-
211
- <details><summary>Click to see the direct usage in Transformers</summary>
212
-
213
- </details>
214
- -->
215
 
216
- <!--
217
- ### Downstream Usage (Sentence Transformers)
218
-
219
- You can finetune this model on your own dataset.
220
-
221
- <details><summary>Click to expand</summary>
222
-
223
- </details>
224
- -->
225
-
226
- <!--
227
- ### Out-of-Scope Use
228
-
229
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
230
- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
 
232
- <!--
233
- ## Bias, Risks and Limitations
234
 
235
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
236
- -->
 
 
237
 
238
- <!--
239
- ### Recommendations
240
 
241
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
242
- -->
 
 
243
 
244
  ## Training Details
245
 
246
  ### Training Dataset
247
 
248
  #### generator
249
-
250
- * Dataset: generator
251
- * Size: 5,579,240 training samples
252
- * Columns: <code>anchor</code> and <code>positive</code>
253
- * Approximate statistics based on the first 1000 samples:
254
- | | anchor | positive |
255
- |:--------|:---------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
256
- | type | string | string |
257
- | details | <ul><li>min: 3 tokens</li><li>mean: 7.95 tokens</li><li>max: 30 tokens</li></ul> | <ul><li>min: 18 tokens</li><li>mean: 59.33 tokens</li><li>max: 64 tokens</li></ul> |
258
- * Samples:
259
- | anchor | positive |
260
- |:--------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
261
- | <code>CAD Designer - Fire Sprinkler - Milwaukee - Relocation</code> | <code>coordinate construction activities, oversee construction project, fire protection engineering, install fire sprinklers, hydraulics, construction industry, create AutoCAD drawings, design sprinkler systems, inspect construction sites, design drawings, supervise sewerage systems construction, prepare site for construction, building codes, communicate with construction crews</code> |
262
- | <code>RN Practitioner</code> | <code>assume responsibility, financial statements, manage work, implement fundamentals of nursing, diagnose advanced nursing care, diagnose nursing care, specialist nursing care, nursing principles, provide nursing advice on healthcare, apply nursing care in long-term care, prescribe advanced nursing care, plan advanced nursing care, nursing science, implement nursing care, develop financial statistics reports, clinical decision-making at advanced practice, prepare financial statements, create a financial report, produce statistical financial records, operate in a specific field of nursing care</code> |
263
- | <code>Respiratory Therapist Travel Positions (BB-160B7)</code> | <code>respiratory therapy, comply with quality standards related to healthcare practice, provide information, primary care, record treated patient's information, formulate a treatment plan, carry out treatment prescribed by doctors, develop patient treatment strategies</code> |
264
- * Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
265
- ```json
266
- {
267
- "scale": 20.0,
268
- "similarity_fct": "cos_sim"
269
- }
270
- ```
271
 
272
  ### Training Hyperparameters
273
- #### Non-Default Hyperparameters
274
-
275
- - `overwrite_output_dir`: True
276
- - `per_device_train_batch_size`: 2048
277
- - `per_device_eval_batch_size`: 2048
278
- - `num_train_epochs`: 1
279
- - `fp16`: True
280
-
281
- #### All Hyperparameters
282
- <details><summary>Click to expand</summary>
283
-
284
- - `overwrite_output_dir`: True
285
- - `do_predict`: False
286
- - `eval_strategy`: no
287
- - `prediction_loss_only`: True
288
- - `per_device_train_batch_size`: 2048
289
- - `per_device_eval_batch_size`: 2048
290
- - `per_gpu_train_batch_size`: None
291
- - `per_gpu_eval_batch_size`: None
292
- - `gradient_accumulation_steps`: 1
293
- - `eval_accumulation_steps`: None
294
- - `torch_empty_cache_steps`: None
295
- - `learning_rate`: 5e-05
296
- - `weight_decay`: 0.0
297
- - `adam_beta1`: 0.9
298
- - `adam_beta2`: 0.999
299
- - `adam_epsilon`: 1e-08
300
- - `max_grad_norm`: 1.0
301
- - `num_train_epochs`: 1
302
- - `max_steps`: -1
303
- - `lr_scheduler_type`: linear
304
- - `lr_scheduler_kwargs`: {}
305
- - `warmup_ratio`: 0.0
306
- - `warmup_steps`: 0
307
- - `log_level`: passive
308
- - `log_level_replica`: warning
309
- - `log_on_each_node`: True
310
- - `logging_nan_inf_filter`: True
311
- - `save_safetensors`: True
312
- - `save_on_each_node`: False
313
- - `save_only_model`: False
314
- - `restore_callback_states_from_checkpoint`: False
315
- - `no_cuda`: False
316
- - `use_cpu`: False
317
- - `use_mps_device`: False
318
- - `seed`: 42
319
- - `data_seed`: None
320
- - `jit_mode_eval`: False
321
- - `use_ipex`: False
322
- - `bf16`: False
323
- - `fp16`: True
324
- - `fp16_opt_level`: O1
325
- - `half_precision_backend`: auto
326
- - `bf16_full_eval`: False
327
- - `fp16_full_eval`: False
328
- - `tf32`: None
329
- - `local_rank`: 0
330
- - `ddp_backend`: None
331
- - `tpu_num_cores`: None
332
- - `tpu_metrics_debug`: False
333
- - `debug`: []
334
- - `dataloader_drop_last`: False
335
- - `dataloader_num_workers`: 0
336
- - `dataloader_prefetch_factor`: None
337
- - `past_index`: -1
338
- - `disable_tqdm`: False
339
- - `remove_unused_columns`: True
340
- - `label_names`: None
341
- - `load_best_model_at_end`: False
342
- - `ignore_data_skip`: False
343
- - `fsdp`: []
344
- - `fsdp_min_num_params`: 0
345
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
346
- - `fsdp_transformer_layer_cls_to_wrap`: None
347
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
348
- - `deepspeed`: None
349
- - `label_smoothing_factor`: 0.0
350
- - `optim`: adamw_torch
351
- - `optim_args`: None
352
- - `adafactor`: False
353
- - `group_by_length`: False
354
- - `length_column_name`: length
355
- - `ddp_find_unused_parameters`: None
356
- - `ddp_bucket_cap_mb`: None
357
- - `ddp_broadcast_buffers`: False
358
- - `dataloader_pin_memory`: True
359
- - `dataloader_persistent_workers`: False
360
- - `skip_memory_metrics`: True
361
- - `use_legacy_prediction_loop`: False
362
- - `push_to_hub`: False
363
- - `resume_from_checkpoint`: None
364
- - `hub_model_id`: None
365
- - `hub_strategy`: every_save
366
- - `hub_private_repo`: False
367
- - `hub_always_push`: False
368
- - `gradient_checkpointing`: False
369
- - `gradient_checkpointing_kwargs`: None
370
- - `include_inputs_for_metrics`: False
371
- - `eval_do_concat_batches`: True
372
- - `fp16_backend`: auto
373
- - `push_to_hub_model_id`: None
374
- - `push_to_hub_organization`: None
375
- - `mp_parameters`:
376
- - `auto_find_batch_size`: False
377
- - `full_determinism`: False
378
- - `torchdynamo`: None
379
- - `ray_scope`: last
380
- - `ddp_timeout`: 1800
381
- - `torch_compile`: False
382
- - `torch_compile_backend`: None
383
- - `torch_compile_mode`: None
384
- - `dispatch_batches`: None
385
- - `split_batches`: None
386
- - `include_tokens_per_second`: False
387
- - `include_num_input_tokens_seen`: False
388
- - `neftune_noise_alpha`: None
389
- - `optim_target_modules`: None
390
- - `batch_eval_metrics`: False
391
- - `eval_on_start`: False
392
- - `eval_use_gather_object`: False
393
- - `batch_sampler`: batch_sampler
394
- - `multi_dataset_batch_sampler`: proportional
395
-
396
- </details>
397
-
398
- ### Training Logs
399
- | Epoch | Step | Training Loss |
400
- |:------:|:----:|:-------------:|
401
- | 0.1835 | 500 | 3.6354 |
402
- | 0.3670 | 1000 | 3.1788 |
403
- | 0.5505 | 1500 | 2.9969 |
404
- | 0.7339 | 2000 | 2.9026 |
405
- | 0.9174 | 2500 | 2.8421 |
406
-
407
 
408
  ### Framework Versions
409
  - Python: 3.9.19
@@ -441,22 +295,4 @@ You can finetune this model on your own dataset.
441
  archivePrefix={arXiv},
442
  primaryClass={cs.LG}
443
  }
444
- ```
445
-
446
- <!--
447
- ## Glossary
448
-
449
- *Clearly define terms in order to be accessible across audiences.*
450
- -->
451
-
452
- <!--
453
- ## Model Card Authors
454
-
455
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
456
- -->
457
-
458
- <!--
459
- ## Model Card Contact
460
-
461
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
462
- -->
 
139
 
140
  # SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
141
 
142
+ This is a [sentence-transformers](https://www.SBERT.net) model specifically trained for job title matching and similarity. It's finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) on a large dataset of job titles and their associated skills/requirements. The model maps job titles and descriptions to a 1024-dimensional dense vector space and can be used for semantic job title matching, job similarity search, and related HR/recruitment tasks.
143
 
144
  ## Model Details
145
 
146
  ### Model Description
147
  - **Model Type:** Sentence Transformer
148
+ - **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
149
  - **Maximum Sequence Length:** 64 tokens
150
  - **Output Dimensionality:** 1024 tokens
151
  - **Similarity Function:** Cosine Similarity
152
+ - **Training Dataset:** 5.5M+ job title pairs
153
+ - **Primary Use Case:** Job title matching and similarity
154
+ - **Performance:** Achieves 0.6457 MAP on TalentCLEF benchmark
 
155
 
156
  ### Model Sources
157
 
 
176
 
177
  ### Direct Usage (Sentence Transformers)
178
 
179
+ First install the required packages:
180
 
181
  ```bash
182
  pip install -U sentence-transformers
183
  ```
184
 
185
+ Then you can load and use the model with the following code:
186
+
187
  ```python
188
+ import torch
189
+ import numpy as np
190
+ from tqdm.auto import tqdm
191
  from sentence_transformers import SentenceTransformer
192
+ from sentence_transformers.util import batch_to_device
193
 
194
+ # Load the model
195
  model = SentenceTransformer("jensjorisdecorte/JobBERT-v2")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
+ def encode_batch(jobbert_model, texts):
198
+ features = jobbert_model.tokenize(texts)
199
+ features = batch_to_device(features, jobbert_model.device)
200
+ features["text_keys"] = ["anchor"]
201
+ with torch.no_grad():
202
+ out_features = jobbert_model.forward(features)
203
+ return out_features["sentence_embedding"].cpu().numpy()
204
+
205
+ def encode(jobbert_model, texts, batch_size: int = 8):
206
+ # Sort texts by length and keep track of original indices
207
+ sorted_indices = np.argsort([len(text) for text in texts])
208
+ sorted_texts = [texts[i] for i in sorted_indices]
209
+
210
+ embeddings = []
211
+
212
+ # Encode in batches
213
+ for i in tqdm(range(0, len(sorted_texts), batch_size)):
214
+ batch = sorted_texts[i:i+batch_size]
215
+ embeddings.append(encode_batch(jobbert_model, batch))
216
+
217
+ # Concatenate embeddings and reorder to original indices
218
+ sorted_embeddings = np.concatenate(embeddings)
219
+ original_order = np.argsort(sorted_indices)
220
+ return sorted_embeddings[original_order]
221
+
222
+ # Example usage
223
+ job_titles = [
224
+ 'Software Engineer',
225
+ 'Senior Software Developer',
226
+ 'Product Manager',
227
+ 'Data Scientist'
228
+ ]
229
 
230
+ # Get embeddings
231
+ embeddings = encode(model, job_titles)
232
 
233
+ # Calculate similarity matrix
234
+ similarities = np.dot(embeddings, embeddings.T)
235
+ print(similarities)
236
+ ```
237
 
238
+ ### Example Use Cases
 
239
 
240
+ 1. **Job Title Matching**: Find similar job titles for standardization or matching
241
+ 2. **Job Search**: Match job seekers with relevant positions based on title similarity
242
+ 3. **HR Analytics**: Analyze job title patterns and similarities across organizations
243
+ 4. **Talent Management**: Identify similar roles for career development and succession planning
244
 
245
  ## Training Details
246
 
247
  ### Training Dataset
248
 
249
  #### generator
250
+ - Dataset: 5.5M+ job title pairs
251
+ - Format: Anchor job titles paired with related skills/requirements
252
+ - Training objective: Learn semantic similarity between job titles and their associated skills
253
+ - Loss: CachedMultipleNegativesRankingLoss with cosine similarity
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
  ### Training Hyperparameters
256
+ - Batch Size: 2048
257
+ - Learning Rate: 5e-05
258
+ - Epochs: 1
259
+ - FP16 Training: Enabled
260
+ - Optimizer: AdamW
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
  ### Framework Versions
263
  - Python: 3.9.19
 
295
  archivePrefix={arXiv},
296
  primaryClass={cs.LG}
297
  }
298
+ ```