TechWolf
/

JobBERT-v2

@@ -139,20 +139,19 @@ widget:
 # SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
-This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) on the generator dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
-- **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) <!-- at revision 9a3225965996d404b775526de6dbfe85d3368642 -->
 - **Maximum Sequence Length:** 64 tokens
 - **Output Dimensionality:** 1024 tokens
 - **Similarity Function:** Cosine Similarity
-- **Training Dataset:**
-    - generator
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
 ### Model Sources
@@ -177,233 +176,88 @@ SentenceTransformer(
 ### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
-Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
-# Download from the 🤗 Hub
 model = SentenceTransformer("jensjorisdecorte/JobBERT-v2")
-# Run inference
-sentences = [
-    'Branch Manager',
-    'teamwork principles, office administration, delegate responsibilities, create banking accounts, manage alarm system, make independent operating decisions, use microsoft office, offer financial services, ensure proper document management, own management skills, use spreadsheets software, manage cash flow, integrate community outreach, manage time, perform multiple tasks at the same time, carry out calculations, assess customer credibility, maintain customer service, team building, digitise documents, promote financial products, communication, assist customers, follow procedures in the event of an alarm, office equipment',
-    'support employability of people with disabilities, schedule shifts, issue licences, funding methods, maintain correspondence records, computer equipment, decide on providing funds, tend filing machine, use microsoft office, lift stacks of paper, transport office equipment, tend to guests with special needs, provide written content, foreign affairs policy development, provide charity services, philanthropy, maintain financial records, meet deadlines, manage fundraising activities, assist individuals with disabilities in community activities, report on grants, prepare compliance documents, manage grant applications, tolerate sitting for long periods, follow work schedule',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 1024]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
-```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
 ## Training Details
 ### Training Dataset
 #### generator
-* Dataset: generator
-* Size: 5,579,240 training samples
-* Columns: <code>anchor</code> and <code>positive</code>
-* Approximate statistics based on the first 1000 samples:
-  |         | anchor                                                                           | positive                                                                           |
-  |:--------|:---------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
-  | type    | string                                                                           | string                                                                             |
-  | details | <ul><li>min: 3 tokens</li><li>mean: 7.95 tokens</li><li>max: 30 tokens</li></ul> | <ul><li>min: 18 tokens</li><li>mean: 59.33 tokens</li><li>max: 64 tokens</li></ul> |
-* Samples:
-  | anchor                                                              | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-  |:--------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-  | <code>CAD Designer - Fire Sprinkler - Milwaukee - Relocation</code> | <code>coordinate construction activities, oversee construction project, fire protection engineering, install fire sprinklers, hydraulics, construction industry, create AutoCAD drawings, design sprinkler systems, inspect construction sites, design drawings, supervise sewerage systems construction, prepare site for construction, building codes, communicate with construction crews</code>                                                                                                                                                                                                                              |
-  | <code>RN Practitioner</code>                                        | <code>assume responsibility, financial statements, manage work, implement fundamentals of nursing, diagnose advanced nursing care, diagnose nursing care, specialist nursing care, nursing principles, provide nursing advice on healthcare, apply nursing care in long-term care, prescribe advanced nursing care, plan advanced nursing care, nursing science, implement nursing care, develop financial statistics reports, clinical decision-making at advanced practice, prepare financial statements, create a financial report, produce statistical financial records, operate in a specific field of nursing care</code> |
-  | <code>Respiratory Therapist Travel Positions (BB-160B7)</code>      | <code>respiratory therapy, comply with quality standards related to healthcare practice, provide information, primary care, record treated patient's information, formulate a treatment plan, carry out treatment prescribed by doctors, develop patient treatment strategies</code>                                                                                                                                                                                                                                                                                                                                             |
-* Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
-  ```json
-  {
-      "scale": 20.0,
-      "similarity_fct": "cos_sim"
-  }
-  ```
 ### Training Hyperparameters
-#### Non-Default Hyperparameters
-- `overwrite_output_dir`: True
-- `per_device_train_batch_size`: 2048
-- `per_device_eval_batch_size`: 2048
-- `num_train_epochs`: 1
-- `fp16`: True
-#### All Hyperparameters
-<details><summary>Click to expand</summary>
-- `overwrite_output_dir`: True
-- `do_predict`: False
-- `eval_strategy`: no
-- `prediction_loss_only`: True
-- `per_device_train_batch_size`: 2048
-- `per_device_eval_batch_size`: 2048
-- `per_gpu_train_batch_size`: None
-- `per_gpu_eval_batch_size`: None
-- `gradient_accumulation_steps`: 1
-- `eval_accumulation_steps`: None
-- `torch_empty_cache_steps`: None
-- `learning_rate`: 5e-05
-- `weight_decay`: 0.0
-- `adam_beta1`: 0.9
-- `adam_beta2`: 0.999
-- `adam_epsilon`: 1e-08
-- `max_grad_norm`: 1.0
-- `num_train_epochs`: 1
-- `max_steps`: -1
-- `lr_scheduler_type`: linear
-- `lr_scheduler_kwargs`: {}
-- `warmup_ratio`: 0.0
-- `warmup_steps`: 0
-- `log_level`: passive
-- `log_level_replica`: warning
-- `log_on_each_node`: True
-- `logging_nan_inf_filter`: True
-- `save_safetensors`: True
-- `save_on_each_node`: False
-- `save_only_model`: False
-- `restore_callback_states_from_checkpoint`: False
-- `no_cuda`: False
-- `use_cpu`: False
-- `use_mps_device`: False
-- `seed`: 42
-- `data_seed`: None
-- `jit_mode_eval`: False
-- `use_ipex`: False
-- `bf16`: False
-- `fp16`: True
-- `fp16_opt_level`: O1
-- `half_precision_backend`: auto
-- `bf16_full_eval`: False
-- `fp16_full_eval`: False
-- `tf32`: None
-- `local_rank`: 0
-- `ddp_backend`: None
-- `tpu_num_cores`: None
-- `tpu_metrics_debug`: False
-- `debug`: []
-- `dataloader_drop_last`: False
-- `dataloader_num_workers`: 0
-- `dataloader_prefetch_factor`: None
-- `past_index`: -1
-- `disable_tqdm`: False
-- `remove_unused_columns`: True
-- `label_names`: None
-- `load_best_model_at_end`: False
-- `ignore_data_skip`: False
-- `fsdp`: []
-- `fsdp_min_num_params`: 0
-- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
-- `fsdp_transformer_layer_cls_to_wrap`: None
-- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
-- `deepspeed`: None
-- `label_smoothing_factor`: 0.0
-- `optim`: adamw_torch
-- `optim_args`: None
-- `adafactor`: False
-- `group_by_length`: False
-- `length_column_name`: length
-- `ddp_find_unused_parameters`: None
-- `ddp_bucket_cap_mb`: None
-- `ddp_broadcast_buffers`: False
-- `dataloader_pin_memory`: True
-- `dataloader_persistent_workers`: False
-- `skip_memory_metrics`: True
-- `use_legacy_prediction_loop`: False
-- `push_to_hub`: False
-- `resume_from_checkpoint`: None
-- `hub_model_id`: None
-- `hub_strategy`: every_save
-- `hub_private_repo`: False
-- `hub_always_push`: False
-- `gradient_checkpointing`: False
-- `gradient_checkpointing_kwargs`: None
-- `include_inputs_for_metrics`: False
-- `eval_do_concat_batches`: True
-- `fp16_backend`: auto
-- `push_to_hub_model_id`: None
-- `push_to_hub_organization`: None
-- `mp_parameters`:
-- `auto_find_batch_size`: False
-- `full_determinism`: False
-- `torchdynamo`: None
-- `ray_scope`: last
-- `ddp_timeout`: 1800
-- `torch_compile`: False
-- `torch_compile_backend`: None
-- `torch_compile_mode`: None
-- `dispatch_batches`: None
-- `split_batches`: None
-- `include_tokens_per_second`: False
-- `include_num_input_tokens_seen`: False
-- `neftune_noise_alpha`: None
-- `optim_target_modules`: None
-- `batch_eval_metrics`: False
-- `eval_on_start`: False
-- `eval_use_gather_object`: False
-- `batch_sampler`: batch_sampler
-- `multi_dataset_batch_sampler`: proportional
-</details>
-### Training Logs
-| Epoch  | Step | Training Loss |
-|:------:|:----:|:-------------:|
-| 0.1835 | 500  | 3.6354        |
-| 0.3670 | 1000 | 3.1788        |
-| 0.5505 | 1500 | 2.9969        |
-| 0.7339 | 2000 | 2.9026        |
-| 0.9174 | 2500 | 2.8421        |
 ### Framework Versions
 - Python: 3.9.19
@@ -441,22 +295,4 @@ You can finetune this model on your own dataset.
     archivePrefix={arXiv},
     primaryClass={cs.LG}
 }
-```
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 # SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
+This is a [sentence-transformers](https://www.SBERT.net) model specifically trained for job title matching and similarity. It's finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) on a large dataset of job titles and their associated skills/requirements. The model maps job titles and descriptions to a 1024-dimensional dense vector space and can be used for semantic job title matching, job similarity search, and related HR/recruitment tasks.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
+- **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
 - **Maximum Sequence Length:** 64 tokens
 - **Output Dimensionality:** 1024 tokens
 - **Similarity Function:** Cosine Similarity
+- **Training Dataset:** 5.5M+ job title pairs
+- **Primary Use Case:** Job title matching and similarity
+- **Performance:** Achieves 0.6457 MAP on TalentCLEF benchmark
 ### Model Sources
 ### Direct Usage (Sentence Transformers)
+First install the required packages:
 ```bash
 pip install -U sentence-transformers
 ```
+Then you can load and use the model with the following code:
 ```python
+import torch
+import numpy as np
+from tqdm.auto import tqdm
 from sentence_transformers import SentenceTransformer
+from sentence_transformers.util import batch_to_device
+# Load the model
 model = SentenceTransformer("jensjorisdecorte/JobBERT-v2")
+def encode_batch(jobbert_model, texts):
+    features = jobbert_model.tokenize(texts)
+    features = batch_to_device(features, jobbert_model.device)
+    features["text_keys"] = ["anchor"]
+    with torch.no_grad():
+        out_features = jobbert_model.forward(features)
+    return out_features["sentence_embedding"].cpu().numpy()
+def encode(jobbert_model, texts, batch_size: int = 8):
+    # Sort texts by length and keep track of original indices
+    sorted_indices = np.argsort([len(text) for text in texts])
+    sorted_texts = [texts[i] for i in sorted_indices]
+    embeddings = []
+    # Encode in batches
+    for i in tqdm(range(0, len(sorted_texts), batch_size)):
+        batch = sorted_texts[i:i+batch_size]
+        embeddings.append(encode_batch(jobbert_model, batch))
+    # Concatenate embeddings and reorder to original indices
+    sorted_embeddings = np.concatenate(embeddings)
+    original_order = np.argsort(sorted_indices)
+    return sorted_embeddings[original_order]
+# Example usage
+job_titles = [
+    'Software Engineer',
+    'Senior Software Developer',
+    'Product Manager',
+    'Data Scientist'
+]
+# Get embeddings
+embeddings = encode(model, job_titles)
+# Calculate similarity matrix
+similarities = np.dot(embeddings, embeddings.T)
+print(similarities)
+```
+### Example Use Cases
+1. **Job Title Matching**: Find similar job titles for standardization or matching
+2. **Job Search**: Match job seekers with relevant positions based on title similarity
+3. **HR Analytics**: Analyze job title patterns and similarities across organizations
+4. **Talent Management**: Identify similar roles for career development and succession planning
 ## Training Details
 ### Training Dataset
 #### generator
+- Dataset: 5.5M+ job title pairs
+- Format: Anchor job titles paired with related skills/requirements
+- Training objective: Learn semantic similarity between job titles and their associated skills
+- Loss: CachedMultipleNegativesRankingLoss with cosine similarity
 ### Training Hyperparameters
+- Batch Size: 2048
+- Learning Rate: 5e-05
+- Epochs: 1
+- FP16 Training: Enabled
+- Optimizer: AdamW
 ### Framework Versions
 - Python: 3.9.19
     archivePrefix={arXiv},
     primaryClass={cs.LG}
 }
+```