Model Card for Model ID

This model is a fine-tuned version of the Longformer-base-4096, a transformer-based language model designed for processing long documents. It has been adapted using Masked Language Modeling (MLM) on a dataset of resumes and job descriptions to improve its understanding of career-related text. You can find the full code for the masked language modeling training longformer-base-4096 model in this repository.

Uses

Direct Use

This model can be used to generate contextual embeddings for resumes and job descriptions. These embeddings can be applied to tasks such as similarity matching between resumes and job postings or clustering similar documents. It can also perform masked language modeling tasks within the career-related text domain.

Out-of-Scope Use

The model is not designed for general-purpose language modeling outside the resume and job description domain. Using it for unrelated tasks (e.g., sentiment analysis on social media or translation) without additional fine-tuning may result in poor performance.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("recuse/longformer-base-ResumeJD")
model = AutoModelForMaskedLM.from_pretrained("recuse/longformer-base-ResumeJD")

# Tokenize an input string
text = "Experienced software engineer with expertise in Machine Learning."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get hidden states
outputs = model(**batch_inputs, output_hidden_states=True)
hidden_states = outputs.hidden_states[-1]

Training Details

Training Data

The model was fine-tuned on a dataset of resumes and job descriptions from Kaggle, consisting of approximately 2,200 resume strings and 800 job description strings.

Training Procedure

The text was tokenized using the Longformer tokenizer. For MLM, 15% of tokens were randomly selected: 80% were masked with [MASK], 10% were replaced with random tokens, and 10% were left unchanged, following standard MLM practices.

Training Hyperparameters

{
    "model_name": "allenai/longformer-base-4096",
    "device": "cuda",
    "batch_size": 4,
    "lr": 5e-5,
    "epoch_num": 1,
    "log_step": 50,
    "use_fp16": true
}

Evaluation

The model was evaluated on a held-out set consisting of 5% of the training data. We measured its ability to predict masked tokens in resume and job description texts using the MLM loss, achieving a final loss of 0.3080

Summary

The fine-tuned model shows improved performance on career-related text compared to the base Longformer, as evidenced by the low MLM loss.

Citation

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}

recuse
/

longformer-base-ResumeJD