Model Description

This model is fine-tuned on reward modeling data and has undergone two stages of training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). As a result, it is a post-DPO model optimized for reasoning and text generation tasks.

chat_message = [
  {"role": "user", "content": ...},
  {"role": "reason", "content": ...},
  {"role": "assistant", "content": ...},
]

Intended Use

While this model is specifically designed for reward modeling tasks, it also demonstrates adaptability to general-purpose tasks. Notably, it exhibits a degree of correctness and reliability across various applications.

Limitations

  • The model’s performance may vary depending on the domain and specificity of the input.
  • It may inherit biases present in the training data.

Code and Resources

The code and additional resources for this model are available on GitHub.

Downloads last month
6
Safetensors
Model size
14.8B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for jiulaikankan/Qwen2.5-14B-ReasonGenRM

Base model

Qwen/Qwen2.5-14B
Finetuned
(82)
this model