Description

Llama3-Instruct-8B model finetuned by hybrid WPO (GPT-4-turbo + on-policy sampling + Ultrafeedback). Details in WPO: Enhancing RLHF with Weighted Preference Optimization. The model is trained based on wzhouad/llama3-ultrafeedback-hybrid.

License

This model is licensed under the Zoom software license and is permitted for use only for noncommercial, educational, or academic research purposes.

Downloads last month
12
Safetensors
Model size
8.03B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train wzhouad/Llama3-Instruct-8B-WPO-HB

Collection including wzhouad/Llama3-Instruct-8B-WPO-HB