Lighteval documentation
Use SGLang as backend
Use SGLang as backend
Lighteval allows you to use sglang
as backend allowing great speedups.
To use, simply change the model_args
to reflect the arguments you want to pass to sglang.
lighteval sglang \
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
"leaderboard|truthfulqa:mc|0|0"
sglang
is able to distribute the model across multiple GPUs using data
parallelism and tensor parallelism.
You can choose the parallelism method by setting in the model_args
.
For example if you have 4 GPUs you can split it across using tp_size
:
lighteval sglang \
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tp_size=4" \
"leaderboard|truthfulqa:mc|0|0"
Or, if your model fits on a single GPU, you can use dp_size
to speed up the evaluation:
lighteval sglang \
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,dp_size=4" \
"leaderboard|truthfulqa:mc|0|0"
Use a config file
For more advanced configurations, you can use a config file for the model.
An example of a config file is shown below and can be found at examples/model_configs/sglang_model_config.yaml
.
lighteval sglang \
"examples/model_configs/sglang_model_config.yaml" \
"leaderboard|truthfulqa:mc|0|0"
Documentation for the config file of sglang can be found here
model_parameters:
model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
dtype: "auto"
tp_size: 1
dp_size: 1
context_length: null
random_seed: 1
trust_remote_code: False
use_chat_template: False
device: "cuda"
skip_tokenizer_init: False
kv_cache_dtype: "auto"
add_special_tokens: True
pairwise_tokenization: False
sampling_backend: null
attention_backend: null
mem_fraction_static: 0.8
chunked_prefill_size: 4096
generation_parameters:
max_new_tokens: 1024
min_new_tokens: 0
temperature: 1.0
top_k: 50
min_p: 0.0
top_p: 1.0
presence_penalty: 0.0
repetition_penalty: 1.0
frequency_penalty: 0.0
In the case of OOM issues, you might need to reduce the context size of the
model as well as reduce the mem_fraction_static
and chunked_prefill_size
parameter.