Lighteval documentation

Use SGLang as backend

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Use SGLang as backend

Lighteval allows you to use sglang as backend allowing great speedups. To use, simply change the model_args to reflect the arguments you want to pass to sglang.

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0|0"

sglang is able to distribute the model across multiple GPUs using data parallelism and tensor parallelism. You can choose the parallelism method by setting in the model_args.

For example if you have 4 GPUs you can split it across using tp_size:

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tp_size=4" \
    "leaderboard|truthfulqa:mc|0|0"

Or, if your model fits on a single GPU, you can use dp_size to speed up the evaluation:

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,dp_size=4" \
    "leaderboard|truthfulqa:mc|0|0"

Use a config file

For more advanced configurations, you can use a config file for the model. An example of a config file is shown below and can be found at examples/model_configs/sglang_model_config.yaml.

lighteval sglang \
    "examples/model_configs/sglang_model_config.yaml" \
    "leaderboard|truthfulqa:mc|0|0"

Documentation for the config file of sglang can be found here

model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    dtype: "auto"
    tp_size: 1
    dp_size: 1
    context_length: null
    random_seed: 1
    trust_remote_code: False
    use_chat_template: False
    device: "cuda"
    skip_tokenizer_init: False
    kv_cache_dtype: "auto"
    add_special_tokens: True
    pairwise_tokenization: False
    sampling_backend: null
    attention_backend: null
    mem_fraction_static: 0.8
    chunked_prefill_size: 4096
    generation_parameters:
      max_new_tokens: 1024
      min_new_tokens: 0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0

In the case of OOM issues, you might need to reduce the context size of the model as well as reduce the mem_fraction_static and chunked_prefill_size parameter.

< > Update on GitHub