Lighteval documentation


You are viewing v0.7.0 version. A newer version v0.9.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started




class lighteval.models.abstract_model.LightevalModel

< >

( )


< >

( )

Clean up operations if needed, such as closing an endpoint.


< >

( requests: list override_bs: typing.Optional[int] = None ) list[GenerativeResponse]


  • requests (list[Request]) — list of requests containing the context and ending conditions.
  • disable_tqdm (bool, optional) — Whether to disable the progress bar. Defaults to False.
  • override_bs (int, optional) — Override the batch size for generation. Defaults to None.



list of generated responses.

Generates responses using a greedy decoding strategy until certain ending conditions are met.


< >

( requests: list override_bs: typing.Optional[int] = None )

Generates responses using a greedy decoding strategy until certain ending conditions are met.


< >

( requests: list override_bs: typing.Optional[int] = None )

Tokenize the context and continuation and compute the log likelihood of those tokenized sequences.


< >

( requests: list override_bs: typing.Optional[int] = None )

This function is used to compute the log likelihood of the context for perplexity metrics.


< >

( requests: list override_bs: typing.Optional[int] = None )

Tokenize the context and continuation and compute the log likelihood of those tokenized sequences.


< >

( context continuation pairwise: bool = False ) Tuple[TokenSequence, TokenSequence]


  • context (str) — The context string to be encoded.
  • continuation (str) — The continuation string to be encoded.
  • pairwise (bool) — If True, encode context and continuation separately. If False, encode them together and then split.


Tuple[TokenSequence, TokenSequence]

A tuple containing the encoded context and continuation.

Encodes a context, continuation pair by taking care of the spaces in between.

The advantage of pairwise is: 1) It better aligns with how LLM predicts tokens 2) Works in case len(tok(context,cont)) != len(tok(context)) + len(tok(continuation)). E.g this can happen for chinese if no space is used between context/continuation

Accelerate and Transformers Models


class lighteval.models.transformers.transformers_model.TransformersModelConfig

< >

( pretrained: str accelerator: Accelerator = None tokenizer: typing.Optional[str] = None multichoice_continuations_start_space: typing.Optional[bool] = None pairwise_tokenization: bool = False subfolder: typing.Optional[str] = None revision: str = 'main' batch_size: int = -1 max_gen_toks: typing.Optional[int] = 256 max_length: typing.Optional[int] = None add_special_tokens: bool = True model_parallel: typing.Optional[bool] = None dtype: typing.Union[str, torch.dtype, NoneType] = None device: typing.Union[int, str] = 'cuda' quantization_config: typing.Optional[transformers.utils.quantization_config.BitsAndBytesConfig] = None trust_remote_code: bool = False use_chat_template: bool = False compile: bool = False generation_parameters: GenerationParameters = None generation_config: GenerationConfig = None )


  • pretrained (str) — HuggingFace Hub model ID name or the path to a pre-trained model to load. This is effectively the pretrained_model_name_or_path argument of from_pretrained in the HuggingFace transformers API.
  • accelerator (Accelerator) — accelerator to use for model training.
  • tokenizer (Optional[str]) — HuggingFace Hub tokenizer ID that will be used for tokenization.
  • multichoice_continuations_start_space (Optional[bool]) — Whether to add a space at the start of each continuation in multichoice generation. For example, context: “What is the capital of France?” and choices: “Paris”, “London”. Will be tokenized as: “What is the capital of France? Paris” and “What is the capital of France? London”. True adds a space, False strips a space, None does nothing
  • pairwise_tokenization (bool) — Whether to tokenize the context and continuation as separately or together.
  • subfolder (Optional[str]) — The subfolder within the model repository.
  • revision (str) — The revision of the model.
  • batch_size (int) — The batch size for model training.
  • max_gen_toks (Optional[int]) — The maximum number of tokens to generate.
  • max_length (Optional[int]) — The maximum length of the generated output.
  • add_special_tokens (bool, optional, defaults to True) — Whether to add special tokens to the input sequences. If None, the default value will be set to True for seq2seq models (e.g. T5) and False for causal models.
  • model_parallel (bool, optional, defaults to False) — True/False: force to use or not the accelerate library to load a large model across multiple devices. Default: None which corresponds to comparing the number of processes with the number of GPUs. If it’s smaller => model-parallelism, else not.
  • dtype (Union[str, torch.dtype], optional, defaults to None) —): Converts the model weights to dtype, if specified. Strings get converted to torch.dtype objects (e.g. float16 -> torch.float16). Use dtype="auto" to derive the type from the model’s weights.
  • device (Union[int, str]) — device to use for model training.
  • quantization_config (Optional[BitsAndBytesConfig]) — quantization configuration for the model, manually provided to load a normally floating point model at a quantized precision. Needed for 4-bit and 8-bit precision.
  • trust_remote_code (bool) — Whether to trust remote code during model loading.
  • generation_parameters (GenerationParameters) — Range of parameters which will affect the generation.
  • generation_config (GenerationConfig) — GenerationConfig object (only passed during manual creation)

Base configuration class for models.

Methods: post_init(): Performs post-initialization checks on the configuration. _init_configs(model_name, env_config): Initializes the model configuration. init_configs(env_config): Initializes the model configuration using the environment configuration. get_model_sha(): Retrieves the SHA of the model.

class lighteval.models.transformers.transformers_model.TransformersModel

< >

( env_config: EnvConfig config: TransformersModelConfig )


< >

( requests: list override_bs: typing.Optional[int] = None ) list[GenerativeResponse]


  • requests (list[Request]) — list of requests containing the context and ending conditions.
  • override_bs (int, optional) — Override the batch size for generation. Defaults to None.



list of generated responses.

Generates responses using a greedy decoding strategy until certain ending conditions are met.


< >

( model_parallel: bool | None = None )

Compute all the parameters related to model_parallel


< >

( requests: list override_bs: typing.Optional[int] = None ) list[Tuple[float, bool]]


  • requests (list[Tuple[str, dict]]) — description


list[Tuple[float, bool]]


Tokenize the context and continuation and compute the log likelihood of those tokenized sequences.


< >

( requests: list override_bs: typing.Optional[int] = None ) list[Tuple[float, bool]]


  • requests (list[Tuple[str, dict]]) — description


list[Tuple[float, bool]]


Tokenize the context and continuation and compute the log likelihood of those tokenized sequences.


< >

( output_tensor: Tensor drop_last_samples: bool = True num_samples: int = None ) torch.Tensor


  • output_tensor (torch.Tensor) — The output tensor to be padded.
  • drop_last_samples (bool, optional) — Whether to drop the last samples during gathering.
  • Last samples are dropped when the number of samples is not divisible by the number of processes. — Defaults to True.



The padded output tensor and the gathered length tensor.

Pads the output_tensor to the maximum length and gathers the lengths across processes.


< >

( batch: list padding_length: int max_context: typing.Optional[int] = None single_token: bool = False )

Tokenize a batch of inputs and return also the length, truncations and padding. This step is done manually since we tokenize log probability inputs together with their continuation, to manage possible extra spaces added at the start by tokenizers, see tok_encode_pair.


class lighteval.models.transformers.adapter_model.AdapterModelConfig

< >

( pretrained: str accelerator: Accelerator = None tokenizer: typing.Optional[str] = None multichoice_continuations_start_space: typing.Optional[bool] = None pairwise_tokenization: bool = False subfolder: typing.Optional[str] = None revision: str = 'main' batch_size: int = -1 max_gen_toks: typing.Optional[int] = 256 max_length: typing.Optional[int] = None add_special_tokens: bool = True model_parallel: typing.Optional[bool] = None dtype: typing.Union[str, torch.dtype, NoneType] = None device: typing.Union[int, str] = 'cuda' quantization_config: typing.Optional[transformers.utils.quantization_config.BitsAndBytesConfig] = None trust_remote_code: bool = False use_chat_template: bool = False compile: bool = False generation_parameters: GenerationParameters = None generation_config: GenerationConfig = None base_model: str = None )

class lighteval.models.transformers.adapter_model.AdapterModel

< >

( env_config: EnvConfig config: TransformersModelConfig )


class lighteval.models.transformers.delta_model.DeltaModelConfig

< >

( pretrained: str accelerator: Accelerator = None tokenizer: typing.Optional[str] = None multichoice_continuations_start_space: typing.Optional[bool] = None pairwise_tokenization: bool = False subfolder: typing.Optional[str] = None revision: str = 'main' batch_size: int = -1 max_gen_toks: typing.Optional[int] = 256 max_length: typing.Optional[int] = None add_special_tokens: bool = True model_parallel: typing.Optional[bool] = None dtype: typing.Union[str, torch.dtype, NoneType] = None device: typing.Union[int, str] = 'cuda' quantization_config: typing.Optional[transformers.utils.quantization_config.BitsAndBytesConfig] = None trust_remote_code: bool = False use_chat_template: bool = False compile: bool = False generation_parameters: GenerationParameters = None generation_config: GenerationConfig = None base_model: str = None )

class lighteval.models.transformers.delta_model.DeltaModel

< >

( env_config: EnvConfig config: TransformersModelConfig )

Endpoints-based Models


class lighteval.models.endpoints.endpoint_model.InferenceEndpointModelConfig

< >

( endpoint_name: str = None model_name: str = None reuse_existing: bool = False accelerator: str = 'gpu' model_dtype: str = None vendor: str = 'aws' region: str = 'us-east-1' instance_size: str = None instance_type: str = None framework: str = 'pytorch' endpoint_type: str = 'protected' add_special_tokens: bool = True revision: str = 'main' namespace: str = None image_url: str = None env_vars: dict = None generation_parameters: GenerationParameters = None )


< >

( path: str ) InferenceEndpointModelConfig


  • path (str) — Path of the model configuration YAML file.



Configuration for inference endpoint model.

Load configuration for inference endpoint model from YAML file path.

class lighteval.models.endpoints.endpoint_model.ServerlessEndpointModelConfig

< >

( model_name: str add_special_tokens: bool = True generation_parameters: GenerationParameters = None )

class lighteval.models.endpoints.endpoint_model.InferenceEndpointModel

< >

( config: typing.Union[lighteval.models.endpoints.endpoint_model.InferenceEndpointModelConfig, lighteval.models.endpoints.endpoint_model.ServerlessEndpointModelConfig] env_config: EnvConfig )

InferenceEndpointModels can be used both with the free inference client, or with inference endpoints, which will use text-generation-inference to deploy your model for the duration of the evaluation.

TGI ModelClient

class lighteval.models.endpoints.tgi_model.TGIModelConfig

< >

( inference_server_address: str inference_server_auth: str model_id: str generation_parameters: GenerationParameters = None )


< >

( path: str ) TGIModelConfig


  • path (str) — Path of the model configuration YAML file.



Configuration for TGI endpoint model.

Load configuration for TGI endpoint model from YAML file path.

class lighteval.models.endpoints.tgi_model.ModelClient

< >

( config: TGIModelConfig )

Open AI Models

class lighteval.models.endpoints.openai_model.OpenAIClient

< >

( config: OpenAIModelConfig env_config )


< >

( requests: list override_bs: typing.Optional[int] = None ) list[GenerativeResponse]


  • requests (list[Request]) — list of requests containing the context and ending conditions.
  • override_bs (int, optional) — Override the batch size for generation. Defaults to None.



list of generated responses.

Generates responses using a greedy decoding strategy until certain ending conditions are met.

Nanotron Model


class lighteval.models.nanotron.nanotron_model.NanotronLightevalModel

< >

( checkpoint_path: str nanotron_config: FullNanotronConfig parallel_context: ParallelContext max_gen_toks: typing.Optional[int] = 256 max_length: typing.Optional[int] = None add_special_tokens: typing.Optional[bool] = True dtype: typing.Union[str, torch.dtype, NoneType] = None trust_remote_code: bool = False debug_one_layer_model: bool = False model_class: typing.Optional[typing.Type] = None env_config: EnvConfig = None )


< >

( output_tensor: Tensor process_group: dist.ProcessGroup = None )

Gather together tensors of (possibly) various size spread on separate GPUs (first exchange the lengths and then pad and gather)


< >

( requests: typing.List[lighteval.tasks.requests.GreedyUntilRequest] disable_tqdm: bool = False override_bs: int = -1 num_dataset_splits: int = 1 )

Greedy generation until a stop token is generated.


< >

( ending_condition: tuple | dict | list | str )

Ending conditions are submitted in several possible formats. By default in lighteval we pass them as tuples (stop sequence, max number of items). In the harness they sometimes are passed as dicts {“until”: .., “max_length”: …} or as only ending conditions, either lists or strings. Here, we convert all these formats to a tuple containing a list of ending conditions, and a float for the max length allowed.


< >

( requests: typing.List[typing.Tuple[str, dict]] override_bs = 0 ) List[Tuple[float, bool]]


  • requests (List[Tuple[str, dict]]) — description


List[Tuple[float, bool]]


Tokenize the context and continuation and compute the log likelihood of those tokenized sequences.


< >

( output_tensor: Tensor )

Gather together tensors of (possibly) various size spread on separate GPUs (first exchange the lengths and then pad and gather)


< >

( batch: typing.List[str] padding_length: int max_context: typing.Optional[int] = None full_attention_masks: bool = False pad_on_left: bool = False )

Tokenize a batch of inputs and return also the length, truncations and padding

We truncate to keep only at most max_context tokens We pad to padding_length tokens

VLLM Model


class lighteval.models.vllm.vllm_model.VLLMModelConfig

< >

( pretrained: str gpu_memory_utilisation: float = 0.9 revision: str = 'main' dtype: str | None = None tensor_parallel_size: int = 1 pipeline_parallel_size: int = 1 data_parallel_size: int = 1 max_model_length: int | None = None swap_space: int = 4 seed: int = 1234 trust_remote_code: bool = False use_chat_template: bool = False add_special_tokens: bool = True multichoice_continuations_start_space: bool = True pairwise_tokenization: bool = False generation_parameters: GenerationParameters = None subfolder: typing.Optional[str] = None )

class lighteval.models.vllm.vllm_model.VLLMModel

< >

( config: VLLMModelConfig env_config: EnvConfig )


< >

( requests: list override_bs: typing.Optional[int] = None ) list[GenerateReturn]


  • requests (list[Request]) — list of requests containing the context and ending conditions.
  • override_bs (int, optional) — Override the batch size for generation. Defaults to None.



list of generated responses.

Generates responses using a greedy decoding strategy until certain ending conditions are met.

< > Update on GitHub