In [None]:
%pip install lighteval==0.6.2
%pip install great-tables
%pip install polars







# Comparaison de différentes formulations d'une instruction pour une même tâche
Dans ce *notebook*, nous allons utiliser un très petit modèle pour une tâche simple. Nous nous concentrerons sur la comparaison de plusieurs formulations pour l'instruction (*prompt* en anglais) donné en entrée afin de voir comment elles affectent les résultats que l'on peut obtenir.

In [2]:
import string
import os
from datetime import timedelta
from types import ModuleType
from ast import literal_eval

In [3]:
# Pour la visualisation des données
from great_tables import GT
import polars as pl
import polars.selectors as cs
from datasets import load_dataset

In [None]:
# Pour l'évaluation
import lighteval
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.model_config import BaseModelConfig, VLLMModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
from lighteval.metrics.metrics import Metrics
from lighteval.tasks.lighteval_task import LightevalTaskConfig, Doc
from lighteval.utils.utils import as_list, EnvConfig
from lighteval.utils.imports import is_accelerate_available, is_tgi_available

In [5]:
# Définir pour votre cas d'usage
cache_dir = "tmp"
max_samples = 10

## Comparer plusieurs formulations pour une même tâche

Comparons :
- à l'aide d'une évaluation MCQA (question-réponse à choix multiples i.e. un QCM)
- l'utilisation d'une évaluation générative

et pour les deux, en utilisant des variations des mêmes prompts.

Nous utiliserons le jeu de données ARC d'AI2 pour nos expériences, en utilisant le sous-ensemble « challenge ». Vous pouvez consulter le jeu de données ici : https://huggingface.co/datasets/allenai/ai2_arc?row=0.

### Définissons le cœur de notre tâche

In [6]:
class ArcExplorationTask(LightevalTaskConfig):
    def __init__(self, name, prompt_function, metric):
        super().__init__(
            name=name,
            prompt_function=prompt_function,
            metric=as_list(metric),
            # Il s'agit d'une tâche personnalisée
            suite=["custom"],
            # Ceci définit notre jeu de données et ses sous-ensembles
            hf_repo="allenai/ai2_arc",
            hf_subset="ARC-Challenge",
            hf_avail_splits=["train", "validation", "test"],
            evaluation_splits=["test"],
            # Paramètres des exemples few shot
            few_shots_split="validation",
            few_shots_select="random", 
            # Autres paramètres
            stop_sequence=[".", "\n"],
            generation_size=100,
        )

### Définissons nos métriques

Pour une évaluation à choix multiples, , nous voulons la log-vraissemblance de l'*accuracy* normalisée par la longueur (= le choix le plus probable est-il le bon ?).

Pour l'évaluation de générations, nous voulons une correspondance exacte (= le texte généré correspond-il à la référence ?).

In [7]:
metric_mcqa = Metrics.loglikelihood_acc_norm
metric_gen = Metrics.quasi_exact_match

### Définissons des fonctions pour les différentes instructions

Une ligne du jeu de données ARC est un dictionnaire, de la forme suivante
```python
{
    "question": "la question avec une instruction",
    "choices": {
        "text": ["choix 1", "choix 2", ...],
        "label": ["A", "B", ...]
    },
    "answerKey": "le label gold"
}
```

Notre fonction appliquera un gabarit dans lequel nous associerons toutes ces informations aux clés demandées (`query`, `choices`, `gold_index`, et une `instruction` si nécessaire).

Premier cas, nous définissons le gabarit le plus basique possible.  
L'instruction ressemble à ceci :
```
<la question>
```
et nous regardons `<les choix>` directement.

In [8]:
def arc_base(line, task_name: str = None):
    query= f"{line['question']}"
    choices=line["choices"]["text"]

    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=line["choices"]["label"].index(line["answerKey"]),
    )

Deuxième cas, nous ajoutons maintenant un peu de contexte. L'instruction ressemble alors à ceci :
```
Question: <la question>
Answer: 
```
et nous regardons `<les choix>` directement aussi.

In [9]:
def arc_context(line, task_name: str = None):
    query= f"Question: {line['question']}"
    query += "\nAnswer: "
    choices=line["choices"]["text"]
    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=line["choices"]["label"].index(line["answerKey"]),
    )

Troisième cas, nous ajoutons maintenant des choix dans notre instruction. Le *prompt* ressemble alors à ceci :
```
Question: <la question>
Choices:
A. <choix A>
B. <choix B>
...
Answer: 
```
et nous regardons `<les choix>` directement à nouveau.

In [10]:
letters = list(string.ascii_uppercase)

In [11]:
def arc_context_choices(line, task_name: str = None):
    query = f"Question: {line['question']}\n"
    query += "\n".join([f"{letters[ix]}. {choice}" for ix, choice in enumerate(line["choices"]["text"])])
    query += "\nAnswer: "
    choices=line["choices"]["text"]
    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=line["choices"]["label"].index(line["answerKey"]),
    )

Dernier cas, nous faisons la même chose, mais nous regardons `<les labels des choix>` à la place.

In [12]:
def arc_context_labels(line, task_name: str = None):
    query = f"Question: {line['question']}\n"
    query += "\n".join([f"{letters[ix]}. {choice}" for ix, choice in enumerate(line["choices"]["text"])])
    query += "\nAnswer: "
    choices=[letters[ix] for ix in range(len(line["choices"]["text"]))]
    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=line["choices"]["label"].index(line["answerKey"]),
    )



### Enchaînons le tout

In [13]:
task_module = ModuleType("task_module")
task_module.__file__ = ".",
task_module.TASKS_TABLE = [
    ArcExplorationTask(
        name="arc_base", 
        prompt_function=arc_base, 
        metric=[metric_mcqa, metric_gen]
    ),
    ArcExplorationTask(
        name="arc_context", 
        prompt_function=arc_context, 
        metric=[metric_mcqa, metric_gen]
    ),
    ArcExplorationTask(
        name="arc_context_choice", 
        prompt_function=arc_context_choices, 
        metric=[metric_mcqa, metric_gen]
    ),
    ArcExplorationTask(
        name="arc_context_labels", 
        prompt_function=arc_context_labels, 
        metric=[metric_mcqa, metric_gen]
    )
]

task_names = ["arc_base", "arc_context", "arc_context_choice", "arc_context_labels"]

# Lançons notre évaluation !

In [14]:
if is_accelerate_available():
    from accelerate import Accelerator, InitProcessGroupKwargs
    accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=3000))])
else:
    accelerator = None

In [15]:
# Paramètres pour sauvegarder les résultats de l'évaluation
evaluation_tracker = EvaluationTracker(
    output_dir=cache_dir,
    save_details=True,
    # Ces 2 options requièrent que vous soyez connecté avec le CLI de huggingface_hub
    # push_to_hub=True,
    # hub_results_org="your username", 
)


# Paramètres de l'ensemble du pipeline
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.ACCELERATE,
    env_config=EnvConfig(cache_dir=cache_dir),
    override_batch_size=1,
    max_samples=max_samples, # Nous n'effectuons qu'une petite éxécution ici, à enlever pour obtenir les vrais résultats
    custom_tasks_directory=task_module # Nous pouvons transmettre le chemin d'accès à un module ou le module lui-même
)

# Modèle - nous utilisons ici VLLM, mais nous pourrions utiliser TGI, Accelerate, etc.
model_config = BaseModelConfig(
    pretrained="HuggingFaceTB/SmolLM-1.7B",
    dtype="bfloat16",
    use_chat_template=False,
)

tasks = ",".join([f"custom|{task}|3|0" for task in task_names])
# Nous sommes prêts !
pipeline = Pipeline(
    tasks=tasks,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model_config=model_config,
)
pipeline.evaluate()
pipeline.save_and_push_results()



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Greedy generation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [06:41<00:00,  9.73s/it][A
Splits: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [06:41<00:00, 401.95s/it][A


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [05:26<00:00,  8.17s/it][A
1it [05:26, 326.90s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [04:50<00:00,  7.25s/it][A
2it [10:16, 305.21s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [03:27<00:00,  5.19s/it][A
3it [13:44, 260.60s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In [16]:
#pipeline.show_results()

In [17]:
results = pipeline.get_results()["results"]
results_processed = []
for eval_name, eval_results in results.items():
    results_processed.append({
        "Prompt function": (eval_name.split(":")[1] if ":" in eval_name else eval_name).replace("_", " "), 
        "Quasi Exact Match": eval_results["qem"], 
        "Normalized Accuracy": eval_results["acc_norm"]
    })
results_data = pl.from_dicts(results_processed, strict=False)
(GT(results_data.head(max_samples*4))
    .tab_header("Results")
     .tab_spanner(label="Evaluations", columns=["Quasi Exact Match", "Normalized Accuracy"])

)

Results,Results,Results
Prompt function,Evaluations,Evaluations
Prompt function,Quasi Exact Match,Normalized Accuracy
arc base,0.0,0.3
arc context,0.1,0.5
arc context choice,0.4,0.3
arc context labels,0.0,0.0
all,0.125,0.275


# Lire les résultats

In [18]:
path = f"{cache_dir}/details/HuggingFaceTB/SmolLM-1.7B/"

results = {}

for root, _, files in os.walk(path):
    for file in files:
        eval_name = file.split("|")[1]
        results[eval_name] = load_dataset("parquet", data_files=f"{root}/{file}")["train"]


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [19]:
# créer un nouveau DataFrame pour stocker les données transformées
transformed_data = []
keys = ["example", "gold", "predictions", "metrics"]

# itérer sur chaque jeu de données et ses échantillons
for ix in range(max_samples * 2):
    for key in keys:
        cur_sample = {"Sample": f"Sample {ix}", "Type": key.capitalize()}
        for eval_name, df in sorted(results.items()):
            try:
                cur_result = literal_eval(results[eval_name][ix][key])
                if isinstance(cur_result, list):
                    if len(cur_result) == 1:
                        cur_sample[eval_name] = cur_result[0]
                    else:
                        cur_sample[eval_name] = "\n".join([str(i) for i in cur_result])
                elif isinstance(cur_result, dict):
                    for metric, value in cur_result.items():
                        cur_sample[eval_name] = str(value)
                        cur_sample["Type"] = f"{key.capitalize()}: {metric}"
            except SyntaxError:
                cur_sample[eval_name] = results[eval_name][ix][key]
                
        for k, v in cur_sample.items():
            # Nous remplaçons les \n de python par des <br /> markdown pour l'affichage du tableau
            if isinstance(v, str):
                cur_sample[k] = v.replace("\n", "<br />")
        transformed_data.append(cur_sample)

### Examinons les résultats des générations

In [20]:
pl_data = pl.from_dicts(transformed_data, strict=False, infer_schema_length=200)

In [21]:
(GT(pl_data.head(max_samples*4))
    .tab_header("Comparing our different prompts' outputs")
    .tab_spanner(label="Samples", columns=cs.starts_with("arc"))
    .tab_stub(rowname_col="Type", groupname_col="Sample")
     .fmt_markdown(columns=cs.starts_with("arc"))
)

Comparing our different prompts' outputs,Comparing our different prompts' outputs,Comparing our different prompts' outputs,Comparing our different prompts' outputs,Comparing our different prompts' outputs
Unnamed: 0_level_1,Samples,Samples,Samples,Samples
Unnamed: 0_level_2,arc_base,arc_context,arc_context_choice,arc_context_labels
Sample 0,Sample 0,Sample 0,Sample 0,Sample 0
Example,Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people?,Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people? Answer:,Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people? A. The air stays cleaner. B. Cars can travel at faster speeds. C. The skills of the drivers improve. D. It becomes safer to drive on the roads. Answer:,Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people? A. The air stays cleaner. B. Cars can travel at faster speeds. C. The skills of the drivers improve. D. It becomes safer to drive on the roads. Answer:
Gold,The air stays cleaner.,The air stays cleaner.,The air stays cleaner.,A
Predictions,It reduces the amount of pollution in the air,1,It becomes safer to drive on the roads,D
Metrics: qem,0,0,0,0
Sample 1,Sample 1,Sample 1,Sample 1,Sample 1
Example,Which statement correctly describes a physical characteristic of the Moon?,Question: Which statement correctly describes a physical characteristic of the Moon? Answer:,Question: Which statement correctly describes a physical characteristic of the Moon? A. The Moon is made of hot gases. B. The Moon is covered with many craters. C. The Moon has many bodies of liquid water. D. The Moon has the ability to give off its own light. Answer:,Question: Which statement correctly describes a physical characteristic of the Moon? A. The Moon is made of hot gases. B. The Moon is covered with many craters. C. The Moon has many bodies of liquid water. D. The Moon has the ability to give off its own light. Answer:
Gold,The Moon is covered with many craters.,The Moon is covered with many craters.,The Moon is covered with many craters.,B
Predictions,a,1,The Moon has the ability to give off its own light,D
Metrics: qem,0,0,0,0


Nous pouvons observer que :
- le format de base est trop rigide en mode génératif : le modèle ne prédit jamais la bonne fin
- le format de base + balises question/réponse semble inciter le modèle à produire des nombres, ce qui est inadéquat pour un certain nombre de questions
- cependant (dans les deux derniers cas), l'introduction des choix dans la question aide le modèle à prédire un choix parmi les choix pertinents !

Il est intéressant de noter que dans le dernier cas, lorsque les choix sont présents mais que le modèle doit prédire l'étiquette, il n'y parvient pas systématiquement. 

Dans d'autres cas, comme les échantillons 3 et 4, le modèle ne sélectionnera pas le même choix en prédisant l'étiquette ou en prédisant le choix (si les choix étaient présents).

### Examinons les log-probabilités en sortie

In [22]:
(GT(pl_data.tail(max_samples * 4))
    .tab_header("Comparing our different prompts' outputs")
    .tab_spanner(label="Samples", columns=cs.starts_with("arc"))
    .tab_stub(rowname_col="Type", groupname_col="Sample")
     .fmt_markdown(columns=cs.starts_with("arc"))
)

Comparing our different prompts' outputs,Comparing our different prompts' outputs,Comparing our different prompts' outputs,Comparing our different prompts' outputs,Comparing our different prompts' outputs
Unnamed: 0_level_1,Samples,Samples,Samples,Samples
Unnamed: 0_level_2,arc_base,arc_context,arc_context_choice,arc_context_labels
Sample 10,Sample 10,Sample 10,Sample 10,Sample 10
Example,Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people?,Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people? Answer:,Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people? A. The air stays cleaner. B. Cars can travel at faster speeds. C. The skills of the drivers improve. D. It becomes safer to drive on the roads. Answer:,Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people? A. The air stays cleaner. B. Cars can travel at faster speeds. C. The skills of the drivers improve. D. It becomes safer to drive on the roads. Answer:
Gold,The air stays cleaner.,The air stays cleaner.,The air stays cleaner.,A
Predictions,"(-10.595719337463379, False) (-17.943925857543945, False) (-25.558387756347656, False) (-19.944787979125977, False)","(-9.650793075561523, False) (-18.44221305847168, False) (-28.147789001464844, False) (-21.082042694091797, False)","(-1.7918881177902222, False) (-3.5758469104766846, False) (-2.042778253555298, False) (-0.5759693384170532, True)","(-1.445155382156372, False) (-1.695155382156372, False) (-1.445155382156372, False) (-1.070155382156372, True)"
Metrics: acc_norm,1,1,0,0
Sample 11,Sample 11,Sample 11,Sample 11,Sample 11
Example,Which statement correctly describes a physical characteristic of the Moon?,Question: Which statement correctly describes a physical characteristic of the Moon? Answer:,Question: Which statement correctly describes a physical characteristic of the Moon? A. The Moon is made of hot gases. B. The Moon is covered with many craters. C. The Moon has many bodies of liquid water. D. The Moon has the ability to give off its own light. Answer:,Question: Which statement correctly describes a physical characteristic of the Moon? A. The Moon is made of hot gases. B. The Moon is covered with many craters. C. The Moon has many bodies of liquid water. D. The Moon has the ability to give off its own light. Answer:
Gold,The Moon is covered with many craters.,The Moon is covered with many craters.,The Moon is covered with many craters.,B
Predictions,"(-14.036760330200195, False) (-14.156133651733398, False) (-23.221176147460938, False) (-22.99703598022461, False)","(-12.965611457824707, False) (-11.864521026611328, False) (-20.491010665893555, False) (-21.728534698486328, False)","(-4.868813514709473, False) (-3.7199177742004395, False) (-3.739739418029785, False) (-0.2140919268131256, True)","(-1.5505614280700684, False) (-1.4255614280700684, False) (-1.3005614280700684, True) (-1.3005614280700684, False)"
Metrics: acc_norm,1,1,0,0


Lorsque l'on examine les log-vraisemblances des générations, ce qui semble curieusement fonctionner le mieux pour ce modèle est le fait de ne pas avoir indiqué d'exemples dans l'instruction, dans un mécanisme opposé à celui des évaluations de générations.