Spaces:
Runtime error
Runtime error
from dataclasses import dataclass | |
from enum import Enum | |
class Task: | |
benchmark: str | |
metric: str | |
col_name: str | |
# Select your tasks here | |
# --------------------------------------------------- | |
class Tasks(Enum): | |
# task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
task0 = Task("perplexity", "perplexity", "Perplexity") | |
NUM_FEWSHOT = 0 # Not used for perplexity | |
# --------------------------------------------------- | |
# Your leaderboard name | |
TITLE = """<h1 align="center" id="space-title">Model Tracing Leaderboard</h1>""" | |
# What does your leaderboard evaluate? | |
INTRODUCTION_TEXT = """ | |
This leaderboard evaluates specific language models based on their perplexity scores and | |
structural similarity to Llama-2-7B using model tracing analysis. | |
**Models Evaluated:** | |
- `lmsys/vicuna-7b-v1.5` - Vicuna 7B v1.5 | |
- `ibm-granite/granite-7b-base` - IBM Granite 7B Base | |
- `EleutherAI/llemma_7b` - LLeMa 7B | |
**Metrics:** | |
- **Perplexity**: Lower perplexity scores indicate better performance - it means the model is better at predicting the next token in the text. | |
- **Match P-Value**: Lower p-values indicate the model preserves structural similarity to Llama-2-7B after fine-tuning (neuron organization is maintained). | |
""" | |
# Which evaluations are you running? | |
LLM_BENCHMARKS_TEXT = """ | |
## How it works | |
The evaluation runs two types of analysis on the supported language models: | |
### Supported Models | |
- **Vicuna 7B v1.5** (`lmsys/vicuna-7b-v1.5`) - Chat-optimized LLaMA variant | |
- **IBM Granite 7B** (`ibm-granite/granite-7b-base`) - IBM's foundational language model | |
- **LLeMa 7B** (`EleutherAI/llemma_7b`) - EleutherAI's mathematical language model | |
### 1. Perplexity Evaluation | |
Perplexity tests using a fixed test passage about artificial intelligence. | |
Perplexity measures how well a model predicts text - lower scores mean better predictions. | |
### 2. Model Tracing Analysis | |
Compares each model's internal structure to Llama-2-7B using the "match" statistic: | |
- **Base Model**: Llama-2-7B (`meta-llama/Llama-2-7b-hf`) | |
- **Comparison Models**: The 3 supported models listed above | |
- **Method**: Neuron matching analysis across transformer layers | |
- **Alignment**: Models are aligned before comparison using the Hungarian algorithm | |
- **Output**: P-value indicating structural similarity (lower = more similar to Llama-2-7B) | |
The match statistic tests whether neurons in corresponding layers maintain similar functional roles | |
between the base model and the comparison models. | |
## Test Text | |
The evaluation uses the following passage: | |
``` | |
Artificial intelligence has transformed the way we live and work, bringing both opportunities and challenges. | |
From autonomous vehicles to language models that can engage in human-like conversation, AI technologies are becoming increasingly | |
sophisticated. However, with this advancement comes the responsibility to ensure these systems are developed and deployed ethically, | |
with careful consideration for privacy, fairness, and transparency. The future of AI will likely depend on how well we balance innovation | |
with these important social considerations. | |
``` | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
## Testing Models | |
This leaderboard focuses on comparing specific models: | |
1. **Vicuna 7B v1.5** - Chat-optimized variant of LLaMA | |
2. **IBM Granite 7B Base** - IBM's foundational language model | |
3. **LLeMa 7B** - EleutherAI's mathematical language model | |
Use the "Test Model" tab to run perplexity evaluation on any of these models. | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
CITATION_BUTTON_TEXT = "" | |