Spaces:
Runtime error
Runtime error
File size: 3,635 Bytes
359f755 77c0f20 359f755 77c0f20 359f755 4864926 359f755 4864926 1dd4b6a 4864926 1dd4b6a 4864926 359f755 77c0f20 359f755 4864926 1dd4b6a 77c0f20 359f755 1dd4b6a 4864926 1dd4b6a 4864926 1dd4b6a 4864926 1dd4b6a 77c0f20 359f755 77c0f20 359f755 77c0f20 359f755 77c0f20 4864926 359f755 4864926 359f755 77c0f20 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
from dataclasses import dataclass
from enum import Enum
@dataclass
class Task:
benchmark: str
metric: str
col_name: str
# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
# task_key in the json file, metric_key in the json file, name to display in the leaderboard
task0 = Task("perplexity", "perplexity", "Perplexity")
NUM_FEWSHOT = 0 # Not used for perplexity
# ---------------------------------------------------
# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Model Tracing Leaderboard</h1>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
This leaderboard evaluates specific language models based on their perplexity scores and
structural similarity to Llama-2-7B using model tracing analysis.
**Models Evaluated:**
- `lmsys/vicuna-7b-v1.5` - Vicuna 7B v1.5
- `ibm-granite/granite-7b-base` - IBM Granite 7B Base
- `EleutherAI/llemma_7b` - LLeMa 7B
**Metrics:**
- **Perplexity**: Lower perplexity scores indicate better performance - it means the model is better at predicting the next token in the text.
- **Match P-Value**: Lower p-values indicate the model preserves structural similarity to Llama-2-7B after fine-tuning (neuron organization is maintained).
"""
# Which evaluations are you running?
LLM_BENCHMARKS_TEXT = """
## How it works
The evaluation runs two types of analysis on the supported language models:
### Supported Models
- **Vicuna 7B v1.5** (`lmsys/vicuna-7b-v1.5`) - Chat-optimized LLaMA variant
- **IBM Granite 7B** (`ibm-granite/granite-7b-base`) - IBM's foundational language model
- **LLeMa 7B** (`EleutherAI/llemma_7b`) - EleutherAI's mathematical language model
### 1. Perplexity Evaluation
Perplexity tests using a fixed test passage about artificial intelligence.
Perplexity measures how well a model predicts text - lower scores mean better predictions.
### 2. Model Tracing Analysis
Compares each model's internal structure to Llama-2-7B using the "match" statistic:
- **Base Model**: Llama-2-7B (`meta-llama/Llama-2-7b-hf`)
- **Comparison Models**: The 3 supported models listed above
- **Method**: Neuron matching analysis across transformer layers
- **Alignment**: Models are aligned before comparison using the Hungarian algorithm
- **Output**: P-value indicating structural similarity (lower = more similar to Llama-2-7B)
The match statistic tests whether neurons in corresponding layers maintain similar functional roles
between the base model and the comparison models.
## Test Text
The evaluation uses the following passage:
```
Artificial intelligence has transformed the way we live and work, bringing both opportunities and challenges.
From autonomous vehicles to language models that can engage in human-like conversation, AI technologies are becoming increasingly
sophisticated. However, with this advancement comes the responsibility to ensure these systems are developed and deployed ethically,
with careful consideration for privacy, fairness, and transparency. The future of AI will likely depend on how well we balance innovation
with these important social considerations.
```
"""
EVALUATION_QUEUE_TEXT = """
## Testing Models
This leaderboard focuses on comparing specific models:
1. **Vicuna 7B v1.5** - Chat-optimized variant of LLaMA
2. **IBM Granite 7B Base** - IBM's foundational language model
3. **LLeMa 7B** - EleutherAI's mathematical language model
Use the "Test Model" tab to run perplexity evaluation on any of these models.
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = ""
|