File size: 6,923 Bytes
4d13673 4ddbe3b 4d13673 ce2acb0 4d13673 0a5d23d 485b788 7fcc452 3a246c9 9cd1249 9a6239b 7c06aef 913253a 7c06aef 913253a 7c06aef 913253a 9a6239b 913253a 9a6239b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
---
title: AI Language Monitor
emoji: π
colorFrom: purple
colorTo: pink
sdk: docker
app_port: 8000
license: cc-by-sa-4.0
short_description: Evaluating LLM performance across all human languages.
datasets:
- openlanguagedata/flores_plus
- google/fleurs
- mozilla-foundation/common_voice_1_0
- CohereForAI/Global-MMLU
models:
- meta-llama/Llama-3.3-70B-Instruct
- mistralai/Mistral-Small-24B-Instruct-2501
- deepseek-ai/DeepSeek-V3
- microsoft/phi-4
- openai/whisper-large-v3
- google/gemma-3-27b-it
tags:
- leaderboard
- submission:manual
- test:public
- judge:auto
- modality:text
- modality:artefacts
- eval:generation
- language:English
- language:German
---
<!--
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
-->
[](https://huggingface.co/spaces/datenlabor-bmz/ai-language-monitor)
# AI Language Monitor π
_Tracking language proficiency of AI models for every language_
## System Architecture
The AI Language Monitor evaluates language models across 100+ languages using a comprehensive pipeline that combines model discovery, automated evaluation, and real-time visualization.
```mermaid
flowchart TD
%% Model Sources
A1["important_models<br/>Static Curated List"] --> D[load_models]
A2["get_historical_popular_models<br/>Web Scraping - Top 20"] --> D
A3["get_current_popular_models<br/>Web Scraping - Top 10"] --> D
A4["blocklist<br/>Exclusions"] --> D
%% Model Processing
D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>~40-50 models"]
E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
G --> H["Enriched Model DataFrame"]
H --> |Save| I[models.json]
%% Model Validation & Cost Filtering
H --> |"Validate Models<br/>Check API Availability"| H1["Valid Models Only<br/>Cost β€ $20/1M tokens"]
H1 --> |"Timeout Protection<br/>120s for Large Models"| H2["Robust Model List"]
%% Language Data
J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
%% Task Registry with Unified Prompting
L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions<br/>Unified English Zero-Shot"]
M --> M1["translation_from/to<br/>BLEU + ChrF"]
M --> M2["classification<br/>Accuracy"]
M --> M3["mmlu<br/>Accuracy"]
M --> M4["arc<br/>Accuracy"]
M --> M5["truthfulqa<br/>Accuracy"]
M --> M6["mgsm<br/>Accuracy"]
%% On-the-fly Translation with Origin Tagging
subgraph OTF [On-the-fly Dataset Translation]
direction LR
DS_raw["Raw English Dataset<br/>(e.g., MMLU)"] --> Google_Translate["Google Translate API"]
Google_Translate --> DS_translated["Translated Dataset<br/>(e.g., German MMLU)<br/>Origin: 'machine'"]
DS_native["Native Dataset<br/>(e.g., German MMLU)<br/>Origin: 'human'"]
end
%% Evaluation Pipeline
H2 --> |"models ID"| N["main.py / main_gcs.py<br/>evaluate"]
K --> |"languages bcp_47"| N
L --> |"tasks.items"| N
N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model Γ Language Γ Task"]
O --> |"10 samples each"| P["Evaluation Execution<br/>Batch Processing"]
%% Task Execution with Origin Tracking
P --> Q1[translate_and_evaluate<br/>Origin: 'human']
P --> Q2[classify_and_evaluate<br/>Origin: 'human']
P --> Q3[mmlu_and_evaluate<br/>Origin: 'human'/'machine']
P --> Q4[arc_and_evaluate<br/>Origin: 'human'/'machine']
P --> Q5[truthfulqa_and_evaluate<br/>Origin: 'human'/'machine']
P --> Q6[mgsm_and_evaluate<br/>Origin: 'human'/'machine']
%% API Calls with Error Handling
Q1 --> |"complete() API<br/>Rate Limiting"| R["OpenRouter<br/>Model Inference"]
Q2 --> |"complete() API<br/>Rate Limiting"| R
Q3 --> |"complete() API<br/>Rate Limiting"| R
Q4 --> |"complete() API<br/>Rate Limiting"| R
Q5 --> |"complete() API<br/>Rate Limiting"| R
Q6 --> |"complete() API<br/>Rate Limiting"| R
%% Results Processing with Origin Aggregation
R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task+origin"]
S --> |Save| T[results.json]
%% Backend & Frontend with Origin-Specific Metrics
T --> |Read| U[backend.py]
I --> |Read| U
U --> |make_model_table| V["Model Rankings<br/>Origin-Specific Metrics"]
U --> |make_country_table| W["Country Aggregation"]
U --> |"API Endpoint"| X["FastAPI /api/data<br/>arc_accuracy_human<br/>arc_accuracy_machine"]
X --> |"JSON Response"| Y["Frontend React App"]
%% UI Components
Y --> Z1["WorldMap.js<br/>Country Visualization"]
Y --> Z2["ModelTable.js<br/>Model Rankings"]
Y --> Z3["LanguageTable.js<br/>Language Coverage"]
Y --> Z4["DatasetTable.js<br/>Task Performance"]
%% Data Sources with Origin Information
subgraph DS ["Data Sources"]
DS1["Flores-200<br/>Translation Sentences<br/>Origin: 'human'"]
DS2["MMLU/AfriMMLU<br/>Knowledge QA<br/>Origin: 'human'"]
DS3["ARC<br/>Science Reasoning<br/>Origin: 'human'"]
DS4["TruthfulQA<br/>Truthfulness<br/>Origin: 'human'"]
DS5["MGSM<br/>Math Problems<br/>Origin: 'human'"]
end
DS1 --> Q1
DS2 --> Q3
DS3 --> Q4
DS4 --> Q5
DS5 --> Q6
DS_translated --> Q3
DS_translated --> Q4
DS_translated --> Q5
DS_native --> Q3
DS_native --> Q4
DS_native --> Q5
%% Styling - Neutral colors that work in both dark and light modes
classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529
classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529
classDef api fill:#dee2e6,stroke:#6c757d,color:#212529
classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460
classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
classDef translation fill:#d4edda,stroke:#155724,color:#155724
class A1,A2,A3,A4 modelSource
class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
class R,F,G,X api
class T,I storage
class Y,Z1,Z2,Z3,Z4 frontend
class Google_Translate,DS_translated,DS_native translation
```
**Key Features:**
- **Model Discovery**: Combines curated models with real-time trending models via web scraping
- **Multi-Task Evaluation**: 7 tasks across 100+ languages with origin tracking (human vs machine-translated)
- **Scalable Architecture**: Dual deployment (local/GitHub vs Google Cloud)
- **Real-time Visualization**: Interactive web interface with country-level insights
## Evaluate
### Local Development
```bash
uv run --extra dev evals/main.py
```
### Google Cloud Deployment
```bash
uv run --extra dev evals/main_gcs.py
```
## Explore
```bash
uv run evals/backend.py
cd frontend && npm i && npm start
```
|