File size: 8,653 Bytes
29f1683
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c06aef
 
 
 
29f1683
 
 
7c06aef
 
29f1683
 
 
 
 
 
 
7c06aef
 
 
 
 
 
 
 
29f1683
7c06aef
29f1683
 
 
7c06aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29f1683
 
7c06aef
29f1683
 
7c06aef
29f1683
7c06aef
29f1683
 
 
 
 
 
 
 
7c06aef
29f1683
7c06aef
 
 
 
 
29f1683
 
 
 
 
 
 
 
7c06aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29f1683
 
 
 
 
 
7c06aef
29f1683
 
 
 
7c06aef
29f1683
 
 
7c06aef
 
29f1683
 
7c06aef
29f1683
7c06aef
 
29f1683
 
7c06aef
 
29f1683
7c06aef
29f1683
7c06aef
 
29f1683
7c06aef
29f1683
7c06aef
 
 
29f1683
 
7c06aef
29f1683
7c06aef
29f1683
7c06aef
 
 
 
 
 
29f1683
 
 
7c06aef
 
 
 
 
 
29f1683
7c06aef
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# AI Language Monitor - System Architecture

This diagram shows the complete data flow from model discovery through evaluation to frontend visualization.

```mermaid
flowchart TD
    %% Model Sources
    A1["important_models<br/>Static Curated List"] --> D[load_models]
    A2["get_historical_popular_models<br/>Web Scraping - Top 20"] --> D
    A3["get_current_popular_models<br/>Web Scraping - Top 10"] --> D
    A4["blocklist<br/>Exclusions"] --> D
    
    %% Model Processing
    D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>~40-50 models"]
    E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
    F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
    G --> H["Enriched Model DataFrame"]
    H --> |Save| I[models.json]
    
    %% Model Validation & Cost Filtering
    H --> |"Validate Models<br/>Check API Availability"| H1["Valid Models Only<br/>Cost ≀ $20/1M tokens"]
    H1 --> |"Timeout Protection<br/>120s for Large Models"| H2["Robust Model List"]
    
    %% Language Data
    J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
    
    %% Task Registry with Unified Prompting
    L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions<br/>Unified English Zero-Shot"]
    M --> M1["translation_from/to<br/>BLEU + ChrF"]
    M --> M2["classification<br/>Accuracy"]
    M --> M3["mmlu<br/>Accuracy"]
    M --> M4["arc<br/>Accuracy"] 
    M --> M5["truthfulqa<br/>Accuracy"]
    M --> M6["mgsm<br/>Accuracy"]
    
    %% On-the-fly Translation with Origin Tagging
    subgraph OTF [On-the-fly Dataset Translation]
        direction LR
        DS_raw["Raw English Dataset<br/>(e.g., MMLU)"] --> Google_Translate["Google Translate API"]
        Google_Translate --> DS_translated["Translated Dataset<br/>(e.g., German MMLU)<br/>Origin: 'machine'"]
        DS_native["Native Dataset<br/>(e.g., German MMLU)<br/>Origin: 'human'"]
    end
    
    %% Evaluation Pipeline
    H2 --> |"models ID"| N["main.py / main_gcs.py<br/>evaluate"]
    K --> |"languages bcp_47"| N
    L --> |"tasks.items"| N
    N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model Γ— Language Γ— Task"]
    O --> |"10 samples each"| P["Evaluation Execution<br/>Batch Processing"]
    
    %% Task Execution with Origin Tracking
    P --> Q1[translate_and_evaluate<br/>Origin: 'human']
    P --> Q2[classify_and_evaluate<br/>Origin: 'human']
    P --> Q3[mmlu_and_evaluate<br/>Origin: 'human'/'machine']
    P --> Q4[arc_and_evaluate<br/>Origin: 'human'/'machine']
    P --> Q5[truthfulqa_and_evaluate<br/>Origin: 'human'/'machine']
    P --> Q6[mgsm_and_evaluate<br/>Origin: 'human'/'machine']
    
    %% API Calls with Error Handling
    Q1 --> |"complete() API<br/>Rate Limiting"| R["OpenRouter<br/>Model Inference"]
    Q2 --> |"complete() API<br/>Rate Limiting"| R
    Q3 --> |"complete() API<br/>Rate Limiting"| R
    Q4 --> |"complete() API<br/>Rate Limiting"| R
    Q5 --> |"complete() API<br/>Rate Limiting"| R
    Q6 --> |"complete() API<br/>Rate Limiting"| R
    
    %% Results Processing with Origin Aggregation
    R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task+origin"]
    S --> |Save| T[results.json]
    
    %% Backend & Frontend with Origin-Specific Metrics
    T --> |Read| U[backend.py]
    I --> |Read| U
    U --> |make_model_table| V["Model Rankings<br/>Origin-Specific Metrics"]
    U --> |make_country_table| W["Country Aggregation"]
    U --> |"API Endpoint"| X["FastAPI /api/data<br/>arc_accuracy_human<br/>arc_accuracy_machine"]
    X --> |"JSON Response"| Y["Frontend React App"]
    
    %% UI Components
    Y --> Z1["WorldMap.js<br/>Country Visualization"]
    Y --> Z2["ModelTable.js<br/>Model Rankings"]
    Y --> Z3["LanguageTable.js<br/>Language Coverage"]
    Y --> Z4["DatasetTable.js<br/>Task Performance"]
    
    %% Data Sources with Origin Information
    subgraph DS ["Data Sources"]
        DS1["Flores-200<br/>Translation Sentences<br/>Origin: 'human'"]
        DS2["MMLU/AfriMMLU<br/>Knowledge QA<br/>Origin: 'human'"]
        DS3["ARC<br/>Science Reasoning<br/>Origin: 'human'"]
        DS4["TruthfulQA<br/>Truthfulness<br/>Origin: 'human'"]
        DS5["MGSM<br/>Math Problems<br/>Origin: 'human'"]
    end
    
    DS1 --> Q1
    DS2 --> Q3
    DS3 --> Q4
    DS4 --> Q5
    DS5 --> Q6
    
    DS_translated --> Q3
    DS_translated --> Q4
    DS_translated --> Q5
    
    DS_native --> Q3
    DS_native --> Q4
    DS_native --> Q5
    
    %% Styling - Neutral colors that work in both dark and light modes
    classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529
    classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529
    classDef api fill:#dee2e6,stroke:#6c757d,color:#212529
    classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460
    classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
    classDef translation fill:#d4edda,stroke:#155724,color:#155724
    
    class A1,A2,A3,A4 modelSource
    class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
    class R,F,G,X api
    class T,I storage
    class Y,Z1,Z2,Z3,Z4 frontend
    class Google_Translate,DS_translated,DS_native translation
```

## Architecture Components

### πŸ”΅ Model Discovery (Light Gray)
- **Static Curated Models**: Handpicked important models for comprehensive evaluation
- **Dynamic Popular Models**: Real-time discovery of trending models via web scraping
- **Quality Control**: Blocklist for problematic or incompatible models
- **Model Validation**: API availability checks and cost filtering (≀$20/1M tokens)
- **Timeout Protection**: 120s timeout for large/reasoning models, 60s for others
- **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs

### 🟣 Evaluation Pipeline (Medium Gray)
- **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
- **Unified English Zero-Shot Prompting**: All tasks use English instructions with target language content
- **Origin Tagging**: Distinguishes between human-translated ('human') and machine-translated ('machine') data
- **Combinatorial Approach**: Systematic evaluation across Model Γ— Language Γ— Task combinations
- **Sample-based**: 10 evaluations per combination for statistical reliability
- **Batch Processing**: 50 tasks per batch with rate limiting and error resilience
- **Dual Deployment**: `main.py` for local/GitHub, `main_gcs.py` for Google Cloud with GCS storage

### 🟠 API Integration (Light Gray)
- **OpenRouter**: Primary model inference API for all language model tasks
- **Rate Limiting**: Intelligent batching and delays to prevent API overload
- **Error Handling**: Graceful handling of timeouts, rate limits, and model unavailability
- **HuggingFace**: Model metadata and open-source model information
- **Google Translate**: Specialized translation API for on-the-fly dataset translation

### 🟒 Data Storage (Cyan)
- **results.json**: Aggregated evaluation scores with origin-specific metrics
- **models.json**: Dynamic model list with metadata and validation status
- **languages.json**: Language information with population data

### 🟑 Frontend Visualization (Light Red)
- **WorldMap**: Interactive country-level language proficiency visualization
- **ModelTable**: Ranked model performance leaderboard with origin-specific columns
- **LanguageTable**: Language coverage and speaker statistics
- **DatasetTable**: Task-specific performance breakdowns with human/machine distinction

### πŸ”΅ Translation & Origin Tracking (Light Green)
- **On-the-fly Translation**: Google Translate API for languages without native benchmarks
- **Origin Tagging**: Automatic classification of data sources (human vs. machine translated)
- **Separate Metrics**: Frontend displays distinct scores for human and machine-translated data

## Data Flow Summary

1. **Model Discovery**: Combine curated + trending models β†’ validate API availability β†’ enrich with metadata
2. **Evaluation Setup**: Generate all valid Model Γ— Language Γ— Task combinations with origin tracking
3. **Task Execution**: Run evaluations using unified English prompting and appropriate datasets
4. **Result Processing**: Aggregate scores by model+language+task+origin and save to JSON files
5. **Backend Serving**: FastAPI serves processed data with origin-specific metrics via REST API
6. **Frontend Display**: React app visualizes data through interactive components with transparency indicators

This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface with methodological transparency.