Spaces:
Running
Running
adds readme, removes quantization, adds readtoken logic, updates trackio , spaces
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- README.md +275 -293
- docs/A100_LARGE_SCALE_GUIDE.md +0 -195
- docs/APP_CONFIGURATION_GUIDE.md +0 -234
- docs/CLOUD_DEPLOYMENT_GUIDE.md +0 -462
- docs/CLOUD_TRAINING_GUIDE.md +0 -440
- docs/Configuration_Management.md +29 -0
- docs/DATASET_AUTOMATION_FIX.md +0 -218
- docs/DATASET_COMPONENTS_VERIFICATION.md +0 -235
- docs/DEPLOYMENT_COMPONENTS_VERIFICATION.md +0 -393
- docs/DEPLOYMENT_GUIDE.md +0 -397
- docs/Data_Pipeline.md +95 -0
- docs/ENHANCED_MODEL_CARD_METADATA.md +0 -300
- docs/ENVIRONMENT_SETUP_FIX.md +0 -239
- docs/ENVIRONMENT_VARIABLES.md +0 -113
- docs/Entry_Point.md +120 -0
- docs/FINAL_DEPLOYMENT_VERIFICATION.md +0 -378
- docs/FORMATTING_FIX_SUMMARY.md +0 -153
- docs/GIT_CONFIGURATION_FIX.md +0 -257
- docs/GIT_CONFIGURATION_GUIDE.md +0 -258
- docs/H100_LIGHTWEIGHT_GUIDE.md +0 -276
- docs/HF_DATASETS_GUIDE.md +0 -269
- docs/HF_HUB_V0_34_UPDATE.md +0 -170
- docs/HF_SPACES_GUIDE.md +0 -163
- docs/INTERACTIVE_PIPELINE_IMPROVEMENTS.md +0 -330
- docs/LATEST_DEPLOYMENT_APPROACH.md +0 -267
- docs/LAUNCH_SCRIPT_UPDATES.md +0 -174
- docs/LAUNCH_SCRIPT_USERNAME_FIX.md +0 -154
- docs/MODEL_CARD_USER_INPUT_ANALYSIS.md +0 -233
- docs/MODEL_RECOVERY_GUIDE.md +0 -228
- docs/MONITORING_IMPROVEMENTS_SUMMARY.md +0 -191
- docs/MONITORING_INTEGRATION_GUIDE.md +0 -245
- docs/MONITORING_VERIFICATION_REPORT.md +0 -163
- docs/Model_Abstraction.md +36 -0
- docs/NO_THINK_TAG_GUIDE.md +0 -146
- docs/PIPELINE_SUMMARY.md +0 -330
- docs/PUSH_GUIDE.md +0 -406
- docs/PUSH_SCRIPT_GUIDE.md +0 -267
- docs/QUANTIZATION_FIX_SUMMARY.md +0 -165
- docs/QUANTIZATION_GUIDE.md +0 -313
- docs/QUANTIZATION_IMPLEMENTATION_SUMMARY.md +0 -248
- docs/README_END_TO_END.md +0 -303
- docs/SFT_TRAINER_CONFIG_USAGE.md +0 -233
- docs/TOKEN_FIX_SUMMARY.md +0 -249
- docs/TOKEN_VALIDATION_FIX.md +0 -183
- docs/TRACKIO_API_FIX_SUMMARY.md +0 -276
- docs/TRACKIO_DEPLOYMENT_FIXES.md +0 -266
- docs/TRACKIO_DICT_ACCESS_FIX.md +0 -144
- docs/TRACKIO_INTEGRATION.md +0 -252
- docs/TRACKIO_INTEGRATION_VERIFICATION.md +0 -177
- docs/TRACKIO_INTERFACE_GUIDE.md +0 -222
README.md
CHANGED
|
@@ -1,399 +1,381 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
##
|
| 6 |
|
| 7 |
-
|
|
|
|
|
|
|
| 8 |
|
| 9 |
-
|
| 10 |
-
- **Direct Preference Optimization (DPO)**: Improve model alignment
|
| 11 |
-
- **Long-context fine-tuning**: Support for up to 128k tokens
|
| 12 |
-
- **Tool calling**: Fine-tune for function calling capabilities
|
| 13 |
-
- **Model Quantization**: Create int8 (GPU) and int4 (CPU) quantized versions
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
-
- `config/train_smollm3.py`: Default configuration
|
| 23 |
-
- `model.py`: Model wrapper and loading
|
| 24 |
-
- `data.py`: Dataset handling and preprocessing
|
| 25 |
-
- `trainer.py`: Training loop and trainer setup
|
| 26 |
-
- `requirements.txt`: Dependencies
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
When setting up a Fine Tuning Job in the FlexAI console, use these settings:
|
| 31 |
-
|
| 32 |
-
#### Basic Configuration
|
| 33 |
-
- **Name**: `smollm3-finetune`
|
| 34 |
-
- **Cluster**: Your organization's designated cluster
|
| 35 |
-
- **Checkpoint**: (Optional) Previous training job checkpoint
|
| 36 |
-
- **Node Count**: 1
|
| 37 |
-
- **Accelerator Count**: 1-8 (depending on your needs)
|
| 38 |
-
|
| 39 |
-
#### Repository Settings
|
| 40 |
-
- **Repository URL**: `https://github.com/your-username/flexai-finetune`
|
| 41 |
-
- **Repository Revision**: `main`
|
| 42 |
-
|
| 43 |
-
#### Dataset Configuration
|
| 44 |
-
- **Datasets**: Your dataset (mounted under `/input`)
|
| 45 |
-
- **Mount Directory**: `my_dataset`
|
| 46 |
-
|
| 47 |
-
#### Entry Point
|
| 48 |
-
```
|
| 49 |
-
train.py config/train_smollm3.py --dataset_dir=my_dataset --init_from=resume --out_dir=/input-checkpoint --max_iters=1500
|
| 50 |
```
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
-
```json
|
| 58 |
-
[
|
| 59 |
-
{
|
| 60 |
-
"messages": [
|
| 61 |
-
{"role": "user", "content": "What is machine learning?"},
|
| 62 |
-
{"role": "assistant", "content": "Machine learning is a subset of AI..."}
|
| 63 |
-
]
|
| 64 |
-
}
|
| 65 |
-
]
|
| 66 |
-
```
|
| 67 |
|
| 68 |
-
|
| 69 |
-
```json
|
| 70 |
-
[
|
| 71 |
-
{
|
| 72 |
-
"instruction": "What is machine learning?",
|
| 73 |
-
"output": "Machine learning is a subset of AI..."
|
| 74 |
-
}
|
| 75 |
-
]
|
| 76 |
-
```
|
| 77 |
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
```
|
| 87 |
|
| 88 |
-
### 4. Configuration Options
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
```
|
| 110 |
|
| 111 |
-
### 5. Command Line Arguments
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
-
|
| 116 |
-
# Basic usage
|
| 117 |
-
python train.py config/train_smollm3.py
|
| 118 |
-
|
| 119 |
-
# With custom parameters
|
| 120 |
-
python train.py config/train_smollm3.py \
|
| 121 |
-
--dataset_dir=my_dataset \
|
| 122 |
-
--out_dir=/output-checkpoint \
|
| 123 |
-
--init_from=resume \
|
| 124 |
-
--max_iters=1500 \
|
| 125 |
-
--batch_size=8 \
|
| 126 |
-
--learning_rate=1e-5 \
|
| 127 |
-
--max_seq_length=8192
|
| 128 |
-
```
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
### 1. Custom Configuration
|
| 133 |
-
|
| 134 |
-
Create a custom configuration file:
|
| 135 |
|
| 136 |
```python
|
| 137 |
# config/my_config.py
|
| 138 |
from config.train_smollm3 import SmolLM3Config
|
| 139 |
|
| 140 |
config = SmolLM3Config(
|
| 141 |
-
model_name="HuggingFaceTB/SmolLM3-3B
|
| 142 |
max_seq_length=8192,
|
| 143 |
-
batch_size=
|
| 144 |
-
learning_rate=
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
)
|
| 149 |
```
|
| 150 |
|
| 151 |
-
###
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
```python
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
)
|
| 162 |
```
|
| 163 |
|
| 164 |
-
###
|
| 165 |
|
| 166 |
-
|
| 167 |
|
| 168 |
```python
|
| 169 |
-
from
|
|
|
|
|
|
|
| 170 |
|
| 171 |
-
|
|
|
|
| 172 |
model=model,
|
| 173 |
dataset=dataset,
|
| 174 |
config=config,
|
| 175 |
-
output_dir="./
|
| 176 |
)
|
| 177 |
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
```json
|
| 186 |
-
[
|
| 187 |
-
{
|
| 188 |
-
"messages": [
|
| 189 |
-
{"role": "user", "content": "What's the weather in New York?"},
|
| 190 |
-
{"role": "assistant", "content": "<tool_call>\n<invoke name=\"get_weather\">\n<parameter name=\"location\">New York</parameter>\n</invoke>\n</tool_call>"},
|
| 191 |
-
{"role": "tool", "content": "The weather in New York is 72°F and sunny."},
|
| 192 |
-
{"role": "assistant", "content": "The weather in New York is currently 72°F and sunny."}
|
| 193 |
-
]
|
| 194 |
-
}
|
| 195 |
-
]
|
| 196 |
```
|
| 197 |
|
| 198 |
-
##
|
| 199 |
|
| 200 |
-
|
| 201 |
|
| 202 |
-
|
| 203 |
-
- **SmolLM3-3B**: Instruction-tuned model
|
| 204 |
-
- **SmolLM3-3B-Instruct**: Enhanced instruction model
|
| 205 |
-
- **Quantized versions**: Available for deployment
|
| 206 |
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
-
|
| 211 |
-
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
### Recommended
|
| 215 |
-
- **GPU**: A100/H100 or similar
|
| 216 |
-
- **RAM**: 64GB+ system memory
|
| 217 |
-
- **Storage**: 100GB+ SSD
|
| 218 |
|
| 219 |
-
|
| 220 |
|
| 221 |
-
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
2. **Slow Training**
|
| 230 |
-
- Enable `flash_attention`
|
| 231 |
-
- Use mixed precision (`fp16`/`bf16`)
|
| 232 |
-
- Increase `dataloader_num_workers`
|
| 233 |
|
| 234 |
-
|
| 235 |
-
- Check dataset format
|
| 236 |
-
- Ensure proper JSON structure
|
| 237 |
-
- Verify file permissions
|
| 238 |
|
| 239 |
-
###
|
| 240 |
|
| 241 |
-
|
| 242 |
|
| 243 |
```python
|
| 244 |
-
import
|
| 245 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 246 |
```
|
| 247 |
|
| 248 |
-
|
| 249 |
|
| 250 |
-
|
| 251 |
|
| 252 |
```python
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
pipe = pipeline(
|
| 256 |
-
task="text-generation",
|
| 257 |
-
model="./output-checkpoint",
|
| 258 |
-
device=0,
|
| 259 |
-
max_new_tokens=256,
|
| 260 |
-
do_sample=True,
|
| 261 |
-
temperature=0.7
|
| 262 |
-
)
|
| 263 |
-
|
| 264 |
-
# Test the model
|
| 265 |
-
messages = [{"role": "user", "content": "Explain gravity in simple terms."}]
|
| 266 |
-
outputs = pipe(messages)
|
| 267 |
-
print(outputs[0]["generated_text"][-1]["content"])
|
| 268 |
-
```
|
| 269 |
-
|
| 270 |
-
## Model Quantization
|
| 271 |
-
|
| 272 |
-
The pipeline includes built-in quantization support using torchao for creating optimized model versions with a unified repository structure:
|
| 273 |
-
|
| 274 |
-
### Repository Structure
|
| 275 |
-
|
| 276 |
-
All models (main and quantized) are stored in a single repository:
|
| 277 |
-
|
| 278 |
-
```
|
| 279 |
-
your-username/model-name/
|
| 280 |
-
├── README.md (unified model card)
|
| 281 |
-
├── config.json
|
| 282 |
-
├── pytorch_model.bin
|
| 283 |
-
├── tokenizer.json
|
| 284 |
-
├── int8/ (quantized model for GPU)
|
| 285 |
-
└── int4/ (quantized model for CPU)
|
| 286 |
```
|
| 287 |
|
| 288 |
-
|
| 289 |
|
| 290 |
-
|
| 291 |
-
- **int4_weight_only**: CPU optimized, ~75% memory reduction
|
| 292 |
-
|
| 293 |
-
### Automatic Quantization
|
| 294 |
-
|
| 295 |
-
When using the interactive pipeline (`launch.sh`), you'll be prompted to create quantized versions after training:
|
| 296 |
|
| 297 |
```bash
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
|
|
|
|
|
|
| 301 |
```
|
| 302 |
|
| 303 |
-
###
|
| 304 |
|
| 305 |
-
|
| 306 |
|
| 307 |
```bash
|
| 308 |
-
# Quantize and push to HF Hub
|
| 309 |
-
python scripts/model_tonic/quantize_standalone.py
|
|
|
|
| 310 |
--quant-type int8_weight_only \
|
| 311 |
--token YOUR_HF_TOKEN
|
| 312 |
|
| 313 |
-
# Quantize
|
| 314 |
-
python scripts/model_tonic/quantize_standalone.py
|
|
|
|
| 315 |
--quant-type int4_weight_only \
|
| 316 |
--device cpu \
|
| 317 |
--save-only
|
| 318 |
```
|
| 319 |
|
| 320 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 321 |
|
| 322 |
```python
|
| 323 |
-
|
| 324 |
-
from
|
| 325 |
-
|
| 326 |
-
# Load main model
|
| 327 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 328 |
-
"your-username/model-name",
|
| 329 |
-
device_map="auto",
|
| 330 |
-
torch_dtype=torch.bfloat16
|
| 331 |
-
)
|
| 332 |
-
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
|
| 333 |
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
|
|
|
|
|
|
| 339 |
)
|
| 340 |
-
|
| 341 |
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
```
|
| 350 |
|
| 351 |
-
|
| 352 |
|
| 353 |
-
|
| 354 |
|
| 355 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 356 |
|
| 357 |
-
|
| 358 |
-
- **Conditional Sections**: Quantized model information appears when available
|
| 359 |
-
- **Usage Examples**: Complete examples for all model variants
|
| 360 |
-
- **Performance Information**: Memory and speed benefits for each quantization type
|
| 361 |
|
| 362 |
-
|
| 363 |
|
| 364 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 365 |
|
| 366 |
-
### Using vLLM
|
| 367 |
```bash
|
| 368 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 369 |
```
|
| 370 |
|
| 371 |
-
###
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 375 |
```
|
| 376 |
|
| 377 |
-
##
|
| 378 |
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
|
|
|
|
|
|
| 383 |
|
| 384 |
-
## License
|
| 385 |
|
| 386 |
-
This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
|
| 387 |
|
|
|
|
| 388 |
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
"created_at": "2025-07-18T19:58:52.689087",
|
| 394 |
-
"status": "running",
|
| 395 |
-
"metrics": [],
|
| 396 |
-
"parameters": {},
|
| 397 |
-
"artifacts": [],
|
| 398 |
-
"logs": []
|
| 399 |
-
}
|
|
|
|
| 1 |
+
# 🤏🏻🏭SmolFactory
|
| 2 |
|
| 3 |
+
A comprehensive end-to-end fine-tuning pipeline for SmolLM3 models with custom monitoring, Hugging Face integration, and interactive configuration management.
|
| 4 |
|
| 5 |
+
## 🤖 Automatically Push Model, Spaces, Datasets & Monitoring
|
| 6 |
|
| 7 |
+
- **Trackio Monitoring Space**: Real-time training metrics, loss curves, and resource utilization
|
| 8 |
+
- **Demo Spaces**: Instant web interfaces for model testing and demonstration
|
| 9 |
+
- **Automatic Deployment**: Spaces created and configured automatically during the pipeline
|
| 10 |
|
| 11 |
+
### 📈 **Custom Trackio Monitoring**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
- **Real-time Metrics**: Live training loss, learning rate, gradient norms, and GPU utilization
|
| 14 |
+
- **Custom Dashboards**: Tailored visualizations for SmolLM3 fine-tuning
|
| 15 |
+
- **Artifact Logging**: Model checkpoints, configuration files, and training logs
|
| 16 |
+
- **Experiment Comparison**: Side-by-side analysis of different training runs
|
| 17 |
+
- **Alert System**: Notifications for training issues or completion
|
| 18 |
+
- **Integration**: Seamless connection with HF Spaces for public monitoring
|
| 19 |
+
- **Experiment Tracking**: All training data, metrics, and artifacts stored in HF Datasets
|
| 20 |
+
- **Reproducibility**: Complete experiment history with configuration snapshots
|
| 21 |
+
- **Collaboration**: Easy sharing of training results and model comparisons
|
| 22 |
+
- **Version Control**: Track dataset changes and model performance over time
|
| 23 |
|
| 24 |
+
## 🚀 Quick Start
|
| 25 |
|
| 26 |
+
### Interactive Pipeline (Recommended)
|
| 27 |
|
| 28 |
+
The easiest way to get started is using the interactive pipeline:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
```bash
|
| 31 |
+
./launch.sh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
```
|
| 33 |
|
| 34 |
+
This script will:
|
| 35 |
+
1. **Authenticate** with Hugging Face (write + read tokens)
|
| 36 |
+
2. **Configure** training parameters interactively
|
| 37 |
+
3. **Deploy** Trackio Space for monitoring
|
| 38 |
+
4. **Setup** HF Dataset for experiment tracking
|
| 39 |
+
5. **Execute** training with your chosen configuration
|
| 40 |
+
6. **Push** model to HF Hub with comprehensive documentation
|
| 41 |
+
7. **Deploy** demo space for testing (optional)
|
| 42 |
|
| 43 |
+
### Manual Setup
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
For advanced users who want to customize the pipeline:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
```bash
|
| 48 |
+
# 1. Install dependencies
|
| 49 |
+
pip install -r requirements/requirements_core.txt
|
| 50 |
+
|
| 51 |
+
# 2. Configure your training
|
| 52 |
+
python scripts/training/train.py \
|
| 53 |
+
--config config/train_smollm3_h100_lightweight.py \
|
| 54 |
+
--experiment-name "my-experiment" \
|
| 55 |
+
--output-dir ./outputs \
|
| 56 |
+
--trackio-url "https://huggingface.co/spaces/username/trackio-monitoring"
|
| 57 |
+
|
| 58 |
+
# 3. Push model to HF Hub
|
| 59 |
+
python scripts/model_tonic/push_to_huggingface.py \
|
| 60 |
+
./outputs username/model-name \
|
| 61 |
+
--token YOUR_HF_TOKEN
|
| 62 |
```
|
| 63 |
|
|
|
|
| 64 |
|
| 65 |
+
## 🏗️ Repository Architecture
|
| 66 |
+
|
| 67 |
+
```mermaid
|
| 68 |
+
graph LR
|
| 69 |
+
Entry_Point["Entry Point"]
|
| 70 |
+
Configuration_Management["Configuration Management"]
|
| 71 |
+
Data_Pipeline["Data Pipeline"]
|
| 72 |
+
Model_Abstraction["Model Abstraction"]
|
| 73 |
+
Training_Orchestrator["Training Orchestrator"]
|
| 74 |
+
Entry_Point -- "Initializes and Uses" --> Configuration_Management
|
| 75 |
+
Entry_Point -- "Initializes" --> Data_Pipeline
|
| 76 |
+
Entry_Point -- "Initializes" --> Model_Abstraction
|
| 77 |
+
Entry_Point -- "Initializes and Invokes" --> Training_Orchestrator
|
| 78 |
+
Configuration_Management -- "Provides Configuration To" --> Model_Abstraction
|
| 79 |
+
Configuration_Management -- "Provides Configuration To" --> Data_Pipeline
|
| 80 |
+
Configuration_Management -- "Provides Configuration To" --> Training_Orchestrator
|
| 81 |
+
Data_Pipeline -- "Provides Data To" --> Training_Orchestrator
|
| 82 |
+
Model_Abstraction -- "Provides Model To" --> Training_Orchestrator
|
| 83 |
+
click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
|
| 84 |
+
click Configuration_Management href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Configuration_Management.md" "Details"
|
| 85 |
+
click Data_Pipeline href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Data_Pipeline.md" "Details"
|
| 86 |
+
click Model_Abstraction href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Model_Abstraction.md" "Details"
|
| 87 |
+
click Training_Orchestrator href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Training_Orchestrator.md" "Details"
|
| 88 |
```
|
| 89 |
|
|
|
|
| 90 |
|
| 91 |
+
## 🔧 Core Components
|
| 92 |
|
| 93 |
+
### Configuration System (`config/`)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
+
All training configurations inherit from `SmolLM3Config`:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
```python
|
| 98 |
# config/my_config.py
|
| 99 |
from config.train_smollm3 import SmolLM3Config
|
| 100 |
|
| 101 |
config = SmolLM3Config(
|
| 102 |
+
model_name="HuggingFaceTB/SmolLM3-3B",
|
| 103 |
max_seq_length=8192,
|
| 104 |
+
batch_size=8,
|
| 105 |
+
learning_rate=5e-6,
|
| 106 |
+
trainer_type="sft", # or "dpo"
|
| 107 |
+
enable_tracking=True,
|
| 108 |
+
trackio_url="https://huggingface.co/spaces/username/trackio-monitoring"
|
| 109 |
)
|
| 110 |
```
|
| 111 |
|
| 112 |
+
### Dataset Processing (`src/data.py`)
|
| 113 |
|
| 114 |
+
The `SmolLM3Dataset` class handles multiple dataset formats:
|
| 115 |
|
| 116 |
```python
|
| 117 |
+
from src.data import SmolLM3Dataset
|
| 118 |
+
|
| 119 |
+
# Supports multiple formats:
|
| 120 |
+
# 1. Chat format (recommended)
|
| 121 |
+
# 2. Instruction format
|
| 122 |
+
# 3. User-Assistant format
|
| 123 |
+
# 4. Hugging Face datasets
|
| 124 |
+
|
| 125 |
+
dataset = SmolLM3Dataset(
|
| 126 |
+
data_path="my_dataset",
|
| 127 |
+
tokenizer=tokenizer,
|
| 128 |
+
max_seq_length=4096,
|
| 129 |
+
use_chat_template=True,
|
| 130 |
+
sample_size=80000 # For lightweight training
|
| 131 |
)
|
| 132 |
```
|
| 133 |
|
| 134 |
+
### Training Orchestration (`src/train.py`)
|
| 135 |
|
| 136 |
+
The main training script coordinates all components:
|
| 137 |
|
| 138 |
```python
|
| 139 |
+
from src.train import main
|
| 140 |
+
from src.model import SmolLM3Model
|
| 141 |
+
from src.trainer import SmolLM3Trainer, SmolLM3DPOTrainer
|
| 142 |
|
| 143 |
+
# SFT Training
|
| 144 |
+
trainer = SmolLM3Trainer(
|
| 145 |
model=model,
|
| 146 |
dataset=dataset,
|
| 147 |
config=config,
|
| 148 |
+
output_dir="./outputs"
|
| 149 |
)
|
| 150 |
|
| 151 |
+
# DPO Training
|
| 152 |
+
dpo_trainer = SmolLM3DPOTrainer(
|
| 153 |
+
model=model,
|
| 154 |
+
dataset=dataset,
|
| 155 |
+
config=config,
|
| 156 |
+
output_dir="./dpo-outputs"
|
| 157 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
```
|
| 159 |
|
| 160 |
+
## 🎯 Training Types
|
| 161 |
|
| 162 |
+
### Supervised Fine-tuning (SFT)
|
| 163 |
|
| 164 |
+
Standard instruction tuning for improving model capabilities:
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
+
```bash
|
| 167 |
+
python scripts/training/train.py \
|
| 168 |
+
--config config/train_smollm3.py \
|
| 169 |
+
--trainer-type sft \
|
| 170 |
+
--experiment-name "sft-experiment"
|
| 171 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
+
### Direct Preference Optimization (DPO)
|
| 174 |
|
| 175 |
+
Preference-based training for alignment:
|
| 176 |
|
| 177 |
+
```bash
|
| 178 |
+
python scripts/training/train.py \
|
| 179 |
+
--config config/train_smollm3_dpo.py \
|
| 180 |
+
--trainer-type dpo \
|
| 181 |
+
--experiment-name "dpo-experiment"
|
| 182 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
|
| 184 |
+
## 📊 Monitoring & Tracking
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
+
### Trackio Integration
|
| 187 |
|
| 188 |
+
The pipeline includes comprehensive monitoring:
|
| 189 |
|
| 190 |
```python
|
| 191 |
+
from src.monitoring import create_monitor_from_config
|
| 192 |
+
|
| 193 |
+
monitor = create_monitor_from_config(config)
|
| 194 |
+
monitor.log_metrics({
|
| 195 |
+
"train_loss": loss,
|
| 196 |
+
"learning_rate": lr,
|
| 197 |
+
"gradient_norm": grad_norm
|
| 198 |
+
})
|
| 199 |
```
|
| 200 |
|
| 201 |
+
### HF Dataset Integration
|
| 202 |
|
| 203 |
+
Experiment data is automatically saved to HF Datasets:
|
| 204 |
|
| 205 |
```python
|
| 206 |
+
# Automatically configured in launch.sh
|
| 207 |
+
dataset_repo = "username/trackio-experiments"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
```
|
| 209 |
|
| 210 |
+
## 🔄 Model Management
|
| 211 |
|
| 212 |
+
### Pushing to HF Hub
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
|
| 214 |
```bash
|
| 215 |
+
python scripts/model_tonic/push_to_huggingface.py \
|
| 216 |
+
./outputs username/model-name \
|
| 217 |
+
--token YOUR_HF_TOKEN \
|
| 218 |
+
--trackio-url "https://huggingface.co/spaces/username/trackio-monitoring" \
|
| 219 |
+
--experiment-name "my-experiment"
|
| 220 |
```
|
| 221 |
|
| 222 |
+
### Model Quantization
|
| 223 |
|
| 224 |
+
Create optimized versions for deployment:
|
| 225 |
|
| 226 |
```bash
|
| 227 |
+
# Quantize and push to HF Hub
|
| 228 |
+
python scripts/model_tonic/quantize_standalone.py \
|
| 229 |
+
./outputs username/model-name \
|
| 230 |
--quant-type int8_weight_only \
|
| 231 |
--token YOUR_HF_TOKEN
|
| 232 |
|
| 233 |
+
# Quantize for CPU deployment
|
| 234 |
+
python scripts/model_tonic/quantize_standalone.py \
|
| 235 |
+
./outputs username/model-name \
|
| 236 |
--quant-type int4_weight_only \
|
| 237 |
--device cpu \
|
| 238 |
--save-only
|
| 239 |
```
|
| 240 |
|
| 241 |
+
## 🛠️ Customization Guide
|
| 242 |
+
|
| 243 |
+
### Adding New Training Configurations
|
| 244 |
+
|
| 245 |
+
1. Create a new config file in `config/`:
|
| 246 |
|
| 247 |
```python
|
| 248 |
+
# config/train_smollm3_custom.py
|
| 249 |
+
from config.train_smollm3 import SmolLM3Config
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
|
| 251 |
+
config = SmolLM3Config(
|
| 252 |
+
model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
|
| 253 |
+
max_seq_length=16384,
|
| 254 |
+
batch_size=4,
|
| 255 |
+
learning_rate=1e-5,
|
| 256 |
+
max_iters=2000,
|
| 257 |
+
trainer_type="sft"
|
| 258 |
)
|
| 259 |
+
```
|
| 260 |
|
| 261 |
+
2. Add to the training script mapping in `scripts/training/train.py`:
|
| 262 |
+
|
| 263 |
+
```python
|
| 264 |
+
config_map = {
|
| 265 |
+
# ... existing configs ...
|
| 266 |
+
"config/train_smollm3_custom.py": get_custom_config,
|
| 267 |
+
}
|
| 268 |
```
|
| 269 |
|
| 270 |
+
### Custom Dataset Formats
|
| 271 |
|
| 272 |
+
Extend `src/data.py` to support new formats:
|
| 273 |
|
| 274 |
+
```python
|
| 275 |
+
def _load_custom_format(self, data_path: str) -> Dataset:
|
| 276 |
+
"""Load custom dataset format"""
|
| 277 |
+
# Your custom loading logic here
|
| 278 |
+
pass
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
### Custom Training Loops
|
| 282 |
+
|
| 283 |
+
Extend `src/trainer.py` for specialized training:
|
| 284 |
+
|
| 285 |
+
```python
|
| 286 |
+
class SmolLM3CustomTrainer(SmolLM3Trainer):
|
| 287 |
+
def training_step(self, batch):
|
| 288 |
+
# Custom training logic
|
| 289 |
+
pass
|
| 290 |
+
```
|
| 291 |
|
| 292 |
+
## 🔧 Development & Contributing
|
|
|
|
|
|
|
|
|
|
| 293 |
|
| 294 |
+
### Project Structure
|
| 295 |
|
| 296 |
+
- **`src/`**: Core training modules
|
| 297 |
+
- **`config/`**: Training configurations
|
| 298 |
+
- **`scripts/`**: Utility scripts and automation
|
| 299 |
+
- **`docs/`**: Comprehensive documentation
|
| 300 |
+
- **`tests/`**: Test files and debugging tools
|
| 301 |
+
|
| 302 |
+
### Adding New Features
|
| 303 |
+
|
| 304 |
+
1. **Configuration**: Add to `config/` directory
|
| 305 |
+
2. **Core Logic**: Extend modules in `src/`
|
| 306 |
+
3. **Scripts**: Add utility scripts to `scripts/`
|
| 307 |
+
4. **Documentation**: Update relevant docs in `docs/`
|
| 308 |
+
5. **Tests**: Add test files to `tests/`
|
| 309 |
+
|
| 310 |
+
### Testing Your Changes
|
| 311 |
|
|
|
|
| 312 |
```bash
|
| 313 |
+
# Run basic tests
|
| 314 |
+
python tests/test_config.py
|
| 315 |
+
python tests/test_dataset.py
|
| 316 |
+
python tests/test_training.py
|
| 317 |
+
|
| 318 |
+
# Test specific components
|
| 319 |
+
python tests/test_monitoring.py
|
| 320 |
+
python tests/test_model_push.py
|
| 321 |
```
|
| 322 |
|
| 323 |
+
### Code Style
|
| 324 |
+
|
| 325 |
+
- Follow PEP 8 for Python code
|
| 326 |
+
- Use type hints for all functions
|
| 327 |
+
- Add comprehensive docstrings
|
| 328 |
+
- Include error handling for external APIs
|
| 329 |
+
- Use structured logging with consistent field names
|
| 330 |
+
|
| 331 |
+
## 🚨 Troubleshooting
|
| 332 |
+
|
| 333 |
+
### Common Issues
|
| 334 |
+
|
| 335 |
+
1. **Out of Memory (OOM)**
|
| 336 |
+
```bash
|
| 337 |
+
# Reduce batch size in config
|
| 338 |
+
batch_size=2 # instead of 8
|
| 339 |
+
gradient_accumulation_steps=16 # increase to compensate
|
| 340 |
+
```
|
| 341 |
+
|
| 342 |
+
2. **Token Validation Errors**
|
| 343 |
+
```bash
|
| 344 |
+
# Validate your HF token
|
| 345 |
+
python scripts/validate_hf_token.py YOUR_TOKEN
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
3. **Dataset Loading Issues**
|
| 349 |
+
```bash
|
| 350 |
+
# Check dataset format
|
| 351 |
+
python tests/test_dataset_loading.py
|
| 352 |
+
```
|
| 353 |
+
|
| 354 |
+
### Debug Mode
|
| 355 |
+
|
| 356 |
+
Enable detailed logging:
|
| 357 |
+
|
| 358 |
+
```python
|
| 359 |
+
import logging
|
| 360 |
+
logging.basicConfig(level=logging.DEBUG)
|
| 361 |
```
|
| 362 |
|
| 363 |
+
## 🤝 Contributing
|
| 364 |
|
| 365 |
+
1. Fork the repository
|
| 366 |
+
2. Create a feature branch
|
| 367 |
+
3. Make your changes following the code style
|
| 368 |
+
4. Add tests for new functionality
|
| 369 |
+
5. Update documentation
|
| 370 |
+
6. Submit a pull request
|
| 371 |
|
| 372 |
+
## 📄 License
|
| 373 |
|
| 374 |
+
This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
|
| 375 |
|
| 376 |
+
## 🔗 Resources
|
| 377 |
|
| 378 |
+
- [SmolLM3 Blog Post](https://huggingface.co/blog/smollm3)
|
| 379 |
+
- [Model Repository](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
|
| 380 |
+
- [GitHub Repository](https://github.com/huggingface/smollm)
|
| 381 |
+
- [SmolTalk Dataset](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/A100_LARGE_SCALE_GUIDE.md
DELETED
|
@@ -1,195 +0,0 @@
|
|
| 1 |
-
# A100 Large Scale Training Guide
|
| 2 |
-
|
| 3 |
-
This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.
|
| 4 |
-
|
| 5 |
-
## Available Configurations
|
| 6 |
-
|
| 7 |
-
### 1. A100 Large Batch Configuration
|
| 8 |
-
**File**: `config/train_smollm3_openhermes_fr_a100_large.py`
|
| 9 |
-
|
| 10 |
-
**Key Features**:
|
| 11 |
-
- **Effective Batch Size**: 128 (8 × 16 gradient accumulation)
|
| 12 |
-
- **Training Duration**: ~1.3 passes (8,000 steps)
|
| 13 |
-
- **Learning Rate**: 5e-6 (optimized for large batches)
|
| 14 |
-
- **Mixed Precision**: bf16 (A100 optimized)
|
| 15 |
-
- **Sequence Length**: 8192 tokens
|
| 16 |
-
- **Memory Optimizations**: No gradient checkpointing for A100 efficiency
|
| 17 |
-
|
| 18 |
-
**Estimated Training Time**: ~6-8 hours on A100
|
| 19 |
-
|
| 20 |
-
### 2. Multiple Passes Configuration
|
| 21 |
-
**File**: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`
|
| 22 |
-
|
| 23 |
-
**Key Features**:
|
| 24 |
-
- **Effective Batch Size**: 120 (6 × 20 gradient accumulation)
|
| 25 |
-
- **Training Duration**: ~4 passes (25,000 steps)
|
| 26 |
-
- **Learning Rate**: 3e-6 (conservative for long training)
|
| 27 |
-
- **Warmup Steps**: 2000 (longer warmup for stability)
|
| 28 |
-
- **Checkpoint Strategy**: More frequent saves (every 2000 steps)
|
| 29 |
-
|
| 30 |
-
**Estimated Training Time**: ~20-24 hours on A100
|
| 31 |
-
|
| 32 |
-
## Training Commands
|
| 33 |
-
|
| 34 |
-
### Quick Start - Large Batch Experiment
|
| 35 |
-
```bash
|
| 36 |
-
python run_a100_large_experiment.py \
|
| 37 |
-
--config config/train_smollm3_openhermes_fr_a100_large.py \
|
| 38 |
-
--experiment-name "smollm3_openhermes_fr_large_batch" \
|
| 39 |
-
--output-dir ./outputs/large_batch
|
| 40 |
-
```
|
| 41 |
-
|
| 42 |
-
### Multiple Passes Experiment
|
| 43 |
-
```bash
|
| 44 |
-
python run_a100_large_experiment.py \
|
| 45 |
-
--config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
|
| 46 |
-
--experiment-name "smollm3_openhermes_fr_multiple_passes" \
|
| 47 |
-
--output-dir ./outputs/multiple_passes
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
### Dry Run (Check Configuration)
|
| 51 |
-
```bash
|
| 52 |
-
python run_a100_large_experiment.py \
|
| 53 |
-
--config config/train_smollm3_openhermes_fr_a100_large.py \
|
| 54 |
-
--dry-run
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
### Resume Training
|
| 58 |
-
```bash
|
| 59 |
-
python run_a100_large_experiment.py \
|
| 60 |
-
--config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
|
| 61 |
-
--resume ./outputs/multiple_passes/checkpoint-10000 \
|
| 62 |
-
--output-dir ./outputs/multiple_passes
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
-
## Configuration Details
|
| 66 |
-
|
| 67 |
-
### Memory Usage Optimization
|
| 68 |
-
- **Gradient Checkpointing**: Disabled for A100 efficiency
|
| 69 |
-
- **Flash Attention**: Enabled for memory efficiency
|
| 70 |
-
- **bf16 Mixed Precision**: Better for A100 than fp16
|
| 71 |
-
- **Gradient Clipping**: 1.0 for stability
|
| 72 |
-
- **Group by Length**: Enabled for better batching
|
| 73 |
-
|
| 74 |
-
### Data Loading Optimization
|
| 75 |
-
- **Num Workers**: 8 for faster data loading
|
| 76 |
-
- **Pin Memory**: Enabled for GPU transfer efficiency
|
| 77 |
-
- **Prefetch Factor**: 2 for pipeline optimization
|
| 78 |
-
|
| 79 |
-
### Training Stability
|
| 80 |
-
- **Conservative Learning Rate**: Lower LR for large effective batch sizes
|
| 81 |
-
- **Longer Warmup**: More warmup steps for stability
|
| 82 |
-
- **Higher Beta2**: 0.999 for AdamW stability
|
| 83 |
-
- **Gradient Clipping**: Prevents gradient explosion
|
| 84 |
-
|
| 85 |
-
## Expected Results
|
| 86 |
-
|
| 87 |
-
### Large Batch Configuration (1.3 passes)
|
| 88 |
-
- **Training Steps**: 8,000
|
| 89 |
-
- **Effective Batch Size**: 128
|
| 90 |
-
- **Steps per Epoch**: ~6,250
|
| 91 |
-
- **Epochs**: ~1.3
|
| 92 |
-
- **Expected Loss**: Should converge to ~1.5-2.0
|
| 93 |
-
|
| 94 |
-
### Multiple Passes Configuration (4 passes)
|
| 95 |
-
- **Training Steps**: 25,000
|
| 96 |
-
- **Effective Batch Size**: 120
|
| 97 |
-
- **Steps per Epoch**: ~6,667
|
| 98 |
-
- **Epochs**: ~3.75
|
| 99 |
-
- **Expected Loss**: Should converge to ~1.2-1.5
|
| 100 |
-
|
| 101 |
-
## Monitoring and Logging
|
| 102 |
-
|
| 103 |
-
### Trackio Integration
|
| 104 |
-
Both configurations include Trackio monitoring:
|
| 105 |
-
- **Metrics Logging**: Every 25-50 steps
|
| 106 |
-
- **Artifact Logging**: Model checkpoints
|
| 107 |
-
- **Config Logging**: Training configuration
|
| 108 |
-
|
| 109 |
-
### Checkpoint Strategy
|
| 110 |
-
- **Large Batch**: Save every 1000 steps (8 checkpoints)
|
| 111 |
-
- **Multiple Passes**: Save every 2000 steps (12 checkpoints)
|
| 112 |
-
- **Best Model**: Automatically load best model at end
|
| 113 |
-
|
| 114 |
-
## Hardware Requirements
|
| 115 |
-
|
| 116 |
-
### Minimum Requirements
|
| 117 |
-
- **GPU**: A100 80GB (or multiple A100s)
|
| 118 |
-
- **RAM**: 64GB+ system RAM
|
| 119 |
-
- **Storage**: 100GB+ for checkpoints and logs
|
| 120 |
-
- **Network**: Fast internet for dataset download
|
| 121 |
-
|
| 122 |
-
### Recommended Setup
|
| 123 |
-
- **GPU**: 2-4x A100 80GB
|
| 124 |
-
- **RAM**: 128GB+ system RAM
|
| 125 |
-
- **Storage**: 500GB+ NVMe SSD
|
| 126 |
-
- **Network**: 10Gbps+ connection
|
| 127 |
-
|
| 128 |
-
## Troubleshooting
|
| 129 |
-
|
| 130 |
-
### Out of Memory (OOM)
|
| 131 |
-
If you encounter OOM errors:
|
| 132 |
-
1. Reduce `batch_size` from 8 to 6 or 4
|
| 133 |
-
2. Increase `gradient_accumulation_steps` to maintain effective batch size
|
| 134 |
-
3. Reduce `max_seq_length` from 8192 to 4096
|
| 135 |
-
|
| 136 |
-
### Slow Training
|
| 137 |
-
If training is too slow:
|
| 138 |
-
1. Increase `dataloader_num_workers` to 12-16
|
| 139 |
-
2. Ensure you're using bf16 mixed precision
|
| 140 |
-
3. Check that gradient checkpointing is disabled
|
| 141 |
-
4. Verify flash attention is enabled
|
| 142 |
-
|
| 143 |
-
### Convergence Issues
|
| 144 |
-
If loss doesn't converge:
|
| 145 |
-
1. Reduce learning rate by 2x
|
| 146 |
-
2. Increase warmup steps
|
| 147 |
-
3. Check gradient norms in logs
|
| 148 |
-
4. Verify dataset quality
|
| 149 |
-
|
| 150 |
-
## Customization
|
| 151 |
-
|
| 152 |
-
### For Different Dataset Sizes
|
| 153 |
-
Adjust `max_iters` based on your dataset size:
|
| 154 |
-
```python
|
| 155 |
-
# For 1M datapoints with effective batch size 120
|
| 156 |
-
steps_per_epoch = 1000000 // 120 # ~8,333 steps
|
| 157 |
-
max_iters = steps_per_epoch * desired_epochs
|
| 158 |
-
```
|
| 159 |
-
|
| 160 |
-
### For Different GPU Memory
|
| 161 |
-
Adjust batch size and gradient accumulation:
|
| 162 |
-
```python
|
| 163 |
-
# For 40GB A100
|
| 164 |
-
batch_size = 4
|
| 165 |
-
gradient_accumulation_steps = 32 # Effective batch size = 128
|
| 166 |
-
|
| 167 |
-
# For 24GB GPU
|
| 168 |
-
batch_size = 2
|
| 169 |
-
gradient_accumulation_steps = 64 # Effective batch size = 128
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
## Performance Tips
|
| 173 |
-
|
| 174 |
-
1. **Use bf16**: Better than fp16 for A100
|
| 175 |
-
2. **Disable Gradient Checkpointing**: A100 has enough memory
|
| 176 |
-
3. **Use Flash Attention**: Memory efficient attention
|
| 177 |
-
4. **Group by Length**: Better batching efficiency
|
| 178 |
-
5. **Pin Memory**: Faster GPU transfers
|
| 179 |
-
6. **Multiple Workers**: Faster data loading
|
| 180 |
-
|
| 181 |
-
## Expected Timeline
|
| 182 |
-
|
| 183 |
-
- **Large Batch**: 6-8 hours for 1.3 passes
|
| 184 |
-
- **Multiple Passes**: 20-24 hours for 4 passes
|
| 185 |
-
- **Full Dataset (5+ passes)**: 30+ hours
|
| 186 |
-
|
| 187 |
-
## Next Steps
|
| 188 |
-
|
| 189 |
-
After training completes:
|
| 190 |
-
1. Evaluate on validation set
|
| 191 |
-
2. Test generation quality
|
| 192 |
-
3. Push to Hugging Face Hub
|
| 193 |
-
4. Deploy for inference
|
| 194 |
-
|
| 195 |
-
For deployment instructions, see `DEPLOYMENT_GUIDE.md`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/APP_CONFIGURATION_GUIDE.md
DELETED
|
@@ -1,234 +0,0 @@
|
|
| 1 |
-
# ⚙️ App Configuration Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The Trackio app now includes a **Configuration tab** that allows you to set your Hugging Face token and dataset repository directly through the interface, providing an alternative to environment variables.
|
| 6 |
-
|
| 7 |
-
## 🚀 New Features
|
| 8 |
-
|
| 9 |
-
### **Configuration Tab**
|
| 10 |
-
- ✅ **HF Token Input**: Secure password field for your Hugging Face token
|
| 11 |
-
- ✅ **Dataset Repository Input**: Text field for your dataset repository
|
| 12 |
-
- ✅ **Update Configuration**: Apply new settings and reload experiments
|
| 13 |
-
- ✅ **Test Connection**: Verify access to the dataset repository
|
| 14 |
-
- ✅ **Create Dataset**: Create a new dataset repository if it doesn't exist
|
| 15 |
-
|
| 16 |
-
### **Flexible Configuration**
|
| 17 |
-
- ✅ **Environment Variables**: Still supported as fallback
|
| 18 |
-
- ✅ **Interface Input**: New direct input method
|
| 19 |
-
- ✅ **Dynamic Updates**: Change configuration without restarting
|
| 20 |
-
- ✅ **Validation**: Input validation and error handling
|
| 21 |
-
|
| 22 |
-
## 📋 Configuration Tab Usage
|
| 23 |
-
|
| 24 |
-
### **1. Access the Configuration Tab**
|
| 25 |
-
- Open the Trackio app
|
| 26 |
-
- Click on the "⚙️ Configuration" tab
|
| 27 |
-
- You'll see input fields for HF Token and Dataset Repository
|
| 28 |
-
|
| 29 |
-
### **2. Set Your HF Token**
|
| 30 |
-
```
|
| 31 |
-
Hugging Face Token: hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
|
| 32 |
-
```
|
| 33 |
-
- **Type**: Password field (hidden for security)
|
| 34 |
-
- **Required**: Yes (for dataset access)
|
| 35 |
-
- **Format**: Your HF token starting with `hf_`
|
| 36 |
-
- **Help**: Click the help text for instructions on getting your token
|
| 37 |
-
|
| 38 |
-
### **3. Set Your Dataset Repository**
|
| 39 |
-
```
|
| 40 |
-
Dataset Repository: your-username/your-dataset-name
|
| 41 |
-
```
|
| 42 |
-
- **Type**: Text field
|
| 43 |
-
- **Required**: No (defaults to `tonic/trackio-experiments`)
|
| 44 |
-
- **Format**: `username/dataset-name`
|
| 45 |
-
- **Examples**:
|
| 46 |
-
- `tonic/trackio-experiments`
|
| 47 |
-
- `your-username/my-experiments`
|
| 48 |
-
- `your-org/team-experiments`
|
| 49 |
-
|
| 50 |
-
### **4. Use the Action Buttons**
|
| 51 |
-
|
| 52 |
-
#### **Update Configuration**
|
| 53 |
-
- Applies new settings immediately
|
| 54 |
-
- Reloads experiments with new configuration
|
| 55 |
-
- Shows current status and experiment count
|
| 56 |
-
|
| 57 |
-
#### **Test Connection**
|
| 58 |
-
- Verifies access to the dataset repository
|
| 59 |
-
- Tests HF token permissions
|
| 60 |
-
- Shows dataset information and experiment count
|
| 61 |
-
|
| 62 |
-
#### **Create Dataset**
|
| 63 |
-
- Creates a new dataset repository if it doesn't exist
|
| 64 |
-
- Sets up the correct schema for experiments
|
| 65 |
-
- Makes the dataset private by default
|
| 66 |
-
|
| 67 |
-
## 🔧 Configuration Methods
|
| 68 |
-
|
| 69 |
-
### **Method 1: Interface Input (New)**
|
| 70 |
-
1. Go to "⚙️ Configuration" tab
|
| 71 |
-
2. Enter your HF token and dataset repository
|
| 72 |
-
3. Click "Update Configuration"
|
| 73 |
-
4. Verify with "Test Connection"
|
| 74 |
-
|
| 75 |
-
### **Method 2: Environment Variables (Existing)**
|
| 76 |
-
```bash
|
| 77 |
-
# Set environment variables
|
| 78 |
-
export HF_TOKEN=your_hf_token_here
|
| 79 |
-
export TRACKIO_DATASET_REPO=your-username/your-dataset-name
|
| 80 |
-
|
| 81 |
-
# Or for HF Spaces, add to Space settings
|
| 82 |
-
HF_TOKEN=your_hf_token_here
|
| 83 |
-
TRACKIO_DATASET_REPO=your-username/your-dataset-name
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
### **Method 3: Hybrid Approach**
|
| 87 |
-
- Set environment variables as defaults
|
| 88 |
-
- Override specific values through the interface
|
| 89 |
-
- Interface values take precedence over environment variables
|
| 90 |
-
|
| 91 |
-
## 📊 Configuration Priority
|
| 92 |
-
|
| 93 |
-
The app uses this priority order for configuration:
|
| 94 |
-
|
| 95 |
-
1. **Interface Input** (highest priority)
|
| 96 |
-
2. **Environment Variables** (fallback)
|
| 97 |
-
3. **Default Values** (lowest priority)
|
| 98 |
-
|
| 99 |
-
## 🛠️ Getting Your HF Token
|
| 100 |
-
|
| 101 |
-
### **Step-by-Step Instructions**
|
| 102 |
-
1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
|
| 103 |
-
2. Click "New token"
|
| 104 |
-
3. Give it a name (e.g., "Trackio Access")
|
| 105 |
-
4. Select "Write" permissions
|
| 106 |
-
5. Click "Generate token"
|
| 107 |
-
6. Copy the token (starts with `hf_`)
|
| 108 |
-
7. Paste it in the app's HF Token field
|
| 109 |
-
|
| 110 |
-
### **Token Permissions**
|
| 111 |
-
- **Read**: Required for loading experiments
|
| 112 |
-
- **Write**: Required for saving experiments
|
| 113 |
-
- **Scope**: Should have access to your dataset repositories
|
| 114 |
-
|
| 115 |
-
## 📁 Dataset Repository Format
|
| 116 |
-
|
| 117 |
-
### **Correct Format**
|
| 118 |
-
```
|
| 119 |
-
username/dataset-name
|
| 120 |
-
```
|
| 121 |
-
|
| 122 |
-
### **Examples**
|
| 123 |
-
- `tonic/trackio-experiments` (default)
|
| 124 |
-
- `your-username/my-experiments`
|
| 125 |
-
- `your-org/team-experiments`
|
| 126 |
-
- `your-username/smollm3-experiments`
|
| 127 |
-
|
| 128 |
-
### **Validation**
|
| 129 |
-
- Must contain exactly one `/`
|
| 130 |
-
- Username must be valid HF username
|
| 131 |
-
- Dataset name must be valid (alphanumeric + hyphens)
|
| 132 |
-
|
| 133 |
-
## 🔍 Testing Your Configuration
|
| 134 |
-
|
| 135 |
-
### **1. Test Connection**
|
| 136 |
-
- Enter your HF token and dataset repository
|
| 137 |
-
- Click "Test Connection"
|
| 138 |
-
- Should show: "✅ Connection successful!"
|
| 139 |
-
|
| 140 |
-
### **2. Create Dataset (if needed)**
|
| 141 |
-
- If dataset doesn't exist, click "Create Dataset"
|
| 142 |
-
- Should show: "✅ Dataset created successfully!"
|
| 143 |
-
|
| 144 |
-
### **3. Update Configuration**
|
| 145 |
-
- Click "Update Configuration"
|
| 146 |
-
- Should show: "✅ Configuration updated successfully!"
|
| 147 |
-
|
| 148 |
-
## 🚨 Troubleshooting
|
| 149 |
-
|
| 150 |
-
### **Issue: "Please provide a Hugging Face token"**
|
| 151 |
-
**Solution**:
|
| 152 |
-
- Enter your HF token in the interface
|
| 153 |
-
- Or set the `HF_TOKEN` environment variable
|
| 154 |
-
|
| 155 |
-
### **Issue: "Connection failed: 401 Unauthorized"**
|
| 156 |
-
**Solutions**:
|
| 157 |
-
1. Check your HF token is correct
|
| 158 |
-
2. Verify the token has read access to the dataset
|
| 159 |
-
3. Ensure the dataset repository exists
|
| 160 |
-
|
| 161 |
-
### **Issue: "Failed to create dataset"**
|
| 162 |
-
**Solutions**:
|
| 163 |
-
1. Check your HF token has write permissions
|
| 164 |
-
2. Verify the username in the repository name
|
| 165 |
-
3. Ensure the dataset name is valid
|
| 166 |
-
|
| 167 |
-
### **Issue: "Dataset repository must be in format: username/dataset-name"**
|
| 168 |
-
**Solution**:
|
| 169 |
-
- Use the correct format: `username/dataset-name`
|
| 170 |
-
- Example: `your-username/my-experiments`
|
| 171 |
-
|
| 172 |
-
## 📈 Benefits
|
| 173 |
-
|
| 174 |
-
### **For Users**
|
| 175 |
-
- ✅ **Easy Setup**: No need to set environment variables
|
| 176 |
-
- ✅ **Visual Interface**: Clear input fields and validation
|
| 177 |
-
- ✅ **Immediate Feedback**: Test connection and see results
|
| 178 |
-
- ✅ **Flexible**: Can change configuration anytime
|
| 179 |
-
|
| 180 |
-
### **For Development**
|
| 181 |
-
- ✅ **Backward Compatible**: Environment variables still work
|
| 182 |
-
- ✅ **Fallback Support**: Graceful degradation
|
| 183 |
-
- ✅ **Error Handling**: Clear error messages
|
| 184 |
-
- ✅ **Validation**: Input validation and testing
|
| 185 |
-
|
| 186 |
-
### **For Deployment**
|
| 187 |
-
- ✅ **HF Spaces Ready**: Works on Hugging Face Spaces
|
| 188 |
-
- ✅ **No Restart Required**: Dynamic configuration updates
|
| 189 |
-
- ✅ **Secure**: Password field for token input
|
| 190 |
-
- ✅ **User-Friendly**: Clear instructions and help text
|
| 191 |
-
|
| 192 |
-
## 🎯 Usage Examples
|
| 193 |
-
|
| 194 |
-
### **Basic Setup**
|
| 195 |
-
1. Open the app
|
| 196 |
-
2. Go to "⚙️ Configuration" tab
|
| 197 |
-
3. Enter your HF token
|
| 198 |
-
4. Enter your dataset repository
|
| 199 |
-
5. Click "Update Configuration"
|
| 200 |
-
6. Click "Test Connection" to verify
|
| 201 |
-
|
| 202 |
-
### **Advanced Setup**
|
| 203 |
-
1. Set environment variables as defaults
|
| 204 |
-
2. Use interface to override specific values
|
| 205 |
-
3. Test connection to verify access
|
| 206 |
-
4. Create dataset if it doesn't exist
|
| 207 |
-
5. Start using the app with persistent storage
|
| 208 |
-
|
| 209 |
-
### **Team Setup**
|
| 210 |
-
1. Create a shared dataset repository
|
| 211 |
-
2. Share the repository name with team
|
| 212 |
-
3. Each team member sets their own HF token
|
| 213 |
-
4. All experiments are stored in the shared dataset
|
| 214 |
-
|
| 215 |
-
## 📋 Configuration Status
|
| 216 |
-
|
| 217 |
-
The app shows current configuration status:
|
| 218 |
-
```
|
| 219 |
-
📊 Dataset: your-username/your-dataset
|
| 220 |
-
🔑 HF Token: Set
|
| 221 |
-
📈 Experiments: 5
|
| 222 |
-
```
|
| 223 |
-
|
| 224 |
-
## 🔄 Updating Configuration
|
| 225 |
-
|
| 226 |
-
You can update configuration at any time:
|
| 227 |
-
1. Go to "⚙️ Configuration" tab
|
| 228 |
-
2. Change HF token or dataset repository
|
| 229 |
-
3. Click "Update Configuration"
|
| 230 |
-
4. Experiments will reload with new settings
|
| 231 |
-
|
| 232 |
-
---
|
| 233 |
-
|
| 234 |
-
**🎉 Your Trackio app is now more flexible and user-friendly with direct configuration input!**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/CLOUD_DEPLOYMENT_GUIDE.md
DELETED
|
@@ -1,462 +0,0 @@
|
|
| 1 |
-
# Cloud Deployment Guide for SmolLM3 DPO Training
|
| 2 |
-
|
| 3 |
-
This guide provides the exact sequence of commands to deploy and run SmolLM3 DPO training on a cloud computing instance with 6 epochs.
|
| 4 |
-
|
| 5 |
-
## Prerequisites
|
| 6 |
-
|
| 7 |
-
### Cloud Instance Requirements
|
| 8 |
-
|
| 9 |
-
- **GPU**: NVIDIA A100, H100, or similar (16GB+ VRAM)
|
| 10 |
-
- **RAM**: 64GB+ system memory
|
| 11 |
-
- **Storage**: 100GB+ SSD storage
|
| 12 |
-
- **OS**: Ubuntu 20.04 or 22.04
|
| 13 |
-
|
| 14 |
-
### Required Information
|
| 15 |
-
|
| 16 |
-
Before starting, gather these details:
|
| 17 |
-
- Your Hugging Face username
|
| 18 |
-
- Your Hugging Face token (with write permissions)
|
| 19 |
-
- Your Trackio Space URL (if using monitoring)
|
| 20 |
-
|
| 21 |
-
## Step-by-Step Deployment
|
| 22 |
-
|
| 23 |
-
### Step 1: Launch Cloud Instance
|
| 24 |
-
|
| 25 |
-
Choose your cloud provider and launch an instance:
|
| 26 |
-
|
| 27 |
-
#### AWS (g5.2xlarge or g5.4xlarge)
|
| 28 |
-
```bash
|
| 29 |
-
# Launch instance with Ubuntu 22.04 and appropriate GPU
|
| 30 |
-
aws ec2 run-instances \
|
| 31 |
-
--image-id ami-0c7217cdde317cfec \
|
| 32 |
-
--instance-type g5.2xlarge \
|
| 33 |
-
--key-name your-key-pair \
|
| 34 |
-
--security-group-ids sg-xxxxxxxxx
|
| 35 |
-
```
|
| 36 |
-
|
| 37 |
-
#### Google Cloud (n1-standard-8 with T4/V100)
|
| 38 |
-
```bash
|
| 39 |
-
gcloud compute instances create smollm3-dpo \
|
| 40 |
-
--zone=us-central1-a \
|
| 41 |
-
--machine-type=n1-standard-8 \
|
| 42 |
-
--accelerator="type=nvidia-tesla-t4,count=1" \
|
| 43 |
-
--image-family=ubuntu-2204-lts \
|
| 44 |
-
--image-project=ubuntu-os-cloud
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
#### Azure (Standard_NC6s_v3)
|
| 48 |
-
```bash
|
| 49 |
-
az vm create \
|
| 50 |
-
--resource-group your-rg \
|
| 51 |
-
--name smollm3-dpo \
|
| 52 |
-
--image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
|
| 53 |
-
--size Standard_NC6s_v3 \
|
| 54 |
-
--admin-username azureuser
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
### Step 2: Connect to Instance
|
| 58 |
-
|
| 59 |
-
```bash
|
| 60 |
-
# SSH to your instance
|
| 61 |
-
ssh -i your-key.pem ubuntu@your-instance-ip
|
| 62 |
-
|
| 63 |
-
# Or for Azure
|
| 64 |
-
ssh azureuser@your-instance-ip
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
### Step 3: Update System and Install Dependencies
|
| 68 |
-
|
| 69 |
-
```bash
|
| 70 |
-
# Update system
|
| 71 |
-
sudo apt-get update
|
| 72 |
-
sudo apt-get upgrade -y
|
| 73 |
-
|
| 74 |
-
# Install system dependencies
|
| 75 |
-
sudo apt-get install -y git curl wget unzip python3 python3-pip python3-venv
|
| 76 |
-
|
| 77 |
-
# Install NVIDIA drivers (if not pre-installed)
|
| 78 |
-
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
|
| 79 |
-
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
|
| 80 |
-
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
| 81 |
-
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
| 82 |
-
|
| 83 |
-
sudo apt-get update
|
| 84 |
-
sudo apt-get install -y nvidia-container-toolkit
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
### Step 4: Clone Repository and Setup Environment
|
| 88 |
-
|
| 89 |
-
```bash
|
| 90 |
-
# Clone your repository
|
| 91 |
-
git clone https://github.com/your-username/flexai-finetune.git
|
| 92 |
-
cd flexai-finetune
|
| 93 |
-
|
| 94 |
-
# Create virtual environment
|
| 95 |
-
python3 -m venv smollm3_env
|
| 96 |
-
source smollm3_env/bin/activate
|
| 97 |
-
|
| 98 |
-
# Install PyTorch with CUDA
|
| 99 |
-
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
|
| 100 |
-
|
| 101 |
-
# Install project dependencies
|
| 102 |
-
pip install -r requirements.txt
|
| 103 |
-
|
| 104 |
-
# Install additional DPO dependencies
|
| 105 |
-
pip install trl>=0.7.0
|
| 106 |
-
pip install peft>=0.4.0
|
| 107 |
-
pip install accelerate>=0.20.0
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
### Step 5: Configure Authentication
|
| 111 |
-
|
| 112 |
-
```bash
|
| 113 |
-
# Set your Hugging Face token
|
| 114 |
-
export HF_TOKEN="your_huggingface_token_here"
|
| 115 |
-
|
| 116 |
-
# Login to Hugging Face
|
| 117 |
-
hf login --token $HF_TOKEN
|
| 118 |
-
```
|
| 119 |
-
|
| 120 |
-
### Step 6: Create Configuration Files
|
| 121 |
-
|
| 122 |
-
Create the DPO configuration file:
|
| 123 |
-
|
| 124 |
-
```bash
|
| 125 |
-
cat > config/train_smollm3_dpo_6epochs.py << 'EOF'
|
| 126 |
-
"""
|
| 127 |
-
SmolLM3 DPO Training Configuration - 6 Epochs
|
| 128 |
-
Optimized for cloud deployment
|
| 129 |
-
"""
|
| 130 |
-
|
| 131 |
-
from config.train_smollm3_dpo import SmolLM3DPOConfig
|
| 132 |
-
|
| 133 |
-
config = SmolLM3DPOConfig(
|
| 134 |
-
# Model configuration
|
| 135 |
-
model_name="HuggingFaceTB/SmolLM3-3B",
|
| 136 |
-
max_seq_length=4096,
|
| 137 |
-
use_flash_attention=True,
|
| 138 |
-
use_gradient_checkpointing=True,
|
| 139 |
-
|
| 140 |
-
# Training configuration
|
| 141 |
-
batch_size=2,
|
| 142 |
-
gradient_accumulation_steps=8,
|
| 143 |
-
learning_rate=5e-6,
|
| 144 |
-
weight_decay=0.01,
|
| 145 |
-
warmup_steps=100,
|
| 146 |
-
max_iters=None, # Will be calculated based on epochs
|
| 147 |
-
eval_interval=100,
|
| 148 |
-
log_interval=10,
|
| 149 |
-
save_interval=500,
|
| 150 |
-
|
| 151 |
-
# DPO configuration
|
| 152 |
-
beta=0.1,
|
| 153 |
-
max_prompt_length=2048,
|
| 154 |
-
|
| 155 |
-
# Optimizer configuration
|
| 156 |
-
optimizer="adamw",
|
| 157 |
-
beta1=0.9,
|
| 158 |
-
beta2=0.95,
|
| 159 |
-
eps=1e-8,
|
| 160 |
-
|
| 161 |
-
# Scheduler configuration
|
| 162 |
-
scheduler="cosine",
|
| 163 |
-
min_lr=1e-6,
|
| 164 |
-
|
| 165 |
-
# Mixed precision
|
| 166 |
-
fp16=True,
|
| 167 |
-
bf16=False,
|
| 168 |
-
|
| 169 |
-
# Logging and saving
|
| 170 |
-
save_steps=500,
|
| 171 |
-
eval_steps=100,
|
| 172 |
-
logging_steps=10,
|
| 173 |
-
save_total_limit=3,
|
| 174 |
-
|
| 175 |
-
# Evaluation
|
| 176 |
-
eval_strategy="steps",
|
| 177 |
-
metric_for_best_model="eval_loss",
|
| 178 |
-
greater_is_better=False,
|
| 179 |
-
load_best_model_at_end=True,
|
| 180 |
-
|
| 181 |
-
# Data configuration
|
| 182 |
-
data_dir="smoltalk_dataset",
|
| 183 |
-
train_file="train.json",
|
| 184 |
-
validation_file="validation.json",
|
| 185 |
-
|
| 186 |
-
# Chat template configuration
|
| 187 |
-
use_chat_template=True,
|
| 188 |
-
chat_template_kwargs={
|
| 189 |
-
"enable_thinking": False,
|
| 190 |
-
"add_generation_prompt": True
|
| 191 |
-
},
|
| 192 |
-
|
| 193 |
-
# Trackio monitoring configuration
|
| 194 |
-
enable_tracking=True,
|
| 195 |
-
trackio_url="https://your-trackio-space.hf.space", # Change this
|
| 196 |
-
trackio_token=None,
|
| 197 |
-
log_artifacts=True,
|
| 198 |
-
log_metrics=True,
|
| 199 |
-
log_config=True,
|
| 200 |
-
experiment_name="smollm3_dpo_6epochs"
|
| 201 |
-
)
|
| 202 |
-
EOF
|
| 203 |
-
```
|
| 204 |
-
|
| 205 |
-
### Step 7: Download and Prepare Dataset
|
| 206 |
-
|
| 207 |
-
```bash
|
| 208 |
-
# Create dataset preparation script
|
| 209 |
-
cat > prepare_dataset.py << 'EOF'
|
| 210 |
-
from datasets import load_dataset
|
| 211 |
-
import json
|
| 212 |
-
import os
|
| 213 |
-
|
| 214 |
-
# Load SmolTalk dataset
|
| 215 |
-
print('Loading SmolTalk dataset...')
|
| 216 |
-
dataset = load_dataset('HuggingFaceTB/smoltalk')
|
| 217 |
-
|
| 218 |
-
# Create dataset directory
|
| 219 |
-
os.makedirs('smoltalk_dataset', exist_ok=True)
|
| 220 |
-
|
| 221 |
-
# Convert to DPO format (preference pairs)
|
| 222 |
-
def convert_to_dpo_format(example):
|
| 223 |
-
# For SmolTalk, we'll create preference pairs based on response quality
|
| 224 |
-
# This is a simplified example - you may need to adjust based on your needs
|
| 225 |
-
return {
|
| 226 |
-
'prompt': example.get('prompt', ''),
|
| 227 |
-
'chosen': example.get('chosen', ''),
|
| 228 |
-
'rejected': example.get('rejected', '')
|
| 229 |
-
}
|
| 230 |
-
|
| 231 |
-
# Process train split
|
| 232 |
-
train_data = []
|
| 233 |
-
for example in dataset['train']:
|
| 234 |
-
dpo_example = convert_to_dpo_format(example)
|
| 235 |
-
if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
|
| 236 |
-
train_data.append(dpo_example)
|
| 237 |
-
|
| 238 |
-
# Process validation split
|
| 239 |
-
val_data = []
|
| 240 |
-
for example in dataset['validation']:
|
| 241 |
-
dpo_example = convert_to_dpo_format(example)
|
| 242 |
-
if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
|
| 243 |
-
val_data.append(dpo_example)
|
| 244 |
-
|
| 245 |
-
# Save to files
|
| 246 |
-
with open('smoltalk_dataset/train.json', 'w') as f:
|
| 247 |
-
json.dump(train_data, f, indent=2)
|
| 248 |
-
|
| 249 |
-
with open('smoltalk_dataset/validation.json', 'w') as f:
|
| 250 |
-
json.dump(val_data, f, indent=2)
|
| 251 |
-
|
| 252 |
-
print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
|
| 253 |
-
EOF
|
| 254 |
-
|
| 255 |
-
# Run dataset preparation
|
| 256 |
-
python prepare_dataset.py
|
| 257 |
-
```
|
| 258 |
-
|
| 259 |
-
### Step 8: Calculate Training Parameters
|
| 260 |
-
|
| 261 |
-
```bash
|
| 262 |
-
# Calculate training steps based on epochs
|
| 263 |
-
TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('smoltalk_dataset/train.json')); print(len(data))")
|
| 264 |
-
BATCH_SIZE=2
|
| 265 |
-
GRADIENT_ACCUMULATION_STEPS=8
|
| 266 |
-
MAX_EPOCHS=6
|
| 267 |
-
EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
|
| 268 |
-
STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
|
| 269 |
-
MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
|
| 270 |
-
|
| 271 |
-
echo "Training Configuration:"
|
| 272 |
-
echo " Total samples: $TOTAL_SAMPLES"
|
| 273 |
-
echo " Effective batch size: $EFFECTIVE_BATCH_SIZE"
|
| 274 |
-
echo " Steps per epoch: $STEPS_PER_EPOCH"
|
| 275 |
-
echo " Total training steps: $MAX_STEPS"
|
| 276 |
-
echo " Training epochs: $MAX_EPOCHS"
|
| 277 |
-
```
|
| 278 |
-
|
| 279 |
-
### Step 9: Start DPO Training
|
| 280 |
-
|
| 281 |
-
```bash
|
| 282 |
-
# Start training with all parameters
|
| 283 |
-
python train.py config/train_smollm3_dpo_6epochs.py \
|
| 284 |
-
--dataset_dir smoltalk_dataset \
|
| 285 |
-
--out_dir /output-checkpoint \
|
| 286 |
-
--init_from scratch \
|
| 287 |
-
--max_iters $MAX_STEPS \
|
| 288 |
-
--batch_size $BATCH_SIZE \
|
| 289 |
-
--learning_rate 5e-6 \
|
| 290 |
-
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
|
| 291 |
-
--max_seq_length 4096 \
|
| 292 |
-
--save_steps 500 \
|
| 293 |
-
--eval_steps 100 \
|
| 294 |
-
--logging_steps 10 \
|
| 295 |
-
--enable_tracking \
|
| 296 |
-
--trackio_url "https://your-trackio-space.hf.space" \
|
| 297 |
-
--experiment_name "smollm3_dpo_6epochs"
|
| 298 |
-
```
|
| 299 |
-
|
| 300 |
-
### Step 10: Push Model to Hugging Face Hub
|
| 301 |
-
|
| 302 |
-
```bash
|
| 303 |
-
# Push the trained model
|
| 304 |
-
python push_to_huggingface.py /output-checkpoint "your-username/smollm3-dpo-6epochs" \
|
| 305 |
-
--token "$HF_TOKEN" \
|
| 306 |
-
--trackio-url "https://your-trackio-space.hf.space" \
|
| 307 |
-
--experiment-name "smollm3_dpo_6epochs"
|
| 308 |
-
```
|
| 309 |
-
|
| 310 |
-
### Step 11: Test the Uploaded Model
|
| 311 |
-
|
| 312 |
-
```bash
|
| 313 |
-
# Test the model
|
| 314 |
-
python -c "
|
| 315 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 316 |
-
import torch
|
| 317 |
-
|
| 318 |
-
print('Loading uploaded model...')
|
| 319 |
-
model = AutoModelForCausalLM.from_pretrained('your-username/smollm3-dpo-6epochs', torch_dtype=torch.float16, device_map='auto')
|
| 320 |
-
tokenizer = AutoTokenizer.from_pretrained('your-username/smollm3-dpo-6epochs')
|
| 321 |
-
|
| 322 |
-
print('Testing model generation...')
|
| 323 |
-
prompt = 'Hello, how are you?'
|
| 324 |
-
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
|
| 325 |
-
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
|
| 326 |
-
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 327 |
-
print(f'Prompt: {prompt}')
|
| 328 |
-
print(f'Response: {response}')
|
| 329 |
-
print('✅ Model test completed successfully!')
|
| 330 |
-
"
|
| 331 |
-
```
|
| 332 |
-
|
| 333 |
-
## Complete One-Line Deployment
|
| 334 |
-
|
| 335 |
-
If you want to run everything automatically, use the deployment script:
|
| 336 |
-
|
| 337 |
-
```bash
|
| 338 |
-
# Make script executable
|
| 339 |
-
chmod +x cloud_deployment.sh
|
| 340 |
-
|
| 341 |
-
# Edit configuration in the script first
|
| 342 |
-
nano cloud_deployment.sh
|
| 343 |
-
# Change these variables:
|
| 344 |
-
# - REPO_NAME="your-username/smollm3-dpo-6epochs"
|
| 345 |
-
# - TRACKIO_URL="https://your-trackio-space.hf.space"
|
| 346 |
-
# - HF_TOKEN="your_hf_token_here"
|
| 347 |
-
|
| 348 |
-
# Run the complete deployment
|
| 349 |
-
./cloud_deployment.sh
|
| 350 |
-
```
|
| 351 |
-
|
| 352 |
-
## Monitoring and Debugging
|
| 353 |
-
|
| 354 |
-
### Check GPU Usage
|
| 355 |
-
|
| 356 |
-
```bash
|
| 357 |
-
# Monitor GPU usage during training
|
| 358 |
-
watch -n 1 nvidia-smi
|
| 359 |
-
```
|
| 360 |
-
|
| 361 |
-
### Check Training Logs
|
| 362 |
-
|
| 363 |
-
```bash
|
| 364 |
-
# Monitor training progress
|
| 365 |
-
tail -f training.log
|
| 366 |
-
|
| 367 |
-
# Check system resources
|
| 368 |
-
htop
|
| 369 |
-
```
|
| 370 |
-
|
| 371 |
-
### Monitor Trackio
|
| 372 |
-
|
| 373 |
-
```bash
|
| 374 |
-
# Check if Trackio is logging properly
|
| 375 |
-
curl -s "https://your-trackio-space.hf.space" | grep -i "experiment"
|
| 376 |
-
```
|
| 377 |
-
|
| 378 |
-
## Expected Timeline
|
| 379 |
-
|
| 380 |
-
- **Setup**: 15-30 minutes
|
| 381 |
-
- **Dataset preparation**: 5-10 minutes
|
| 382 |
-
- **Training (6 epochs)**: 4-8 hours (depending on GPU)
|
| 383 |
-
- **Model upload**: 10-30 minutes
|
| 384 |
-
- **Testing**: 5-10 minutes
|
| 385 |
-
|
| 386 |
-
## Troubleshooting
|
| 387 |
-
|
| 388 |
-
### Common Issues
|
| 389 |
-
|
| 390 |
-
#### 1. Out of Memory (OOM)
|
| 391 |
-
```bash
|
| 392 |
-
# Reduce batch size
|
| 393 |
-
BATCH_SIZE=1
|
| 394 |
-
GRADIENT_ACCUMULATION_STEPS=16
|
| 395 |
-
|
| 396 |
-
# Or use gradient checkpointing
|
| 397 |
-
# Already enabled in config
|
| 398 |
-
```
|
| 399 |
-
|
| 400 |
-
#### 2. Slow Training
|
| 401 |
-
```bash
|
| 402 |
-
# Check GPU utilization
|
| 403 |
-
nvidia-smi
|
| 404 |
-
|
| 405 |
-
# Check if mixed precision is working
|
| 406 |
-
# Look for "fp16" in training logs
|
| 407 |
-
```
|
| 408 |
-
|
| 409 |
-
#### 3. Dataset Issues
|
| 410 |
-
```bash
|
| 411 |
-
# Check dataset format
|
| 412 |
-
head -n 5 smoltalk_dataset/train.json
|
| 413 |
-
|
| 414 |
-
# Verify dataset size
|
| 415 |
-
wc -l smoltalk_dataset/train.json
|
| 416 |
-
```
|
| 417 |
-
|
| 418 |
-
#### 4. Authentication Issues
|
| 419 |
-
```bash
|
| 420 |
-
# Test HF token
|
| 421 |
-
python -c "
|
| 422 |
-
from huggingface_hub import HfApi
|
| 423 |
-
api = HfApi(token='$HF_TOKEN')
|
| 424 |
-
print('Token is valid!')
|
| 425 |
-
"
|
| 426 |
-
```
|
| 427 |
-
|
| 428 |
-
## Cost Estimation
|
| 429 |
-
|
| 430 |
-
### AWS (g5.2xlarge)
|
| 431 |
-
- **Instance**: $0.526/hour
|
| 432 |
-
- **Training time**: 6 hours
|
| 433 |
-
- **Total cost**: ~$3.16
|
| 434 |
-
|
| 435 |
-
### Google Cloud (n1-standard-8 + T4)
|
| 436 |
-
- **Instance**: $0.38/hour
|
| 437 |
-
- **Training time**: 6 hours
|
| 438 |
-
- **Total cost**: ~$2.28
|
| 439 |
-
|
| 440 |
-
### Azure (Standard_NC6s_v3)
|
| 441 |
-
- **Instance**: $0.90/hour
|
| 442 |
-
- **Training time**: 6 hours
|
| 443 |
-
- **Total cost**: ~$5.40
|
| 444 |
-
|
| 445 |
-
## Next Steps
|
| 446 |
-
|
| 447 |
-
After successful deployment:
|
| 448 |
-
|
| 449 |
-
1. **Monitor training** in your Trackio Space
|
| 450 |
-
2. **Check model repository** on Hugging Face Hub
|
| 451 |
-
3. **Test the model** with different prompts
|
| 452 |
-
4. **Share your model** with the community
|
| 453 |
-
5. **Iterate and improve** based on results
|
| 454 |
-
|
| 455 |
-
## Support
|
| 456 |
-
|
| 457 |
-
- **Training issues**: Check logs and GPU utilization
|
| 458 |
-
- **Upload issues**: Verify HF token and repository permissions
|
| 459 |
-
- **Monitoring issues**: Check Trackio Space configuration
|
| 460 |
-
- **Performance issues**: Adjust batch size and learning rate
|
| 461 |
-
|
| 462 |
-
Your SmolLM3 DPO model will be ready for use after training completes!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/CLOUD_TRAINING_GUIDE.md
DELETED
|
@@ -1,440 +0,0 @@
|
|
| 1 |
-
# Cloud Training Guide for OpenHermes-FR Dataset
|
| 2 |
-
|
| 3 |
-
This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset.
|
| 4 |
-
|
| 5 |
-
## Overview
|
| 6 |
-
|
| 7 |
-
The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:
|
| 8 |
-
|
| 9 |
-
- ✅ **Cloud Instance Setup** - Complete environment configuration
|
| 10 |
-
- ✅ **Dataset Integration** - Automatic loading and filtering
|
| 11 |
-
- ✅ **Training Configuration** - Optimized for French instruction tuning
|
| 12 |
-
- ✅ **Monitoring Integration** - Trackio experiment tracking
|
| 13 |
-
- ✅ **Model Deployment** - Push to Hugging Face Hub
|
| 14 |
-
|
| 15 |
-
## Dataset Information
|
| 16 |
-
|
| 17 |
-
### Schema
|
| 18 |
-
```json
|
| 19 |
-
{
|
| 20 |
-
"prompt": "Explique la différence entre la photosynthèse C3 et C4.",
|
| 21 |
-
"accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
|
| 22 |
-
"bad_prompt_detected": false,
|
| 23 |
-
"bad_response_detected": false,
|
| 24 |
-
"bad_entry": false
|
| 25 |
-
}
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
### Key Features
|
| 29 |
-
- **Size**: 799,875 examples (~1.4GB)
|
| 30 |
-
- **Language**: 100% French
|
| 31 |
-
- **Quality**: GPT-4o generated responses with automatic filtering
|
| 32 |
-
- **License**: ODC-BY 1.0
|
| 33 |
-
|
| 34 |
-
## Cloud Instance Setup
|
| 35 |
-
|
| 36 |
-
### 1. Choose Your Cloud Provider
|
| 37 |
-
|
| 38 |
-
#### **AWS EC2 (Recommended)**
|
| 39 |
-
```bash
|
| 40 |
-
# Launch instance with GPU
|
| 41 |
-
# Recommended: g4dn.xlarge or g5.xlarge
|
| 42 |
-
# AMI: Deep Learning AMI (Ubuntu 20.04)
|
| 43 |
-
```
|
| 44 |
-
|
| 45 |
-
#### **Google Cloud Platform**
|
| 46 |
-
```bash
|
| 47 |
-
# Launch instance with GPU
|
| 48 |
-
# Recommended: n1-standard-4 with Tesla T4 or V100
|
| 49 |
-
```
|
| 50 |
-
|
| 51 |
-
#### **Azure**
|
| 52 |
-
```bash
|
| 53 |
-
# Launch instance with GPU
|
| 54 |
-
# Recommended: Standard_NC6s_v3 or Standard_NC12s_v3
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
### 2. Instance Specifications
|
| 58 |
-
|
| 59 |
-
#### **Minimum Requirements**
|
| 60 |
-
- **GPU**: 16GB+ VRAM (Tesla T4, V100, or A100)
|
| 61 |
-
- **RAM**: 32GB+ system memory
|
| 62 |
-
- **Storage**: 100GB+ SSD
|
| 63 |
-
- **CPU**: 8+ cores
|
| 64 |
-
|
| 65 |
-
#### **Recommended Specifications**
|
| 66 |
-
- **GPU**: A100 (40GB) or H100 (80GB)
|
| 67 |
-
- **RAM**: 64GB+ system memory
|
| 68 |
-
- **Storage**: 200GB+ NVMe SSD
|
| 69 |
-
- **CPU**: 16+ cores
|
| 70 |
-
|
| 71 |
-
### 3. Environment Setup
|
| 72 |
-
|
| 73 |
-
```bash
|
| 74 |
-
# Update system
|
| 75 |
-
sudo apt update && sudo apt upgrade -y
|
| 76 |
-
|
| 77 |
-
# Install CUDA (if not pre-installed)
|
| 78 |
-
# Follow NVIDIA CUDA installation guide for your GPU
|
| 79 |
-
|
| 80 |
-
# Install Python dependencies
|
| 81 |
-
sudo apt install python3-pip python3-venv git -y
|
| 82 |
-
|
| 83 |
-
# Create virtual environment
|
| 84 |
-
python3 -m venv smollm3_env
|
| 85 |
-
source smollm3_env/bin/activate
|
| 86 |
-
|
| 87 |
-
# Clone repository
|
| 88 |
-
git clone <your-repo-url>
|
| 89 |
-
cd <your-repo-directory>
|
| 90 |
-
|
| 91 |
-
# Install dependencies
|
| 92 |
-
pip install -r requirements.txt
|
| 93 |
-
|
| 94 |
-
# Install additional dependencies for cloud training
|
| 95 |
-
pip install accelerate transformers datasets huggingface_hub
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
## Training Configuration
|
| 99 |
-
|
| 100 |
-
### 1. Use the OpenHermes-FR Config
|
| 101 |
-
|
| 102 |
-
The repository includes a specialized configuration for the OpenHermes-FR dataset:
|
| 103 |
-
|
| 104 |
-
```bash
|
| 105 |
-
python train.py config/train_smollm3_openhermes_fr.py \
|
| 106 |
-
--enable_tracking \
|
| 107 |
-
--trackio_url "https://your-space.hf.space" \
|
| 108 |
-
--experiment_name "smollm3_fr_openhermes_v1"
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
### 2. Configuration Details
|
| 112 |
-
|
| 113 |
-
The `config/train_smollm3_openhermes_fr.py` includes:
|
| 114 |
-
|
| 115 |
-
#### **Dataset Configuration**
|
| 116 |
-
```python
|
| 117 |
-
dataset_name: str = "legmlai/openhermes-fr"
|
| 118 |
-
dataset_split: str = "train"
|
| 119 |
-
input_field: str = "prompt"
|
| 120 |
-
target_field: str = "accepted_completion"
|
| 121 |
-
filter_bad_entries: bool = True
|
| 122 |
-
bad_entry_field: str = "bad_entry"
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
#### **Training Optimization**
|
| 126 |
-
```python
|
| 127 |
-
batch_size: int = 2 # Reduced for French text (longer sequences)
|
| 128 |
-
gradient_accumulation_steps: int = 8 # Maintains effective batch size
|
| 129 |
-
learning_rate: float = 1e-5 # Lower for instruction tuning
|
| 130 |
-
max_iters: int = 2000 # More iterations for large dataset
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
#### **Monitoring Integration**
|
| 134 |
-
```python
|
| 135 |
-
enable_tracking: bool = True
|
| 136 |
-
experiment_name: str = "smollm3_openhermes_fr"
|
| 137 |
-
```
|
| 138 |
-
|
| 139 |
-
## Training Commands
|
| 140 |
-
|
| 141 |
-
### Basic Training
|
| 142 |
-
```bash
|
| 143 |
-
python train.py config/train_smollm3_openhermes_fr.py
|
| 144 |
-
```
|
| 145 |
-
|
| 146 |
-
### Training with Monitoring
|
| 147 |
-
```bash
|
| 148 |
-
python train.py config/train_smollm3_openhermes_fr.py \
|
| 149 |
-
--enable_tracking \
|
| 150 |
-
--trackio_url "https://your-trackio-space.hf.space" \
|
| 151 |
-
--experiment_name "smollm3_fr_openhermes_v1"
|
| 152 |
-
```
|
| 153 |
-
|
| 154 |
-
### Training with Custom Parameters
|
| 155 |
-
```bash
|
| 156 |
-
python train.py config/train_smollm3_openhermes_fr.py \
|
| 157 |
-
--batch_size 4 \
|
| 158 |
-
--learning_rate 2e-5 \
|
| 159 |
-
--max_iters 3000 \
|
| 160 |
-
--enable_tracking \
|
| 161 |
-
--trackio_url "https://your-trackio-space.hf.space" \
|
| 162 |
-
--experiment_name "smollm3_fr_high_lr"
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
### Training with Checkpoint Resume
|
| 166 |
-
```bash
|
| 167 |
-
python train.py config/train_smollm3_openhermes_fr.py \
|
| 168 |
-
--init_from resume \
|
| 169 |
-
--enable_tracking \
|
| 170 |
-
--trackio_url "https://your-trackio-space.hf.space" \
|
| 171 |
-
--experiment_name "smollm3_fr_resume"
|
| 172 |
-
```
|
| 173 |
-
|
| 174 |
-
## Dataset Processing
|
| 175 |
-
|
| 176 |
-
### Automatic Filtering
|
| 177 |
-
|
| 178 |
-
The training script automatically:
|
| 179 |
-
- ✅ **Loads** the OpenHermes-FR dataset from Hugging Face
|
| 180 |
-
- ✅ **Filters** out bad entries (`bad_entry = true`)
|
| 181 |
-
- ✅ **Splits** data into train/validation/test (98/1/1)
|
| 182 |
-
- ✅ **Formats** prompts and completions for instruction tuning
|
| 183 |
-
|
| 184 |
-
### Manual Dataset Inspection
|
| 185 |
-
|
| 186 |
-
```python
|
| 187 |
-
from datasets import load_dataset
|
| 188 |
-
|
| 189 |
-
# Load dataset
|
| 190 |
-
dataset = load_dataset("legmlai/openhermes-fr")
|
| 191 |
-
|
| 192 |
-
# Check dataset info
|
| 193 |
-
print(f"Dataset size: {len(dataset['train'])}")
|
| 194 |
-
print(f"Sample columns: {dataset['train'].column_names}")
|
| 195 |
-
|
| 196 |
-
# Check filtering
|
| 197 |
-
bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
|
| 198 |
-
print(f"Bad entries: {len(bad_entries)}")
|
| 199 |
-
|
| 200 |
-
# Sample data
|
| 201 |
-
sample = dataset['train'][0]
|
| 202 |
-
print(f"Prompt: {sample['prompt']}")
|
| 203 |
-
print(f"Completion: {sample['accepted_completion']}")
|
| 204 |
-
```
|
| 205 |
-
|
| 206 |
-
## Monitoring and Tracking
|
| 207 |
-
|
| 208 |
-
### Trackio Integration
|
| 209 |
-
|
| 210 |
-
The training automatically logs:
|
| 211 |
-
- **Training metrics**: Loss, accuracy, learning rate
|
| 212 |
-
- **System metrics**: GPU memory, CPU usage
|
| 213 |
-
- **Dataset info**: Size, filtering statistics
|
| 214 |
-
- **Model checkpoints**: Regular saves with metadata
|
| 215 |
-
|
| 216 |
-
### View Training Progress
|
| 217 |
-
|
| 218 |
-
1. **Trackio Space**: Visit your Trackio Space URL
|
| 219 |
-
2. **Experiment Details**: Check the "View Experiments" tab
|
| 220 |
-
3. **Metrics**: Monitor loss curves and system usage
|
| 221 |
-
4. **Logs**: Download training logs for analysis
|
| 222 |
-
|
| 223 |
-
## Model Deployment
|
| 224 |
-
|
| 225 |
-
### Push to Hugging Face Hub
|
| 226 |
-
|
| 227 |
-
After training, deploy your model:
|
| 228 |
-
|
| 229 |
-
```bash
|
| 230 |
-
python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
|
| 231 |
-
--trackio-url "https://your-trackio-space.hf.space" \
|
| 232 |
-
--experiment-name "smollm3_fr_openhermes_v1"
|
| 233 |
-
```
|
| 234 |
-
|
| 235 |
-
### Use Your Model
|
| 236 |
-
|
| 237 |
-
```python
|
| 238 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 239 |
-
|
| 240 |
-
# Load your fine-tuned model
|
| 241 |
-
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
|
| 242 |
-
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")
|
| 243 |
-
|
| 244 |
-
# Generate French text
|
| 245 |
-
prompt = "Expliquez le concept de l'intelligence artificielle."
|
| 246 |
-
inputs = tokenizer(prompt, return_tensors="pt")
|
| 247 |
-
outputs = model.generate(**inputs, max_new_tokens=200)
|
| 248 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 249 |
-
```
|
| 250 |
-
|
| 251 |
-
## Performance Optimization
|
| 252 |
-
|
| 253 |
-
### GPU Memory Management
|
| 254 |
-
|
| 255 |
-
```bash
|
| 256 |
-
# Monitor GPU usage
|
| 257 |
-
nvidia-smi -l 1
|
| 258 |
-
|
| 259 |
-
# Optimize for your GPU
|
| 260 |
-
# For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
|
| 261 |
-
# For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
|
| 262 |
-
# For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2
|
| 263 |
-
```
|
| 264 |
-
|
| 265 |
-
### Training Speed
|
| 266 |
-
|
| 267 |
-
```bash
|
| 268 |
-
# Use mixed precision (enabled by default)
|
| 269 |
-
fp16: bool = True
|
| 270 |
-
|
| 271 |
-
# Enable gradient checkpointing (enabled by default)
|
| 272 |
-
use_gradient_checkpointing: bool = True
|
| 273 |
-
|
| 274 |
-
# Use flash attention (enabled by default)
|
| 275 |
-
use_flash_attention: bool = True
|
| 276 |
-
```
|
| 277 |
-
|
| 278 |
-
## Troubleshooting
|
| 279 |
-
|
| 280 |
-
### Common Issues
|
| 281 |
-
|
| 282 |
-
#### 1. **Out of Memory (OOM)**
|
| 283 |
-
```bash
|
| 284 |
-
# Reduce batch size
|
| 285 |
-
python train.py config/train_smollm3_openhermes_fr.py --batch_size 1
|
| 286 |
-
|
| 287 |
-
# Increase gradient accumulation
|
| 288 |
-
# Edit config: gradient_accumulation_steps = 16
|
| 289 |
-
```
|
| 290 |
-
|
| 291 |
-
#### 2. **Slow Training**
|
| 292 |
-
```bash
|
| 293 |
-
# Check GPU utilization
|
| 294 |
-
nvidia-smi
|
| 295 |
-
|
| 296 |
-
# Verify data loading
|
| 297 |
-
# Check if dataset is cached locally
|
| 298 |
-
```
|
| 299 |
-
|
| 300 |
-
#### 3. **Dataset Loading Issues**
|
| 301 |
-
```bash
|
| 302 |
-
# Clear cache
|
| 303 |
-
rm -rf ~/.cache/huggingface/
|
| 304 |
-
|
| 305 |
-
# Check internet connection
|
| 306 |
-
# Verify dataset name: "legmlai/openhermes-fr"
|
| 307 |
-
```
|
| 308 |
-
|
| 309 |
-
#### 4. **Monitoring Connection Issues**
|
| 310 |
-
```bash
|
| 311 |
-
# Test Trackio connection
|
| 312 |
-
curl -I https://your-trackio-space.hf.space
|
| 313 |
-
|
| 314 |
-
# Check token permissions
|
| 315 |
-
# Verify experiment name format
|
| 316 |
-
```
|
| 317 |
-
|
| 318 |
-
### Debug Mode
|
| 319 |
-
|
| 320 |
-
```bash
|
| 321 |
-
# Enable debug logging
|
| 322 |
-
export LOG_LEVEL=DEBUG
|
| 323 |
-
python train.py config/train_smollm3_openhermes_fr.py
|
| 324 |
-
```
|
| 325 |
-
|
| 326 |
-
## Cost Optimization
|
| 327 |
-
|
| 328 |
-
### Cloud Provider Tips
|
| 329 |
-
|
| 330 |
-
#### **AWS EC2**
|
| 331 |
-
- Use Spot Instances for cost savings
|
| 332 |
-
- Monitor usage with CloudWatch
|
| 333 |
-
- Use appropriate instance types
|
| 334 |
-
|
| 335 |
-
#### **Google Cloud Platform**
|
| 336 |
-
- Use Preemptible VMs for non-critical training
|
| 337 |
-
- Monitor with Cloud Monitoring
|
| 338 |
-
- Use committed use discounts
|
| 339 |
-
|
| 340 |
-
#### **Azure**
|
| 341 |
-
- Use Spot VMs for cost optimization
|
| 342 |
-
- Monitor with Azure Monitor
|
| 343 |
-
- Use reserved instances for long training
|
| 344 |
-
|
| 345 |
-
### Training Time Estimates
|
| 346 |
-
|
| 347 |
-
| GPU Type | Batch Size | Estimated Time |
|
| 348 |
-
|----------|------------|----------------|
|
| 349 |
-
| Tesla T4 (16GB) | 2 | 8-12 hours |
|
| 350 |
-
| V100 (32GB) | 4 | 4-6 hours |
|
| 351 |
-
| A100 (40GB) | 8 | 2-3 hours |
|
| 352 |
-
| H100 (80GB) | 16 | 1-2 hours |
|
| 353 |
-
|
| 354 |
-
## Security Best Practices
|
| 355 |
-
|
| 356 |
-
### Token Management
|
| 357 |
-
```bash
|
| 358 |
-
# Use environment variables
|
| 359 |
-
export HF_TOKEN="your_token_here"
|
| 360 |
-
export TRACKIO_TOKEN="your_trackio_token"
|
| 361 |
-
|
| 362 |
-
# Don't hardcode in scripts
|
| 363 |
-
# Use IAM roles when possible
|
| 364 |
-
```
|
| 365 |
-
|
| 366 |
-
### Data Privacy
|
| 367 |
-
```bash
|
| 368 |
-
# Use private repositories for sensitive models
|
| 369 |
-
python push_to_huggingface.py model username/private-model --private
|
| 370 |
-
|
| 371 |
-
# Secure your cloud instance
|
| 372 |
-
# Use VPC and security groups
|
| 373 |
-
```
|
| 374 |
-
|
| 375 |
-
## Complete Workflow Example
|
| 376 |
-
|
| 377 |
-
### 1. Setup Cloud Instance
|
| 378 |
-
```bash
|
| 379 |
-
# Launch GPU instance
|
| 380 |
-
# Install dependencies
|
| 381 |
-
git clone <your-repo>
|
| 382 |
-
cd <your-repo>
|
| 383 |
-
pip install -r requirements.txt
|
| 384 |
-
```
|
| 385 |
-
|
| 386 |
-
### 2. Train Model
|
| 387 |
-
```bash
|
| 388 |
-
python train.py config/train_smollm3_openhermes_fr.py \
|
| 389 |
-
--enable_tracking \
|
| 390 |
-
--trackio_url "https://your-space.hf.space" \
|
| 391 |
-
--experiment_name "smollm3_fr_v1"
|
| 392 |
-
```
|
| 393 |
-
|
| 394 |
-
### 3. Deploy Model
|
| 395 |
-
```bash
|
| 396 |
-
python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
|
| 397 |
-
--trackio-url "https://your-space.hf.space" \
|
| 398 |
-
--experiment-name "smollm3_fr_v1"
|
| 399 |
-
```
|
| 400 |
-
|
| 401 |
-
### 4. Test Model
|
| 402 |
-
```python
|
| 403 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 404 |
-
|
| 405 |
-
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
|
| 406 |
-
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")
|
| 407 |
-
|
| 408 |
-
# Test French generation
|
| 409 |
-
prompt = "Qu'est-ce que l'apprentissage automatique?"
|
| 410 |
-
inputs = tokenizer(prompt, return_tensors="pt")
|
| 411 |
-
outputs = model.generate(**inputs, max_new_tokens=100)
|
| 412 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 413 |
-
```
|
| 414 |
-
|
| 415 |
-
## Support and Resources
|
| 416 |
-
|
| 417 |
-
### Documentation
|
| 418 |
-
- [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr)
|
| 419 |
-
- [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
|
| 420 |
-
- [Trackio Monitoring](https://github.com/Josephrp/trackio)
|
| 421 |
-
|
| 422 |
-
### Community
|
| 423 |
-
- [Hugging Face Forums](https://discuss.huggingface.co/)
|
| 424 |
-
- [Transformers Documentation](https://huggingface.co/docs/transformers/)
|
| 425 |
-
|
| 426 |
-
### Examples
|
| 427 |
-
- [French Language Models](https://huggingface.co/models?search=french)
|
| 428 |
-
- [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
|
| 429 |
-
|
| 430 |
-
## Conclusion
|
| 431 |
-
|
| 432 |
-
This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:
|
| 433 |
-
|
| 434 |
-
- ✅ **Complete Setup** - From cloud instance to model deployment
|
| 435 |
-
- ✅ **Optimized Configuration** - Tailored for French instruction tuning
|
| 436 |
-
- ✅ **Monitoring Integration** - Trackio experiment tracking
|
| 437 |
-
- ✅ **Cost Optimization** - Tips for efficient cloud usage
|
| 438 |
-
- ✅ **Troubleshooting** - Solutions for common issues
|
| 439 |
-
|
| 440 |
-
Start training your French language model today!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/Configuration_Management.md
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
```mermaid
|
| 2 |
+
graph LR
|
| 3 |
+
Configuration_Management["Configuration Management"]
|
| 4 |
+
Training_Orchestration["Training Orchestration"]
|
| 5 |
+
Training_Orchestration -- "retrieves configuration from" --> Configuration_Management
|
| 6 |
+
click Configuration_Management href "https://github.com//Josephrp/SmolFactory/blob/main/SmolFactory/docs/blob/Configuration_Management.md" "Details"
|
| 7 |
+
```
|
| 8 |
+
|
| 9 |
+
[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:[email protected])
|
| 10 |
+
|
| 11 |
+
## Details
|
| 12 |
+
|
| 13 |
+
One paragraph explaining the functionality which is represented by this graph. What the main flow is and what is its purpose.
|
| 14 |
+
|
| 15 |
+
### Configuration Management [[Expand]](./Configuration_Management.md)
|
| 16 |
+
This component, primarily embodied by the `SmolLM3Config` dataclass and the `get_config` function in `config/train_smollm3.py`, is responsible for the centralized definition, loading, validation, and provision of access to all training parameters, model specifications, data paths, and hyperparameters. It supports loading both base and custom configurations, ensuring that all necessary settings are available and correctly formatted for the training and fine-tuning processes.
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
**Related Classes/Methods**: _None_
|
| 20 |
+
|
| 21 |
+
### Training Orchestration
|
| 22 |
+
This component represents the main scripts or modules responsible for initiating and coordinating the training and fine-tuning processes. It acts as the primary entry point for different training runs, retrieving necessary configurations and orchestrating the overall training pipeline.
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
**Related Classes/Methods**: _None_
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
|
docs/DATASET_AUTOMATION_FIX.md
DELETED
|
@@ -1,218 +0,0 @@
|
|
| 1 |
-
# Dataset Configuration Automation Fix
|
| 2 |
-
|
| 3 |
-
## Problem Description
|
| 4 |
-
|
| 5 |
-
The original launch script required users to manually specify their username in the dataset repository name, which was:
|
| 6 |
-
1. **Error-prone**: Users had to remember their username
|
| 7 |
-
2. **Inconsistent**: Different users might use different naming conventions
|
| 8 |
-
3. **Manual**: Required extra steps in the setup process
|
| 9 |
-
|
| 10 |
-
## Solution Implementation
|
| 11 |
-
|
| 12 |
-
### Automatic Dataset Repository Creation
|
| 13 |
-
|
| 14 |
-
We've implemented a Python-based solution that automatically:
|
| 15 |
-
|
| 16 |
-
1. **Extracts username from token**: Uses the HF API to get the username from the validated token
|
| 17 |
-
2. **Creates dataset repository**: Automatically creates `username/trackio-experiments` or custom name
|
| 18 |
-
3. **Sets environment variables**: Automatically configures `TRACKIO_DATASET_REPO`
|
| 19 |
-
4. **Provides customization**: Allows users to customize the dataset name if desired
|
| 20 |
-
|
| 21 |
-
### Key Components
|
| 22 |
-
|
| 23 |
-
#### 1. **`scripts/dataset_tonic/setup_hf_dataset.py`** - Main Dataset Setup Script
|
| 24 |
-
- Automatically detects username from HF token
|
| 25 |
-
- Creates dataset repository with proper permissions
|
| 26 |
-
- Supports custom dataset names
|
| 27 |
-
- Sets environment variables for other scripts
|
| 28 |
-
|
| 29 |
-
#### 2. **Updated `launch.sh`** - Enhanced User Experience
|
| 30 |
-
- Automatically creates dataset repository
|
| 31 |
-
- Provides options for default or custom dataset names
|
| 32 |
-
- Fallback to manual input if automatic creation fails
|
| 33 |
-
- Clear user feedback and progress indicators
|
| 34 |
-
|
| 35 |
-
#### 3. **Python API Integration** - Consistent Authentication
|
| 36 |
-
- Uses `HfApi(token=token)` for direct token authentication
|
| 37 |
-
- Avoids environment variable conflicts
|
| 38 |
-
- Consistent error handling across all scripts
|
| 39 |
-
|
| 40 |
-
## Usage Examples
|
| 41 |
-
|
| 42 |
-
### Automatic Dataset Creation (Default)
|
| 43 |
-
|
| 44 |
-
```bash
|
| 45 |
-
# The launch script now automatically:
|
| 46 |
-
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here
|
| 47 |
-
|
| 48 |
-
# Creates: username/trackio-experiments
|
| 49 |
-
# Sets: TRACKIO_DATASET_REPO=username/trackio-experiments
|
| 50 |
-
```
|
| 51 |
-
|
| 52 |
-
### Custom Dataset Name
|
| 53 |
-
|
| 54 |
-
```bash
|
| 55 |
-
# Create with custom name
|
| 56 |
-
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here my-custom-experiments
|
| 57 |
-
|
| 58 |
-
# Creates: username/my-custom-experiments
|
| 59 |
-
# Sets: TRACKIO_DATASET_REPO=username/my-custom-experiments
|
| 60 |
-
```
|
| 61 |
-
|
| 62 |
-
### Launch Script Integration
|
| 63 |
-
|
| 64 |
-
The launch script now provides a seamless experience:
|
| 65 |
-
|
| 66 |
-
```bash
|
| 67 |
-
./launch.sh
|
| 68 |
-
|
| 69 |
-
# Step 3: Experiment Details
|
| 70 |
-
# - Automatically creates dataset repository
|
| 71 |
-
# - Option to use default or custom name
|
| 72 |
-
# - No manual username input required
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
## Features
|
| 76 |
-
|
| 77 |
-
### ✅ **Automatic Username Detection**
|
| 78 |
-
- Extracts username from HF token using Python API
|
| 79 |
-
- No manual username input required
|
| 80 |
-
- Consistent across all scripts
|
| 81 |
-
|
| 82 |
-
### ✅ **Flexible Dataset Naming**
|
| 83 |
-
- Default: `username/trackio-experiments`
|
| 84 |
-
- Custom: `username/custom-name`
|
| 85 |
-
- User choice during setup
|
| 86 |
-
|
| 87 |
-
### ✅ **Robust Error Handling**
|
| 88 |
-
- Graceful fallback to manual input
|
| 89 |
-
- Clear error messages
|
| 90 |
-
- Token validation before creation
|
| 91 |
-
|
| 92 |
-
### ✅ **Environment Integration**
|
| 93 |
-
- Automatically sets `TRACKIO_DATASET_REPO`
|
| 94 |
-
- Compatible with existing scripts
|
| 95 |
-
- No manual configuration required
|
| 96 |
-
|
| 97 |
-
### ✅ **Cross-Platform Compatibility**
|
| 98 |
-
- Works on Windows, Linux, macOS
|
| 99 |
-
- Uses Python API instead of CLI
|
| 100 |
-
- Consistent behavior across platforms
|
| 101 |
-
|
| 102 |
-
## Technical Implementation
|
| 103 |
-
|
| 104 |
-
### Token Authentication Flow
|
| 105 |
-
|
| 106 |
-
```python
|
| 107 |
-
# 1. Direct token authentication
|
| 108 |
-
api = HfApi(token=token)
|
| 109 |
-
|
| 110 |
-
# 2. Extract username
|
| 111 |
-
user_info = api.whoami()
|
| 112 |
-
username = user_info.get("name", user_info.get("username"))
|
| 113 |
-
|
| 114 |
-
# 3. Create repository
|
| 115 |
-
create_repo(
|
| 116 |
-
repo_id=f"{username}/{dataset_name}",
|
| 117 |
-
repo_type="dataset",
|
| 118 |
-
token=token,
|
| 119 |
-
exist_ok=True,
|
| 120 |
-
private=False
|
| 121 |
-
)
|
| 122 |
-
```
|
| 123 |
-
|
| 124 |
-
### Launch Script Integration
|
| 125 |
-
|
| 126 |
-
```bash
|
| 127 |
-
# Automatic dataset creation
|
| 128 |
-
if python3 scripts/dataset_tonic/setup_hf_dataset.py 2>/dev/null; then
|
| 129 |
-
TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
|
| 130 |
-
print_status "Dataset repository created successfully"
|
| 131 |
-
else
|
| 132 |
-
# Fallback to manual input
|
| 133 |
-
get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
|
| 134 |
-
fi
|
| 135 |
-
```
|
| 136 |
-
|
| 137 |
-
## User Experience Improvements
|
| 138 |
-
|
| 139 |
-
### Before (Manual Process)
|
| 140 |
-
1. User enters HF token
|
| 141 |
-
2. User manually types username
|
| 142 |
-
3. User manually types dataset repository name
|
| 143 |
-
4. User manually configures environment variables
|
| 144 |
-
5. Risk of typos and inconsistencies
|
| 145 |
-
|
| 146 |
-
### After (Automated Process)
|
| 147 |
-
1. User enters HF token
|
| 148 |
-
2. System automatically detects username
|
| 149 |
-
3. System automatically creates dataset repository
|
| 150 |
-
4. System automatically sets environment variables
|
| 151 |
-
5. Option to customize dataset name if desired
|
| 152 |
-
|
| 153 |
-
## Error Handling
|
| 154 |
-
|
| 155 |
-
### Common Scenarios
|
| 156 |
-
|
| 157 |
-
| Scenario | Action | User Experience |
|
| 158 |
-
|----------|--------|-----------------|
|
| 159 |
-
| Valid token | ✅ Automatic creation | Seamless setup |
|
| 160 |
-
| Invalid token | ❌ Clear error message | Helpful feedback |
|
| 161 |
-
| Network issues | ⚠️ Retry with fallback | Graceful degradation |
|
| 162 |
-
| Repository exists | ℹ️ Use existing | No conflicts |
|
| 163 |
-
|
| 164 |
-
### Fallback Mechanisms
|
| 165 |
-
|
| 166 |
-
1. **Token validation fails**: Clear error message with troubleshooting steps
|
| 167 |
-
2. **Dataset creation fails**: Fallback to manual input
|
| 168 |
-
3. **Network issues**: Retry with exponential backoff
|
| 169 |
-
4. **Permission issues**: Clear guidance on token permissions
|
| 170 |
-
|
| 171 |
-
## Benefits
|
| 172 |
-
|
| 173 |
-
### For Users
|
| 174 |
-
- **Simplified Setup**: No manual username input required
|
| 175 |
-
- **Reduced Errors**: Automatic username detection eliminates typos
|
| 176 |
-
- **Consistent Naming**: Standardized repository naming conventions
|
| 177 |
-
- **Better UX**: Clear progress indicators and feedback
|
| 178 |
-
|
| 179 |
-
### For Developers
|
| 180 |
-
- **Maintainable Code**: Python API instead of CLI dependencies
|
| 181 |
-
- **Cross-Platform**: Works consistently across operating systems
|
| 182 |
-
- **Extensible**: Easy to add new features and customizations
|
| 183 |
-
- **Testable**: Comprehensive test coverage
|
| 184 |
-
|
| 185 |
-
### For System
|
| 186 |
-
- **Reliable**: Robust error handling and fallback mechanisms
|
| 187 |
-
- **Secure**: Direct token authentication without environment conflicts
|
| 188 |
-
- **Scalable**: Easy to extend for additional repository types
|
| 189 |
-
- **Integrated**: Seamless integration with existing pipeline
|
| 190 |
-
|
| 191 |
-
## Migration Guide
|
| 192 |
-
|
| 193 |
-
### For Existing Users
|
| 194 |
-
|
| 195 |
-
No migration required! The system automatically:
|
| 196 |
-
- Detects existing repositories
|
| 197 |
-
- Uses existing repositories if they exist
|
| 198 |
-
- Creates new repositories only when needed
|
| 199 |
-
|
| 200 |
-
### For New Users
|
| 201 |
-
|
| 202 |
-
The setup is now completely automated:
|
| 203 |
-
1. Run `./launch.sh`
|
| 204 |
-
2. Enter your HF token
|
| 205 |
-
3. Choose dataset naming preference
|
| 206 |
-
4. System handles everything else automatically
|
| 207 |
-
|
| 208 |
-
## Future Enhancements
|
| 209 |
-
|
| 210 |
-
- [ ] Support for organization repositories
|
| 211 |
-
- [ ] Multiple dataset repositories per user
|
| 212 |
-
- [ ] Dataset repository templates
|
| 213 |
-
- [ ] Advanced repository configuration options
|
| 214 |
-
- [ ] Repository sharing and collaboration features
|
| 215 |
-
|
| 216 |
-
---
|
| 217 |
-
|
| 218 |
-
**Note**: This automation ensures that users can focus on their fine-tuning experiments rather than repository setup details, while maintaining full flexibility for customization when needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/DATASET_COMPONENTS_VERIFICATION.md
DELETED
|
@@ -1,235 +0,0 @@
|
|
| 1 |
-
# Dataset Components Verification
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This document verifies that all important dataset components have been properly implemented and are working correctly.
|
| 6 |
-
|
| 7 |
-
## ✅ **Verified Components**
|
| 8 |
-
|
| 9 |
-
### 1. **Initial Experiment Data** ✅ IMPLEMENTED
|
| 10 |
-
|
| 11 |
-
**Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `add_initial_experiment_data()` function
|
| 12 |
-
|
| 13 |
-
**What it does**:
|
| 14 |
-
- Creates comprehensive sample experiment data
|
| 15 |
-
- Includes realistic training metrics (loss, accuracy, GPU usage, etc.)
|
| 16 |
-
- Contains proper experiment parameters (model name, batch size, learning rate, etc.)
|
| 17 |
-
- Includes experiment logs and artifacts structure
|
| 18 |
-
- Uploads data to HF Dataset using `datasets` library
|
| 19 |
-
|
| 20 |
-
**Sample Data Structure**:
|
| 21 |
-
```json
|
| 22 |
-
{
|
| 23 |
-
"experiment_id": "exp_20250120_143022",
|
| 24 |
-
"name": "smollm3-finetune-demo",
|
| 25 |
-
"description": "SmolLM3 fine-tuning experiment demo with comprehensive metrics tracking",
|
| 26 |
-
"created_at": "2025-01-20T14:30:22.123456",
|
| 27 |
-
"status": "completed",
|
| 28 |
-
"metrics": "[{\"timestamp\": \"2025-01-20T14:30:22.123456\", \"step\": 100, \"metrics\": {\"loss\": 1.15, \"grad_norm\": 10.5, \"learning_rate\": 5e-6, \"num_tokens\": 1000000.0, \"mean_token_accuracy\": 0.76, \"epoch\": 0.1, \"total_tokens\": 1000000.0, \"throughput\": 2000000.0, \"step_time\": 0.5, \"batch_size\": 2, \"seq_len\": 4096, \"token_acc\": 0.76, \"gpu_memory_allocated\": 15.2, \"gpu_memory_reserved\": 70.1, \"gpu_utilization\": 85.2, \"cpu_percent\": 2.7, \"memory_percent\": 10.1}}]",
|
| 29 |
-
"parameters": "{\"model_name\": \"HuggingFaceTB/SmolLM3-3B\", \"max_seq_length\": 4096, \"batch_size\": 2, \"learning_rate\": 5e-6, \"epochs\": 3, \"dataset\": \"OpenHermes-FR\", \"trainer_type\": \"SFTTrainer\", \"hardware\": \"GPU (H100/A100)\", \"mixed_precision\": true, \"gradient_checkpointing\": true, \"flash_attention\": true}",
|
| 30 |
-
"artifacts": "[]",
|
| 31 |
-
"logs": "[{\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Training started successfully\"}, {\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Model loaded and configured\"}, {\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Dataset loaded and preprocessed\"}]",
|
| 32 |
-
"last_updated": "2025-01-20T14:30:22.123456"
|
| 33 |
-
}
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
**Test Result**: ✅ Successfully uploaded to `Tonic/test-dataset-complete`
|
| 37 |
-
|
| 38 |
-
### 2. **README Templates** ✅ IMPLEMENTED
|
| 39 |
-
|
| 40 |
-
**Location**:
|
| 41 |
-
- Template: `templates/datasets/readme.md`
|
| 42 |
-
- Implementation: `scripts/dataset_tonic/setup_hf_dataset.py` - `add_dataset_readme()` function
|
| 43 |
-
|
| 44 |
-
**What it does**:
|
| 45 |
-
- Uses comprehensive README template from `templates/datasets/readme.md`
|
| 46 |
-
- Falls back to basic README if template doesn't exist
|
| 47 |
-
- Includes dataset schema documentation
|
| 48 |
-
- Provides usage examples and integration information
|
| 49 |
-
- Uploads README to dataset repository using `huggingface_hub`
|
| 50 |
-
|
| 51 |
-
**Template Features**:
|
| 52 |
-
- Dataset schema documentation
|
| 53 |
-
- Metrics structure examples
|
| 54 |
-
- Integration instructions
|
| 55 |
-
- Privacy and license information
|
| 56 |
-
- Sample experiment entries
|
| 57 |
-
|
| 58 |
-
**Test Result**: ✅ Successfully added README to `Tonic/test-dataset-complete`
|
| 59 |
-
|
| 60 |
-
### 3. **Dataset Repository Creation** ✅ IMPLEMENTED
|
| 61 |
-
|
| 62 |
-
**Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `create_dataset_repository()` function
|
| 63 |
-
|
| 64 |
-
**What it does**:
|
| 65 |
-
- Creates HF Dataset repository with proper permissions
|
| 66 |
-
- Handles existing repositories gracefully
|
| 67 |
-
- Sets up public dataset for easier sharing
|
| 68 |
-
- Uses Python API (`huggingface_hub.create_repo`)
|
| 69 |
-
|
| 70 |
-
**Test Result**: ✅ Successfully created dataset repositories
|
| 71 |
-
|
| 72 |
-
### 4. **Automatic Username Detection** ✅ IMPLEMENTED
|
| 73 |
-
|
| 74 |
-
**Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `get_username_from_token()` function
|
| 75 |
-
|
| 76 |
-
**What it does**:
|
| 77 |
-
- Extracts username from HF token using Python API
|
| 78 |
-
- Uses `HfApi(token=token).whoami()`
|
| 79 |
-
- Handles both `name` and `username` fields
|
| 80 |
-
- Provides clear error messages
|
| 81 |
-
|
| 82 |
-
**Test Result**: ✅ Successfully detected username "Tonic"
|
| 83 |
-
|
| 84 |
-
### 5. **Environment Variable Integration** ✅ IMPLEMENTED
|
| 85 |
-
|
| 86 |
-
**Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `setup_trackio_dataset()` function
|
| 87 |
-
|
| 88 |
-
**What it does**:
|
| 89 |
-
- Sets `TRACKIO_DATASET_REPO` environment variable
|
| 90 |
-
- Supports both environment and command-line token sources
|
| 91 |
-
- Provides clear feedback on environment setup
|
| 92 |
-
|
| 93 |
-
**Test Result**: ✅ Successfully set `TRACKIO_DATASET_REPO=Tonic/test-dataset-complete`
|
| 94 |
-
|
| 95 |
-
### 6. **Launch Script Integration** ✅ IMPLEMENTED
|
| 96 |
-
|
| 97 |
-
**Location**: `launch.sh` - Dataset creation section
|
| 98 |
-
|
| 99 |
-
**What it does**:
|
| 100 |
-
- Automatically calls dataset setup script
|
| 101 |
-
- Provides user options for default or custom dataset names
|
| 102 |
-
- Falls back to manual input if automatic creation fails
|
| 103 |
-
- Integrates seamlessly with the training pipeline
|
| 104 |
-
|
| 105 |
-
**Features**:
|
| 106 |
-
- Automatic dataset creation
|
| 107 |
-
- Custom dataset name support
|
| 108 |
-
- Graceful error handling
|
| 109 |
-
- Clear user feedback
|
| 110 |
-
|
| 111 |
-
## 🔧 **Technical Implementation Details**
|
| 112 |
-
|
| 113 |
-
### Token Authentication Flow
|
| 114 |
-
|
| 115 |
-
```python
|
| 116 |
-
# 1. Direct token authentication
|
| 117 |
-
api = HfApi(token=token)
|
| 118 |
-
|
| 119 |
-
# 2. Extract username
|
| 120 |
-
user_info = api.whoami()
|
| 121 |
-
username = user_info.get("name", user_info.get("username"))
|
| 122 |
-
|
| 123 |
-
# 3. Create repository
|
| 124 |
-
create_repo(
|
| 125 |
-
repo_id=f"{username}/{dataset_name}",
|
| 126 |
-
repo_type="dataset",
|
| 127 |
-
token=token,
|
| 128 |
-
exist_ok=True,
|
| 129 |
-
private=False
|
| 130 |
-
)
|
| 131 |
-
|
| 132 |
-
# 4. Upload data
|
| 133 |
-
dataset = Dataset.from_list(initial_experiments)
|
| 134 |
-
dataset.push_to_hub(repo_id, token=token, private=False)
|
| 135 |
-
|
| 136 |
-
# 5. Upload README
|
| 137 |
-
upload_file(
|
| 138 |
-
path_or_fileobj=readme_content,
|
| 139 |
-
path_in_repo="README.md",
|
| 140 |
-
repo_id=repo_id,
|
| 141 |
-
repo_type="dataset",
|
| 142 |
-
token=token
|
| 143 |
-
)
|
| 144 |
-
```
|
| 145 |
-
|
| 146 |
-
### Error Handling
|
| 147 |
-
|
| 148 |
-
- **Token validation**: Clear error messages for invalid tokens
|
| 149 |
-
- **Repository creation**: Handles existing repositories gracefully
|
| 150 |
-
- **Data upload**: Fallback mechanisms for upload failures
|
| 151 |
-
- **README upload**: Graceful handling of template issues
|
| 152 |
-
|
| 153 |
-
### Cross-Platform Compatibility
|
| 154 |
-
|
| 155 |
-
- **Windows**: Tested and working on Windows PowerShell
|
| 156 |
-
- **Linux**: Compatible with bash scripts
|
| 157 |
-
- **macOS**: Compatible with zsh/bash
|
| 158 |
-
|
| 159 |
-
## 📊 **Test Results**
|
| 160 |
-
|
| 161 |
-
### Successful Test Run
|
| 162 |
-
|
| 163 |
-
```bash
|
| 164 |
-
$ python scripts/dataset_tonic/setup_hf_dataset.py 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 test-dataset-complete
|
| 165 |
-
|
| 166 |
-
🚀 Setting up Trackio Dataset Repository
|
| 167 |
-
==================================================
|
| 168 |
-
🔍 Getting username from token...
|
| 169 |
-
✅ Authenticated as: Tonic
|
| 170 |
-
🔧 Creating dataset repository: Tonic/test-dataset-complete
|
| 171 |
-
✅ Successfully created dataset repository: Tonic/test-dataset-complete
|
| 172 |
-
✅ Set TRACKIO_DATASET_REPO=Tonic/test-dataset-complete
|
| 173 |
-
📊 Adding initial experiment data...
|
| 174 |
-
Creating parquet from Arrow format: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 93.77ba/s]
|
| 175 |
-
Uploading the dataset shards: 100%|█████████████████████████████████████| 1/1 [00:01<00:00, 1.39s/ shards]
|
| 176 |
-
✅ Successfully uploaded initial experiment data to Tonic/test-dataset-complete
|
| 177 |
-
✅ Successfully added README to Tonic/test-dataset-complete
|
| 178 |
-
✅ Successfully added initial experiment data
|
| 179 |
-
|
| 180 |
-
🎉 Dataset setup complete!
|
| 181 |
-
📊 Dataset URL: https://huggingface.co/datasets/Tonic/test-dataset-complete
|
| 182 |
-
🔧 Repository ID: Tonic/test-dataset-complete
|
| 183 |
-
```
|
| 184 |
-
|
| 185 |
-
### Verified Dataset Repository
|
| 186 |
-
|
| 187 |
-
**URL**: https://huggingface.co/datasets/Tonic/test-dataset-complete
|
| 188 |
-
|
| 189 |
-
**Contents**:
|
| 190 |
-
- ✅ README.md with comprehensive documentation
|
| 191 |
-
- ✅ Initial experiment data with realistic metrics
|
| 192 |
-
- ✅ Proper dataset schema
|
| 193 |
-
- ✅ Public repository for easy access
|
| 194 |
-
|
| 195 |
-
## 🎯 **Integration Points**
|
| 196 |
-
|
| 197 |
-
### 1. **Trackio Space Integration**
|
| 198 |
-
- Dataset repository automatically configured
|
| 199 |
-
- Environment variables set for Space deployment
|
| 200 |
-
- Compatible with Trackio monitoring interface
|
| 201 |
-
|
| 202 |
-
### 2. **Training Pipeline Integration**
|
| 203 |
-
- `TRACKIO_DATASET_REPO` environment variable set
|
| 204 |
-
- Compatible with monitoring scripts
|
| 205 |
-
- Ready for experiment logging
|
| 206 |
-
|
| 207 |
-
### 3. **Launch Script Integration**
|
| 208 |
-
- Seamless integration with `launch.sh`
|
| 209 |
-
- Automatic dataset creation during setup
|
| 210 |
-
- User-friendly configuration options
|
| 211 |
-
|
| 212 |
-
## ✅ **Verification Summary**
|
| 213 |
-
|
| 214 |
-
| Component | Status | Location | Test Result |
|
| 215 |
-
|-----------|--------|----------|-------------|
|
| 216 |
-
| Initial Experiment Data | ✅ Implemented | `setup_hf_dataset.py` | ✅ Uploaded successfully |
|
| 217 |
-
| README Templates | ✅ Implemented | `templates/datasets/readme.md` | ✅ Added to repository |
|
| 218 |
-
| Dataset Repository Creation | ✅ Implemented | `setup_hf_dataset.py` | ✅ Created successfully |
|
| 219 |
-
| Username Detection | ✅ Implemented | `setup_hf_dataset.py` | ✅ Detected "Tonic" |
|
| 220 |
-
| Environment Variables | ✅ Implemented | `setup_hf_dataset.py` | ✅ Set correctly |
|
| 221 |
-
| Launch Script Integration | ✅ Implemented | `launch.sh` | ✅ Integrated |
|
| 222 |
-
| Error Handling | ✅ Implemented | All functions | ✅ Graceful fallbacks |
|
| 223 |
-
| Cross-Platform Support | ✅ Implemented | Python API | ✅ Windows/Linux/macOS |
|
| 224 |
-
|
| 225 |
-
## 🚀 **Next Steps**
|
| 226 |
-
|
| 227 |
-
The dataset components are now **fully implemented and verified**. Users can:
|
| 228 |
-
|
| 229 |
-
1. **Run the launch script**: `./launch.sh`
|
| 230 |
-
2. **Get automatic dataset creation**: No manual username input required
|
| 231 |
-
3. **Receive comprehensive documentation**: README templates included
|
| 232 |
-
4. **Start with sample data**: Initial experiment data provided
|
| 233 |
-
5. **Monitor experiments**: Trackio integration ready
|
| 234 |
-
|
| 235 |
-
**All important components are properly implemented and working correctly!** 🎉
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/DEPLOYMENT_COMPONENTS_VERIFICATION.md
DELETED
|
@@ -1,393 +0,0 @@
|
|
| 1 |
-
# Deployment Components Verification
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This document verifies that all important components for Trackio Spaces deployment and model repository deployment have been properly implemented and are working correctly.
|
| 6 |
-
|
| 7 |
-
## ✅ **Trackio Spaces Deployment - Verified Components**
|
| 8 |
-
|
| 9 |
-
### 1. **Space Creation** ✅ IMPLEMENTED
|
| 10 |
-
|
| 11 |
-
**Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `create_space()` function
|
| 12 |
-
|
| 13 |
-
**What it does**:
|
| 14 |
-
- Creates HF Space using latest Python API (`create_repo`)
|
| 15 |
-
- Falls back to CLI method if API fails
|
| 16 |
-
- Handles authentication and username extraction
|
| 17 |
-
- Sets proper Space configuration (Gradio SDK, CPU hardware)
|
| 18 |
-
|
| 19 |
-
**Key Features**:
|
| 20 |
-
- ✅ **API-based creation**: Uses `huggingface_hub.create_repo`
|
| 21 |
-
- ✅ **Fallback mechanism**: CLI method if API fails
|
| 22 |
-
- ✅ **Username extraction**: Automatic from token using `whoami()`
|
| 23 |
-
- ✅ **Proper configuration**: Gradio SDK, CPU hardware, public access
|
| 24 |
-
|
| 25 |
-
**Test Result**: ✅ Successfully creates Spaces
|
| 26 |
-
|
| 27 |
-
### 2. **File Upload System** ✅ IMPLEMENTED
|
| 28 |
-
|
| 29 |
-
**Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `upload_files_to_space()` function
|
| 30 |
-
|
| 31 |
-
**What it does**:
|
| 32 |
-
- Prepares all required files in temporary directory
|
| 33 |
-
- Uploads files using HF Hub API (`upload_file`)
|
| 34 |
-
- Handles proper file structure for HF Spaces
|
| 35 |
-
- Sets up git repository and pushes to main branch
|
| 36 |
-
|
| 37 |
-
**Key Features**:
|
| 38 |
-
- ✅ **API-based upload**: Uses `huggingface_hub.upload_file`
|
| 39 |
-
- ✅ **Proper file structure**: Follows HF Spaces requirements
|
| 40 |
-
- ✅ **Git integration**: Proper git workflow in temp directory
|
| 41 |
-
- ✅ **Error handling**: Graceful fallback mechanisms
|
| 42 |
-
|
| 43 |
-
**Files Uploaded**:
|
| 44 |
-
- ✅ `app.py` - Main Gradio interface
|
| 45 |
-
- ✅ `requirements.txt` - Dependencies
|
| 46 |
-
- ✅ `README.md` - Space documentation
|
| 47 |
-
- ✅ `.gitignore` - Git ignore file
|
| 48 |
-
|
| 49 |
-
### 3. **Space Configuration** ✅ IMPLEMENTED
|
| 50 |
-
|
| 51 |
-
**Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `set_space_secrets()` function
|
| 52 |
-
|
| 53 |
-
**What it does**:
|
| 54 |
-
- Sets environment variables via HF Hub API
|
| 55 |
-
- Configures `HF_TOKEN` for dataset access
|
| 56 |
-
- Sets `TRACKIO_DATASET_REPO` for experiment storage
|
| 57 |
-
- Provides manual setup instructions if API fails
|
| 58 |
-
|
| 59 |
-
**Key Features**:
|
| 60 |
-
- ✅ **API-based secrets**: Uses `add_space_secret()` method
|
| 61 |
-
- ✅ **Automatic configuration**: Sets required environment variables
|
| 62 |
-
- ✅ **Manual fallback**: Clear instructions if API fails
|
| 63 |
-
- ✅ **Error handling**: Graceful degradation
|
| 64 |
-
|
| 65 |
-
### 4. **Space Testing** ✅ IMPLEMENTED
|
| 66 |
-
|
| 67 |
-
**Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `test_space()` function
|
| 68 |
-
|
| 69 |
-
**What it does**:
|
| 70 |
-
- Tests Space availability after deployment
|
| 71 |
-
- Checks if Space is building correctly
|
| 72 |
-
- Provides status feedback to user
|
| 73 |
-
- Handles build time delays
|
| 74 |
-
|
| 75 |
-
**Key Features**:
|
| 76 |
-
- ✅ **Availability testing**: Checks Space URL accessibility
|
| 77 |
-
- ✅ **Build status**: Monitors Space build progress
|
| 78 |
-
- ✅ **User feedback**: Clear status messages
|
| 79 |
-
- ✅ **Timeout handling**: Proper wait times for builds
|
| 80 |
-
|
| 81 |
-
### 5. **Gradio Interface** ✅ IMPLEMENTED
|
| 82 |
-
|
| 83 |
-
**Location**: `templates/spaces/app.py` - Complete Gradio application
|
| 84 |
-
|
| 85 |
-
**What it does**:
|
| 86 |
-
- Provides comprehensive experiment tracking interface
|
| 87 |
-
- Integrates with HF Datasets for persistent storage
|
| 88 |
-
- Offers real-time metrics visualization
|
| 89 |
-
- Supports API access for training scripts
|
| 90 |
-
|
| 91 |
-
**Key Features**:
|
| 92 |
-
- ✅ **Experiment management**: Create, view, update experiments
|
| 93 |
-
- ✅ **Metrics logging**: Real-time training metrics
|
| 94 |
-
- ✅ **Visualization**: Interactive plots and charts
|
| 95 |
-
- ✅ **HF Datasets integration**: Persistent storage
|
| 96 |
-
- ✅ **API endpoints**: Programmatic access
|
| 97 |
-
- ✅ **Fallback data**: Backup when dataset unavailable
|
| 98 |
-
|
| 99 |
-
**Interface Components**:
|
| 100 |
-
- ✅ **Create Experiment**: Start new experiments
|
| 101 |
-
- ✅ **Log Metrics**: Track training progress
|
| 102 |
-
- ✅ **View Experiments**: See experiment details
|
| 103 |
-
- ✅ **Update Status**: Mark experiments complete
|
| 104 |
-
- ✅ **Visualizations**: Interactive plots
|
| 105 |
-
- ✅ **Configuration**: Environment setup
|
| 106 |
-
|
| 107 |
-
### 6. **Requirements and Dependencies** ✅ IMPLEMENTED
|
| 108 |
-
|
| 109 |
-
**Location**: `templates/spaces/requirements.txt`
|
| 110 |
-
|
| 111 |
-
**What it includes**:
|
| 112 |
-
- ✅ **Core Gradio**: `gradio>=4.0.0`
|
| 113 |
-
- ✅ **Data processing**: `pandas>=2.0.0`, `numpy>=1.24.0`
|
| 114 |
-
- ✅ **Visualization**: `plotly>=5.15.0`
|
| 115 |
-
- ✅ **HF integration**: `datasets>=2.14.0`, `huggingface-hub>=0.16.0`
|
| 116 |
-
- ✅ **HTTP requests**: `requests>=2.31.0`
|
| 117 |
-
- ✅ **Environment**: `python-dotenv>=1.0.0`
|
| 118 |
-
|
| 119 |
-
### 7. **README Template** ✅ IMPLEMENTED
|
| 120 |
-
|
| 121 |
-
**Location**: `templates/spaces/README.md`
|
| 122 |
-
|
| 123 |
-
**What it includes**:
|
| 124 |
-
- ✅ **HF Spaces metadata**: Proper YAML frontmatter
|
| 125 |
-
- ✅ **Feature documentation**: Complete interface description
|
| 126 |
-
- ✅ **API documentation**: Usage examples
|
| 127 |
-
- ✅ **Configuration guide**: Environment variables
|
| 128 |
-
- ✅ **Troubleshooting**: Common issues and solutions
|
| 129 |
-
|
| 130 |
-
## ✅ **Model Repository Deployment - Verified Components**
|
| 131 |
-
|
| 132 |
-
### 1. **Repository Creation** ✅ IMPLEMENTED
|
| 133 |
-
|
| 134 |
-
**Location**: `scripts/model_tonic/push_to_huggingface.py` - `create_repository()` function
|
| 135 |
-
|
| 136 |
-
**What it does**:
|
| 137 |
-
- Creates HF model repository using Python API
|
| 138 |
-
- Handles private/public repository settings
|
| 139 |
-
- Supports existing repository updates
|
| 140 |
-
- Provides proper error handling
|
| 141 |
-
|
| 142 |
-
**Key Features**:
|
| 143 |
-
- ✅ **API-based creation**: Uses `huggingface_hub.create_repo`
|
| 144 |
-
- ✅ **Privacy settings**: Configurable private/public
|
| 145 |
-
- ✅ **Existing handling**: `exist_ok=True` for updates
|
| 146 |
-
- ✅ **Error handling**: Clear error messages
|
| 147 |
-
|
| 148 |
-
### 2. **Model File Upload** ✅ IMPLEMENTED
|
| 149 |
-
|
| 150 |
-
**Location**: `scripts/model_tonic/push_to_huggingface.py` - `upload_model_files()` function
|
| 151 |
-
|
| 152 |
-
**What it does**:
|
| 153 |
-
- Validates model files exist and are complete
|
| 154 |
-
- Uploads all model files to repository
|
| 155 |
-
- Handles large file uploads efficiently
|
| 156 |
-
- Provides progress feedback
|
| 157 |
-
|
| 158 |
-
**Key Features**:
|
| 159 |
-
- ✅ **File validation**: Checks for required model files
|
| 160 |
-
- ✅ **Complete upload**: All model components uploaded
|
| 161 |
-
- ✅ **Progress tracking**: Upload progress feedback
|
| 162 |
-
- ✅ **Error handling**: Graceful failure handling
|
| 163 |
-
|
| 164 |
-
**Files Uploaded**:
|
| 165 |
-
- ✅ `config.json` - Model configuration
|
| 166 |
-
- ✅ `pytorch_model.bin` - Model weights
|
| 167 |
-
- ✅ `tokenizer.json` - Tokenizer configuration
|
| 168 |
-
- ✅ `tokenizer_config.json` - Tokenizer settings
|
| 169 |
-
- ✅ `special_tokens_map.json` - Special tokens
|
| 170 |
-
- ✅ `generation_config.json` - Generation settings
|
| 171 |
-
|
| 172 |
-
### 3. **Model Card Generation** ✅ IMPLEMENTED
|
| 173 |
-
|
| 174 |
-
**Location**: `scripts/model_tonic/push_to_huggingface.py` - `create_model_card()` function
|
| 175 |
-
|
| 176 |
-
**What it does**:
|
| 177 |
-
- Generates comprehensive model cards
|
| 178 |
-
- Includes training configuration and results
|
| 179 |
-
- Provides usage examples and documentation
|
| 180 |
-
- Supports quantized model variants
|
| 181 |
-
|
| 182 |
-
**Key Features**:
|
| 183 |
-
- ✅ **Template-based**: Uses `templates/model_card.md`
|
| 184 |
-
- ✅ **Dynamic content**: Training config and results
|
| 185 |
-
- ✅ **Usage examples**: Code snippets and instructions
|
| 186 |
-
- ✅ **Quantized support**: Multiple model variants
|
| 187 |
-
- ✅ **Metadata**: Proper HF Hub metadata
|
| 188 |
-
|
| 189 |
-
### 4. **Training Results Documentation** ✅ IMPLEMENTED
|
| 190 |
-
|
| 191 |
-
**Location**: `scripts/model_tonic/push_to_huggingface.py` - `upload_training_results()` function
|
| 192 |
-
|
| 193 |
-
**What it does**:
|
| 194 |
-
- Uploads training configuration and results
|
| 195 |
-
- Documents experiment parameters
|
| 196 |
-
- Includes performance metrics
|
| 197 |
-
- Provides experiment tracking links
|
| 198 |
-
|
| 199 |
-
**Key Features**:
|
| 200 |
-
- ✅ **Configuration upload**: Training parameters
|
| 201 |
-
- ✅ **Results documentation**: Performance metrics
|
| 202 |
-
- ✅ **Experiment links**: Trackio integration
|
| 203 |
-
- ✅ **Metadata**: Proper documentation structure
|
| 204 |
-
|
| 205 |
-
### 5. **Quantized Model Support** ✅ IMPLEMENTED
|
| 206 |
-
|
| 207 |
-
**Location**: `scripts/model_tonic/quantize_model.py`
|
| 208 |
-
|
| 209 |
-
**What it does**:
|
| 210 |
-
- Creates int8 and int4 quantized models
|
| 211 |
-
- Uploads to subdirectories in same repository
|
| 212 |
-
- Generates quantized model cards
|
| 213 |
-
- Provides usage instructions for each variant
|
| 214 |
-
|
| 215 |
-
**Key Features**:
|
| 216 |
-
- ✅ **Multiple quantization**: int8 and int4 support
|
| 217 |
-
- ✅ **Unified repository**: All variants in one repo
|
| 218 |
-
- ✅ **Separate documentation**: Individual model cards
|
| 219 |
-
- ✅ **Usage instructions**: Clear guidance for each variant
|
| 220 |
-
|
| 221 |
-
### 6. **Trackio Integration** ✅ IMPLEMENTED
|
| 222 |
-
|
| 223 |
-
**Location**: `scripts/model_tonic/push_to_huggingface.py` - `log_to_trackio()` function
|
| 224 |
-
|
| 225 |
-
**What it does**:
|
| 226 |
-
- Logs model push events to Trackio
|
| 227 |
-
- Records training results and metrics
|
| 228 |
-
- Provides experiment tracking links
|
| 229 |
-
- Integrates with HF Datasets
|
| 230 |
-
|
| 231 |
-
**Key Features**:
|
| 232 |
-
- ✅ **Event logging**: Model push events
|
| 233 |
-
- ✅ **Results tracking**: Training metrics
|
| 234 |
-
- ✅ **Experiment links**: Trackio Space integration
|
| 235 |
-
- ✅ **Dataset integration**: HF Datasets support
|
| 236 |
-
|
| 237 |
-
### 7. **Model Validation** ✅ IMPLEMENTED
|
| 238 |
-
|
| 239 |
-
**Location**: `scripts/model_tonic/push_to_huggingface.py` - `validate_model_path()` function
|
| 240 |
-
|
| 241 |
-
**What it does**:
|
| 242 |
-
- Validates model files are complete
|
| 243 |
-
- Checks for required model components
|
| 244 |
-
- Verifies file integrity
|
| 245 |
-
- Provides detailed error messages
|
| 246 |
-
|
| 247 |
-
**Key Features**:
|
| 248 |
-
- ✅ **File validation**: Checks all required files
|
| 249 |
-
- ✅ **Size verification**: Model file sizes
|
| 250 |
-
- ✅ **Configuration check**: Valid config files
|
| 251 |
-
- ✅ **Error reporting**: Detailed error messages
|
| 252 |
-
|
| 253 |
-
## 🔧 **Technical Implementation Details**
|
| 254 |
-
|
| 255 |
-
### Trackio Space Deployment Flow
|
| 256 |
-
|
| 257 |
-
```python
|
| 258 |
-
# 1. Create Space
|
| 259 |
-
create_repo(
|
| 260 |
-
repo_id=f"{username}/{space_name}",
|
| 261 |
-
token=token,
|
| 262 |
-
repo_type="space",
|
| 263 |
-
exist_ok=True,
|
| 264 |
-
private=False,
|
| 265 |
-
space_sdk="gradio",
|
| 266 |
-
space_hardware="cpu-basic"
|
| 267 |
-
)
|
| 268 |
-
|
| 269 |
-
# 2. Upload Files
|
| 270 |
-
upload_file(
|
| 271 |
-
path_or_fileobj=file_content,
|
| 272 |
-
path_in_repo=file_path,
|
| 273 |
-
repo_id=repo_id,
|
| 274 |
-
repo_type="space",
|
| 275 |
-
token=token
|
| 276 |
-
)
|
| 277 |
-
|
| 278 |
-
# 3. Set Secrets
|
| 279 |
-
add_space_secret(
|
| 280 |
-
repo_id=repo_id,
|
| 281 |
-
repo_type="space",
|
| 282 |
-
key="HF_TOKEN",
|
| 283 |
-
value=token
|
| 284 |
-
)
|
| 285 |
-
```
|
| 286 |
-
|
| 287 |
-
### Model Repository Deployment Flow
|
| 288 |
-
|
| 289 |
-
```python
|
| 290 |
-
# 1. Create Repository
|
| 291 |
-
create_repo(
|
| 292 |
-
repo_id=repo_name,
|
| 293 |
-
token=token,
|
| 294 |
-
private=private,
|
| 295 |
-
exist_ok=True
|
| 296 |
-
)
|
| 297 |
-
|
| 298 |
-
# 2. Upload Model Files
|
| 299 |
-
upload_file(
|
| 300 |
-
path_or_fileobj=model_file,
|
| 301 |
-
path_in_repo=file_path,
|
| 302 |
-
repo_id=repo_name,
|
| 303 |
-
token=token
|
| 304 |
-
)
|
| 305 |
-
|
| 306 |
-
# 3. Generate Model Card
|
| 307 |
-
model_card = create_model_card(training_config, results)
|
| 308 |
-
upload_file(
|
| 309 |
-
path_or_fileobj=model_card,
|
| 310 |
-
path_in_repo="README.md",
|
| 311 |
-
repo_id=repo_name,
|
| 312 |
-
token=token
|
| 313 |
-
)
|
| 314 |
-
```
|
| 315 |
-
|
| 316 |
-
## 📊 **Test Results**
|
| 317 |
-
|
| 318 |
-
### Trackio Space Deployment Test
|
| 319 |
-
|
| 320 |
-
```bash
|
| 321 |
-
$ python scripts/trackio_tonic/deploy_trackio_space.py
|
| 322 |
-
|
| 323 |
-
🚀 Starting Trackio Space deployment...
|
| 324 |
-
✅ Authenticated as: Tonic
|
| 325 |
-
✅ Space created successfully: https://huggingface.co/spaces/Tonic/trackio-monitoring
|
| 326 |
-
✅ Files uploaded successfully
|
| 327 |
-
✅ Secrets configured via API
|
| 328 |
-
✅ Space is building and will be available shortly
|
| 329 |
-
🎉 Deployment completed!
|
| 330 |
-
📊 Trackio Space URL: https://huggingface.co/spaces/Tonic/trackio-monitoring
|
| 331 |
-
```
|
| 332 |
-
|
| 333 |
-
### Model Repository Deployment Test
|
| 334 |
-
|
| 335 |
-
```bash
|
| 336 |
-
$ python scripts/model_tonic/push_to_huggingface.py --model_path outputs/model --repo_name Tonic/smollm3-finetuned
|
| 337 |
-
|
| 338 |
-
✅ Repository created: https://huggingface.co/Tonic/smollm3-finetuned
|
| 339 |
-
✅ Model files uploaded successfully
|
| 340 |
-
✅ Model card generated and uploaded
|
| 341 |
-
✅ Training results documented
|
| 342 |
-
✅ Quantized models created and uploaded
|
| 343 |
-
🎉 Model deployment completed!
|
| 344 |
-
```
|
| 345 |
-
|
| 346 |
-
## 🎯 **Integration Points**
|
| 347 |
-
|
| 348 |
-
### 1. **End-to-End Pipeline Integration**
|
| 349 |
-
- ✅ **Launch script**: Automatic deployment calls
|
| 350 |
-
- ✅ **Environment setup**: Proper token configuration
|
| 351 |
-
- ✅ **Error handling**: Graceful fallbacks
|
| 352 |
-
- ✅ **User feedback**: Clear progress indicators
|
| 353 |
-
|
| 354 |
-
### 2. **Monitoring Integration**
|
| 355 |
-
- ✅ **Trackio Space**: Real-time experiment tracking
|
| 356 |
-
- ✅ **HF Datasets**: Persistent experiment storage
|
| 357 |
-
- ✅ **Model cards**: Complete documentation
|
| 358 |
-
- ✅ **Training results**: Comprehensive logging
|
| 359 |
-
|
| 360 |
-
### 3. **Cross-Component Integration**
|
| 361 |
-
- ✅ **Dataset deployment**: Automatic dataset creation
|
| 362 |
-
- ✅ **Space deployment**: Automatic Space creation
|
| 363 |
-
- ✅ **Model deployment**: Automatic model upload
|
| 364 |
-
- ✅ **Documentation**: Complete system documentation
|
| 365 |
-
|
| 366 |
-
## ✅ **Verification Summary**
|
| 367 |
-
|
| 368 |
-
| Component | Status | Location | Test Result |
|
| 369 |
-
|-----------|--------|----------|-------------|
|
| 370 |
-
| **Trackio Space Creation** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Created successfully |
|
| 371 |
-
| **File Upload System** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Uploaded successfully |
|
| 372 |
-
| **Space Configuration** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Configured via API |
|
| 373 |
-
| **Gradio Interface** | ✅ Implemented | `templates/spaces/app.py` | ✅ Full functionality |
|
| 374 |
-
| **Requirements** | ✅ Implemented | `templates/spaces/requirements.txt` | ✅ All dependencies |
|
| 375 |
-
| **README Template** | ✅ Implemented | `templates/spaces/README.md` | ✅ Complete documentation |
|
| 376 |
-
| **Model Repository Creation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Created successfully |
|
| 377 |
-
| **Model File Upload** | ✅ Implemented | `push_to_huggingface.py` | ✅ Uploaded successfully |
|
| 378 |
-
| **Model Card Generation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Generated and uploaded |
|
| 379 |
-
| **Quantized Models** | ✅ Implemented | `quantize_model.py` | ✅ Created and uploaded |
|
| 380 |
-
| **Trackio Integration** | ✅ Implemented | `push_to_huggingface.py` | ✅ Integrated successfully |
|
| 381 |
-
| **Model Validation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Validated successfully |
|
| 382 |
-
|
| 383 |
-
## 🚀 **Next Steps**
|
| 384 |
-
|
| 385 |
-
The deployment components are now **fully implemented and verified**. Users can:
|
| 386 |
-
|
| 387 |
-
1. **Deploy Trackio Space**: Automatic Space creation and configuration
|
| 388 |
-
2. **Upload Models**: Complete model deployment with documentation
|
| 389 |
-
3. **Monitor Experiments**: Real-time tracking and visualization
|
| 390 |
-
4. **Share Results**: Comprehensive documentation and examples
|
| 391 |
-
5. **Scale Operations**: Support for multiple experiments and models
|
| 392 |
-
|
| 393 |
-
**All important deployment components are properly implemented and working correctly!** 🎉
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/DEPLOYMENT_GUIDE.md
DELETED
|
@@ -1,397 +0,0 @@
|
|
| 1 |
-
# Trackio Deployment Guide for Hugging Face Spaces
|
| 2 |
-
|
| 3 |
-
This guide provides step-by-step instructions for deploying Trackio experiment tracking to Hugging Face Spaces and integrating it with your SmolLM3 fine-tuning pipeline.
|
| 4 |
-
|
| 5 |
-
## Prerequisites
|
| 6 |
-
|
| 7 |
-
- Hugging Face account
|
| 8 |
-
- Hugging Face CLI installed (`pip install huggingface_hub`)
|
| 9 |
-
- Git configured with your Hugging Face credentials
|
| 10 |
-
|
| 11 |
-
## Method 1: Automated Deployment (Recommended)
|
| 12 |
-
|
| 13 |
-
### Step 1: Run the Deployment Script
|
| 14 |
-
|
| 15 |
-
```bash
|
| 16 |
-
python deploy_trackio_space.py
|
| 17 |
-
```
|
| 18 |
-
|
| 19 |
-
The script will prompt you for:
|
| 20 |
-
- Your Hugging Face username
|
| 21 |
-
- Space name (e.g., `trackio-monitoring`)
|
| 22 |
-
- Hugging Face token (needs a write token obviously)
|
| 23 |
-
|
| 24 |
-
### Step 2: Wait for Build
|
| 25 |
-
|
| 26 |
-
After deployment, wait 2-5 minutes for the Space to build and become available.
|
| 27 |
-
|
| 28 |
-
### Step 3: Test the Interface
|
| 29 |
-
|
| 30 |
-
Visit your Space URL to test the interface:
|
| 31 |
-
```
|
| 32 |
-
https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
## Method 2: Manual Deployment
|
| 36 |
-
|
| 37 |
-
### Step 1: Create a New Space
|
| 38 |
-
|
| 39 |
-
1. Go to https://huggingface.co/spaces
|
| 40 |
-
2. Click "Create new Space"
|
| 41 |
-
3. Configure the Space:
|
| 42 |
-
- **Owner**: Your username
|
| 43 |
-
- **Space name**: `trackio-monitoring` (or your preferred name)
|
| 44 |
-
- **SDK**: Gradio
|
| 45 |
-
- **Hardware**: CPU (Basic)
|
| 46 |
-
- **License**: MIT
|
| 47 |
-
|
| 48 |
-
### Step 2: Upload Files
|
| 49 |
-
|
| 50 |
-
Upload these files to your Space:
|
| 51 |
-
|
| 52 |
-
#### `app.py`
|
| 53 |
-
The main Gradio interface (already created in this repository)
|
| 54 |
-
|
| 55 |
-
#### `requirements_space.txt`
|
| 56 |
-
```
|
| 57 |
-
gradio>=4.0.0
|
| 58 |
-
gradio-client>=0.10.0
|
| 59 |
-
requests>=2.31.0
|
| 60 |
-
numpy>=1.24.0
|
| 61 |
-
pandas>=2.0.0
|
| 62 |
-
jsonschema>=4.17.0
|
| 63 |
-
plotly>=5.15.0
|
| 64 |
-
matplotlib>=3.7.0
|
| 65 |
-
python-dotenv>=1.0.0
|
| 66 |
-
```
|
| 67 |
-
|
| 68 |
-
#### `README.md`
|
| 69 |
-
```markdown
|
| 70 |
-
# Trackio Experiment Tracking
|
| 71 |
-
|
| 72 |
-
A Gradio interface for experiment tracking and monitoring.
|
| 73 |
-
|
| 74 |
-
## Features
|
| 75 |
-
|
| 76 |
-
- Create and manage experiments
|
| 77 |
-
- Log training metrics and parameters
|
| 78 |
-
- View experiment details and results
|
| 79 |
-
- Update experiment status
|
| 80 |
-
|
| 81 |
-
## Usage
|
| 82 |
-
|
| 83 |
-
1. Create a new experiment using the "Create Experiment" tab
|
| 84 |
-
2. Log metrics during training using the "Log Metrics" tab
|
| 85 |
-
3. View experiment details using the "View Experiments" tab
|
| 86 |
-
4. Update experiment status using the "Update Status" tab
|
| 87 |
-
|
| 88 |
-
## Integration
|
| 89 |
-
|
| 90 |
-
To connect your training script to this Trackio Space:
|
| 91 |
-
|
| 92 |
-
```python
|
| 93 |
-
from monitoring import SmolLM3Monitor
|
| 94 |
-
|
| 95 |
-
monitor = SmolLM3Monitor(
|
| 96 |
-
experiment_name="my_experiment",
|
| 97 |
-
trackio_url="https://your-space.hf.space",
|
| 98 |
-
enable_tracking=True
|
| 99 |
-
)
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
### Step 3: Configure Space Settings
|
| 103 |
-
|
| 104 |
-
In your Space settings, ensure:
|
| 105 |
-
- **App file**: `app.py`
|
| 106 |
-
- **Python version**: 3.9 or higher
|
| 107 |
-
- **Hardware**: CPU (Basic) is sufficient
|
| 108 |
-
|
| 109 |
-
## Integration with Your Training Script
|
| 110 |
-
|
| 111 |
-
### Step 1: Update Your Configuration
|
| 112 |
-
|
| 113 |
-
Add Trackio settings to your training configuration:
|
| 114 |
-
|
| 115 |
-
```python
|
| 116 |
-
# config/train_smollm3.py
|
| 117 |
-
@dataclass
|
| 118 |
-
class SmolLM3Config:
|
| 119 |
-
# ... existing settings ...
|
| 120 |
-
|
| 121 |
-
# Trackio monitoring configuration
|
| 122 |
-
enable_tracking: bool = True
|
| 123 |
-
trackio_url: Optional[str] = None # Your Space URL
|
| 124 |
-
trackio_token: Optional[str] = None
|
| 125 |
-
log_artifacts: bool = True
|
| 126 |
-
log_metrics: bool = True
|
| 127 |
-
log_config: bool = True
|
| 128 |
-
experiment_name: Optional[str] = None
|
| 129 |
-
```
|
| 130 |
-
|
| 131 |
-
### Step 2: Run Training with Trackio
|
| 132 |
-
|
| 133 |
-
```bash
|
| 134 |
-
python train.py config/train_smollm3.py \
|
| 135 |
-
--dataset_dir my_dataset \
|
| 136 |
-
--enable_tracking \
|
| 137 |
-
--trackio_url "https://your-username-trackio-monitoring.hf.space" \
|
| 138 |
-
--experiment_name "smollm3_finetune_v1"
|
| 139 |
-
```
|
| 140 |
-
|
| 141 |
-
### Step 3: Monitor Your Experiments
|
| 142 |
-
|
| 143 |
-
1. **Create Experiment**: Use the "Create Experiment" tab in your Space
|
| 144 |
-
2. **Log Metrics**: Your training script will automatically log metrics
|
| 145 |
-
3. **View Results**: Use the "View Experiments" tab to see progress
|
| 146 |
-
4. **Update Status**: Mark experiments as completed when done
|
| 147 |
-
|
| 148 |
-
## Advanced Configuration
|
| 149 |
-
|
| 150 |
-
### Environment Variables
|
| 151 |
-
|
| 152 |
-
You can set Trackio configuration via environment variables:
|
| 153 |
-
|
| 154 |
-
```bash
|
| 155 |
-
export TRACKIO_URL="https://your-space.hf.space"
|
| 156 |
-
export TRACKIO_TOKEN="your_token_here"
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
### Custom Experiment Names
|
| 160 |
-
|
| 161 |
-
```bash
|
| 162 |
-
python train.py config/train_smollm3.py \
|
| 163 |
-
--experiment_name "smollm3_high_lr_experiment" \
|
| 164 |
-
--trackio_url "https://your-space.hf.space"
|
| 165 |
-
```
|
| 166 |
-
|
| 167 |
-
### Multiple Experiments
|
| 168 |
-
|
| 169 |
-
You can run multiple experiments and track them separately:
|
| 170 |
-
|
| 171 |
-
```bash
|
| 172 |
-
# Experiment 1
|
| 173 |
-
python train.py config/train_smollm3.py \
|
| 174 |
-
--experiment_name "smollm3_baseline" \
|
| 175 |
-
--learning_rate 2e-5
|
| 176 |
-
|
| 177 |
-
# Experiment 2
|
| 178 |
-
python train.py config/train_smollm3.py \
|
| 179 |
-
--experiment_name "smollm3_high_lr" \
|
| 180 |
-
--learning_rate 5e-5
|
| 181 |
-
```
|
| 182 |
-
|
| 183 |
-
## Using the Trackio Interface
|
| 184 |
-
|
| 185 |
-
### Creating Experiments
|
| 186 |
-
|
| 187 |
-
1. Go to the "Create Experiment" tab
|
| 188 |
-
2. Enter experiment name (e.g., "smollm3_finetune_v1")
|
| 189 |
-
3. Add description (optional)
|
| 190 |
-
4. Click "Create Experiment"
|
| 191 |
-
5. Note the experiment ID for logging metrics
|
| 192 |
-
|
| 193 |
-
### Logging Metrics
|
| 194 |
-
|
| 195 |
-
1. Go to the "Log Metrics" tab
|
| 196 |
-
2. Enter your experiment ID
|
| 197 |
-
3. Add metrics in JSON format:
|
| 198 |
-
```json
|
| 199 |
-
{
|
| 200 |
-
"loss": 0.5,
|
| 201 |
-
"accuracy": 0.85,
|
| 202 |
-
"learning_rate": 2e-5
|
| 203 |
-
}
|
| 204 |
-
```
|
| 205 |
-
4. Add step number (optional)
|
| 206 |
-
5. Click "Log Metrics"
|
| 207 |
-
|
| 208 |
-
### Viewing Experiments
|
| 209 |
-
|
| 210 |
-
1. Go to the "View Experiments" tab
|
| 211 |
-
2. Enter experiment ID to view specific experiment
|
| 212 |
-
3. Or click "List All Experiments" to see all experiments
|
| 213 |
-
|
| 214 |
-
### Updating Status
|
| 215 |
-
|
| 216 |
-
1. Go to the "Update Status" tab
|
| 217 |
-
2. Enter experiment ID
|
| 218 |
-
3. Select new status (running, completed, failed, paused)
|
| 219 |
-
4. Click "Update Status"
|
| 220 |
-
|
| 221 |
-
## Troubleshooting
|
| 222 |
-
|
| 223 |
-
### Common Issues
|
| 224 |
-
|
| 225 |
-
#### 1. Space Not Building
|
| 226 |
-
- Check that all required files are uploaded
|
| 227 |
-
- Verify `app.py` is the main file
|
| 228 |
-
- Check the Space logs for errors
|
| 229 |
-
|
| 230 |
-
#### 2. Connection Errors
|
| 231 |
-
- Verify your Space URL is correct
|
| 232 |
-
- Check that the Space is running (not paused)
|
| 233 |
-
- Ensure your training script can reach the Space URL
|
| 234 |
-
|
| 235 |
-
#### 3. Missing Metrics
|
| 236 |
-
- Check that `enable_tracking=True` in your config
|
| 237 |
-
- Verify the Trackio URL is correct
|
| 238 |
-
- Check training logs for monitoring errors
|
| 239 |
-
|
| 240 |
-
#### 4. Authentication Issues
|
| 241 |
-
- If using tokens, verify they're correct
|
| 242 |
-
- Check Hugging Face account permissions
|
| 243 |
-
- Ensure Space is public or you have access
|
| 244 |
-
|
| 245 |
-
### Debug Mode
|
| 246 |
-
|
| 247 |
-
Enable debug logging in your training script:
|
| 248 |
-
|
| 249 |
-
```python
|
| 250 |
-
import logging
|
| 251 |
-
logging.basicConfig(level=logging.DEBUG)
|
| 252 |
-
```
|
| 253 |
-
|
| 254 |
-
### Manual Testing
|
| 255 |
-
|
| 256 |
-
Test the Trackio interface manually:
|
| 257 |
-
|
| 258 |
-
1. Create an experiment
|
| 259 |
-
2. Log some test metrics
|
| 260 |
-
3. View the experiment details
|
| 261 |
-
4. Update the status
|
| 262 |
-
|
| 263 |
-
## Security Considerations
|
| 264 |
-
|
| 265 |
-
### Public vs Private Spaces
|
| 266 |
-
|
| 267 |
-
- **Public Spaces**: Anyone can view and use the interface
|
| 268 |
-
- **Private Spaces**: Only you and collaborators can access
|
| 269 |
-
|
| 270 |
-
### Token Management
|
| 271 |
-
|
| 272 |
-
- Store tokens securely (environment variables)
|
| 273 |
-
- Don't commit tokens to version control
|
| 274 |
-
- Use Hugging Face's token management
|
| 275 |
-
|
| 276 |
-
### Data Privacy
|
| 277 |
-
|
| 278 |
-
- Trackio stores experiment data in the Space
|
| 279 |
-
- Consider data retention policies
|
| 280 |
-
- Be mindful of sensitive information in experiment names
|
| 281 |
-
|
| 282 |
-
## Performance Optimization
|
| 283 |
-
|
| 284 |
-
### Space Configuration
|
| 285 |
-
|
| 286 |
-
- Use CPU (Basic) for the interface (sufficient for tracking)
|
| 287 |
-
- Consider GPU only for actual training
|
| 288 |
-
- Monitor Space usage and limits
|
| 289 |
-
|
| 290 |
-
### Efficient Logging
|
| 291 |
-
|
| 292 |
-
- Log metrics at reasonable intervals (every 10-100 steps)
|
| 293 |
-
- Avoid logging too frequently to prevent rate limiting
|
| 294 |
-
- Use batch logging when possible
|
| 295 |
-
|
| 296 |
-
## Monitoring Best Practices
|
| 297 |
-
|
| 298 |
-
### Experiment Naming
|
| 299 |
-
|
| 300 |
-
Use descriptive names:
|
| 301 |
-
- `smollm3_baseline_v1`
|
| 302 |
-
- `smollm3_high_lr_experiment`
|
| 303 |
-
- `smollm3_dpo_training`
|
| 304 |
-
|
| 305 |
-
### Metric Logging
|
| 306 |
-
|
| 307 |
-
Log relevant metrics:
|
| 308 |
-
- Training loss
|
| 309 |
-
- Validation loss
|
| 310 |
-
- Learning rate
|
| 311 |
-
- GPU memory usage
|
| 312 |
-
- Training time
|
| 313 |
-
|
| 314 |
-
### Status Management
|
| 315 |
-
|
| 316 |
-
- Mark experiments as "running" when starting
|
| 317 |
-
- Update to "completed" when finished
|
| 318 |
-
- Mark as "failed" if errors occur
|
| 319 |
-
- Use "paused" for temporary stops
|
| 320 |
-
|
| 321 |
-
## Integration Examples
|
| 322 |
-
|
| 323 |
-
### Basic Integration
|
| 324 |
-
|
| 325 |
-
```python
|
| 326 |
-
from monitoring import SmolLM3Monitor
|
| 327 |
-
|
| 328 |
-
# Initialize monitor
|
| 329 |
-
monitor = SmolLM3Monitor(
|
| 330 |
-
experiment_name="my_experiment",
|
| 331 |
-
trackio_url="https://your-space.hf.space",
|
| 332 |
-
enable_tracking=True
|
| 333 |
-
)
|
| 334 |
-
|
| 335 |
-
# Log configuration
|
| 336 |
-
monitor.log_config(config_dict)
|
| 337 |
-
|
| 338 |
-
# Log metrics during training
|
| 339 |
-
monitor.log_metrics({"loss": 0.5}, step=100)
|
| 340 |
-
|
| 341 |
-
# Log final results
|
| 342 |
-
monitor.log_training_summary(final_results)
|
| 343 |
-
```
|
| 344 |
-
|
| 345 |
-
### Advanced Integration
|
| 346 |
-
|
| 347 |
-
```python
|
| 348 |
-
# Custom monitoring setup
|
| 349 |
-
monitor = SmolLM3Monitor(
|
| 350 |
-
experiment_name="smollm3_advanced",
|
| 351 |
-
trackio_url="https://your-space.hf.space",
|
| 352 |
-
enable_tracking=True,
|
| 353 |
-
log_artifacts=True,
|
| 354 |
-
log_metrics=True,
|
| 355 |
-
log_config=True
|
| 356 |
-
)
|
| 357 |
-
|
| 358 |
-
# Log system metrics
|
| 359 |
-
monitor.log_system_metrics(step=current_step)
|
| 360 |
-
|
| 361 |
-
# Log model checkpoint
|
| 362 |
-
monitor.log_model_checkpoint("checkpoint-1000", step=1000)
|
| 363 |
-
|
| 364 |
-
# Log evaluation results
|
| 365 |
-
monitor.log_evaluation_results(eval_results, step=1000)
|
| 366 |
-
```
|
| 367 |
-
|
| 368 |
-
## Support and Resources
|
| 369 |
-
|
| 370 |
-
### Documentation
|
| 371 |
-
|
| 372 |
-
- [Hugging Face Spaces Documentation](https://huggingface.co/docs/hub/spaces)
|
| 373 |
-
- [Gradio Documentation](https://gradio.app/docs/)
|
| 374 |
-
- [Trackio GitHub Repository](https://github.com/Josephrp/trackio)
|
| 375 |
-
|
| 376 |
-
### Community
|
| 377 |
-
|
| 378 |
-
- [Hugging Face Forums](https://discuss.huggingface.co/)
|
| 379 |
-
- [Gradio Discord](https://discord.gg/feTf9z3Z)
|
| 380 |
-
|
| 381 |
-
### Issues and Feedback
|
| 382 |
-
|
| 383 |
-
- Report issues on the project repository
|
| 384 |
-
- Provide feedback on the Trackio interface
|
| 385 |
-
- Suggest improvements for the monitoring system
|
| 386 |
-
|
| 387 |
-
## Conclusion
|
| 388 |
-
|
| 389 |
-
You now have a complete Trackio monitoring system deployed on Hugging Face Spaces! This setup provides:
|
| 390 |
-
|
| 391 |
-
- ✅ Easy experiment tracking and monitoring
|
| 392 |
-
- ✅ Real-time metric logging
|
| 393 |
-
- ✅ Web-based interface for experiment management
|
| 394 |
-
- ✅ Integration with your SmolLM3 fine-tuning pipeline
|
| 395 |
-
- ✅ Scalable and accessible monitoring solution
|
| 396 |
-
|
| 397 |
-
Start tracking your experiments and gain insights into your model training process!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/Data_Pipeline.md
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
```mermaid
|
| 2 |
+
graph LR
|
| 3 |
+
EntryPoint["EntryPoint"]
|
| 4 |
+
Configuration["Configuration"]
|
| 5 |
+
Model_Abstraction["Model Abstraction"]
|
| 6 |
+
Data_Pipeline["Data Pipeline"]
|
| 7 |
+
Training_Logic["Training Logic"]
|
| 8 |
+
Utilities["Utilities"]
|
| 9 |
+
EntryPoint -- "instructs" --> Data_Pipeline
|
| 10 |
+
EntryPoint -- "loads settings from" --> Configuration
|
| 11 |
+
EntryPoint -- "initializes models via" --> Model_Abstraction
|
| 12 |
+
EntryPoint -- "invokes" --> Training_Logic
|
| 13 |
+
Configuration -- "provides settings to" --> EntryPoint
|
| 14 |
+
Configuration -- "informs" --> Model_Abstraction
|
| 15 |
+
Configuration -- "guides" --> Data_Pipeline
|
| 16 |
+
Model_Abstraction -- "provides models to" --> EntryPoint
|
| 17 |
+
Model_Abstraction -- "receives settings from" --> Configuration
|
| 18 |
+
Model_Abstraction -- "interacts with" --> Training_Logic
|
| 19 |
+
Data_Pipeline -- "provides processed data to" --> EntryPoint
|
| 20 |
+
Data_Pipeline -- "receives parameters from" --> Configuration
|
| 21 |
+
Data_Pipeline -- "supplies batches to" --> Training_Logic
|
| 22 |
+
Training_Logic -- "receives control from" --> EntryPoint
|
| 23 |
+
Training_Logic -- "consumes data from" --> Data_Pipeline
|
| 24 |
+
Training_Logic -- "operates on models from" --> Model_Abstraction
|
| 25 |
+
Training_Logic -- "uses" --> Utilities
|
| 26 |
+
Utilities -- "used by" --> EntryPoint
|
| 27 |
+
Utilities -- "provides functionalities to" --> Training_Logic
|
| 28 |
+
Utilities -- "assists" --> Data_Pipeline
|
| 29 |
+
click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
|
| 30 |
+
click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:[email protected])
|
| 34 |
+
|
| 35 |
+
## Details
|
| 36 |
+
|
| 37 |
+
Final component overview for the `smollm3_finetune` project, based on the provided analysis and adhering to Machine Learning Training and Fine-tuning Framework patterns.
|
| 38 |
+
|
| 39 |
+
### EntryPoint
|
| 40 |
+
The main entry point of the application, responsible for orchestrating the entire training and fine-tuning workflow. It initializes other core components, loads configurations, and manages the overall execution flow.
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
**Related Classes/Methods**:
|
| 44 |
+
|
| 45 |
+
- `smollm3_finetune.train` (1:1)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
### Configuration
|
| 49 |
+
Centralizes and defines all parameters and settings required for the training and fine-tuning process, including model hyperparameters, dataset paths, and training arguments. It promotes a configuration-driven architecture, allowing easy modification and versioning of experimental setups.
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
**Related Classes/Methods**:
|
| 53 |
+
|
| 54 |
+
- <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/config.py#L1-L1" target="_blank" rel="noopener noreferrer">`config` (1:1)</a>
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
### Model Abstraction [[Expand]](./Model_Abstraction.md)
|
| 58 |
+
Encapsulates the logic for loading, initializing, and managing different machine learning models and their variants (e.g., different architectures, quantization settings). It provides a consistent interface for interacting with various model architectures.
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
**Related Classes/Methods**:
|
| 62 |
+
|
| 63 |
+
- <a href="https://github.com/Josephrp/SmolFactory/docs/main/src/model.py#L1-L1" target="_blank" rel="noopener noreferrer">`model` (1:1)</a>
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
### Data Pipeline [[Expand]](./Data_Pipeline.md)
|
| 67 |
+
Handles the entire data lifecycle, including dataset loading, preprocessing (e.g., tokenization, formatting), and creating efficient data loaders for both training and evaluation phases. It ensures data is prepared correctly and efficiently for the model.
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
**Related Classes/Methods**:
|
| 71 |
+
|
| 72 |
+
- `smollm3_finetune.data.load_and_preprocess_data` (1:1)
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
### Training Logic
|
| 76 |
+
Contains the core algorithms and routines for training and fine-tuning machine learning models. This includes the training loop, optimization steps, loss calculation, gradient accumulation, and potentially specialized fine-tuning methods (e.g., LoRA, QLoRA).
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
**Related Classes/Methods**:
|
| 80 |
+
|
| 81 |
+
- <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/trainer.py#L1-L1" target="_blank" rel="noopener noreferrer">`trainer` (1:1)</a>
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
### Utilities
|
| 85 |
+
A collection of common helper functions, reusable modules, and general-purpose tools that support various parts of the training framework but do not belong to a specific core component. This includes functions for logging, metrics calculation, device management, etc.
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
**Related Classes/Methods**:
|
| 89 |
+
|
| 90 |
+
- `utils` (1:1)
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
|
docs/ENHANCED_MODEL_CARD_METADATA.md
DELETED
|
@@ -1,300 +0,0 @@
|
|
| 1 |
-
# Enhanced Model Card Metadata System
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The enhanced model card system now includes comprehensive YAML metadata that follows the [Hugging Face Model Cards specification](https://huggingface.co/docs/hub/en/model-cards). This ensures maximum compatibility with the Hugging Face Hub and provides rich metadata for model discovery and usage.
|
| 6 |
-
|
| 7 |
-
## Metadata Structure
|
| 8 |
-
|
| 9 |
-
### Core Metadata Fields
|
| 10 |
-
|
| 11 |
-
The model card template now includes the following metadata fields:
|
| 12 |
-
|
| 13 |
-
```yaml
|
| 14 |
-
---
|
| 15 |
-
language:
|
| 16 |
-
- en
|
| 17 |
-
- fr
|
| 18 |
-
license: apache-2.0
|
| 19 |
-
library_name: transformers
|
| 20 |
-
tags:
|
| 21 |
-
- smollm3
|
| 22 |
-
- fine-tuned
|
| 23 |
-
- causal-lm
|
| 24 |
-
- text-generation
|
| 25 |
-
- quantized
|
| 26 |
-
- dataset:OpenHermes-FR
|
| 27 |
-
- config:H100 Lightweight
|
| 28 |
-
pipeline_tag: text-generation
|
| 29 |
-
base_model: HuggingFaceTB/SmolLM3-3B
|
| 30 |
-
datasets:
|
| 31 |
-
- OpenHermes-FR
|
| 32 |
-
---
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
### Conditional Metadata
|
| 36 |
-
|
| 37 |
-
The system supports conditional metadata based on model configuration:
|
| 38 |
-
|
| 39 |
-
#### Quantized Models
|
| 40 |
-
When quantized models are available, additional metadata is included:
|
| 41 |
-
|
| 42 |
-
```yaml
|
| 43 |
-
quantization_types:
|
| 44 |
-
- int8_weight_only
|
| 45 |
-
- int4_weight_only
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
#### Model Index (Evaluation Results)
|
| 49 |
-
The system automatically generates structured evaluation results:
|
| 50 |
-
|
| 51 |
-
```yaml
|
| 52 |
-
model-index:
|
| 53 |
-
- name: Model Name
|
| 54 |
-
results:
|
| 55 |
-
- task:
|
| 56 |
-
type: text-generation
|
| 57 |
-
dataset:
|
| 58 |
-
name: OpenHermes-FR
|
| 59 |
-
type: OpenHermes-FR
|
| 60 |
-
metrics:
|
| 61 |
-
- name: Training Loss
|
| 62 |
-
type: loss
|
| 63 |
-
value: "2.1"
|
| 64 |
-
- name: Validation Loss
|
| 65 |
-
type: loss
|
| 66 |
-
value: "2.3"
|
| 67 |
-
- name: Perplexity
|
| 68 |
-
type: perplexity
|
| 69 |
-
value: "9.8"
|
| 70 |
-
```
|
| 71 |
-
|
| 72 |
-
For quantized models, additional entries are included:
|
| 73 |
-
|
| 74 |
-
```yaml
|
| 75 |
-
- name: Model Name (int8 quantized)
|
| 76 |
-
results:
|
| 77 |
-
- task:
|
| 78 |
-
type: text-generation
|
| 79 |
-
dataset:
|
| 80 |
-
name: OpenHermes-FR
|
| 81 |
-
type: OpenHermes-FR
|
| 82 |
-
metrics:
|
| 83 |
-
- name: Memory Reduction
|
| 84 |
-
type: memory_efficiency
|
| 85 |
-
value: "~50%"
|
| 86 |
-
- name: Inference Speed
|
| 87 |
-
type: speed
|
| 88 |
-
value: "Faster"
|
| 89 |
-
```
|
| 90 |
-
|
| 91 |
-
## Metadata Fields Explained
|
| 92 |
-
|
| 93 |
-
### Required Fields
|
| 94 |
-
|
| 95 |
-
| Field | Description | Example |
|
| 96 |
-
|-------|-------------|---------|
|
| 97 |
-
| `language` | Supported languages | `["en", "fr"]` |
|
| 98 |
-
| `license` | Model license | `"apache-2.0"` |
|
| 99 |
-
| `library_name` | Primary library | `"transformers"` |
|
| 100 |
-
| `tags` | Model tags for discovery | `["smollm3", "fine-tuned"]` |
|
| 101 |
-
| `pipeline_tag` | Task type | `"text-generation"` |
|
| 102 |
-
| `base_model` | Original model | `"HuggingFaceTB/SmolLM3-3B"` |
|
| 103 |
-
|
| 104 |
-
### Optional Fields
|
| 105 |
-
|
| 106 |
-
| Field | Description | Example |
|
| 107 |
-
|-------|-------------|---------|
|
| 108 |
-
| `datasets` | Training datasets | `["OpenHermes-FR"]` |
|
| 109 |
-
| `author` | Model author | `"Your Name"` |
|
| 110 |
-
| `experiment_name` | Experiment tracking | `"smollm3-experiment"` |
|
| 111 |
-
| `trackio_url` | Monitoring URL | `"https://trackio.space/exp"` |
|
| 112 |
-
| `hardware` | Training hardware | `"GPU (H100/A100)"` |
|
| 113 |
-
| `training_config` | Configuration type | `"H100 Lightweight"` |
|
| 114 |
-
| `trainer_type` | Trainer used | `"SFTTrainer"` |
|
| 115 |
-
| `batch_size` | Training batch size | `"8"` |
|
| 116 |
-
| `learning_rate` | Learning rate | `"5e-6"` |
|
| 117 |
-
| `max_epochs` | Number of epochs | `"3"` |
|
| 118 |
-
| `max_seq_length` | Sequence length | `"2048"` |
|
| 119 |
-
| `gradient_accumulation_steps` | Gradient accumulation | `"16"` |
|
| 120 |
-
|
| 121 |
-
### Training Results
|
| 122 |
-
|
| 123 |
-
| Field | Description | Example |
|
| 124 |
-
|-------|-------------|---------|
|
| 125 |
-
| `training_loss` | Final training loss | `"2.1"` |
|
| 126 |
-
| `validation_loss` | Final validation loss | `"2.3"` |
|
| 127 |
-
| `perplexity` | Model perplexity | `"9.8"` |
|
| 128 |
-
|
| 129 |
-
## Benefits of Enhanced Metadata
|
| 130 |
-
|
| 131 |
-
### 1. Improved Discovery
|
| 132 |
-
- **Filtering**: Users can filter models by dataset, configuration, or hardware
|
| 133 |
-
- **Search**: Enhanced search capabilities on the Hugging Face Hub
|
| 134 |
-
- **Tags**: Automatic tag generation for better categorization
|
| 135 |
-
|
| 136 |
-
### 2. Better Model Cards
|
| 137 |
-
- **Structured Data**: Evaluation results are displayed in widgets
|
| 138 |
-
- **Consistent Format**: Follows Hugging Face standards
|
| 139 |
-
- **Rich Information**: Comprehensive model information
|
| 140 |
-
|
| 141 |
-
### 3. Integration Benefits
|
| 142 |
-
- **Papers with Code**: Model index data can be indexed in leaderboards
|
| 143 |
-
- **API Compatibility**: Better integration with Hugging Face APIs
|
| 144 |
-
- **Automated Tools**: Support for automated model analysis
|
| 145 |
-
|
| 146 |
-
## Usage Examples
|
| 147 |
-
|
| 148 |
-
### Basic Model Card Generation
|
| 149 |
-
|
| 150 |
-
```bash
|
| 151 |
-
python scripts/model_tonic/generate_model_card.py \
|
| 152 |
-
--repo-name "username/model-name" \
|
| 153 |
-
--model-name "My Fine-tuned Model" \
|
| 154 |
-
--dataset-name "OpenHermes-FR" \
|
| 155 |
-
--training-config "H100 Lightweight" \
|
| 156 |
-
--batch-size "8" \
|
| 157 |
-
--learning-rate "5e-6" \
|
| 158 |
-
--max-epochs "3" \
|
| 159 |
-
--training-loss "2.1" \
|
| 160 |
-
--validation-loss "2.3" \
|
| 161 |
-
--perplexity "9.8" \
|
| 162 |
-
--output "README.md"
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
### With Quantized Models
|
| 166 |
-
|
| 167 |
-
```bash
|
| 168 |
-
python scripts/model_tonic/generate_model_card.py \
|
| 169 |
-
--repo-name "username/model-name" \
|
| 170 |
-
--model-name "My Fine-tuned Model" \
|
| 171 |
-
--dataset-name "OpenHermes-FR" \
|
| 172 |
-
--training-config "H100 Lightweight" \
|
| 173 |
-
--batch-size "8" \
|
| 174 |
-
--learning-rate "5e-6" \
|
| 175 |
-
--max-epochs "3" \
|
| 176 |
-
--training-loss "2.1" \
|
| 177 |
-
--validation-loss "2.3" \
|
| 178 |
-
--perplexity "9.8" \
|
| 179 |
-
--quantized-models \
|
| 180 |
-
--output "README.md"
|
| 181 |
-
```
|
| 182 |
-
|
| 183 |
-
## Template Variables
|
| 184 |
-
|
| 185 |
-
The enhanced template supports all the original variables plus new metadata fields:
|
| 186 |
-
|
| 187 |
-
### New Variables
|
| 188 |
-
|
| 189 |
-
| Variable | Description | Default |
|
| 190 |
-
|----------|-------------|---------|
|
| 191 |
-
| `training_loss` | Training loss value | `"N/A"` |
|
| 192 |
-
| `validation_loss` | Validation loss value | `"N/A"` |
|
| 193 |
-
| `perplexity` | Model perplexity | `"N/A"` |
|
| 194 |
-
|
| 195 |
-
### Conditional Metadata
|
| 196 |
-
|
| 197 |
-
The template automatically includes:
|
| 198 |
-
|
| 199 |
-
- **Dataset Information**: When `dataset_name` is provided
|
| 200 |
-
- **Quantization Types**: When `quantized_models` is `true`
|
| 201 |
-
- **Evaluation Results**: When training metrics are available
|
| 202 |
-
- **Hardware Information**: When `hardware_info` is provided
|
| 203 |
-
|
| 204 |
-
## Integration with Training Pipeline
|
| 205 |
-
|
| 206 |
-
### Automatic Metadata Generation
|
| 207 |
-
|
| 208 |
-
The push script automatically extracts metadata from:
|
| 209 |
-
|
| 210 |
-
1. **Training Configuration**: Batch size, learning rate, epochs, etc.
|
| 211 |
-
2. **Training Results**: Loss values, perplexity, etc.
|
| 212 |
-
3. **Model Information**: Base model, hardware, etc.
|
| 213 |
-
4. **Experiment Tracking**: Trackio URLs, experiment names
|
| 214 |
-
|
| 215 |
-
### Example Integration
|
| 216 |
-
|
| 217 |
-
```python
|
| 218 |
-
# In push_to_huggingface.py
|
| 219 |
-
variables = {
|
| 220 |
-
"model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
|
| 221 |
-
"repo_name": self.repo_name,
|
| 222 |
-
"base_model": "HuggingFaceTB/SmolLM3-3B",
|
| 223 |
-
"dataset_name": training_config.get('dataset_name', 'OpenHermes-FR'),
|
| 224 |
-
"training_config_type": training_config.get('training_config_type', 'Custom Configuration'),
|
| 225 |
-
"trainer_type": training_config.get('trainer_type', 'SFTTrainer'),
|
| 226 |
-
"batch_size": str(training_config.get('per_device_train_batch_size', 8)),
|
| 227 |
-
"learning_rate": str(training_config.get('learning_rate', '5e-6')),
|
| 228 |
-
"max_epochs": str(training_config.get('num_train_epochs', 3)),
|
| 229 |
-
"hardware_info": self._get_hardware_info(),
|
| 230 |
-
"training_loss": results.get('train_loss', 'N/A'),
|
| 231 |
-
"validation_loss": results.get('eval_loss', 'N/A'),
|
| 232 |
-
"perplexity": results.get('perplexity', 'N/A'),
|
| 233 |
-
"quantized_models": False # Updated if quantized models are added
|
| 234 |
-
}
|
| 235 |
-
```
|
| 236 |
-
|
| 237 |
-
## Validation and Testing
|
| 238 |
-
|
| 239 |
-
### Metadata Validation
|
| 240 |
-
|
| 241 |
-
The system includes validation for:
|
| 242 |
-
|
| 243 |
-
- **Required Fields**: Ensures all required metadata is present
|
| 244 |
-
- **Format Validation**: Validates YAML syntax and structure
|
| 245 |
-
- **Value Ranges**: Checks for reasonable values in numeric fields
|
| 246 |
-
- **Conditional Logic**: Verifies conditional metadata is properly included
|
| 247 |
-
|
| 248 |
-
### Test Coverage
|
| 249 |
-
|
| 250 |
-
The test suite verifies:
|
| 251 |
-
|
| 252 |
-
- **Basic Metadata**: All required fields are present
|
| 253 |
-
- **Conditional Metadata**: Quantized model metadata is included when appropriate
|
| 254 |
-
- **Evaluation Results**: Model index data is properly structured
|
| 255 |
-
- **Template Processing**: Variable substitution works correctly
|
| 256 |
-
|
| 257 |
-
## Best Practices
|
| 258 |
-
|
| 259 |
-
### 1. Metadata Completeness
|
| 260 |
-
- Include all available training information
|
| 261 |
-
- Provide accurate evaluation metrics
|
| 262 |
-
- Use consistent naming conventions
|
| 263 |
-
|
| 264 |
-
### 2. Conditional Logic
|
| 265 |
-
- Only include relevant metadata
|
| 266 |
-
- Use conditional sections appropriately
|
| 267 |
-
- Provide fallback values for missing data
|
| 268 |
-
|
| 269 |
-
### 3. Validation
|
| 270 |
-
- Test metadata generation with various configurations
|
| 271 |
-
- Verify YAML syntax is correct
|
| 272 |
-
- Check that all variables are properly substituted
|
| 273 |
-
|
| 274 |
-
### 4. Documentation
|
| 275 |
-
- Document all available metadata fields
|
| 276 |
-
- Provide examples for each field type
|
| 277 |
-
- Include troubleshooting information
|
| 278 |
-
|
| 279 |
-
## Future Enhancements
|
| 280 |
-
|
| 281 |
-
### Planned Features
|
| 282 |
-
|
| 283 |
-
1. **Additional Metrics**: Support for more evaluation metrics
|
| 284 |
-
2. **Custom Metadata**: User-defined metadata fields
|
| 285 |
-
3. **Validation Rules**: Configurable validation rules
|
| 286 |
-
4. **Auto-Detection**: Automatic detection of model features
|
| 287 |
-
5. **Integration APIs**: Better integration with external tools
|
| 288 |
-
|
| 289 |
-
### Extensibility
|
| 290 |
-
|
| 291 |
-
The system is designed to be easily extensible:
|
| 292 |
-
|
| 293 |
-
- **New Fields**: Easy to add new metadata fields
|
| 294 |
-
- **Custom Validators**: Support for custom validation logic
|
| 295 |
-
- **Template Extensions**: Support for template inheritance
|
| 296 |
-
- **API Integration**: Easy integration with external APIs
|
| 297 |
-
|
| 298 |
-
## Conclusion
|
| 299 |
-
|
| 300 |
-
The enhanced model card metadata system provides comprehensive, standards-compliant metadata that maximizes compatibility with the Hugging Face Hub while providing rich information for model discovery and usage. The system automatically generates appropriate metadata based on model configuration and training results, ensuring consistency and completeness across all model repositories.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/ENVIRONMENT_SETUP_FIX.md
DELETED
|
@@ -1,239 +0,0 @@
|
|
| 1 |
-
# Environment Setup Fix
|
| 2 |
-
|
| 3 |
-
## Issue Identified
|
| 4 |
-
|
| 5 |
-
The user requested to ensure that the provided token is properly available in the new virtual environment created during the launch script execution to avoid errors.
|
| 6 |
-
|
| 7 |
-
## Root Cause
|
| 8 |
-
|
| 9 |
-
The `launch.sh` script was setting environment variables after creating the virtual environment, which could cause the token to not be available within the virtual environment context.
|
| 10 |
-
|
| 11 |
-
## Fixes Applied
|
| 12 |
-
|
| 13 |
-
### 1. **Environment Variables Set Before Virtual Environment** ✅ **FIXED**
|
| 14 |
-
|
| 15 |
-
**File**: `launch.sh`
|
| 16 |
-
|
| 17 |
-
**Changes**:
|
| 18 |
-
- Set environment variables before creating the virtual environment
|
| 19 |
-
- Re-export environment variables after activating the virtual environment
|
| 20 |
-
- Added verification step to ensure token is available
|
| 21 |
-
|
| 22 |
-
**Before**:
|
| 23 |
-
```bash
|
| 24 |
-
print_info "Creating Python virtual environment..."
|
| 25 |
-
python3 -m venv smollm3_env
|
| 26 |
-
source smollm3_env/bin/activate
|
| 27 |
-
|
| 28 |
-
# ... install dependencies ...
|
| 29 |
-
|
| 30 |
-
# Step 8: Authentication setup
|
| 31 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 32 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
**After**:
|
| 36 |
-
```bash
|
| 37 |
-
# Set environment variables before creating virtual environment
|
| 38 |
-
print_info "Setting up environment variables..."
|
| 39 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 40 |
-
export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
|
| 41 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 42 |
-
export HF_USERNAME="$HF_USERNAME"
|
| 43 |
-
|
| 44 |
-
print_info "Creating Python virtual environment..."
|
| 45 |
-
python3 -m venv smollm3_env
|
| 46 |
-
source smollm3_env/bin/activate
|
| 47 |
-
|
| 48 |
-
# Re-export environment variables in the virtual environment
|
| 49 |
-
print_info "Configuring environment variables in virtual environment..."
|
| 50 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 51 |
-
export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
|
| 52 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 53 |
-
export HF_USERNAME="$HF_USERNAME"
|
| 54 |
-
```
|
| 55 |
-
|
| 56 |
-
### 2. **Token Verification Step** ✅ **ADDED**
|
| 57 |
-
|
| 58 |
-
**File**: `launch.sh`
|
| 59 |
-
|
| 60 |
-
**Added verification to ensure token is properly configured**:
|
| 61 |
-
```bash
|
| 62 |
-
# Verify token is available in the virtual environment
|
| 63 |
-
print_info "Verifying token availability in virtual environment..."
|
| 64 |
-
if [ -n "$HF_TOKEN" ] && [ -n "$HUGGING_FACE_HUB_TOKEN" ]; then
|
| 65 |
-
print_status "✅ Token properly configured in virtual environment"
|
| 66 |
-
print_info " HF_TOKEN: ${HF_TOKEN:0:10}...${HF_TOKEN: -4}"
|
| 67 |
-
print_info " HUGGING_FACE_HUB_TOKEN: ${HUGGING_FACE_HUB_TOKEN:0:10}...${HUGGING_FACE_HUB_TOKEN: -4}"
|
| 68 |
-
else
|
| 69 |
-
print_error "❌ Token not properly configured in virtual environment"
|
| 70 |
-
print_error "Please check your token and try again"
|
| 71 |
-
exit 1
|
| 72 |
-
fi
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
### 3. **Environment Variables Before Each Script Call** ✅ **ADDED**
|
| 76 |
-
|
| 77 |
-
**File**: `launch.sh`
|
| 78 |
-
|
| 79 |
-
**Added environment variable exports before each Python script call**:
|
| 80 |
-
|
| 81 |
-
**Trackio Space Deployment**:
|
| 82 |
-
```bash
|
| 83 |
-
# Ensure environment variables are available for the script
|
| 84 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 85 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 86 |
-
export HF_USERNAME="$HF_USERNAME"
|
| 87 |
-
|
| 88 |
-
python deploy_trackio_space.py "$TRACKIO_SPACE_NAME" "$HF_TOKEN" "$GIT_EMAIL"
|
| 89 |
-
```
|
| 90 |
-
|
| 91 |
-
**Dataset Setup**:
|
| 92 |
-
```bash
|
| 93 |
-
# Ensure environment variables are available for the script
|
| 94 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 95 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 96 |
-
export HF_USERNAME="$HF_USERNAME"
|
| 97 |
-
|
| 98 |
-
python setup_hf_dataset.py "$HF_TOKEN"
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
**Trackio Configuration**:
|
| 102 |
-
```bash
|
| 103 |
-
# Ensure environment variables are available for the script
|
| 104 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 105 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 106 |
-
export HF_USERNAME="$HF_USERNAME"
|
| 107 |
-
|
| 108 |
-
python configure_trackio.py
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
**Training Script**:
|
| 112 |
-
```bash
|
| 113 |
-
# Ensure environment variables are available for training
|
| 114 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 115 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 116 |
-
export HF_USERNAME="$HF_USERNAME"
|
| 117 |
-
export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
|
| 118 |
-
|
| 119 |
-
python scripts/training/train.py \
|
| 120 |
-
--config "$CONFIG_FILE" \
|
| 121 |
-
--experiment-name "$EXPERIMENT_NAME" \
|
| 122 |
-
--output-dir /output-checkpoint \
|
| 123 |
-
--trackio-url "$TRACKIO_URL" \
|
| 124 |
-
--trainer-type "$TRAINER_TYPE"
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
**Model Push**:
|
| 128 |
-
```bash
|
| 129 |
-
# Ensure environment variables are available for model push
|
| 130 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 131 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 132 |
-
export HF_USERNAME="$HF_USERNAME"
|
| 133 |
-
export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
|
| 134 |
-
|
| 135 |
-
python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
|
| 136 |
-
--token "$HF_TOKEN" \
|
| 137 |
-
--trackio-url "$TRACKIO_URL" \
|
| 138 |
-
--experiment-name "$EXPERIMENT_NAME" \
|
| 139 |
-
--dataset-repo "$TRACKIO_DATASET_REPO"
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
**Quantization Scripts**:
|
| 143 |
-
```bash
|
| 144 |
-
# Ensure environment variables are available for quantization
|
| 145 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 146 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 147 |
-
export HF_USERNAME="$HF_USERNAME"
|
| 148 |
-
export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
|
| 149 |
-
|
| 150 |
-
python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
|
| 151 |
-
--quant-type "$QUANT_TYPE" \
|
| 152 |
-
--device "$DEVICE" \
|
| 153 |
-
--token "$HF_TOKEN" \
|
| 154 |
-
--trackio-url "$TRACKIO_URL" \
|
| 155 |
-
--experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
|
| 156 |
-
--dataset-repo "$TRACKIO_DATASET_REPO"
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
## Key Improvements
|
| 160 |
-
|
| 161 |
-
### 1. **Proper Environment Variable Timing**
|
| 162 |
-
- ✅ **Set before virtual environment**: Variables set before creating venv
|
| 163 |
-
- ✅ **Re-export after activation**: Variables re-exported after activating venv
|
| 164 |
-
- ✅ **Before each script**: Variables exported before each Python script call
|
| 165 |
-
- ✅ **Verification step**: Token availability verified before proceeding
|
| 166 |
-
|
| 167 |
-
### 2. **Comprehensive Coverage**
|
| 168 |
-
- ✅ **All scripts covered**: Every Python script has environment variables
|
| 169 |
-
- ✅ **Multiple variables**: HF_TOKEN, HUGGING_FACE_HUB_TOKEN, HF_USERNAME, TRACKIO_DATASET_REPO
|
| 170 |
-
- ✅ **Consistent naming**: All scripts use the same environment variable names
|
| 171 |
-
- ✅ **Error handling**: Verification step catches missing tokens
|
| 172 |
-
|
| 173 |
-
### 3. **Cross-Platform Compatibility**
|
| 174 |
-
- ✅ **Bash compatible**: Uses standard bash export syntax
|
| 175 |
-
- ✅ **Virtual environment aware**: Properly handles venv activation
|
| 176 |
-
- ✅ **Token validation**: Verifies token availability before use
|
| 177 |
-
- ✅ **Clear error messages**: Descriptive error messages for debugging
|
| 178 |
-
|
| 179 |
-
## Environment Variables Set
|
| 180 |
-
|
| 181 |
-
The following environment variables are now properly set and available in the virtual environment:
|
| 182 |
-
|
| 183 |
-
1. **`HF_TOKEN`** - The Hugging Face token for authentication
|
| 184 |
-
2. **`HUGGING_FACE_HUB_TOKEN`** - Alternative token variable for Python API
|
| 185 |
-
3. **`HF_USERNAME`** - Username extracted from token
|
| 186 |
-
4. **`TRACKIO_DATASET_REPO`** - Dataset repository for Trackio
|
| 187 |
-
|
| 188 |
-
## Test Results
|
| 189 |
-
|
| 190 |
-
### **Environment Setup Test**
|
| 191 |
-
```bash
|
| 192 |
-
$ python tests/test_environment_setup.py
|
| 193 |
-
|
| 194 |
-
🚀 Environment Setup Verification
|
| 195 |
-
==================================================
|
| 196 |
-
🔍 Testing Environment Variables
|
| 197 |
-
[OK] HF_TOKEN: hf_FWrfleE...zuoF
|
| 198 |
-
[OK] HUGGING_FACE_HUB_TOKEN: hf_FWrfleE...zuoF
|
| 199 |
-
[OK] HF_USERNAME: Tonic...onic
|
| 200 |
-
[OK] TRACKIO_DATASET_REPO: Tonic/trac...ents
|
| 201 |
-
|
| 202 |
-
🔍 Testing Launch Script Environment Setup
|
| 203 |
-
[OK] Found: export HF_TOKEN=
|
| 204 |
-
[OK] Found: export HUGGING_FACE_HUB_TOKEN=
|
| 205 |
-
[OK] Found: export HF_USERNAME=
|
| 206 |
-
[OK] Found: export TRACKIO_DATASET_REPO=
|
| 207 |
-
[OK] Found virtual environment activation
|
| 208 |
-
[OK] Found environment variable re-export after activation
|
| 209 |
-
|
| 210 |
-
[SUCCESS] ALL ENVIRONMENT TESTS PASSED!
|
| 211 |
-
[OK] Environment variables: Properly set
|
| 212 |
-
[OK] Virtual environment: Can access variables
|
| 213 |
-
[OK] Launch script: Properly configured
|
| 214 |
-
|
| 215 |
-
The environment setup is working correctly!
|
| 216 |
-
```
|
| 217 |
-
|
| 218 |
-
## User Token Status
|
| 219 |
-
|
| 220 |
-
**Token**: `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`
|
| 221 |
-
**Status**: ✅ **Working correctly in virtual environment**
|
| 222 |
-
**Username**: `Tonic` (auto-detected)
|
| 223 |
-
|
| 224 |
-
## Next Steps
|
| 225 |
-
|
| 226 |
-
The user can now run the launch script with confidence that the token will be properly available in the virtual environment:
|
| 227 |
-
|
| 228 |
-
```bash
|
| 229 |
-
./launch.sh
|
| 230 |
-
```
|
| 231 |
-
|
| 232 |
-
The script will:
|
| 233 |
-
1. ✅ **Set environment variables** before creating virtual environment
|
| 234 |
-
2. ✅ **Re-export variables** after activating virtual environment
|
| 235 |
-
3. ✅ **Verify token availability** before proceeding
|
| 236 |
-
4. ✅ **Export variables** before each Python script call
|
| 237 |
-
5. ✅ **Ensure all scripts** have access to the token
|
| 238 |
-
|
| 239 |
-
**No more token-related errors in the virtual environment!** 🎉
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/ENVIRONMENT_VARIABLES.md
DELETED
|
@@ -1,113 +0,0 @@
|
|
| 1 |
-
# 🔧 Trackio Environment Variables Reference
|
| 2 |
-
|
| 3 |
-
## Quick Setup
|
| 4 |
-
|
| 5 |
-
Set these environment variables in your Hugging Face Space:
|
| 6 |
-
|
| 7 |
-
```bash
|
| 8 |
-
# Required: Your HF token for dataset access
|
| 9 |
-
HF_TOKEN=your_hf_token_here
|
| 10 |
-
|
| 11 |
-
# Optional: Dataset repository to use (defaults to tonic/trackio-experiments)
|
| 12 |
-
TRACKIO_DATASET_REPO=your-username/your-dataset-name
|
| 13 |
-
```
|
| 14 |
-
|
| 15 |
-
## Environment Variables
|
| 16 |
-
|
| 17 |
-
| Variable | Required | Default | Description |
|
| 18 |
-
|----------|----------|---------|-------------|
|
| 19 |
-
| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token for dataset access |
|
| 20 |
-
| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository to load experiments from |
|
| 21 |
-
| `SPACE_ID` | 🔄 Auto | None | HF Space ID (automatically detected) |
|
| 22 |
-
|
| 23 |
-
## Configuration Examples
|
| 24 |
-
|
| 25 |
-
### 1. Default Setup
|
| 26 |
-
```bash
|
| 27 |
-
HF_TOKEN=your_token_here
|
| 28 |
-
# Uses: tonic/trackio-experiments
|
| 29 |
-
```
|
| 30 |
-
|
| 31 |
-
### 2. Personal Dataset
|
| 32 |
-
```bash
|
| 33 |
-
HF_TOKEN=your_token_here
|
| 34 |
-
TRACKIO_DATASET_REPO=your-username/trackio-experiments
|
| 35 |
-
```
|
| 36 |
-
|
| 37 |
-
### 3. Team Dataset
|
| 38 |
-
```bash
|
| 39 |
-
HF_TOKEN=your_token_here
|
| 40 |
-
TRACKIO_DATASET_REPO=your-org/team-experiments
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
### 4. Project-Specific Dataset
|
| 44 |
-
```bash
|
| 45 |
-
HF_TOKEN=your_token_here
|
| 46 |
-
TRACKIO_DATASET_REPO=your-username/smollm3-experiments
|
| 47 |
-
```
|
| 48 |
-
|
| 49 |
-
## How to Set in HF Spaces
|
| 50 |
-
|
| 51 |
-
1. Go to your Hugging Face Space settings
|
| 52 |
-
2. Navigate to "Settings" → "Environment variables"
|
| 53 |
-
3. Add the variables:
|
| 54 |
-
- `HF_TOKEN`: Your HF token
|
| 55 |
-
- `TRACKIO_DATASET_REPO`: Your dataset repository (optional)
|
| 56 |
-
|
| 57 |
-
## Testing Configuration
|
| 58 |
-
|
| 59 |
-
Run the configuration script to check your setup:
|
| 60 |
-
|
| 61 |
-
```bash
|
| 62 |
-
python configure_trackio.py
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
-
This will:
|
| 66 |
-
- ✅ Show current environment variables
|
| 67 |
-
- 🧪 Test dataset access
|
| 68 |
-
- 📊 Display experiment count
|
| 69 |
-
- 💾 Generate configuration file
|
| 70 |
-
|
| 71 |
-
## Getting Your HF Token
|
| 72 |
-
|
| 73 |
-
1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
|
| 74 |
-
2. Click "New token"
|
| 75 |
-
3. Give it a name (e.g., "Trackio Access")
|
| 76 |
-
4. Select "Write" permissions
|
| 77 |
-
5. Copy the token and set it as `HF_TOKEN`
|
| 78 |
-
|
| 79 |
-
## Dataset Repository Format
|
| 80 |
-
|
| 81 |
-
The `TRACKIO_DATASET_REPO` should follow this format:
|
| 82 |
-
```
|
| 83 |
-
username/dataset-name
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
Examples:
|
| 87 |
-
- `tonic/trackio-experiments`
|
| 88 |
-
- `your-username/my-experiments`
|
| 89 |
-
- `your-org/team-experiments`
|
| 90 |
-
|
| 91 |
-
## Troubleshooting
|
| 92 |
-
|
| 93 |
-
### Issue: "HF_TOKEN not found"
|
| 94 |
-
**Solution**: Set your HF token in the Space environment variables
|
| 95 |
-
|
| 96 |
-
### Issue: "Failed to load dataset"
|
| 97 |
-
**Solutions**:
|
| 98 |
-
1. Check your token has read access to the dataset
|
| 99 |
-
2. Verify the dataset repository exists
|
| 100 |
-
3. Try the backup fallback (automatic)
|
| 101 |
-
|
| 102 |
-
### Issue: "Failed to save experiments"
|
| 103 |
-
**Solutions**:
|
| 104 |
-
1. Check your token has write permissions
|
| 105 |
-
2. Verify the dataset repository exists
|
| 106 |
-
3. Check network connectivity
|
| 107 |
-
|
| 108 |
-
## Security Notes
|
| 109 |
-
|
| 110 |
-
- 🔒 Dataset is private by default
|
| 111 |
-
- 🔑 Only accessible with your HF_TOKEN
|
| 112 |
-
- 🛡️ No sensitive data exposed publicly
|
| 113 |
-
- 🔐 Secure storage on HF infrastructure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/Entry_Point.md
ADDED
|
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
```mermaid
|
| 2 |
+
graph LR
|
| 3 |
+
Entry_Point["Entry Point"]
|
| 4 |
+
Configuration["Configuration"]
|
| 5 |
+
Model_Abstraction["Model Abstraction"]
|
| 6 |
+
Data_Pipeline["Data Pipeline"]
|
| 7 |
+
Training_Logic["Training Logic"]
|
| 8 |
+
Utilities["Utilities"]
|
| 9 |
+
Scripts["Scripts"]
|
| 10 |
+
Requirements_Management["Requirements Management"]
|
| 11 |
+
Entry_Point -- "initializes" --> Configuration
|
| 12 |
+
Entry_Point -- "initializes" --> Model_Abstraction
|
| 13 |
+
Entry_Point -- "initializes" --> Data_Pipeline
|
| 14 |
+
Entry_Point -- "invokes" --> Training_Logic
|
| 15 |
+
Configuration -- "provides settings to" --> Model_Abstraction
|
| 16 |
+
Configuration -- "provides settings to" --> Data_Pipeline
|
| 17 |
+
Configuration -- "provides settings to" --> Training_Logic
|
| 18 |
+
Model_Abstraction -- "provides model to" --> Training_Logic
|
| 19 |
+
Data_Pipeline -- "provides data to" --> Training_Logic
|
| 20 |
+
Training_Logic -- "utilizes" --> Model_Abstraction
|
| 21 |
+
Training_Logic -- "utilizes" --> Data_Pipeline
|
| 22 |
+
Training_Logic -- "utilizes" --> Configuration
|
| 23 |
+
Training_Logic -- "utilizes" --> Utilities
|
| 24 |
+
Data_Pipeline -- "uses" --> Utilities
|
| 25 |
+
Model_Abstraction -- "uses" --> Utilities
|
| 26 |
+
Scripts -- "supports" --> Data_Pipeline
|
| 27 |
+
Scripts -- "supports" --> Model_Abstraction
|
| 28 |
+
Requirements_Management -- "defines environment for" --> Entry_Point
|
| 29 |
+
Requirements_Management -- "defines environment for" --> Configuration
|
| 30 |
+
Requirements_Management -- "defines environment for" --> Model_Abstraction
|
| 31 |
+
Requirements_Management -- "defines environment for" --> Data_Pipeline
|
| 32 |
+
Requirements_Management -- "defines environment for" --> Training_Logic
|
| 33 |
+
Requirements_Management -- "defines environment for" --> Utilities
|
| 34 |
+
Requirements_Management -- "defines environment for" --> Scripts
|
| 35 |
+
click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
|
| 36 |
+
click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
|
| 37 |
+
click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:[email protected])
|
| 41 |
+
|
| 42 |
+
## Details
|
| 43 |
+
|
| 44 |
+
Component overview for the Machine Learning Training and Fine-tuning Framework.
|
| 45 |
+
|
| 46 |
+
### Entry Point [[Expand]](./Entry_Point.md)
|
| 47 |
+
The primary execution script that orchestrates the entire training process. It initializes all other major components, loads configurations, sets up the training environment, and invokes the core training logic.
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
**Related Classes/Methods**:
|
| 51 |
+
|
| 52 |
+
- `train.py`
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
### Configuration
|
| 56 |
+
Centralized management of all training parameters, model hyperparameters, dataset paths, and other environment settings. It defines the schema for configurations, often using dataclasses, and supports both base and custom configurations.
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
**Related Classes/Methods**:
|
| 60 |
+
|
| 61 |
+
- `config/` (1:1)
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
### Model Abstraction [[Expand]](./Model_Abstraction.md)
|
| 65 |
+
Responsible for abstracting the underlying machine learning model. This includes loading pre-trained models, handling different model architectures or variants, and preparing the model for training (e.g., quantization, device placement).
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
**Related Classes/Methods**:
|
| 69 |
+
|
| 70 |
+
- <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/model.py#L1-L1" target="_blank" rel="noopener noreferrer">`model.py` (1:1)</a>
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
### Data Pipeline [[Expand]](./Data_Pipeline.md)
|
| 74 |
+
Manages the entire data flow, from loading raw datasets to preprocessing, tokenization, and creating efficient data loaders (e.g., PyTorch `DataLoader`) for batching and shuffling data during training and evaluation.
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
**Related Classes/Methods**:
|
| 78 |
+
|
| 79 |
+
- <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/data.py#L1-L1" target="_blank" rel="noopener noreferrer">`data.py` (1:1)</a>
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
### Training Logic
|
| 83 |
+
Encapsulates the core training loop, including forward and backward passes, loss calculation, optimization steps, and integration of callbacks for monitoring and control. It may include specialized trainers for different fine-tuning methods.
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
**Related Classes/Methods**:
|
| 87 |
+
|
| 88 |
+
- <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/trainer.py#L1-L1" target="_blank" rel="noopener noreferrer">`trainer.py` (1:1)</a>
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
### Utilities
|
| 92 |
+
Provides a collection of common helper functions, classes, and modules used across various components. This includes functionalities like logging, metric calculation, checkpointing, and general data manipulation.
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
**Related Classes/Methods**:
|
| 96 |
+
|
| 97 |
+
- `utils/` (1:1)
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
### Scripts
|
| 101 |
+
Contains auxiliary scripts that support the overall project but are separate from the main training pipeline. Examples include data preparation scripts, model conversion tools, or deployment-related utilities.
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
**Related Classes/Methods**:
|
| 105 |
+
|
| 106 |
+
- `scripts/` (1:1)
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
### Requirements Management
|
| 110 |
+
Defines and manages all project dependencies, ensuring a consistent and reproducible development and deployment environment. This typically involves `requirements.txt` files or similar dependency management tools.
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
**Related Classes/Methods**:
|
| 114 |
+
|
| 115 |
+
- `requirements/` (1:1)
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
|
docs/FINAL_DEPLOYMENT_VERIFICATION.md
DELETED
|
@@ -1,378 +0,0 @@
|
|
| 1 |
-
# Final Deployment Verification Summary
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This document provides the final verification that all important components for Trackio Spaces deployment and model repository deployment have been properly implemented and are working correctly.
|
| 6 |
-
|
| 7 |
-
## ✅ **VERIFICATION COMPLETE: All Components Properly Implemented**
|
| 8 |
-
|
| 9 |
-
### **What We Verified**
|
| 10 |
-
|
| 11 |
-
You were absolutely right to ask about the Trackio Spaces deployment and model repository deployment components. I've now **completely verified** that all important components are properly implemented:
|
| 12 |
-
|
| 13 |
-
## **Trackio Spaces Deployment** ✅ **FULLY IMPLEMENTED**
|
| 14 |
-
|
| 15 |
-
### **1. Space Creation System** ✅ **COMPLETE**
|
| 16 |
-
- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
|
| 17 |
-
- **Functionality**: Creates HF Spaces using latest Python API
|
| 18 |
-
- **Features**:
|
| 19 |
-
- ✅ API-based creation with `huggingface_hub.create_repo`
|
| 20 |
-
- ✅ Fallback to CLI method if API fails
|
| 21 |
-
- ✅ Automatic username extraction from token
|
| 22 |
-
- ✅ Proper Space configuration (Gradio SDK, CPU hardware)
|
| 23 |
-
|
| 24 |
-
### **2. File Upload System** ✅ **COMPLETE**
|
| 25 |
-
- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
|
| 26 |
-
- **Functionality**: Uploads all required files to Space
|
| 27 |
-
- **Features**:
|
| 28 |
-
- ✅ API-based upload using `huggingface_hub.upload_file`
|
| 29 |
-
- ✅ Proper HF Spaces file structure
|
| 30 |
-
- ✅ Git integration in temporary directory
|
| 31 |
-
- ✅ Error handling and fallback mechanisms
|
| 32 |
-
|
| 33 |
-
**Files Uploaded**:
|
| 34 |
-
- ✅ `app.py` - Complete Gradio interface (1,241 lines)
|
| 35 |
-
- ✅ `requirements.txt` - All dependencies included
|
| 36 |
-
- ✅ `README.md` - Comprehensive documentation
|
| 37 |
-
- ✅ `.gitignore` - Proper git configuration
|
| 38 |
-
|
| 39 |
-
### **3. Space Configuration** ✅ **COMPLETE**
|
| 40 |
-
- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
|
| 41 |
-
- **Functionality**: Sets environment variables via HF Hub API
|
| 42 |
-
- **Features**:
|
| 43 |
-
- ✅ API-based secrets using `add_space_secret()`
|
| 44 |
-
- ✅ Automatic `HF_TOKEN` configuration
|
| 45 |
-
- ✅ Automatic `TRACKIO_DATASET_REPO` setup
|
| 46 |
-
- ✅ Manual fallback instructions if API fails
|
| 47 |
-
|
| 48 |
-
### **4. Gradio Interface** ✅ **COMPLETE**
|
| 49 |
-
- **Location**: `templates/spaces/app.py` (1,241 lines)
|
| 50 |
-
- **Functionality**: Comprehensive experiment tracking interface
|
| 51 |
-
- **Features**:
|
| 52 |
-
- ✅ **Experiment Management**: Create, view, update experiments
|
| 53 |
-
- ✅ **Metrics Logging**: Real-time training metrics
|
| 54 |
-
- ✅ **Visualization**: Interactive plots and charts
|
| 55 |
-
- ✅ **HF Datasets Integration**: Persistent storage
|
| 56 |
-
- ✅ **API Endpoints**: Programmatic access
|
| 57 |
-
- ✅ **Fallback Data**: Backup when dataset unavailable
|
| 58 |
-
|
| 59 |
-
**Interface Components**:
|
| 60 |
-
- ✅ **Create Experiment**: Start new experiments
|
| 61 |
-
- ✅ **Log Metrics**: Track training progress
|
| 62 |
-
- ✅ **View Experiments**: See experiment details
|
| 63 |
-
- ✅ **Update Status**: Mark experiments complete
|
| 64 |
-
- ✅ **Visualizations**: Interactive plots
|
| 65 |
-
- ✅ **Configuration**: Environment setup
|
| 66 |
-
|
| 67 |
-
### **5. Requirements and Dependencies** ✅ **COMPLETE**
|
| 68 |
-
- **Location**: `templates/spaces/requirements.txt`
|
| 69 |
-
- **Dependencies**: All required packages included
|
| 70 |
-
- ✅ **Core Gradio**: `gradio>=4.0.0`
|
| 71 |
-
- ✅ **Data Processing**: `pandas>=2.0.0`, `numpy>=1.24.0`
|
| 72 |
-
- ✅ **Visualization**: `plotly>=5.15.0`
|
| 73 |
-
- ✅ **HF Integration**: `datasets>=2.14.0`, `huggingface-hub>=0.16.0`
|
| 74 |
-
- ✅ **HTTP Requests**: `requests>=2.31.0`
|
| 75 |
-
- ✅ **Environment**: `python-dotenv>=1.0.0`
|
| 76 |
-
|
| 77 |
-
### **6. README Template** ✅ **COMPLETE**
|
| 78 |
-
- **Location**: `templates/spaces/README.md`
|
| 79 |
-
- **Features**:
|
| 80 |
-
- ✅ **HF Spaces Metadata**: Proper YAML frontmatter
|
| 81 |
-
- ✅ **Feature Documentation**: Complete interface description
|
| 82 |
-
- ✅ **API Documentation**: Usage examples
|
| 83 |
-
- ✅ **Configuration Guide**: Environment variables
|
| 84 |
-
- ✅ **Troubleshooting**: Common issues and solutions
|
| 85 |
-
|
| 86 |
-
## **Model Repository Deployment** ✅ **FULLY IMPLEMENTED**
|
| 87 |
-
|
| 88 |
-
### **1. Repository Creation** ✅ **COMPLETE**
|
| 89 |
-
- **Location**: `scripts/model_tonic/push_to_huggingface.py`
|
| 90 |
-
- **Functionality**: Creates HF model repositories using Python API
|
| 91 |
-
- **Features**:
|
| 92 |
-
- ✅ API-based creation with `huggingface_hub.create_repo`
|
| 93 |
-
- ✅ Configurable private/public settings
|
| 94 |
-
- ✅ Existing repository handling (`exist_ok=True`)
|
| 95 |
-
- ✅ Proper error handling and messages
|
| 96 |
-
|
| 97 |
-
### **2. Model File Upload** ✅ **COMPLETE**
|
| 98 |
-
- **Location**: `scripts/model_tonic/push_to_huggingface.py`
|
| 99 |
-
- **Functionality**: Uploads all model files to repository
|
| 100 |
-
- **Features**:
|
| 101 |
-
- ✅ File validation and integrity checks
|
| 102 |
-
- ✅ Complete model component upload
|
| 103 |
-
- ✅ Progress tracking and feedback
|
| 104 |
-
- ✅ Graceful error handling
|
| 105 |
-
|
| 106 |
-
**Files Uploaded**:
|
| 107 |
-
- ✅ `config.json` - Model configuration
|
| 108 |
-
- ✅ `pytorch_model.bin` - Model weights
|
| 109 |
-
- ✅ `tokenizer.json` - Tokenizer configuration
|
| 110 |
-
- ✅ `tokenizer_config.json` - Tokenizer settings
|
| 111 |
-
- ✅ `special_tokens_map.json` - Special tokens
|
| 112 |
-
- ✅ `generation_config.json` - Generation settings
|
| 113 |
-
|
| 114 |
-
### **3. Model Card Generation** ✅ **COMPLETE**
|
| 115 |
-
- **Location**: `scripts/model_tonic/push_to_huggingface.py`
|
| 116 |
-
- **Functionality**: Generates comprehensive model cards
|
| 117 |
-
- **Features**:
|
| 118 |
-
- ✅ Template-based generation using `templates/model_card.md`
|
| 119 |
-
- ✅ Dynamic content from training configuration
|
| 120 |
-
- ✅ Usage examples and documentation
|
| 121 |
-
- ✅ Support for quantized model variants
|
| 122 |
-
- ✅ Proper HF Hub metadata
|
| 123 |
-
|
| 124 |
-
### **4. Training Results Documentation** ✅ **COMPLETE**
|
| 125 |
-
- **Location**: `scripts/model_tonic/push_to_huggingface.py`
|
| 126 |
-
- **Functionality**: Uploads training configuration and results
|
| 127 |
-
- **Features**:
|
| 128 |
-
- ✅ Training parameters documentation
|
| 129 |
-
- ✅ Performance metrics inclusion
|
| 130 |
-
- ✅ Experiment tracking links
|
| 131 |
-
- ✅ Proper documentation structure
|
| 132 |
-
|
| 133 |
-
### **5. Quantized Model Support** ✅ **COMPLETE**
|
| 134 |
-
- **Location**: `scripts/model_tonic/quantize_model.py`
|
| 135 |
-
- **Functionality**: Creates and uploads quantized models
|
| 136 |
-
- **Features**:
|
| 137 |
-
- ✅ Multiple quantization levels (int8, int4)
|
| 138 |
-
- ✅ Unified repository structure
|
| 139 |
-
- ✅ Separate documentation for each variant
|
| 140 |
-
- ✅ Clear usage instructions
|
| 141 |
-
|
| 142 |
-
### **6. Trackio Integration** ✅ **COMPLETE**
|
| 143 |
-
- **Location**: `scripts/model_tonic/push_to_huggingface.py`
|
| 144 |
-
- **Functionality**: Logs model push events to Trackio
|
| 145 |
-
- **Features**:
|
| 146 |
-
- ✅ Event logging for model pushes
|
| 147 |
-
- ✅ Training results tracking
|
| 148 |
-
- ✅ Experiment tracking links
|
| 149 |
-
- ✅ HF Datasets integration
|
| 150 |
-
|
| 151 |
-
### **7. Model Validation** ✅ **COMPLETE**
|
| 152 |
-
- **Location**: `scripts/model_tonic/push_to_huggingface.py`
|
| 153 |
-
- **Functionality**: Validates model files before upload
|
| 154 |
-
- **Features**:
|
| 155 |
-
- ✅ Complete file validation
|
| 156 |
-
- ✅ Size and integrity checks
|
| 157 |
-
- ✅ Configuration validation
|
| 158 |
-
- ✅ Detailed error reporting
|
| 159 |
-
|
| 160 |
-
## **Integration Components** ✅ **FULLY IMPLEMENTED**
|
| 161 |
-
|
| 162 |
-
### **1. Launch Script Integration** ✅ **COMPLETE**
|
| 163 |
-
- **Location**: `launch.sh`
|
| 164 |
-
- **Features**:
|
| 165 |
-
- ✅ Automatic Trackio Space deployment calls
|
| 166 |
-
- ✅ Automatic model push integration
|
| 167 |
-
- ✅ Environment setup and configuration
|
| 168 |
-
- ✅ Error handling and user feedback
|
| 169 |
-
|
| 170 |
-
### **2. Monitoring Integration** ✅ **COMPLETE**
|
| 171 |
-
- **Location**: `src/monitoring.py`
|
| 172 |
-
- **Features**:
|
| 173 |
-
- ✅ `SmolLM3Monitor` class implementation
|
| 174 |
-
- ✅ Real-time experiment tracking
|
| 175 |
-
- ✅ Trackio Space integration
|
| 176 |
-
- ✅ HF Datasets integration
|
| 177 |
-
|
| 178 |
-
### **3. Dataset Integration** ✅ **COMPLETE**
|
| 179 |
-
- **Location**: `scripts/dataset_tonic/setup_hf_dataset.py`
|
| 180 |
-
- **Features**:
|
| 181 |
-
- ✅ Automatic dataset repository creation
|
| 182 |
-
- ✅ Initial experiment data upload
|
| 183 |
-
- ✅ README template integration
|
| 184 |
-
- ✅ Environment variable setup
|
| 185 |
-
|
| 186 |
-
## **Token Validation** ✅ **FULLY IMPLEMENTED**
|
| 187 |
-
|
| 188 |
-
### **1. Token Validation System** ✅ **COMPLETE**
|
| 189 |
-
- **Location**: `scripts/validate_hf_token.py`
|
| 190 |
-
- **Features**:
|
| 191 |
-
- ✅ API-based token validation
|
| 192 |
-
- ✅ Username extraction from token
|
| 193 |
-
- ✅ JSON output for shell parsing
|
| 194 |
-
- ✅ Comprehensive error handling
|
| 195 |
-
|
| 196 |
-
## **Test Results** ✅ **ALL PASSED**
|
| 197 |
-
|
| 198 |
-
### **Comprehensive Component Test**
|
| 199 |
-
```bash
|
| 200 |
-
$ python tests/test_deployment_components.py
|
| 201 |
-
|
| 202 |
-
🚀 Deployment Components Verification
|
| 203 |
-
==================================================
|
| 204 |
-
🔍 Testing Trackio Space Deployment Components
|
| 205 |
-
✅ Trackio Space deployment script exists
|
| 206 |
-
✅ Gradio app template exists
|
| 207 |
-
✅ TrackioSpace class implemented
|
| 208 |
-
✅ Experiment creation functionality
|
| 209 |
-
✅ Metrics logging functionality
|
| 210 |
-
✅ Experiment retrieval functionality
|
| 211 |
-
✅ Space requirements file exists
|
| 212 |
-
✅ Required dependency: gradio
|
| 213 |
-
✅ Required dependency: pandas
|
| 214 |
-
✅ Required dependency: plotly
|
| 215 |
-
✅ Required dependency: datasets
|
| 216 |
-
✅ Required dependency: huggingface-hub
|
| 217 |
-
✅ Space README template exists
|
| 218 |
-
✅ HF Spaces metadata present
|
| 219 |
-
✅ All Trackio Space components verified!
|
| 220 |
-
|
| 221 |
-
🔍 Testing Model Repository Deployment Components
|
| 222 |
-
✅ Model push script exists
|
| 223 |
-
✅ Model quantization script exists
|
| 224 |
-
✅ Model card template exists
|
| 225 |
-
✅ Required section: base_model:
|
| 226 |
-
✅ Required section: pipeline_tag:
|
| 227 |
-
✅ Required section: tags:
|
| 228 |
-
✅ Model card generator exists
|
| 229 |
-
✅ Required function: def create_repository
|
| 230 |
-
✅ Required function: def upload_model_files
|
| 231 |
-
✅ Required function: def create_model_card
|
| 232 |
-
✅ Required function: def validate_model_path
|
| 233 |
-
✅ All Model Repository components verified!
|
| 234 |
-
|
| 235 |
-
🔍 Testing Integration Components
|
| 236 |
-
✅ Launch script exists
|
| 237 |
-
✅ Trackio Space deployment integrated
|
| 238 |
-
✅ Model push integrated
|
| 239 |
-
✅ Monitoring script exists
|
| 240 |
-
✅ SmolLM3Monitor class implemented
|
| 241 |
-
✅ Dataset setup script exists
|
| 242 |
-
✅ Dataset setup function implemented
|
| 243 |
-
✅ All integration components verified!
|
| 244 |
-
|
| 245 |
-
🔍 Testing Token Validation
|
| 246 |
-
✅ Token validation script exists
|
| 247 |
-
✅ Token validation function implemented
|
| 248 |
-
✅ Token validation components verified!
|
| 249 |
-
|
| 250 |
-
==================================================
|
| 251 |
-
🎉 ALL COMPONENTS VERIFIED SUCCESSFULLY!
|
| 252 |
-
✅ Trackio Space deployment components: Complete
|
| 253 |
-
✅ Model repository deployment components: Complete
|
| 254 |
-
✅ Integration components: Complete
|
| 255 |
-
✅ Token validation components: Complete
|
| 256 |
-
|
| 257 |
-
All important deployment components are properly implemented!
|
| 258 |
-
```
|
| 259 |
-
|
| 260 |
-
## **Technical Implementation Details**
|
| 261 |
-
|
| 262 |
-
### **Trackio Space Deployment Flow**
|
| 263 |
-
```python
|
| 264 |
-
# 1. Create Space
|
| 265 |
-
create_repo(
|
| 266 |
-
repo_id=f"{username}/{space_name}",
|
| 267 |
-
token=token,
|
| 268 |
-
repo_type="space",
|
| 269 |
-
exist_ok=True,
|
| 270 |
-
private=False,
|
| 271 |
-
space_sdk="gradio",
|
| 272 |
-
space_hardware="cpu-basic"
|
| 273 |
-
)
|
| 274 |
-
|
| 275 |
-
# 2. Upload Files
|
| 276 |
-
upload_file(
|
| 277 |
-
path_or_fileobj=file_content,
|
| 278 |
-
path_in_repo=file_path,
|
| 279 |
-
repo_id=repo_id,
|
| 280 |
-
repo_type="space",
|
| 281 |
-
token=token
|
| 282 |
-
)
|
| 283 |
-
|
| 284 |
-
# 3. Set Secrets
|
| 285 |
-
add_space_secret(
|
| 286 |
-
repo_id=repo_id,
|
| 287 |
-
repo_type="space",
|
| 288 |
-
key="HF_TOKEN",
|
| 289 |
-
value=token
|
| 290 |
-
)
|
| 291 |
-
```
|
| 292 |
-
|
| 293 |
-
### **Model Repository Deployment Flow**
|
| 294 |
-
```python
|
| 295 |
-
# 1. Create Repository
|
| 296 |
-
create_repo(
|
| 297 |
-
repo_id=repo_name,
|
| 298 |
-
token=token,
|
| 299 |
-
private=private,
|
| 300 |
-
exist_ok=True
|
| 301 |
-
)
|
| 302 |
-
|
| 303 |
-
# 2. Upload Model Files
|
| 304 |
-
upload_file(
|
| 305 |
-
path_or_fileobj=model_file,
|
| 306 |
-
path_in_repo=file_path,
|
| 307 |
-
repo_id=repo_name,
|
| 308 |
-
token=token
|
| 309 |
-
)
|
| 310 |
-
|
| 311 |
-
# 3. Generate Model Card
|
| 312 |
-
model_card = create_model_card(training_config, results)
|
| 313 |
-
upload_file(
|
| 314 |
-
path_or_fileobj=model_card,
|
| 315 |
-
path_in_repo="README.md",
|
| 316 |
-
repo_id=repo_name,
|
| 317 |
-
token=token
|
| 318 |
-
)
|
| 319 |
-
```
|
| 320 |
-
|
| 321 |
-
## **Verification Summary**
|
| 322 |
-
|
| 323 |
-
| Component Category | Status | Components Verified | Test Result |
|
| 324 |
-
|-------------------|--------|-------------------|-------------|
|
| 325 |
-
| **Trackio Space Deployment** | ✅ Complete | 6 components | ✅ All passed |
|
| 326 |
-
| **Model Repository Deployment** | ✅ Complete | 7 components | ✅ All passed |
|
| 327 |
-
| **Integration Components** | ✅ Complete | 3 components | ✅ All passed |
|
| 328 |
-
| **Token Validation** | ✅ Complete | 1 component | ✅ All passed |
|
| 329 |
-
|
| 330 |
-
## **Key Achievements**
|
| 331 |
-
|
| 332 |
-
### **1. Complete Automation**
|
| 333 |
-
- ✅ **No manual username input**: Automatic extraction from token
|
| 334 |
-
- ✅ **No manual Space creation**: Automatic via Python API
|
| 335 |
-
- ✅ **No manual model upload**: Complete automation
|
| 336 |
-
- ✅ **No manual configuration**: Automatic environment setup
|
| 337 |
-
|
| 338 |
-
### **2. Robust Error Handling**
|
| 339 |
-
- ✅ **API fallbacks**: CLI methods when API fails
|
| 340 |
-
- ✅ **Graceful degradation**: Clear error messages
|
| 341 |
-
- ✅ **User feedback**: Progress indicators and status
|
| 342 |
-
- ✅ **Recovery mechanisms**: Multiple retry strategies
|
| 343 |
-
|
| 344 |
-
### **3. Comprehensive Documentation**
|
| 345 |
-
- ✅ **Model cards**: Complete with usage examples
|
| 346 |
-
- ✅ **Space documentation**: Full interface description
|
| 347 |
-
- ✅ **API documentation**: Usage examples and integration
|
| 348 |
-
- ✅ **Troubleshooting guides**: Common issues and solutions
|
| 349 |
-
|
| 350 |
-
### **4. Cross-Platform Support**
|
| 351 |
-
- ✅ **Windows**: Tested and working on PowerShell
|
| 352 |
-
- ✅ **Linux**: Compatible with bash scripts
|
| 353 |
-
- ✅ **macOS**: Compatible with zsh/bash
|
| 354 |
-
- ✅ **Python API**: Platform-independent
|
| 355 |
-
|
| 356 |
-
## **Next Steps**
|
| 357 |
-
|
| 358 |
-
The deployment components are now **fully implemented and verified**. Users can:
|
| 359 |
-
|
| 360 |
-
1. **Deploy Trackio Space**: Automatic Space creation and configuration
|
| 361 |
-
2. **Upload Models**: Complete model deployment with documentation
|
| 362 |
-
3. **Monitor Experiments**: Real-time tracking and visualization
|
| 363 |
-
4. **Share Results**: Comprehensive documentation and examples
|
| 364 |
-
5. **Scale Operations**: Support for multiple experiments and models
|
| 365 |
-
|
| 366 |
-
## **Conclusion**
|
| 367 |
-
|
| 368 |
-
**All important deployment components are properly implemented and working correctly!** 🎉
|
| 369 |
-
|
| 370 |
-
The verification confirms that:
|
| 371 |
-
- ✅ **Trackio Spaces deployment**: Complete with all required components
|
| 372 |
-
- ✅ **Model repository deployment**: Complete with all required components
|
| 373 |
-
- ✅ **Integration systems**: Complete with all required components
|
| 374 |
-
- ✅ **Token validation**: Complete with all required components
|
| 375 |
-
- ✅ **Documentation**: Complete with all required components
|
| 376 |
-
- ✅ **Error handling**: Complete with all required components
|
| 377 |
-
|
| 378 |
-
The system is now ready for production use with full automation and comprehensive functionality.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/FORMATTING_FIX_SUMMARY.md
DELETED
|
@@ -1,153 +0,0 @@
|
|
| 1 |
-
# String Formatting Fix Summary
|
| 2 |
-
|
| 3 |
-
## 🐛 Problem
|
| 4 |
-
|
| 5 |
-
The training script was failing with the error:
|
| 6 |
-
```
|
| 7 |
-
ERROR:trainer:Training failed: Unknown format code 'f' for object of type 'str'
|
| 8 |
-
```
|
| 9 |
-
|
| 10 |
-
This error occurs when Python's string formatting encounters an f-string format specifier (`%f`) but receives a string object instead of a numeric value.
|
| 11 |
-
|
| 12 |
-
## 🔍 Root Cause
|
| 13 |
-
|
| 14 |
-
The issue was caused by inconsistent use of f-string formatting (`f"..."`) and traditional string formatting (`"..." % ...`) in the logging statements throughout the codebase. When logging statements used f-string syntax but were processed by the logging system, it could cause formatting conflicts.
|
| 15 |
-
|
| 16 |
-
## ✅ Solution
|
| 17 |
-
|
| 18 |
-
I fixed the issue by standardizing all logging statements to use traditional string formatting with `%` placeholders instead of f-strings. This ensures compatibility with Python's logging system and prevents formatting conflicts.
|
| 19 |
-
|
| 20 |
-
### Files Fixed
|
| 21 |
-
|
| 22 |
-
1. **`src/monitoring.py`** - Fixed all logging statements
|
| 23 |
-
2. **`src/trainer.py`** - Fixed all logging statements
|
| 24 |
-
3. **`src/model.py`** - Fixed all logging statements
|
| 25 |
-
4. **`src/data.py`** - Fixed all logging statements
|
| 26 |
-
|
| 27 |
-
### Changes Made
|
| 28 |
-
|
| 29 |
-
#### Before (Problematic):
|
| 30 |
-
```python
|
| 31 |
-
logger.info(f"Loading model from {self.model_name}")
|
| 32 |
-
logger.error(f"Failed to load model: {e}")
|
| 33 |
-
print(f"Step {step}: loss={loss:.4f}, lr={lr}")
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
#### After (Fixed):
|
| 37 |
-
```python
|
| 38 |
-
logger.info("Loading model from %s", self.model_name)
|
| 39 |
-
logger.error("Failed to load model: %s", e)
|
| 40 |
-
print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr))
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
## 🧪 Testing
|
| 44 |
-
|
| 45 |
-
Created `test_formatting_fix.py` to verify the fix:
|
| 46 |
-
|
| 47 |
-
```bash
|
| 48 |
-
python test_formatting_fix.py
|
| 49 |
-
```
|
| 50 |
-
|
| 51 |
-
This script tests:
|
| 52 |
-
- ✅ Logging functionality
|
| 53 |
-
- ✅ Module imports
|
| 54 |
-
- ✅ Configuration loading
|
| 55 |
-
- ✅ Monitoring creation
|
| 56 |
-
- ✅ Error handling
|
| 57 |
-
|
| 58 |
-
## 🚀 Usage
|
| 59 |
-
|
| 60 |
-
The fix is now ready to use. You can run your training command again:
|
| 61 |
-
|
| 62 |
-
```bash
|
| 63 |
-
python run_a100_large_experiment.py \
|
| 64 |
-
--config config/train_smollm3_openhermes_fr_a100_balanced.py \
|
| 65 |
-
--trackio_url "https://tonic-test-trackio-test.hf.space" \
|
| 66 |
-
--experiment-name "petit-elle-l-aime-3-balanced" \
|
| 67 |
-
--output-dir ./outputs/balanced | tee trainfr.log
|
| 68 |
-
```
|
| 69 |
-
|
| 70 |
-
## 📋 Key Changes
|
| 71 |
-
|
| 72 |
-
### 1. Monitoring Module (`src/monitoring.py`)
|
| 73 |
-
- Fixed all `logger.info()`, `logger.error()`, `logger.warning()` calls
|
| 74 |
-
- Replaced f-strings with `%` formatting
|
| 75 |
-
- Fixed string concatenation in file paths
|
| 76 |
-
- Fixed HF Datasets integration logging
|
| 77 |
-
|
| 78 |
-
### 2. Trainer Module (`src/trainer.py`)
|
| 79 |
-
- Fixed logging in `SmolLM3Trainer` class
|
| 80 |
-
- Fixed console output formatting
|
| 81 |
-
- Fixed error message formatting
|
| 82 |
-
- Fixed callback logging
|
| 83 |
-
|
| 84 |
-
### 3. Model Module (`src/model.py`)
|
| 85 |
-
- Fixed model loading logging
|
| 86 |
-
- Fixed configuration logging
|
| 87 |
-
- Fixed error reporting
|
| 88 |
-
- Fixed parameter logging
|
| 89 |
-
|
| 90 |
-
### 4. Data Module (`src/data.py`)
|
| 91 |
-
- Fixed dataset loading logging
|
| 92 |
-
- Fixed processing progress logging
|
| 93 |
-
- Fixed error handling
|
| 94 |
-
- Fixed split processing logging
|
| 95 |
-
|
| 96 |
-
## 🔧 Technical Details
|
| 97 |
-
|
| 98 |
-
### Why This Happened
|
| 99 |
-
1. **Mixed Formatting**: Some code used f-strings while others used `%` formatting
|
| 100 |
-
2. **Logging System**: Python's logging system processes format strings differently
|
| 101 |
-
3. **String Processing**: When strings containing `%f` were processed as format strings, it caused conflicts
|
| 102 |
-
|
| 103 |
-
### The Fix
|
| 104 |
-
1. **Standardized Formatting**: All logging now uses `%` placeholders
|
| 105 |
-
2. **Consistent Style**: No more mixing of f-strings and `%` formatting
|
| 106 |
-
3. **Safe Logging**: All logging statements are now safe for the logging system
|
| 107 |
-
|
| 108 |
-
### Benefits
|
| 109 |
-
- ✅ **Eliminates Formatting Errors**: No more "Unknown format code 'f'" errors
|
| 110 |
-
- ✅ **Consistent Code Style**: All logging uses the same format
|
| 111 |
-
- ✅ **Better Performance**: Traditional formatting is slightly faster
|
| 112 |
-
- ✅ **Compatibility**: Works with all Python versions and logging configurations
|
| 113 |
-
|
| 114 |
-
## 🎯 Verification
|
| 115 |
-
|
| 116 |
-
To verify the fix works:
|
| 117 |
-
|
| 118 |
-
1. **Run the test script**:
|
| 119 |
-
```bash
|
| 120 |
-
python test_formatting_fix.py
|
| 121 |
-
```
|
| 122 |
-
|
| 123 |
-
2. **Check that all tests pass**:
|
| 124 |
-
- ✅ Logging tests
|
| 125 |
-
- ✅ Import tests
|
| 126 |
-
- ✅ Configuration tests
|
| 127 |
-
- ✅ Monitoring creation tests
|
| 128 |
-
|
| 129 |
-
3. **Run your training command**:
|
| 130 |
-
```bash
|
| 131 |
-
python run_a100_large_experiment.py --config config/train_smollm3_openhermes_fr_a100_balanced.py --trackio_url "https://tonic-test-trackio-test.hf.space" --experiment-name "petit-elle-l-aime-3-balanced" --output-dir ./outputs/balanced
|
| 132 |
-
```
|
| 133 |
-
|
| 134 |
-
## 📝 Notes
|
| 135 |
-
|
| 136 |
-
- The fix maintains all existing functionality
|
| 137 |
-
- No changes to the training logic or configuration
|
| 138 |
-
- All error messages and logging remain informative
|
| 139 |
-
- The fix is backward compatible
|
| 140 |
-
- HF Datasets integration is preserved
|
| 141 |
-
|
| 142 |
-
## 🚨 Prevention
|
| 143 |
-
|
| 144 |
-
To prevent similar issues in the future:
|
| 145 |
-
|
| 146 |
-
1. **Use Consistent Formatting**: Stick to `%` formatting for logging
|
| 147 |
-
2. **Avoid f-strings in Logging**: Don't use f-strings in `logger.info()` calls
|
| 148 |
-
3. **Test Logging**: Always test logging statements during development
|
| 149 |
-
4. **Use Type Hints**: Consider using type hints to catch formatting issues early
|
| 150 |
-
|
| 151 |
-
---
|
| 152 |
-
|
| 153 |
-
**The formatting fix is now complete and ready for use! 🎉**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/GIT_CONFIGURATION_FIX.md
DELETED
|
@@ -1,257 +0,0 @@
|
|
| 1 |
-
# Git Configuration Fix for Trackio Space Deployment
|
| 2 |
-
|
| 3 |
-
## Issue Identified
|
| 4 |
-
|
| 5 |
-
The Trackio Space deployment was failing with the error:
|
| 6 |
-
```
|
| 7 |
-
❌ Error uploading files: Command '['git', 'commit', '-m', 'Initial Trackio Space setup']' returned non-zero exit status 128.
|
| 8 |
-
```
|
| 9 |
-
|
| 10 |
-
This error occurs because git requires a user identity (email and name) to be configured before making commits. The deployment script was creating a temporary directory and initializing a git repository, but wasn't configuring the git user identity in that temporary directory.
|
| 11 |
-
|
| 12 |
-
## Root Cause
|
| 13 |
-
|
| 14 |
-
### **Problem**: Git Identity Not Configured in Temporary Directory
|
| 15 |
-
|
| 16 |
-
When the deployment script:
|
| 17 |
-
1. Creates a temporary directory
|
| 18 |
-
2. Changes to that directory (`os.chdir(temp_dir)`)
|
| 19 |
-
3. Initializes a git repository (`git init`)
|
| 20 |
-
4. Tries to commit (`git commit`)
|
| 21 |
-
|
| 22 |
-
The git repository in the temporary directory doesn't inherit the git configuration from the main directory, so it has no user identity configured.
|
| 23 |
-
|
| 24 |
-
### **Solution**: Configure Git Identity in Temporary Directory
|
| 25 |
-
|
| 26 |
-
The fix involves explicitly configuring git user identity in the temporary directory before attempting to commit.
|
| 27 |
-
|
| 28 |
-
## Fixes Applied
|
| 29 |
-
|
| 30 |
-
### 1. **Enhanced TrackioSpaceDeployer Constructor**
|
| 31 |
-
|
| 32 |
-
**Before**:
|
| 33 |
-
```python
|
| 34 |
-
def __init__(self, space_name: str, username: str, token: str):
|
| 35 |
-
self.space_name = space_name
|
| 36 |
-
self.username = username
|
| 37 |
-
self.token = token
|
| 38 |
-
```
|
| 39 |
-
|
| 40 |
-
**After**:
|
| 41 |
-
```python
|
| 42 |
-
def __init__(self, space_name: str, username: str, token: str, git_email: str = None, git_name: str = None):
|
| 43 |
-
self.space_name = space_name
|
| 44 |
-
self.username = username
|
| 45 |
-
self.token = token
|
| 46 |
-
|
| 47 |
-
# Git configuration
|
| 48 |
-
self.git_email = git_email or f"{username}@huggingface.co"
|
| 49 |
-
self.git_name = git_name or username
|
| 50 |
-
```
|
| 51 |
-
|
| 52 |
-
### 2. **Git Configuration in upload_files_to_space Method**
|
| 53 |
-
|
| 54 |
-
**Added to the method**:
|
| 55 |
-
```python
|
| 56 |
-
# Configure git user identity for this repository
|
| 57 |
-
try:
|
| 58 |
-
# Try to get existing git config
|
| 59 |
-
result = subprocess.run(["git", "config", "--global", "user.email"], capture_output=True, text=True)
|
| 60 |
-
if result.returncode == 0 and result.stdout.strip():
|
| 61 |
-
git_email = result.stdout.strip()
|
| 62 |
-
else:
|
| 63 |
-
git_email = self.git_email
|
| 64 |
-
|
| 65 |
-
result = subprocess.run(["git", "config", "--global", "user.name"], capture_output=True, text=True)
|
| 66 |
-
if result.returncode == 0 and result.stdout.strip():
|
| 67 |
-
git_name = result.stdout.strip()
|
| 68 |
-
else:
|
| 69 |
-
git_name = self.git_name
|
| 70 |
-
|
| 71 |
-
except Exception:
|
| 72 |
-
# Fallback to default values
|
| 73 |
-
git_email = self.git_email
|
| 74 |
-
git_name = self.git_name
|
| 75 |
-
|
| 76 |
-
# Set git config for this repository
|
| 77 |
-
subprocess.run(["git", "config", "user.email", git_email], check=True, capture_output=True)
|
| 78 |
-
subprocess.run(["git", "config", "user.name", git_name], check=True, capture_output=True)
|
| 79 |
-
|
| 80 |
-
print(f"✅ Configured git with email: {git_email}, name: {git_name}")
|
| 81 |
-
```
|
| 82 |
-
|
| 83 |
-
### 3. **Updated Main Function**
|
| 84 |
-
|
| 85 |
-
**Enhanced to accept git configuration**:
|
| 86 |
-
```python
|
| 87 |
-
def main():
|
| 88 |
-
# Get user input
|
| 89 |
-
username = input("Enter your Hugging Face username: ").strip()
|
| 90 |
-
space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
|
| 91 |
-
token = input("Enter your Hugging Face token: ").strip()
|
| 92 |
-
|
| 93 |
-
# Get git configuration (optional)
|
| 94 |
-
git_email = input("Enter your git email (optional, press Enter for default): ").strip()
|
| 95 |
-
git_name = input("Enter your git name (optional, press Enter for default): ").strip()
|
| 96 |
-
|
| 97 |
-
# Create deployer with git config
|
| 98 |
-
deployer = TrackioSpaceDeployer(space_name, username, token, git_email, git_name)
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
### 4. **Updated Launch Script**
|
| 102 |
-
|
| 103 |
-
**Enhanced to pass git configuration**:
|
| 104 |
-
```bash
|
| 105 |
-
# Create deployment script input
|
| 106 |
-
cat > deploy_input.txt << EOF
|
| 107 |
-
$HF_USERNAME
|
| 108 |
-
$TRACKIO_SPACE_NAME
|
| 109 |
-
$HF_TOKEN
|
| 110 |
-
$GIT_EMAIL
|
| 111 |
-
$HF_USERNAME
|
| 112 |
-
EOF
|
| 113 |
-
```
|
| 114 |
-
|
| 115 |
-
## Testing the Fix
|
| 116 |
-
|
| 117 |
-
### **Run Git Configuration Tests**
|
| 118 |
-
```bash
|
| 119 |
-
python tests/test_git_config_fix.py
|
| 120 |
-
```
|
| 121 |
-
|
| 122 |
-
Expected output:
|
| 123 |
-
```
|
| 124 |
-
🚀 Testing Git Configuration Fix
|
| 125 |
-
========================================
|
| 126 |
-
🔍 Testing git configuration in temporary directory...
|
| 127 |
-
✅ Created temp directory: /tmp/tmp_xxxxx
|
| 128 |
-
✅ Initialized git repository
|
| 129 |
-
✅ Git email configured correctly
|
| 130 |
-
✅ Git name configured correctly
|
| 131 |
-
✅ Git commit successful
|
| 132 |
-
✅ Cleanup successful
|
| 133 |
-
|
| 134 |
-
🔍 Testing deployment script git configuration...
|
| 135 |
-
✅ Git email set correctly
|
| 136 |
-
✅ Git name set correctly
|
| 137 |
-
|
| 138 |
-
🔍 Testing git configuration fallback...
|
| 139 |
-
✅ Default git email set correctly
|
| 140 |
-
✅ Default git name set correctly
|
| 141 |
-
|
| 142 |
-
🔍 Testing git commit with configuration...
|
| 143 |
-
✅ Created temp directory: /tmp/tmp_xxxxx
|
| 144 |
-
✅ Git commit successful with configuration
|
| 145 |
-
✅ Cleanup successful
|
| 146 |
-
|
| 147 |
-
📊 Test Results: 4/4 tests passed
|
| 148 |
-
✅ All git configuration tests passed! The deployment should work correctly.
|
| 149 |
-
```
|
| 150 |
-
|
| 151 |
-
## Files Modified
|
| 152 |
-
|
| 153 |
-
### **Core Deployment Files**
|
| 154 |
-
1. **`scripts/trackio_tonic/deploy_trackio_space.py`**
|
| 155 |
-
- Enhanced constructor to accept git configuration
|
| 156 |
-
- Added git configuration in upload_files_to_space method
|
| 157 |
-
- Updated main function to accept git parameters
|
| 158 |
-
- Added fallback mechanisms for git configuration
|
| 159 |
-
|
| 160 |
-
### **Launch Script**
|
| 161 |
-
2. **`launch.sh`**
|
| 162 |
-
- Updated to pass git configuration to deployment script
|
| 163 |
-
- Enhanced input file creation with git parameters
|
| 164 |
-
|
| 165 |
-
### **Testing**
|
| 166 |
-
3. **`tests/test_git_config_fix.py`**
|
| 167 |
-
- Comprehensive testing of git configuration
|
| 168 |
-
- Tests for temporary directory git setup
|
| 169 |
-
- Tests for deployment script git handling
|
| 170 |
-
- Tests for fallback behavior
|
| 171 |
-
|
| 172 |
-
## Benefits of the Fix
|
| 173 |
-
|
| 174 |
-
### **1. Reliable Git Commits**
|
| 175 |
-
- Git user identity properly configured in temporary directory
|
| 176 |
-
- No more "exit status 128" errors
|
| 177 |
-
- Successful commits and pushes to Hugging Face Spaces
|
| 178 |
-
|
| 179 |
-
### **2. Flexible Configuration**
|
| 180 |
-
- Accepts custom git email and name
|
| 181 |
-
- Falls back to sensible defaults
|
| 182 |
-
- Works with existing git configuration
|
| 183 |
-
|
| 184 |
-
### **3. Better Error Handling**
|
| 185 |
-
- Graceful fallback to default values
|
| 186 |
-
- Clear error messages and logging
|
| 187 |
-
- Robust configuration validation
|
| 188 |
-
|
| 189 |
-
### **4. Professional Setup**
|
| 190 |
-
- Uses user's actual email address when provided
|
| 191 |
-
- Maintains proper git attribution
|
| 192 |
-
- Follows git best practices
|
| 193 |
-
|
| 194 |
-
## Usage Instructions
|
| 195 |
-
|
| 196 |
-
### **1. Test the Fix**
|
| 197 |
-
```bash
|
| 198 |
-
python tests/test_git_config_fix.py
|
| 199 |
-
```
|
| 200 |
-
|
| 201 |
-
### **2. Deploy with Git Configuration**
|
| 202 |
-
```bash
|
| 203 |
-
python scripts/trackio_tonic/deploy_trackio_space.py
|
| 204 |
-
```
|
| 205 |
-
|
| 206 |
-
When prompted:
|
| 207 |
-
- Enter your HF username
|
| 208 |
-
- Enter space name
|
| 209 |
-
- Enter your HF token
|
| 210 |
-
- Enter your git email (or press Enter for default)
|
| 211 |
-
- Enter your git name (or press Enter for default)
|
| 212 |
-
|
| 213 |
-
### **3. Use with Launch Script**
|
| 214 |
-
```bash
|
| 215 |
-
./launch.sh
|
| 216 |
-
```
|
| 217 |
-
|
| 218 |
-
The launch script will automatically pass the git configuration to the deployment script.
|
| 219 |
-
|
| 220 |
-
## Troubleshooting
|
| 221 |
-
|
| 222 |
-
### **Common Issues**
|
| 223 |
-
|
| 224 |
-
#### **1. Git Configuration Still Fails**
|
| 225 |
-
```bash
|
| 226 |
-
# Check if git is properly configured
|
| 227 |
-
git config --list
|
| 228 |
-
|
| 229 |
-
# Set git config manually if needed
|
| 230 |
-
git config --global user.email "[email protected]"
|
| 231 |
-
git config --global user.name "Your Name"
|
| 232 |
-
```
|
| 233 |
-
|
| 234 |
-
#### **2. Permission Issues**
|
| 235 |
-
```bash
|
| 236 |
-
# Check HF token permissions
|
| 237 |
-
hf whoami
|
| 238 |
-
|
| 239 |
-
# Verify token has write access
|
| 240 |
-
hf repo create test-repo --type space
|
| 241 |
-
```
|
| 242 |
-
|
| 243 |
-
#### **3. Space Creation Fails**
|
| 244 |
-
```bash
|
| 245 |
-
# Check if space name is available
|
| 246 |
-
# Try a different space name
|
| 247 |
-
# Verify HF token is valid
|
| 248 |
-
```
|
| 249 |
-
|
| 250 |
-
## Next Steps
|
| 251 |
-
|
| 252 |
-
1. **Test the fix**: Run the git configuration tests
|
| 253 |
-
2. **Deploy a test space**: Use the updated deployment script
|
| 254 |
-
3. **Verify deployment**: Check that the space is created successfully
|
| 255 |
-
4. **Use in production**: Deploy your actual Trackio Space
|
| 256 |
-
|
| 257 |
-
The git configuration fix should resolve the deployment issues and allow successful Trackio Space creation! 🚀
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/GIT_CONFIGURATION_GUIDE.md
DELETED
|
@@ -1,258 +0,0 @@
|
|
| 1 |
-
# Git Configuration Guide for Hugging Face Operations
|
| 2 |
-
|
| 3 |
-
This guide explains the correct way to configure git for Hugging Face Spaces deployment and model pushing operations.
|
| 4 |
-
|
| 5 |
-
## 🎯 **Overview**
|
| 6 |
-
|
| 7 |
-
When working with Hugging Face Spaces and model repositories, proper git configuration is essential for:
|
| 8 |
-
- Creating and deploying Spaces
|
| 9 |
-
- Pushing models to the Hub
|
| 10 |
-
- Managing experiment tracking datasets
|
| 11 |
-
- Ensuring proper authentication
|
| 12 |
-
- **Using the user's actual email address for proper git identity and commit attribution**
|
| 13 |
-
|
| 14 |
-
## ✅ **Correct Git Configuration**
|
| 15 |
-
|
| 16 |
-
### **1. Local vs Global Configuration**
|
| 17 |
-
|
| 18 |
-
**❌ Wrong (Current):**
|
| 19 |
-
```bash
|
| 20 |
-
git config --global user.email "[email protected]"
|
| 21 |
-
git config --global user.name "$HF_USERNAME"
|
| 22 |
-
```
|
| 23 |
-
|
| 24 |
-
**✅ Correct (Updated):**
|
| 25 |
-
```bash
|
| 26 |
-
# Get user's actual email address
|
| 27 |
-
read -p "Enter your email address for git configuration: " GIT_EMAIL
|
| 28 |
-
|
| 29 |
-
# Configure git locally for this project only
|
| 30 |
-
git config user.email "$GIT_EMAIL"
|
| 31 |
-
git config user.name "$HF_USERNAME"
|
| 32 |
-
|
| 33 |
-
# Verify configuration
|
| 34 |
-
git config user.email
|
| 35 |
-
git config user.name
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
### **2. Proper Authentication Setup**
|
| 39 |
-
|
| 40 |
-
**✅ Correct Authentication:**
|
| 41 |
-
```bash
|
| 42 |
-
# Login with token and add to git credentials
|
| 43 |
-
hf login --token "$HF_TOKEN" --add-to-git-credential
|
| 44 |
-
|
| 45 |
-
# Verify login
|
| 46 |
-
hf whoami
|
| 47 |
-
```
|
| 48 |
-
|
| 49 |
-
### **3. Error Handling**
|
| 50 |
-
|
| 51 |
-
**✅ Robust Configuration:**
|
| 52 |
-
```bash
|
| 53 |
-
# Get user's email and configure git with error handling
|
| 54 |
-
read -p "Enter your email address for git configuration: " GIT_EMAIL
|
| 55 |
-
|
| 56 |
-
if git config user.email "$GIT_EMAIL" && \
|
| 57 |
-
git config user.name "$HF_USERNAME"; then
|
| 58 |
-
echo "✅ Git configured successfully"
|
| 59 |
-
echo " Email: $(git config user.email)"
|
| 60 |
-
echo " Name: $(git config user.name)"
|
| 61 |
-
else
|
| 62 |
-
echo "❌ Failed to configure git"
|
| 63 |
-
exit 1
|
| 64 |
-
fi
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
## 🔧 **Why These Changes Matter**
|
| 68 |
-
|
| 69 |
-
### **1. Local Configuration Benefits**
|
| 70 |
-
- **Isolation**: Doesn't affect other projects on the system
|
| 71 |
-
- **Project-specific**: Each project can have different git settings
|
| 72 |
-
- **Cleaner**: No global state pollution
|
| 73 |
-
- **Safer**: Won't interfere with existing git configurations
|
| 74 |
-
|
| 75 |
-
### **2. User's Actual Email Address**
|
| 76 |
-
- **Professional**: Uses the user's real email address
|
| 77 |
-
- **Authentic**: Represents the actual user's identity
|
| 78 |
-
- **Consistent**: Matches the user's Hugging Face account
|
| 79 |
-
- **Best Practice**: Follows git configuration standards
|
| 80 |
-
|
| 81 |
-
### **3. Token-based Authentication**
|
| 82 |
-
- **Secure**: Uses HF token instead of username/password
|
| 83 |
-
- **Automated**: No manual password entry required
|
| 84 |
-
- **Persistent**: Credentials stored securely
|
| 85 |
-
- **Verified**: Includes verification steps
|
| 86 |
-
|
| 87 |
-
## 📋 **Implementation in Launch Script**
|
| 88 |
-
|
| 89 |
-
### **Updated Authentication Step:**
|
| 90 |
-
```bash
|
| 91 |
-
# Step 8: Authentication setup
|
| 92 |
-
print_step "Step 8: Authentication Setup"
|
| 93 |
-
echo "================================"
|
| 94 |
-
|
| 95 |
-
export HF_TOKEN="$HF_TOKEN"
|
| 96 |
-
export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
|
| 97 |
-
|
| 98 |
-
# Login to Hugging Face with token
|
| 99 |
-
print_info "Logging in to Hugging Face..."
|
| 100 |
-
if hf login --token "$HF_TOKEN" --add-to-git-credential; then
|
| 101 |
-
print_status "Successfully logged in to Hugging Face"
|
| 102 |
-
print_info "Username: $(hf whoami)"
|
| 103 |
-
else
|
| 104 |
-
print_error "Failed to login to Hugging Face"
|
| 105 |
-
print_error "Please check your token and try again"
|
| 106 |
-
exit 1
|
| 107 |
-
fi
|
| 108 |
-
|
| 109 |
-
# Configure git for HF operations
|
| 110 |
-
print_step "Step 8.1: Git Configuration"
|
| 111 |
-
echo "================================"
|
| 112 |
-
|
| 113 |
-
print_info "Configuring git for Hugging Face operations..."
|
| 114 |
-
|
| 115 |
-
# Get user's email for git configuration
|
| 116 |
-
get_input "Enter your email address for git configuration" "" GIT_EMAIL
|
| 117 |
-
|
| 118 |
-
# Configure git locally (not globally) for this project
|
| 119 |
-
git config user.email "$GIT_EMAIL"
|
| 120 |
-
git config user.name "$HF_USERNAME"
|
| 121 |
-
|
| 122 |
-
# Verify git configuration
|
| 123 |
-
print_info "Verifying git configuration..."
|
| 124 |
-
if git config user.email && git config user.name; then
|
| 125 |
-
print_status "Git configured successfully"
|
| 126 |
-
print_info " Email: $(git config user.email)"
|
| 127 |
-
print_info " Name: $(git config user.name)"
|
| 128 |
-
else
|
| 129 |
-
print_error "Failed to configure git"
|
| 130 |
-
exit 1
|
| 131 |
-
fi
|
| 132 |
-
```
|
| 133 |
-
|
| 134 |
-
## 🚀 **Deployment Script Improvements**
|
| 135 |
-
|
| 136 |
-
### **Robust File Upload:**
|
| 137 |
-
```python
|
| 138 |
-
def upload_files(self) -> bool:
|
| 139 |
-
"""Upload necessary files to the Space"""
|
| 140 |
-
try:
|
| 141 |
-
print("Uploading files to Space...")
|
| 142 |
-
|
| 143 |
-
# Files to upload
|
| 144 |
-
files_to_upload = [
|
| 145 |
-
"app.py",
|
| 146 |
-
"requirements_space.txt",
|
| 147 |
-
"README.md"
|
| 148 |
-
]
|
| 149 |
-
|
| 150 |
-
# Check if we're in a git repository
|
| 151 |
-
try:
|
| 152 |
-
subprocess.run(["git", "status"], capture_output=True, check=True)
|
| 153 |
-
except subprocess.CalledProcessError:
|
| 154 |
-
print("⚠️ Not in a git repository, initializing...")
|
| 155 |
-
subprocess.run(["git", "init"], check=True)
|
| 156 |
-
subprocess.run(["git", "remote", "add", "origin", f"https://huggingface.co/spaces/{self.username}/{self.space_name}"], check=True)
|
| 157 |
-
|
| 158 |
-
# Add all files at once
|
| 159 |
-
existing_files = [f for f in files_to_upload if os.path.exists(f)]
|
| 160 |
-
if existing_files:
|
| 161 |
-
subprocess.run(["git", "add"] + existing_files, check=True)
|
| 162 |
-
subprocess.run(["git", "commit", "-m", "Initial Space setup"], check=True)
|
| 163 |
-
|
| 164 |
-
# Push to the space
|
| 165 |
-
try:
|
| 166 |
-
subprocess.run(["git", "push", "origin", "main"], check=True)
|
| 167 |
-
print(f"✅ Uploaded {len(existing_files)} files")
|
| 168 |
-
except subprocess.CalledProcessError:
|
| 169 |
-
# Try pushing to master branch if main doesn't exist
|
| 170 |
-
subprocess.run(["git", "push", "origin", "master"], check=True)
|
| 171 |
-
print(f"✅ Uploaded {len(existing_files)} files")
|
| 172 |
-
else:
|
| 173 |
-
print("⚠️ No files found to upload")
|
| 174 |
-
|
| 175 |
-
return True
|
| 176 |
-
|
| 177 |
-
except Exception as e:
|
| 178 |
-
print(f"❌ Error uploading files: {e}")
|
| 179 |
-
return False
|
| 180 |
-
```
|
| 181 |
-
|
| 182 |
-
## 🔍 **Troubleshooting**
|
| 183 |
-
|
| 184 |
-
### **Common Issues and Solutions:**
|
| 185 |
-
|
| 186 |
-
#### **1. Git Configuration Fails**
|
| 187 |
-
```bash
|
| 188 |
-
# Check current git config
|
| 189 |
-
git config --list
|
| 190 |
-
|
| 191 |
-
# Reset if needed
|
| 192 |
-
git config --unset user.email
|
| 193 |
-
git config --unset user.name
|
| 194 |
-
|
| 195 |
-
# Reconfigure
|
| 196 |
-
git config user.email "[email protected]"
|
| 197 |
-
git config user.name "your-username"
|
| 198 |
-
```
|
| 199 |
-
|
| 200 |
-
#### **2. Authentication Issues**
|
| 201 |
-
```bash
|
| 202 |
-
# Check HF login status
|
| 203 |
-
hf whoami
|
| 204 |
-
|
| 205 |
-
# Re-login if needed
|
| 206 |
-
hf logout
|
| 207 |
-
hf login --token "your-token"
|
| 208 |
-
```
|
| 209 |
-
|
| 210 |
-
#### **3. Space Deployment Fails**
|
| 211 |
-
```bash
|
| 212 |
-
# Check git remote
|
| 213 |
-
git remote -v
|
| 214 |
-
|
| 215 |
-
# Re-add remote if needed
|
| 216 |
-
git remote remove origin
|
| 217 |
-
git remote add origin https://huggingface.co/spaces/username/space-name
|
| 218 |
-
```
|
| 219 |
-
|
| 220 |
-
## 📚 **Best Practices**
|
| 221 |
-
|
| 222 |
-
### **1. Always Use Local Configuration**
|
| 223 |
-
- Use `git config` without `--global` flag
|
| 224 |
-
- Keeps project configurations isolated
|
| 225 |
-
- Prevents conflicts with other projects
|
| 226 |
-
|
| 227 |
-
### **2. Verify Configuration**
|
| 228 |
-
- Always check that git config was successful
|
| 229 |
-
- Display configured values for verification
|
| 230 |
-
- Exit on failure to prevent downstream issues
|
| 231 |
-
|
| 232 |
-
### **3. Use Token-based Authentication**
|
| 233 |
-
- More secure than username/password
|
| 234 |
-
- Automatically handles credential storage
|
| 235 |
-
- Works well with CI/CD systems
|
| 236 |
-
|
| 237 |
-
### **4. Handle Errors Gracefully**
|
| 238 |
-
- Check return codes from git commands
|
| 239 |
-
- Provide clear error messages
|
| 240 |
-
- Exit early on critical failures
|
| 241 |
-
|
| 242 |
-
### **5. Test Configuration**
|
| 243 |
-
- Verify git config after setting it
|
| 244 |
-
- Test HF login before proceeding
|
| 245 |
-
- Validate remote repository access
|
| 246 |
-
|
| 247 |
-
## 🎯 **Summary**
|
| 248 |
-
|
| 249 |
-
The updated git configuration approach provides:
|
| 250 |
-
|
| 251 |
-
1. **✅ Better Isolation**: Local configuration doesn't affect system-wide settings
|
| 252 |
-
2. **✅ User's Actual Email**: Uses the user's real email address for proper git identity
|
| 253 |
-
3. **✅ Proper Authentication**: Token-based login with credential storage
|
| 254 |
-
4. **✅ Error Handling**: Robust verification and error reporting
|
| 255 |
-
5. **✅ Professional Setup**: Uses user's actual email and verification
|
| 256 |
-
6. **✅ Deployment Reliability**: Improved Space deployment with git repository handling
|
| 257 |
-
|
| 258 |
-
This ensures a more reliable and professional setup for Hugging Face operations in the SmolLM3 fine-tuning pipeline.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/H100_LIGHTWEIGHT_GUIDE.md
DELETED
|
@@ -1,276 +0,0 @@
|
|
| 1 |
-
# H100 Lightweight Training Configuration Guide
|
| 2 |
-
|
| 3 |
-
This guide explains the new **H100 Lightweight (Rapid)** training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.
|
| 4 |
-
|
| 5 |
-
## 🎯 Overview
|
| 6 |
-
|
| 7 |
-
The H100 Lightweight configuration is designed for:
|
| 8 |
-
- **Rapid experimentation** on H100 GPUs
|
| 9 |
-
- **Efficient training** with 80K carefully selected samples
|
| 10 |
-
- **Quick iteration** for research and development
|
| 11 |
-
- **Cost-effective** training sessions
|
| 12 |
-
|
| 13 |
-
## 🚀 Key Features
|
| 14 |
-
|
| 15 |
-
### **Optimized for H100**
|
| 16 |
-
- **Batch Size**: 16 (larger than A100 configs)
|
| 17 |
-
- **Gradient Accumulation**: 4 (reduced for faster updates)
|
| 18 |
-
- **Learning Rate**: 8e-6 (slightly higher for rapid convergence)
|
| 19 |
-
- **Sequence Length**: 8192 (full context window)
|
| 20 |
-
|
| 21 |
-
### **Dataset Sampling**
|
| 22 |
-
- **Source**: OpenHermes-FR dataset
|
| 23 |
-
- **Sample Size**: 80,000 random samples
|
| 24 |
-
- **Validation**: 1,000 samples (if available)
|
| 25 |
-
- **Reproducibility**: Fixed random seed (42)
|
| 26 |
-
|
| 27 |
-
### **Training Optimizations**
|
| 28 |
-
- **Warmup Steps**: 50 (reduced for rapid training)
|
| 29 |
-
- **Evaluation**: Every 50 steps
|
| 30 |
-
- **Logging**: Every 5 steps
|
| 31 |
-
- **Saving**: Every 200 steps
|
| 32 |
-
- **Checkpoints**: Keep only 2 (save storage)
|
| 33 |
-
|
| 34 |
-
## 📊 Configuration Details
|
| 35 |
-
|
| 36 |
-
### **Model Configuration**
|
| 37 |
-
```python
|
| 38 |
-
model_name="HuggingFaceTB/SmolLM3-3B"
|
| 39 |
-
max_seq_length=8192
|
| 40 |
-
use_flash_attention=True
|
| 41 |
-
use_gradient_checkpointing=True
|
| 42 |
-
```
|
| 43 |
-
|
| 44 |
-
### **Training Parameters**
|
| 45 |
-
```python
|
| 46 |
-
batch_size=16
|
| 47 |
-
gradient_accumulation_steps=4
|
| 48 |
-
learning_rate=8e-6
|
| 49 |
-
warmup_steps=50
|
| 50 |
-
max_epochs=1
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
### **H100-Specific Optimizations**
|
| 54 |
-
```python
|
| 55 |
-
dataloader_num_workers=4
|
| 56 |
-
dataloader_pin_memory=True
|
| 57 |
-
gradient_clipping=1.0
|
| 58 |
-
group_by_length=True
|
| 59 |
-
pad_to_multiple_of=8
|
| 60 |
-
```
|
| 61 |
-
|
| 62 |
-
### **Memory Optimizations**
|
| 63 |
-
```python
|
| 64 |
-
save_total_limit=2
|
| 65 |
-
early_stopping_patience=3
|
| 66 |
-
max_grad_norm=1.0
|
| 67 |
-
warmup_ratio=0.1
|
| 68 |
-
```
|
| 69 |
-
|
| 70 |
-
## 🔧 Usage
|
| 71 |
-
|
| 72 |
-
### **Interactive Selection**
|
| 73 |
-
```bash
|
| 74 |
-
./launch.sh
|
| 75 |
-
# Select "H100 Lightweight (Rapid)" when prompted
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
### **Expected Training Time**
|
| 79 |
-
- **H100**: ~2-4 hours (depending on hardware)
|
| 80 |
-
- **A100**: ~4-6 hours
|
| 81 |
-
- **V100**: ~6-8 hours
|
| 82 |
-
|
| 83 |
-
### **Memory Requirements**
|
| 84 |
-
- **GPU Memory**: 40GB+ (H100 recommended)
|
| 85 |
-
- **System RAM**: 32GB+
|
| 86 |
-
- **Storage**: 50GB+ for dataset and checkpoints
|
| 87 |
-
|
| 88 |
-
## 📈 Performance Characteristics
|
| 89 |
-
|
| 90 |
-
### **Training Speed**
|
| 91 |
-
- **Steps per Second**: ~2-3 (on H100)
|
| 92 |
-
- **Samples per Second**: ~32-48
|
| 93 |
-
- **Effective Batch Size**: 64 (16 × 4)
|
| 94 |
-
|
| 95 |
-
### **Convergence**
|
| 96 |
-
- **Expected Loss**: 1.2-1.8 (after 1 epoch)
|
| 97 |
-
- **Evaluation Frequency**: Every 50 steps
|
| 98 |
-
- **Early Stopping**: After 3 evaluations without improvement
|
| 99 |
-
|
| 100 |
-
### **Dataset Efficiency**
|
| 101 |
-
- **80K samples**: ~1.3% of full OpenHermes-FR
|
| 102 |
-
- **Random sampling**: Ensures diversity
|
| 103 |
-
- **Fixed seed**: Reproducible results
|
| 104 |
-
|
| 105 |
-
## 🎯 Use Cases
|
| 106 |
-
|
| 107 |
-
### **Perfect For**
|
| 108 |
-
- **Rapid prototyping** of new ideas
|
| 109 |
-
- **Hyperparameter tuning** experiments
|
| 110 |
-
- **Model comparison** studies
|
| 111 |
-
- **Research validation** before full training
|
| 112 |
-
- **Educational purposes** and learning
|
| 113 |
-
|
| 114 |
-
### **Not Recommended For**
|
| 115 |
-
- **Production models** (use Multiple Passes instead)
|
| 116 |
-
- **Competition submissions** (use full dataset)
|
| 117 |
-
- **Research papers** (use complete training)
|
| 118 |
-
|
| 119 |
-
## 🔄 Comparison with Other Configurations
|
| 120 |
-
|
| 121 |
-
| Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case |
|
| 122 |
-
|---------------|--------------|------------|--------|---------------|----------|
|
| 123 |
-
| **Basic Training** | Full SmolTalk | 2 | 3 | 6-8 hours | Learning |
|
| 124 |
-
| **H100 Lightweight** | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments |
|
| 125 |
-
| **A100 Large Scale** | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research |
|
| 126 |
-
| **Multiple Passes** | Full Hermes-FR | 6 | 4 | 24-36 hours | Production |
|
| 127 |
-
|
| 128 |
-
## 🛠️ Customization
|
| 129 |
-
|
| 130 |
-
### **Modifying Sample Size**
|
| 131 |
-
```bash
|
| 132 |
-
# In the launch script, you can modify:
|
| 133 |
-
DATASET_SAMPLE_SIZE=50000 # For 50K samples
|
| 134 |
-
DATASET_SAMPLE_SIZE=100000 # For 100K samples
|
| 135 |
-
```
|
| 136 |
-
|
| 137 |
-
### **Adjusting Training Parameters**
|
| 138 |
-
```bash
|
| 139 |
-
# Modify in config/train_smollm3_h100_lightweight.py:
|
| 140 |
-
batch_size=12 # Smaller batch size
|
| 141 |
-
learning_rate=6e-6 # Lower learning rate
|
| 142 |
-
warmup_steps=100 # More warmup steps
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
### **Changing Dataset**
|
| 146 |
-
```bash
|
| 147 |
-
# Modify the dataset name in the configuration:
|
| 148 |
-
dataset_name="your-custom-dataset"
|
| 149 |
-
```
|
| 150 |
-
|
| 151 |
-
## 📊 Monitoring and Results
|
| 152 |
-
|
| 153 |
-
### **Trackio Integration**
|
| 154 |
-
- **Real-time metrics**: Loss, learning rate, gradient norm
|
| 155 |
-
- **Training curves**: Visual progress tracking
|
| 156 |
-
- **Resource usage**: GPU utilization, memory consumption
|
| 157 |
-
- **Artifacts**: Model checkpoints, logs
|
| 158 |
-
|
| 159 |
-
### **Expected Metrics**
|
| 160 |
-
- **Training Loss**: Starts ~3.0, ends ~1.5
|
| 161 |
-
- **Validation Loss**: Should be close to training loss
|
| 162 |
-
- **Learning Rate**: Cosine decay from 8e-6 to 2e-6
|
| 163 |
-
- **Gradient Norm**: Should stay below 1.0
|
| 164 |
-
|
| 165 |
-
### **Success Indicators**
|
| 166 |
-
- **Converging loss**: Steady decrease over time
|
| 167 |
-
- **Stable gradients**: Consistent gradient norms
|
| 168 |
-
- **Good validation**: Validation loss follows training loss
|
| 169 |
-
- **No overfitting**: Validation loss doesn't increase
|
| 170 |
-
|
| 171 |
-
## 🚨 Troubleshooting
|
| 172 |
-
|
| 173 |
-
### **Common Issues**
|
| 174 |
-
|
| 175 |
-
#### **Out of Memory (OOM)**
|
| 176 |
-
```bash
|
| 177 |
-
# Reduce batch size in config:
|
| 178 |
-
batch_size=12 # Instead of 16
|
| 179 |
-
gradient_accumulation_steps=6 # Instead of 4
|
| 180 |
-
```
|
| 181 |
-
|
| 182 |
-
#### **Slow Training**
|
| 183 |
-
```bash
|
| 184 |
-
# Check GPU utilization:
|
| 185 |
-
nvidia-smi
|
| 186 |
-
# Ensure CUDA is properly installed
|
| 187 |
-
python -c "import torch; print(torch.cuda.is_available())"
|
| 188 |
-
```
|
| 189 |
-
|
| 190 |
-
#### **Poor Convergence**
|
| 191 |
-
```bash
|
| 192 |
-
# Try different learning rate:
|
| 193 |
-
learning_rate=6e-6 # Instead of 8e-6
|
| 194 |
-
# Or increase warmup:
|
| 195 |
-
warmup_steps=100 # Instead of 50
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
#### **Dataset Issues**
|
| 199 |
-
```bash
|
| 200 |
-
# Check dataset loading:
|
| 201 |
-
python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"
|
| 202 |
-
```
|
| 203 |
-
|
| 204 |
-
### **Performance Tips**
|
| 205 |
-
|
| 206 |
-
1. **Use H100 if available**: Significantly faster than A100
|
| 207 |
-
2. **Monitor GPU memory**: Keep utilization below 90%
|
| 208 |
-
3. **Check logs regularly**: Look for convergence issues
|
| 209 |
-
4. **Save checkpoints**: Don't lose progress
|
| 210 |
-
5. **Use early stopping**: Prevent overfitting
|
| 211 |
-
|
| 212 |
-
## 📋 Example Workflow
|
| 213 |
-
|
| 214 |
-
### **Complete H100 Lightweight Training**
|
| 215 |
-
```bash
|
| 216 |
-
# 1. Setup
|
| 217 |
-
python setup_launch.py
|
| 218 |
-
|
| 219 |
-
# 2. Check requirements
|
| 220 |
-
python check_requirements.py
|
| 221 |
-
|
| 222 |
-
# 3. Run interactive pipeline
|
| 223 |
-
./launch.sh
|
| 224 |
-
|
| 225 |
-
# 4. Select configuration
|
| 226 |
-
# Choose: "H100 Lightweight (Rapid)"
|
| 227 |
-
|
| 228 |
-
# 5. Monitor training
|
| 229 |
-
# Watch Trackio Space for real-time progress
|
| 230 |
-
|
| 231 |
-
# 6. Check results
|
| 232 |
-
# Model will be pushed to HF Hub
|
| 233 |
-
# Summary in training_summary.md
|
| 234 |
-
```
|
| 235 |
-
|
| 236 |
-
### **Expected Output**
|
| 237 |
-
```
|
| 238 |
-
✅ Dataset prepared: 80000 train samples, 1000 validation samples
|
| 239 |
-
📈 Training started with 5000 total steps
|
| 240 |
-
⏱️ Estimated time: 2-4 hours
|
| 241 |
-
📊 Monitor progress at: https://huggingface.co/spaces/...
|
| 242 |
-
```
|
| 243 |
-
|
| 244 |
-
## 🎉 Benefits
|
| 245 |
-
|
| 246 |
-
### **Speed**
|
| 247 |
-
- **3-4x faster** than full dataset training
|
| 248 |
-
- **Rapid iteration** for research
|
| 249 |
-
- **Quick validation** of ideas
|
| 250 |
-
|
| 251 |
-
### **Efficiency**
|
| 252 |
-
- **Reduced costs** (less GPU time)
|
| 253 |
-
- **Lower storage** requirements
|
| 254 |
-
- **Faster experimentation** cycle
|
| 255 |
-
|
| 256 |
-
### **Quality**
|
| 257 |
-
- **Still high quality** results
|
| 258 |
-
- **Good for prototyping**
|
| 259 |
-
- **Suitable for many use cases**
|
| 260 |
-
|
| 261 |
-
## 🔮 Future Enhancements
|
| 262 |
-
|
| 263 |
-
### **Planned Improvements**
|
| 264 |
-
- **Adaptive sampling**: Smart dataset selection
|
| 265 |
-
- **Multi-GPU support**: Distributed training
|
| 266 |
-
- **Advanced monitoring**: More detailed metrics
|
| 267 |
-
- **Auto-tuning**: Automatic hyperparameter optimization
|
| 268 |
-
|
| 269 |
-
### **Extensibility**
|
| 270 |
-
- **Custom datasets**: Easy integration
|
| 271 |
-
- **Different models**: Support for other architectures
|
| 272 |
-
- **Advanced sampling**: Stratified, balanced sampling
|
| 273 |
-
|
| 274 |
-
---
|
| 275 |
-
|
| 276 |
-
**Happy Rapid Training on H100! 🚀**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/HF_DATASETS_GUIDE.md
DELETED
|
@@ -1,269 +0,0 @@
|
|
| 1 |
-
# 🚀 Trackio with Hugging Face Datasets - Complete Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This guide explains how to use Hugging Face Datasets for persistent storage of Trackio experiments, providing reliable data persistence across Hugging Face Spaces deployments.
|
| 6 |
-
|
| 7 |
-
## 🏗️ Architecture
|
| 8 |
-
|
| 9 |
-
### Why HF Datasets?
|
| 10 |
-
|
| 11 |
-
1. **Persistent Storage**: Data survives Space restarts and redeployments
|
| 12 |
-
2. **Version Control**: Automatic versioning of experiment data
|
| 13 |
-
3. **Access Control**: Private datasets for security
|
| 14 |
-
4. **Reliability**: HF's infrastructure ensures data availability
|
| 15 |
-
5. **Scalability**: Handles large amounts of experiment data
|
| 16 |
-
|
| 17 |
-
### Data Flow
|
| 18 |
-
|
| 19 |
-
```
|
| 20 |
-
Training Script → Trackio App → HF Dataset → Trackio App → Plots
|
| 21 |
-
```
|
| 22 |
-
|
| 23 |
-
## 🚀 Setup Instructions
|
| 24 |
-
|
| 25 |
-
### 1. Create HF Token
|
| 26 |
-
|
| 27 |
-
1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
|
| 28 |
-
2. Create a new token with `write` permissions
|
| 29 |
-
3. Copy the token for use in your Space
|
| 30 |
-
|
| 31 |
-
### 2. Set Up Dataset Repository
|
| 32 |
-
|
| 33 |
-
```bash
|
| 34 |
-
# Run the setup script
|
| 35 |
-
python setup_hf_dataset.py
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
This will:
|
| 39 |
-
- Create a private dataset: `tonic/trackio-experiments`
|
| 40 |
-
- Add your existing experiments
|
| 41 |
-
- Configure the dataset for Trackio
|
| 42 |
-
|
| 43 |
-
### 3. Configure Hugging Face Space
|
| 44 |
-
|
| 45 |
-
#### Environment Variables
|
| 46 |
-
Set these in your HF Space settings:
|
| 47 |
-
```bash
|
| 48 |
-
HF_TOKEN=your_hf_token_here
|
| 49 |
-
TRACKIO_DATASET_REPO=your-username/your-dataset-name
|
| 50 |
-
```
|
| 51 |
-
|
| 52 |
-
**Environment Variables Explained:**
|
| 53 |
-
- `HF_TOKEN`: Your Hugging Face token (required for dataset access)
|
| 54 |
-
- `TRACKIO_DATASET_REPO`: Dataset repository to use (optional, defaults to `tonic/trackio-experiments`)
|
| 55 |
-
|
| 56 |
-
**Example Configurations:**
|
| 57 |
-
```bash
|
| 58 |
-
# Use default dataset
|
| 59 |
-
HF_TOKEN=your_token_here
|
| 60 |
-
|
| 61 |
-
# Use personal dataset
|
| 62 |
-
HF_TOKEN=your_token_here
|
| 63 |
-
TRACKIO_DATASET_REPO=your-username/trackio-experiments
|
| 64 |
-
|
| 65 |
-
# Use team dataset
|
| 66 |
-
HF_TOKEN=your_token_here
|
| 67 |
-
TRACKIO_DATASET_REPO=your-org/team-experiments
|
| 68 |
-
|
| 69 |
-
# Use project-specific dataset
|
| 70 |
-
HF_TOKEN=your_token_here
|
| 71 |
-
TRACKIO_DATASET_REPO=your-username/smollm3-experiments
|
| 72 |
-
```
|
| 73 |
-
|
| 74 |
-
#### Requirements
|
| 75 |
-
Update your `requirements.txt`:
|
| 76 |
-
```txt
|
| 77 |
-
gradio>=4.0.0
|
| 78 |
-
plotly>=5.0.0
|
| 79 |
-
pandas>=1.5.0
|
| 80 |
-
numpy>=1.24.0
|
| 81 |
-
datasets>=2.14.0
|
| 82 |
-
huggingface-hub>=0.16.0
|
| 83 |
-
requests>=2.31.0
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
### 4. Deploy Updated App
|
| 87 |
-
|
| 88 |
-
The updated `app.py` now:
|
| 89 |
-
- Loads experiments from HF Dataset
|
| 90 |
-
- Saves new experiments to the dataset
|
| 91 |
-
- Falls back to backup data if dataset unavailable
|
| 92 |
-
- Provides better error handling
|
| 93 |
-
|
| 94 |
-
### 5. Configure Environment Variables
|
| 95 |
-
|
| 96 |
-
Use the configuration script to check your setup:
|
| 97 |
-
|
| 98 |
-
```bash
|
| 99 |
-
python configure_trackio.py
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
This script will:
|
| 103 |
-
- Show current environment variables
|
| 104 |
-
- Test dataset access
|
| 105 |
-
- Generate configuration file
|
| 106 |
-
- Provide usage examples
|
| 107 |
-
|
| 108 |
-
**Available Environment Variables:**
|
| 109 |
-
|
| 110 |
-
| Variable | Required | Default | Description |
|
| 111 |
-
|----------|----------|---------|-------------|
|
| 112 |
-
| `HF_TOKEN` | Yes | None | Your Hugging Face token |
|
| 113 |
-
| `TRACKIO_DATASET_REPO` | No | `tonic/trackio-experiments` | Dataset repository to use |
|
| 114 |
-
| `SPACE_ID` | Auto | None | HF Space ID (auto-detected) |
|
| 115 |
-
|
| 116 |
-
## 📊 Dataset Schema
|
| 117 |
-
|
| 118 |
-
The HF Dataset contains these columns:
|
| 119 |
-
|
| 120 |
-
| Column | Type | Description |
|
| 121 |
-
|--------|------|-------------|
|
| 122 |
-
| `experiment_id` | string | Unique experiment identifier |
|
| 123 |
-
| `name` | string | Experiment name |
|
| 124 |
-
| `description` | string | Experiment description |
|
| 125 |
-
| `created_at` | string | ISO timestamp |
|
| 126 |
-
| `status` | string | running/completed/failed |
|
| 127 |
-
| `metrics` | string | JSON array of metric entries |
|
| 128 |
-
| `parameters` | string | JSON object of experiment parameters |
|
| 129 |
-
| `artifacts` | string | JSON array of artifacts |
|
| 130 |
-
| `logs` | string | JSON array of log entries |
|
| 131 |
-
| `last_updated` | string | ISO timestamp of last update |
|
| 132 |
-
|
| 133 |
-
## 🔧 Technical Details
|
| 134 |
-
|
| 135 |
-
### Loading Experiments
|
| 136 |
-
|
| 137 |
-
```python
|
| 138 |
-
from datasets import load_dataset
|
| 139 |
-
|
| 140 |
-
# Load from HF Dataset
|
| 141 |
-
dataset = load_dataset("tonic/trackio-experiments", token=HF_TOKEN)
|
| 142 |
-
|
| 143 |
-
# Convert to experiments dict
|
| 144 |
-
for row in dataset['train']:
|
| 145 |
-
experiment = {
|
| 146 |
-
'id': row['experiment_id'],
|
| 147 |
-
'metrics': json.loads(row['metrics']),
|
| 148 |
-
'parameters': json.loads(row['parameters']),
|
| 149 |
-
# ... other fields
|
| 150 |
-
}
|
| 151 |
-
```
|
| 152 |
-
|
| 153 |
-
### Saving Experiments
|
| 154 |
-
|
| 155 |
-
```python
|
| 156 |
-
from datasets import Dataset
|
| 157 |
-
from huggingface_hub import HfApi
|
| 158 |
-
|
| 159 |
-
# Convert experiments to dataset format
|
| 160 |
-
dataset_data = []
|
| 161 |
-
for exp_id, exp_data in experiments.items():
|
| 162 |
-
dataset_data.append({
|
| 163 |
-
'experiment_id': exp_id,
|
| 164 |
-
'metrics': json.dumps(exp_data['metrics']),
|
| 165 |
-
'parameters': json.dumps(exp_data['parameters']),
|
| 166 |
-
# ... other fields
|
| 167 |
-
})
|
| 168 |
-
|
| 169 |
-
# Push to HF Hub
|
| 170 |
-
dataset = Dataset.from_list(dataset_data)
|
| 171 |
-
dataset.push_to_hub("tonic/trackio-experiments", token=HF_TOKEN, private=True)
|
| 172 |
-
```
|
| 173 |
-
|
| 174 |
-
## 📈 Your Current Experiments
|
| 175 |
-
|
| 176 |
-
### Available Experiments
|
| 177 |
-
|
| 178 |
-
1. **`exp_20250720_130853`** (petite-elle-l-aime-3)
|
| 179 |
-
- 4 metric entries (steps 25, 50, 75, 100)
|
| 180 |
-
- Loss decreasing: 1.1659 → 1.1528
|
| 181 |
-
- Good convergence pattern
|
| 182 |
-
|
| 183 |
-
2. **`exp_20250720_134319`** (petite-elle-l-aime-3-1)
|
| 184 |
-
- 2 metric entries (step 25)
|
| 185 |
-
- Loss: 1.166
|
| 186 |
-
- GPU memory tracking
|
| 187 |
-
|
| 188 |
-
### Metrics Available for Plotting
|
| 189 |
-
|
| 190 |
-
- `loss` - Training loss curve
|
| 191 |
-
- `learning_rate` - Learning rate schedule
|
| 192 |
-
- `mean_token_accuracy` - Token-level accuracy
|
| 193 |
-
- `grad_norm` - Gradient norm
|
| 194 |
-
- `num_tokens` - Tokens processed
|
| 195 |
-
- `epoch` - Training epoch
|
| 196 |
-
- `gpu_0_memory_allocated` - GPU memory usage
|
| 197 |
-
- `cpu_percent` - CPU usage
|
| 198 |
-
- `memory_percent` - System memory
|
| 199 |
-
|
| 200 |
-
## 🎯 Usage Instructions
|
| 201 |
-
|
| 202 |
-
### 1. View Experiments
|
| 203 |
-
- Go to "View Experiments" tab
|
| 204 |
-
- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
|
| 205 |
-
- Click "View Experiment"
|
| 206 |
-
|
| 207 |
-
### 2. Create Plots
|
| 208 |
-
- Go to "Visualizations" tab
|
| 209 |
-
- Enter experiment ID
|
| 210 |
-
- Select metric to plot
|
| 211 |
-
- Click "Create Plot"
|
| 212 |
-
|
| 213 |
-
### 3. Compare Experiments
|
| 214 |
-
- Use "Experiment Comparison" feature
|
| 215 |
-
- Enter: `exp_20250720_130853,exp_20250720_134319`
|
| 216 |
-
- Compare loss curves
|
| 217 |
-
|
| 218 |
-
## 🔍 Troubleshooting
|
| 219 |
-
|
| 220 |
-
### Issue: "No metrics data available"
|
| 221 |
-
**Solutions**:
|
| 222 |
-
1. Check HF_TOKEN is set correctly
|
| 223 |
-
2. Verify dataset repository exists
|
| 224 |
-
3. Check network connectivity to HF Hub
|
| 225 |
-
|
| 226 |
-
### Issue: "Failed to load from dataset"
|
| 227 |
-
**Solutions**:
|
| 228 |
-
1. App falls back to backup data automatically
|
| 229 |
-
2. Check dataset permissions
|
| 230 |
-
3. Verify token has read access
|
| 231 |
-
|
| 232 |
-
### Issue: "Failed to save experiments"
|
| 233 |
-
**Solutions**:
|
| 234 |
-
1. Check token has write permissions
|
| 235 |
-
2. Verify dataset repository exists
|
| 236 |
-
3. Check network connectivity
|
| 237 |
-
|
| 238 |
-
## 🚀 Benefits of This Approach
|
| 239 |
-
|
| 240 |
-
### ✅ Advantages
|
| 241 |
-
- **Persistent**: Data survives Space restarts
|
| 242 |
-
- **Reliable**: HF's infrastructure ensures availability
|
| 243 |
-
- **Secure**: Private datasets protect your data
|
| 244 |
-
- **Scalable**: Handles large amounts of experiment data
|
| 245 |
-
- **Versioned**: Automatic versioning of experiment data
|
| 246 |
-
|
| 247 |
-
### 🔄 Fallback Strategy
|
| 248 |
-
1. **Primary**: Load from HF Dataset
|
| 249 |
-
2. **Secondary**: Use backup data (your existing experiments)
|
| 250 |
-
3. **Tertiary**: Create new experiments locally
|
| 251 |
-
|
| 252 |
-
## 📋 Next Steps
|
| 253 |
-
|
| 254 |
-
1. **Set HF_TOKEN**: Add your token to Space environment
|
| 255 |
-
2. **Run Setup**: Execute `setup_hf_dataset.py`
|
| 256 |
-
3. **Deploy App**: Push updated `app.py` to your Space
|
| 257 |
-
4. **Test Plots**: Verify experiments load and plots work
|
| 258 |
-
5. **Monitor Training**: New experiments will be saved to dataset
|
| 259 |
-
|
| 260 |
-
## 🔐 Security Notes
|
| 261 |
-
|
| 262 |
-
- Dataset is **private** by default
|
| 263 |
-
- Only accessible with your HF_TOKEN
|
| 264 |
-
- Experiment data is stored securely on HF infrastructure
|
| 265 |
-
- No sensitive data is exposed publicly
|
| 266 |
-
|
| 267 |
-
---
|
| 268 |
-
|
| 269 |
-
**Your experiments are now configured for reliable persistence using Hugging Face Datasets!** 🎉
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/HF_HUB_V0_34_UPDATE.md
DELETED
|
@@ -1,170 +0,0 @@
|
|
| 1 |
-
# Hugging Face Hub v0.34.0 Compatibility Update
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This document outlines the updates made to ensure compatibility with the new Hugging Face Hub v0.34.0 release, which introduced significant changes to the CLI interface.
|
| 6 |
-
|
| 7 |
-
## Key Changes in HF Hub v0.34.0
|
| 8 |
-
|
| 9 |
-
### 1. CLI Rename
|
| 10 |
-
- **Old**: `huggingface-cli`
|
| 11 |
-
- **New**: `hf`
|
| 12 |
-
- **Status**: Legacy `huggingface-cli` still works but is deprecated
|
| 13 |
-
|
| 14 |
-
### 2. New Features
|
| 15 |
-
- **Jobs CLI**: New `hf jobs` command for running compute jobs
|
| 16 |
-
- **Enhanced Inference**: Image-to-image support and PIL Image support
|
| 17 |
-
- **Xet Integration**: Improved file transfer protocol
|
| 18 |
-
- **Modern Command Format**: `hf <resource> <action> [options]`
|
| 19 |
-
|
| 20 |
-
## Files Updated
|
| 21 |
-
|
| 22 |
-
### Core Scripts
|
| 23 |
-
1. **`launch.sh`**
|
| 24 |
-
- Updated `huggingface-cli whoami` → `hf whoami`
|
| 25 |
-
- Updated `huggingface-cli login` → `hf login`
|
| 26 |
-
|
| 27 |
-
2. **`scripts/trackio_tonic/deploy_trackio_space.py`**
|
| 28 |
-
- Updated CLI commands for space creation
|
| 29 |
-
- Updated username extraction method
|
| 30 |
-
|
| 31 |
-
3. **`scripts/dataset_tonic/setup_hf_dataset.py`**
|
| 32 |
-
- Updated username extraction method
|
| 33 |
-
|
| 34 |
-
4. **`scripts/trackio_tonic/configure_trackio.py`**
|
| 35 |
-
- Updated username extraction method
|
| 36 |
-
|
| 37 |
-
### Documentation Files
|
| 38 |
-
1. **`setup_launch.py`**
|
| 39 |
-
- Updated troubleshooting guide
|
| 40 |
-
|
| 41 |
-
2. **`README_END_TO_END.md`**
|
| 42 |
-
- Updated CLI command examples
|
| 43 |
-
|
| 44 |
-
3. **`docs/GIT_CONFIGURATION_GUIDE.md`**
|
| 45 |
-
- Updated authentication examples
|
| 46 |
-
|
| 47 |
-
4. **`docs/LAUNCH_SCRIPT_USERNAME_FIX.md`**
|
| 48 |
-
- Updated username extraction method
|
| 49 |
-
|
| 50 |
-
5. **`docs/LAUNCH_SCRIPT_UPDATES.md`**
|
| 51 |
-
- Updated CLI command references
|
| 52 |
-
|
| 53 |
-
6. **`docs/TRACKIO_DEPLOYMENT_FIXES.md`**
|
| 54 |
-
- Updated troubleshooting commands
|
| 55 |
-
|
| 56 |
-
7. **`docs/GIT_CONFIGURATION_FIX.md`**
|
| 57 |
-
- Updated authentication examples
|
| 58 |
-
|
| 59 |
-
## Compatibility Notes
|
| 60 |
-
|
| 61 |
-
### Backward Compatibility
|
| 62 |
-
- The legacy `huggingface-cli` commands still work
|
| 63 |
-
- Our scripts will continue to function with both old and new CLI
|
| 64 |
-
- No breaking changes to the Python API
|
| 65 |
-
|
| 66 |
-
### Recommended Actions
|
| 67 |
-
1. **Update CLI Installation**: Ensure users have the latest `huggingface_hub` package
|
| 68 |
-
2. **Update Documentation**: All references now use the new `hf` command
|
| 69 |
-
3. **Test Deployment**: Verify that all deployment scripts work with the new CLI
|
| 70 |
-
|
| 71 |
-
## Verification Steps
|
| 72 |
-
|
| 73 |
-
### 1. Test CLI Installation
|
| 74 |
-
```bash
|
| 75 |
-
# Check if hf command is available
|
| 76 |
-
hf --version
|
| 77 |
-
|
| 78 |
-
# Test authentication
|
| 79 |
-
hf whoami
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
### 2. Test Deployment Scripts
|
| 83 |
-
```bash
|
| 84 |
-
# Test space deployment
|
| 85 |
-
python scripts/trackio_tonic/deploy_trackio_space.py
|
| 86 |
-
|
| 87 |
-
# Test dataset setup
|
| 88 |
-
python scripts/dataset_tonic/setup_hf_dataset.py
|
| 89 |
-
|
| 90 |
-
# Test model push
|
| 91 |
-
python scripts/model_tonic/push_to_huggingface.py
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
### 3. Test Launch Script
|
| 95 |
-
```bash
|
| 96 |
-
# Run the interactive pipeline
|
| 97 |
-
./launch.sh
|
| 98 |
-
```
|
| 99 |
-
|
| 100 |
-
## Benefits of the Update
|
| 101 |
-
|
| 102 |
-
### 1. Future-Proof
|
| 103 |
-
- Uses the new official CLI name
|
| 104 |
-
- Follows HF's recommended practices
|
| 105 |
-
- Ready for future HF Hub updates
|
| 106 |
-
|
| 107 |
-
### 2. Consistency
|
| 108 |
-
- All scripts now use the same CLI command
|
| 109 |
-
- Unified command format across the project
|
| 110 |
-
- Consistent with HF's new conventions
|
| 111 |
-
|
| 112 |
-
### 3. Modern Interface
|
| 113 |
-
- Aligns with HF's new command structure
|
| 114 |
-
- Better integration with HF's ecosystem
|
| 115 |
-
- Improved user experience
|
| 116 |
-
|
| 117 |
-
## Migration Guide
|
| 118 |
-
|
| 119 |
-
### For Users
|
| 120 |
-
1. **Update huggingface_hub**: `pip install --upgrade huggingface_hub`
|
| 121 |
-
2. **Test CLI**: Run `hf whoami` to verify installation
|
| 122 |
-
3. **Update Scripts**: Use the updated scripts from this repository
|
| 123 |
-
|
| 124 |
-
### For Developers
|
| 125 |
-
1. **Update Dependencies**: Ensure `huggingface_hub>=0.34.0`
|
| 126 |
-
2. **Test Scripts**: Verify all deployment scripts work
|
| 127 |
-
3. **Update Documentation**: Use `hf` instead of `huggingface-cli`
|
| 128 |
-
|
| 129 |
-
## Troubleshooting
|
| 130 |
-
|
| 131 |
-
### Common Issues
|
| 132 |
-
|
| 133 |
-
#### 1. CLI Not Found
|
| 134 |
-
```bash
|
| 135 |
-
# Install/upgrade huggingface_hub
|
| 136 |
-
pip install --upgrade huggingface_hub
|
| 137 |
-
|
| 138 |
-
# Verify installation
|
| 139 |
-
hf --version
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
#### 2. Authentication Issues
|
| 143 |
-
```bash
|
| 144 |
-
# Login with new CLI
|
| 145 |
-
hf login --token "your-token"
|
| 146 |
-
|
| 147 |
-
# Verify login
|
| 148 |
-
hf whoami
|
| 149 |
-
```
|
| 150 |
-
|
| 151 |
-
#### 3. Script Compatibility
|
| 152 |
-
- All scripts have been updated to use the new CLI
|
| 153 |
-
- Legacy commands are still supported as fallback
|
| 154 |
-
- No breaking changes to functionality
|
| 155 |
-
|
| 156 |
-
## Summary
|
| 157 |
-
|
| 158 |
-
The update to HF Hub v0.34.0 compatibility ensures:
|
| 159 |
-
|
| 160 |
-
1. **✅ Future-Proof**: Uses the new official CLI name
|
| 161 |
-
2. **✅ Consistent**: All scripts use the same command format
|
| 162 |
-
3. **✅ Compatible**: Maintains backward compatibility
|
| 163 |
-
4. **✅ Modern**: Aligns with HF's latest conventions
|
| 164 |
-
5. **✅ Tested**: All deployment scripts verified to work
|
| 165 |
-
|
| 166 |
-
The project is now fully compatible with Hugging Face Hub v0.34.0 and ready for future updates.
|
| 167 |
-
|
| 168 |
-
---
|
| 169 |
-
|
| 170 |
-
**Note**: The legacy `huggingface-cli` commands will continue to work, but using `hf` is now the recommended approach for all new development and deployments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/HF_SPACES_GUIDE.md
DELETED
|
@@ -1,163 +0,0 @@
|
|
| 1 |
-
# 🚀 Trackio on Hugging Face Spaces - Complete Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.
|
| 6 |
-
|
| 7 |
-
## 🏗️ Hugging Face Spaces Architecture
|
| 8 |
-
|
| 9 |
-
### Key Challenges
|
| 10 |
-
|
| 11 |
-
1. **Ephemeral Storage**: File system gets reset between deployments
|
| 12 |
-
2. **No Persistent Storage**: Files written during runtime don't persist
|
| 13 |
-
3. **Multiple Instances**: Training and monitoring might run in different environments
|
| 14 |
-
4. **Limited File System**: Restricted write permissions in certain directories
|
| 15 |
-
|
| 16 |
-
### How Trackio Handles HF Spaces
|
| 17 |
-
|
| 18 |
-
The updated Trackio app now includes:
|
| 19 |
-
|
| 20 |
-
- **Automatic HF Spaces Detection**: Detects when running on HF Spaces
|
| 21 |
-
- **Persistent Path Selection**: Uses `/tmp/` for better persistence
|
| 22 |
-
- **Backup Recovery**: Automatically recovers experiments from backup data
|
| 23 |
-
- **Fallback Storage**: Multiple storage locations for redundancy
|
| 24 |
-
|
| 25 |
-
## 📊 Your Current Experiments
|
| 26 |
-
|
| 27 |
-
Based on your logs, you have these experiments available:
|
| 28 |
-
|
| 29 |
-
### Experiment 1: `exp_20250720_130853`
|
| 30 |
-
- **Name**: petite-elle-l-aime-3
|
| 31 |
-
- **Status**: Running
|
| 32 |
-
- **Metrics**: 4 entries (steps 25, 50, 75, 100)
|
| 33 |
-
- **Key Metrics**: Loss decreasing from 1.1659 to 1.1528
|
| 34 |
-
|
| 35 |
-
### Experiment 2: `exp_20250720_134319`
|
| 36 |
-
- **Name**: petite-elle-l-aime-3-1
|
| 37 |
-
- **Status**: Running
|
| 38 |
-
- **Metrics**: 2 entries (step 25)
|
| 39 |
-
- **Key Metrics**: Loss 1.166, GPU memory usage
|
| 40 |
-
|
| 41 |
-
## 🎯 How to Use Your Experiments
|
| 42 |
-
|
| 43 |
-
### 1. View Experiments
|
| 44 |
-
- Go to the "View Experiments" tab
|
| 45 |
-
- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
|
| 46 |
-
- Click "View Experiment" to see details
|
| 47 |
-
|
| 48 |
-
### 2. Create Plots
|
| 49 |
-
- Go to the "Visualizations" tab
|
| 50 |
-
- Enter experiment ID
|
| 51 |
-
- Select metric to plot:
|
| 52 |
-
- `loss` - Training loss curve
|
| 53 |
-
- `learning_rate` - Learning rate schedule
|
| 54 |
-
- `mean_token_accuracy` - Token accuracy
|
| 55 |
-
- `grad_norm` - Gradient norm
|
| 56 |
-
- `gpu_0_memory_allocated` - GPU memory usage
|
| 57 |
-
|
| 58 |
-
### 3. Compare Experiments
|
| 59 |
-
- Use the "Experiment Comparison" feature
|
| 60 |
-
- Enter: `exp_20250720_130853,exp_20250720_134319`
|
| 61 |
-
- Compare loss curves between experiments
|
| 62 |
-
|
| 63 |
-
## 🔧 Technical Details
|
| 64 |
-
|
| 65 |
-
### Data Persistence Strategy
|
| 66 |
-
|
| 67 |
-
```python
|
| 68 |
-
# HF Spaces detection
|
| 69 |
-
if os.environ.get('SPACE_ID'):
|
| 70 |
-
data_file = "/tmp/trackio_experiments.json"
|
| 71 |
-
else:
|
| 72 |
-
data_file = "trackio_experiments.json"
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
### Backup Recovery
|
| 76 |
-
|
| 77 |
-
The app automatically recovers your experiments from backup data when:
|
| 78 |
-
- Running on HF Spaces
|
| 79 |
-
- No existing experiments found
|
| 80 |
-
- Data file is missing or empty
|
| 81 |
-
|
| 82 |
-
### Storage Locations
|
| 83 |
-
|
| 84 |
-
1. **Primary**: `/tmp/trackio_experiments.json`
|
| 85 |
-
2. **Backup**: `/tmp/trackio_backup.json`
|
| 86 |
-
3. **Fallback**: Local directory (for development)
|
| 87 |
-
|
| 88 |
-
## 🚀 Deployment Best Practices
|
| 89 |
-
|
| 90 |
-
### 1. Environment Variables
|
| 91 |
-
```bash
|
| 92 |
-
# Set in HF Spaces environment
|
| 93 |
-
SPACE_ID=your-space-id
|
| 94 |
-
TRACKIO_URL=https://your-space.hf.space
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
### 2. File Structure
|
| 98 |
-
```
|
| 99 |
-
your-space/
|
| 100 |
-
├── app.py # Main Trackio app
|
| 101 |
-
├── requirements.txt # Dependencies
|
| 102 |
-
├── README.md # Space description
|
| 103 |
-
└── .gitignore # Ignore temporary files
|
| 104 |
-
```
|
| 105 |
-
|
| 106 |
-
### 3. Requirements
|
| 107 |
-
```txt
|
| 108 |
-
gradio>=4.0.0
|
| 109 |
-
plotly>=5.0.0
|
| 110 |
-
pandas>=1.5.0
|
| 111 |
-
numpy>=1.24.0
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
## 📈 Monitoring Your Training
|
| 115 |
-
|
| 116 |
-
### Real-time Metrics
|
| 117 |
-
Your experiments show:
|
| 118 |
-
- **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence)
|
| 119 |
-
- **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07
|
| 120 |
-
- **Token Accuracy**: Around 75-76% (reasonable for early training)
|
| 121 |
-
- **GPU Memory**: ~17GB allocated, 75GB reserved
|
| 122 |
-
|
| 123 |
-
### Expected Behavior
|
| 124 |
-
- Loss should continue decreasing
|
| 125 |
-
- Learning rate will follow cosine schedule
|
| 126 |
-
- Token accuracy should improve over time
|
| 127 |
-
- GPU memory usage should remain stable
|
| 128 |
-
|
| 129 |
-
## 🔍 Troubleshooting
|
| 130 |
-
|
| 131 |
-
### Issue: "No metrics data available"
|
| 132 |
-
**Solution**: The app now automatically recovers experiments from backup
|
| 133 |
-
|
| 134 |
-
### Issue: Plots not showing
|
| 135 |
-
**Solution**:
|
| 136 |
-
1. Check experiment ID is correct
|
| 137 |
-
2. Try different metrics (loss, learning_rate, etc.)
|
| 138 |
-
3. Refresh the page
|
| 139 |
-
|
| 140 |
-
### Issue: Data not persisting
|
| 141 |
-
**Solution**:
|
| 142 |
-
1. App now uses `/tmp/` for better persistence
|
| 143 |
-
2. Backup recovery ensures data availability
|
| 144 |
-
3. Multiple storage locations provide redundancy
|
| 145 |
-
|
| 146 |
-
## 🎯 Next Steps
|
| 147 |
-
|
| 148 |
-
1. **Deploy Updated App**: Push the updated `app.py` to your HF Space
|
| 149 |
-
2. **Test Plots**: Try plotting your experiments
|
| 150 |
-
3. **Monitor Training**: Continue monitoring your training runs
|
| 151 |
-
4. **Add New Experiments**: Create new experiments as needed
|
| 152 |
-
|
| 153 |
-
## 📞 Support
|
| 154 |
-
|
| 155 |
-
If you encounter issues:
|
| 156 |
-
1. Check the logs in your HF Space
|
| 157 |
-
2. Verify experiment IDs are correct
|
| 158 |
-
3. Try the backup recovery feature
|
| 159 |
-
4. Contact for additional support
|
| 160 |
-
|
| 161 |
-
---
|
| 162 |
-
|
| 163 |
-
**Your experiments are now properly configured and should display correctly in the Trackio interface!** 🎉
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/INTERACTIVE_PIPELINE_IMPROVEMENTS.md
DELETED
|
@@ -1,330 +0,0 @@
|
|
| 1 |
-
# Interactive Pipeline Improvements
|
| 2 |
-
|
| 3 |
-
This document explains the improvements made to the `launch.sh` script to make it interactive and configurable for different training scenarios.
|
| 4 |
-
|
| 5 |
-
## 🎯 Key Improvements
|
| 6 |
-
|
| 7 |
-
### 1. **Interactive User Interface**
|
| 8 |
-
- **Colored Output**: Added color-coded status messages for better UX
|
| 9 |
-
- **Input Validation**: Real-time validation of user inputs
|
| 10 |
-
- **Default Values**: Smart defaults for common configurations
|
| 11 |
-
- **Error Handling**: Graceful error handling with helpful messages
|
| 12 |
-
|
| 13 |
-
### 2. **Training Configuration Selection**
|
| 14 |
-
The script now offers 4 predefined training configurations:
|
| 15 |
-
|
| 16 |
-
#### **Basic Training (Default)**
|
| 17 |
-
```bash
|
| 18 |
-
Model: SmolLM3-3B
|
| 19 |
-
Dataset: SmolTalk
|
| 20 |
-
Epochs: 3
|
| 21 |
-
Batch Size: 2
|
| 22 |
-
Learning Rate: 5e-6
|
| 23 |
-
Sequence Length: 4096
|
| 24 |
-
Best for: Quick experiments, learning
|
| 25 |
-
```
|
| 26 |
-
|
| 27 |
-
#### **H100 Lightweight (Rapid)**
|
| 28 |
-
```bash
|
| 29 |
-
Model: SmolLM3-3B
|
| 30 |
-
Dataset: OpenHermes-FR (80K samples)
|
| 31 |
-
Epochs: 1
|
| 32 |
-
Batch Size: 16
|
| 33 |
-
Learning Rate: 8e-6
|
| 34 |
-
Sequence Length: 8192
|
| 35 |
-
Best for: Rapid training on H100
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
#### **A100 Large Scale**
|
| 39 |
-
```bash
|
| 40 |
-
Model: SmolLM3-3B
|
| 41 |
-
Dataset: OpenHermes-FR
|
| 42 |
-
Epochs: 1.3 passes
|
| 43 |
-
Batch Size: 8
|
| 44 |
-
Learning Rate: 5e-6
|
| 45 |
-
Sequence Length: 8192
|
| 46 |
-
Best for: High-performance training
|
| 47 |
-
```
|
| 48 |
-
|
| 49 |
-
#### **Multiple Passes**
|
| 50 |
-
```bash
|
| 51 |
-
Model: SmolLM3-3B
|
| 52 |
-
Dataset: OpenHermes-FR
|
| 53 |
-
Epochs: 4 passes
|
| 54 |
-
Batch Size: 6
|
| 55 |
-
Learning Rate: 3e-6
|
| 56 |
-
Sequence Length: 8192
|
| 57 |
-
Best for: Thorough training
|
| 58 |
-
```
|
| 59 |
-
|
| 60 |
-
#### **Custom Configuration**
|
| 61 |
-
- User-defined parameters
|
| 62 |
-
- Flexible model and dataset selection
|
| 63 |
-
- Custom training parameters
|
| 64 |
-
|
| 65 |
-
### 3. **Enhanced User Experience**
|
| 66 |
-
|
| 67 |
-
#### **Step-by-Step Guidance**
|
| 68 |
-
1. **Authentication** - HF username and token validation
|
| 69 |
-
2. **Configuration Selection** - Choose from predefined configs
|
| 70 |
-
3. **Experiment Setup** - Configure experiment details
|
| 71 |
-
4. **Training Parameters** - Adjust hyperparameters
|
| 72 |
-
5. **Deployment Setup** - Trackio Space configuration
|
| 73 |
-
6. **Confirmation** - Review and confirm settings
|
| 74 |
-
|
| 75 |
-
#### **Input Functions**
|
| 76 |
-
```bash
|
| 77 |
-
# Get input with default value
|
| 78 |
-
get_input "Prompt" "default_value" VARIABLE_NAME
|
| 79 |
-
|
| 80 |
-
# Select from options
|
| 81 |
-
select_option "Choose option:" "Option 1" "Option 2" "Option 3" VARIABLE_NAME
|
| 82 |
-
|
| 83 |
-
# Validate HF token
|
| 84 |
-
validate_hf_token "$HF_TOKEN"
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
#### **Colored Output Functions**
|
| 88 |
-
```bash
|
| 89 |
-
print_status "Success message" # Green ✅
|
| 90 |
-
print_warning "Warning message" # Yellow ⚠️
|
| 91 |
-
print_error "Error message" # Red ❌
|
| 92 |
-
print_info "Info message" # Blue ℹ️
|
| 93 |
-
print_header "Header message" # Purple 🚀
|
| 94 |
-
print_step "Step message" # Cyan 📋
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
### 4. **Dynamic Configuration Generation**
|
| 98 |
-
|
| 99 |
-
The script now generates training configurations based on user selection:
|
| 100 |
-
|
| 101 |
-
```python
|
| 102 |
-
# Generated config file
|
| 103 |
-
config = SmolLM3Config(
|
| 104 |
-
model_name="$MODEL_NAME",
|
| 105 |
-
max_seq_length=$MAX_SEQ_LENGTH,
|
| 106 |
-
batch_size=$BATCH_SIZE,
|
| 107 |
-
learning_rate=$LEARNING_RATE,
|
| 108 |
-
# ... other parameters
|
| 109 |
-
)
|
| 110 |
-
```
|
| 111 |
-
|
| 112 |
-
### 5. **Improved Error Handling**
|
| 113 |
-
|
| 114 |
-
#### **Input Validation**
|
| 115 |
-
- Required field validation
|
| 116 |
-
- HF token validation
|
| 117 |
-
- Numeric input validation
|
| 118 |
-
- Choice validation
|
| 119 |
-
|
| 120 |
-
#### **Graceful Degradation**
|
| 121 |
-
- Clear error messages
|
| 122 |
-
- Recovery suggestions
|
| 123 |
-
- Exit on critical errors
|
| 124 |
-
|
| 125 |
-
### 6. **Configuration Management**
|
| 126 |
-
|
| 127 |
-
#### **User Credentials**
|
| 128 |
-
- Interactive username input
|
| 129 |
-
- Secure token input
|
| 130 |
-
- Real-time token validation
|
| 131 |
-
|
| 132 |
-
#### **Experiment Details**
|
| 133 |
-
- Dynamic experiment naming
|
| 134 |
-
- Repository name generation
|
| 135 |
-
- Dataset repository configuration
|
| 136 |
-
|
| 137 |
-
#### **Training Parameters**
|
| 138 |
-
- Batch size selection
|
| 139 |
-
- Learning rate adjustment
|
| 140 |
-
- Sequence length configuration
|
| 141 |
-
- Save/eval/logging steps
|
| 142 |
-
|
| 143 |
-
### 7. **Enhanced Monitoring Integration**
|
| 144 |
-
|
| 145 |
-
#### **Trackio Space**
|
| 146 |
-
- Dynamic space naming
|
| 147 |
-
- Automatic deployment
|
| 148 |
-
- URL generation
|
| 149 |
-
|
| 150 |
-
#### **HF Datasets**
|
| 151 |
-
- Dataset repository setup
|
| 152 |
-
- Experiment data storage
|
| 153 |
-
- Access configuration
|
| 154 |
-
|
| 155 |
-
## 🔧 Technical Improvements
|
| 156 |
-
|
| 157 |
-
### 1. **Modular Functions**
|
| 158 |
-
```bash
|
| 159 |
-
# Input handling
|
| 160 |
-
get_input() # Get user input with defaults
|
| 161 |
-
select_option() # Select from options
|
| 162 |
-
validate_hf_token() # Validate HF token
|
| 163 |
-
|
| 164 |
-
# Configuration
|
| 165 |
-
show_training_configs() # Display available configs
|
| 166 |
-
get_training_config() # Get config based on selection
|
| 167 |
-
create_training_config() # Generate config file
|
| 168 |
-
|
| 169 |
-
# Output formatting
|
| 170 |
-
print_status() # Success messages
|
| 171 |
-
print_warning() # Warning messages
|
| 172 |
-
print_error() # Error messages
|
| 173 |
-
print_info() # Info messages
|
| 174 |
-
print_header() # Header messages
|
| 175 |
-
print_step() # Step messages
|
| 176 |
-
```
|
| 177 |
-
|
| 178 |
-
### 2. **Configuration Selection Logic**
|
| 179 |
-
```bash
|
| 180 |
-
case "$config_type" in
|
| 181 |
-
"Basic Training")
|
| 182 |
-
MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
|
| 183 |
-
DATASET_NAME="HuggingFaceTB/smoltalk"
|
| 184 |
-
# ... other parameters
|
| 185 |
-
;;
|
| 186 |
-
"A100 Large Scale")
|
| 187 |
-
MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
|
| 188 |
-
DATASET_NAME="legmlai/openhermes-fr"
|
| 189 |
-
# ... other parameters
|
| 190 |
-
;;
|
| 191 |
-
# ... other configurations
|
| 192 |
-
esac
|
| 193 |
-
```
|
| 194 |
-
|
| 195 |
-
### 3. **Dynamic File Generation**
|
| 196 |
-
```bash
|
| 197 |
-
# Generate training config
|
| 198 |
-
create_training_config "$CONFIG_FILE"
|
| 199 |
-
|
| 200 |
-
# Generate deployment input
|
| 201 |
-
cat > deploy_input.txt << EOF
|
| 202 |
-
$HF_USERNAME
|
| 203 |
-
$TRACKIO_SPACE_NAME
|
| 204 |
-
$HF_TOKEN
|
| 205 |
-
EOF
|
| 206 |
-
```
|
| 207 |
-
|
| 208 |
-
## 📊 User Workflow
|
| 209 |
-
|
| 210 |
-
### **Before (Static)**
|
| 211 |
-
1. Edit `launch.sh` manually
|
| 212 |
-
2. Update hardcoded variables
|
| 213 |
-
3. Run script
|
| 214 |
-
4. Hope configuration is correct
|
| 215 |
-
|
| 216 |
-
### **After (Interactive)**
|
| 217 |
-
1. Run `./launch.sh`
|
| 218 |
-
2. Follow interactive prompts
|
| 219 |
-
3. Select training configuration
|
| 220 |
-
4. Confirm settings
|
| 221 |
-
5. Watch automated pipeline
|
| 222 |
-
|
| 223 |
-
## 🎯 Benefits
|
| 224 |
-
|
| 225 |
-
### **For Users**
|
| 226 |
-
- **No Manual Editing**: No need to edit script files
|
| 227 |
-
- **Guided Experience**: Step-by-step prompts
|
| 228 |
-
- **Validation**: Real-time input validation
|
| 229 |
-
- **Flexibility**: Multiple configuration options
|
| 230 |
-
- **Safety**: Confirmation before execution
|
| 231 |
-
|
| 232 |
-
### **For Developers**
|
| 233 |
-
- **Maintainable**: Modular function structure
|
| 234 |
-
- **Extensible**: Easy to add new configurations
|
| 235 |
-
- **Robust**: Comprehensive error handling
|
| 236 |
-
- **User-Friendly**: Clear feedback and guidance
|
| 237 |
-
|
| 238 |
-
### **For Different Use Cases**
|
| 239 |
-
- **Beginners**: Basic Training configuration
|
| 240 |
-
- **H100 Users**: H100 Lightweight for rapid experiments
|
| 241 |
-
- **Researchers**: A100 Large Scale for serious experiments
|
| 242 |
-
- **Production**: Multiple Passes for thorough training
|
| 243 |
-
- **Custom**: User-defined parameters for specific needs
|
| 244 |
-
|
| 245 |
-
## 🔄 Configuration Examples
|
| 246 |
-
|
| 247 |
-
### **Quick Start (Basic Training)**
|
| 248 |
-
```bash
|
| 249 |
-
./launch.sh
|
| 250 |
-
# Follow prompts:
|
| 251 |
-
# 1. Enter HF username and token
|
| 252 |
-
# 2. Select "Basic Training"
|
| 253 |
-
# 3. Confirm settings
|
| 254 |
-
# 4. Watch automated pipeline
|
| 255 |
-
```
|
| 256 |
-
|
| 257 |
-
### **High-Performance Training (A100)**
|
| 258 |
-
```bash
|
| 259 |
-
./launch.sh
|
| 260 |
-
# Follow prompts:
|
| 261 |
-
# 1. Enter HF username and token
|
| 262 |
-
# 2. Select "A100 Large Scale"
|
| 263 |
-
# 3. Adjust parameters if needed
|
| 264 |
-
# 4. Confirm and run
|
| 265 |
-
```
|
| 266 |
-
|
| 267 |
-
### **Rapid Training (H100)**
|
| 268 |
-
```bash
|
| 269 |
-
./launch.sh
|
| 270 |
-
# Follow prompts:
|
| 271 |
-
# 1. Enter HF username and token
|
| 272 |
-
# 2. Select "H100 Lightweight (Rapid)"
|
| 273 |
-
# 3. Confirm settings
|
| 274 |
-
# 4. Watch rapid training on H100
|
| 275 |
-
```
|
| 276 |
-
|
| 277 |
-
### **Custom Training**
|
| 278 |
-
```bash
|
| 279 |
-
./launch.sh
|
| 280 |
-
# Follow prompts:
|
| 281 |
-
# 1. Enter HF username and token
|
| 282 |
-
# 2. Select "Custom Configuration"
|
| 283 |
-
# 3. Enter custom parameters:
|
| 284 |
-
# - Model: microsoft/DialoGPT-medium
|
| 285 |
-
# - Dataset: your-custom-dataset
|
| 286 |
-
# - Epochs: 5
|
| 287 |
-
# - Batch Size: 4
|
| 288 |
-
# - Learning Rate: 1e-5
|
| 289 |
-
# 4. Confirm and run
|
| 290 |
-
```
|
| 291 |
-
|
| 292 |
-
## 🚀 Future Enhancements
|
| 293 |
-
|
| 294 |
-
### **Planned Improvements**
|
| 295 |
-
- **GUI Interface**: Web-based configuration interface
|
| 296 |
-
- **Configuration Templates**: Save/load custom configurations
|
| 297 |
-
- **Advanced Validation**: More sophisticated input validation
|
| 298 |
-
- **Progress Tracking**: Real-time progress indicators
|
| 299 |
-
- **Rollback Capability**: Undo changes if needed
|
| 300 |
-
|
| 301 |
-
### **Extensibility**
|
| 302 |
-
- **Plugin System**: Add custom training configurations
|
| 303 |
-
- **API Integration**: Connect to external services
|
| 304 |
-
- **Multi-GPU Support**: Distributed training options
|
| 305 |
-
- **Advanced Monitoring**: Enhanced tracking capabilities
|
| 306 |
-
|
| 307 |
-
## 📋 Migration Guide
|
| 308 |
-
|
| 309 |
-
### **For Existing Users**
|
| 310 |
-
1. **Backup**: Save your current `launch.sh`
|
| 311 |
-
2. **Update**: Replace with new interactive version
|
| 312 |
-
3. **Test**: Run with basic configuration first
|
| 313 |
-
4. **Migrate**: Use interactive prompts instead of manual editing
|
| 314 |
-
|
| 315 |
-
### **For New Users**
|
| 316 |
-
1. **Setup**: Run `python setup_launch.py`
|
| 317 |
-
2. **Check**: Run `python check_requirements.py`
|
| 318 |
-
3. **Launch**: Run `./launch.sh`
|
| 319 |
-
4. **Follow**: Use interactive prompts
|
| 320 |
-
|
| 321 |
-
## 🎉 Conclusion
|
| 322 |
-
|
| 323 |
-
The interactive pipeline provides a much better user experience with:
|
| 324 |
-
- **Guided Configuration**: No manual editing required
|
| 325 |
-
- **Multiple Options**: Predefined configurations for different use cases
|
| 326 |
-
- **Validation**: Real-time input validation and error handling
|
| 327 |
-
- **Flexibility**: Custom configuration support
|
| 328 |
-
- **Safety**: Confirmation steps and error recovery
|
| 329 |
-
|
| 330 |
-
The script is now production-ready for users of all skill levels, from beginners to advanced researchers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/LATEST_DEPLOYMENT_APPROACH.md
DELETED
|
@@ -1,267 +0,0 @@
|
|
| 1 |
-
# Latest Trackio Space Deployment Approach
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
Based on the [Hugging Face Hub repository code](https://github.com/huggingface/huggingface_hub/blob/9e0493cfdb4de5a27b45c53c3342c83ab1a138fb/src/huggingface_hub/commands/repo.py#L30), I've updated the Trackio Space deployment to use the latest Hugging Face Hub Python API instead of CLI commands.
|
| 6 |
-
|
| 7 |
-
## Key Improvements
|
| 8 |
-
|
| 9 |
-
### 1. **Latest HF Hub API Integration**
|
| 10 |
-
|
| 11 |
-
**Before**: Using CLI commands
|
| 12 |
-
```python
|
| 13 |
-
cmd = ["hf", "repo", "create", f"{username}/{space_name}", "--type", "space"]
|
| 14 |
-
```
|
| 15 |
-
|
| 16 |
-
**After**: Using Python API
|
| 17 |
-
```python
|
| 18 |
-
from huggingface_hub import create_repo
|
| 19 |
-
|
| 20 |
-
create_repo(
|
| 21 |
-
repo_id=f"{username}/{space_name}",
|
| 22 |
-
token=token,
|
| 23 |
-
repo_type="space",
|
| 24 |
-
exist_ok=True,
|
| 25 |
-
private=False,
|
| 26 |
-
space_sdk="gradio",
|
| 27 |
-
space_hardware="cpu-basic"
|
| 28 |
-
)
|
| 29 |
-
```
|
| 30 |
-
|
| 31 |
-
### 2. **Robust Fallback Mechanism**
|
| 32 |
-
|
| 33 |
-
The deployment script now includes both API and CLI approaches:
|
| 34 |
-
|
| 35 |
-
```python
|
| 36 |
-
def create_space(self) -> bool:
|
| 37 |
-
"""Create a new Hugging Face Space using the latest API"""
|
| 38 |
-
try:
|
| 39 |
-
if not HF_HUB_AVAILABLE:
|
| 40 |
-
return self._create_space_cli()
|
| 41 |
-
|
| 42 |
-
# Use latest API
|
| 43 |
-
create_repo(...)
|
| 44 |
-
|
| 45 |
-
except Exception as api_error:
|
| 46 |
-
# Fallback to CLI
|
| 47 |
-
return self._create_space_cli()
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
### 3. **Enhanced Dependencies**
|
| 51 |
-
|
| 52 |
-
Updated `requirements/requirements_core.txt`:
|
| 53 |
-
```txt
|
| 54 |
-
# Hugging Face Hub for model and space management
|
| 55 |
-
huggingface_hub>=0.19.0
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
## API Parameters
|
| 59 |
-
|
| 60 |
-
### **Required Parameters**
|
| 61 |
-
- `repo_id`: Repository identifier (username/space-name)
|
| 62 |
-
- `token`: Hugging Face token with write permissions
|
| 63 |
-
|
| 64 |
-
### **Optional Parameters**
|
| 65 |
-
- `repo_type`: Set to "space" for Spaces
|
| 66 |
-
- `exist_ok`: Allow existing repositories (default: True)
|
| 67 |
-
- `private`: Make repository private (default: False)
|
| 68 |
-
- `space_sdk`: SDK type (default: "gradio")
|
| 69 |
-
- `space_hardware`: Hardware specification (default: "cpu-basic")
|
| 70 |
-
|
| 71 |
-
## Deployment Process
|
| 72 |
-
|
| 73 |
-
### **Step 1: API Creation**
|
| 74 |
-
```python
|
| 75 |
-
# Create space using latest API
|
| 76 |
-
create_repo(
|
| 77 |
-
repo_id=f"{username}/{space_name}",
|
| 78 |
-
token=token,
|
| 79 |
-
repo_type="space",
|
| 80 |
-
exist_ok=True,
|
| 81 |
-
private=False,
|
| 82 |
-
space_sdk="gradio",
|
| 83 |
-
space_hardware="cpu-basic"
|
| 84 |
-
)
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
### **Step 2: File Preparation**
|
| 88 |
-
```python
|
| 89 |
-
# Prepare files in temporary directory
|
| 90 |
-
temp_dir = tempfile.mkdtemp()
|
| 91 |
-
# Copy template files
|
| 92 |
-
shutil.copy2(source_path, dest_path)
|
| 93 |
-
# Update README with actual space URL
|
| 94 |
-
readme_content.replace("{SPACE_URL}", self.space_url)
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
### **Step 3: Git Upload**
|
| 98 |
-
```python
|
| 99 |
-
# Initialize git in temp directory
|
| 100 |
-
os.chdir(temp_dir)
|
| 101 |
-
subprocess.run(["git", "init"], check=True)
|
| 102 |
-
subprocess.run(["git", "remote", "add", "origin", space_url], check=True)
|
| 103 |
-
subprocess.run(["git", "add", "."], check=True)
|
| 104 |
-
subprocess.run(["git", "commit", "-m", "Initial Trackio Space setup"], check=True)
|
| 105 |
-
subprocess.run(["git", "push", "origin", "main"], check=True)
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
## Testing the Latest Deployment
|
| 109 |
-
|
| 110 |
-
### **Run Latest Deployment Tests**
|
| 111 |
-
```bash
|
| 112 |
-
python tests/test_latest_deployment.py
|
| 113 |
-
```
|
| 114 |
-
|
| 115 |
-
Expected output:
|
| 116 |
-
```
|
| 117 |
-
🚀 Testing Latest Trackio Space Deployment
|
| 118 |
-
=======================================================
|
| 119 |
-
🔍 Testing huggingface_hub import...
|
| 120 |
-
✅ huggingface_hub imported successfully
|
| 121 |
-
|
| 122 |
-
🔍 Testing deployment script import...
|
| 123 |
-
✅ TrackioSpaceDeployer class imported successfully
|
| 124 |
-
✅ HF API initialized
|
| 125 |
-
|
| 126 |
-
🔍 Testing API methods...
|
| 127 |
-
✅ Method exists: create_space
|
| 128 |
-
✅ Method exists: _create_space_cli
|
| 129 |
-
✅ Method exists: prepare_space_files
|
| 130 |
-
✅ Method exists: upload_files_to_space
|
| 131 |
-
✅ Method exists: test_space
|
| 132 |
-
✅ Method exists: deploy
|
| 133 |
-
|
| 134 |
-
🔍 Testing create_repo API...
|
| 135 |
-
✅ Required parameter: repo_id
|
| 136 |
-
✅ Required parameter: token
|
| 137 |
-
✅ Optional parameter: repo_type
|
| 138 |
-
✅ Optional parameter: space_sdk
|
| 139 |
-
✅ Optional parameter: space_hardware
|
| 140 |
-
✅ create_repo API signature looks correct
|
| 141 |
-
|
| 142 |
-
🔍 Testing space creation logic...
|
| 143 |
-
✅ Space URL formatted correctly
|
| 144 |
-
✅ Repo ID formatted correctly
|
| 145 |
-
|
| 146 |
-
🔍 Testing template files...
|
| 147 |
-
✅ app.py exists
|
| 148 |
-
✅ requirements.txt exists
|
| 149 |
-
✅ README.md exists
|
| 150 |
-
|
| 151 |
-
🔍 Testing temporary directory handling...
|
| 152 |
-
✅ Created temp directory: /tmp/tmp_xxxxx
|
| 153 |
-
✅ File copying works
|
| 154 |
-
✅ Cleanup successful
|
| 155 |
-
|
| 156 |
-
📊 Test Results: 7/7 tests passed
|
| 157 |
-
✅ All deployment tests passed! The latest deployment should work correctly.
|
| 158 |
-
```
|
| 159 |
-
|
| 160 |
-
## Files Updated
|
| 161 |
-
|
| 162 |
-
### **Core Deployment Files**
|
| 163 |
-
1. **`scripts/trackio_tonic/deploy_trackio_space.py`**
|
| 164 |
-
- Added HF Hub API integration
|
| 165 |
-
- Implemented fallback mechanism
|
| 166 |
-
- Enhanced error handling
|
| 167 |
-
- Better logging and debugging
|
| 168 |
-
|
| 169 |
-
### **Dependencies**
|
| 170 |
-
2. **`requirements/requirements_core.txt`**
|
| 171 |
-
- Updated huggingface_hub to >=0.19.0
|
| 172 |
-
- Organized dependencies by category
|
| 173 |
-
- Added missing dependencies
|
| 174 |
-
|
| 175 |
-
### **Testing**
|
| 176 |
-
3. **`tests/test_latest_deployment.py`**
|
| 177 |
-
- Comprehensive API testing
|
| 178 |
-
- Import validation
|
| 179 |
-
- Method verification
|
| 180 |
-
- Template file checking
|
| 181 |
-
|
| 182 |
-
## Benefits of Latest Approach
|
| 183 |
-
|
| 184 |
-
### **1. Better Error Handling**
|
| 185 |
-
- API-first approach with CLI fallback
|
| 186 |
-
- Detailed error messages
|
| 187 |
-
- Graceful degradation
|
| 188 |
-
|
| 189 |
-
### **2. More Reliable**
|
| 190 |
-
- Uses official HF Hub API
|
| 191 |
-
- Better parameter validation
|
| 192 |
-
- Consistent behavior
|
| 193 |
-
|
| 194 |
-
### **3. Future-Proof**
|
| 195 |
-
- Follows latest HF Hub patterns
|
| 196 |
-
- Easy to update with new API features
|
| 197 |
-
- Maintains backward compatibility
|
| 198 |
-
|
| 199 |
-
### **4. Enhanced Logging**
|
| 200 |
-
- Detailed progress reporting
|
| 201 |
-
- Better debugging information
|
| 202 |
-
- Clear success/failure indicators
|
| 203 |
-
|
| 204 |
-
## Usage Instructions
|
| 205 |
-
|
| 206 |
-
### **1. Install Latest Dependencies**
|
| 207 |
-
```bash
|
| 208 |
-
pip install huggingface_hub>=0.19.0
|
| 209 |
-
```
|
| 210 |
-
|
| 211 |
-
### **2. Test the Deployment**
|
| 212 |
-
```bash
|
| 213 |
-
python tests/test_latest_deployment.py
|
| 214 |
-
```
|
| 215 |
-
|
| 216 |
-
### **3. Deploy Trackio Space**
|
| 217 |
-
```bash
|
| 218 |
-
python scripts/trackio_tonic/deploy_trackio_space.py
|
| 219 |
-
```
|
| 220 |
-
|
| 221 |
-
### **4. Verify Deployment**
|
| 222 |
-
- Check the Space URL
|
| 223 |
-
- Test the interface
|
| 224 |
-
- Verify API endpoints
|
| 225 |
-
|
| 226 |
-
## Troubleshooting
|
| 227 |
-
|
| 228 |
-
### **Common Issues**
|
| 229 |
-
|
| 230 |
-
#### **1. Import Errors**
|
| 231 |
-
```
|
| 232 |
-
❌ Failed to import huggingface_hub
|
| 233 |
-
```
|
| 234 |
-
**Solution**: Install latest version
|
| 235 |
-
```bash
|
| 236 |
-
pip install huggingface_hub>=0.19.0
|
| 237 |
-
```
|
| 238 |
-
|
| 239 |
-
#### **2. API Errors**
|
| 240 |
-
```
|
| 241 |
-
API creation failed: 401 Client Error
|
| 242 |
-
```
|
| 243 |
-
**Solution**: Check token permissions and validity
|
| 244 |
-
|
| 245 |
-
#### **3. Git Push Errors**
|
| 246 |
-
```
|
| 247 |
-
❌ Error uploading files: git push failed
|
| 248 |
-
```
|
| 249 |
-
**Solution**: Verify git configuration and token access
|
| 250 |
-
|
| 251 |
-
### **Fallback Behavior**
|
| 252 |
-
|
| 253 |
-
The deployment script automatically falls back to CLI if:
|
| 254 |
-
- `huggingface_hub` is not available
|
| 255 |
-
- API creation fails
|
| 256 |
-
- Network issues occur
|
| 257 |
-
|
| 258 |
-
## Reference Implementation
|
| 259 |
-
|
| 260 |
-
Based on the [Hugging Face Hub repository](https://github.com/huggingface/huggingface_hub/blob/9e0493cfdb4de5a27b45c53c3342c83ab1a138fb/src/huggingface_hub/commands/repo.py#L30), this implementation:
|
| 261 |
-
|
| 262 |
-
1. **Uses the latest API patterns**
|
| 263 |
-
2. **Follows HF Hub best practices**
|
| 264 |
-
3. **Maintains backward compatibility**
|
| 265 |
-
4. **Provides robust error handling**
|
| 266 |
-
|
| 267 |
-
The Trackio Space deployment should now work reliably with the latest Hugging Face Hub infrastructure! 🚀
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/LAUNCH_SCRIPT_UPDATES.md
DELETED
|
@@ -1,174 +0,0 @@
|
|
| 1 |
-
# Launch Script Updates
|
| 2 |
-
|
| 3 |
-
This document outlines the updates made to `launch.sh` to work with the new automated Trackio deployment features.
|
| 4 |
-
|
| 5 |
-
## Key Changes Made
|
| 6 |
-
|
| 7 |
-
### ✅ **Removed Manual Username Input**
|
| 8 |
-
- **Before**: Script asked for username manually
|
| 9 |
-
- **After**: Username is automatically extracted from HF token using `whoami()`
|
| 10 |
-
- **Benefit**: Fewer manual inputs, better user experience
|
| 11 |
-
|
| 12 |
-
### ✅ **Updated Token Validation**
|
| 13 |
-
- **Before**: `validate_hf_token()` only validated token
|
| 14 |
-
- **After**: `validate_hf_token_and_get_username()` validates token AND extracts username
|
| 15 |
-
- **Benefit**: Automatic username detection from token
|
| 16 |
-
|
| 17 |
-
### ✅ **Updated Deployment Workflow**
|
| 18 |
-
- **Before**: Passed username manually to deployment script
|
| 19 |
-
- **After**: Deployment script automatically gets username from token
|
| 20 |
-
- **Benefit**: Consistent with new automated features
|
| 21 |
-
|
| 22 |
-
### ✅ **Enhanced User Feedback**
|
| 23 |
-
- **Before**: Basic status messages
|
| 24 |
-
- **After**: Clear information about automated features
|
| 25 |
-
- **Benefit**: Users understand what's happening automatically
|
| 26 |
-
|
| 27 |
-
## Updated Workflow
|
| 28 |
-
|
| 29 |
-
### **Step 1: Authentication (Simplified)**
|
| 30 |
-
```bash
|
| 31 |
-
# Before: Asked for username + token
|
| 32 |
-
get_input "Hugging Face username" "" HF_USERNAME
|
| 33 |
-
get_input "Hugging Face token" "" HF_TOKEN
|
| 34 |
-
|
| 35 |
-
# After: Only asks for token, username auto-detected
|
| 36 |
-
get_input "Hugging Face token" "" HF_TOKEN
|
| 37 |
-
# Username automatically extracted from token
|
| 38 |
-
```
|
| 39 |
-
|
| 40 |
-
### **Step 9: Trackio Space Deployment (Automated)**
|
| 41 |
-
```bash
|
| 42 |
-
# Before: Manual input file creation
|
| 43 |
-
cat > deploy_input.txt << EOF
|
| 44 |
-
$HF_USERNAME
|
| 45 |
-
$TRACKIO_SPACE_NAME
|
| 46 |
-
$HF_TOKEN
|
| 47 |
-
$GIT_EMAIL
|
| 48 |
-
$HF_USERNAME
|
| 49 |
-
EOF
|
| 50 |
-
python deploy_trackio_space.py < deploy_input.txt
|
| 51 |
-
|
| 52 |
-
# After: Direct input with automated features
|
| 53 |
-
python deploy_trackio_space.py << EOF
|
| 54 |
-
$TRACKIO_SPACE_NAME
|
| 55 |
-
$HF_TOKEN
|
| 56 |
-
$GIT_EMAIL
|
| 57 |
-
$HF_USERNAME
|
| 58 |
-
EOF
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
### **Step 10: Dataset Setup (Automated)**
|
| 62 |
-
```bash
|
| 63 |
-
# Before: Basic dataset setup
|
| 64 |
-
python setup_hf_dataset.py
|
| 65 |
-
|
| 66 |
-
# After: Automated dataset setup with user feedback
|
| 67 |
-
print_info "Setting up HF Dataset with automated features..."
|
| 68 |
-
print_info "Username will be auto-detected from token"
|
| 69 |
-
print_info "Dataset repository: $TRACKIO_DATASET_REPO"
|
| 70 |
-
python setup_hf_dataset.py
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
### **Step 11: Trackio Configuration (Automated)**
|
| 74 |
-
```bash
|
| 75 |
-
# Before: Basic configuration
|
| 76 |
-
python configure_trackio.py
|
| 77 |
-
|
| 78 |
-
# After: Automated configuration with user feedback
|
| 79 |
-
print_info "Configuring Trackio with automated features..."
|
| 80 |
-
print_info "Username will be auto-detected from token"
|
| 81 |
-
python configure_trackio.py
|
| 82 |
-
```
|
| 83 |
-
|
| 84 |
-
## New Function: `validate_hf_token_and_get_username()`
|
| 85 |
-
|
| 86 |
-
```bash
|
| 87 |
-
validate_hf_token_and_get_username() {
|
| 88 |
-
local token="$1"
|
| 89 |
-
if [ -z "$token" ]; then
|
| 90 |
-
return 1
|
| 91 |
-
fi
|
| 92 |
-
|
| 93 |
-
# Test the token and get username
|
| 94 |
-
export HF_TOKEN="$token"
|
| 95 |
-
if hf whoami >/dev/null 2>&1; then
|
| 96 |
-
# Get username from whoami command
|
| 97 |
-
HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
|
| 98 |
-
return 0
|
| 99 |
-
else
|
| 100 |
-
return 1
|
| 101 |
-
fi
|
| 102 |
-
}
|
| 103 |
-
```
|
| 104 |
-
|
| 105 |
-
## User Experience Improvements
|
| 106 |
-
|
| 107 |
-
### ✅ **Fewer Manual Inputs**
|
| 108 |
-
- Only need to provide HF token
|
| 109 |
-
- Username automatically detected
|
| 110 |
-
- Git email still required (for git operations)
|
| 111 |
-
|
| 112 |
-
### ✅ **Better Feedback**
|
| 113 |
-
- Clear messages about automated features
|
| 114 |
-
- Shows what's happening automatically
|
| 115 |
-
- Better error messages
|
| 116 |
-
|
| 117 |
-
### ✅ **Consistent Automation**
|
| 118 |
-
- All scripts now use automated features
|
| 119 |
-
- No manual username input anywhere
|
| 120 |
-
- Automatic secret setting
|
| 121 |
-
|
| 122 |
-
## Configuration Summary Updates
|
| 123 |
-
|
| 124 |
-
### **Before:**
|
| 125 |
-
```
|
| 126 |
-
📋 Configuration Summary:
|
| 127 |
-
========================
|
| 128 |
-
User: username (manually entered)
|
| 129 |
-
Experiment: experiment_name
|
| 130 |
-
...
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
### **After:**
|
| 134 |
-
```
|
| 135 |
-
📋 Configuration Summary:
|
| 136 |
-
========================
|
| 137 |
-
User: username (auto-detected from token)
|
| 138 |
-
Experiment: experiment_name
|
| 139 |
-
...
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
## Benefits
|
| 143 |
-
|
| 144 |
-
1. **Simplified Workflow**: Only need token, username auto-detected
|
| 145 |
-
2. **Consistent Automation**: All scripts use automated features
|
| 146 |
-
3. **Better User Experience**: Clear feedback about automated features
|
| 147 |
-
4. **Reduced Errors**: No manual username input means fewer typos
|
| 148 |
-
5. **Streamlined Process**: Fewer steps, more automation
|
| 149 |
-
|
| 150 |
-
## Testing
|
| 151 |
-
|
| 152 |
-
The updated launch script has been tested for:
|
| 153 |
-
- ✅ Syntax validation (`bash -n launch.sh`)
|
| 154 |
-
- ✅ Function integration with updated scripts
|
| 155 |
-
- ✅ Automated username extraction
|
| 156 |
-
- ✅ Consistent workflow with new features
|
| 157 |
-
|
| 158 |
-
## Compatibility
|
| 159 |
-
|
| 160 |
-
The updated launch script is fully compatible with:
|
| 161 |
-
- ✅ Updated `deploy_trackio_space.py` (automated features)
|
| 162 |
-
- ✅ Updated `setup_hf_dataset.py` (username extraction)
|
| 163 |
-
- ✅ Updated `configure_trackio.py` (automated configuration)
|
| 164 |
-
- ✅ Existing training and model push scripts
|
| 165 |
-
|
| 166 |
-
## Summary
|
| 167 |
-
|
| 168 |
-
The launch script now provides a seamless, automated experience that:
|
| 169 |
-
- Extracts username automatically from HF token
|
| 170 |
-
- Uses all the new automated features in the deployment scripts
|
| 171 |
-
- Provides clear feedback about automated processes
|
| 172 |
-
- Maintains compatibility with existing workflows
|
| 173 |
-
- Reduces manual input requirements
|
| 174 |
-
- Improves overall user experience
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/LAUNCH_SCRIPT_USERNAME_FIX.md
DELETED
|
@@ -1,154 +0,0 @@
|
|
| 1 |
-
# Launch Script Username Parameter Fix
|
| 2 |
-
|
| 3 |
-
This document outlines the fix for removing unnecessary username parameters from the launch script deployment calls.
|
| 4 |
-
|
| 5 |
-
## 🐛 **Problem Description**
|
| 6 |
-
|
| 7 |
-
The `launch.sh` script was still passing the username parameter to the deployment script even though the deployment script should auto-detect the username from the token.
|
| 8 |
-
|
| 9 |
-
**Before:**
|
| 10 |
-
```bash
|
| 11 |
-
# Run deployment script with automated features
|
| 12 |
-
python deploy_trackio_space.py << EOF
|
| 13 |
-
$TRACKIO_SPACE_NAME
|
| 14 |
-
$HF_TOKEN
|
| 15 |
-
$GIT_EMAIL
|
| 16 |
-
$HF_USERNAME # ❌ Unnecessary - should be auto-detected
|
| 17 |
-
EOF
|
| 18 |
-
```
|
| 19 |
-
|
| 20 |
-
## ✅ **Solution Implemented**
|
| 21 |
-
|
| 22 |
-
### **Removed Unnecessary Username Parameter**
|
| 23 |
-
|
| 24 |
-
**After:**
|
| 25 |
-
```bash
|
| 26 |
-
# Run deployment script with automated features
|
| 27 |
-
python deploy_trackio_space.py << EOF
|
| 28 |
-
$TRACKIO_SPACE_NAME
|
| 29 |
-
$HF_TOKEN
|
| 30 |
-
$GIT_EMAIL
|
| 31 |
-
|
| 32 |
-
EOF
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
## 🔧 **Why This Fix Was Needed**
|
| 36 |
-
|
| 37 |
-
### **1. Deployment Script Auto-Detection**
|
| 38 |
-
The `deploy_trackio_space.py` script already has robust username auto-detection:
|
| 39 |
-
|
| 40 |
-
```python
|
| 41 |
-
def __init__(self, space_name: str, token: str, git_email: str = None, git_name: str = None):
|
| 42 |
-
# Username is auto-detected from token
|
| 43 |
-
username = get_username_from_token(token)
|
| 44 |
-
if not username:
|
| 45 |
-
username = get_username_from_cli(token)
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
### **2. Consistent Automation**
|
| 49 |
-
All deployment scripts now use the same pattern:
|
| 50 |
-
- `deploy_trackio_space.py` - Auto-detects username from token
|
| 51 |
-
- `setup_hf_dataset.py` - Auto-detects username from token
|
| 52 |
-
- `configure_trackio.py` - Auto-detects username from token
|
| 53 |
-
|
| 54 |
-
### **3. Reduced Manual Input**
|
| 55 |
-
The launch script still extracts username for its own use (defaults, display), but doesn't pass it to scripts that can auto-detect it.
|
| 56 |
-
|
| 57 |
-
## 📋 **Current Workflow**
|
| 58 |
-
|
| 59 |
-
### **Launch Script Username Usage:**
|
| 60 |
-
```bash
|
| 61 |
-
# 1. Extract username for launch script use
|
| 62 |
-
HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
|
| 63 |
-
|
| 64 |
-
# 2. Use for default values and display
|
| 65 |
-
get_input "Model repository name" "$HF_USERNAME/smollm3-finetuned-$(date +%Y%m%d)" REPO_NAME
|
| 66 |
-
get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
|
| 67 |
-
TRACKIO_URL="https://huggingface.co/spaces/$HF_USERNAME/$TRACKIO_SPACE_NAME"
|
| 68 |
-
|
| 69 |
-
# 3. Display in summary
|
| 70 |
-
echo " User: $HF_USERNAME (auto-detected from token)"
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
### **Deployment Script Auto-Detection:**
|
| 74 |
-
```python
|
| 75 |
-
# Each script auto-detects username from token
|
| 76 |
-
username = get_username_from_token(hf_token)
|
| 77 |
-
if not username:
|
| 78 |
-
username = get_username_from_cli(hf_token)
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
## 🎯 **Benefits**
|
| 82 |
-
|
| 83 |
-
### **✅ Consistent Automation**
|
| 84 |
-
- All scripts use the same username detection method
|
| 85 |
-
- No manual username input required anywhere
|
| 86 |
-
- Automatic fallback to CLI if API fails
|
| 87 |
-
|
| 88 |
-
### **✅ Reduced Complexity**
|
| 89 |
-
- Fewer parameters to pass between scripts
|
| 90 |
-
- Less chance of username mismatch errors
|
| 91 |
-
- Cleaner script interfaces
|
| 92 |
-
|
| 93 |
-
### **✅ Better User Experience**
|
| 94 |
-
- Username is auto-detected from token
|
| 95 |
-
- No manual username input required
|
| 96 |
-
- Clear feedback about auto-detection
|
| 97 |
-
|
| 98 |
-
### **✅ Future-Proof**
|
| 99 |
-
- If username detection method changes, only one place to update
|
| 100 |
-
- Consistent behavior across all scripts
|
| 101 |
-
- Easier to maintain and debug
|
| 102 |
-
|
| 103 |
-
## 🔍 **Scripts Updated**
|
| 104 |
-
|
| 105 |
-
### **1. `launch.sh`**
|
| 106 |
-
- ✅ Removed `$HF_USERNAME` parameter from deployment script call
|
| 107 |
-
- ✅ Kept username extraction for launch script use (defaults, display)
|
| 108 |
-
- ✅ Maintained all other functionality
|
| 109 |
-
|
| 110 |
-
### **2. Deployment Scripts (No Changes Needed)**
|
| 111 |
-
- ✅ `deploy_trackio_space.py` - Already auto-detects username
|
| 112 |
-
- ✅ `setup_hf_dataset.py` - Already auto-detects username
|
| 113 |
-
- ✅ `configure_trackio.py` - Already auto-detects username
|
| 114 |
-
|
| 115 |
-
## 🧪 **Testing Results**
|
| 116 |
-
|
| 117 |
-
```bash
|
| 118 |
-
# Syntax check passes
|
| 119 |
-
bash -n launch.sh
|
| 120 |
-
# ✅ No syntax errors
|
| 121 |
-
|
| 122 |
-
# All tests pass
|
| 123 |
-
python tests/test_trackio_fixes.py
|
| 124 |
-
# ✅ 7/7 tests passed
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
## 🚀 **Usage**
|
| 128 |
-
|
| 129 |
-
The fix is transparent to users. The workflow remains the same:
|
| 130 |
-
|
| 131 |
-
```bash
|
| 132 |
-
# 1. Run launch script
|
| 133 |
-
bash launch.sh
|
| 134 |
-
|
| 135 |
-
# 2. Enter token (username auto-detected)
|
| 136 |
-
Enter your Hugging Face token: hf_...
|
| 137 |
-
|
| 138 |
-
# 3. All deployment happens automatically
|
| 139 |
-
# - Username auto-detected from token
|
| 140 |
-
# - No manual username input required
|
| 141 |
-
# - Consistent behavior across all scripts
|
| 142 |
-
```
|
| 143 |
-
|
| 144 |
-
## 🎉 **Summary**
|
| 145 |
-
|
| 146 |
-
The username parameter fix ensures that:
|
| 147 |
-
|
| 148 |
-
- ✅ **No Manual Username Input**: Username is auto-detected from token
|
| 149 |
-
- ✅ **Consistent Automation**: All scripts use the same detection method
|
| 150 |
-
- ✅ **Reduced Complexity**: Fewer parameters to pass between scripts
|
| 151 |
-
- ✅ **Better User Experience**: Clear feedback about auto-detection
|
| 152 |
-
- ✅ **Future-Proof**: Easy to maintain and update
|
| 153 |
-
|
| 154 |
-
The launch script now provides a truly automated experience where the username is seamlessly extracted from the token and used consistently across all deployment scripts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/MODEL_CARD_USER_INPUT_ANALYSIS.md
DELETED
|
@@ -1,233 +0,0 @@
|
|
| 1 |
-
# Model Card User Input Analysis
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This document analyzes the interaction between the model card template (`templates/model_card.md`), the model card generator (`scripts/model_tonic/generate_model_card.py`), and the launch script (`launch.sh`) to identify variables that require user input and improve the user experience.
|
| 6 |
-
|
| 7 |
-
## Template Variables Analysis
|
| 8 |
-
|
| 9 |
-
### Variables in `templates/model_card.md`
|
| 10 |
-
|
| 11 |
-
The model card template uses the following variables that can be populated with user input:
|
| 12 |
-
|
| 13 |
-
#### Core Model Information
|
| 14 |
-
- `{{model_name}}` - Display name of the model
|
| 15 |
-
- `{{model_description}}` - Brief description of the model
|
| 16 |
-
- `{{repo_name}}` - Hugging Face repository name
|
| 17 |
-
- `{{base_model}}` - Base model used for fine-tuning
|
| 18 |
-
|
| 19 |
-
#### Training Configuration
|
| 20 |
-
- `{{training_config_type}}` - Type of training configuration used
|
| 21 |
-
- `{{trainer_type}}` - Type of trainer (SFT, DPO, etc.)
|
| 22 |
-
- `{{batch_size}}` - Training batch size
|
| 23 |
-
- `{{gradient_accumulation_steps}}` - Gradient accumulation steps
|
| 24 |
-
- `{{learning_rate}}` - Learning rate used
|
| 25 |
-
- `{{max_epochs}}` - Maximum number of epochs
|
| 26 |
-
- `{{max_seq_length}}` - Maximum sequence length
|
| 27 |
-
|
| 28 |
-
#### Dataset Information
|
| 29 |
-
- `{{dataset_name}}` - Name of the dataset used
|
| 30 |
-
- `{{dataset_size}}` - Size of the dataset
|
| 31 |
-
- `{{dataset_format}}` - Format of the dataset
|
| 32 |
-
- `{{dataset_sample_size}}` - Sample size (for lightweight configs)
|
| 33 |
-
|
| 34 |
-
#### Training Results
|
| 35 |
-
- `{{training_loss}}` - Final training loss
|
| 36 |
-
- `{{validation_loss}}` - Final validation loss
|
| 37 |
-
- `{{perplexity}}` - Model perplexity
|
| 38 |
-
|
| 39 |
-
#### Infrastructure
|
| 40 |
-
- `{{hardware_info}}` - Hardware used for training
|
| 41 |
-
- `{{experiment_name}}` - Name of the experiment
|
| 42 |
-
- `{{trackio_url}}` - Trackio monitoring URL
|
| 43 |
-
- `{{dataset_repo}}` - HF Dataset repository
|
| 44 |
-
|
| 45 |
-
#### Author Information
|
| 46 |
-
- `{{author_name}}` - Author name for citations and attribution
|
| 47 |
-
- `{{model_name_slug}}` - URL-friendly model name
|
| 48 |
-
|
| 49 |
-
#### Quantization
|
| 50 |
-
- `{{quantized_models}}` - Boolean indicating if quantized models exist
|
| 51 |
-
|
| 52 |
-
## User Input Requirements
|
| 53 |
-
|
| 54 |
-
### Previously Missing User Inputs
|
| 55 |
-
|
| 56 |
-
#### 1. **Author Name** (`author_name`)
|
| 57 |
-
- **Purpose**: Used in model card metadata and citations
|
| 58 |
-
- **Template Usage**: `{{#if author_name}}author: {{author_name}}{{/if}}`
|
| 59 |
-
- **Citation Usage**: `author={{{author_name}}}`
|
| 60 |
-
- **Default**: "Your Name"
|
| 61 |
-
- **User Input Added**: ✅ **IMPLEMENTED**
|
| 62 |
-
|
| 63 |
-
#### 2. **Model Description** (`model_description`)
|
| 64 |
-
- **Purpose**: Brief description of the model's capabilities
|
| 65 |
-
- **Template Usage**: `{{model_description}}`
|
| 66 |
-
- **Default**: "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities."
|
| 67 |
-
- **User Input Added**: ✅ **IMPLEMENTED**
|
| 68 |
-
|
| 69 |
-
### Variables That Don't Need User Input
|
| 70 |
-
|
| 71 |
-
Most variables are automatically populated from:
|
| 72 |
-
- **Training Configuration**: Batch size, learning rate, epochs, etc.
|
| 73 |
-
- **System Detection**: Hardware info, model size, etc.
|
| 74 |
-
- **Auto-Generation**: Repository names, experiment names, etc.
|
| 75 |
-
- **Training Results**: Loss values, perplexity, etc.
|
| 76 |
-
|
| 77 |
-
## Implementation Changes
|
| 78 |
-
|
| 79 |
-
### 1. Launch Script Updates (`launch.sh`)
|
| 80 |
-
|
| 81 |
-
#### Added User Input Prompts
|
| 82 |
-
```bash
|
| 83 |
-
# Step 8.2: Author Information for Model Card
|
| 84 |
-
print_step "Step 8.2: Author Information"
|
| 85 |
-
echo "================================="
|
| 86 |
-
|
| 87 |
-
print_info "This information will be used in the model card and citation."
|
| 88 |
-
get_input "Author name for model card" "$HF_USERNAME" AUTHOR_NAME
|
| 89 |
-
|
| 90 |
-
print_info "Model description will be used in the model card and repository."
|
| 91 |
-
get_input "Model description" "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities." MODEL_DESCRIPTION
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
#### Updated Configuration Summary
|
| 95 |
-
```bash
|
| 96 |
-
echo " Author: $AUTHOR_NAME"
|
| 97 |
-
```
|
| 98 |
-
|
| 99 |
-
#### Updated Model Push Call
|
| 100 |
-
```bash
|
| 101 |
-
python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
|
| 102 |
-
--token "$HF_TOKEN" \
|
| 103 |
-
--trackio-url "$TRACKIO_URL" \
|
| 104 |
-
--experiment-name "$EXPERIMENT_NAME" \
|
| 105 |
-
--dataset-repo "$TRACKIO_DATASET_REPO" \
|
| 106 |
-
--author-name "$AUTHOR_NAME" \
|
| 107 |
-
--model-description "$MODEL_DESCRIPTION"
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
### 2. Push Script Updates (`scripts/model_tonic/push_to_huggingface.py`)
|
| 111 |
-
|
| 112 |
-
#### Added Command Line Arguments
|
| 113 |
-
```python
|
| 114 |
-
parser.add_argument('--author-name', type=str, default=None, help='Author name for model card')
|
| 115 |
-
parser.add_argument('--model-description', type=str, default=None, help='Model description for model card')
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
#### Updated Class Constructor
|
| 119 |
-
```python
|
| 120 |
-
def __init__(
|
| 121 |
-
self,
|
| 122 |
-
model_path: str,
|
| 123 |
-
repo_name: str,
|
| 124 |
-
token: Optional[str] = None,
|
| 125 |
-
private: bool = False,
|
| 126 |
-
trackio_url: Optional[str] = None,
|
| 127 |
-
experiment_name: Optional[str] = None,
|
| 128 |
-
dataset_repo: Optional[str] = None,
|
| 129 |
-
hf_token: Optional[str] = None,
|
| 130 |
-
author_name: Optional[str] = None,
|
| 131 |
-
model_description: Optional[str] = None
|
| 132 |
-
):
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
#### Updated Model Card Generation
|
| 136 |
-
```python
|
| 137 |
-
variables = {
|
| 138 |
-
"model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
|
| 139 |
-
"model_description": self.model_description or "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities.",
|
| 140 |
-
# ... other variables
|
| 141 |
-
"author_name": self.author_name or training_config.get('author_name', 'Your Name'),
|
| 142 |
-
}
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
## User Experience Improvements
|
| 146 |
-
|
| 147 |
-
### 1. **Interactive Prompts**
|
| 148 |
-
- Users are now prompted for author name and model description
|
| 149 |
-
- Default values are provided for convenience
|
| 150 |
-
- Clear explanations of what each field is used for
|
| 151 |
-
|
| 152 |
-
### 2. **Configuration Summary**
|
| 153 |
-
- Author name is now displayed in the configuration summary
|
| 154 |
-
- Users can review all settings before proceeding
|
| 155 |
-
|
| 156 |
-
### 3. **Automatic Integration**
|
| 157 |
-
- User inputs are automatically passed to the model card generation
|
| 158 |
-
- No manual editing of scripts required
|
| 159 |
-
|
| 160 |
-
## Template Variable Categories
|
| 161 |
-
|
| 162 |
-
### Automatic Variables (No User Input Needed)
|
| 163 |
-
- `repo_name` - Auto-generated from username and date
|
| 164 |
-
- `base_model` - Always "HuggingFaceTB/SmolLM3-3B"
|
| 165 |
-
- `training_config_type` - From user selection
|
| 166 |
-
- `trainer_type` - From user selection
|
| 167 |
-
- `batch_size`, `learning_rate`, `max_epochs` - From training config
|
| 168 |
-
- `hardware_info` - Auto-detected
|
| 169 |
-
- `experiment_name` - Auto-generated with timestamp
|
| 170 |
-
- `trackio_url` - Auto-generated from space name
|
| 171 |
-
- `dataset_repo` - Auto-generated
|
| 172 |
-
- `training_loss`, `validation_loss`, `perplexity` - From training results
|
| 173 |
-
|
| 174 |
-
### User Input Variables (Now Implemented)
|
| 175 |
-
- `author_name` - ✅ **Added user prompt**
|
| 176 |
-
- `model_description` - ✅ **Added user prompt**
|
| 177 |
-
|
| 178 |
-
### Conditional Variables
|
| 179 |
-
- `quantized_models` - Set automatically based on quantization choices
|
| 180 |
-
- `dataset_sample_size` - Set based on training configuration type
|
| 181 |
-
|
| 182 |
-
## Benefits of These Changes
|
| 183 |
-
|
| 184 |
-
### 1. **Better Attribution**
|
| 185 |
-
- Author names are properly captured and used in citations
|
| 186 |
-
- Model cards include proper attribution
|
| 187 |
-
|
| 188 |
-
### 2. **Customizable Descriptions**
|
| 189 |
-
- Users can provide custom model descriptions
|
| 190 |
-
- Better model documentation and discoverability
|
| 191 |
-
|
| 192 |
-
### 3. **Improved User Experience**
|
| 193 |
-
- No need to manually edit scripts
|
| 194 |
-
- Interactive prompts with helpful defaults
|
| 195 |
-
- Clear feedback on what information is being collected
|
| 196 |
-
|
| 197 |
-
### 4. **Consistent Documentation**
|
| 198 |
-
- All model cards will have proper author information
|
| 199 |
-
- Standardized model descriptions
|
| 200 |
-
- Better integration with Hugging Face Hub
|
| 201 |
-
|
| 202 |
-
## Future Enhancements
|
| 203 |
-
|
| 204 |
-
### Potential Additional User Inputs
|
| 205 |
-
1. **License Selection** - Allow users to choose model license
|
| 206 |
-
2. **Model Tags** - Custom tags for better discoverability
|
| 207 |
-
3. **Usage Examples** - Custom usage examples for specific use cases
|
| 208 |
-
4. **Limitations Description** - Custom limitations based on training data
|
| 209 |
-
|
| 210 |
-
### Template Improvements
|
| 211 |
-
1. **Dynamic License** - Support for different license types
|
| 212 |
-
2. **Custom Tags** - User-defined model tags
|
| 213 |
-
3. **Usage Scenarios** - Template sections for different use cases
|
| 214 |
-
|
| 215 |
-
## Testing
|
| 216 |
-
|
| 217 |
-
The changes have been tested to ensure:
|
| 218 |
-
- ✅ Author name is properly passed to model card generation
|
| 219 |
-
- ✅ Model description is properly passed to model card generation
|
| 220 |
-
- ✅ Default values work correctly
|
| 221 |
-
- ✅ Configuration summary displays new fields
|
| 222 |
-
- ✅ Model push script accepts new parameters
|
| 223 |
-
|
| 224 |
-
## Conclusion
|
| 225 |
-
|
| 226 |
-
The analysis identified that the model card template had two key variables (`author_name` and `model_description`) that would benefit from user input. These have been successfully implemented with:
|
| 227 |
-
|
| 228 |
-
1. **Interactive prompts** in the launch script
|
| 229 |
-
2. **Command line arguments** in the push script
|
| 230 |
-
3. **Proper integration** with the model card generator
|
| 231 |
-
4. **User-friendly defaults** and clear explanations
|
| 232 |
-
|
| 233 |
-
This improves the overall user experience and ensures that model cards have proper attribution and descriptions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/MODEL_RECOVERY_GUIDE.md
DELETED
|
@@ -1,228 +0,0 @@
|
|
| 1 |
-
# Model Recovery and Deployment Guide
|
| 2 |
-
|
| 3 |
-
This guide will help you recover your trained model from the cloud instance and deploy it to Hugging Face Hub with quantization.
|
| 4 |
-
|
| 5 |
-
## Prerequisites
|
| 6 |
-
|
| 7 |
-
1. **Hugging Face Token**: You need a Hugging Face token with write permissions
|
| 8 |
-
2. **Cloud Instance Access**: SSH access to your cloud instance
|
| 9 |
-
3. **Model Files**: Your trained model should be in `/output-checkpoint/` on the cloud instance
|
| 10 |
-
|
| 11 |
-
## Step 1: Connect to Your Cloud Instance
|
| 12 |
-
|
| 13 |
-
```bash
|
| 14 |
-
ssh root@your-cloud-instance-ip
|
| 15 |
-
cd ~/smollm3_finetune
|
| 16 |
-
```
|
| 17 |
-
|
| 18 |
-
## Step 2: Set Your Hugging Face Token
|
| 19 |
-
|
| 20 |
-
```bash
|
| 21 |
-
export HF_TOKEN=your_huggingface_token_here
|
| 22 |
-
```
|
| 23 |
-
|
| 24 |
-
Replace `your_huggingface_token_here` with your actual Hugging Face token.
|
| 25 |
-
|
| 26 |
-
## Step 3: Verify Model Files
|
| 27 |
-
|
| 28 |
-
Check that your model files exist:
|
| 29 |
-
|
| 30 |
-
```bash
|
| 31 |
-
ls -la /output-checkpoint/
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
You should see files like:
|
| 35 |
-
- `config.json`
|
| 36 |
-
- `model.safetensors.index.json`
|
| 37 |
-
- `model-00001-of-00002.safetensors`
|
| 38 |
-
- `model-00002-of-00002.safetensors`
|
| 39 |
-
- `tokenizer.json`
|
| 40 |
-
- `tokenizer_config.json`
|
| 41 |
-
|
| 42 |
-
## Step 4: Update Configuration
|
| 43 |
-
|
| 44 |
-
Edit the deployment script to use your Hugging Face username:
|
| 45 |
-
|
| 46 |
-
```bash
|
| 47 |
-
nano cloud_deploy.py
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
Change this line:
|
| 51 |
-
```python
|
| 52 |
-
REPO_NAME = "your-username/smollm3-finetuned" # Change to your HF username and desired repo name
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
To your actual username, for example:
|
| 56 |
-
```python
|
| 57 |
-
REPO_NAME = "tonic/smollm3-finetuned"
|
| 58 |
-
```
|
| 59 |
-
|
| 60 |
-
## Step 5: Run the Deployment
|
| 61 |
-
|
| 62 |
-
Execute the deployment script:
|
| 63 |
-
|
| 64 |
-
```bash
|
| 65 |
-
python3 cloud_deploy.py
|
| 66 |
-
```
|
| 67 |
-
|
| 68 |
-
This will:
|
| 69 |
-
1. ✅ Validate your model files
|
| 70 |
-
2. ✅ Install required dependencies (torchao, huggingface_hub)
|
| 71 |
-
3. ✅ Push the main model to Hugging Face Hub
|
| 72 |
-
4. ✅ Create quantized versions (int8 and int4)
|
| 73 |
-
5. ✅ Push quantized models to subdirectories
|
| 74 |
-
|
| 75 |
-
## Step 6: Verify Deployment
|
| 76 |
-
|
| 77 |
-
After successful deployment, you can verify:
|
| 78 |
-
|
| 79 |
-
1. **Main Model**: https://huggingface.co/your-username/smollm3-finetuned
|
| 80 |
-
2. **int8 Quantized**: https://huggingface.co/your-username/smollm3-finetuned/int8
|
| 81 |
-
3. **int4 Quantized**: https://huggingface.co/your-username/smollm3-finetuned/int4
|
| 82 |
-
|
| 83 |
-
## Alternative: Manual Deployment
|
| 84 |
-
|
| 85 |
-
If you prefer to run the steps manually:
|
| 86 |
-
|
| 87 |
-
### 1. Push Main Model Only
|
| 88 |
-
|
| 89 |
-
```bash
|
| 90 |
-
python3 scripts/model_tonic/push_to_huggingface.py \
|
| 91 |
-
/output-checkpoint/ \
|
| 92 |
-
your-username/smollm3-finetuned \
|
| 93 |
-
--hf-token $HF_TOKEN \
|
| 94 |
-
--author-name "Your Name" \
|
| 95 |
-
--model-description "A fine-tuned SmolLM3 model for improved text generation"
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
### 2. Quantize and Push (Optional)
|
| 99 |
-
|
| 100 |
-
```bash
|
| 101 |
-
# int8 quantization (GPU optimized)
|
| 102 |
-
python3 scripts/model_tonic/quantize_model.py \
|
| 103 |
-
/output-checkpoint/ \
|
| 104 |
-
your-username/smollm3-finetuned \
|
| 105 |
-
--quant-type int8_weight_only \
|
| 106 |
-
--hf-token $HF_TOKEN
|
| 107 |
-
|
| 108 |
-
# int4 quantization (CPU optimized)
|
| 109 |
-
python3 scripts/model_tonic/quantize_model.py \
|
| 110 |
-
/output-checkpoint/ \
|
| 111 |
-
your-username/smollm3-finetuned \
|
| 112 |
-
--quant-type int4_weight_only \
|
| 113 |
-
--hf-token $HF_TOKEN
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
## Troubleshooting
|
| 117 |
-
|
| 118 |
-
### Common Issues
|
| 119 |
-
|
| 120 |
-
1. **HF_TOKEN not set**
|
| 121 |
-
```bash
|
| 122 |
-
export HF_TOKEN=your_token_here
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
2. **Model files not found**
|
| 126 |
-
```bash
|
| 127 |
-
ls -la /output-checkpoint/
|
| 128 |
-
```
|
| 129 |
-
Make sure the training completed successfully.
|
| 130 |
-
|
| 131 |
-
3. **Dependencies missing**
|
| 132 |
-
```bash
|
| 133 |
-
pip install torchao huggingface_hub
|
| 134 |
-
```
|
| 135 |
-
|
| 136 |
-
4. **Permission denied**
|
| 137 |
-
```bash
|
| 138 |
-
chmod +x cloud_deploy.py
|
| 139 |
-
chmod +x recover_model.py
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
### Error Messages
|
| 143 |
-
|
| 144 |
-
- **"Missing required model files"**: Check that your model training completed successfully
|
| 145 |
-
- **"Repository creation failed"**: Verify your HF token has write permissions
|
| 146 |
-
- **"Quantization failed"**: Check GPU memory availability or try CPU quantization
|
| 147 |
-
|
| 148 |
-
## Model Usage
|
| 149 |
-
|
| 150 |
-
Once deployed, you can use your model:
|
| 151 |
-
|
| 152 |
-
```python
|
| 153 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 154 |
-
|
| 155 |
-
# Main model
|
| 156 |
-
model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned")
|
| 157 |
-
tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned")
|
| 158 |
-
|
| 159 |
-
# int8 quantized (GPU optimized)
|
| 160 |
-
model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned/int8")
|
| 161 |
-
tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned/int8")
|
| 162 |
-
|
| 163 |
-
# int4 quantized (CPU optimized)
|
| 164 |
-
model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned/int4")
|
| 165 |
-
tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned/int4")
|
| 166 |
-
|
| 167 |
-
# Generate text
|
| 168 |
-
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
|
| 169 |
-
outputs = model.generate(**inputs, max_new_tokens=100)
|
| 170 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 171 |
-
```
|
| 172 |
-
|
| 173 |
-
## File Structure
|
| 174 |
-
|
| 175 |
-
After deployment, your repository will have:
|
| 176 |
-
|
| 177 |
-
```
|
| 178 |
-
your-username/smollm3-finetuned/
|
| 179 |
-
├── README.md (model card)
|
| 180 |
-
├── config.json
|
| 181 |
-
├── model.safetensors.index.json
|
| 182 |
-
├── model-00001-of-00002.safetensors
|
| 183 |
-
├── model-00002-of-00002.safetensors
|
| 184 |
-
├── tokenizer.json
|
| 185 |
-
├── tokenizer_config.json
|
| 186 |
-
├── int8/ (quantized model for GPU)
|
| 187 |
-
│ ├── README.md
|
| 188 |
-
│ ├── config.json
|
| 189 |
-
│ └── pytorch_model.bin
|
| 190 |
-
└── int4/ (quantized model for CPU)
|
| 191 |
-
├── README.md
|
| 192 |
-
├── config.json
|
| 193 |
-
└── pytorch_model.bin
|
| 194 |
-
```
|
| 195 |
-
|
| 196 |
-
## Success Indicators
|
| 197 |
-
|
| 198 |
-
✅ **Successful deployment shows:**
|
| 199 |
-
- "Model recovery and deployment completed successfully!"
|
| 200 |
-
- "View your model at: https://huggingface.co/your-username/smollm3-finetuned"
|
| 201 |
-
- No error messages in the output
|
| 202 |
-
|
| 203 |
-
❌ **Failed deployment shows:**
|
| 204 |
-
- Error messages about missing files or permissions
|
| 205 |
-
- "Model recovery and deployment failed!"
|
| 206 |
-
|
| 207 |
-
## Next Steps
|
| 208 |
-
|
| 209 |
-
After successful deployment:
|
| 210 |
-
|
| 211 |
-
1. **Test your model** on Hugging Face Hub
|
| 212 |
-
2. **Share your model** with the community
|
| 213 |
-
3. **Monitor usage** through Hugging Face analytics
|
| 214 |
-
4. **Consider fine-tuning** further based on feedback
|
| 215 |
-
|
| 216 |
-
## Support
|
| 217 |
-
|
| 218 |
-
If you encounter issues:
|
| 219 |
-
|
| 220 |
-
1. Check the error messages carefully
|
| 221 |
-
2. Verify your HF token permissions
|
| 222 |
-
3. Ensure all model files are present
|
| 223 |
-
4. Try running individual steps manually
|
| 224 |
-
5. Check the logs for detailed error information
|
| 225 |
-
|
| 226 |
-
---
|
| 227 |
-
|
| 228 |
-
**Happy deploying! 🚀**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/MONITORING_IMPROVEMENTS_SUMMARY.md
DELETED
|
@@ -1,191 +0,0 @@
|
|
| 1 |
-
# 🚀 Monitoring Improvements Summary
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The monitoring system has been significantly enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
|
| 6 |
-
|
| 7 |
-
## ✅ Key Improvements Made
|
| 8 |
-
|
| 9 |
-
### 1. **Enhanced `monitoring.py`**
|
| 10 |
-
- ✅ **HF Datasets Integration**: Added support for saving experiments to HF Datasets repositories
|
| 11 |
-
- ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
|
| 12 |
-
- ✅ **Fallback Support**: Graceful degradation if HF Datasets unavailable
|
| 13 |
-
- ✅ **Dual Storage**: Experiments saved to both Trackio and HF Datasets
|
| 14 |
-
- ✅ **Periodic Saving**: Metrics saved to HF Dataset every 10 steps
|
| 15 |
-
- ✅ **Error Handling**: Robust error logging and recovery
|
| 16 |
-
|
| 17 |
-
### 2. **Updated `train.py`**
|
| 18 |
-
- ✅ **Monitoring Integration**: Automatic monitoring setup in training scripts
|
| 19 |
-
- ✅ **Configuration Logging**: Experiment configuration logged at start
|
| 20 |
-
- ✅ **Training Callbacks**: Monitoring callbacks added to trainer
|
| 21 |
-
- ✅ **Summary Logging**: Training summaries logged at completion
|
| 22 |
-
- ✅ **Error Logging**: Errors logged to monitoring system
|
| 23 |
-
- ✅ **Cleanup**: Proper monitoring session cleanup
|
| 24 |
-
|
| 25 |
-
### 3. **Configuration Files Updated**
|
| 26 |
-
- ✅ **HF Datasets Config**: Added `hf_token` and `dataset_repo` parameters
|
| 27 |
-
- ✅ **Environment Support**: Environment variables automatically detected
|
| 28 |
-
- ✅ **Backward Compatible**: Existing configurations still work
|
| 29 |
-
|
| 30 |
-
### 4. **New Utility Scripts**
|
| 31 |
-
- ✅ **`configure_trackio.py`**: Configuration testing and setup
|
| 32 |
-
- ✅ **`integrate_monitoring.py`**: Automated integration script
|
| 33 |
-
- ✅ **`test_monitoring_integration.py`**: Comprehensive testing
|
| 34 |
-
- ✅ **`setup_hf_dataset.py`**: Dataset repository setup
|
| 35 |
-
|
| 36 |
-
### 5. **Documentation**
|
| 37 |
-
- ✅ **`MONITORING_INTEGRATION_GUIDE.md`**: Comprehensive usage guide
|
| 38 |
-
- ✅ **`ENVIRONMENT_VARIABLES.md`**: Environment variable reference
|
| 39 |
-
- ✅ **`HF_DATASETS_GUIDE.md`**: Detailed HF Datasets guide
|
| 40 |
-
|
| 41 |
-
## 🔧 Environment Variables
|
| 42 |
-
|
| 43 |
-
| Variable | Required | Default | Description |
|
| 44 |
-
|----------|----------|---------|-------------|
|
| 45 |
-
| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token |
|
| 46 |
-
| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
|
| 47 |
-
| `TRACKIO_URL` | ❌ No | None | Trackio server URL |
|
| 48 |
-
| `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
|
| 49 |
-
|
| 50 |
-
## 📊 What Gets Monitored
|
| 51 |
-
|
| 52 |
-
### **Training Metrics**
|
| 53 |
-
- Loss values (training and validation)
|
| 54 |
-
- Learning rate
|
| 55 |
-
- Gradient norms
|
| 56 |
-
- Training steps and epochs
|
| 57 |
-
|
| 58 |
-
### **System Metrics**
|
| 59 |
-
- GPU memory usage
|
| 60 |
-
- GPU utilization
|
| 61 |
-
- CPU usage
|
| 62 |
-
- Memory usage
|
| 63 |
-
|
| 64 |
-
### **Experiment Data**
|
| 65 |
-
- Configuration parameters
|
| 66 |
-
- Model checkpoints
|
| 67 |
-
- Evaluation results
|
| 68 |
-
- Training summaries
|
| 69 |
-
|
| 70 |
-
### **Artifacts**
|
| 71 |
-
- Configuration files
|
| 72 |
-
- Training logs
|
| 73 |
-
- Evaluation results
|
| 74 |
-
- Model checkpoints
|
| 75 |
-
|
| 76 |
-
## 🚀 Usage Examples
|
| 77 |
-
|
| 78 |
-
### **Basic Training**
|
| 79 |
-
```bash
|
| 80 |
-
# Set environment variables
|
| 81 |
-
export HF_TOKEN=your_token_here
|
| 82 |
-
export TRACKIO_DATASET_REPO=your-username/experiments
|
| 83 |
-
|
| 84 |
-
# Run training with monitoring
|
| 85 |
-
python train.py config/train_smollm3_openhermes_fr.py
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
### **Advanced Configuration**
|
| 89 |
-
```bash
|
| 90 |
-
# Train with custom settings
|
| 91 |
-
python train.py config/train_smollm3_openhermes_fr.py \
|
| 92 |
-
--experiment_name "smollm3_french_v2" \
|
| 93 |
-
--hf_token your_token_here \
|
| 94 |
-
--dataset_repo your-username/french-experiments
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
### **Testing Setup**
|
| 98 |
-
```bash
|
| 99 |
-
# Test configuration
|
| 100 |
-
python configure_trackio.py
|
| 101 |
-
|
| 102 |
-
# Test monitoring integration
|
| 103 |
-
python test_monitoring_integration.py
|
| 104 |
-
|
| 105 |
-
# Test dataset access
|
| 106 |
-
python test_hf_datasets.py
|
| 107 |
-
```
|
| 108 |
-
|
| 109 |
-
## 📈 Benefits
|
| 110 |
-
|
| 111 |
-
### **For HF Spaces Deployment**
|
| 112 |
-
- ✅ **Persistent Storage**: Data survives Space restarts
|
| 113 |
-
- ✅ **No Local Storage**: No dependency on ephemeral storage
|
| 114 |
-
- ✅ **Scalable**: Works with any dataset size
|
| 115 |
-
- ✅ **Secure**: Private dataset storage
|
| 116 |
-
|
| 117 |
-
### **For Experiment Management**
|
| 118 |
-
- ✅ **Centralized**: All experiments in one place
|
| 119 |
-
- ✅ **Searchable**: Easy to find specific experiments
|
| 120 |
-
- ✅ **Versioned**: Dataset versioning for experiments
|
| 121 |
-
- ✅ **Collaborative**: Share experiments with team
|
| 122 |
-
|
| 123 |
-
### **For Development**
|
| 124 |
-
- ✅ **Flexible**: Easy to switch between datasets
|
| 125 |
-
- ✅ **Configurable**: Environment-based configuration
|
| 126 |
-
- ✅ **Robust**: Fallback mechanisms
|
| 127 |
-
- ✅ **Debuggable**: Comprehensive logging
|
| 128 |
-
|
| 129 |
-
## 🧪 Testing Results
|
| 130 |
-
|
| 131 |
-
All monitoring integration tests passed:
|
| 132 |
-
- ✅ Module Import
|
| 133 |
-
- ✅ Monitor Creation
|
| 134 |
-
- ✅ Config Creation
|
| 135 |
-
- ✅ Metrics Logging
|
| 136 |
-
- ✅ Configuration Logging
|
| 137 |
-
- ✅ System Metrics
|
| 138 |
-
- ✅ Training Summary
|
| 139 |
-
- ✅ Callback Creation
|
| 140 |
-
|
| 141 |
-
## 📋 Files Modified/Created
|
| 142 |
-
|
| 143 |
-
### **Core Files**
|
| 144 |
-
- `monitoring.py` - Enhanced with HF Datasets support
|
| 145 |
-
- `train.py` - Updated with monitoring integration
|
| 146 |
-
- `requirements_core.txt` - Added monitoring dependencies
|
| 147 |
-
- `requirements_space.txt` - Updated for HF Spaces
|
| 148 |
-
|
| 149 |
-
### **Configuration Files**
|
| 150 |
-
- `config/train_smollm3.py` - Added HF Datasets config
|
| 151 |
-
- `config/train_smollm3_openhermes_fr.py` - Added HF Datasets config
|
| 152 |
-
- `config/train_smollm3_openhermes_fr_a100_balanced.py` - Added HF Datasets config
|
| 153 |
-
- `config/train_smollm3_openhermes_fr_a100_large.py` - Added HF Datasets config
|
| 154 |
-
- `config/train_smollm3_openhermes_fr_a100_max_performance.py` - Added HF Datasets config
|
| 155 |
-
- `config/train_smollm3_openhermes_fr_a100_multiple_passes.py` - Added HF Datasets config
|
| 156 |
-
|
| 157 |
-
### **New Utility Scripts**
|
| 158 |
-
- `configure_trackio.py` - Configuration testing
|
| 159 |
-
- `integrate_monitoring.py` - Automated integration
|
| 160 |
-
- `test_monitoring_integration.py` - Comprehensive testing
|
| 161 |
-
- `setup_hf_dataset.py` - Dataset setup
|
| 162 |
-
|
| 163 |
-
### **Documentation**
|
| 164 |
-
- `MONITORING_INTEGRATION_GUIDE.md` - Usage guide
|
| 165 |
-
- `ENVIRONMENT_VARIABLES.md` - Environment reference
|
| 166 |
-
- `HF_DATASETS_GUIDE.md` - HF Datasets guide
|
| 167 |
-
- `MONITORING_IMPROVEMENTS_SUMMARY.md` - This summary
|
| 168 |
-
|
| 169 |
-
## 🎯 Next Steps
|
| 170 |
-
|
| 171 |
-
1. **Set up your HF token and dataset repository**
|
| 172 |
-
2. **Test the configuration with `python configure_trackio.py`**
|
| 173 |
-
3. **Run a training experiment to verify full functionality**
|
| 174 |
-
4. **Check your HF Dataset repository for experiment data**
|
| 175 |
-
5. **View results in your Trackio interface**
|
| 176 |
-
|
| 177 |
-
## 🔍 Troubleshooting
|
| 178 |
-
|
| 179 |
-
### **Common Issues**
|
| 180 |
-
- **HF_TOKEN not set**: Set your Hugging Face token
|
| 181 |
-
- **Dataset access failed**: Check token permissions and repository existence
|
| 182 |
-
- **Monitoring not working**: Run `python test_monitoring_integration.py` to diagnose
|
| 183 |
-
|
| 184 |
-
### **Getting Help**
|
| 185 |
-
- Check the comprehensive guides in the documentation files
|
| 186 |
-
- Run the test scripts to verify your setup
|
| 187 |
-
- Check logs for specific error messages
|
| 188 |
-
|
| 189 |
-
---
|
| 190 |
-
|
| 191 |
-
**🎉 The monitoring system is now ready for production use with persistent HF Datasets storage!**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/MONITORING_INTEGRATION_GUIDE.md
DELETED
|
@@ -1,245 +0,0 @@
|
|
| 1 |
-
# 🔧 Improved Monitoring Integration Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The monitoring system has been enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
|
| 6 |
-
|
| 7 |
-
## 🚀 Key Improvements
|
| 8 |
-
|
| 9 |
-
### 1. **HF Datasets Integration**
|
| 10 |
-
- ✅ **Persistent Storage**: Experiments are saved to HF Datasets repositories
|
| 11 |
-
- ✅ **Environment Variables**: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO`
|
| 12 |
-
- ✅ **Fallback Support**: Graceful degradation if HF Datasets unavailable
|
| 13 |
-
- ✅ **Automatic Backup**: Local files as backup
|
| 14 |
-
|
| 15 |
-
### 2. **Enhanced Monitoring Features**
|
| 16 |
-
- 📊 **Real-time Metrics**: Training metrics logged to both Trackio and HF Datasets
|
| 17 |
-
- 🔧 **System Metrics**: GPU memory, CPU usage, and system performance
|
| 18 |
-
- 📈 **Training Summaries**: Comprehensive experiment summaries
|
| 19 |
-
- 🛡️ **Error Handling**: Robust error logging and recovery
|
| 20 |
-
|
| 21 |
-
### 3. **Easy Integration**
|
| 22 |
-
- 🔌 **Automatic Setup**: Environment variables automatically detected
|
| 23 |
-
- 📝 **Configuration**: Simple setup with environment variables
|
| 24 |
-
- 🔄 **Backward Compatible**: Works with existing Trackio setup
|
| 25 |
-
|
| 26 |
-
## 📋 Environment Variables
|
| 27 |
-
|
| 28 |
-
| Variable | Required | Default | Description |
|
| 29 |
-
|----------|----------|---------|-------------|
|
| 30 |
-
| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token |
|
| 31 |
-
| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
|
| 32 |
-
| `TRACKIO_URL` | ❌ No | None | Trackio server URL |
|
| 33 |
-
| `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
|
| 34 |
-
|
| 35 |
-
## 🛠️ Setup Instructions
|
| 36 |
-
|
| 37 |
-
### 1. **Get Your HF Token**
|
| 38 |
-
```bash
|
| 39 |
-
# Go to https://huggingface.co/settings/tokens
|
| 40 |
-
# Create a new token with "Write" permissions
|
| 41 |
-
# Copy the token
|
| 42 |
-
```
|
| 43 |
-
|
| 44 |
-
### 2. **Set Environment Variables**
|
| 45 |
-
```bash
|
| 46 |
-
# For HF Spaces, add these to your Space settings:
|
| 47 |
-
HF_TOKEN=your_hf_token_here
|
| 48 |
-
TRACKIO_DATASET_REPO=your-username/your-dataset-name
|
| 49 |
-
|
| 50 |
-
# For local development:
|
| 51 |
-
export HF_TOKEN=your_hf_token_here
|
| 52 |
-
export TRACKIO_DATASET_REPO=your-username/your-dataset-name
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
### 3. **Create Dataset Repository**
|
| 56 |
-
```bash
|
| 57 |
-
# Run the setup script
|
| 58 |
-
python setup_hf_dataset.py
|
| 59 |
-
|
| 60 |
-
# Or manually create a dataset on HF Hub
|
| 61 |
-
# Go to https://huggingface.co/datasets
|
| 62 |
-
# Create a new dataset repository
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
-
### 4. **Test Configuration**
|
| 66 |
-
```bash
|
| 67 |
-
# Test your setup
|
| 68 |
-
python configure_trackio.py
|
| 69 |
-
|
| 70 |
-
# Test dataset access
|
| 71 |
-
python test_hf_datasets.py
|
| 72 |
-
```
|
| 73 |
-
|
| 74 |
-
## 🚀 Usage Examples
|
| 75 |
-
|
| 76 |
-
### **Basic Training with Monitoring**
|
| 77 |
-
```bash
|
| 78 |
-
# Train with default monitoring
|
| 79 |
-
python train.py config/train_smollm3_openhermes_fr.py
|
| 80 |
-
|
| 81 |
-
# Train with custom dataset repository
|
| 82 |
-
TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py
|
| 83 |
-
```
|
| 84 |
-
|
| 85 |
-
### **Advanced Training Configuration**
|
| 86 |
-
```bash
|
| 87 |
-
# Train with custom experiment name
|
| 88 |
-
python train.py config/train_smollm3_openhermes_fr.py \
|
| 89 |
-
--experiment_name "smollm3_french_tuning_v2" \
|
| 90 |
-
--hf_token your_token_here \
|
| 91 |
-
--dataset_repo your-username/french-experiments
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
### **Training Scripts with Monitoring**
|
| 95 |
-
```bash
|
| 96 |
-
# All training scripts now support monitoring:
|
| 97 |
-
python train.py config/train_smollm3_openhermes_fr_a100_balanced.py
|
| 98 |
-
python train.py config/train_smollm3_openhermes_fr_a100_large.py
|
| 99 |
-
python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py
|
| 100 |
-
python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
## 📊 What Gets Monitored
|
| 104 |
-
|
| 105 |
-
### **Training Metrics**
|
| 106 |
-
- Loss values (training and validation)
|
| 107 |
-
- Learning rate
|
| 108 |
-
- Gradient norms
|
| 109 |
-
- Training steps and epochs
|
| 110 |
-
|
| 111 |
-
### **System Metrics**
|
| 112 |
-
- GPU memory usage
|
| 113 |
-
- GPU utilization
|
| 114 |
-
- CPU usage
|
| 115 |
-
- Memory usage
|
| 116 |
-
|
| 117 |
-
### **Experiment Data**
|
| 118 |
-
- Configuration parameters
|
| 119 |
-
- Model checkpoints
|
| 120 |
-
- Evaluation results
|
| 121 |
-
- Training summaries
|
| 122 |
-
|
| 123 |
-
### **Artifacts**
|
| 124 |
-
- Configuration files
|
| 125 |
-
- Training logs
|
| 126 |
-
- Evaluation results
|
| 127 |
-
- Model checkpoints
|
| 128 |
-
|
| 129 |
-
## 🔍 Viewing Results
|
| 130 |
-
|
| 131 |
-
### **1. Trackio Interface**
|
| 132 |
-
- Visit your Trackio Space
|
| 133 |
-
- Navigate to "Experiments" tab
|
| 134 |
-
- View real-time metrics and plots
|
| 135 |
-
|
| 136 |
-
### **2. HF Dataset Repository**
|
| 137 |
-
- Go to your dataset repository on HF Hub
|
| 138 |
-
- Browse experiment data
|
| 139 |
-
- Download experiment files
|
| 140 |
-
|
| 141 |
-
### **3. Local Files**
|
| 142 |
-
- Check local backup files
|
| 143 |
-
- Review training logs
|
| 144 |
-
- Examine configuration files
|
| 145 |
-
|
| 146 |
-
## 🛠️ Configuration Examples
|
| 147 |
-
|
| 148 |
-
### **Default Setup**
|
| 149 |
-
```python
|
| 150 |
-
# Uses default dataset: tonic/trackio-experiments
|
| 151 |
-
# Requires only HF_TOKEN
|
| 152 |
-
```
|
| 153 |
-
|
| 154 |
-
### **Personal Dataset**
|
| 155 |
-
```bash
|
| 156 |
-
export HF_TOKEN=your_token_here
|
| 157 |
-
export TRACKIO_DATASET_REPO=your-username/trackio-experiments
|
| 158 |
-
```
|
| 159 |
-
|
| 160 |
-
### **Team Dataset**
|
| 161 |
-
```bash
|
| 162 |
-
export HF_TOKEN=your_token_here
|
| 163 |
-
export TRACKIO_DATASET_REPO=your-org/team-experiments
|
| 164 |
-
```
|
| 165 |
-
|
| 166 |
-
### **Project-Specific Dataset**
|
| 167 |
-
```bash
|
| 168 |
-
export HF_TOKEN=your_token_here
|
| 169 |
-
export TRACKIO_DATASET_REPO=your-username/smollm3-experiments
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
## 🔧 Troubleshooting
|
| 173 |
-
|
| 174 |
-
### **Issue: "HF_TOKEN not found"**
|
| 175 |
-
```bash
|
| 176 |
-
# Solution: Set your HF token
|
| 177 |
-
export HF_TOKEN=your_token_here
|
| 178 |
-
# Or add to HF Space environment variables
|
| 179 |
-
```
|
| 180 |
-
|
| 181 |
-
### **Issue: "Failed to load dataset"**
|
| 182 |
-
```bash
|
| 183 |
-
# Solutions:
|
| 184 |
-
# 1. Check token has read access
|
| 185 |
-
# 2. Verify dataset repository exists
|
| 186 |
-
# 3. Run setup script: python setup_hf_dataset.py
|
| 187 |
-
```
|
| 188 |
-
|
| 189 |
-
### **Issue: "Failed to save experiments"**
|
| 190 |
-
```bash
|
| 191 |
-
# Solutions:
|
| 192 |
-
# 1. Check token has write permissions
|
| 193 |
-
# 2. Verify dataset repository exists
|
| 194 |
-
# 3. Check network connectivity
|
| 195 |
-
```
|
| 196 |
-
|
| 197 |
-
### **Issue: "Monitoring not working"**
|
| 198 |
-
```bash
|
| 199 |
-
# Solutions:
|
| 200 |
-
# 1. Check environment variables
|
| 201 |
-
# 2. Run configuration test: python configure_trackio.py
|
| 202 |
-
# 3. Check logs for specific errors
|
| 203 |
-
```
|
| 204 |
-
|
| 205 |
-
## 📈 Benefits
|
| 206 |
-
|
| 207 |
-
### **For HF Spaces Deployment**
|
| 208 |
-
- ✅ **Persistent Storage**: Data survives Space restarts
|
| 209 |
-
- ✅ **No Local Storage**: No dependency on ephemeral storage
|
| 210 |
-
- ✅ **Scalable**: Works with any dataset size
|
| 211 |
-
- ✅ **Secure**: Private dataset storage
|
| 212 |
-
|
| 213 |
-
### **For Experiment Management**
|
| 214 |
-
- ✅ **Centralized**: All experiments in one place
|
| 215 |
-
- ✅ **Searchable**: Easy to find specific experiments
|
| 216 |
-
- ✅ **Versioned**: Dataset versioning for experiments
|
| 217 |
-
- ✅ **Collaborative**: Share experiments with team
|
| 218 |
-
|
| 219 |
-
### **For Development**
|
| 220 |
-
- ✅ **Flexible**: Easy to switch between datasets
|
| 221 |
-
- ✅ **Configurable**: Environment-based configuration
|
| 222 |
-
- ✅ **Robust**: Fallback mechanisms
|
| 223 |
-
- ✅ **Debuggable**: Comprehensive logging
|
| 224 |
-
|
| 225 |
-
## 🎯 Next Steps
|
| 226 |
-
|
| 227 |
-
1. **Set up your HF token and dataset repository**
|
| 228 |
-
2. **Test the configuration with `python configure_trackio.py`**
|
| 229 |
-
3. **Run a training experiment to verify monitoring**
|
| 230 |
-
4. **Check your HF Dataset repository for experiment data**
|
| 231 |
-
5. **View results in your Trackio interface**
|
| 232 |
-
|
| 233 |
-
## 📚 Related Files
|
| 234 |
-
|
| 235 |
-
- `monitoring.py` - Enhanced monitoring with HF Datasets support
|
| 236 |
-
- `train.py` - Updated training script with monitoring integration
|
| 237 |
-
- `configure_trackio.py` - Configuration and testing script
|
| 238 |
-
- `setup_hf_dataset.py` - Dataset repository setup
|
| 239 |
-
- `test_hf_datasets.py` - Dataset access testing
|
| 240 |
-
- `ENVIRONMENT_VARIABLES.md` - Environment variable reference
|
| 241 |
-
- `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide
|
| 242 |
-
|
| 243 |
-
---
|
| 244 |
-
|
| 245 |
-
**🎉 Your experiments are now persistently stored and easily accessible!**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/MONITORING_VERIFICATION_REPORT.md
DELETED
|
@@ -1,163 +0,0 @@
|
|
| 1 |
-
# Monitoring Verification Report
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This document verifies that `src/monitoring.py` is fully compatible with the actual deployed Trackio space and all monitoring components.
|
| 6 |
-
|
| 7 |
-
## ✅ **VERIFICATION STATUS: ALL TESTS PASSED**
|
| 8 |
-
|
| 9 |
-
### **Trackio Space Deployment Verification**
|
| 10 |
-
|
| 11 |
-
The actual deployed Trackio space at `https://tonic-trackio-monitoring-20250726.hf.space` provides the following API endpoints:
|
| 12 |
-
|
| 13 |
-
#### **Available API Endpoints**
|
| 14 |
-
1. ✅ `/update_trackio_config` - Update configuration
|
| 15 |
-
2. ✅ `/test_dataset_connection` - Test dataset connection
|
| 16 |
-
3. ✅ `/create_dataset_repository` - Create dataset repository
|
| 17 |
-
4. ✅ `/create_experiment_interface` - Create experiment
|
| 18 |
-
5. ✅ `/log_metrics_interface` - Log metrics
|
| 19 |
-
6. ✅ `/log_parameters_interface` - Log parameters
|
| 20 |
-
7. ✅ `/get_experiment_details` - Get experiment details
|
| 21 |
-
8. ✅ `/list_experiments_interface` - List experiments
|
| 22 |
-
9. ✅ `/create_metrics_plot` - Create metrics plot
|
| 23 |
-
10. ✅ `/create_experiment_comparison` - Compare experiments
|
| 24 |
-
11. ✅ `/simulate_training_data` - Simulate training data
|
| 25 |
-
12. ✅ `/create_demo_experiment` - Create demo experiment
|
| 26 |
-
13. ✅ `/update_experiment_status_interface` - Update status
|
| 27 |
-
|
| 28 |
-
### **Monitoring.py Compatibility Verification**
|
| 29 |
-
|
| 30 |
-
#### **✅ Dataset Structure Compatibility**
|
| 31 |
-
- **Field Structure**: All 10 fields match between monitoring.py and actual dataset
|
| 32 |
-
- `experiment_id`, `name`, `description`, `created_at`, `status`
|
| 33 |
-
- `metrics`, `parameters`, `artifacts`, `logs`, `last_updated`
|
| 34 |
-
- **Metrics Structure**: All 16 metrics fields compatible
|
| 35 |
-
- `loss`, `grad_norm`, `learning_rate`, `num_tokens`, `mean_token_accuracy`
|
| 36 |
-
- `epoch`, `total_tokens`, `throughput`, `step_time`, `batch_size`
|
| 37 |
-
- `seq_len`, `token_acc`, `gpu_memory_allocated`, `gpu_memory_reserved`
|
| 38 |
-
- `gpu_utilization`, `cpu_percent`, `memory_percent`
|
| 39 |
-
- **Parameters Structure**: All 11 parameters fields compatible
|
| 40 |
-
- `model_name`, `max_seq_length`, `batch_size`, `learning_rate`, `epochs`
|
| 41 |
-
- `dataset`, `trainer_type`, `hardware`, `mixed_precision`
|
| 42 |
-
- `gradient_checkpointing`, `flash_attention`
|
| 43 |
-
|
| 44 |
-
#### **✅ Trackio API Client Compatibility**
|
| 45 |
-
- **Available Methods**: All 7 methods working correctly
|
| 46 |
-
- `create_experiment` ✅
|
| 47 |
-
- `log_metrics` ✅
|
| 48 |
-
- `log_parameters` ✅
|
| 49 |
-
- `get_experiment_details` ✅
|
| 50 |
-
- `list_experiments` ✅
|
| 51 |
-
- `update_experiment_status` ✅
|
| 52 |
-
- `simulate_training_data` ✅
|
| 53 |
-
|
| 54 |
-
#### **✅ Monitoring Variables Verification**
|
| 55 |
-
- **Core Variables**: All 10 variables present and working
|
| 56 |
-
- `experiment_id`, `experiment_name`, `start_time`, `metrics_history`, `artifacts`
|
| 57 |
-
- `trackio_client`, `hf_dataset_client`, `dataset_repo`, `hf_token`, `enable_tracking`
|
| 58 |
-
- **Core Methods**: All 7 methods present and working
|
| 59 |
-
- `log_metrics`, `log_configuration`, `log_model_checkpoint`, `log_evaluation_results`
|
| 60 |
-
- `log_system_metrics`, `log_training_summary`, `create_monitoring_callback`
|
| 61 |
-
|
| 62 |
-
#### **✅ Integration Verification**
|
| 63 |
-
- **Monitor Creation**: ✅ Working perfectly
|
| 64 |
-
- **Attribute Verification**: ✅ All 7 expected attributes present
|
| 65 |
-
- **Dataset Repository**: ✅ Properly set and validated
|
| 66 |
-
- **Enable Tracking**: ✅ Correctly configured
|
| 67 |
-
|
| 68 |
-
### **Key Compatibility Features**
|
| 69 |
-
|
| 70 |
-
#### **1. Dataset Structure Alignment**
|
| 71 |
-
```python
|
| 72 |
-
# monitoring.py uses the exact structure from setup_hf_dataset.py
|
| 73 |
-
dataset_data = [{
|
| 74 |
-
'experiment_id': self.experiment_id or f"exp_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
|
| 75 |
-
'name': self.experiment_name,
|
| 76 |
-
'description': "SmolLM3 fine-tuning experiment",
|
| 77 |
-
'created_at': self.start_time.isoformat(),
|
| 78 |
-
'status': 'running',
|
| 79 |
-
'metrics': json.dumps(self.metrics_history),
|
| 80 |
-
'parameters': json.dumps(experiment_data),
|
| 81 |
-
'artifacts': json.dumps(self.artifacts),
|
| 82 |
-
'logs': json.dumps([]),
|
| 83 |
-
'last_updated': datetime.now().isoformat()
|
| 84 |
-
}]
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
#### **2. Trackio Space Integration**
|
| 88 |
-
```python
|
| 89 |
-
# Uses only available methods from deployed space
|
| 90 |
-
self.trackio_client.log_metrics(experiment_id, metrics, step)
|
| 91 |
-
self.trackio_client.log_parameters(experiment_id, parameters)
|
| 92 |
-
self.trackio_client.list_experiments()
|
| 93 |
-
self.trackio_client.update_experiment_status(experiment_id, status)
|
| 94 |
-
```
|
| 95 |
-
|
| 96 |
-
#### **3. Error Handling**
|
| 97 |
-
```python
|
| 98 |
-
# Graceful fallback when Trackio space is unavailable
|
| 99 |
-
try:
|
| 100 |
-
result = self.trackio_client.list_experiments()
|
| 101 |
-
if result.get('error'):
|
| 102 |
-
logger.warning(f"Trackio Space not accessible: {result['error']}")
|
| 103 |
-
self.enable_tracking = False
|
| 104 |
-
return
|
| 105 |
-
except Exception as e:
|
| 106 |
-
logger.warning(f"Trackio Space not accessible: {e}")
|
| 107 |
-
self.enable_tracking = False
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
### **Verification Test Results**
|
| 111 |
-
|
| 112 |
-
```
|
| 113 |
-
🚀 Monitoring Verification Tests
|
| 114 |
-
==================================================
|
| 115 |
-
✅ Dataset structure: Compatible
|
| 116 |
-
✅ Trackio space: Compatible
|
| 117 |
-
✅ Monitoring variables: Correct
|
| 118 |
-
✅ API client: Compatible
|
| 119 |
-
✅ Integration: Working
|
| 120 |
-
✅ Structure compatibility: Verified
|
| 121 |
-
✅ Space compatibility: Verified
|
| 122 |
-
|
| 123 |
-
🎉 ALL MONITORING VERIFICATION TESTS PASSED!
|
| 124 |
-
Monitoring.py is fully compatible with all components!
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
### **Deployed Trackio Space API Endpoints**
|
| 128 |
-
|
| 129 |
-
The actual deployed space provides these endpoints that monitoring.py can use:
|
| 130 |
-
|
| 131 |
-
#### **Core Experiment Management**
|
| 132 |
-
- `POST /create_experiment_interface` - Create new experiments
|
| 133 |
-
- `POST /log_metrics_interface` - Log training metrics
|
| 134 |
-
- `POST /log_parameters_interface` - Log experiment parameters
|
| 135 |
-
- `GET /list_experiments_interface` - List all experiments
|
| 136 |
-
- `POST /update_experiment_status_interface` - Update experiment status
|
| 137 |
-
|
| 138 |
-
#### **Configuration & Setup**
|
| 139 |
-
- `POST /update_trackio_config` - Update HF token and dataset repo
|
| 140 |
-
- `POST /test_dataset_connection` - Test dataset connectivity
|
| 141 |
-
- `POST /create_dataset_repository` - Create HF dataset repository
|
| 142 |
-
|
| 143 |
-
#### **Analysis & Visualization**
|
| 144 |
-
- `POST /create_metrics_plot` - Generate metric plots
|
| 145 |
-
- `POST /create_experiment_comparison` - Compare multiple experiments
|
| 146 |
-
- `POST /get_experiment_details` - Get detailed experiment info
|
| 147 |
-
|
| 148 |
-
#### **Testing & Demo**
|
| 149 |
-
- `POST /simulate_training_data` - Generate demo training data
|
| 150 |
-
- `POST /create_demo_experiment` - Create demonstration experiments
|
| 151 |
-
|
| 152 |
-
### **Conclusion**
|
| 153 |
-
|
| 154 |
-
**✅ MONITORING.PY IS FULLY COMPATIBLE WITH THE ACTUAL DEPLOYED TRACKIO SPACE**
|
| 155 |
-
|
| 156 |
-
The monitoring system has been verified to work correctly with:
|
| 157 |
-
- ✅ All actual API endpoints from the deployed Trackio space
|
| 158 |
-
- ✅ Complete dataset structure compatibility
|
| 159 |
-
- ✅ Proper error handling and fallback mechanisms
|
| 160 |
-
- ✅ All monitoring variables and methods working correctly
|
| 161 |
-
- ✅ Seamless integration with HF Datasets and Trackio space
|
| 162 |
-
|
| 163 |
-
**The monitoring.py file is production-ready and fully compatible with the actual deployed Trackio space!** 🚀
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/Model_Abstraction.md
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
```mermaid
|
| 2 |
+
graph LR
|
| 3 |
+
EntryPoint["EntryPoint"]
|
| 4 |
+
Model_Abstraction["Model Abstraction"]
|
| 5 |
+
EntryPoint -- "initiates model loading in" --> Model_Abstraction
|
| 6 |
+
click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
|
| 7 |
+
```
|
| 8 |
+
|
| 9 |
+
[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:[email protected])
|
| 10 |
+
|
| 11 |
+
## Details
|
| 12 |
+
|
| 13 |
+
Updated analysis to include EntryPoint component and clarify its interaction with Model Abstraction.
|
| 14 |
+
|
| 15 |
+
### EntryPoint
|
| 16 |
+
This component represents the primary execution flow of the `smollm3_finetune` application. It is responsible for initializing the application, parsing configuration, and orchestrating the high-level tasks such as initiating the model loading process and potentially the training or inference loops. It acts as the user-facing interface or the main script that kicks off the application's operations.
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
**Related Classes/Methods**:
|
| 20 |
+
|
| 21 |
+
- `smollm3_finetune.main` (1:1)
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
### Model Abstraction [[Expand]](./Model_Abstraction.md)
|
| 25 |
+
This component is responsible for encapsulating the complex logic of loading pre-trained models, defining their architectures, and managing various model variants such as quantization and LoRA adapters. It provides a unified and consistent interface for interacting with different model configurations, ensuring that the core training logic can operate seamlessly regardless of the underlying model specifics. This abstraction is crucial for maintaining modularity and flexibility within the machine learning training and fine-tuning framework.
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
**Related Classes/Methods**:
|
| 29 |
+
|
| 30 |
+
- `smollm3_finetune.model` (1:1)
|
| 31 |
+
- `smollm3_finetune.model.load_model` (1:1)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
|
docs/NO_THINK_TAG_GUIDE.md
DELETED
|
@@ -1,146 +0,0 @@
|
|
| 1 |
-
# SmolLM3 `/no_think` Tag Implementation Guide
|
| 2 |
-
|
| 3 |
-
## The Problem
|
| 4 |
-
|
| 5 |
-
You were using the `enable_thinking` parameter in the chat template configuration, which is **incorrect** for SmolLM3. The `/no_think` tag should be added as a **system message** in your training data, not as a configuration parameter.
|
| 6 |
-
|
| 7 |
-
### What was wrong:
|
| 8 |
-
|
| 9 |
-
```python
|
| 10 |
-
# ❌ INCORRECT - This doesn't work for SmolLM3
|
| 11 |
-
chat_template_kwargs={
|
| 12 |
-
"enable_thinking": False, # This parameter doesn't exist in SmolLM3
|
| 13 |
-
"add_generation_prompt": True
|
| 14 |
-
}
|
| 15 |
-
```
|
| 16 |
-
|
| 17 |
-
### What's correct:
|
| 18 |
-
|
| 19 |
-
```python
|
| 20 |
-
# ✅ CORRECT - Add /no_think as system message
|
| 21 |
-
messages = [
|
| 22 |
-
{"role": "system", "content": "You are a helpful assistant. /no_think"},
|
| 23 |
-
{"role": "user", "content": "What is machine learning?"},
|
| 24 |
-
{"role": "assistant", "content": "Machine learning is..."}
|
| 25 |
-
]
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
## The Solution
|
| 29 |
-
|
| 30 |
-
### 1. Updated Data Processing
|
| 31 |
-
|
| 32 |
-
The `data.py` file now properly handles the `/no_think` tag by:
|
| 33 |
-
|
| 34 |
-
- Adding a system message with `/no_think` when `no_think_system_message=True`
|
| 35 |
-
- Using the correct chat template parameters
|
| 36 |
-
- Properly formatting messages for SmolLM3
|
| 37 |
-
|
| 38 |
-
### 2. Updated Configuration
|
| 39 |
-
|
| 40 |
-
All configuration files now use the correct parameter:
|
| 41 |
-
|
| 42 |
-
```python
|
| 43 |
-
chat_template_kwargs={
|
| 44 |
-
"add_generation_prompt": True,
|
| 45 |
-
"no_think_system_message": True # Set to True to add /no_think tag
|
| 46 |
-
}
|
| 47 |
-
```
|
| 48 |
-
|
| 49 |
-
### 3. How It Works
|
| 50 |
-
|
| 51 |
-
When `no_think_system_message=True`, the system automatically adds:
|
| 52 |
-
|
| 53 |
-
```
|
| 54 |
-
{"role": "system", "content": "You are a helpful assistant. /no_think"}
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
as the first message in each conversation.
|
| 58 |
-
|
| 59 |
-
## Testing the Fix
|
| 60 |
-
|
| 61 |
-
### 1. Run the Test Script
|
| 62 |
-
|
| 63 |
-
```bash
|
| 64 |
-
python test_no_think.py
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
This will show you the difference between:
|
| 68 |
-
- Messages with `/no_think` tag
|
| 69 |
-
- Messages without `/no_think` tag
|
| 70 |
-
|
| 71 |
-
### 2. Test with Training
|
| 72 |
-
|
| 73 |
-
```bash
|
| 74 |
-
python train.py config/train_smollm3_no_think_test.py --dataset_dir=test_data
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
This uses a configuration that enables the `/no_think` tag.
|
| 78 |
-
|
| 79 |
-
## Configuration Options
|
| 80 |
-
|
| 81 |
-
### Enable `/no_think` tag:
|
| 82 |
-
```python
|
| 83 |
-
chat_template_kwargs={
|
| 84 |
-
"add_generation_prompt": True,
|
| 85 |
-
"no_think_system_message": True # Adds /no_think to system message
|
| 86 |
-
}
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
### Disable `/no_think` tag (allow thinking):
|
| 90 |
-
```python
|
| 91 |
-
chat_template_kwargs={
|
| 92 |
-
"add_generation_prompt": True,
|
| 93 |
-
"no_think_system_message": False # No /no_think tag
|
| 94 |
-
}
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
## When to Use `/no_think`
|
| 98 |
-
|
| 99 |
-
- **Use `/no_think`** for tasks that require direct, concise responses
|
| 100 |
-
- **Don't use `/no_think`** for tasks that benefit from reasoning and step-by-step thinking
|
| 101 |
-
|
| 102 |
-
### Examples:
|
| 103 |
-
|
| 104 |
-
**With `/no_think` (direct answers):**
|
| 105 |
-
```
|
| 106 |
-
System: You are a helpful assistant. /no_think
|
| 107 |
-
User: What is 2+2?
|
| 108 |
-
Assistant: 4
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
**Without `/no_think` (reasoning allowed):**
|
| 112 |
-
```
|
| 113 |
-
System: You are a helpful assistant.
|
| 114 |
-
User: Solve this math problem step by step: 15 * 7
|
| 115 |
-
Assistant: Let me solve this step by step:
|
| 116 |
-
1. First, I'll break down 15 * 7
|
| 117 |
-
2. 15 * 7 = (10 + 5) * 7
|
| 118 |
-
3. = 10 * 7 + 5 * 7
|
| 119 |
-
4. = 70 + 35
|
| 120 |
-
5. = 105
|
| 121 |
-
The answer is 105.
|
| 122 |
-
```
|
| 123 |
-
|
| 124 |
-
## Updated Files
|
| 125 |
-
|
| 126 |
-
The following files were updated to fix the `/no_think` tag issue:
|
| 127 |
-
|
| 128 |
-
1. `data.py` - Updated `format_chat_template` function
|
| 129 |
-
2. `config/train_smollm3.py` - Updated default configuration
|
| 130 |
-
3. `config/train_smollm3_openhermes_fr.py` - Updated configuration
|
| 131 |
-
4. `config/train_smollm3_long_context.py` - Updated configuration
|
| 132 |
-
5. `config/runpod_config.py` - Updated configuration
|
| 133 |
-
6. All A100 configuration files - Updated configurations
|
| 134 |
-
|
| 135 |
-
## Verification
|
| 136 |
-
|
| 137 |
-
To verify the fix is working:
|
| 138 |
-
|
| 139 |
-
1. Check that system messages include `/no_think` when `no_think_system_message=True`
|
| 140 |
-
2. Verify that the chat template is applied correctly
|
| 141 |
-
3. Test with actual training to ensure the model learns the `/no_think` behavior
|
| 142 |
-
|
| 143 |
-
## References
|
| 144 |
-
|
| 145 |
-
- [SmolLM3 Model Card](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
|
| 146 |
-
- [SmolLM3 Documentation](https://huggingface.co/docs/transformers/model_doc/smollm3)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/PIPELINE_SUMMARY.md
DELETED
|
@@ -1,330 +0,0 @@
|
|
| 1 |
-
# SmolLM3 End-to-End Pipeline - Implementation Summary
|
| 2 |
-
|
| 3 |
-
This document summarizes the comprehensive refactoring and enhancement of the SmolLM3 fine-tuning codebase to create a complete end-to-end pipeline.
|
| 4 |
-
|
| 5 |
-
## 🎯 Overview
|
| 6 |
-
|
| 7 |
-
The pipeline now provides a complete solution from Trackio Space deployment to model push, with integrated monitoring, dataset management, and automated deployment.
|
| 8 |
-
|
| 9 |
-
## 📁 Files Created/Modified
|
| 10 |
-
|
| 11 |
-
### **Core Pipeline Files**
|
| 12 |
-
|
| 13 |
-
1. **`launch.sh`** - Complete end-to-end pipeline script
|
| 14 |
-
- 16-step comprehensive pipeline
|
| 15 |
-
- Automated environment setup
|
| 16 |
-
- Integrated monitoring and deployment
|
| 17 |
-
- Dynamic configuration generation
|
| 18 |
-
|
| 19 |
-
2. **`setup_launch.py`** - User configuration helper
|
| 20 |
-
- Interactive setup for user credentials
|
| 21 |
-
- Automatic script configuration
|
| 22 |
-
- Requirements checker generation
|
| 23 |
-
|
| 24 |
-
3. **`test_pipeline.py`** - Comprehensive testing suite
|
| 25 |
-
- Import testing
|
| 26 |
-
- Component verification
|
| 27 |
-
- CUDA and HF token validation
|
| 28 |
-
|
| 29 |
-
4. **`README_END_TO_END.md`** - Complete documentation
|
| 30 |
-
- Step-by-step usage guide
|
| 31 |
-
- Troubleshooting section
|
| 32 |
-
- Advanced configuration options
|
| 33 |
-
|
| 34 |
-
### **Scripts and Utilities**
|
| 35 |
-
|
| 36 |
-
5. **`scripts/trackio_tonic/trackio_api_client.py`** - API client for Trackio
|
| 37 |
-
- Complete API client implementation
|
| 38 |
-
- Error handling and retry logic
|
| 39 |
-
- Support for both JSON and SSE responses
|
| 40 |
-
|
| 41 |
-
6. **`scripts/trackio_tonic/deploy_trackio_space.py`** - Space deployment
|
| 42 |
-
- Automated HF Space creation
|
| 43 |
-
- File upload and configuration
|
| 44 |
-
- Space testing and validation
|
| 45 |
-
|
| 46 |
-
7. **`scripts/trackio_tonic/configure_trackio.py`** - Configuration helper
|
| 47 |
-
- Environment variable setup
|
| 48 |
-
- Dataset repository configuration
|
| 49 |
-
- Usage examples and validation
|
| 50 |
-
|
| 51 |
-
8. **`scripts/model_tonic/push_to_huggingface.py`** - Model deployment
|
| 52 |
-
- Complete model upload pipeline
|
| 53 |
-
- Model card generation
|
| 54 |
-
- Training results documentation
|
| 55 |
-
|
| 56 |
-
9. **`scripts/dataset_tonic/setup_hf_dataset.py`** - Dataset setup
|
| 57 |
-
- HF Dataset repository creation
|
| 58 |
-
- Initial experiment data structure
|
| 59 |
-
- Dataset access configuration
|
| 60 |
-
|
| 61 |
-
### **Source Code Updates**
|
| 62 |
-
|
| 63 |
-
10. **`src/monitoring.py`** - Enhanced monitoring
|
| 64 |
-
- HF Datasets integration
|
| 65 |
-
- Trackio API client integration
|
| 66 |
-
- Comprehensive metrics logging
|
| 67 |
-
|
| 68 |
-
11. **`src/train.py`** - Updated training script
|
| 69 |
-
- Monitoring integration
|
| 70 |
-
- HF Datasets support
|
| 71 |
-
- Enhanced error handling
|
| 72 |
-
|
| 73 |
-
12. **`src/config.py`** - Configuration management
|
| 74 |
-
- Dynamic config loading
|
| 75 |
-
- Multiple config type support
|
| 76 |
-
- Fallback mechanisms
|
| 77 |
-
|
| 78 |
-
13. **`src/data.py`** - Enhanced dataset handling
|
| 79 |
-
- Multiple format support
|
| 80 |
-
- Automatic conversion
|
| 81 |
-
- Bad entry filtering
|
| 82 |
-
|
| 83 |
-
14. **`src/model.py`** - Model wrapper
|
| 84 |
-
- SmolLM3-specific optimizations
|
| 85 |
-
- Flash attention support
|
| 86 |
-
- Long context handling
|
| 87 |
-
|
| 88 |
-
15. **`src/trainer.py`** - Training orchestration
|
| 89 |
-
- Monitoring callback integration
|
| 90 |
-
- Enhanced logging
|
| 91 |
-
- Checkpoint management
|
| 92 |
-
|
| 93 |
-
## 🔧 Key Improvements
|
| 94 |
-
|
| 95 |
-
### **1. Import Path Fixes**
|
| 96 |
-
- Fixed all import paths to work with the refactored structure
|
| 97 |
-
- Added proper sys.path handling for cross-module imports
|
| 98 |
-
- Ensured compatibility between different script locations
|
| 99 |
-
|
| 100 |
-
### **2. Monitoring Integration**
|
| 101 |
-
- **Trackio Space**: Real-time experiment tracking
|
| 102 |
-
- **HF Datasets**: Persistent experiment storage
|
| 103 |
-
- **System Metrics**: GPU, memory, and CPU monitoring
|
| 104 |
-
- **Training Callbacks**: Automatic metric logging
|
| 105 |
-
|
| 106 |
-
### **3. Dataset Handling**
|
| 107 |
-
- **Multi-format Support**: Prompt/completion, instruction/output, chat formats
|
| 108 |
-
- **Automatic Conversion**: Handles different dataset structures
|
| 109 |
-
- **Validation**: Ensures data quality and completeness
|
| 110 |
-
- **Splitting**: Automatic train/validation/test splits
|
| 111 |
-
|
| 112 |
-
### **4. Configuration Management**
|
| 113 |
-
- **Dynamic Generation**: Creates configs based on user input
|
| 114 |
-
- **Multiple Types**: Support for different training configurations
|
| 115 |
-
- **Environment Variables**: Proper integration with environment
|
| 116 |
-
- **Validation**: Ensures configuration correctness
|
| 117 |
-
|
| 118 |
-
### **5. Deployment Automation**
|
| 119 |
-
- **Model Upload**: Complete model push to HF Hub
|
| 120 |
-
- **Model Cards**: Comprehensive documentation generation
|
| 121 |
-
- **Training Results**: Complete experiment documentation
|
| 122 |
-
- **Testing**: Automated model validation
|
| 123 |
-
|
| 124 |
-
## 🚀 Pipeline Steps
|
| 125 |
-
|
| 126 |
-
The end-to-end pipeline performs these 16 steps:
|
| 127 |
-
|
| 128 |
-
1. **Environment Setup** - System dependencies and Python environment
|
| 129 |
-
2. **PyTorch Installation** - CUDA-enabled PyTorch installation
|
| 130 |
-
3. **Dependencies** - All required Python packages
|
| 131 |
-
4. **Authentication** - HF token setup and validation
|
| 132 |
-
5. **Trackio Deployment** - HF Space creation and configuration
|
| 133 |
-
6. **Dataset Setup** - HF Dataset repository creation
|
| 134 |
-
7. **Trackio Configuration** - Environment and dataset configuration
|
| 135 |
-
8. **Training Config** - Dynamic configuration generation
|
| 136 |
-
9. **Dataset Preparation** - Download and format conversion
|
| 137 |
-
10. **Parameter Calculation** - Training steps and batch calculations
|
| 138 |
-
11. **Training Execution** - Model fine-tuning with monitoring
|
| 139 |
-
12. **Model Push** - Upload to HF Hub with documentation
|
| 140 |
-
13. **Model Testing** - Validation of uploaded model
|
| 141 |
-
14. **Summary Report** - Complete training documentation
|
| 142 |
-
15. **Resource Links** - All online resource URLs
|
| 143 |
-
16. **Next Steps** - Usage instructions and recommendations
|
| 144 |
-
|
| 145 |
-
## 📊 Monitoring Features
|
| 146 |
-
|
| 147 |
-
### **Trackio Space Interface**
|
| 148 |
-
- Real-time training metrics
|
| 149 |
-
- Experiment comparison
|
| 150 |
-
- System resource monitoring
|
| 151 |
-
- Training progress visualization
|
| 152 |
-
|
| 153 |
-
### **HF Dataset Storage**
|
| 154 |
-
- Persistent experiment data
|
| 155 |
-
- Version-controlled history
|
| 156 |
-
- Collaborative sharing
|
| 157 |
-
- Automated backup
|
| 158 |
-
|
| 159 |
-
### **Comprehensive Logging**
|
| 160 |
-
- Training metrics (loss, accuracy, etc.)
|
| 161 |
-
- System metrics (GPU, memory, CPU)
|
| 162 |
-
- Configuration parameters
|
| 163 |
-
- Training artifacts
|
| 164 |
-
|
| 165 |
-
## 🔧 Configuration Options
|
| 166 |
-
|
| 167 |
-
### **User Configuration**
|
| 168 |
-
```bash
|
| 169 |
-
# Required
|
| 170 |
-
HF_TOKEN="your_token"
|
| 171 |
-
HF_USERNAME="your_username"
|
| 172 |
-
|
| 173 |
-
# Optional
|
| 174 |
-
MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
|
| 175 |
-
DATASET_NAME="HuggingFaceTB/smoltalk"
|
| 176 |
-
```
|
| 177 |
-
|
| 178 |
-
### **Training Parameters**
|
| 179 |
-
```bash
|
| 180 |
-
BATCH_SIZE=2
|
| 181 |
-
GRADIENT_ACCUMULATION_STEPS=8
|
| 182 |
-
LEARNING_RATE=5e-6
|
| 183 |
-
MAX_EPOCHS=3
|
| 184 |
-
MAX_SEQ_LENGTH=4096
|
| 185 |
-
```
|
| 186 |
-
|
| 187 |
-
### **Monitoring Configuration**
|
| 188 |
-
```bash
|
| 189 |
-
TRACKIO_DATASET_REPO="username/trackio-experiments"
|
| 190 |
-
EXPERIMENT_NAME="smollm3_finetune_YYYYMMDD_HHMMSS"
|
| 191 |
-
```
|
| 192 |
-
|
| 193 |
-
## 🛠️ Error Handling
|
| 194 |
-
|
| 195 |
-
### **Comprehensive Error Handling**
|
| 196 |
-
- Import error detection and reporting
|
| 197 |
-
- Configuration validation
|
| 198 |
-
- Network timeout handling
|
| 199 |
-
- Graceful degradation
|
| 200 |
-
|
| 201 |
-
### **Debugging Support**
|
| 202 |
-
- Detailed logging at all levels
|
| 203 |
-
- Component-specific error messages
|
| 204 |
-
- Fallback mechanisms
|
| 205 |
-
- Testing utilities
|
| 206 |
-
|
| 207 |
-
## 📈 Performance Optimizations
|
| 208 |
-
|
| 209 |
-
### **Training Optimizations**
|
| 210 |
-
- Flash Attention for efficiency
|
| 211 |
-
- Gradient checkpointing for memory
|
| 212 |
-
- Mixed precision training
|
| 213 |
-
- Optimized data loading
|
| 214 |
-
|
| 215 |
-
### **Monitoring Optimizations**
|
| 216 |
-
- Asynchronous logging
|
| 217 |
-
- Batch metric updates
|
| 218 |
-
- Efficient data storage
|
| 219 |
-
- Minimal overhead
|
| 220 |
-
|
| 221 |
-
## 🔄 Integration Points
|
| 222 |
-
|
| 223 |
-
### **Hugging Face Ecosystem**
|
| 224 |
-
- **HF Hub**: Model and dataset storage
|
| 225 |
-
- **HF Spaces**: Trackio monitoring interface
|
| 226 |
-
- **HF Datasets**: Experiment data persistence
|
| 227 |
-
- **HF CLI**: Authentication and deployment
|
| 228 |
-
|
| 229 |
-
### **External Services**
|
| 230 |
-
- **Trackio**: Experiment tracking
|
| 231 |
-
- **CUDA**: GPU acceleration
|
| 232 |
-
- **PyTorch**: Deep learning framework
|
| 233 |
-
- **Transformers**: Model library
|
| 234 |
-
|
| 235 |
-
## 🎯 Usage Workflow
|
| 236 |
-
|
| 237 |
-
### **1. Setup Phase**
|
| 238 |
-
```bash
|
| 239 |
-
python setup_launch.py # Configure with user info
|
| 240 |
-
python test_pipeline.py # Verify all components
|
| 241 |
-
```
|
| 242 |
-
|
| 243 |
-
### **2. Execution Phase**
|
| 244 |
-
```bash
|
| 245 |
-
chmod +x launch.sh # Make executable
|
| 246 |
-
./launch.sh # Run complete pipeline
|
| 247 |
-
```
|
| 248 |
-
|
| 249 |
-
### **3. Monitoring Phase**
|
| 250 |
-
- Track progress in Trackio Space
|
| 251 |
-
- Monitor metrics in real-time
|
| 252 |
-
- Check logs for issues
|
| 253 |
-
- Validate results
|
| 254 |
-
|
| 255 |
-
### **4. Results Phase**
|
| 256 |
-
- Access model on HF Hub
|
| 257 |
-
- Review training summary
|
| 258 |
-
- Test model performance
|
| 259 |
-
- Share results
|
| 260 |
-
|
| 261 |
-
## 📋 Quality Assurance
|
| 262 |
-
|
| 263 |
-
### **Testing Coverage**
|
| 264 |
-
- Import testing for all modules
|
| 265 |
-
- Script availability verification
|
| 266 |
-
- Configuration validation
|
| 267 |
-
- CUDA and token testing
|
| 268 |
-
- Component integration testing
|
| 269 |
-
|
| 270 |
-
### **Documentation**
|
| 271 |
-
- Comprehensive README
|
| 272 |
-
- Step-by-step guides
|
| 273 |
-
- Troubleshooting section
|
| 274 |
-
- Advanced usage examples
|
| 275 |
-
|
| 276 |
-
### **Error Recovery**
|
| 277 |
-
- Graceful error handling
|
| 278 |
-
- Detailed error messages
|
| 279 |
-
- Recovery mechanisms
|
| 280 |
-
- Fallback options
|
| 281 |
-
|
| 282 |
-
## 🚀 Future Enhancements
|
| 283 |
-
|
| 284 |
-
### **Planned Improvements**
|
| 285 |
-
- Multi-GPU training support
|
| 286 |
-
- Distributed training
|
| 287 |
-
- Advanced hyperparameter tuning
|
| 288 |
-
- Custom dataset upload
|
| 289 |
-
- Model evaluation metrics
|
| 290 |
-
- Automated testing pipeline
|
| 291 |
-
|
| 292 |
-
### **Extensibility**
|
| 293 |
-
- Plugin architecture for custom components
|
| 294 |
-
- Configuration templates
|
| 295 |
-
- Custom monitoring backends
|
| 296 |
-
- Advanced deployment options
|
| 297 |
-
|
| 298 |
-
## 📊 Success Metrics
|
| 299 |
-
|
| 300 |
-
### **Pipeline Completeness**
|
| 301 |
-
- ✅ All 16 steps implemented
|
| 302 |
-
- ✅ Error handling at each step
|
| 303 |
-
- ✅ Monitoring integration
|
| 304 |
-
- ✅ Documentation complete
|
| 305 |
-
|
| 306 |
-
### **User Experience**
|
| 307 |
-
- ✅ Simple setup process
|
| 308 |
-
- ✅ Clear error messages
|
| 309 |
-
- ✅ Comprehensive documentation
|
| 310 |
-
- ✅ Testing utilities
|
| 311 |
-
|
| 312 |
-
### **Technical Quality**
|
| 313 |
-
- ✅ Import path fixes
|
| 314 |
-
- ✅ Configuration management
|
| 315 |
-
- ✅ Monitoring integration
|
| 316 |
-
- ✅ Deployment automation
|
| 317 |
-
|
| 318 |
-
## 🎉 Conclusion
|
| 319 |
-
|
| 320 |
-
The SmolLM3 end-to-end pipeline provides a complete solution for fine-tuning with integrated monitoring, automated deployment, and comprehensive documentation. The refactored codebase is now production-ready with proper error handling, testing, and user experience considerations.
|
| 321 |
-
|
| 322 |
-
**Key Achievements:**
|
| 323 |
-
- Complete end-to-end automation
|
| 324 |
-
- Integrated monitoring and tracking
|
| 325 |
-
- Comprehensive error handling
|
| 326 |
-
- Production-ready deployment
|
| 327 |
-
- Extensive documentation
|
| 328 |
-
- Testing and validation suite
|
| 329 |
-
|
| 330 |
-
The pipeline is now ready for users to easily fine-tune SmolLM3 models with full monitoring and deployment capabilities.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/PUSH_GUIDE.md
DELETED
|
@@ -1,406 +0,0 @@
|
|
| 1 |
-
# Push to Hugging Face Hub Guide
|
| 2 |
-
|
| 3 |
-
This guide explains how to use the `push_to_huggingface.py` script to upload your trained SmolLM3 models and results to Hugging Face Hub.
|
| 4 |
-
|
| 5 |
-
## Features
|
| 6 |
-
|
| 7 |
-
- ✅ **Automatic Repository Creation** - Creates HF repositories automatically
|
| 8 |
-
- ✅ **Model Validation** - Validates required model files before upload
|
| 9 |
-
- ✅ **Comprehensive Model Cards** - Generates detailed model documentation
|
| 10 |
-
- ✅ **Training Results Upload** - Uploads logs, configs, and results
|
| 11 |
-
- ✅ **Trackio Integration** - Logs push actions to your monitoring system
|
| 12 |
-
- ✅ **Private/Public Repositories** - Support for both private and public models
|
| 13 |
-
|
| 14 |
-
## Prerequisites
|
| 15 |
-
|
| 16 |
-
### 1. Install Dependencies
|
| 17 |
-
|
| 18 |
-
```bash
|
| 19 |
-
pip install huggingface_hub
|
| 20 |
-
```
|
| 21 |
-
|
| 22 |
-
### 2. Set Up Hugging Face Token
|
| 23 |
-
|
| 24 |
-
```bash
|
| 25 |
-
# Option 1: Environment variable
|
| 26 |
-
export HF_TOKEN="your_huggingface_token_here"
|
| 27 |
-
|
| 28 |
-
# Option 2: Use --token argument
|
| 29 |
-
python push_to_huggingface.py model_path repo_name --token "your_token"
|
| 30 |
-
```
|
| 31 |
-
|
| 32 |
-
### 3. Get Your Hugging Face Token
|
| 33 |
-
|
| 34 |
-
1. Go to https://huggingface.co/settings/tokens
|
| 35 |
-
2. Click "New token"
|
| 36 |
-
3. Give it a name (e.g., "model-upload")
|
| 37 |
-
4. Select "Write" permissions
|
| 38 |
-
5. Copy the token
|
| 39 |
-
|
| 40 |
-
## Basic Usage
|
| 41 |
-
|
| 42 |
-
### Simple Model Push
|
| 43 |
-
|
| 44 |
-
```bash
|
| 45 |
-
python push_to_huggingface.py /path/to/model username/model-name
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
### Push with Custom Token
|
| 49 |
-
|
| 50 |
-
```bash
|
| 51 |
-
python push_to_huggingface.py /path/to/model username/model-name \
|
| 52 |
-
--token "hf_your_token_here"
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
### Push Private Model
|
| 56 |
-
|
| 57 |
-
```bash
|
| 58 |
-
python push_to_huggingface.py /path/to/model username/model-name \
|
| 59 |
-
--private
|
| 60 |
-
```
|
| 61 |
-
|
| 62 |
-
### Push with Trackio Integration
|
| 63 |
-
|
| 64 |
-
```bash
|
| 65 |
-
python push_to_huggingface.py /path/to/model username/model-name \
|
| 66 |
-
--trackio-url "https://your-space.hf.space" \
|
| 67 |
-
--experiment-name "my_experiment"
|
| 68 |
-
```
|
| 69 |
-
|
| 70 |
-
## Complete Workflow Example
|
| 71 |
-
|
| 72 |
-
### 1. Train Your Model
|
| 73 |
-
|
| 74 |
-
```bash
|
| 75 |
-
python train.py config/train_smollm3.py \
|
| 76 |
-
--dataset_dir my_dataset \
|
| 77 |
-
--enable_tracking \
|
| 78 |
-
--trackio_url "https://your-space.hf.space" \
|
| 79 |
-
--experiment_name "smollm3_finetune_v1"
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
### 2. Push to Hugging Face Hub
|
| 83 |
-
|
| 84 |
-
```bash
|
| 85 |
-
python push_to_huggingface.py /output-checkpoint username/smollm3-finetuned \
|
| 86 |
-
--trackio-url "https://your-space.hf.space" \
|
| 87 |
-
--experiment-name "smollm3_finetune_v1"
|
| 88 |
-
```
|
| 89 |
-
|
| 90 |
-
### 3. Use Your Model
|
| 91 |
-
|
| 92 |
-
```python
|
| 93 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 94 |
-
|
| 95 |
-
# Load your uploaded model
|
| 96 |
-
model = AutoModelForCausalLM.from_pretrained("username/smollm3-finetuned")
|
| 97 |
-
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-finetuned")
|
| 98 |
-
|
| 99 |
-
# Generate text
|
| 100 |
-
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
|
| 101 |
-
outputs = model.generate(**inputs, max_new_tokens=100)
|
| 102 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 103 |
-
```
|
| 104 |
-
|
| 105 |
-
## Repository Structure
|
| 106 |
-
|
| 107 |
-
After pushing, your repository will contain:
|
| 108 |
-
|
| 109 |
-
```
|
| 110 |
-
username/model-name/
|
| 111 |
-
├── README.md # Auto-generated model card
|
| 112 |
-
├── config.json # Model configuration
|
| 113 |
-
├── pytorch_model.bin # Model weights
|
| 114 |
-
├── tokenizer.json # Tokenizer configuration
|
| 115 |
-
├── tokenizer_config.json # Tokenizer settings
|
| 116 |
-
├── special_tokens_map.json # Special tokens
|
| 117 |
-
├── training_results/ # Training artifacts
|
| 118 |
-
│ ├── train_results.json
|
| 119 |
-
│ ├── eval_results.json
|
| 120 |
-
│ ├── training_config.json
|
| 121 |
-
│ └── training.log
|
| 122 |
-
└── .gitattributes # Git attributes
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
## Model Card Features
|
| 126 |
-
|
| 127 |
-
The script automatically generates comprehensive model cards including:
|
| 128 |
-
|
| 129 |
-
- **Model Details**: Base model, fine-tuning method, size
|
| 130 |
-
- **Training Configuration**: All training parameters
|
| 131 |
-
- **Training Results**: Loss, accuracy, steps, time
|
| 132 |
-
- **Usage Examples**: Code snippets for loading and using
|
| 133 |
-
- **Performance Metrics**: Training and validation metrics
|
| 134 |
-
- **Hardware Information**: GPU/CPU used for training
|
| 135 |
-
|
| 136 |
-
## Advanced Usage
|
| 137 |
-
|
| 138 |
-
### Custom Repository Names
|
| 139 |
-
|
| 140 |
-
```bash
|
| 141 |
-
# Public repository
|
| 142 |
-
python push_to_huggingface.py /model myusername/smollm3-chatbot
|
| 143 |
-
|
| 144 |
-
# Private repository
|
| 145 |
-
python push_to_huggingface.py /model myusername/smollm3-private --private
|
| 146 |
-
```
|
| 147 |
-
|
| 148 |
-
### Integration with Training Pipeline
|
| 149 |
-
|
| 150 |
-
```bash
|
| 151 |
-
#!/bin/bash
|
| 152 |
-
# Complete training and push workflow
|
| 153 |
-
|
| 154 |
-
# 1. Train the model
|
| 155 |
-
python train.py config/train_smollm3.py \
|
| 156 |
-
--dataset_dir my_dataset \
|
| 157 |
-
--enable_tracking \
|
| 158 |
-
--trackio_url "https://your-space.hf.space" \
|
| 159 |
-
--experiment_name "smollm3_v1"
|
| 160 |
-
|
| 161 |
-
# 2. Push to Hugging Face Hub
|
| 162 |
-
python push_to_huggingface.py /output-checkpoint myusername/smollm3-v1 \
|
| 163 |
-
--trackio-url "https://your-space.hf.space" \
|
| 164 |
-
--experiment-name "smollm3_v1"
|
| 165 |
-
|
| 166 |
-
# 3. Test the model
|
| 167 |
-
python -c "
|
| 168 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 169 |
-
model = AutoModelForCausalLM.from_pretrained('myusername/smollm3-v1')
|
| 170 |
-
tokenizer = AutoTokenizer.from_pretrained('myusername/smollm3-v1')
|
| 171 |
-
print('Model loaded successfully!')
|
| 172 |
-
"
|
| 173 |
-
```
|
| 174 |
-
|
| 175 |
-
### Batch Processing Multiple Models
|
| 176 |
-
|
| 177 |
-
```bash
|
| 178 |
-
#!/bin/bash
|
| 179 |
-
# Push multiple models
|
| 180 |
-
|
| 181 |
-
models=(
|
| 182 |
-
"smollm3-baseline"
|
| 183 |
-
"smollm3-high-lr"
|
| 184 |
-
"smollm3-dpo"
|
| 185 |
-
)
|
| 186 |
-
|
| 187 |
-
for model in "${models[@]}"; do
|
| 188 |
-
echo "Pushing $model..."
|
| 189 |
-
python push_to_huggingface.py "/models/$model" "username/$model"
|
| 190 |
-
done
|
| 191 |
-
```
|
| 192 |
-
|
| 193 |
-
## Error Handling
|
| 194 |
-
|
| 195 |
-
### Common Issues and Solutions
|
| 196 |
-
|
| 197 |
-
#### 1. Missing Model Files
|
| 198 |
-
|
| 199 |
-
**Error**: `❌ Missing required files: ['config.json', 'pytorch_model.bin']`
|
| 200 |
-
|
| 201 |
-
**Solution**: Ensure your model directory contains all required files:
|
| 202 |
-
- `config.json`
|
| 203 |
-
- `pytorch_model.bin`
|
| 204 |
-
- `tokenizer.json`
|
| 205 |
-
- `tokenizer_config.json`
|
| 206 |
-
|
| 207 |
-
#### 2. Authentication Issues
|
| 208 |
-
|
| 209 |
-
**Error**: `❌ Failed to create repository: 401 Client Error`
|
| 210 |
-
|
| 211 |
-
**Solution**:
|
| 212 |
-
- Check your HF token is valid
|
| 213 |
-
- Ensure token has write permissions
|
| 214 |
-
- Verify username in repository name matches your account
|
| 215 |
-
|
| 216 |
-
#### 3. Repository Already Exists
|
| 217 |
-
|
| 218 |
-
**Error**: `Repository already exists`
|
| 219 |
-
|
| 220 |
-
**Solution**: The script handles this automatically with `exist_ok=True`, but you can:
|
| 221 |
-
- Use a different repository name
|
| 222 |
-
- Delete the existing repository first
|
| 223 |
-
- Use version numbers: `username/model-v2`
|
| 224 |
-
|
| 225 |
-
#### 4. Large File Upload Issues
|
| 226 |
-
|
| 227 |
-
**Error**: `Upload failed for large files`
|
| 228 |
-
|
| 229 |
-
**Solution**:
|
| 230 |
-
- Check your internet connection
|
| 231 |
-
- Use Git LFS for large files
|
| 232 |
-
- Consider splitting large models
|
| 233 |
-
|
| 234 |
-
## Trackio Integration
|
| 235 |
-
|
| 236 |
-
### Logging Push Actions
|
| 237 |
-
|
| 238 |
-
When using Trackio integration, the script logs:
|
| 239 |
-
|
| 240 |
-
- **Push Action**: Repository creation and file uploads
|
| 241 |
-
- **Model Metadata**: Size, configuration, results
|
| 242 |
-
- **Repository Info**: Name, privacy settings, URL
|
| 243 |
-
- **Training Results**: Loss, accuracy, steps
|
| 244 |
-
|
| 245 |
-
### Viewing Push Logs
|
| 246 |
-
|
| 247 |
-
1. Go to your Trackio Space
|
| 248 |
-
2. Navigate to the "View Experiments" tab
|
| 249 |
-
3. Find your experiment
|
| 250 |
-
4. Check the metrics for push-related actions
|
| 251 |
-
|
| 252 |
-
## Security Best Practices
|
| 253 |
-
|
| 254 |
-
### Token Management
|
| 255 |
-
|
| 256 |
-
```bash
|
| 257 |
-
# Use environment variables (recommended)
|
| 258 |
-
export HF_TOKEN="your_token_here"
|
| 259 |
-
python push_to_huggingface.py model repo
|
| 260 |
-
|
| 261 |
-
# Don't hardcode tokens in scripts
|
| 262 |
-
# ❌ Bad: python push_to_huggingface.py model repo --token "hf_xxx"
|
| 263 |
-
```
|
| 264 |
-
|
| 265 |
-
### Private Models
|
| 266 |
-
|
| 267 |
-
```bash
|
| 268 |
-
# For sensitive models, use private repositories
|
| 269 |
-
python push_to_huggingface.py model username/private-model --private
|
| 270 |
-
```
|
| 271 |
-
|
| 272 |
-
### Repository Naming
|
| 273 |
-
|
| 274 |
-
```bash
|
| 275 |
-
# Use descriptive names
|
| 276 |
-
python push_to_huggingface.py model username/smollm3-chatbot-v1
|
| 277 |
-
|
| 278 |
-
# Include version numbers
|
| 279 |
-
python push_to_huggingface.py model username/smollm3-v2.0
|
| 280 |
-
```
|
| 281 |
-
|
| 282 |
-
## Performance Optimization
|
| 283 |
-
|
| 284 |
-
### Large Models
|
| 285 |
-
|
| 286 |
-
For models > 5GB:
|
| 287 |
-
|
| 288 |
-
```bash
|
| 289 |
-
# Use Git LFS for large files
|
| 290 |
-
git lfs install
|
| 291 |
-
git lfs track "*.bin"
|
| 292 |
-
|
| 293 |
-
# Consider splitting models
|
| 294 |
-
python push_to_huggingface.py model username/model-large --private
|
| 295 |
-
```
|
| 296 |
-
|
| 297 |
-
### Upload Speed
|
| 298 |
-
|
| 299 |
-
```bash
|
| 300 |
-
# Use stable internet connection
|
| 301 |
-
# Consider uploading during off-peak hours
|
| 302 |
-
# Use private repositories for faster uploads
|
| 303 |
-
```
|
| 304 |
-
|
| 305 |
-
## Troubleshooting
|
| 306 |
-
|
| 307 |
-
### Debug Mode
|
| 308 |
-
|
| 309 |
-
```bash
|
| 310 |
-
# Enable debug logging
|
| 311 |
-
export LOG_LEVEL=DEBUG
|
| 312 |
-
python push_to_huggingface.py model repo
|
| 313 |
-
```
|
| 314 |
-
|
| 315 |
-
### Validate Model Files
|
| 316 |
-
|
| 317 |
-
```bash
|
| 318 |
-
# Check model structure before pushing
|
| 319 |
-
ls -la /path/to/model/
|
| 320 |
-
# Should contain: config.json, pytorch_model.bin, tokenizer.json, etc.
|
| 321 |
-
```
|
| 322 |
-
|
| 323 |
-
### Test Repository Access
|
| 324 |
-
|
| 325 |
-
```bash
|
| 326 |
-
# Test your HF token
|
| 327 |
-
python -c "
|
| 328 |
-
from huggingface_hub import HfApi
|
| 329 |
-
api = HfApi(token='your_token')
|
| 330 |
-
print('Token is valid!')
|
| 331 |
-
"
|
| 332 |
-
```
|
| 333 |
-
|
| 334 |
-
## Integration Examples
|
| 335 |
-
|
| 336 |
-
### With CI/CD Pipeline
|
| 337 |
-
|
| 338 |
-
```yaml
|
| 339 |
-
# .github/workflows/train-and-push.yml
|
| 340 |
-
name: Train and Push Model
|
| 341 |
-
|
| 342 |
-
on:
|
| 343 |
-
push:
|
| 344 |
-
branches: [main]
|
| 345 |
-
|
| 346 |
-
jobs:
|
| 347 |
-
train-and-push:
|
| 348 |
-
runs-on: ubuntu-latest
|
| 349 |
-
steps:
|
| 350 |
-
- uses: actions/checkout@v2
|
| 351 |
-
|
| 352 |
-
- name: Train Model
|
| 353 |
-
run: |
|
| 354 |
-
python train.py config/train_smollm3.py
|
| 355 |
-
|
| 356 |
-
- name: Push to HF Hub
|
| 357 |
-
run: |
|
| 358 |
-
python push_to_huggingface.py /output username/model-${{ github.run_number }}
|
| 359 |
-
env:
|
| 360 |
-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
| 361 |
-
```
|
| 362 |
-
|
| 363 |
-
### With Docker
|
| 364 |
-
|
| 365 |
-
```dockerfile
|
| 366 |
-
# Dockerfile
|
| 367 |
-
FROM python:3.9
|
| 368 |
-
|
| 369 |
-
WORKDIR /app
|
| 370 |
-
COPY requirements.txt .
|
| 371 |
-
RUN pip install -r requirements.txt
|
| 372 |
-
|
| 373 |
-
COPY . .
|
| 374 |
-
|
| 375 |
-
CMD ["python", "push_to_huggingface.py", "/model", "username/model"]
|
| 376 |
-
```
|
| 377 |
-
|
| 378 |
-
## Support and Resources
|
| 379 |
-
|
| 380 |
-
### Documentation
|
| 381 |
-
|
| 382 |
-
- [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/index)
|
| 383 |
-
- [Transformers Documentation](https://huggingface.co/docs/transformers/index)
|
| 384 |
-
- [Model Cards Guide](https://huggingface.co/docs/hub/model-cards)
|
| 385 |
-
|
| 386 |
-
### Community
|
| 387 |
-
|
| 388 |
-
- [Hugging Face Forums](https://discuss.huggingface.co/)
|
| 389 |
-
- [GitHub Issues](https://github.com/huggingface/huggingface_hub/issues)
|
| 390 |
-
|
| 391 |
-
### Examples
|
| 392 |
-
|
| 393 |
-
- [Model Repository Examples](https://huggingface.co/models?search=smollm3)
|
| 394 |
-
- [Fine-tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
|
| 395 |
-
|
| 396 |
-
## Conclusion
|
| 397 |
-
|
| 398 |
-
The `push_to_huggingface.py` script provides a complete solution for:
|
| 399 |
-
|
| 400 |
-
- ✅ **Easy Model Deployment** - One command to push models
|
| 401 |
-
- ✅ **Professional Documentation** - Auto-generated model cards
|
| 402 |
-
- ✅ **Training Artifacts** - Complete experiment tracking
|
| 403 |
-
- ✅ **Integration Ready** - Works with CI/CD and monitoring
|
| 404 |
-
- ✅ **Security Focused** - Proper token and privacy management
|
| 405 |
-
|
| 406 |
-
Start sharing your fine-tuned SmolLM3 models with the community!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/PUSH_SCRIPT_GUIDE.md
DELETED
|
@@ -1,267 +0,0 @@
|
|
| 1 |
-
# 🚀 Push to Hugging Face Script Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The `push_to_huggingface.py` script has been enhanced to integrate with **HF Datasets** for experiment tracking and provides complete model deployment with persistent experiment storage.
|
| 6 |
-
|
| 7 |
-
## 🚀 Key Improvements
|
| 8 |
-
|
| 9 |
-
### **1. HF Datasets Integration**
|
| 10 |
-
- ✅ **Dataset Repository Support**: Configurable dataset repository for experiment storage
|
| 11 |
-
- ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
|
| 12 |
-
- ✅ **Enhanced Logging**: Logs push actions to both Trackio and HF Datasets
|
| 13 |
-
- ✅ **Model Card Integration**: Includes dataset repository information in model cards
|
| 14 |
-
|
| 15 |
-
### **2. Enhanced Configuration**
|
| 16 |
-
- ✅ **Flexible Token Input**: Multiple ways to provide HF token
|
| 17 |
-
- ✅ **Dataset Repository Tracking**: Links models to their experiment datasets
|
| 18 |
-
- ✅ **Environment Variable Support**: Fallback to environment variables
|
| 19 |
-
- ✅ **Command Line Arguments**: New arguments for HF Datasets integration
|
| 20 |
-
|
| 21 |
-
### **3. Improved Model Cards**
|
| 22 |
-
- ✅ **Dataset Repository Info**: Shows which dataset contains experiment data
|
| 23 |
-
- ✅ **Experiment Tracking Section**: Explains how to access training data
|
| 24 |
-
- ✅ **Enhanced Documentation**: Better model cards with experiment links
|
| 25 |
-
|
| 26 |
-
## 📋 Usage Examples
|
| 27 |
-
|
| 28 |
-
### **Basic Usage**
|
| 29 |
-
```bash
|
| 30 |
-
# Push model with default settings
|
| 31 |
-
python push_to_huggingface.py /path/to/model username/repo-name
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
### **With HF Datasets Integration**
|
| 35 |
-
```bash
|
| 36 |
-
# Push model with custom dataset repository
|
| 37 |
-
python push_to_huggingface.py /path/to/model username/repo-name \
|
| 38 |
-
--dataset-repo username/experiments
|
| 39 |
-
```
|
| 40 |
-
|
| 41 |
-
### **With Custom Token**
|
| 42 |
-
```bash
|
| 43 |
-
# Push model with custom HF token
|
| 44 |
-
python push_to_huggingface.py /path/to/model username/repo-name \
|
| 45 |
-
--hf-token your_token_here
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
### **Complete Example**
|
| 49 |
-
```bash
|
| 50 |
-
# Push model with all options
|
| 51 |
-
python push_to_huggingface.py /path/to/model username/repo-name \
|
| 52 |
-
--dataset-repo username/experiments \
|
| 53 |
-
--hf-token your_token_here \
|
| 54 |
-
--private \
|
| 55 |
-
--experiment-name "smollm3_finetune_v2"
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
## 🔧 Command Line Arguments
|
| 59 |
-
|
| 60 |
-
| Argument | Required | Default | Description |
|
| 61 |
-
|----------|----------|---------|-------------|
|
| 62 |
-
| `model_path` | ✅ Yes | None | Path to trained model directory |
|
| 63 |
-
| `repo_name` | ✅ Yes | None | HF repository name (username/repo-name) |
|
| 64 |
-
| `--token` | ❌ No | `HF_TOKEN` env | Hugging Face token |
|
| 65 |
-
| `--hf-token` | ❌ No | `HF_TOKEN` env | HF token (alternative to --token) |
|
| 66 |
-
| `--private` | ❌ No | False | Make repository private |
|
| 67 |
-
| `--trackio-url` | ❌ No | None | Trackio Space URL for logging |
|
| 68 |
-
| `--experiment-name` | ❌ No | None | Experiment name for Trackio |
|
| 69 |
-
| `--dataset-repo` | ❌ No | `TRACKIO_DATASET_REPO` env | HF Dataset repository |
|
| 70 |
-
|
| 71 |
-
## 🛠️ Configuration Methods
|
| 72 |
-
|
| 73 |
-
### **Method 1: Command Line Arguments**
|
| 74 |
-
```bash
|
| 75 |
-
python push_to_huggingface.py model_path repo_name \
|
| 76 |
-
--dataset-repo username/experiments \
|
| 77 |
-
--hf-token your_token_here
|
| 78 |
-
```
|
| 79 |
-
|
| 80 |
-
### **Method 2: Environment Variables**
|
| 81 |
-
```bash
|
| 82 |
-
export HF_TOKEN=your_token_here
|
| 83 |
-
export TRACKIO_DATASET_REPO=username/experiments
|
| 84 |
-
python push_to_huggingface.py model_path repo_name
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
### **Method 3: Hybrid Approach**
|
| 88 |
-
```bash
|
| 89 |
-
# Set defaults via environment variables
|
| 90 |
-
export HF_TOKEN=your_token_here
|
| 91 |
-
export TRACKIO_DATASET_REPO=username/experiments
|
| 92 |
-
|
| 93 |
-
# Override specific values via command line
|
| 94 |
-
python push_to_huggingface.py model_path repo_name \
|
| 95 |
-
--dataset-repo username/specific-experiments
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
## 📊 What Gets Pushed
|
| 99 |
-
|
| 100 |
-
### **Model Files**
|
| 101 |
-
- ✅ **Model Weights**: `pytorch_model.bin`
|
| 102 |
-
- ✅ **Configuration**: `config.json`
|
| 103 |
-
- ✅ **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`
|
| 104 |
-
- ✅ **All Other Files**: Any additional files in model directory
|
| 105 |
-
|
| 106 |
-
### **Documentation**
|
| 107 |
-
- ✅ **Model Card**: Comprehensive README.md with model information
|
| 108 |
-
- ✅ **Training Configuration**: JSON configuration used for training
|
| 109 |
-
- ✅ **Training Results**: JSON results and metrics
|
| 110 |
-
- ✅ **Training Logs**: Text logs from training process
|
| 111 |
-
|
| 112 |
-
### **Experiment Data**
|
| 113 |
-
- ✅ **Dataset Repository**: Links to HF Dataset containing experiment data
|
| 114 |
-
- ✅ **Training Metrics**: All training metrics stored in dataset
|
| 115 |
-
- ✅ **Configuration**: Training configuration stored in dataset
|
| 116 |
-
- ✅ **Artifacts**: Training artifacts and logs
|
| 117 |
-
|
| 118 |
-
## 🔍 Enhanced Model Cards
|
| 119 |
-
|
| 120 |
-
The improved script creates enhanced model cards that include:
|
| 121 |
-
|
| 122 |
-
### **Model Information**
|
| 123 |
-
- Base model and architecture
|
| 124 |
-
- Training date and model size
|
| 125 |
-
- **Dataset repository** for experiment data
|
| 126 |
-
|
| 127 |
-
### **Training Configuration**
|
| 128 |
-
- Complete training parameters
|
| 129 |
-
- Hardware information
|
| 130 |
-
- Training duration and steps
|
| 131 |
-
|
| 132 |
-
### **Experiment Tracking**
|
| 133 |
-
- Links to HF Dataset repository
|
| 134 |
-
- Instructions for accessing experiment data
|
| 135 |
-
- Training metrics and results
|
| 136 |
-
|
| 137 |
-
### **Usage Examples**
|
| 138 |
-
- Code examples for loading and using the model
|
| 139 |
-
- Generation examples
|
| 140 |
-
- Performance information
|
| 141 |
-
|
| 142 |
-
## 📈 Logging Integration
|
| 143 |
-
|
| 144 |
-
### **Trackio Logging**
|
| 145 |
-
- ✅ **Push Actions**: Logs model push events
|
| 146 |
-
- ✅ **Model Information**: Repository name, size, configuration
|
| 147 |
-
- ✅ **Training Data**: Links to experiment dataset
|
| 148 |
-
|
| 149 |
-
### **HF Datasets Logging**
|
| 150 |
-
- ✅ **Experiment Summary**: Final training summary
|
| 151 |
-
- ✅ **Push Metadata**: Model repository and push date
|
| 152 |
-
- ✅ **Configuration**: Complete training configuration
|
| 153 |
-
|
| 154 |
-
### **Dual Storage**
|
| 155 |
-
- ✅ **Trackio**: Real-time monitoring and visualization
|
| 156 |
-
- ✅ **HF Datasets**: Persistent experiment storage
|
| 157 |
-
- ✅ **Synchronized**: Both systems updated together
|
| 158 |
-
|
| 159 |
-
## 🚨 Troubleshooting
|
| 160 |
-
|
| 161 |
-
### **Issue: "Missing required files"**
|
| 162 |
-
**Solutions**:
|
| 163 |
-
1. Check model directory contains required files
|
| 164 |
-
2. Ensure model was saved correctly during training
|
| 165 |
-
3. Verify file permissions
|
| 166 |
-
|
| 167 |
-
### **Issue: "Failed to create repository"**
|
| 168 |
-
**Solutions**:
|
| 169 |
-
1. Check HF token has write permissions
|
| 170 |
-
2. Verify repository name format: `username/repo-name`
|
| 171 |
-
3. Ensure repository doesn't already exist (or use `--private`)
|
| 172 |
-
|
| 173 |
-
### **Issue: "Failed to upload files"**
|
| 174 |
-
**Solutions**:
|
| 175 |
-
1. Check network connectivity
|
| 176 |
-
2. Verify HF token is valid
|
| 177 |
-
3. Ensure repository was created successfully
|
| 178 |
-
|
| 179 |
-
### **Issue: "Dataset repository not found"**
|
| 180 |
-
**Solutions**:
|
| 181 |
-
1. Check dataset repository exists
|
| 182 |
-
2. Verify HF token has read access
|
| 183 |
-
3. Use `--dataset-repo` to specify correct repository
|
| 184 |
-
|
| 185 |
-
## 📋 Workflow Integration
|
| 186 |
-
|
| 187 |
-
### **Complete Training Workflow**
|
| 188 |
-
1. **Train Model**: Use training scripts with monitoring
|
| 189 |
-
2. **Monitor Progress**: View metrics in Trackio interface
|
| 190 |
-
3. **Push Model**: Use improved push script
|
| 191 |
-
4. **Access Data**: View experiments in HF Dataset repository
|
| 192 |
-
|
| 193 |
-
### **Example Workflow**
|
| 194 |
-
```bash
|
| 195 |
-
# 1. Train model with monitoring
|
| 196 |
-
python train.py config/train_smollm3_openhermes_fr.py \
|
| 197 |
-
--experiment_name "smollm3_french_v2"
|
| 198 |
-
|
| 199 |
-
# 2. Push model to HF Hub
|
| 200 |
-
python push_to_huggingface.py outputs/model username/smollm3-french \
|
| 201 |
-
--dataset-repo username/experiments \
|
| 202 |
-
--experiment-name "smollm3_french_v2"
|
| 203 |
-
|
| 204 |
-
# 3. View results
|
| 205 |
-
# - Model: https://huggingface.co/username/smollm3-french
|
| 206 |
-
# - Experiments: https://huggingface.co/datasets/username/experiments
|
| 207 |
-
# - Trackio: Your Trackio Space interface
|
| 208 |
-
```
|
| 209 |
-
|
| 210 |
-
## 🎯 Benefits
|
| 211 |
-
|
| 212 |
-
### **For Model Deployment**
|
| 213 |
-
- ✅ **Complete Documentation**: Enhanced model cards with experiment links
|
| 214 |
-
- ✅ **Persistent Storage**: Experiment data stored in HF Datasets
|
| 215 |
-
- ✅ **Easy Access**: Direct links to training data and metrics
|
| 216 |
-
- ✅ **Reproducibility**: Complete training configuration included
|
| 217 |
-
|
| 218 |
-
### **For Experiment Management**
|
| 219 |
-
- ✅ **Centralized Storage**: All experiments in HF Dataset repository
|
| 220 |
-
- ✅ **Version Control**: Model versions linked to experiment data
|
| 221 |
-
- ✅ **Collaboration**: Share experiments and models easily
|
| 222 |
-
- ✅ **Searchability**: Easy to find specific experiments
|
| 223 |
-
|
| 224 |
-
### **For Development**
|
| 225 |
-
- ✅ **Flexible Configuration**: Multiple ways to set parameters
|
| 226 |
-
- ✅ **Backward Compatible**: Works with existing setups
|
| 227 |
-
- ✅ **Error Handling**: Clear error messages and troubleshooting
|
| 228 |
-
- ✅ **Integration**: Works with existing monitoring system
|
| 229 |
-
|
| 230 |
-
## 📊 Testing Results
|
| 231 |
-
|
| 232 |
-
All push script tests passed:
|
| 233 |
-
- ✅ **HuggingFacePusher Initialization**: Works with new parameters
|
| 234 |
-
- ✅ **Model Card Creation**: Includes HF Datasets integration
|
| 235 |
-
- ✅ **Logging Integration**: Logs to both Trackio and HF Datasets
|
| 236 |
-
- ✅ **Argument Parsing**: Handles new command line arguments
|
| 237 |
-
- ✅ **Environment Variables**: Proper fallback handling
|
| 238 |
-
|
| 239 |
-
## 🔄 Migration Guide
|
| 240 |
-
|
| 241 |
-
### **From Old Script**
|
| 242 |
-
```bash
|
| 243 |
-
# Old way
|
| 244 |
-
python push_to_huggingface.py model_path repo_name --token your_token
|
| 245 |
-
|
| 246 |
-
# New way (same functionality)
|
| 247 |
-
python push_to_huggingface.py model_path repo_name --hf-token your_token
|
| 248 |
-
|
| 249 |
-
# New way with HF Datasets
|
| 250 |
-
python push_to_huggingface.py model_path repo_name \
|
| 251 |
-
--hf-token your_token \
|
| 252 |
-
--dataset-repo username/experiments
|
| 253 |
-
```
|
| 254 |
-
|
| 255 |
-
### **Environment Variables**
|
| 256 |
-
```bash
|
| 257 |
-
# Set environment variables for automatic detection
|
| 258 |
-
export HF_TOKEN=your_token_here
|
| 259 |
-
export TRACKIO_DATASET_REPO=username/experiments
|
| 260 |
-
|
| 261 |
-
# Then use simple command
|
| 262 |
-
python push_to_huggingface.py model_path repo_name
|
| 263 |
-
```
|
| 264 |
-
|
| 265 |
-
---
|
| 266 |
-
|
| 267 |
-
**🎉 Your push script is now fully integrated with HF Datasets for complete experiment tracking and model deployment!**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/QUANTIZATION_FIX_SUMMARY.md
DELETED
|
@@ -1,165 +0,0 @@
|
|
| 1 |
-
# Quantization Fix Summary
|
| 2 |
-
|
| 3 |
-
## Issues Identified
|
| 4 |
-
|
| 5 |
-
The quantization script was failing due to several compatibility issues:
|
| 6 |
-
|
| 7 |
-
1. **Int8 Quantization Error**:
|
| 8 |
-
- Error: `The model is quantized with QuantizationMethod.TORCHAO and is not serializable`
|
| 9 |
-
- Cause: Offloaded modules in the model cannot be quantized with torchao
|
| 10 |
-
- Solution: Added alternative save method and fallback to bitsandbytes
|
| 11 |
-
|
| 12 |
-
2. **Int4 Quantization Error**:
|
| 13 |
-
- Error: `Could not run 'aten::_convert_weight_to_int4pack_for_cpu' with arguments from the 'CUDA' backend`
|
| 14 |
-
- Cause: Int4 quantization requires CPU backend but was being attempted on CUDA
|
| 15 |
-
- Solution: Added proper device selection logic
|
| 16 |
-
|
| 17 |
-
3. **Monitoring Error**:
|
| 18 |
-
- Error: `'SmolLM3Monitor' object has no attribute 'log_event'`
|
| 19 |
-
- Cause: Incorrect monitoring API usage
|
| 20 |
-
- Solution: Added flexible monitoring method detection
|
| 21 |
-
|
| 22 |
-
## Fixes Implemented
|
| 23 |
-
|
| 24 |
-
### 1. Enhanced Device Management (`scripts/model_tonic/quantize_model.py`)
|
| 25 |
-
|
| 26 |
-
```python
|
| 27 |
-
def get_optimal_device(self, quant_type: str) -> str:
|
| 28 |
-
"""Get optimal device for quantization type"""
|
| 29 |
-
if quant_type == "int4_weight_only":
|
| 30 |
-
# Int4 quantization works better on CPU
|
| 31 |
-
return "cpu"
|
| 32 |
-
elif quant_type == "int8_weight_only":
|
| 33 |
-
# Int8 quantization works on GPU
|
| 34 |
-
if torch.cuda.is_available():
|
| 35 |
-
return "cuda"
|
| 36 |
-
else:
|
| 37 |
-
logger.warning("⚠️ CUDA not available, falling back to CPU for int8")
|
| 38 |
-
return "cpu"
|
| 39 |
-
else:
|
| 40 |
-
return "auto"
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
### 2. Alternative Quantization Method
|
| 44 |
-
|
| 45 |
-
Added `quantize_model_alternative()` method using bitsandbytes for better compatibility:
|
| 46 |
-
|
| 47 |
-
```python
|
| 48 |
-
def quantize_model_alternative(self, quant_type: str, device: str = "auto", group_size: int = 128, save_dir: Optional[str] = None) -> Optional[str]:
|
| 49 |
-
"""Alternative quantization using bitsandbytes for better compatibility"""
|
| 50 |
-
# Uses BitsAndBytesConfig instead of TorchAoConfig
|
| 51 |
-
# Handles serialization issues better
|
| 52 |
-
```
|
| 53 |
-
|
| 54 |
-
### 3. Improved Error Handling
|
| 55 |
-
|
| 56 |
-
- Added fallback from torchao to bitsandbytes
|
| 57 |
-
- Enhanced save method with alternative approaches
|
| 58 |
-
- Better device mapping for different quantization types
|
| 59 |
-
|
| 60 |
-
### 4. Fixed Monitoring Integration
|
| 61 |
-
|
| 62 |
-
```python
|
| 63 |
-
def log_to_trackio(self, action: str, details: Dict[str, Any]):
|
| 64 |
-
"""Log quantization events to Trackio"""
|
| 65 |
-
if self.monitor:
|
| 66 |
-
try:
|
| 67 |
-
# Use the correct monitoring method
|
| 68 |
-
if hasattr(self.monitor, 'log_event'):
|
| 69 |
-
self.monitor.log_event(action, details)
|
| 70 |
-
elif hasattr(self.monitor, 'log_metric'):
|
| 71 |
-
self.monitor.log_metric(action, details.get('value', 1.0))
|
| 72 |
-
elif hasattr(self.monitor, 'log'):
|
| 73 |
-
self.monitor.log(action, details)
|
| 74 |
-
else:
|
| 75 |
-
logger.info(f"📊 {action}: {details}")
|
| 76 |
-
except Exception as e:
|
| 77 |
-
logger.warning(f"⚠️ Failed to log to Trackio: {e}")
|
| 78 |
-
```
|
| 79 |
-
|
| 80 |
-
## Usage Instructions
|
| 81 |
-
|
| 82 |
-
### 1. Install Dependencies
|
| 83 |
-
|
| 84 |
-
```bash
|
| 85 |
-
pip install -r requirements_quantization.txt
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
### 2. Run Quantization
|
| 89 |
-
|
| 90 |
-
```bash
|
| 91 |
-
python3 quantize_and_push.py
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
### 3. Test Fixes
|
| 95 |
-
|
| 96 |
-
```bash
|
| 97 |
-
python3 test_quantization_fix.py
|
| 98 |
-
```
|
| 99 |
-
|
| 100 |
-
## Expected Behavior
|
| 101 |
-
|
| 102 |
-
### Successful Quantization
|
| 103 |
-
|
| 104 |
-
The script will now:
|
| 105 |
-
|
| 106 |
-
1. **Try torchao first** for each quantization type
|
| 107 |
-
2. **Fall back to bitsandbytes** if torchao fails
|
| 108 |
-
3. **Use appropriate devices** (CPU for int4, GPU for int8)
|
| 109 |
-
4. **Handle serialization issues** with alternative save methods
|
| 110 |
-
5. **Log progress** without monitoring errors
|
| 111 |
-
|
| 112 |
-
### Output
|
| 113 |
-
|
| 114 |
-
```
|
| 115 |
-
✅ Model files validated
|
| 116 |
-
🔄 Processing quantization type: int8_weight_only
|
| 117 |
-
🔄 Using device: cuda
|
| 118 |
-
✅ int8_weight_only quantization and push completed
|
| 119 |
-
🔄 Processing quantization type: int4_weight_only
|
| 120 |
-
🔄 Using device: cpu
|
| 121 |
-
✅ int4_weight_only quantization and push completed
|
| 122 |
-
📊 Quantization summary: 2/2 successful
|
| 123 |
-
✅ Quantization completed successfully!
|
| 124 |
-
```
|
| 125 |
-
|
| 126 |
-
## Troubleshooting
|
| 127 |
-
|
| 128 |
-
### If All Quantization Fails
|
| 129 |
-
|
| 130 |
-
1. **Install bitsandbytes**:
|
| 131 |
-
```bash
|
| 132 |
-
pip install bitsandbytes
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
2. **Check model path**:
|
| 136 |
-
```bash
|
| 137 |
-
ls -la /output-checkpoint
|
| 138 |
-
```
|
| 139 |
-
|
| 140 |
-
3. **Verify dependencies**:
|
| 141 |
-
```bash
|
| 142 |
-
python3 test_quantization_fix.py
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
### Common Issues
|
| 146 |
-
|
| 147 |
-
1. **Memory Issues**: Use CPU for int4 quantization
|
| 148 |
-
2. **Serialization Errors**: The script now handles these automatically
|
| 149 |
-
3. **Device Conflicts**: Automatic device selection based on quantization type
|
| 150 |
-
|
| 151 |
-
## Files Modified
|
| 152 |
-
|
| 153 |
-
1. `scripts/model_tonic/quantize_model.py` - Main quantization logic
|
| 154 |
-
2. `quantize_and_push.py` - Main script with better error handling
|
| 155 |
-
3. `test_quantization_fix.py` - Test script for verification
|
| 156 |
-
4. `requirements_quantization.txt` - Dependencies file
|
| 157 |
-
|
| 158 |
-
## Next Steps
|
| 159 |
-
|
| 160 |
-
1. Run the test script to verify fixes
|
| 161 |
-
2. Install bitsandbytes if not already installed
|
| 162 |
-
3. Run the quantization script
|
| 163 |
-
4. Check the Hugging Face repository for quantized models
|
| 164 |
-
|
| 165 |
-
The fixes ensure robust quantization with multiple fallback options and proper error handling.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/QUANTIZATION_GUIDE.md
DELETED
|
@@ -1,313 +0,0 @@
|
|
| 1 |
-
# Model Quantization Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using `torchao` and automatically uploading them to Hugging Face Hub in a unified repository structure.
|
| 6 |
-
|
| 7 |
-
## Repository Structure
|
| 8 |
-
|
| 9 |
-
With the updated pipeline, all models (main and quantized) are stored in a single repository:
|
| 10 |
-
|
| 11 |
-
```
|
| 12 |
-
your-username/model-name/
|
| 13 |
-
├── README.md (unified model card)
|
| 14 |
-
├── config.json
|
| 15 |
-
├── pytorch_model.bin
|
| 16 |
-
├── tokenizer.json
|
| 17 |
-
├── tokenizer_config.json
|
| 18 |
-
├── int8/ (quantized model for GPU)
|
| 19 |
-
│ ├── README.md
|
| 20 |
-
│ ├── config.json
|
| 21 |
-
│ └── pytorch_model.bin
|
| 22 |
-
└── int4/ (quantized model for CPU)
|
| 23 |
-
├── README.md
|
| 24 |
-
├── config.json
|
| 25 |
-
└── pytorch_model.bin
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
## Quantization Types
|
| 29 |
-
|
| 30 |
-
### int8 Weight-Only Quantization (GPU Optimized)
|
| 31 |
-
- **Memory Reduction**: ~50% compared to original model
|
| 32 |
-
- **Speed**: Faster inference with minimal accuracy loss
|
| 33 |
-
- **Hardware**: GPU optimized for high-performance inference
|
| 34 |
-
- **Use Case**: Production deployments with GPU resources
|
| 35 |
-
|
| 36 |
-
### int4 Weight-Only Quantization (CPU Optimized)
|
| 37 |
-
- **Memory Reduction**: ~75% compared to original model
|
| 38 |
-
- **Speed**: Significantly faster inference with some accuracy trade-off
|
| 39 |
-
- **Hardware**: CPU optimized for deployment
|
| 40 |
-
- **Use Case**: Edge deployment, CPU-only environments
|
| 41 |
-
|
| 42 |
-
## Integration with Pipeline
|
| 43 |
-
|
| 44 |
-
### Automatic Quantization
|
| 45 |
-
|
| 46 |
-
The quantization process is integrated into the main training pipeline:
|
| 47 |
-
|
| 48 |
-
1. **Training**: Model is trained using the standard pipeline
|
| 49 |
-
2. **Model Push**: Main model is pushed to Hugging Face Hub
|
| 50 |
-
3. **Quantization Options**: User is prompted to create quantized versions
|
| 51 |
-
4. **Quantized Models**: Quantized models are created and pushed to subdirectories
|
| 52 |
-
5. **Unified Documentation**: Single model card covers all versions
|
| 53 |
-
|
| 54 |
-
### Pipeline Integration
|
| 55 |
-
|
| 56 |
-
The quantization step is added to `launch.sh` after the main model push:
|
| 57 |
-
|
| 58 |
-
```bash
|
| 59 |
-
# Step 16.5: Quantization Options
|
| 60 |
-
print_step "Step 16.5: Model Quantization Options"
|
| 61 |
-
echo "=========================================="
|
| 62 |
-
|
| 63 |
-
print_info "Would you like to create quantized versions of your model?"
|
| 64 |
-
print_info "Quantization reduces model size and improves inference speed."
|
| 65 |
-
|
| 66 |
-
# Ask about quantization
|
| 67 |
-
get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"
|
| 68 |
-
|
| 69 |
-
if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
|
| 70 |
-
print_info "Quantization options:"
|
| 71 |
-
print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
|
| 72 |
-
print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
|
| 73 |
-
print_info "3. Both int8 and int4 versions"
|
| 74 |
-
|
| 75 |
-
select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"
|
| 76 |
-
|
| 77 |
-
# Create quantized models in the same repository
|
| 78 |
-
python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
|
| 79 |
-
--quant-type "$QUANT_TYPE" \
|
| 80 |
-
--device "$DEVICE" \
|
| 81 |
-
--token "$HF_TOKEN" \
|
| 82 |
-
--trackio-url "$TRACKIO_URL" \
|
| 83 |
-
--experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
|
| 84 |
-
--dataset-repo "$TRACKIO_DATASET_REPO"
|
| 85 |
-
fi
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
## Standalone Quantization
|
| 89 |
-
|
| 90 |
-
### Using the Standalone Script
|
| 91 |
-
|
| 92 |
-
For models already uploaded to Hugging Face Hub:
|
| 93 |
-
|
| 94 |
-
```bash
|
| 95 |
-
python scripts/model_tonic/quantize_standalone.py \
|
| 96 |
-
"your-username/model-name" \
|
| 97 |
-
"your-username/model-name" \
|
| 98 |
-
--quant-type "int8_weight_only" \
|
| 99 |
-
--device "auto" \
|
| 100 |
-
--token "your-hf-token"
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
### Command Line Options
|
| 104 |
-
|
| 105 |
-
```bash
|
| 106 |
-
python scripts/model_tonic/quantize_standalone.py model_path repo_name [options]
|
| 107 |
-
|
| 108 |
-
Options:
|
| 109 |
-
--quant-type {int8_weight_only,int4_weight_only,int8_dynamic}
|
| 110 |
-
Quantization type (default: int8_weight_only)
|
| 111 |
-
--device DEVICE Device for quantization (auto, cpu, cuda)
|
| 112 |
-
--group-size GROUP_SIZE
|
| 113 |
-
Group size for quantization (default: 128)
|
| 114 |
-
--token TOKEN Hugging Face token
|
| 115 |
-
--private Create private repository
|
| 116 |
-
--trackio-url TRACKIO_URL
|
| 117 |
-
Trackio URL for monitoring
|
| 118 |
-
--experiment-name EXPERIMENT_NAME
|
| 119 |
-
Experiment name for tracking
|
| 120 |
-
--dataset-repo DATASET_REPO
|
| 121 |
-
HF Dataset repository
|
| 122 |
-
--save-only Save quantized model locally without pushing to HF
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
## Loading Quantized Models
|
| 126 |
-
|
| 127 |
-
### Loading Main Model
|
| 128 |
-
|
| 129 |
-
```python
|
| 130 |
-
import torch
|
| 131 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 132 |
-
|
| 133 |
-
# Load the main model
|
| 134 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 135 |
-
"your-username/model-name",
|
| 136 |
-
device_map="auto",
|
| 137 |
-
torch_dtype=torch.bfloat16
|
| 138 |
-
)
|
| 139 |
-
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
-
### Loading int8 Quantized Model (GPU)
|
| 143 |
-
|
| 144 |
-
```python
|
| 145 |
-
import torch
|
| 146 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 147 |
-
|
| 148 |
-
# Load int8 quantized model (GPU optimized)
|
| 149 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 150 |
-
"your-username/model-name/int8",
|
| 151 |
-
device_map="auto",
|
| 152 |
-
torch_dtype=torch.bfloat16
|
| 153 |
-
)
|
| 154 |
-
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
|
| 155 |
-
```
|
| 156 |
-
|
| 157 |
-
### Loading int4 Quantized Model (CPU)
|
| 158 |
-
|
| 159 |
-
```python
|
| 160 |
-
import torch
|
| 161 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 162 |
-
|
| 163 |
-
# Load int4 quantized model (CPU optimized)
|
| 164 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 165 |
-
"your-username/model-name/int4",
|
| 166 |
-
device_map="cpu",
|
| 167 |
-
torch_dtype=torch.bfloat16
|
| 168 |
-
)
|
| 169 |
-
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
## Usage Examples
|
| 173 |
-
|
| 174 |
-
### Text Generation with Quantized Model
|
| 175 |
-
|
| 176 |
-
```python
|
| 177 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 178 |
-
|
| 179 |
-
# Load quantized model
|
| 180 |
-
model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
|
| 181 |
-
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
|
| 182 |
-
|
| 183 |
-
# Generate text
|
| 184 |
-
text = "The future of artificial intelligence is"
|
| 185 |
-
inputs = tokenizer(text, return_tensors="pt")
|
| 186 |
-
outputs = model.generate(**inputs, max_new_tokens=100)
|
| 187 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 188 |
-
```
|
| 189 |
-
|
| 190 |
-
### Conversation with Quantized Model
|
| 191 |
-
|
| 192 |
-
```python
|
| 193 |
-
def chat_with_quantized_model(prompt, max_length=100):
|
| 194 |
-
inputs = tokenizer(prompt, return_tensors="pt")
|
| 195 |
-
outputs = model.generate(**inputs, max_new_tokens=max_length)
|
| 196 |
-
return tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 197 |
-
|
| 198 |
-
response = chat_with_quantized_model("Hello, how are you today?")
|
| 199 |
-
print(response)
|
| 200 |
-
```
|
| 201 |
-
|
| 202 |
-
## Configuration Options
|
| 203 |
-
|
| 204 |
-
### Quantization Parameters
|
| 205 |
-
|
| 206 |
-
- **group_size**: Group size for quantization (default: 128)
|
| 207 |
-
- **device**: Target device for quantization (auto, cpu, cuda)
|
| 208 |
-
- **quant_type**: Type of quantization to apply
|
| 209 |
-
|
| 210 |
-
### Hardware Requirements
|
| 211 |
-
|
| 212 |
-
- **Main Model**: GPU with 8GB+ VRAM recommended
|
| 213 |
-
- **int8 Model**: GPU with 4GB+ VRAM
|
| 214 |
-
- **int4 Model**: CPU deployment possible
|
| 215 |
-
|
| 216 |
-
## Performance Comparison
|
| 217 |
-
|
| 218 |
-
| Model Type | Memory Usage | Speed | Accuracy | Use Case |
|
| 219 |
-
|------------|--------------|-------|----------|----------|
|
| 220 |
-
| Original | 100% | Baseline | Best | Development, Research |
|
| 221 |
-
| int8 | ~50% | Faster | Minimal loss | Production GPU |
|
| 222 |
-
| int4 | ~25% | Fastest | Some loss | Edge, CPU deployment |
|
| 223 |
-
|
| 224 |
-
## Best Practices
|
| 225 |
-
|
| 226 |
-
### When to Use Quantization
|
| 227 |
-
|
| 228 |
-
1. **int8 (GPU)**: When you need faster inference with minimal accuracy loss
|
| 229 |
-
2. **int4 (CPU)**: When deploying to CPU-only environments or edge devices
|
| 230 |
-
3. **Both**: When you need flexibility for different deployment scenarios
|
| 231 |
-
|
| 232 |
-
### Memory Optimization
|
| 233 |
-
|
| 234 |
-
- Use int8 for GPU deployments with memory constraints
|
| 235 |
-
- Use int4 for CPU deployments or very memory-constrained environments
|
| 236 |
-
- Consider the trade-off between speed and accuracy
|
| 237 |
-
|
| 238 |
-
### Deployment Considerations
|
| 239 |
-
|
| 240 |
-
- Test quantized models on your specific use case
|
| 241 |
-
- Monitor performance and accuracy in production
|
| 242 |
-
- Consider using the main model for development and quantized versions for deployment
|
| 243 |
-
|
| 244 |
-
## Troubleshooting
|
| 245 |
-
|
| 246 |
-
### Common Issues
|
| 247 |
-
|
| 248 |
-
1. **CUDA Out of Memory**: Reduce batch size or use int8 quantization
|
| 249 |
-
2. **Import Errors**: Install torchao: `pip install torchao>=0.10.0`
|
| 250 |
-
3. **Model Loading Errors**: Ensure the model path is correct and accessible
|
| 251 |
-
|
| 252 |
-
### Debugging
|
| 253 |
-
|
| 254 |
-
```bash
|
| 255 |
-
# Test quantization functionality
|
| 256 |
-
python tests/test_quantization.py
|
| 257 |
-
|
| 258 |
-
# Check torchao installation
|
| 259 |
-
python -c "import torchao; print('torchao available')"
|
| 260 |
-
|
| 261 |
-
# Verify model files
|
| 262 |
-
ls -la /path/to/model/
|
| 263 |
-
```
|
| 264 |
-
|
| 265 |
-
## Monitoring and Tracking
|
| 266 |
-
|
| 267 |
-
### Trackio Integration
|
| 268 |
-
|
| 269 |
-
Quantization events are logged to Trackio:
|
| 270 |
-
|
| 271 |
-
- `quantization_started`: When quantization begins
|
| 272 |
-
- `quantization_completed`: When quantization finishes
|
| 273 |
-
- `quantized_model_pushed`: When model is uploaded to HF Hub
|
| 274 |
-
- `quantization_failed`: If quantization fails
|
| 275 |
-
|
| 276 |
-
### Metrics Tracked
|
| 277 |
-
|
| 278 |
-
- Quantization type and parameters
|
| 279 |
-
- Model size reduction
|
| 280 |
-
- Upload URLs for quantized models
|
| 281 |
-
- Processing time and success status
|
| 282 |
-
|
| 283 |
-
## Dependencies
|
| 284 |
-
|
| 285 |
-
### Required Packages
|
| 286 |
-
|
| 287 |
-
```bash
|
| 288 |
-
pip install torchao>=0.10.0
|
| 289 |
-
pip install transformers>=4.35.0
|
| 290 |
-
pip install huggingface_hub>=0.16.0
|
| 291 |
-
```
|
| 292 |
-
|
| 293 |
-
### Optional Dependencies
|
| 294 |
-
|
| 295 |
-
```bash
|
| 296 |
-
pip install accelerate>=0.20.0 # For device mapping
|
| 297 |
-
pip install bitsandbytes>=0.41.0 # For additional quantization
|
| 298 |
-
```
|
| 299 |
-
|
| 300 |
-
## References
|
| 301 |
-
|
| 302 |
-
- [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
|
| 303 |
-
- [Hugging Face Model Cards](https://huggingface.co/docs/hub/model-cards)
|
| 304 |
-
- [Transformers Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)
|
| 305 |
-
|
| 306 |
-
## Support
|
| 307 |
-
|
| 308 |
-
For issues and questions:
|
| 309 |
-
|
| 310 |
-
1. Check the troubleshooting section above
|
| 311 |
-
2. Review the test files in `tests/test_quantization.py`
|
| 312 |
-
3. Open an issue on the project repository
|
| 313 |
-
4. Check the Trackio monitoring for detailed logs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/QUANTIZATION_IMPLEMENTATION_SUMMARY.md
DELETED
|
@@ -1,248 +0,0 @@
|
|
| 1 |
-
# Quantization Implementation Summary
|
| 2 |
-
|
| 3 |
-
This document summarizes the torchao quantization features that have been added to the SmolLM3 fine-tuning pipeline.
|
| 4 |
-
|
| 5 |
-
## 🚀 New Features Added
|
| 6 |
-
|
| 7 |
-
### 1. Core Quantization Scripts
|
| 8 |
-
|
| 9 |
-
#### `scripts/model_tonic/quantize_model.py`
|
| 10 |
-
- **Main quantization script** with full HF Hub integration
|
| 11 |
-
- Supports int8 (GPU) and int4 (CPU) quantization
|
| 12 |
-
- Automatic model card and README generation
|
| 13 |
-
- Trackio monitoring integration
|
| 14 |
-
- Comprehensive error handling and validation
|
| 15 |
-
|
| 16 |
-
#### `scripts/model_tonic/quantize_standalone.py`
|
| 17 |
-
- **Standalone quantization script** for independent use
|
| 18 |
-
- Simple command-line interface
|
| 19 |
-
- Option to save locally without pushing to HF Hub
|
| 20 |
-
- Quick quantization workflow
|
| 21 |
-
|
| 22 |
-
### 2. Pipeline Integration
|
| 23 |
-
|
| 24 |
-
#### Updated `launch.sh`
|
| 25 |
-
- **Interactive quantization prompts** after model training
|
| 26 |
-
- Support for single or dual quantization (int8 + int4)
|
| 27 |
-
- Automatic repository naming with quantization suffixes
|
| 28 |
-
- Enhanced summary reporting with quantization results
|
| 29 |
-
|
| 30 |
-
### 3. Documentation
|
| 31 |
-
|
| 32 |
-
#### `docs/QUANTIZATION_GUIDE.md`
|
| 33 |
-
- **Comprehensive quantization guide**
|
| 34 |
-
- Usage examples and best practices
|
| 35 |
-
- Performance comparisons
|
| 36 |
-
- Troubleshooting section
|
| 37 |
-
- Advanced configuration options
|
| 38 |
-
|
| 39 |
-
#### Updated `README.md`
|
| 40 |
-
- **Quantization section** with quick start examples
|
| 41 |
-
- Integration with main pipeline documentation
|
| 42 |
-
- Loading quantized models examples
|
| 43 |
-
|
| 44 |
-
### 4. Testing
|
| 45 |
-
|
| 46 |
-
#### `tests/test_quantization.py`
|
| 47 |
-
- **Comprehensive test suite** for quantization functionality
|
| 48 |
-
- Tests for imports, initialization, configuration creation
|
| 49 |
-
- Model validation and documentation generation tests
|
| 50 |
-
- Automated testing workflow
|
| 51 |
-
|
| 52 |
-
### 5. Dependencies
|
| 53 |
-
|
| 54 |
-
#### Updated `requirements/requirements.txt`
|
| 55 |
-
- **Added torchao>=0.10.0** for quantization support
|
| 56 |
-
- Maintains compatibility with existing dependencies
|
| 57 |
-
|
| 58 |
-
## 🔧 Quantization Types Supported
|
| 59 |
-
|
| 60 |
-
### int8_weight_only (GPU Optimized)
|
| 61 |
-
- **Memory Reduction**: ~50%
|
| 62 |
-
- **Accuracy**: Minimal degradation
|
| 63 |
-
- **Speed**: Faster inference
|
| 64 |
-
- **Hardware**: GPU optimized
|
| 65 |
-
- **Use Case**: High-performance inference on GPU
|
| 66 |
-
|
| 67 |
-
### int4_weight_only (CPU Optimized)
|
| 68 |
-
- **Memory Reduction**: ~75%
|
| 69 |
-
- **Accuracy**: Some degradation acceptable
|
| 70 |
-
- **Speed**: Significantly faster inference
|
| 71 |
-
- **Hardware**: CPU optimized
|
| 72 |
-
- **Use Case**: Deployment on CPU or memory-constrained environments
|
| 73 |
-
|
| 74 |
-
### int8_dynamic (Dynamic Quantization)
|
| 75 |
-
- **Memory Reduction**: ~50%
|
| 76 |
-
- **Accuracy**: Minimal degradation
|
| 77 |
-
- **Speed**: Faster inference
|
| 78 |
-
- **Hardware**: GPU optimized
|
| 79 |
-
- **Use Case**: Dynamic quantization during inference
|
| 80 |
-
|
| 81 |
-
## 📋 Usage Examples
|
| 82 |
-
|
| 83 |
-
### Interactive Pipeline (launch.sh)
|
| 84 |
-
```bash
|
| 85 |
-
./launch.sh
|
| 86 |
-
# Complete training and model push
|
| 87 |
-
# Choose quantization options when prompted:
|
| 88 |
-
# - y/n for quantization
|
| 89 |
-
# - int8_weight_only / int4_weight_only / both
|
| 90 |
-
```
|
| 91 |
-
|
| 92 |
-
### Standalone Quantization
|
| 93 |
-
```bash
|
| 94 |
-
# Quantize and push to HF Hub
|
| 95 |
-
python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
|
| 96 |
-
--quant-type int8_weight_only \
|
| 97 |
-
--token YOUR_HF_TOKEN
|
| 98 |
-
|
| 99 |
-
# Quantize and save locally
|
| 100 |
-
python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
|
| 101 |
-
--quant-type int4_weight_only \
|
| 102 |
-
--device cpu \
|
| 103 |
-
--save-only
|
| 104 |
-
```
|
| 105 |
-
|
| 106 |
-
### Loading Quantized Models
|
| 107 |
-
```python
|
| 108 |
-
import torch
|
| 109 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 110 |
-
|
| 111 |
-
# Load int8 quantized model (GPU)
|
| 112 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 113 |
-
"your-username/model-int8",
|
| 114 |
-
device_map="auto",
|
| 115 |
-
torch_dtype=torch.bfloat16
|
| 116 |
-
)
|
| 117 |
-
|
| 118 |
-
# Load int4 quantized model (CPU)
|
| 119 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 120 |
-
"your-username/model-int4",
|
| 121 |
-
device_map="cpu",
|
| 122 |
-
torch_dtype=torch.bfloat16
|
| 123 |
-
)
|
| 124 |
-
```
|
| 125 |
-
|
| 126 |
-
## 🧪 Testing
|
| 127 |
-
|
| 128 |
-
Run the quantization tests:
|
| 129 |
-
```bash
|
| 130 |
-
python tests/test_quantization.py
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
Tests cover:
|
| 134 |
-
- Import validation
|
| 135 |
-
- Quantizer initialization
|
| 136 |
-
- Configuration creation
|
| 137 |
-
- Model validation
|
| 138 |
-
- Documentation generation
|
| 139 |
-
|
| 140 |
-
## 📊 Performance Comparison
|
| 141 |
-
|
| 142 |
-
| Model Type | Memory Usage | Speed | Accuracy | Hardware |
|
| 143 |
-
|------------|--------------|-------|----------|----------|
|
| 144 |
-
| Original | 100% | Baseline | Best | GPU/CPU |
|
| 145 |
-
| int8 | ~50% | Faster | Minimal loss | GPU |
|
| 146 |
-
| int4 | ~25% | Fastest | Some loss | CPU |
|
| 147 |
-
|
| 148 |
-
## 🔍 Key Features
|
| 149 |
-
|
| 150 |
-
### 1. Automatic Integration
|
| 151 |
-
- Seamlessly integrated into the main training pipeline
|
| 152 |
-
- Interactive prompts for quantization options
|
| 153 |
-
- Automatic repository creation and naming
|
| 154 |
-
|
| 155 |
-
### 2. Comprehensive Documentation
|
| 156 |
-
- Automatic model card generation
|
| 157 |
-
- Detailed README creation
|
| 158 |
-
- Usage examples and best practices
|
| 159 |
-
|
| 160 |
-
### 3. Monitoring Integration
|
| 161 |
-
- Trackio logging for quantization events
|
| 162 |
-
- Performance metrics tracking
|
| 163 |
-
- Artifact storage and versioning
|
| 164 |
-
|
| 165 |
-
### 4. Error Handling
|
| 166 |
-
- Robust validation of model paths
|
| 167 |
-
- Graceful handling of quantization failures
|
| 168 |
-
- Detailed error messages and logging
|
| 169 |
-
|
| 170 |
-
### 5. Flexibility
|
| 171 |
-
- Support for multiple quantization types
|
| 172 |
-
- Standalone usage option
|
| 173 |
-
- Custom configuration options
|
| 174 |
-
|
| 175 |
-
## 🛠️ Technical Implementation
|
| 176 |
-
|
| 177 |
-
### Core Components
|
| 178 |
-
|
| 179 |
-
1. **ModelQuantizer Class**
|
| 180 |
-
- Main quantization orchestration
|
| 181 |
-
- HF Hub integration
|
| 182 |
-
- Trackio monitoring
|
| 183 |
-
- Error handling and validation
|
| 184 |
-
|
| 185 |
-
2. **Quantization Configuration**
|
| 186 |
-
- torchao configuration management
|
| 187 |
-
- Device-specific optimizations
|
| 188 |
-
- Group size and parameter tuning
|
| 189 |
-
|
| 190 |
-
3. **Documentation Generation**
|
| 191 |
-
- Automatic model card creation
|
| 192 |
-
- README generation with usage examples
|
| 193 |
-
- Performance and limitation documentation
|
| 194 |
-
|
| 195 |
-
4. **Pipeline Integration**
|
| 196 |
-
- Interactive prompts in launch.sh
|
| 197 |
-
- Automatic repository naming
|
| 198 |
-
- Enhanced summary reporting
|
| 199 |
-
|
| 200 |
-
## 📈 Benefits
|
| 201 |
-
|
| 202 |
-
### For Users
|
| 203 |
-
- **Easy Integration**: Seamless addition to existing pipeline
|
| 204 |
-
- **Multiple Options**: Choose quantization type based on needs
|
| 205 |
-
- **Performance**: Significant memory and speed improvements
|
| 206 |
-
- **Documentation**: Automatic comprehensive documentation
|
| 207 |
-
|
| 208 |
-
### For Deployment
|
| 209 |
-
- **GPU Optimization**: int8 for high-performance inference
|
| 210 |
-
- **CPU Optimization**: int4 for resource-constrained environments
|
| 211 |
-
- **Memory Efficiency**: 50-75% memory reduction
|
| 212 |
-
- **Speed Improvement**: Faster inference times
|
| 213 |
-
|
| 214 |
-
## 🔮 Future Enhancements
|
| 215 |
-
|
| 216 |
-
### Planned Features
|
| 217 |
-
1. **Additional Quantization Types**: Support for more torchao configurations
|
| 218 |
-
2. **Automated Benchmarking**: Performance comparison tools
|
| 219 |
-
3. **Batch Quantization**: Process multiple models simultaneously
|
| 220 |
-
4. **Custom Configurations**: Advanced quantization parameter tuning
|
| 221 |
-
5. **Integration Testing**: End-to-end quantization workflow tests
|
| 222 |
-
|
| 223 |
-
### Potential Improvements
|
| 224 |
-
1. **Quantization-Aware Training**: Support for QAT workflows
|
| 225 |
-
2. **Mixed Precision**: Advanced precision optimization
|
| 226 |
-
3. **Hardware-Specific**: Optimizations for specific GPU/CPU types
|
| 227 |
-
4. **Automated Selection**: Smart quantization type selection
|
| 228 |
-
|
| 229 |
-
## 📚 References
|
| 230 |
-
|
| 231 |
-
- [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
|
| 232 |
-
- [Hugging Face Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)
|
| 233 |
-
- [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
|
| 234 |
-
|
| 235 |
-
## 🎯 Summary
|
| 236 |
-
|
| 237 |
-
The quantization implementation provides a complete, production-ready solution for creating optimized versions of fine-tuned SmolLM3 models. The integration is seamless, the documentation is comprehensive, and the functionality is robust and well-tested.
|
| 238 |
-
|
| 239 |
-
Key achievements:
|
| 240 |
-
- ✅ Full pipeline integration
|
| 241 |
-
- ✅ Multiple quantization types
|
| 242 |
-
- ✅ Comprehensive documentation
|
| 243 |
-
- ✅ Robust error handling
|
| 244 |
-
- ✅ Testing suite
|
| 245 |
-
- ✅ Monitoring integration
|
| 246 |
-
- ✅ Standalone usage option
|
| 247 |
-
|
| 248 |
-
The implementation follows the repository's architecture patterns and maintains consistency with existing code structure and documentation standards.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/README_END_TO_END.md
DELETED
|
@@ -1,303 +0,0 @@
|
|
| 1 |
-
# SmolLM3 End-to-End Fine-tuning Pipeline
|
| 2 |
-
|
| 3 |
-
This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment.
|
| 4 |
-
|
| 5 |
-
## 🚀 Quick Start
|
| 6 |
-
|
| 7 |
-
### 1. Setup Configuration
|
| 8 |
-
|
| 9 |
-
```bash
|
| 10 |
-
# Run the setup script to configure with your information
|
| 11 |
-
python setup_launch.py
|
| 12 |
-
```
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
### 2. Check Requirements
|
| 16 |
-
|
| 17 |
-
```bash
|
| 18 |
-
# Verify all dependencies are installed
|
| 19 |
-
python check_requirements.py
|
| 20 |
-
```
|
| 21 |
-
|
| 22 |
-
### 3. Run the Pipeline
|
| 23 |
-
|
| 24 |
-
```bash
|
| 25 |
-
# Make the script executable and run
|
| 26 |
-
chmod +x launch.sh
|
| 27 |
-
./launch.sh
|
| 28 |
-
```
|
| 29 |
-
This will prompt you for:
|
| 30 |
-
- Your Hugging Face token
|
| 31 |
-
- Optional model and dataset customizations
|
| 32 |
-
|
| 33 |
-
## 📋 What the Pipeline Does
|
| 34 |
-
|
| 35 |
-
The end-to-end pipeline performs the following steps:
|
| 36 |
-
|
| 37 |
-
### 1. **Environment Setup**
|
| 38 |
-
- Installs system dependencies
|
| 39 |
-
- Creates Python virtual environment
|
| 40 |
-
- Installs PyTorch with CUDA support
|
| 41 |
-
- Installs all required Python packages
|
| 42 |
-
|
| 43 |
-
### 2. **Trackio Space Deployment**
|
| 44 |
-
- Creates a new Hugging Face Space for experiment tracking
|
| 45 |
-
- Configures the Trackio monitoring interface
|
| 46 |
-
- Sets up environment variables
|
| 47 |
-
|
| 48 |
-
### 3. **HF Dataset Setup**
|
| 49 |
-
- Creates a Hugging Face Dataset repository for experiment storage
|
| 50 |
-
- Configures dataset access and permissions
|
| 51 |
-
- Sets up initial experiment data structure
|
| 52 |
-
|
| 53 |
-
### 4. **Dataset Preparation**
|
| 54 |
-
- Downloads the specified dataset from Hugging Face Hub
|
| 55 |
-
- Converts to training format (prompt/completion pairs)
|
| 56 |
-
- Handles multiple dataset formats automatically
|
| 57 |
-
- Creates train/validation splits
|
| 58 |
-
|
| 59 |
-
### 5. **Training Configuration**
|
| 60 |
-
- Creates optimized training configuration
|
| 61 |
-
- Sets up monitoring integration
|
| 62 |
-
- Configures model parameters and hyperparameters
|
| 63 |
-
|
| 64 |
-
### 6. **Model Training**
|
| 65 |
-
- Runs the SmolLM3 fine-tuning process
|
| 66 |
-
- Logs metrics to Trackio Space in real-time
|
| 67 |
-
- Saves experiment data to HF Dataset
|
| 68 |
-
- Creates checkpoints during training
|
| 69 |
-
|
| 70 |
-
### 7. **Model Deployment**
|
| 71 |
-
- Pushes trained model to Hugging Face Hub
|
| 72 |
-
- Creates comprehensive model card
|
| 73 |
-
- Uploads training results and logs
|
| 74 |
-
- Tests the uploaded model
|
| 75 |
-
|
| 76 |
-
### 8. **Summary Report**
|
| 77 |
-
- Generates detailed training summary
|
| 78 |
-
- Provides links to all resources
|
| 79 |
-
- Documents configuration and results
|
| 80 |
-
|
| 81 |
-
## 🎯 Features
|
| 82 |
-
|
| 83 |
-
### **Integrated Monitoring**
|
| 84 |
-
- Real-time experiment tracking via Trackio Space
|
| 85 |
-
- Persistent storage in Hugging Face Datasets
|
| 86 |
-
- Comprehensive metrics logging
|
| 87 |
-
- System resource monitoring
|
| 88 |
-
|
| 89 |
-
### **Flexible Dataset Support**
|
| 90 |
-
- Automatic format detection and conversion
|
| 91 |
-
- Support for multiple dataset types
|
| 92 |
-
- Built-in data preprocessing
|
| 93 |
-
- Train/validation split handling
|
| 94 |
-
|
| 95 |
-
### **Optimized Training**
|
| 96 |
-
- Flash Attention support for efficiency
|
| 97 |
-
- Gradient checkpointing for memory optimization
|
| 98 |
-
- Mixed precision training
|
| 99 |
-
- Automatic hyperparameter optimization
|
| 100 |
-
|
| 101 |
-
### **Complete Deployment**
|
| 102 |
-
- Automated model upload to Hugging Face Hub
|
| 103 |
-
- Comprehensive model cards
|
| 104 |
-
- Training results documentation
|
| 105 |
-
- Model testing and validation
|
| 106 |
-
|
| 107 |
-
## 📊 Monitoring & Tracking
|
| 108 |
-
|
| 109 |
-
### **Trackio Space Interface**
|
| 110 |
-
- Real-time training metrics visualization
|
| 111 |
-
- Experiment management and comparison
|
| 112 |
-
- System resource monitoring
|
| 113 |
-
- Training progress tracking
|
| 114 |
-
|
| 115 |
-
### **HF Dataset Storage**
|
| 116 |
-
- Persistent experiment data storage
|
| 117 |
-
- Version-controlled experiment history
|
| 118 |
-
- Collaborative experiment sharing
|
| 119 |
-
- Automated data backup
|
| 120 |
-
|
| 121 |
-
## 🔧 Configuration
|
| 122 |
-
|
| 123 |
-
### **Required Configuration**
|
| 124 |
-
Update these variables in `launch.sh`:
|
| 125 |
-
|
| 126 |
-
```bash
|
| 127 |
-
# Your Hugging Face credentials
|
| 128 |
-
HF_TOKEN="your_hf_token_here"
|
| 129 |
-
HF_USERNAME="your-username"
|
| 130 |
-
|
| 131 |
-
# Model and dataset
|
| 132 |
-
MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
|
| 133 |
-
DATASET_NAME="HuggingFaceTB/smoltalk"
|
| 134 |
-
|
| 135 |
-
# Output repositories
|
| 136 |
-
REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)"
|
| 137 |
-
TRACKIO_DATASET_REPO="your-username/trackio-experiments"
|
| 138 |
-
```
|
| 139 |
-
|
| 140 |
-
### **Training Parameters**
|
| 141 |
-
Customize training parameters:
|
| 142 |
-
|
| 143 |
-
```bash
|
| 144 |
-
# Training configuration
|
| 145 |
-
BATCH_SIZE=2
|
| 146 |
-
GRADIENT_ACCUMULATION_STEPS=8
|
| 147 |
-
LEARNING_RATE=5e-6
|
| 148 |
-
MAX_EPOCHS=3
|
| 149 |
-
MAX_SEQ_LENGTH=4096
|
| 150 |
-
```
|
| 151 |
-
|
| 152 |
-
## 📁 Output Structure
|
| 153 |
-
|
| 154 |
-
After running the pipeline, you'll have:
|
| 155 |
-
|
| 156 |
-
```
|
| 157 |
-
├── training_dataset/ # Prepared dataset
|
| 158 |
-
│ ├── train.json
|
| 159 |
-
│ └── validation.json
|
| 160 |
-
├── /output-checkpoint/ # Model checkpoints
|
| 161 |
-
│ ├── config.json
|
| 162 |
-
│ ├── pytorch_model.bin
|
| 163 |
-
│ └── training_results/
|
| 164 |
-
├── training.log # Training logs
|
| 165 |
-
├── training_summary.md # Summary report
|
| 166 |
-
└── config/train_smollm3_end_to_end.py # Training config
|
| 167 |
-
```
|
| 168 |
-
|
| 169 |
-
## 🌐 Online Resources
|
| 170 |
-
|
| 171 |
-
The pipeline creates these online resources:
|
| 172 |
-
|
| 173 |
-
- **Model Repository**: `https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD`
|
| 174 |
-
- **Trackio Space**: `https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD`
|
| 175 |
-
- **Experiment Dataset**: `https://huggingface.co/datasets/your-username/trackio-experiments`
|
| 176 |
-
|
| 177 |
-
## 🛠️ Troubleshooting
|
| 178 |
-
|
| 179 |
-
### **Common Issues**
|
| 180 |
-
|
| 181 |
-
1. **HF Token Issues**
|
| 182 |
-
```bash
|
| 183 |
-
# Verify your token is correct
|
| 184 |
-
hf whoami
|
| 185 |
-
```
|
| 186 |
-
|
| 187 |
-
2. **CUDA Issues**
|
| 188 |
-
```bash
|
| 189 |
-
# Check CUDA availability
|
| 190 |
-
python -c "import torch; print(torch.cuda.is_available())"
|
| 191 |
-
```
|
| 192 |
-
|
| 193 |
-
3. **Memory Issues**
|
| 194 |
-
```bash
|
| 195 |
-
# Reduce batch size or gradient accumulation
|
| 196 |
-
BATCH_SIZE=1
|
| 197 |
-
GRADIENT_ACCUMULATION_STEPS=16
|
| 198 |
-
```
|
| 199 |
-
|
| 200 |
-
4. **Dataset Issues**
|
| 201 |
-
```bash
|
| 202 |
-
# Test dataset access
|
| 203 |
-
python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))"
|
| 204 |
-
```
|
| 205 |
-
|
| 206 |
-
### **Debug Mode**
|
| 207 |
-
|
| 208 |
-
Run individual components for debugging:
|
| 209 |
-
|
| 210 |
-
```bash
|
| 211 |
-
# Test Trackio deployment
|
| 212 |
-
cd scripts/trackio_tonic
|
| 213 |
-
python deploy_trackio_space.py
|
| 214 |
-
|
| 215 |
-
# Test dataset setup
|
| 216 |
-
cd scripts/dataset_tonic
|
| 217 |
-
python setup_hf_dataset.py
|
| 218 |
-
|
| 219 |
-
# Test training
|
| 220 |
-
python src/train.py config/train_smollm3_end_to_end.py --help
|
| 221 |
-
```
|
| 222 |
-
|
| 223 |
-
## 📚 Advanced Usage
|
| 224 |
-
|
| 225 |
-
### **Custom Datasets**
|
| 226 |
-
|
| 227 |
-
For custom datasets, ensure they have one of these formats:
|
| 228 |
-
|
| 229 |
-
```json
|
| 230 |
-
// Format 1: Prompt/Completion
|
| 231 |
-
{
|
| 232 |
-
"prompt": "What is machine learning?",
|
| 233 |
-
"completion": "Machine learning is..."
|
| 234 |
-
}
|
| 235 |
-
|
| 236 |
-
// Format 2: Instruction/Output
|
| 237 |
-
{
|
| 238 |
-
"instruction": "Explain machine learning",
|
| 239 |
-
"output": "Machine learning is..."
|
| 240 |
-
}
|
| 241 |
-
|
| 242 |
-
// Format 3: Chat format
|
| 243 |
-
{
|
| 244 |
-
"messages": [
|
| 245 |
-
{"role": "user", "content": "What is ML?"},
|
| 246 |
-
{"role": "assistant", "content": "ML is..."}
|
| 247 |
-
]
|
| 248 |
-
}
|
| 249 |
-
```
|
| 250 |
-
|
| 251 |
-
### **Custom Models**
|
| 252 |
-
|
| 253 |
-
To use different models, update the configuration:
|
| 254 |
-
|
| 255 |
-
```bash
|
| 256 |
-
MODEL_NAME="microsoft/DialoGPT-medium"
|
| 257 |
-
MAX_SEQ_LENGTH=1024
|
| 258 |
-
```
|
| 259 |
-
|
| 260 |
-
### **Custom Training**
|
| 261 |
-
|
| 262 |
-
Modify training parameters in the generated config:
|
| 263 |
-
|
| 264 |
-
```python
|
| 265 |
-
# In config/train_smollm3_end_to_end.py
|
| 266 |
-
config = SmolLM3Config(
|
| 267 |
-
learning_rate=1e-5, # Custom learning rate
|
| 268 |
-
max_iters=5000, # Custom training steps
|
| 269 |
-
# ... other parameters
|
| 270 |
-
)
|
| 271 |
-
```
|
| 272 |
-
|
| 273 |
-
## 🤝 Contributing
|
| 274 |
-
|
| 275 |
-
1. Fork the repository
|
| 276 |
-
2. Create a feature branch
|
| 277 |
-
3. Make your changes
|
| 278 |
-
4. Test the pipeline
|
| 279 |
-
5. Submit a pull request
|
| 280 |
-
|
| 281 |
-
## 📄 License
|
| 282 |
-
|
| 283 |
-
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 284 |
-
|
| 285 |
-
## 🙏 Acknowledgments
|
| 286 |
-
|
| 287 |
-
- Hugging Face for the excellent transformers library
|
| 288 |
-
- The SmolLM3 team for the base model
|
| 289 |
-
- The Trackio team for experiment tracking
|
| 290 |
-
- The open-source community for contributions
|
| 291 |
-
|
| 292 |
-
## 📞 Support
|
| 293 |
-
|
| 294 |
-
For issues and questions:
|
| 295 |
-
|
| 296 |
-
1. Check the troubleshooting section
|
| 297 |
-
2. Review the logs in `training.log`
|
| 298 |
-
3. Check the Trackio Space for monitoring data
|
| 299 |
-
4. Open an issue on GitHub
|
| 300 |
-
|
| 301 |
-
---
|
| 302 |
-
|
| 303 |
-
**Happy Fine-tuning! 🚀**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/SFT_TRAINER_CONFIG_USAGE.md
DELETED
|
@@ -1,233 +0,0 @@
|
|
| 1 |
-
# SFT Trainer Configuration Usage Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This guide describes how the SFT (Supervised Fine-tuning) trainer uses the premade configuration files and how the `trainer_type` field is passed through the system.
|
| 6 |
-
|
| 7 |
-
## How SFT Trainer Uses Premade Configs
|
| 8 |
-
|
| 9 |
-
### 1. Configuration Loading Process
|
| 10 |
-
|
| 11 |
-
The SFT trainer uses premade configs through the following process:
|
| 12 |
-
|
| 13 |
-
1. **Config File Selection**: Users specify a config file via command line or launch script
|
| 14 |
-
2. **Config Loading**: The system loads the config using `get_config()` function
|
| 15 |
-
3. **Config Inheritance**: All configs inherit from `SmolLM3Config` base class
|
| 16 |
-
4. **Trainer Type Detection**: The system checks for `trainer_type` field in the config
|
| 17 |
-
5. **Training Arguments Creation**: Config parameters are used to create `TrainingArguments`
|
| 18 |
-
|
| 19 |
-
### 2. Configuration Parameters Used by SFT Trainer
|
| 20 |
-
|
| 21 |
-
The SFT trainer uses the following config parameters:
|
| 22 |
-
|
| 23 |
-
#### Model Configuration
|
| 24 |
-
- `model_name`: Model to load (e.g., "HuggingFaceTB/SmolLM3-3B")
|
| 25 |
-
- `max_seq_length`: Maximum sequence length for tokenization
|
| 26 |
-
- `use_flash_attention`: Whether to use flash attention
|
| 27 |
-
- `use_gradient_checkpointing`: Whether to use gradient checkpointing
|
| 28 |
-
|
| 29 |
-
#### Training Configuration
|
| 30 |
-
- `batch_size`: Per-device batch size
|
| 31 |
-
- `gradient_accumulation_steps`: Gradient accumulation steps
|
| 32 |
-
- `learning_rate`: Learning rate for optimization
|
| 33 |
-
- `weight_decay`: Weight decay for optimizer
|
| 34 |
-
- `warmup_steps`: Number of warmup steps
|
| 35 |
-
- `max_iters`: Maximum training iterations
|
| 36 |
-
- `save_steps`: Save checkpoint every N steps
|
| 37 |
-
- `eval_steps`: Evaluate every N steps
|
| 38 |
-
- `logging_steps`: Log every N steps
|
| 39 |
-
|
| 40 |
-
#### Optimizer Configuration
|
| 41 |
-
- `optimizer`: Optimizer type (e.g., "adamw_torch")
|
| 42 |
-
- `beta1`, `beta2`, `eps`: Optimizer parameters
|
| 43 |
-
|
| 44 |
-
#### Scheduler Configuration
|
| 45 |
-
- `scheduler`: Learning rate scheduler type
|
| 46 |
-
- `min_lr`: Minimum learning rate
|
| 47 |
-
|
| 48 |
-
#### Mixed Precision
|
| 49 |
-
- `fp16`: Whether to use fp16 precision
|
| 50 |
-
- `bf16`: Whether to use bf16 precision
|
| 51 |
-
|
| 52 |
-
#### Data Configuration
|
| 53 |
-
- `dataset_name`: Hugging Face dataset name
|
| 54 |
-
- `data_dir`: Local dataset directory
|
| 55 |
-
- `train_file`: Training file name
|
| 56 |
-
- `validation_file`: Validation file name
|
| 57 |
-
|
| 58 |
-
#### Monitoring Configuration
|
| 59 |
-
- `enable_tracking`: Whether to enable Trackio tracking
|
| 60 |
-
- `trackio_url`: Trackio server URL
|
| 61 |
-
- `experiment_name`: Experiment name for tracking
|
| 62 |
-
|
| 63 |
-
### 3. Training Arguments Creation
|
| 64 |
-
|
| 65 |
-
The SFT trainer creates `TrainingArguments` from config parameters:
|
| 66 |
-
|
| 67 |
-
```python
|
| 68 |
-
def get_training_arguments(self, output_dir: str, **kwargs) -> TrainingArguments:
|
| 69 |
-
training_args = {
|
| 70 |
-
"output_dir": output_dir,
|
| 71 |
-
"per_device_train_batch_size": self.config.batch_size,
|
| 72 |
-
"per_device_eval_batch_size": self.config.batch_size,
|
| 73 |
-
"gradient_accumulation_steps": self.config.gradient_accumulation_steps,
|
| 74 |
-
"learning_rate": self.config.learning_rate,
|
| 75 |
-
"weight_decay": self.config.weight_decay,
|
| 76 |
-
"warmup_steps": self.config.warmup_steps,
|
| 77 |
-
"max_steps": self.config.max_iters,
|
| 78 |
-
"save_steps": self.config.save_steps,
|
| 79 |
-
"eval_steps": self.config.eval_steps,
|
| 80 |
-
"logging_steps": self.config.logging_steps,
|
| 81 |
-
"fp16": self.config.fp16,
|
| 82 |
-
"bf16": self.config.bf16,
|
| 83 |
-
# ... additional parameters
|
| 84 |
-
}
|
| 85 |
-
return TrainingArguments(**training_args)
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
### 4. Trainer Selection Logic
|
| 89 |
-
|
| 90 |
-
The system determines which trainer to use based on the `trainer_type` field:
|
| 91 |
-
|
| 92 |
-
```python
|
| 93 |
-
# Determine trainer type (command line overrides config)
|
| 94 |
-
trainer_type = args.trainer_type or getattr(config, 'trainer_type', 'sft')
|
| 95 |
-
|
| 96 |
-
# Initialize trainer based on type
|
| 97 |
-
if trainer_type.lower() == 'dpo':
|
| 98 |
-
trainer = SmolLM3DPOTrainer(...)
|
| 99 |
-
else:
|
| 100 |
-
trainer = SmolLM3Trainer(...) # SFT trainer
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
## Configuration Files Structure
|
| 104 |
-
|
| 105 |
-
### Base Config (`config/train_smollm3.py`)
|
| 106 |
-
|
| 107 |
-
```python
|
| 108 |
-
@dataclass
|
| 109 |
-
class SmolLM3Config:
|
| 110 |
-
# Trainer type selection
|
| 111 |
-
trainer_type: str = "sft" # "sft" or "dpo"
|
| 112 |
-
|
| 113 |
-
# Model configuration
|
| 114 |
-
model_name: str = "HuggingFaceTB/SmolLM3-3B"
|
| 115 |
-
max_seq_length: int = 4096
|
| 116 |
-
# ... other fields
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
### DPO Config (`config/train_smollm3_dpo.py`)
|
| 120 |
-
|
| 121 |
-
```python
|
| 122 |
-
@dataclass
|
| 123 |
-
class SmolLM3DPOConfig(SmolLM3Config):
|
| 124 |
-
# Trainer type selection
|
| 125 |
-
trainer_type: str = "dpo" # Override default to use DPO trainer
|
| 126 |
-
|
| 127 |
-
# DPO-specific configuration
|
| 128 |
-
beta: float = 0.1
|
| 129 |
-
# ... DPO-specific fields
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
### Specialized Configs (e.g., `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`)
|
| 133 |
-
|
| 134 |
-
```python
|
| 135 |
-
@dataclass
|
| 136 |
-
class SmolLM3ConfigOpenHermesFRMultiplePasses(SmolLM3Config):
|
| 137 |
-
# Inherits trainer_type = "sft" from base config
|
| 138 |
-
|
| 139 |
-
# Specialized configuration for multiple passes
|
| 140 |
-
batch_size: int = 6
|
| 141 |
-
gradient_accumulation_steps: int = 20
|
| 142 |
-
learning_rate: float = 3e-6
|
| 143 |
-
max_iters: int = 25000
|
| 144 |
-
# ... other specialized fields
|
| 145 |
-
```
|
| 146 |
-
|
| 147 |
-
## Trainer Type Priority
|
| 148 |
-
|
| 149 |
-
The trainer type is determined in the following order of priority:
|
| 150 |
-
|
| 151 |
-
1. **Command line argument** (`--trainer_type`) - Highest priority
|
| 152 |
-
2. **Config file** (`trainer_type` field) - Medium priority
|
| 153 |
-
3. **Default value** (`"sft"`) - Lowest priority
|
| 154 |
-
|
| 155 |
-
## Usage Examples
|
| 156 |
-
|
| 157 |
-
### Using SFT Trainer with Different Configs
|
| 158 |
-
|
| 159 |
-
```bash
|
| 160 |
-
# Basic SFT training (uses base config)
|
| 161 |
-
python src/train.py config/train_smollm3.py
|
| 162 |
-
|
| 163 |
-
# SFT training with specialized config
|
| 164 |
-
python src/train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
|
| 165 |
-
|
| 166 |
-
# SFT training with override
|
| 167 |
-
python src/train.py config/train_smollm3.py --trainer_type sft
|
| 168 |
-
|
| 169 |
-
# DPO training (uses DPO config)
|
| 170 |
-
python src/train.py config/train_smollm3_dpo.py
|
| 171 |
-
|
| 172 |
-
# Override config's trainer type
|
| 173 |
-
python src/train.py config/train_smollm3.py --trainer_type dpo
|
| 174 |
-
```
|
| 175 |
-
|
| 176 |
-
### Launch Script Usage
|
| 177 |
-
|
| 178 |
-
```bash
|
| 179 |
-
./launch.sh
|
| 180 |
-
# Select "SFT" when prompted for trainer type
|
| 181 |
-
# The system will use the appropriate config based on selection
|
| 182 |
-
```
|
| 183 |
-
|
| 184 |
-
## Configuration Inheritance
|
| 185 |
-
|
| 186 |
-
All specialized configs inherit from `SmolLM3Config` and automatically get:
|
| 187 |
-
|
| 188 |
-
- `trainer_type = "sft"` (default)
|
| 189 |
-
- All base training parameters
|
| 190 |
-
- All monitoring configuration
|
| 191 |
-
- All data configuration
|
| 192 |
-
|
| 193 |
-
Specialized configs can override any of these parameters for their specific use case.
|
| 194 |
-
|
| 195 |
-
## SFT Trainer Features
|
| 196 |
-
|
| 197 |
-
The SFT trainer provides:
|
| 198 |
-
|
| 199 |
-
1. **SFTTrainer Backend**: Uses Hugging Face's `SFTTrainer` for instruction tuning
|
| 200 |
-
2. **Fallback Support**: Falls back to standard `Trainer` if `SFTTrainer` fails
|
| 201 |
-
3. **Config Integration**: Uses all config parameters for training setup
|
| 202 |
-
4. **Monitoring**: Integrates with Trackio for experiment tracking
|
| 203 |
-
5. **Checkpointing**: Supports model checkpointing and resuming
|
| 204 |
-
6. **Mixed Precision**: Supports fp16 and bf16 training
|
| 205 |
-
|
| 206 |
-
## Troubleshooting
|
| 207 |
-
|
| 208 |
-
### Common Issues
|
| 209 |
-
|
| 210 |
-
1. **Missing trainer_type field**: Ensure all configs have the `trainer_type` field
|
| 211 |
-
2. **Config inheritance issues**: Check that specialized configs properly inherit from base
|
| 212 |
-
3. **Parameter conflicts**: Ensure command line arguments don't conflict with config values
|
| 213 |
-
|
| 214 |
-
### Debugging
|
| 215 |
-
|
| 216 |
-
Enable verbose logging to see config usage:
|
| 217 |
-
|
| 218 |
-
```bash
|
| 219 |
-
python src/train.py config/train_smollm3.py --trainer_type sft
|
| 220 |
-
```
|
| 221 |
-
|
| 222 |
-
Look for these log messages:
|
| 223 |
-
```
|
| 224 |
-
Using trainer type: sft
|
| 225 |
-
Initializing SFT trainer...
|
| 226 |
-
Creating SFTTrainer with training arguments...
|
| 227 |
-
```
|
| 228 |
-
|
| 229 |
-
## Related Documentation
|
| 230 |
-
|
| 231 |
-
- [Trainer Selection Guide](TRAINER_SELECTION_GUIDE.md)
|
| 232 |
-
- [Training Configuration Guide](TRAINING_CONFIGURATION_GUIDE.md)
|
| 233 |
-
- [Monitoring Integration Guide](MONITORING_INTEGRATION_GUIDE.md)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TOKEN_FIX_SUMMARY.md
DELETED
|
@@ -1,249 +0,0 @@
|
|
| 1 |
-
# Token Fix Summary
|
| 2 |
-
|
| 3 |
-
## Issue Identified
|
| 4 |
-
|
| 5 |
-
The user encountered an error when running the launch script:
|
| 6 |
-
|
| 7 |
-
```
|
| 8 |
-
usage: hf <command> [<args>]
|
| 9 |
-
hf: error: argument {auth,cache,download,jobs,repo,repo-files,upload,upload-large-folder,env,version,lfs-enable-largefiles,lfs-multipart-upload}: invalid choice: 'login' (choose from 'auth', 'cache', 'download', 'jobs', 'repo', 'repo-files', 'upload', 'upload-large-folder', 'env', 'version', 'lfs-enable-largefiles', 'lfs-multipart-upload')
|
| 10 |
-
❌ Failed to login to Hugging Face
|
| 11 |
-
```
|
| 12 |
-
|
| 13 |
-
## Root Cause
|
| 14 |
-
|
| 15 |
-
The `launch.sh` script was using `hf login` command which doesn't exist in the current version of the Hugging Face CLI. The script was trying to use CLI commands instead of the Python API for authentication.
|
| 16 |
-
|
| 17 |
-
## Fixes Applied
|
| 18 |
-
|
| 19 |
-
### 1. **Removed HF Login Step** ✅ **FIXED**
|
| 20 |
-
|
| 21 |
-
**File**: `launch.sh`
|
| 22 |
-
|
| 23 |
-
**Before**:
|
| 24 |
-
```bash
|
| 25 |
-
# Login to Hugging Face with token
|
| 26 |
-
print_info "Logging in to Hugging Face..."
|
| 27 |
-
if hf login --token "$HF_TOKEN" --add-to-git-credential; then
|
| 28 |
-
print_status "Successfully logged in to Hugging Face"
|
| 29 |
-
print_info "Username: $(hf whoami)"
|
| 30 |
-
else
|
| 31 |
-
print_error "Failed to login to Hugging Face"
|
| 32 |
-
print_error "Please check your token and try again"
|
| 33 |
-
exit 1
|
| 34 |
-
fi
|
| 35 |
-
```
|
| 36 |
-
|
| 37 |
-
**After**:
|
| 38 |
-
```bash
|
| 39 |
-
# Set HF token for Python API usage
|
| 40 |
-
print_info "Setting up Hugging Face token for Python API..."
|
| 41 |
-
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 42 |
-
print_status "HF token configured for Python API usage"
|
| 43 |
-
print_info "Username: $HF_USERNAME (auto-detected from token)"
|
| 44 |
-
```
|
| 45 |
-
|
| 46 |
-
### 2. **Updated Dataset Setup Script** ✅ **FIXED**
|
| 47 |
-
|
| 48 |
-
**File**: `scripts/dataset_tonic/setup_hf_dataset.py`
|
| 49 |
-
|
| 50 |
-
**Changes**:
|
| 51 |
-
- Updated `main()` function to properly get token from environment
|
| 52 |
-
- Added token validation before proceeding
|
| 53 |
-
- Improved error handling for missing tokens
|
| 54 |
-
|
| 55 |
-
**Before**:
|
| 56 |
-
```python
|
| 57 |
-
def main():
|
| 58 |
-
"""Main function to set up the dataset."""
|
| 59 |
-
|
| 60 |
-
# Get dataset name from command line or use default
|
| 61 |
-
dataset_name = None
|
| 62 |
-
if len(sys.argv) > 2:
|
| 63 |
-
dataset_name = sys.argv[2]
|
| 64 |
-
|
| 65 |
-
success = setup_trackio_dataset(dataset_name)
|
| 66 |
-
sys.exit(0 if success else 1)
|
| 67 |
-
```
|
| 68 |
-
|
| 69 |
-
**After**:
|
| 70 |
-
```python
|
| 71 |
-
def main():
|
| 72 |
-
"""Main function to set up the dataset."""
|
| 73 |
-
|
| 74 |
-
# Get token from environment first
|
| 75 |
-
token = os.environ.get('HUGGING_FACE_HUB_TOKEN') or os.environ.get('HF_TOKEN')
|
| 76 |
-
|
| 77 |
-
# If no token in environment, try command line argument
|
| 78 |
-
if not token and len(sys.argv) > 1:
|
| 79 |
-
token = sys.argv[1]
|
| 80 |
-
|
| 81 |
-
if not token:
|
| 82 |
-
print("❌ No HF token found. Please set HUGGING_FACE_HUB_TOKEN environment variable or provide as argument.")
|
| 83 |
-
sys.exit(1)
|
| 84 |
-
|
| 85 |
-
# Get dataset name from command line or use default
|
| 86 |
-
dataset_name = None
|
| 87 |
-
if len(sys.argv) > 2:
|
| 88 |
-
dataset_name = sys.argv[2]
|
| 89 |
-
|
| 90 |
-
success = setup_trackio_dataset(dataset_name)
|
| 91 |
-
sys.exit(0 if success else 1)
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
### 3. **Updated Launch Script to Pass Token** ✅ **FIXED**
|
| 95 |
-
|
| 96 |
-
**File**: `launch.sh`
|
| 97 |
-
|
| 98 |
-
**Changes**:
|
| 99 |
-
- Updated dataset setup call to pass token as argument
|
| 100 |
-
- Updated Trackio Space deployment call to pass token as argument
|
| 101 |
-
|
| 102 |
-
**Before**:
|
| 103 |
-
```bash
|
| 104 |
-
python setup_hf_dataset.py
|
| 105 |
-
```
|
| 106 |
-
|
| 107 |
-
**After**:
|
| 108 |
-
```bash
|
| 109 |
-
python setup_hf_dataset.py "$HF_TOKEN"
|
| 110 |
-
```
|
| 111 |
-
|
| 112 |
-
**Before**:
|
| 113 |
-
```bash
|
| 114 |
-
python deploy_trackio_space.py << EOF
|
| 115 |
-
$TRACKIO_SPACE_NAME
|
| 116 |
-
$HF_TOKEN
|
| 117 |
-
$GIT_EMAIL
|
| 118 |
-
|
| 119 |
-
EOF
|
| 120 |
-
```
|
| 121 |
-
|
| 122 |
-
**After**:
|
| 123 |
-
```bash
|
| 124 |
-
python deploy_trackio_space.py "$TRACKIO_SPACE_NAME" "$HF_TOKEN" "$GIT_EMAIL"
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
### 4. **Updated Space Deployment Script** ✅ **FIXED**
|
| 128 |
-
|
| 129 |
-
**File**: `scripts/trackio_tonic/deploy_trackio_space.py`
|
| 130 |
-
|
| 131 |
-
**Changes**:
|
| 132 |
-
- Updated `main()` function to handle command line arguments
|
| 133 |
-
- Added support for both interactive and command-line modes
|
| 134 |
-
- Improved token handling and validation
|
| 135 |
-
|
| 136 |
-
**Before**:
|
| 137 |
-
```python
|
| 138 |
-
def main():
|
| 139 |
-
"""Main deployment function"""
|
| 140 |
-
print("Trackio Space Deployment Script")
|
| 141 |
-
print("=" * 40)
|
| 142 |
-
|
| 143 |
-
# Get user input (no username needed - will be extracted from token)
|
| 144 |
-
space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
|
| 145 |
-
token = input("Enter your Hugging Face token: ").strip()
|
| 146 |
-
```
|
| 147 |
-
|
| 148 |
-
**After**:
|
| 149 |
-
```python
|
| 150 |
-
def main():
|
| 151 |
-
"""Main deployment function"""
|
| 152 |
-
print("Trackio Space Deployment Script")
|
| 153 |
-
print("=" * 40)
|
| 154 |
-
|
| 155 |
-
# Check if arguments are provided
|
| 156 |
-
if len(sys.argv) >= 3:
|
| 157 |
-
# Use command line arguments
|
| 158 |
-
space_name = sys.argv[1]
|
| 159 |
-
token = sys.argv[2]
|
| 160 |
-
git_email = sys.argv[3] if len(sys.argv) > 3 else None
|
| 161 |
-
git_name = sys.argv[4] if len(sys.argv) > 4 else None
|
| 162 |
-
|
| 163 |
-
print(f"Using provided arguments:")
|
| 164 |
-
print(f" Space name: {space_name}")
|
| 165 |
-
print(f" Token: {'*' * 10}...{token[-4:]}")
|
| 166 |
-
print(f" Git email: {git_email or 'default'}")
|
| 167 |
-
print(f" Git name: {git_name or 'default'}")
|
| 168 |
-
else:
|
| 169 |
-
# Get user input (no username needed - will be extracted from token)
|
| 170 |
-
space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
|
| 171 |
-
token = input("Enter your Hugging Face token: ").strip()
|
| 172 |
-
```
|
| 173 |
-
|
| 174 |
-
## Key Improvements
|
| 175 |
-
|
| 176 |
-
### 1. **Complete Python API Usage**
|
| 177 |
-
- ✅ **No CLI commands**: All authentication uses Python API
|
| 178 |
-
- ✅ **Direct token passing**: Token passed directly to functions
|
| 179 |
-
- ✅ **Environment variables**: Proper environment variable setup
|
| 180 |
-
- ✅ **No username required**: Automatic extraction from token
|
| 181 |
-
|
| 182 |
-
### 2. **Robust Error Handling**
|
| 183 |
-
- ✅ **Token validation**: Proper token validation before use
|
| 184 |
-
- ✅ **Environment fallbacks**: Multiple ways to get token
|
| 185 |
-
- ✅ **Clear error messages**: Descriptive error messages
|
| 186 |
-
- ✅ **Graceful degradation**: Fallback mechanisms
|
| 187 |
-
|
| 188 |
-
### 3. **Automated Token Handling**
|
| 189 |
-
- ✅ **Automatic extraction**: Username extracted from token
|
| 190 |
-
- ✅ **Environment setup**: Token set in environment variables
|
| 191 |
-
- ✅ **Command line support**: Token passed as arguments
|
| 192 |
-
- ✅ **No manual input**: No username required
|
| 193 |
-
|
| 194 |
-
## Test Results
|
| 195 |
-
|
| 196 |
-
### **Token Validation Test**
|
| 197 |
-
```bash
|
| 198 |
-
$ python tests/test_token_fix.py
|
| 199 |
-
|
| 200 |
-
🚀 Token Validation and Deployment Tests
|
| 201 |
-
==================================================
|
| 202 |
-
🔍 Testing Token Validation
|
| 203 |
-
✅ Token validation module imported successfully
|
| 204 |
-
✅ Token validation successful!
|
| 205 |
-
✅ Username: Tonic
|
| 206 |
-
|
| 207 |
-
🔍 Testing Dataset Setup
|
| 208 |
-
✅ Dataset setup module imported successfully
|
| 209 |
-
✅ Username extraction successful: Tonic
|
| 210 |
-
|
| 211 |
-
🔍 Testing Space Deployment
|
| 212 |
-
✅ Space deployment module imported successfully
|
| 213 |
-
✅ Space deployer initialization successful
|
| 214 |
-
✅ Username: Tonic
|
| 215 |
-
|
| 216 |
-
==================================================
|
| 217 |
-
🎉 ALL TOKEN TESTS PASSED!
|
| 218 |
-
✅ Token validation: Working
|
| 219 |
-
✅ Dataset setup: Working
|
| 220 |
-
✅ Space deployment: Working
|
| 221 |
-
|
| 222 |
-
The token is working correctly with all components!
|
| 223 |
-
```
|
| 224 |
-
|
| 225 |
-
## User Token
|
| 226 |
-
|
| 227 |
-
**Token**: `xxxx`
|
| 228 |
-
|
| 229 |
-
**Status**: ✅ **Working correctly**
|
| 230 |
-
|
| 231 |
-
**Username**: `Tonic` (auto-detected)
|
| 232 |
-
|
| 233 |
-
## Next Steps
|
| 234 |
-
|
| 235 |
-
The user can now run the launch script without encountering the HF login error:
|
| 236 |
-
|
| 237 |
-
```bash
|
| 238 |
-
./launch.sh
|
| 239 |
-
```
|
| 240 |
-
|
| 241 |
-
The script will:
|
| 242 |
-
1. ✅ **Validate token** using Python API
|
| 243 |
-
2. ✅ **Extract username** automatically from token
|
| 244 |
-
3. ✅ **Set environment variables** for Python API usage
|
| 245 |
-
4. ✅ **Deploy Trackio Space** using Python API
|
| 246 |
-
5. ✅ **Setup HF Dataset** using Python API
|
| 247 |
-
6. ✅ **Configure all components** automatically
|
| 248 |
-
|
| 249 |
-
**No manual username input required!** 🎉
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TOKEN_VALIDATION_FIX.md
DELETED
|
@@ -1,183 +0,0 @@
|
|
| 1 |
-
# Hugging Face Token Validation Fix
|
| 2 |
-
|
| 3 |
-
## Problem Description
|
| 4 |
-
|
| 5 |
-
The original launch script was using the `hf` CLI command to validate Hugging Face tokens, which was causing authentication failures even with valid tokens. This was due to:
|
| 6 |
-
|
| 7 |
-
1. CLI installation issues
|
| 8 |
-
2. Inconsistent token format handling
|
| 9 |
-
3. Poor error reporting
|
| 10 |
-
|
| 11 |
-
## Solution Implementation
|
| 12 |
-
|
| 13 |
-
### New Python-Based Validation System
|
| 14 |
-
|
| 15 |
-
We've implemented a robust Python-based token validation system using the official `huggingface_hub` API:
|
| 16 |
-
|
| 17 |
-
#### Key Components
|
| 18 |
-
|
| 19 |
-
1. **`scripts/validate_hf_token.py`** - Main validation script
|
| 20 |
-
2. **Updated `launch.sh`** - Modified to use Python validation
|
| 21 |
-
3. **`tests/test_token_validation.py`** - Test suite for validation
|
| 22 |
-
4. **`scripts/check_dependencies.py`** - Dependency verification
|
| 23 |
-
|
| 24 |
-
### Features
|
| 25 |
-
|
| 26 |
-
- ✅ **Robust Error Handling**: Detailed error messages for different failure types
|
| 27 |
-
- ✅ **JSON Output**: Structured responses for easy parsing
|
| 28 |
-
- ✅ **Multiple Input Methods**: Command line arguments or environment variables
|
| 29 |
-
- ✅ **Username Extraction**: Automatically retrieves username from valid tokens
|
| 30 |
-
- ✅ **Dependency Checking**: Verifies required packages are installed
|
| 31 |
-
|
| 32 |
-
## Usage
|
| 33 |
-
|
| 34 |
-
### Direct Script Usage
|
| 35 |
-
|
| 36 |
-
```bash
|
| 37 |
-
# Using command line argument
|
| 38 |
-
python scripts/validate_hf_token.py hf_your_token_here
|
| 39 |
-
|
| 40 |
-
# Using environment variable
|
| 41 |
-
export HF_TOKEN=hf_your_token_here
|
| 42 |
-
python scripts/validate_hf_token.py
|
| 43 |
-
```
|
| 44 |
-
|
| 45 |
-
### Expected Output
|
| 46 |
-
|
| 47 |
-
**Success:**
|
| 48 |
-
```json
|
| 49 |
-
{"success": true, "username": "YourUsername", "error": null}
|
| 50 |
-
```
|
| 51 |
-
|
| 52 |
-
**Failure:**
|
| 53 |
-
```json
|
| 54 |
-
{"success": false, "username": null, "error": "Invalid token - unauthorized access"}
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
### Integration with Launch Script
|
| 58 |
-
|
| 59 |
-
The `launch.sh` script now automatically:
|
| 60 |
-
|
| 61 |
-
1. Prompts for your HF token
|
| 62 |
-
2. Validates it using the Python script
|
| 63 |
-
3. Extracts your username automatically
|
| 64 |
-
4. Provides detailed error messages if validation fails
|
| 65 |
-
|
| 66 |
-
## Error Types and Solutions
|
| 67 |
-
|
| 68 |
-
### Common Error Messages
|
| 69 |
-
|
| 70 |
-
| Error Message | Cause | Solution |
|
| 71 |
-
|---------------|-------|----------|
|
| 72 |
-
| "Invalid token - unauthorized access" | Token is invalid or expired | Generate new token at https://huggingface.co/settings/tokens |
|
| 73 |
-
| "Token lacks required permissions" | Token doesn't have write access | Ensure token has write permissions |
|
| 74 |
-
| "Network error" | Connection issues | Check internet connection |
|
| 75 |
-
| "Failed to run token validation script" | Missing dependencies | Run `pip install huggingface_hub` |
|
| 76 |
-
|
| 77 |
-
### Dependency Installation
|
| 78 |
-
|
| 79 |
-
```bash
|
| 80 |
-
# Install required dependencies
|
| 81 |
-
pip install huggingface_hub
|
| 82 |
-
|
| 83 |
-
# Check all dependencies
|
| 84 |
-
python scripts/check_dependencies.py
|
| 85 |
-
|
| 86 |
-
# Install all requirements
|
| 87 |
-
pip install -r requirements/requirements.txt
|
| 88 |
-
```
|
| 89 |
-
|
| 90 |
-
## Testing
|
| 91 |
-
|
| 92 |
-
### Run the Test Suite
|
| 93 |
-
|
| 94 |
-
```bash
|
| 95 |
-
python tests/test_token_validation.py
|
| 96 |
-
```
|
| 97 |
-
|
| 98 |
-
### Manual Testing
|
| 99 |
-
|
| 100 |
-
```bash
|
| 101 |
-
# Test with your token
|
| 102 |
-
python scripts/validate_hf_token.py hf_your_token_here
|
| 103 |
-
|
| 104 |
-
# Test dependency check
|
| 105 |
-
python scripts/check_dependencies.py
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
## Troubleshooting
|
| 109 |
-
|
| 110 |
-
### If Token Validation Still Fails
|
| 111 |
-
|
| 112 |
-
1. **Check Token Format**: Ensure token starts with `hf_`
|
| 113 |
-
2. **Verify Token Permissions**: Token needs read/write access
|
| 114 |
-
3. **Check Network**: Ensure internet connection is stable
|
| 115 |
-
4. **Update Dependencies**: Run `pip install --upgrade huggingface_hub`
|
| 116 |
-
|
| 117 |
-
### If Launch Script Fails
|
| 118 |
-
|
| 119 |
-
1. **Check Python Path**: Ensure `python3` is available
|
| 120 |
-
2. **Verify Script Permissions**: Script should be executable
|
| 121 |
-
3. **Check JSON Parsing**: Ensure Python can parse JSON output
|
| 122 |
-
4. **Review Error Messages**: Check the specific error in launch.sh output
|
| 123 |
-
|
| 124 |
-
## Technical Details
|
| 125 |
-
|
| 126 |
-
### Token Validation Process
|
| 127 |
-
|
| 128 |
-
1. **Environment Setup**: Sets `HUGGING_FACE_HUB_TOKEN` environment variable
|
| 129 |
-
2. **API Client Creation**: Initializes `HfApi()` client
|
| 130 |
-
3. **User Info Retrieval**: Calls `api.whoami()` to validate token
|
| 131 |
-
4. **Username Extraction**: Extracts username from user info
|
| 132 |
-
5. **Error Handling**: Catches and categorizes different error types
|
| 133 |
-
|
| 134 |
-
### JSON Parsing in Shell
|
| 135 |
-
|
| 136 |
-
The launch script uses Python's JSON parser to safely extract values:
|
| 137 |
-
|
| 138 |
-
```bash
|
| 139 |
-
local success=$(echo "$result" | python3 -c "
|
| 140 |
-
import sys, json
|
| 141 |
-
try:
|
| 142 |
-
data = json.load(sys.stdin)
|
| 143 |
-
print(data.get('success', False))
|
| 144 |
-
except:
|
| 145 |
-
print('False')
|
| 146 |
-
")
|
| 147 |
-
```
|
| 148 |
-
|
| 149 |
-
## Migration from Old System
|
| 150 |
-
|
| 151 |
-
### Before (CLI-based)
|
| 152 |
-
```bash
|
| 153 |
-
if hf whoami >/dev/null 2>&1; then
|
| 154 |
-
HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
|
| 155 |
-
```
|
| 156 |
-
|
| 157 |
-
### After (Python-based)
|
| 158 |
-
```bash
|
| 159 |
-
if result=$(python3 scripts/validate_hf_token.py "$token" 2>/dev/null); then
|
| 160 |
-
# Parse JSON result with error handling
|
| 161 |
-
local success=$(echo "$result" | python3 -c "...")
|
| 162 |
-
local username=$(echo "$result" | python3 -c "...")
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
## Benefits
|
| 166 |
-
|
| 167 |
-
1. **Reliability**: Uses official Python API instead of CLI
|
| 168 |
-
2. **Error Reporting**: Detailed error messages for debugging
|
| 169 |
-
3. **Cross-Platform**: Works on Windows, Linux, and macOS
|
| 170 |
-
4. **Maintainability**: Easy to update and extend
|
| 171 |
-
5. **Testing**: Comprehensive test suite included
|
| 172 |
-
|
| 173 |
-
## Future Enhancements
|
| 174 |
-
|
| 175 |
-
- [ ] Add token expiration checking
|
| 176 |
-
- [ ] Implement token refresh functionality
|
| 177 |
-
- [ ] Add support for organization tokens
|
| 178 |
-
- [ ] Create GUI for token management
|
| 179 |
-
- [ ] Add token security validation
|
| 180 |
-
|
| 181 |
-
---
|
| 182 |
-
|
| 183 |
-
**Note**: This fix ensures that valid Hugging Face tokens are properly recognized and that users get clear feedback when there are authentication issues.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TRACKIO_API_FIX_SUMMARY.md
DELETED
|
@@ -1,276 +0,0 @@
|
|
| 1 |
-
# Trackio API Fix Summary
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
This document summarizes the fixes applied to resolve the 404 errors in the Trackio integration and implement automatic Space URL resolution.
|
| 6 |
-
|
| 7 |
-
## Issues Identified
|
| 8 |
-
|
| 9 |
-
### 1. **404 Errors in Trackio API Calls**
|
| 10 |
-
- **Problem**: The original API client was using incorrect endpoints and HTTP request patterns
|
| 11 |
-
- **Error**: `POST request failed: 404 - Cannot POST /spaces/Tonic/trackio-monitoring-20250727/gradio_api/call/list_experiments_interface`
|
| 12 |
-
- **Root Cause**: Using raw HTTP requests instead of the proper Gradio client API
|
| 13 |
-
|
| 14 |
-
### 2. **Hardcoded Space URL**
|
| 15 |
-
- **Problem**: The Space URL was hardcoded, making it inflexible
|
| 16 |
-
- **Issue**: No automatic resolution of Space URLs from Space IDs
|
| 17 |
-
- **Impact**: Required manual URL updates when Space deployment changes
|
| 18 |
-
|
| 19 |
-
## Solutions Implemented
|
| 20 |
-
|
| 21 |
-
### 1. **Updated API Client to Use Gradio Client**
|
| 22 |
-
|
| 23 |
-
**File**: `scripts/trackio_tonic/trackio_api_client.py`
|
| 24 |
-
|
| 25 |
-
**Changes**:
|
| 26 |
-
- Replaced custom HTTP requests with `gradio_client.Client`
|
| 27 |
-
- Uses proper two-step process (POST to get event_id, then GET to get results)
|
| 28 |
-
- Handles all Gradio API endpoints correctly
|
| 29 |
-
|
| 30 |
-
**Before**:
|
| 31 |
-
```python
|
| 32 |
-
# Custom HTTP requests with manual event_id handling
|
| 33 |
-
response = requests.post(url, json=payload)
|
| 34 |
-
event_id = response.json()["event_id"]
|
| 35 |
-
result = requests.get(f"{url}/{event_id}")
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
**After**:
|
| 39 |
-
```python
|
| 40 |
-
# Using gradio_client for proper API communication
|
| 41 |
-
result = self.client.predict(*args, api_name=api_name)
|
| 42 |
-
```
|
| 43 |
-
|
| 44 |
-
### 2. **Automatic Space URL Resolution**
|
| 45 |
-
|
| 46 |
-
**Implementation**:
|
| 47 |
-
- Uses Hugging Face Hub API to resolve Space URLs from Space IDs
|
| 48 |
-
- Falls back to default URL format if API is unavailable
|
| 49 |
-
- Supports both authenticated and anonymous access
|
| 50 |
-
|
| 51 |
-
**Key Features**:
|
| 52 |
-
```python
|
| 53 |
-
def _resolve_space_url(self) -> Optional[str]:
|
| 54 |
-
"""Resolve Space URL using Hugging Face Hub API"""
|
| 55 |
-
api = HfApi(token=self.hf_token)
|
| 56 |
-
space_info = api.space_info(self.space_id)
|
| 57 |
-
if space_info and hasattr(space_info, 'host'):
|
| 58 |
-
return space_info.host
|
| 59 |
-
else:
|
| 60 |
-
# Fallback to default URL format
|
| 61 |
-
space_name = self.space_id.replace('/', '-')
|
| 62 |
-
return f"https://{space_name}.hf.space"
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
-
### 3. **Updated Client Interface**
|
| 66 |
-
|
| 67 |
-
**Before**:
|
| 68 |
-
```python
|
| 69 |
-
client = TrackioAPIClient("https://tonic-trackio-monitoring-20250727.hf.space")
|
| 70 |
-
```
|
| 71 |
-
|
| 72 |
-
**After**:
|
| 73 |
-
```python
|
| 74 |
-
client = TrackioAPIClient("Tonic/trackio-monitoring-20250727", hf_token)
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
### 4. **Enhanced Monitoring Integration**
|
| 78 |
-
|
| 79 |
-
**File**: `src/monitoring.py`
|
| 80 |
-
|
| 81 |
-
**Changes**:
|
| 82 |
-
- Updated to use Space ID instead of hardcoded URL
|
| 83 |
-
- Automatic experiment creation with proper ID extraction
|
| 84 |
-
- Better error handling and fallback mechanisms
|
| 85 |
-
|
| 86 |
-
## Dependencies Added
|
| 87 |
-
|
| 88 |
-
### Required Packages
|
| 89 |
-
```bash
|
| 90 |
-
pip install gradio_client huggingface_hub
|
| 91 |
-
```
|
| 92 |
-
|
| 93 |
-
### Package Versions
|
| 94 |
-
- `gradio_client>=1.10.4` - For proper Gradio API communication
|
| 95 |
-
- `huggingface_hub>=0.19.3` - For Space URL resolution
|
| 96 |
-
|
| 97 |
-
## API Endpoints Supported
|
| 98 |
-
|
| 99 |
-
The updated client supports all documented Gradio endpoints:
|
| 100 |
-
|
| 101 |
-
1. **Experiment Management**:
|
| 102 |
-
- `/create_experiment_interface` - Create new experiments
|
| 103 |
-
- `/list_experiments_interface` - List all experiments
|
| 104 |
-
- `/get_experiment_details` - Get experiment details
|
| 105 |
-
- `/update_experiment_status_interface` - Update experiment status
|
| 106 |
-
|
| 107 |
-
2. **Metrics and Parameters**:
|
| 108 |
-
- `/log_metrics_interface` - Log training metrics
|
| 109 |
-
- `/log_parameters_interface` - Log experiment parameters
|
| 110 |
-
|
| 111 |
-
3. **Visualization**:
|
| 112 |
-
- `/create_metrics_plot` - Create metrics plots
|
| 113 |
-
- `/create_experiment_comparison` - Compare experiments
|
| 114 |
-
|
| 115 |
-
4. **Testing and Demo**:
|
| 116 |
-
- `/simulate_training_data` - Simulate training data
|
| 117 |
-
- `/create_demo_experiment` - Create demo experiments
|
| 118 |
-
|
| 119 |
-
## Configuration
|
| 120 |
-
|
| 121 |
-
### Environment Variables
|
| 122 |
-
```bash
|
| 123 |
-
# Required for Space URL resolution
|
| 124 |
-
export HF_TOKEN="your_huggingface_token"
|
| 125 |
-
|
| 126 |
-
# Optional: Custom Space ID
|
| 127 |
-
export TRACKIO_SPACE_ID="your-username/your-space-name"
|
| 128 |
-
|
| 129 |
-
# Optional: Dataset repository
|
| 130 |
-
export TRACKIO_DATASET_REPO="your-username/your-dataset"
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
### Default Configuration
|
| 134 |
-
- **Default Space ID**: `Tonic/trackio-monitoring-20250727`
|
| 135 |
-
- **Default Dataset**: `tonic/trackio-experiments`
|
| 136 |
-
- **Auto-resolution**: Enabled by default
|
| 137 |
-
|
| 138 |
-
## Testing
|
| 139 |
-
|
| 140 |
-
### Test Script
|
| 141 |
-
**File**: `tests/test_trackio_api_fix.py`
|
| 142 |
-
|
| 143 |
-
**Tests Included**:
|
| 144 |
-
1. **Space URL Resolution** - Tests automatic URL resolution
|
| 145 |
-
2. **API Client** - Tests all API endpoints
|
| 146 |
-
3. **Monitoring Integration** - Tests full monitoring workflow
|
| 147 |
-
|
| 148 |
-
### Running Tests
|
| 149 |
-
```bash
|
| 150 |
-
python tests/test_trackio_api_fix.py
|
| 151 |
-
```
|
| 152 |
-
|
| 153 |
-
**Expected Output**:
|
| 154 |
-
```
|
| 155 |
-
🚀 Starting Trackio API Client Tests with Automatic URL Resolution
|
| 156 |
-
======================================================================
|
| 157 |
-
✅ Space URL Resolution: PASSED
|
| 158 |
-
✅ API Client Test: PASSED
|
| 159 |
-
✅ Monitoring Integration: PASSED
|
| 160 |
-
|
| 161 |
-
🎉 All tests passed! The Trackio integration with automatic URL resolution is working correctly.
|
| 162 |
-
```
|
| 163 |
-
|
| 164 |
-
## Benefits
|
| 165 |
-
|
| 166 |
-
### 1. **Reliability**
|
| 167 |
-
- ✅ No more 404 errors
|
| 168 |
-
- ✅ Proper error handling and fallbacks
|
| 169 |
-
- ✅ Automatic retry mechanisms
|
| 170 |
-
|
| 171 |
-
### 2. **Flexibility**
|
| 172 |
-
- ✅ Automatic Space URL resolution
|
| 173 |
-
- ✅ Support for any Trackio Space
|
| 174 |
-
- ✅ Configurable via environment variables
|
| 175 |
-
|
| 176 |
-
### 3. **Maintainability**
|
| 177 |
-
- ✅ Clean separation of concerns
|
| 178 |
-
- ✅ Proper logging and debugging
|
| 179 |
-
- ✅ Comprehensive test coverage
|
| 180 |
-
|
| 181 |
-
### 4. **User Experience**
|
| 182 |
-
- ✅ Seamless integration with training pipeline
|
| 183 |
-
- ✅ Real-time experiment monitoring
|
| 184 |
-
- ✅ Automatic experiment creation and management
|
| 185 |
-
|
| 186 |
-
## Usage Examples
|
| 187 |
-
|
| 188 |
-
### Basic Usage
|
| 189 |
-
```python
|
| 190 |
-
from scripts.trackio_tonic.trackio_api_client import TrackioAPIClient
|
| 191 |
-
|
| 192 |
-
# Initialize with Space ID (URL resolved automatically)
|
| 193 |
-
client = TrackioAPIClient("Tonic/trackio-monitoring-20250727")
|
| 194 |
-
|
| 195 |
-
# Create experiment
|
| 196 |
-
result = client.create_experiment("my_experiment", "Test experiment")
|
| 197 |
-
|
| 198 |
-
# Log metrics
|
| 199 |
-
metrics = {"loss": 1.234, "accuracy": 0.85}
|
| 200 |
-
client.log_metrics("exp_123", metrics, step=100)
|
| 201 |
-
```
|
| 202 |
-
|
| 203 |
-
### With Monitoring Integration
|
| 204 |
-
```python
|
| 205 |
-
from src.monitoring import SmolLM3Monitor
|
| 206 |
-
|
| 207 |
-
# Create monitor (automatically creates experiment)
|
| 208 |
-
monitor = SmolLM3Monitor(
|
| 209 |
-
experiment_name="my_training_run",
|
| 210 |
-
enable_tracking=True
|
| 211 |
-
)
|
| 212 |
-
|
| 213 |
-
# Log metrics during training
|
| 214 |
-
monitor.log_metrics({"loss": 1.234}, step=100)
|
| 215 |
-
|
| 216 |
-
# Log configuration
|
| 217 |
-
monitor.log_config({"learning_rate": 2e-5, "batch_size": 8})
|
| 218 |
-
```
|
| 219 |
-
|
| 220 |
-
## Troubleshooting
|
| 221 |
-
|
| 222 |
-
### Common Issues
|
| 223 |
-
|
| 224 |
-
1. **"gradio_client not available"**
|
| 225 |
-
```bash
|
| 226 |
-
pip install gradio_client
|
| 227 |
-
```
|
| 228 |
-
|
| 229 |
-
2. **"huggingface_hub not available"**
|
| 230 |
-
```bash
|
| 231 |
-
pip install huggingface_hub
|
| 232 |
-
```
|
| 233 |
-
|
| 234 |
-
3. **"Space not accessible"**
|
| 235 |
-
- Check if the Space is running
|
| 236 |
-
- Verify Space ID is correct
|
| 237 |
-
- Ensure HF token has proper permissions
|
| 238 |
-
|
| 239 |
-
4. **"Experiment not found"**
|
| 240 |
-
- Experiments are created automatically by the monitor
|
| 241 |
-
- Use the experiment ID returned by `create_experiment()`
|
| 242 |
-
|
| 243 |
-
### Debug Mode
|
| 244 |
-
Enable debug logging to see detailed API calls:
|
| 245 |
-
```python
|
| 246 |
-
import logging
|
| 247 |
-
logging.basicConfig(level=logging.DEBUG)
|
| 248 |
-
```
|
| 249 |
-
|
| 250 |
-
## Future Enhancements
|
| 251 |
-
|
| 252 |
-
### Planned Features
|
| 253 |
-
1. **Multi-Space Support** - Support for multiple Trackio Spaces
|
| 254 |
-
2. **Advanced Metrics** - Support for custom metric types
|
| 255 |
-
3. **Artifact Upload** - Direct file upload to Spaces
|
| 256 |
-
4. **Real-time Dashboard** - Live monitoring dashboard
|
| 257 |
-
5. **Export Capabilities** - Export experiments to various formats
|
| 258 |
-
|
| 259 |
-
### Extensibility
|
| 260 |
-
The new architecture is designed to be easily extensible:
|
| 261 |
-
- Modular API client design
|
| 262 |
-
- Plugin-based monitoring system
|
| 263 |
-
- Configurable Space resolution
|
| 264 |
-
- Support for custom endpoints
|
| 265 |
-
|
| 266 |
-
## Conclusion
|
| 267 |
-
|
| 268 |
-
The Trackio API integration has been successfully fixed and enhanced with:
|
| 269 |
-
|
| 270 |
-
- ✅ **Resolved 404 errors** through proper Gradio client usage
|
| 271 |
-
- ✅ **Automatic URL resolution** using Hugging Face Hub API
|
| 272 |
-
- ✅ **Comprehensive testing** with full test coverage
|
| 273 |
-
- ✅ **Enhanced monitoring** with seamless integration
|
| 274 |
-
- ✅ **Future-proof architecture** for easy extensions
|
| 275 |
-
|
| 276 |
-
The system is now production-ready and provides reliable experiment tracking for SmolLM3 fine-tuning workflows.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TRACKIO_DEPLOYMENT_FIXES.md
DELETED
|
@@ -1,266 +0,0 @@
|
|
| 1 |
-
# Trackio Deployment Fixes
|
| 2 |
-
|
| 3 |
-
This document outlines the fixes made to resolve the Trackio Space deployment and dataset creation issues.
|
| 4 |
-
|
| 5 |
-
## Issues Identified
|
| 6 |
-
|
| 7 |
-
### 1. Git Authentication Issues in Space Deployment
|
| 8 |
-
- **Problem**: The `deploy_trackio_space.py` script was using git commands for file upload, which failed with authentication errors
|
| 9 |
-
- **Solution**: Replaced git commands with direct HF Hub API calls using `upload_file()`
|
| 10 |
-
|
| 11 |
-
### 2. Dataset Repository Creation Issues
|
| 12 |
-
- **Problem**: The `setup_hf_dataset.py` script was trying to push to a dataset repository that didn't exist, causing 404 errors
|
| 13 |
-
- **Solution**: Added proper repository creation using `create_repo()` before pushing the dataset
|
| 14 |
-
|
| 15 |
-
### 3. Missing Environment Variable Setup
|
| 16 |
-
- **Problem**: The Space deployment didn't set up the required `HF_TOKEN` environment variable
|
| 17 |
-
- **Solution**: Added automatic secret setting using `add_space_secret()` API method
|
| 18 |
-
|
| 19 |
-
### 4. Manual Username Input Required
|
| 20 |
-
- **Problem**: Users had to manually enter their username
|
| 21 |
-
- **Solution**: Automatically extract username from token using `whoami()` API method
|
| 22 |
-
|
| 23 |
-
### 5. Dataset Access Testing Issues
|
| 24 |
-
- **Problem**: The configuration script failed when testing dataset access for non-existent datasets
|
| 25 |
-
- **Solution**: Added proper error handling and repository existence checks
|
| 26 |
-
|
| 27 |
-
## Fixed Scripts
|
| 28 |
-
|
| 29 |
-
### 1. `scripts/trackio_tonic/deploy_trackio_space.py`
|
| 30 |
-
|
| 31 |
-
#### Key Changes:
|
| 32 |
-
- **Replaced git upload with HF Hub API**: Now uses `upload_file()` directly instead of git commands
|
| 33 |
-
- **Automatic secret setting**: Uses `add_space_secret()` API to set HF_TOKEN automatically
|
| 34 |
-
- **Username extraction from token**: Uses `whoami()` to get username automatically
|
| 35 |
-
- **Removed manual username input**: No longer asks for username
|
| 36 |
-
- **Improved error handling**: Better error messages and fallback options
|
| 37 |
-
|
| 38 |
-
#### Usage:
|
| 39 |
-
```bash
|
| 40 |
-
python scripts/trackio_tonic/deploy_trackio_space.py
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
#### What it does:
|
| 44 |
-
1. Extracts username from HF token automatically
|
| 45 |
-
2. Creates a new HF Space using the API
|
| 46 |
-
3. Prepares Space files from templates
|
| 47 |
-
4. Uploads files using HF Hub API (no git required)
|
| 48 |
-
5. **Automatically sets secrets via API** (HF_TOKEN and TRACKIO_DATASET_REPO)
|
| 49 |
-
6. Tests the Space accessibility
|
| 50 |
-
|
| 51 |
-
### 2. `scripts/dataset_tonic/setup_hf_dataset.py`
|
| 52 |
-
|
| 53 |
-
#### Key Changes:
|
| 54 |
-
- **Added repository creation**: Creates the dataset repository before pushing data
|
| 55 |
-
- **Username extraction from token**: Uses `whoami()` to get username automatically
|
| 56 |
-
- **Automatic dataset naming**: Uses username in dataset repository name
|
| 57 |
-
- **Improved error handling**: Better error messages for common issues
|
| 58 |
-
- **Public datasets by default**: Makes datasets public for easier access
|
| 59 |
-
|
| 60 |
-
#### Usage:
|
| 61 |
-
```bash
|
| 62 |
-
python scripts/dataset_tonic/setup_hf_dataset.py
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
-
#### What it does:
|
| 66 |
-
1. Extracts username from HF token automatically
|
| 67 |
-
2. Creates the dataset repository if it doesn't exist
|
| 68 |
-
3. Creates a dataset with sample experiment data
|
| 69 |
-
4. Uploads README template
|
| 70 |
-
5. Makes the dataset public for easier access
|
| 71 |
-
|
| 72 |
-
### 3. `scripts/trackio_tonic/configure_trackio.py`
|
| 73 |
-
|
| 74 |
-
#### Key Changes:
|
| 75 |
-
- **Added repository existence check**: Checks if dataset repository exists before trying to load
|
| 76 |
-
- **Username extraction from token**: Uses `whoami()` to get username automatically
|
| 77 |
-
- **Automatic dataset naming**: Uses username in default dataset repository
|
| 78 |
-
- **Better error handling**: Distinguishes between missing repository and permission issues
|
| 79 |
-
- **Improved user guidance**: Clear instructions for next steps
|
| 80 |
-
|
| 81 |
-
#### Usage:
|
| 82 |
-
```bash
|
| 83 |
-
python scripts/trackio_tonic/configure_trackio.py
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
#### What it does:
|
| 87 |
-
1. Extracts username from HF token automatically
|
| 88 |
-
2. Validates current configuration
|
| 89 |
-
3. Tests dataset access with proper error handling
|
| 90 |
-
4. Generates configuration file with username
|
| 91 |
-
5. Provides usage examples with actual username
|
| 92 |
-
|
| 93 |
-
## Model Push Script (`scripts/model_tonic/push_to_huggingface.py`)
|
| 94 |
-
|
| 95 |
-
The model push script was already using the HF Hub API correctly, so no changes were needed. It properly:
|
| 96 |
-
- Creates repositories using `create_repo()`
|
| 97 |
-
- Uploads files using `upload_file()`
|
| 98 |
-
- Handles authentication correctly
|
| 99 |
-
|
| 100 |
-
## Environment Variables Required
|
| 101 |
-
|
| 102 |
-
### For HF Spaces:
|
| 103 |
-
```bash
|
| 104 |
-
HF_TOKEN=your_hf_token_here
|
| 105 |
-
TRACKIO_DATASET_REPO=your-username/your-dataset-name
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
### For Local Development:
|
| 109 |
-
```bash
|
| 110 |
-
export HF_TOKEN=your_hf_token_here
|
| 111 |
-
export TRACKIO_DATASET_REPO=your-username/your-dataset-name
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
## Deployment Workflow
|
| 115 |
-
|
| 116 |
-
### 1. Create Dataset
|
| 117 |
-
```bash
|
| 118 |
-
# Set environment variables
|
| 119 |
-
export HF_TOKEN=your_token_here
|
| 120 |
-
# TRACKIO_DATASET_REPO will be auto-generated as username/trackio-experiments
|
| 121 |
-
|
| 122 |
-
# Create the dataset
|
| 123 |
-
python scripts/dataset_tonic/setup_hf_dataset.py
|
| 124 |
-
```
|
| 125 |
-
|
| 126 |
-
### 2. Deploy Trackio Space
|
| 127 |
-
```bash
|
| 128 |
-
# Deploy the Space (no username needed - extracted from token)
|
| 129 |
-
python scripts/trackio_tonic/deploy_trackio_space.py
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
### 3. Secrets are Automatically Set
|
| 133 |
-
The script now automatically sets the required secrets via the HF Hub API:
|
| 134 |
-
- `HF_TOKEN` - Your Hugging Face token
|
| 135 |
-
- `TRACKIO_DATASET_REPO` - Your dataset repository (if specified)
|
| 136 |
-
|
| 137 |
-
### 4. Test Configuration
|
| 138 |
-
```bash
|
| 139 |
-
# Test the configuration
|
| 140 |
-
python scripts/trackio_tonic/configure_trackio.py
|
| 141 |
-
```
|
| 142 |
-
|
| 143 |
-
## New Features
|
| 144 |
-
|
| 145 |
-
### ✅ **Automatic Secret Setting**
|
| 146 |
-
- Uses `add_space_secret()` API method
|
| 147 |
-
- Sets `HF_TOKEN` automatically
|
| 148 |
-
- Sets `TRACKIO_DATASET_REPO` if specified
|
| 149 |
-
- Falls back to manual instructions if API fails
|
| 150 |
-
|
| 151 |
-
### ✅ **Username Extraction from Token**
|
| 152 |
-
- Uses `whoami()` API method
|
| 153 |
-
- No manual username input required
|
| 154 |
-
- Automatically uses username in dataset names
|
| 155 |
-
- Provides better user experience
|
| 156 |
-
|
| 157 |
-
### ✅ **Improved User Experience**
|
| 158 |
-
- Fewer manual inputs required
|
| 159 |
-
- Automatic configuration based on token
|
| 160 |
-
- Clear feedback about what's happening
|
| 161 |
-
- Better error messages
|
| 162 |
-
|
| 163 |
-
## Troubleshooting
|
| 164 |
-
|
| 165 |
-
### Common Issues:
|
| 166 |
-
|
| 167 |
-
1. **"Repository not found" errors**:
|
| 168 |
-
- Run `setup_hf_dataset.py` to create the dataset first
|
| 169 |
-
- Check that your HF token has write permissions
|
| 170 |
-
|
| 171 |
-
2. **"Authentication failed" errors**:
|
| 172 |
-
- Verify your HF token is valid
|
| 173 |
-
- Check token permissions on https://huggingface.co/settings/tokens
|
| 174 |
-
|
| 175 |
-
3. **"Space not accessible" errors**:
|
| 176 |
-
- Wait 2-5 minutes for the Space to build
|
| 177 |
-
- Check Space logs at the Space URL
|
| 178 |
-
- Verify all files were uploaded correctly
|
| 179 |
-
|
| 180 |
-
4. **"Dataset access failed" errors**:
|
| 181 |
-
- Ensure the dataset repository exists
|
| 182 |
-
- Check that your token has read permissions
|
| 183 |
-
- Verify the dataset repository name is correct
|
| 184 |
-
|
| 185 |
-
5. **"Secret setting failed" errors**:
|
| 186 |
-
- The script will fall back to manual instructions
|
| 187 |
-
- Follow the provided instructions to set secrets manually
|
| 188 |
-
- Check that your token has write permissions to the Space
|
| 189 |
-
|
| 190 |
-
### Debugging Steps:
|
| 191 |
-
|
| 192 |
-
1. **Check token permissions**:
|
| 193 |
-
```bash
|
| 194 |
-
hf whoami
|
| 195 |
-
```
|
| 196 |
-
|
| 197 |
-
2. **Test dataset access**:
|
| 198 |
-
```python
|
| 199 |
-
from datasets import load_dataset
|
| 200 |
-
dataset = load_dataset("your-username/your-dataset", token="your-token")
|
| 201 |
-
```
|
| 202 |
-
|
| 203 |
-
3. **Test Space deployment**:
|
| 204 |
-
```bash
|
| 205 |
-
python scripts/trackio_tonic/deploy_trackio_space.py
|
| 206 |
-
```
|
| 207 |
-
|
| 208 |
-
4. **Test secret setting**:
|
| 209 |
-
```python
|
| 210 |
-
from huggingface_hub import HfApi
|
| 211 |
-
api = HfApi(token="your-token")
|
| 212 |
-
api.add_space_secret("your-username/your-space", "TEST_KEY", "test_value")
|
| 213 |
-
```
|
| 214 |
-
|
| 215 |
-
## Security Considerations
|
| 216 |
-
|
| 217 |
-
- **Public datasets**: Datasets are now public by default for easier access
|
| 218 |
-
- **Token security**: Never commit tokens to version control
|
| 219 |
-
- **Space secrets**: Automatically set via API, with manual fallback
|
| 220 |
-
- **Access control**: Verify token permissions before deployment
|
| 221 |
-
|
| 222 |
-
## Performance Improvements
|
| 223 |
-
|
| 224 |
-
- **Direct API calls**: Eliminated git dependency for faster uploads
|
| 225 |
-
- **Automatic configuration**: No manual username input required
|
| 226 |
-
- **Parallel processing**: Files are uploaded individually for better error handling
|
| 227 |
-
- **Caching**: HF Hub API handles caching automatically
|
| 228 |
-
- **Error recovery**: Better error handling and retry logic
|
| 229 |
-
|
| 230 |
-
## Future Enhancements
|
| 231 |
-
|
| 232 |
-
1. **Batch secret setting**: Set multiple secrets in one API call
|
| 233 |
-
2. **Progress tracking**: Add progress bars for large uploads
|
| 234 |
-
3. **Validation**: Add more comprehensive validation checks
|
| 235 |
-
4. **Rollback**: Add ability to rollback failed deployments
|
| 236 |
-
5. **Hardware configuration**: Automatically configure Space hardware
|
| 237 |
-
|
| 238 |
-
## Testing
|
| 239 |
-
|
| 240 |
-
To test the fixes:
|
| 241 |
-
|
| 242 |
-
```bash
|
| 243 |
-
# Test dataset creation
|
| 244 |
-
python scripts/dataset_tonic/setup_hf_dataset.py
|
| 245 |
-
|
| 246 |
-
# Test Space deployment
|
| 247 |
-
python scripts/trackio_tonic/deploy_trackio_space.py
|
| 248 |
-
|
| 249 |
-
# Test configuration
|
| 250 |
-
python scripts/trackio_tonic/configure_trackio.py
|
| 251 |
-
|
| 252 |
-
# Test model push (if you have a trained model)
|
| 253 |
-
python scripts/model_tonic/push_to_huggingface.py --model-path /path/to/model --repo-name your-username/your-model
|
| 254 |
-
```
|
| 255 |
-
|
| 256 |
-
## Summary
|
| 257 |
-
|
| 258 |
-
These fixes resolve the main issues with:
|
| 259 |
-
- ✅ Git authentication problems
|
| 260 |
-
- ✅ Dataset repository creation failures
|
| 261 |
-
- ✅ Missing environment variable setup
|
| 262 |
-
- ✅ Manual username input requirement
|
| 263 |
-
- ✅ Poor error handling and user feedback
|
| 264 |
-
- ✅ Security concerns with public datasets
|
| 265 |
-
|
| 266 |
-
The scripts now use the HF Hub API directly, provide better error messages, handle edge cases properly, and offer a much improved user experience with automatic configuration.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TRACKIO_DICT_ACCESS_FIX.md
DELETED
|
@@ -1,144 +0,0 @@
|
|
| 1 |
-
# TrackioConfig Dictionary-Style Access Fix
|
| 2 |
-
|
| 3 |
-
## Problem Description
|
| 4 |
-
|
| 5 |
-
The error `'TrackioConfig' object does not support item assignment` occurred because the TRL library was trying to use dictionary-style item assignment on our `TrackioConfig` object (like `config['key'] = value`), but our implementation only supported attribute assignment.
|
| 6 |
-
|
| 7 |
-
## Root Cause
|
| 8 |
-
|
| 9 |
-
TRL expects configuration objects to support both attribute-style and dictionary-style access:
|
| 10 |
-
- Attribute-style: `config.project_name = "test"`
|
| 11 |
-
- Dictionary-style: `config['project_name'] = "test"`
|
| 12 |
-
|
| 13 |
-
Our `TrackioConfig` class only implemented attribute-style access, causing TRL to fail when it tried to use dictionary-style assignment.
|
| 14 |
-
|
| 15 |
-
## Solution Implementation
|
| 16 |
-
|
| 17 |
-
### Enhanced TrackioConfig Class
|
| 18 |
-
|
| 19 |
-
Modified `src/trackio.py` to add full dictionary-style access support:
|
| 20 |
-
|
| 21 |
-
```python
|
| 22 |
-
class TrackioConfig:
|
| 23 |
-
"""Configuration class for trackio (TRL compatibility)"""
|
| 24 |
-
|
| 25 |
-
def __init__(self):
|
| 26 |
-
# ... existing initialization ...
|
| 27 |
-
|
| 28 |
-
def update(self, config_dict: Dict[str, Any] = None, **kwargs):
|
| 29 |
-
# ... existing update method ...
|
| 30 |
-
|
| 31 |
-
def __getitem__(self, key: str) -> Any:
|
| 32 |
-
"""Dictionary-style access to configuration values"""
|
| 33 |
-
if hasattr(self, key):
|
| 34 |
-
return getattr(self, key)
|
| 35 |
-
else:
|
| 36 |
-
raise KeyError(f"Configuration key '{key}' not found")
|
| 37 |
-
|
| 38 |
-
def __setitem__(self, key: str, value: Any):
|
| 39 |
-
"""Dictionary-style assignment to configuration values"""
|
| 40 |
-
setattr(self, key, value)
|
| 41 |
-
|
| 42 |
-
def __contains__(self, key: str) -> bool:
|
| 43 |
-
"""Check if configuration key exists"""
|
| 44 |
-
return hasattr(self, key)
|
| 45 |
-
|
| 46 |
-
def get(self, key: str, default: Any = None) -> Any:
|
| 47 |
-
"""Get configuration value with default"""
|
| 48 |
-
if hasattr(self, key):
|
| 49 |
-
return getattr(self, key)
|
| 50 |
-
else:
|
| 51 |
-
return default
|
| 52 |
-
|
| 53 |
-
def keys(self):
|
| 54 |
-
"""Get all configuration keys"""
|
| 55 |
-
return list(self.__dict__.keys())
|
| 56 |
-
|
| 57 |
-
def items(self):
|
| 58 |
-
"""Get all configuration key-value pairs"""
|
| 59 |
-
return list(self.__dict__.items())
|
| 60 |
-
|
| 61 |
-
def __repr__(self):
|
| 62 |
-
"""String representation of configuration"""
|
| 63 |
-
attrs = []
|
| 64 |
-
for key, value in self.__dict__.items():
|
| 65 |
-
attrs.append(f"{key}={repr(value)}")
|
| 66 |
-
return f"TrackioConfig({', '.join(attrs)})"
|
| 67 |
-
```
|
| 68 |
-
|
| 69 |
-
### Key Features Added
|
| 70 |
-
|
| 71 |
-
#### 1. **Dictionary-Style Access**
|
| 72 |
-
- `config['key']` - Get configuration value
|
| 73 |
-
- `config['key'] = value` - Set configuration value
|
| 74 |
-
- `'key' in config` - Check if key exists
|
| 75 |
-
|
| 76 |
-
#### 2. **Dictionary Methods**
|
| 77 |
-
- `config.get('key', default)` - Get with default value
|
| 78 |
-
- `config.keys()` - Get all configuration keys
|
| 79 |
-
- `config.items()` - Get all key-value pairs
|
| 80 |
-
|
| 81 |
-
#### 3. **TRL Compatibility**
|
| 82 |
-
- Supports TRL's dictionary-style configuration updates
|
| 83 |
-
- Handles dynamic key assignment
|
| 84 |
-
- Maintains backward compatibility with attribute access
|
| 85 |
-
|
| 86 |
-
## Testing Verification
|
| 87 |
-
|
| 88 |
-
### Test Results
|
| 89 |
-
- ✅ Dictionary-style assignment: `config['project_name'] = 'test'`
|
| 90 |
-
- ✅ Dictionary-style access: `config['project_name']`
|
| 91 |
-
- ✅ Contains check: `'key' in config`
|
| 92 |
-
- ✅ Get method: `config.get('key', default)`
|
| 93 |
-
- ✅ Keys and items: `config.keys()`, `config.items()`
|
| 94 |
-
- ✅ TRL-style usage: `config['allow_val_change'] = True`
|
| 95 |
-
|
| 96 |
-
### TRL-Specific Usage Patterns
|
| 97 |
-
```python
|
| 98 |
-
# TRL-style configuration updates
|
| 99 |
-
config['allow_val_change'] = True
|
| 100 |
-
config['report_to'] = 'trackio'
|
| 101 |
-
config['project_name'] = 'my_experiment'
|
| 102 |
-
|
| 103 |
-
# Dictionary-style access
|
| 104 |
-
project = config['project_name']
|
| 105 |
-
allow_change = config.get('allow_val_change', False)
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
## Integration with Existing Features
|
| 109 |
-
|
| 110 |
-
### Maintains All Existing Functionality
|
| 111 |
-
- ✅ Attribute-style access: `config.project_name`
|
| 112 |
-
- ✅ Update method: `config.update({'key': 'value'})`
|
| 113 |
-
- ✅ Keyword arguments: `config.update(allow_val_change=True)`
|
| 114 |
-
- ✅ Dynamic attributes: New attributes added at runtime
|
| 115 |
-
|
| 116 |
-
### Enhanced Compatibility
|
| 117 |
-
- ✅ Full TRL dictionary-style interface
|
| 118 |
-
- ✅ Backward compatibility with existing code
|
| 119 |
-
- ✅ Robust error handling for missing keys
|
| 120 |
-
- ✅ Comprehensive dictionary methods
|
| 121 |
-
|
| 122 |
-
## Production Readiness
|
| 123 |
-
|
| 124 |
-
### Status: ✅ PRODUCTION READY
|
| 125 |
-
|
| 126 |
-
The enhanced `TrackioConfig` class now provides:
|
| 127 |
-
1. **Complete TRL Compatibility** - Supports all TRL configuration patterns
|
| 128 |
-
2. **Flexible Access** - Both attribute and dictionary-style access
|
| 129 |
-
3. **Robust Error Handling** - Graceful handling of missing keys
|
| 130 |
-
4. **Comprehensive Interface** - Full dictionary-like behavior
|
| 131 |
-
5. **Backward Compatibility** - Existing code continues to work
|
| 132 |
-
|
| 133 |
-
## Conclusion
|
| 134 |
-
|
| 135 |
-
The dictionary-style access fix resolves the `'TrackioConfig' object does not support item assignment` error and provides complete compatibility with TRL's configuration expectations.
|
| 136 |
-
|
| 137 |
-
**Key Achievements:**
|
| 138 |
-
- ✅ Full dictionary-style interface support
|
| 139 |
-
- ✅ TRL configuration pattern compatibility
|
| 140 |
-
- ✅ Backward compatibility maintained
|
| 141 |
-
- ✅ Comprehensive testing verification
|
| 142 |
-
- ✅ Production-ready implementation
|
| 143 |
-
|
| 144 |
-
**No additional changes are required** for TRL configuration compatibility. The system now handles all known TRL configuration access patterns.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TRACKIO_INTEGRATION.md
DELETED
|
@@ -1,252 +0,0 @@
|
|
| 1 |
-
# Trackio Integration for SmolLM3 Fine-tuning
|
| 2 |
-
|
| 3 |
-
This document provides comprehensive information about the Trackio experiment tracking and monitoring integration for your SmolLM3 fine-tuning pipeline.
|
| 4 |
-
|
| 5 |
-
## Features
|
| 6 |
-
|
| 7 |
-
- **SmolLM3 Fine-tuning**: Support for supervised fine-tuning and DPO training
|
| 8 |
-
- **Trackio Integration**: Complete experiment tracking and monitoring
|
| 9 |
-
- **Hugging Face Spaces Deployment**: Easy deployment of Trackio monitoring interface
|
| 10 |
-
- **Comprehensive Logging**: Metrics, parameters, artifacts, and system monitoring
|
| 11 |
-
- **Flexible Configuration**: Support for various training configurations
|
| 12 |
-
|
| 13 |
-
## Quick Start
|
| 14 |
-
|
| 15 |
-
### 1. Install Dependencies
|
| 16 |
-
|
| 17 |
-
```bash
|
| 18 |
-
pip install -r requirements.txt
|
| 19 |
-
```
|
| 20 |
-
|
| 21 |
-
### 2. Basic Training with Trackio
|
| 22 |
-
|
| 23 |
-
```bash
|
| 24 |
-
python train.py config/train_smollm3.py \
|
| 25 |
-
--dataset_dir my_dataset \
|
| 26 |
-
--enable_tracking \
|
| 27 |
-
--trackio_url "https://your-trackio-instance.com" \
|
| 28 |
-
--experiment_name "smollm3_finetune_v1"
|
| 29 |
-
```
|
| 30 |
-
|
| 31 |
-
### 3. Training with Custom Parameters
|
| 32 |
-
|
| 33 |
-
```bash
|
| 34 |
-
python train.py config/train_smollm3.py \
|
| 35 |
-
--dataset_dir my_dataset \
|
| 36 |
-
--batch_size 8 \
|
| 37 |
-
--learning_rate 1e-5 \
|
| 38 |
-
--max_iters 2000 \
|
| 39 |
-
--enable_tracking \
|
| 40 |
-
--trackio_url "https://your-trackio-instance.com" \
|
| 41 |
-
--experiment_name "smollm3_high_lr_experiment"
|
| 42 |
-
```
|
| 43 |
-
|
| 44 |
-
## Trackio Integration
|
| 45 |
-
|
| 46 |
-
### Configuration
|
| 47 |
-
|
| 48 |
-
Add Trackio settings to your configuration:
|
| 49 |
-
|
| 50 |
-
```python
|
| 51 |
-
# In your config file
|
| 52 |
-
config = SmolLM3Config(
|
| 53 |
-
# ... other settings ...
|
| 54 |
-
|
| 55 |
-
# Trackio monitoring configuration
|
| 56 |
-
enable_tracking=True,
|
| 57 |
-
trackio_url="https://your-trackio-instance.com",
|
| 58 |
-
trackio_token="your_token_here", # Optional
|
| 59 |
-
log_artifacts=True,
|
| 60 |
-
log_metrics=True,
|
| 61 |
-
log_config=True,
|
| 62 |
-
experiment_name="my_experiment"
|
| 63 |
-
)
|
| 64 |
-
```
|
| 65 |
-
|
| 66 |
-
### Environment Variables
|
| 67 |
-
|
| 68 |
-
You can also set Trackio configuration via environment variables:
|
| 69 |
-
|
| 70 |
-
```bash
|
| 71 |
-
export TRACKIO_URL="https://your-trackio-instance.com"
|
| 72 |
-
export TRACKIO_TOKEN="your_token_here"
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
### What Gets Tracked
|
| 76 |
-
|
| 77 |
-
- **Configuration**: All training parameters and model settings
|
| 78 |
-
- **Metrics**: Loss, accuracy, learning rate, and custom metrics
|
| 79 |
-
- **System Metrics**: GPU memory, CPU usage, training time
|
| 80 |
-
- **Artifacts**: Model checkpoints, evaluation results
|
| 81 |
-
- **Training Summary**: Final results and experiment duration
|
| 82 |
-
|
| 83 |
-
## Hugging Face Spaces Deployment
|
| 84 |
-
|
| 85 |
-
### Deploy Trackio Monitoring Interface
|
| 86 |
-
|
| 87 |
-
1. **Create a new Space** on Hugging Face:
|
| 88 |
-
- Go to https://huggingface.co/spaces
|
| 89 |
-
- Click "Create new Space"
|
| 90 |
-
- Choose "Gradio" as the SDK
|
| 91 |
-
- Set visibility (Public or Private)
|
| 92 |
-
|
| 93 |
-
2. **Upload the deployment files**:
|
| 94 |
-
- `app.py` - The Gradio interface
|
| 95 |
-
- `requirements_space.txt` - Dependencies
|
| 96 |
-
- `README.md` - Documentation
|
| 97 |
-
|
| 98 |
-
3. **Configure the Space**:
|
| 99 |
-
- The Space will automatically install dependencies
|
| 100 |
-
- The Gradio interface will be available at your Space URL
|
| 101 |
-
|
| 102 |
-
### Using the Trackio Space
|
| 103 |
-
|
| 104 |
-
1. **Create Experiments**: Use the "Create Experiment" tab to start new experiments
|
| 105 |
-
2. **Log Metrics**: Use the "Log Metrics" tab to track training progress
|
| 106 |
-
3. **View Results**: Use the "View Experiments" tab to see experiment details
|
| 107 |
-
4. **Update Status**: Use the "Update Status" tab to mark experiments as completed
|
| 108 |
-
|
| 109 |
-
### Integration with Your Training
|
| 110 |
-
|
| 111 |
-
To connect your training script to the Trackio Space:
|
| 112 |
-
|
| 113 |
-
```python
|
| 114 |
-
# In your training script
|
| 115 |
-
from monitoring import SmolLM3Monitor
|
| 116 |
-
|
| 117 |
-
# Initialize monitor
|
| 118 |
-
monitor = SmolLM3Monitor(
|
| 119 |
-
experiment_name="my_experiment",
|
| 120 |
-
trackio_url="https://your-space.hf.space", # Your Space URL
|
| 121 |
-
enable_tracking=True
|
| 122 |
-
)
|
| 123 |
-
|
| 124 |
-
# Log configuration
|
| 125 |
-
monitor.log_config(config_dict)
|
| 126 |
-
|
| 127 |
-
# Log metrics during training
|
| 128 |
-
monitor.log_metrics({"loss": 0.5, "accuracy": 0.85}, step=100)
|
| 129 |
-
|
| 130 |
-
# Log final results
|
| 131 |
-
monitor.log_training_summary(final_results)
|
| 132 |
-
```
|
| 133 |
-
|
| 134 |
-
## Configuration Files
|
| 135 |
-
|
| 136 |
-
### Main Configuration (`config/train_smollm3.py`)
|
| 137 |
-
|
| 138 |
-
```python
|
| 139 |
-
@dataclass
|
| 140 |
-
class SmolLM3Config:
|
| 141 |
-
# Model configuration
|
| 142 |
-
model_name: str = "HuggingFaceTB/SmolLM3-3B"
|
| 143 |
-
max_seq_length: int = 4096
|
| 144 |
-
|
| 145 |
-
# Training configuration
|
| 146 |
-
batch_size: int = 4
|
| 147 |
-
learning_rate: float = 2e-5
|
| 148 |
-
max_iters: int = 1000
|
| 149 |
-
|
| 150 |
-
# Trackio monitoring
|
| 151 |
-
enable_tracking: bool = True
|
| 152 |
-
trackio_url: Optional[str] = None
|
| 153 |
-
trackio_token: Optional[str] = None
|
| 154 |
-
experiment_name: Optional[str] = None
|
| 155 |
-
```
|
| 156 |
-
|
| 157 |
-
### DPO Configuration (`config/train_smollm3_dpo.py`)
|
| 158 |
-
|
| 159 |
-
```python
|
| 160 |
-
@dataclass
|
| 161 |
-
class SmolLM3DPOConfig(SmolLM3Config):
|
| 162 |
-
# DPO-specific settings
|
| 163 |
-
beta: float = 0.1
|
| 164 |
-
max_prompt_length: int = 2048
|
| 165 |
-
|
| 166 |
-
# Trackio monitoring (inherited)
|
| 167 |
-
enable_tracking: bool = True
|
| 168 |
-
trackio_url: Optional[str] = None
|
| 169 |
-
```
|
| 170 |
-
|
| 171 |
-
## Monitoring Features
|
| 172 |
-
|
| 173 |
-
### Real-time Metrics
|
| 174 |
-
|
| 175 |
-
- Training loss and evaluation metrics
|
| 176 |
-
- Learning rate scheduling
|
| 177 |
-
- GPU memory and utilization
|
| 178 |
-
- Training time and progress
|
| 179 |
-
|
| 180 |
-
### Artifact Tracking
|
| 181 |
-
|
| 182 |
-
- Model checkpoints at regular intervals
|
| 183 |
-
- Evaluation results and plots
|
| 184 |
-
- Configuration snapshots
|
| 185 |
-
- Training logs and summaries
|
| 186 |
-
|
| 187 |
-
### Experiment Management
|
| 188 |
-
|
| 189 |
-
- Experiment naming and organization
|
| 190 |
-
- Status tracking (running, completed, failed)
|
| 191 |
-
- Parameter comparison across experiments
|
| 192 |
-
- Result visualization
|
| 193 |
-
|
| 194 |
-
## Advanced Usage
|
| 195 |
-
|
| 196 |
-
### Custom Metrics
|
| 197 |
-
|
| 198 |
-
```python
|
| 199 |
-
# Log custom metrics
|
| 200 |
-
monitor.log_metrics({
|
| 201 |
-
"custom_metric": value,
|
| 202 |
-
"perplexity": perplexity_score,
|
| 203 |
-
"bleu_score": bleu_score
|
| 204 |
-
}, step=current_step)
|
| 205 |
-
```
|
| 206 |
-
|
| 207 |
-
### System Monitoring
|
| 208 |
-
|
| 209 |
-
```python
|
| 210 |
-
# Log system metrics
|
| 211 |
-
monitor.log_system_metrics(step=current_step)
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
### Artifact Logging
|
| 215 |
-
|
| 216 |
-
```python
|
| 217 |
-
# Log model checkpoint
|
| 218 |
-
monitor.log_model_checkpoint("checkpoint-1000", step=1000)
|
| 219 |
-
|
| 220 |
-
# Log evaluation results
|
| 221 |
-
monitor.log_evaluation_results(eval_results, step=1000)
|
| 222 |
-
```
|
| 223 |
-
|
| 224 |
-
## Troubleshooting
|
| 225 |
-
|
| 226 |
-
### Common Issues
|
| 227 |
-
|
| 228 |
-
1. **Trackio not available**: Install with `pip install trackio`
|
| 229 |
-
2. **Connection errors**: Check your Trackio URL and token
|
| 230 |
-
3. **Missing metrics**: Ensure monitoring is enabled in configuration
|
| 231 |
-
4. **Space deployment issues**: Check Gradio version compatibility
|
| 232 |
-
|
| 233 |
-
### Debug Mode
|
| 234 |
-
|
| 235 |
-
Enable debug logging:
|
| 236 |
-
|
| 237 |
-
```python
|
| 238 |
-
import logging
|
| 239 |
-
logging.basicConfig(level=logging.DEBUG)
|
| 240 |
-
```
|
| 241 |
-
|
| 242 |
-
## Contributing
|
| 243 |
-
|
| 244 |
-
1. Fork the repository
|
| 245 |
-
2. Create a feature branch
|
| 246 |
-
3. Make your changes
|
| 247 |
-
4. Add tests if applicable
|
| 248 |
-
5. Submit a pull request
|
| 249 |
-
|
| 250 |
-
## License
|
| 251 |
-
|
| 252 |
-
This project is licensed under the MIT License - see the LICENSE file for details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TRACKIO_INTEGRATION_VERIFICATION.md
DELETED
|
@@ -1,177 +0,0 @@
|
|
| 1 |
-
# Trackio Integration Verification Report
|
| 2 |
-
|
| 3 |
-
## ✅ Verification Status: PASSED
|
| 4 |
-
|
| 5 |
-
All Trackio integration tests have passed successfully. The integration is correctly implemented according to the documentation provided in `TRACKIO_INTEGRATION.md` and `TRACKIO_INTERFACE_GUIDE.md`.
|
| 6 |
-
|
| 7 |
-
## 🔧 Issues Fixed
|
| 8 |
-
|
| 9 |
-
### 1. **Training Arguments Configuration**
|
| 10 |
-
- **Issue**: `'bool' object is not callable` error with `report_to` parameter
|
| 11 |
-
- **Fix**: Changed `report_to: "none"` to `report_to: None` in `model.py`
|
| 12 |
-
- **Impact**: Resolves the original training failure
|
| 13 |
-
|
| 14 |
-
### 2. **Boolean Parameter Type Safety**
|
| 15 |
-
- **Issue**: Boolean parameters not properly typed in training arguments
|
| 16 |
-
- **Fix**: Added explicit boolean conversion for all boolean parameters:
|
| 17 |
-
- `dataloader_pin_memory`
|
| 18 |
-
- `group_by_length`
|
| 19 |
-
- `prediction_loss_only`
|
| 20 |
-
- `ignore_data_skip`
|
| 21 |
-
- `remove_unused_columns`
|
| 22 |
-
- `ddp_find_unused_parameters`
|
| 23 |
-
- `fp16`
|
| 24 |
-
- `bf16`
|
| 25 |
-
- `load_best_model_at_end`
|
| 26 |
-
- `greater_is_better`
|
| 27 |
-
|
| 28 |
-
### 3. **Callback Implementation**
|
| 29 |
-
- **Issue**: Callback creation failing when tracking disabled
|
| 30 |
-
- **Fix**: Modified `create_monitoring_callback()` to always return a callback
|
| 31 |
-
- **Improvement**: Added proper inheritance from `TrainerCallback`
|
| 32 |
-
|
| 33 |
-
### 4. **Method Naming Conflicts**
|
| 34 |
-
- **Issue**: Boolean attributes conflicting with method names
|
| 35 |
-
- **Fix**: Renamed boolean attributes to avoid conflicts:
|
| 36 |
-
- `log_config` → `log_config_enabled`
|
| 37 |
-
- `log_metrics` → `log_metrics_enabled`
|
| 38 |
-
|
| 39 |
-
### 5. **System Compatibility**
|
| 40 |
-
- **Issue**: Training arguments test failing on systems without bf16 support
|
| 41 |
-
- **Fix**: Added conditional bf16 support detection
|
| 42 |
-
- **Improvement**: Added conditional support for `dataloader_prefetch_factor`
|
| 43 |
-
|
| 44 |
-
## 📊 Test Results
|
| 45 |
-
|
| 46 |
-
| Test | Status | Description |
|
| 47 |
-
|------|--------|-------------|
|
| 48 |
-
| Trackio Configuration | ✅ PASS | All required attributes present |
|
| 49 |
-
| Monitor Creation | ✅ PASS | Monitor created successfully |
|
| 50 |
-
| Callback Creation | ✅ PASS | Callback with all required methods |
|
| 51 |
-
| Monitor Methods | ✅ PASS | All logging methods work correctly |
|
| 52 |
-
| Training Arguments | ✅ PASS | Arguments created without errors |
|
| 53 |
-
|
| 54 |
-
## 🎯 Key Features Verified
|
| 55 |
-
|
| 56 |
-
### 1. **Configuration Management**
|
| 57 |
-
- ✅ Trackio-specific attributes properly defined
|
| 58 |
-
- ✅ Environment variable support
|
| 59 |
-
- ✅ Default values correctly set
|
| 60 |
-
- ✅ Configuration inheritance working
|
| 61 |
-
|
| 62 |
-
### 2. **Monitoring Integration**
|
| 63 |
-
- ✅ Monitor creation from config
|
| 64 |
-
- ✅ Callback integration with Hugging Face Trainer
|
| 65 |
-
- ✅ Real-time metrics logging
|
| 66 |
-
- ✅ System metrics collection
|
| 67 |
-
- ✅ Artifact tracking
|
| 68 |
-
- ✅ Evaluation results logging
|
| 69 |
-
|
| 70 |
-
### 3. **Training Integration**
|
| 71 |
-
- ✅ Training arguments properly configured
|
| 72 |
-
- ✅ Boolean parameters correctly typed
|
| 73 |
-
- ✅ Report_to parameter fixed
|
| 74 |
-
- ✅ Callback methods properly implemented
|
| 75 |
-
- ✅ Error handling enhanced
|
| 76 |
-
|
| 77 |
-
### 4. **Interface Compatibility**
|
| 78 |
-
- ✅ Compatible with Trackio Space deployment
|
| 79 |
-
- ✅ Supports all documented features
|
| 80 |
-
- ✅ Handles missing Trackio URL gracefully
|
| 81 |
-
- ✅ Provides fallback behavior
|
| 82 |
-
|
| 83 |
-
## 🚀 Integration Points
|
| 84 |
-
|
| 85 |
-
### 1. **With Training Script**
|
| 86 |
-
```python
|
| 87 |
-
# Automatic integration via config
|
| 88 |
-
config = SmolLM3ConfigOpenHermesFRBalanced()
|
| 89 |
-
monitor = create_monitor_from_config(config)
|
| 90 |
-
|
| 91 |
-
# Callback automatically added to trainer
|
| 92 |
-
trainer = Trainer(
|
| 93 |
-
model=model,
|
| 94 |
-
args=training_args,
|
| 95 |
-
callbacks=[monitor.create_monitoring_callback()]
|
| 96 |
-
)
|
| 97 |
-
```
|
| 98 |
-
|
| 99 |
-
### 2. **With Trackio Space**
|
| 100 |
-
```python
|
| 101 |
-
# Configuration for Trackio Space
|
| 102 |
-
config.trackio_url = "https://your-space.hf.space"
|
| 103 |
-
config.enable_tracking = True
|
| 104 |
-
config.experiment_name = "my_experiment"
|
| 105 |
-
```
|
| 106 |
-
|
| 107 |
-
### 3. **With Hugging Face Trainer**
|
| 108 |
-
```python
|
| 109 |
-
# Training arguments properly configured
|
| 110 |
-
training_args = model.get_training_arguments(
|
| 111 |
-
output_dir=output_dir,
|
| 112 |
-
report_to=None, # Fixed
|
| 113 |
-
# ... other parameters
|
| 114 |
-
)
|
| 115 |
-
```
|
| 116 |
-
|
| 117 |
-
## 📈 Monitoring Features
|
| 118 |
-
|
| 119 |
-
### Real-time Metrics
|
| 120 |
-
- ✅ Training loss and evaluation metrics
|
| 121 |
-
- ✅ Learning rate scheduling
|
| 122 |
-
- ✅ GPU memory and utilization
|
| 123 |
-
- ✅ Training time and progress
|
| 124 |
-
|
| 125 |
-
### Artifact Tracking
|
| 126 |
-
- ✅ Model checkpoints at regular intervals
|
| 127 |
-
- ✅ Evaluation results and plots
|
| 128 |
-
- ✅ Configuration snapshots
|
| 129 |
-
- ✅ Training logs and summaries
|
| 130 |
-
|
| 131 |
-
### Experiment Management
|
| 132 |
-
- ✅ Experiment naming and organization
|
| 133 |
-
- ✅ Status tracking (running, completed, failed)
|
| 134 |
-
- ✅ Parameter comparison across experiments
|
| 135 |
-
- ✅ Result visualization
|
| 136 |
-
|
| 137 |
-
## 🔍 Error Handling
|
| 138 |
-
|
| 139 |
-
### Graceful Degradation
|
| 140 |
-
- ✅ Continues training when Trackio unavailable
|
| 141 |
-
- ✅ Handles missing environment variables
|
| 142 |
-
- ✅ Provides console logging fallback
|
| 143 |
-
- ✅ Maintains functionality without external dependencies
|
| 144 |
-
|
| 145 |
-
### Robust Callbacks
|
| 146 |
-
- ✅ Callback methods handle exceptions gracefully
|
| 147 |
-
- ✅ Training continues even if monitoring fails
|
| 148 |
-
- ✅ Detailed error logging for debugging
|
| 149 |
-
- ✅ Fallback to console monitoring
|
| 150 |
-
|
| 151 |
-
## 📋 Compliance with Documentation
|
| 152 |
-
|
| 153 |
-
### TRACKIO_INTEGRATION.md Requirements
|
| 154 |
-
- ✅ All configuration options implemented
|
| 155 |
-
- ✅ Environment variable support
|
| 156 |
-
- ✅ Hugging Face Spaces deployment ready
|
| 157 |
-
- ✅ Comprehensive logging features
|
| 158 |
-
- ✅ Artifact tracking capabilities
|
| 159 |
-
|
| 160 |
-
### TRACKIO_INTERFACE_GUIDE.md Requirements
|
| 161 |
-
- ✅ Real-time visualization support
|
| 162 |
-
- ✅ Interactive plots and metrics
|
| 163 |
-
- ✅ Experiment comparison features
|
| 164 |
-
- ✅ Demo data generation
|
| 165 |
-
- ✅ Status tracking and updates
|
| 166 |
-
|
| 167 |
-
## 🎉 Conclusion
|
| 168 |
-
|
| 169 |
-
The Trackio integration is **fully functional** and **correctly implemented** according to the provided documentation. All major issues have been resolved:
|
| 170 |
-
|
| 171 |
-
1. **Original Error Fixed**: The `'bool' object is not callable` error has been resolved
|
| 172 |
-
2. **Callback Integration**: Trackio callbacks now work correctly with Hugging Face Trainer
|
| 173 |
-
3. **Configuration Management**: All Trackio-specific configuration is properly handled
|
| 174 |
-
4. **Error Handling**: Robust error handling and graceful degradation implemented
|
| 175 |
-
5. **Compatibility**: Works across different systems and configurations
|
| 176 |
-
|
| 177 |
-
The integration is ready for production use and will provide comprehensive monitoring for SmolLM3 fine-tuning experiments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TRACKIO_INTERFACE_GUIDE.md
DELETED
|
@@ -1,222 +0,0 @@
|
|
| 1 |
-
# Enhanced Trackio Interface Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
Your Trackio application has been significantly enhanced to provide comprehensive monitoring and visualization for SmolLM3 training experiments. Here's how to make the most of it.
|
| 6 |
-
|
| 7 |
-
## 🚀 Key Enhancements
|
| 8 |
-
|
| 9 |
-
### 1. **Real-time Visualization**
|
| 10 |
-
- **Interactive Plots**: Loss curves, accuracy, learning rate, GPU metrics
|
| 11 |
-
- **Experiment Comparison**: Compare multiple training runs side-by-side
|
| 12 |
-
- **Live Updates**: Watch training progress in real-time
|
| 13 |
-
|
| 14 |
-
### 2. **Comprehensive Data Display**
|
| 15 |
-
- **Formatted Output**: Clean, emoji-rich experiment details
|
| 16 |
-
- **Statistics Overview**: Metrics count, parameters count, artifacts count
|
| 17 |
-
- **Status Tracking**: Visual status indicators (🟢 running, ✅ completed, ❌ failed)
|
| 18 |
-
|
| 19 |
-
### 3. **Demo Data Generation**
|
| 20 |
-
- **Realistic Simulation**: Generate realistic training metrics for testing
|
| 21 |
-
- **Multiple Metrics**: Loss, accuracy, learning rate, GPU memory, training time
|
| 22 |
-
- **Configurable Parameters**: Customize demo data to match your setup
|
| 23 |
-
|
| 24 |
-
## 📊 How to Use with Your SmolLM3 Training
|
| 25 |
-
|
| 26 |
-
### Step 1: Start Your Training
|
| 27 |
-
```bash
|
| 28 |
-
python run_a100_large_experiment.py \
|
| 29 |
-
--config config/train_smollm3_openhermes_fr_a100_balanced.py \
|
| 30 |
-
--trackio_url "https://tonic-test-trackio-test.hf.space" \
|
| 31 |
-
--experiment-name "petit-elle-l-aime-3-balanced" \
|
| 32 |
-
--output-dir ./outputs/balanced
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
### Step 2: Monitor in Real-time
|
| 36 |
-
1. **Visit your Trackio Space**: `https://tonic-test-trackio-test.hf.space`
|
| 37 |
-
2. **Go to "View Experiments" tab**
|
| 38 |
-
3. **Enter your experiment ID** (e.g., `exp_20231201_143022`)
|
| 39 |
-
4. **Click "View Experiment"** to see detailed information
|
| 40 |
-
|
| 41 |
-
### Step 3: Visualize Training Progress
|
| 42 |
-
1. **Go to "📊 Visualizations" tab**
|
| 43 |
-
2. **Enter your experiment ID**
|
| 44 |
-
3. **Select a metric** (loss, accuracy, learning_rate, gpu_memory, training_time)
|
| 45 |
-
4. **Click "Create Plot"** to see interactive charts
|
| 46 |
-
|
| 47 |
-
### Step 4: Compare Experiments
|
| 48 |
-
1. **In the "📊 Visualizations" tab**
|
| 49 |
-
2. **Enter multiple experiment IDs** (comma-separated)
|
| 50 |
-
3. **Click "Compare Experiments"** to see side-by-side comparison
|
| 51 |
-
|
| 52 |
-
## 🎯 Interface Features
|
| 53 |
-
|
| 54 |
-
### Create Experiment Tab
|
| 55 |
-
- **Experiment Name**: Descriptive name for your training run
|
| 56 |
-
- **Description**: Detailed description of what you're training
|
| 57 |
-
- **Automatic ID Generation**: Unique experiment identifier
|
| 58 |
-
|
| 59 |
-
### Log Metrics Tab
|
| 60 |
-
- **Experiment ID**: The experiment to log metrics for
|
| 61 |
-
- **Metrics JSON**: Training metrics in JSON format
|
| 62 |
-
- **Step**: Current training step (optional)
|
| 63 |
-
|
| 64 |
-
Example metrics JSON:
|
| 65 |
-
```json
|
| 66 |
-
{
|
| 67 |
-
"loss": 0.5234,
|
| 68 |
-
"accuracy": 0.8567,
|
| 69 |
-
"learning_rate": 3.5e-6,
|
| 70 |
-
"gpu_memory_gb": 22.5,
|
| 71 |
-
"gpu_utilization_percent": 87.3,
|
| 72 |
-
"training_time_per_step": 0.456
|
| 73 |
-
}
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
### Log Parameters Tab
|
| 77 |
-
- **Experiment ID**: The experiment to log parameters for
|
| 78 |
-
- **Parameters JSON**: Training configuration in JSON format
|
| 79 |
-
|
| 80 |
-
Example parameters JSON:
|
| 81 |
-
```json
|
| 82 |
-
{
|
| 83 |
-
"model_name": "HuggingFaceTB/SmolLM3-3B",
|
| 84 |
-
"batch_size": 8,
|
| 85 |
-
"learning_rate": 3.5e-6,
|
| 86 |
-
"max_iters": 18000,
|
| 87 |
-
"mixed_precision": "bf16",
|
| 88 |
-
"no_think_system_message": true
|
| 89 |
-
}
|
| 90 |
-
```
|
| 91 |
-
|
| 92 |
-
### View Experiments Tab
|
| 93 |
-
- **Experiment ID**: Enter to view specific experiment
|
| 94 |
-
- **List All Experiments**: Shows overview of all experiments
|
| 95 |
-
- **Detailed Information**: Formatted display with statistics
|
| 96 |
-
|
| 97 |
-
### 📊 Visualizations Tab
|
| 98 |
-
- **Training Metrics**: Interactive plots for individual metrics
|
| 99 |
-
- **Experiment Comparison**: Side-by-side comparison of multiple runs
|
| 100 |
-
- **Real-time Updates**: Plots update as new data is logged
|
| 101 |
-
|
| 102 |
-
### 🎯 Demo Data Tab
|
| 103 |
-
- **Generate Demo Data**: Create realistic training data for testing
|
| 104 |
-
- **Configurable**: Adjust parameters to match your setup
|
| 105 |
-
- **Multiple Metrics**: Simulates loss, accuracy, GPU metrics, etc.
|
| 106 |
-
|
| 107 |
-
### Update Status Tab
|
| 108 |
-
- **Experiment ID**: The experiment to update
|
| 109 |
-
- **Status**: running, completed, failed, paused
|
| 110 |
-
- **Visual Indicators**: Status shown with emojis
|
| 111 |
-
|
| 112 |
-
## 📈 What Gets Displayed
|
| 113 |
-
|
| 114 |
-
### Training Metrics
|
| 115 |
-
- **Loss**: Training loss over time
|
| 116 |
-
- **Accuracy**: Model accuracy progression
|
| 117 |
-
- **Learning Rate**: Learning rate scheduling
|
| 118 |
-
- **GPU Memory**: Memory usage in GB
|
| 119 |
-
- **GPU Utilization**: GPU usage percentage
|
| 120 |
-
- **Training Time**: Time per training step
|
| 121 |
-
|
| 122 |
-
### Experiment Details
|
| 123 |
-
- **Basic Info**: ID, name, description, status, creation time
|
| 124 |
-
- **Statistics**: Metrics count, parameters count, artifacts count
|
| 125 |
-
- **Parameters**: All training configuration
|
| 126 |
-
- **Latest Metrics**: Most recent training metrics
|
| 127 |
-
|
| 128 |
-
### Visualizations
|
| 129 |
-
- **Line Charts**: Smooth curves showing metric progression
|
| 130 |
-
- **Interactive Hover**: Detailed information on hover
|
| 131 |
-
- **Multiple Metrics**: Switch between different metrics
|
| 132 |
-
- **Comparison Charts**: Side-by-side experiment comparison
|
| 133 |
-
|
| 134 |
-
## 🔧 Integration with Your Training
|
| 135 |
-
|
| 136 |
-
### Automatic Integration
|
| 137 |
-
Your training script automatically:
|
| 138 |
-
1. **Creates experiments** with your specified name
|
| 139 |
-
2. **Logs parameters** from your configuration
|
| 140 |
-
3. **Logs metrics** every 25 steps (configurable)
|
| 141 |
-
4. **Logs system metrics** (GPU memory, utilization)
|
| 142 |
-
5. **Logs checkpoints** every 2000 steps
|
| 143 |
-
6. **Updates status** when training completes
|
| 144 |
-
|
| 145 |
-
### Manual Integration
|
| 146 |
-
You can also manually:
|
| 147 |
-
1. **Create experiments** through the interface
|
| 148 |
-
2. **Log custom metrics** for specific analysis
|
| 149 |
-
3. **Compare different runs** with different parameters
|
| 150 |
-
4. **Generate demo data** for testing the interface
|
| 151 |
-
|
| 152 |
-
## 🎨 Customization
|
| 153 |
-
|
| 154 |
-
### Adding Custom Metrics
|
| 155 |
-
```python
|
| 156 |
-
# In your training script
|
| 157 |
-
custom_metrics = {
|
| 158 |
-
"loss": current_loss,
|
| 159 |
-
"accuracy": current_accuracy,
|
| 160 |
-
"custom_metric": your_custom_value,
|
| 161 |
-
"gpu_memory": gpu_memory_usage
|
| 162 |
-
}
|
| 163 |
-
|
| 164 |
-
monitor.log_metrics(custom_metrics, step=current_step)
|
| 165 |
-
```
|
| 166 |
-
|
| 167 |
-
### Custom Visualizations
|
| 168 |
-
The interface supports any metric you log. Just add it to your metrics JSON and it will appear in the visualization dropdown.
|
| 169 |
-
|
| 170 |
-
## 🚨 Troubleshooting
|
| 171 |
-
|
| 172 |
-
### No Data Displayed
|
| 173 |
-
1. **Check experiment ID**: Make sure you're using the correct ID
|
| 174 |
-
2. **Verify metrics were logged**: Check if training is actually logging metrics
|
| 175 |
-
3. **Use demo data**: Generate demo data to test the interface
|
| 176 |
-
|
| 177 |
-
### Plots Not Updating
|
| 178 |
-
1. **Refresh the page**: Sometimes plots need a refresh
|
| 179 |
-
2. **Check data format**: Ensure metrics are in the correct JSON format
|
| 180 |
-
3. **Verify step numbers**: Make sure step numbers are increasing
|
| 181 |
-
|
| 182 |
-
### Interface Not Loading
|
| 183 |
-
1. **Check dependencies**: Ensure plotly and pandas are installed
|
| 184 |
-
2. **Check Gradio version**: Use Gradio 4.0.0 or higher
|
| 185 |
-
3. **Check browser console**: Look for JavaScript errors
|
| 186 |
-
|
| 187 |
-
## 📊 Example Workflow
|
| 188 |
-
|
| 189 |
-
1. **Start Training**:
|
| 190 |
-
```bash
|
| 191 |
-
python run_a100_large_experiment.py --experiment-name "my_experiment"
|
| 192 |
-
```
|
| 193 |
-
|
| 194 |
-
2. **Monitor Progress**:
|
| 195 |
-
- Visit your Trackio Space
|
| 196 |
-
- Go to "View Experiments"
|
| 197 |
-
- Enter your experiment ID
|
| 198 |
-
- Watch real-time updates
|
| 199 |
-
|
| 200 |
-
3. **Visualize Results**:
|
| 201 |
-
- Go to "📊 Visualizations"
|
| 202 |
-
- Select "loss" metric
|
| 203 |
-
- Create plot to see training progress
|
| 204 |
-
|
| 205 |
-
4. **Compare Runs**:
|
| 206 |
-
- Run multiple experiments with different parameters
|
| 207 |
-
- Use "Compare Experiments" to see differences
|
| 208 |
-
|
| 209 |
-
5. **Generate Demo Data**:
|
| 210 |
-
- Use "🎯 Demo Data" tab to test the interface
|
| 211 |
-
- Generate realistic training data for demonstration
|
| 212 |
-
|
| 213 |
-
## 🎉 Success Indicators
|
| 214 |
-
|
| 215 |
-
Your interface is working correctly when you see:
|
| 216 |
-
- ✅ **Formatted experiment details** with emojis and structure
|
| 217 |
-
- ✅ **Interactive plots** that respond to your inputs
|
| 218 |
-
- ✅ **Real-time metric updates** during training
|
| 219 |
-
- ✅ **Clean experiment overview** with statistics
|
| 220 |
-
- ✅ **Smooth visualization** with hover information
|
| 221 |
-
|
| 222 |
-
The enhanced interface will now display much more meaningful information and provide a comprehensive monitoring experience for your SmolLM3 training experiments!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|