Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

Tonic commited on Jul 29

Commit

d0d19b2

verified ·

1 Parent(s): 08ed534

Fix model recovery and deployment scripts - add safetensors support and Windows compatibility

Browse files

Files changed (13) hide show

MODEL_RECOVERY_GUIDE.md +228 -0
cloud_deploy.py +96 -0
cloud_recovery.sh +113 -0
process_model.py +230 -0
recover_model.py +334 -0
scripts/model_tonic/push_to_huggingface.py +57 -20
scripts/model_tonic/quantize_model.py +13 -1
config_test_monitoring_auto_resolve_20250727_153310.json → test_data/config_test_monitoring_auto_resolve_20250727_153310.json +0 -0
config_test_monitoring_auto_resolve_20250727_161709.json → test_data/config_test_monitoring_auto_resolve_20250727_161709.json +0 -0
config_test_monitoring_integration_20250727_151307.json → test_data/config_test_monitoring_integration_20250727_151307.json +0 -0
config_test_monitoring_integration_20250727_151403.json → test_data/config_test_monitoring_integration_20250727_151403.json +0 -0
test_update_kwargs.py → tests/test_update_kwargs_1.py +0 -0
verify_fix.py → tests/verify_fix_1.py +0 -0

MODEL_RECOVERY_GUIDE.md ADDED Viewed

	@@ -0,0 +1,228 @@

+# Model Recovery and Deployment Guide
+This guide will help you recover your trained model from the cloud instance and deploy it to Hugging Face Hub with quantization.
+## Prerequisites
+1. **Hugging Face Token**: You need a Hugging Face token with write permissions
+2. **Cloud Instance Access**: SSH access to your cloud instance
+3. **Model Files**: Your trained model should be in `/output-checkpoint/` on the cloud instance
+## Step 1: Connect to Your Cloud Instance
+```bash
+ssh root@your-cloud-instance-ip
+cd ~/smollm3_finetune
+```
+## Step 2: Set Your Hugging Face Token
+```bash
+export HF_TOKEN=your_huggingface_token_here
+```
+Replace `your_huggingface_token_here` with your actual Hugging Face token.
+## Step 3: Verify Model Files
+Check that your model files exist:
+```bash
+ls -la /output-checkpoint/
+```
+You should see files like:
+- `config.json`
+- `model.safetensors.index.json`
+- `model-00001-of-00002.safetensors`
+- `model-00002-of-00002.safetensors`
+- `tokenizer.json`
+- `tokenizer_config.json`
+## Step 4: Update Configuration
+Edit the deployment script to use your Hugging Face username:
+```bash
+nano cloud_deploy.py
+```
+Change this line:
+```python
+REPO_NAME = "your-username/smollm3-finetuned"  # Change to your HF username and desired repo name
+```
+To your actual username, for example:
+```python
+REPO_NAME = "tonic/smollm3-finetuned"
+```
+## Step 5: Run the Deployment
+Execute the deployment script:
+```bash
+python3 cloud_deploy.py
+```
+This will:
+1. ✅ Validate your model files
+2. ✅ Install required dependencies (torchao, huggingface_hub)
+3. ✅ Push the main model to Hugging Face Hub
+4. ✅ Create quantized versions (int8 and int4)
+5. ✅ Push quantized models to subdirectories
+## Step 6: Verify Deployment
+After successful deployment, you can verify:
+1. **Main Model**: https://huggingface.co/your-username/smollm3-finetuned
+2. **int8 Quantized**: https://huggingface.co/your-username/smollm3-finetuned/int8
+3. **int4 Quantized**: https://huggingface.co/your-username/smollm3-finetuned/int4
+## Alternative: Manual Deployment
+If you prefer to run the steps manually:
+### 1. Push Main Model Only
+```bash
+python3 scripts/model_tonic/push_to_huggingface.py \
+    /output-checkpoint/ \
+    your-username/smollm3-finetuned \
+    --hf-token $HF_TOKEN \
+    --author-name "Your Name" \
+    --model-description "A fine-tuned SmolLM3 model for improved text generation"
+```
+### 2. Quantize and Push (Optional)
+```bash
+# int8 quantization (GPU optimized)
+python3 scripts/model_tonic/quantize_model.py \
+    /output-checkpoint/ \
+    your-username/smollm3-finetuned \
+    --quant-type int8_weight_only \
+    --hf-token $HF_TOKEN
+# int4 quantization (CPU optimized)
+python3 scripts/model_tonic/quantize_model.py \
+    /output-checkpoint/ \
+    your-username/smollm3-finetuned \
+    --quant-type int4_weight_only \
+    --hf-token $HF_TOKEN
+```
+## Troubleshooting
+### Common Issues
+1. **HF_TOKEN not set**
+   ```bash
+   export HF_TOKEN=your_token_here
+   ```
+2. **Model files not found**
+   ```bash
+   ls -la /output-checkpoint/
+   ```
+   Make sure the training completed successfully.
+3. **Dependencies missing**
+   ```bash
+   pip install torchao huggingface_hub
+   ```
+4. **Permission denied**
+   ```bash
+   chmod +x cloud_deploy.py
+   chmod +x recover_model.py
+   ```
+### Error Messages
+- **"Missing required model files"**: Check that your model training completed successfully
+- **"Repository creation failed"**: Verify your HF token has write permissions
+- **"Quantization failed"**: Check GPU memory availability or try CPU quantization
+## Model Usage
+Once deployed, you can use your model:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Main model
+model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned")
+tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned")
+# int8 quantized (GPU optimized)
+model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned/int8")
+tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned/int8")
+# int4 quantized (CPU optimized)
+model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned/int4")
+tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned/int4")
+# Generate text
+inputs = tokenizer("Hello, how are you?", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## File Structure
+After deployment, your repository will have:
+```
+your-username/smollm3-finetuned/
+├── README.md (model card)
+├── config.json
+├── model.safetensors.index.json
+├── model-00001-of-00002.safetensors
+├── model-00002-of-00002.safetensors
+├── tokenizer.json
+├── tokenizer_config.json
+├── int8/ (quantized model for GPU)
+│   ├── README.md
+│   ├── config.json
+│   └── pytorch_model.bin
+└── int4/ (quantized model for CPU)
+    ├── README.md
+    ├── config.json
+    └── pytorch_model.bin
+```
+## Success Indicators
+✅ **Successful deployment shows:**
+- "Model recovery and deployment completed successfully!"
+- "View your model at: https://huggingface.co/your-username/smollm3-finetuned"
+- No error messages in the output
+❌ **Failed deployment shows:**
+- Error messages about missing files or permissions
+- "Model recovery and deployment failed!"
+## Next Steps
+After successful deployment:
+1. **Test your model** on Hugging Face Hub
+2. **Share your model** with the community
+3. **Monitor usage** through Hugging Face analytics
+4. **Consider fine-tuning** further based on feedback
+## Support
+If you encounter issues:
+1. Check the error messages carefully
+2. Verify your HF token permissions
+3. Ensure all model files are present
+4. Try running individual steps manually
+5. Check the logs for detailed error information
+---
+**Happy deploying! 🚀**

cloud_deploy.py ADDED Viewed

	@@ -0,0 +1,96 @@

+#!/usr/bin/env python3
+"""
+Cloud Model Deployment Script
+Run this directly on your cloud instance to deploy your trained model
+"""
+import os
+import sys
+import logging
+from pathlib import Path
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+def main():
+    """Main deployment function"""
+    # Configuration - CHANGE THESE VALUES
+    MODEL_PATH = "/output-checkpoint"
+    REPO_NAME = "your-username/smollm3-finetuned"  # Change to your HF username and desired repo name
+    HF_TOKEN = os.getenv('HF_TOKEN')
+    PRIVATE = False  # Set to True for private repository
+    # Validate configuration
+    if not HF_TOKEN:
+        logger.error("❌ HF_TOKEN environment variable not set")
+        logger.info("Please set your Hugging Face token:")
+        logger.info("export HF_TOKEN=your_token_here")
+        return 1
+    if not Path(MODEL_PATH).exists():
+        logger.error(f"❌ Model path not found: {MODEL_PATH}")
+        return 1
+    # Check for required files
+    required_files = ['config.json', 'model.safetensors.index.json', 'tokenizer.json']
+    for file in required_files:
+        if not (Path(MODEL_PATH) / file).exists():
+            logger.error(f"❌ Required file not found: {file}")
+            return 1
+    logger.info("✅ Model files validated")
+    # Install dependencies if needed
+    try:
+        import torchao
+        logger.info("✅ torchao available")
+    except ImportError:
+        logger.info("📦 Installing torchao...")
+        os.system("pip install torchao")
+    try:
+        import huggingface_hub
+        logger.info("✅ huggingface_hub available")
+    except ImportError:
+        logger.info("📦 Installing huggingface_hub...")
+        os.system("pip install huggingface_hub")
+    # Run the recovery script
+    logger.info("🚀 Starting model deployment...")
+    cmd = [
+        sys.executable, "recover_model.py",
+        MODEL_PATH,
+        REPO_NAME,
+        "--hf-token", HF_TOKEN,
+        "--quant-types", "int8_weight_only", "int4_weight_only",
+        "--author-name", "Your Name",
+        "--model-description", "A fine-tuned SmolLM3 model for improved text generation and conversation capabilities"
+    ]
+    if PRIVATE:
+        cmd.append("--private")
+    logger.info(f"Running: {' '.join(cmd)}")
+    # Run the command
+    result = os.system(' '.join(cmd))
+    if result == 0:
+        logger.info("✅ Model deployment completed successfully!")
+        logger.info(f"🌐 View your model at: https://huggingface.co/{REPO_NAME}")
+        logger.info("📊 Quantized models available at:")
+        logger.info(f"  - https://huggingface.co/{REPO_NAME}/int8 (GPU optimized)")
+        logger.info(f"  - https://huggingface.co/{REPO_NAME}/int4 (CPU optimized)")
+        return 0
+    else:
+        logger.error("❌ Model deployment failed!")
+        return 1
+if __name__ == "__main__":
+    exit(main())

cloud_recovery.sh ADDED Viewed

	@@ -0,0 +1,113 @@

+#!/bin/bash
+# Cloud Model Recovery and Deployment Script
+# Run this on your cloud instance to recover and deploy your trained model
+set -e  # Exit on any error
+echo "🚀 Starting cloud model recovery and deployment..."
+# Configuration
+MODEL_PATH="/output-checkpoint"
+REPO_NAME="your-username/smollm3-finetuned"  # Change this to your HF username and desired repo name
+HF_TOKEN="${HF_TOKEN}"  # Set this environment variable
+PRIVATE=false  # Set to true if you want a private repository
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+# Function to print colored output
+print_status() {
+    echo -e "${BLUE}[INFO]${NC} $1"
+}
+print_success() {
+    echo -e "${GREEN}[SUCCESS]${NC} $1"
+}
+print_warning() {
+    echo -e "${YELLOW}[WARNING]${NC} $1"
+}
+print_error() {
+    echo -e "${RED}[ERROR]${NC} $1"
+}
+# Check if we're in the right directory
+if [ ! -d "$MODEL_PATH" ]; then
+    print_error "Model path not found: $MODEL_PATH"
+    exit 1
+fi
+print_status "Found model at: $MODEL_PATH"
+# Check for required files
+print_status "Validating model files..."
+if [ ! -f "$MODEL_PATH/config.json" ]; then
+    print_error "config.json not found"
+    exit 1
+fi
+if [ ! -f "$MODEL_PATH/model.safetensors.index.json" ]; then
+    print_error "model.safetensors.index.json not found"
+    exit 1
+fi
+if [ ! -f "$MODEL_PATH/tokenizer.json" ]; then
+    print_error "tokenizer.json not found"
+    exit 1
+fi
+print_success "Model files validated"
+# Check HF token
+if [ -z "$HF_TOKEN" ]; then
+    print_error "HF_TOKEN environment variable not set"
+    print_status "Please set your Hugging Face token:"
+    print_status "export HF_TOKEN=your_token_here"
+    exit 1
+fi
+print_success "HF Token found"
+# Install required packages if not already installed
+print_status "Checking dependencies..."
+python3 -c "import torchao" 2>/dev/null || {
+    print_status "Installing torchao..."
+    pip install torchao
+}
+python3 -c "import huggingface_hub" 2>/dev/null || {
+    print_status "Installing huggingface_hub..."
+    pip install huggingface_hub
+}
+print_success "Dependencies checked"
+# Run the recovery script
+print_status "Running model recovery and deployment pipeline..."
+python3 recover_model.py \
+    "$MODEL_PATH" \
+    "$REPO_NAME" \
+    --hf-token "$HF_TOKEN" \
+    --private "$PRIVATE" \
+    --quant-types int8_weight_only int4_weight_only \
+    --author-name "Your Name" \
+    --model-description "A fine-tuned SmolLM3 model for improved text generation and conversation capabilities"
+if [ $? -eq 0 ]; then
+    print_success "Model recovery and deployment completed successfully!"
+    print_success "View your model at: https://huggingface.co/$REPO_NAME"
+    print_success "Quantized models available at:"
+    print_success "  - https://huggingface.co/$REPO_NAME/int8 (GPU optimized)"
+    print_success "  - https://huggingface.co/$REPO_NAME/int4 (CPU optimized)"
+else
+    print_error "Model recovery and deployment failed!"
+    exit 1
+fi
+print_success "🎉 All done! Your model has been successfully recovered and deployed to Hugging Face Hub."

process_model.py ADDED Viewed

	@@ -0,0 +1,230 @@

+#!/usr/bin/env python3
+"""
+Model Processing Script
+Processes recovered model with quantization and pushing to HF Hub
+"""
+import os
+import sys
+import json
+import logging
+import subprocess
+from pathlib import Path
+from typing import Dict, Any, Optional
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+class ModelProcessor:
+    """Process recovered model with quantization and pushing"""
+    def __init__(self, model_path: str = "recovered_model"):
+        self.model_path = Path(model_path)
+        self.hf_token = os.getenv('HF_TOKEN')
+    def validate_model(self) -> bool:
+        """Validate that the model can be loaded"""
+        try:
+            logger.info("🔍 Validating model loading...")
+            # Try to load the model
+            cmd = [
+                sys.executable, "-c",
+                "from transformers import AutoModelForCausalLM; "
+                "model = AutoModelForCausalLM.from_pretrained('recovered_model', "
+                "torch_dtype='auto', device_map='auto'); "
+                "print('✅ Model loaded successfully')"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
+            if result.returncode == 0:
+                logger.info("✅ Model validation successful")
+                return True
+            else:
+                logger.error(f"❌ Model validation failed: {result.stderr}")
+                return False
+        except Exception as e:
+            logger.error(f"❌ Model validation error: {e}")
+            return False
+    def get_model_info(self) -> Dict[str, Any]:
+        """Get information about the model"""
+        try:
+            # Load config
+            config_path = self.model_path / "config.json"
+            if config_path.exists():
+                with open(config_path, 'r') as f:
+                    config = json.load(f)
+            else:
+                config = {}
+            # Calculate model size
+            total_size = 0
+            for file in self.model_path.rglob("*"):
+                if file.is_file():
+                    total_size += file.stat().st_size
+            model_info = {
+                "model_type": config.get("model_type", "smollm3"),
+                "architectures": config.get("architectures", ["SmolLM3ForCausalLM"]),
+                "model_size_gb": total_size / (1024**3),
+                "vocab_size": config.get("vocab_size", 32000),
+                "hidden_size": config.get("hidden_size", 2048),
+                "num_attention_heads": config.get("num_attention_heads", 16),
+                "num_hidden_layers": config.get("num_hidden_layers", 24),
+                "max_position_embeddings": config.get("max_position_embeddings", 8192)
+            }
+            logger.info(f"📊 Model info: {model_info}")
+            return model_info
+        except Exception as e:
+            logger.error(f"❌ Failed to get model info: {e}")
+            return {}
+    def run_quantization(self, repo_name: str, quant_type: str = "int8_weight_only") -> bool:
+        """Run quantization on the model"""
+        try:
+            logger.info(f"🔄 Running quantization: {quant_type}")
+            # Check if quantization script exists
+            quantize_script = Path("scripts/model_tonic/quantize_model.py")
+            if not quantize_script.exists():
+                logger.error(f"❌ Quantization script not found: {quantize_script}")
+                return False
+            # Run quantization
+            cmd = [
+                sys.executable, str(quantize_script),
+                str(self.model_path),
+                repo_name,
+                "--quant-type", quant_type,
+                "--device", "auto"
+            ]
+            if self.hf_token:
+                cmd.extend(["--token", self.hf_token])
+            logger.info(f"🚀 Running: {' '.join(cmd)}")
+            result = subprocess.run(cmd, capture_output=True, text=True, timeout=1800)  # 30 min timeout
+            if result.returncode == 0:
+                logger.info("✅ Quantization completed successfully")
+                logger.info(result.stdout)
+                return True
+            else:
+                logger.error("❌ Quantization failed")
+                logger.error(result.stderr)
+                return False
+        except subprocess.TimeoutExpired:
+            logger.error("❌ Quantization timed out")
+            return False
+        except Exception as e:
+            logger.error(f"❌ Failed to run quantization: {e}")
+            return False
+    def run_model_push(self, repo_name: str) -> bool:
+        """Push the model to HF Hub"""
+        try:
+            logger.info(f"🔄 Pushing model to: {repo_name}")
+            # Check if push script exists
+            push_script = Path("scripts/model_tonic/push_to_huggingface.py")
+            if not push_script.exists():
+                logger.error(f"❌ Push script not found: {push_script}")
+                return False
+            # Run push
+            cmd = [
+                sys.executable, str(push_script),
+                str(self.model_path),
+                repo_name
+            ]
+            if self.hf_token:
+                cmd.extend(["--token", self.hf_token])
+            logger.info(f"🚀 Running: {' '.join(cmd)}")
+            result = subprocess.run(cmd, capture_output=True, text=True, timeout=1800)  # 30 min timeout
+            if result.returncode == 0:
+                logger.info("✅ Model push completed successfully")
+                logger.info(result.stdout)
+                return True
+            else:
+                logger.error("❌ Model push failed")
+                logger.error(result.stderr)
+                return False
+        except subprocess.TimeoutExpired:
+            logger.error("❌ Model push timed out")
+            return False
+        except Exception as e:
+            logger.error(f"❌ Failed to push model: {e}")
+            return False
+    def process_model(self, repo_name: str, quantize: bool = True, push: bool = True) -> bool:
+        """Complete model processing workflow"""
+        logger.info("🚀 Starting model processing...")
+        # Step 1: Validate model
+        if not self.validate_model():
+            logger.error("❌ Model validation failed")
+            return False
+        # Step 2: Get model info
+        model_info = self.get_model_info()
+        # Step 3: Quantize if requested
+        if quantize:
+            if not self.run_quantization(repo_name):
+                logger.error("❌ Quantization failed")
+                return False
+        # Step 4: Push if requested
+        if push:
+            if not self.run_model_push(repo_name):
+                logger.error("❌ Model push failed")
+                return False
+        logger.info("🎉 Model processing completed successfully!")
+        logger.info(f"🌐 View your model at: https://huggingface.co/{repo_name}")
+        return True
+def main():
+    """Main function"""
+    import argparse
+    parser = argparse.ArgumentParser(description="Process recovered model")
+    parser.add_argument("repo_name", help="Hugging Face repository name (username/model-name)")
+    parser.add_argument("--model-path", default="recovered_model", help="Path to recovered model")
+    parser.add_argument("--no-quantize", action="store_true", help="Skip quantization")
+    parser.add_argument("--no-push", action="store_true", help="Skip pushing to HF Hub")
+    parser.add_argument("--quant-type", default="int8_weight_only",
+                       choices=["int8_weight_only", "int4_weight_only", "int8_dynamic"],
+                       help="Quantization type")
+    args = parser.parse_args()
+    # Initialize processor
+    processor = ModelProcessor(args.model_path)
+    # Process model
+    success = processor.process_model(
+        repo_name=args.repo_name,
+        quantize=not args.no_quantize,
+        push=not args.no_push
+    )
+    return 0 if success else 1
+if __name__ == "__main__":
+    exit(main())

recover_model.py ADDED Viewed

	@@ -0,0 +1,334 @@

+#!/usr/bin/env python3
+"""
+Model Recovery and Deployment Script
+Recovers trained model from cloud instance, quantizes it, and pushes to Hugging Face Hub
+"""
+import os
+import sys
+import json
+import argparse
+import logging
+import subprocess
+from pathlib import Path
+from typing import Dict, Any, Optional
+from datetime import datetime
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+sys.path.append(os.path.join(os.path.dirname(__file__), 'src'))
+class ModelRecoveryPipeline:
+    """Complete model recovery and deployment pipeline"""
+    def __init__(
+        self,
+        model_path: str,
+        repo_name: str,
+        hf_token: Optional[str] = None,
+        private: bool = False,
+        quantize: bool = True,
+        quant_types: Optional[list] = None,
+        trackio_url: Optional[str] = None,
+        experiment_name: Optional[str] = None,
+        dataset_repo: Optional[str] = None,
+        author_name: Optional[str] = None,
+        model_description: Optional[str] = None
+    ):
+        self.model_path = Path(model_path)
+        self.repo_name = repo_name
+        self.hf_token = hf_token or os.getenv('HF_TOKEN')
+        self.private = private
+        self.quantize = quantize
+        self.quant_types = quant_types or ["int8_weight_only", "int4_weight_only"]
+        self.trackio_url = trackio_url
+        self.experiment_name = experiment_name
+        self.dataset_repo = dataset_repo
+        self.author_name = author_name
+        self.model_description = model_description
+        # Validate HF token
+        if not self.hf_token:
+            raise ValueError("HF_TOKEN environment variable or --hf-token argument is required")
+        logger.info(f"Initialized ModelRecoveryPipeline for {repo_name}")
+        logger.info(f"Model path: {self.model_path}")
+        logger.info(f"Quantization enabled: {self.quantize}")
+        if self.quantize:
+            logger.info(f"Quantization types: {self.quant_types}")
+    def validate_model_path(self) -> bool:
+        """Validate that the model path contains required files"""
+        if not self.model_path.exists():
+            logger.error(f"❌ Model path does not exist: {self.model_path}")
+            return False
+        # Check for essential model files
+        required_files = ['config.json']
+        # Check for model files (either safetensors or pytorch)
+        model_files = [
+            "model.safetensors.index.json",  # Safetensors format
+            "pytorch_model.bin"  # PyTorch format
+        ]
+        missing_files = []
+        for file in required_files:
+            if not (self.model_path / file).exists():
+                missing_files.append(file)
+        # Check if at least one model file exists
+        model_file_exists = any((self.model_path / file).exists() for file in model_files)
+        if not model_file_exists:
+            missing_files.extend(model_files)
+        if missing_files:
+            logger.error(f"❌ Missing required model files: {missing_files}")
+            return False
+        logger.info("✅ Model files validated")
+        return True
+    def load_training_config(self) -> Dict[str, Any]:
+        """Load training configuration from model directory"""
+        config_files = [
+            "training_config.json",
+            "config_petite_llm_3_fr_1_20250727_152504.json",
+            "config_petite_llm_3_fr_1_20250727_152524.json"
+        ]
+        for config_file in config_files:
+            config_path = self.model_path / config_file
+            if config_path.exists():
+                with open(config_path, 'r') as f:
+                    config = json.load(f)
+                logger.info(f"✅ Loaded training config from: {config_file}")
+                return config
+        # Fallback to basic config
+        logger.warning("⚠️ No training config found, using default")
+        return {
+            "model_name": "HuggingFaceTB/SmolLM3-3B",
+            "dataset_name": "OpenHermes-FR",
+            "training_config_type": "Custom Configuration",
+            "trainer_type": "SFTTrainer",
+            "per_device_train_batch_size": 8,
+            "gradient_accumulation_steps": 16,
+            "learning_rate": "5e-6",
+            "num_train_epochs": 3,
+            "max_seq_length": 2048,
+            "dataset_size": "~80K samples",
+            "dataset_format": "Chat format"
+        }
+    def load_training_results(self) -> Dict[str, Any]:
+        """Load training results from model directory"""
+        results_files = [
+            "train_results.json",
+            "training_summary_petite_llm_3_fr_1_20250727_152504.json",
+            "training_summary_petite_llm_3_fr_1_20250727_152524.json"
+        ]
+        for results_file in results_files:
+            results_path = self.model_path / results_file
+            if results_path.exists():
+                with open(results_path, 'r') as f:
+                    results = json.load(f)
+                logger.info(f"✅ Loaded training results from: {results_file}")
+                return results
+        # Fallback to basic results
+        logger.warning("⚠️ No training results found, using default")
+        return {
+            "final_loss": "Unknown",
+            "total_steps": "Unknown",
+            "train_loss": "Unknown",
+            "eval_loss": "Unknown"
+        }
+    def push_main_model(self) -> bool:
+        """Push the main model to Hugging Face Hub"""
+        try:
+            logger.info("🚀 Pushing main model to Hugging Face Hub...")
+            # Import push script
+            from scripts.model_tonic.push_to_huggingface import HuggingFacePusher
+            # Load training data
+            training_config = self.load_training_config()
+            training_results = self.load_training_results()
+            # Initialize pusher
+            pusher = HuggingFacePusher(
+                model_path=str(self.model_path),
+                repo_name=self.repo_name,
+                token=self.hf_token,
+                private=self.private,
+                trackio_url=self.trackio_url,
+                experiment_name=self.experiment_name,
+                dataset_repo=self.dataset_repo,
+                hf_token=self.hf_token,
+                author_name=self.author_name,
+                model_description=self.model_description
+            )
+            # Push model
+            success = pusher.push_model(training_config, training_results)
+            if success:
+                logger.info(f"✅ Main model pushed successfully to: https://huggingface.co/{self.repo_name}")
+                return True
+            else:
+                logger.error("❌ Failed to push main model")
+                return False
+        except Exception as e:
+            logger.error(f"❌ Error pushing main model: {e}")
+            return False
+    def quantize_and_push_models(self) -> bool:
+        """Quantize and push models to Hugging Face Hub"""
+        if not self.quantize:
+            logger.info("⏭️ Skipping quantization (disabled)")
+            return True
+        try:
+            logger.info("🔄 Starting quantization and push process...")
+            # Import quantization script
+            from scripts.model_tonic.quantize_model import ModelQuantizer
+            success_count = 0
+            total_count = len(self.quant_types)
+            for quant_type in self.quant_types:
+                logger.info(f"🔄 Processing quantization type: {quant_type}")
+                # Initialize quantizer
+                quantizer = ModelQuantizer(
+                    model_path=str(self.model_path),
+                    repo_name=self.repo_name,
+                    token=self.hf_token,
+                    private=self.private,
+                    trackio_url=self.trackio_url,
+                    experiment_name=self.experiment_name,
+                    dataset_repo=self.dataset_repo,
+                    hf_token=self.hf_token
+                )
+                # Perform quantization and push
+                success = quantizer.quantize_and_push(
+                    quant_type=quant_type,
+                    device="auto",
+                    group_size=128
+                )
+                if success:
+                    logger.info(f"✅ {quant_type} quantization and push completed")
+                    success_count += 1
+                else:
+                    logger.error(f"❌ {quant_type} quantization and push failed")
+            logger.info(f"📊 Quantization summary: {success_count}/{total_count} successful")
+            return success_count > 0
+        except Exception as e:
+            logger.error(f"❌ Error during quantization: {e}")
+            return False
+    def run_complete_pipeline(self) -> bool:
+        """Run the complete model recovery and deployment pipeline"""
+        logger.info("🚀 Starting complete model recovery and deployment pipeline")
+        # Step 1: Validate model path
+        if not self.validate_model_path():
+            logger.error("❌ Model validation failed")
+            return False
+        # Step 2: Push main model
+        if not self.push_main_model():
+            logger.error("❌ Main model push failed")
+            return False
+        # Step 3: Quantize and push models
+        if not self.quantize_and_push_models():
+            logger.warning("⚠️ Quantization failed, but main model was pushed successfully")
+        logger.info("🎉 Model recovery and deployment pipeline completed!")
+        logger.info(f"🌐 View your model at: https://huggingface.co/{self.repo_name}")
+        return True
+def parse_args():
+    """Parse command line arguments"""
+    parser = argparse.ArgumentParser(description='Recover and deploy trained model to Hugging Face Hub')
+    # Required arguments
+    parser.add_argument('model_path', type=str, help='Path to trained model directory')
+    parser.add_argument('repo_name', type=str, help='Hugging Face repository name (username/repo-name)')
+    # Optional arguments
+    parser.add_argument('--hf-token', type=str, default=None, help='Hugging Face token')
+    parser.add_argument('--private', action='store_true', help='Make repository private')
+    parser.add_argument('--no-quantize', action='store_true', help='Skip quantization')
+    parser.add_argument('--quant-types', nargs='+',
+                       choices=['int8_weight_only', 'int4_weight_only', 'int8_dynamic'],
+                       default=['int8_weight_only', 'int4_weight_only'],
+                       help='Quantization types to apply')
+    parser.add_argument('--trackio-url', type=str, default=None, help='Trackio Space URL for logging')
+    parser.add_argument('--experiment-name', type=str, default=None, help='Experiment name for Trackio')
+    parser.add_argument('--dataset-repo', type=str, default=None, help='HF Dataset repository for experiment storage')
+    parser.add_argument('--author-name', type=str, default=None, help='Author name for model card')
+    parser.add_argument('--model-description', type=str, default=None, help='Model description for model card')
+    return parser.parse_args()
+def main():
+    """Main function"""
+    args = parse_args()
+    # Setup logging
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+    )
+    logger.info("Starting model recovery and deployment pipeline")
+    # Initialize pipeline
+    try:
+        pipeline = ModelRecoveryPipeline(
+            model_path=args.model_path,
+            repo_name=args.repo_name,
+            hf_token=args.hf_token,
+            private=args.private,
+            quantize=not args.no_quantize,
+            quant_types=args.quant_types,
+            trackio_url=args.trackio_url,
+            experiment_name=args.experiment_name,
+            dataset_repo=args.dataset_repo,
+            author_name=args.author_name,
+            model_description=args.model_description
+        )
+        # Run complete pipeline
+        success = pipeline.run_complete_pipeline()
+        if success:
+            logger.info("✅ Model recovery and deployment completed successfully!")
+            return 0
+        else:
+            logger.error("❌ Model recovery and deployment failed!")
+            return 1
+    except Exception as e:
+        logger.error(f"❌ Error during model recovery: {e}")
+        return 1
+if __name__ == "__main__":
+    exit(main())

scripts/model_tonic/push_to_huggingface.py CHANGED Viewed

@@ -8,11 +8,17 @@ import os
 import json
 import argparse
 import logging
 from pathlib import Path
 from typing import Dict, Any, Optional, List
 from datetime import datetime
 import subprocess
 import shutil
 try:
     from huggingface_hub import HfApi, create_repo, upload_file
@@ -34,6 +40,14 @@ except ImportError:
 logger = logging.getLogger(__name__)
 class HuggingFacePusher:
     """Push trained models and results to Hugging Face Hub with HF Datasets integration"""
@@ -88,16 +102,22 @@ class HuggingFacePusher:
         try:
             logger.info(f"Creating repository: {self.repo_name}")
-            # Create repository
-            create_repo(
-                repo_id=self.repo_name,
-                token=self.token,
-                private=self.private,
-                exist_ok=True
-            )
-            logger.info(f"✅ Repository created: https://huggingface.co/{self.repo_name}")
-            return True
         except Exception as e:
             logger.error(f"❌ Failed to create repository: {e}")
@@ -105,18 +125,29 @@ class HuggingFacePusher:
     def validate_model_path(self) -> bool:
         """Validate that the model path contains required files"""
         required_files = [
             "config.json",
-            "pytorch_model.bin",
             "tokenizer.json",
             "tokenizer_config.json"
         ]
         missing_files = []
         for file in required_files:
             if not (self.model_path / file).exists():
                 missing_files.append(file)
         if missing_files:
             logger.error(f"❌ Missing required files: {missing_files}")
             return False
@@ -246,7 +277,6 @@ This model is fine-tuned for specific tasks and may not generalize well to all u
 This model is licensed under the Apache 2.0 License.
 """
-        # return model_card
     def _get_model_size(self) -> float:
         """Get model size in GB"""
@@ -272,7 +302,7 @@ This model is licensed under the Apache 2.0 License.
             return "Unknown"
     def upload_model_files(self) -> bool:
-        """Upload model files to Hugging Face Hub"""
         try:
             logger.info("Uploading model files...")
@@ -283,12 +313,19 @@ This model is licensed under the Apache 2.0 License.
                     remote_path = str(relative_path)
                     logger.info(f"Uploading {relative_path}")
-                    upload_file(
-                        path_or_fileobj=str(file_path),
-                        path_in_repo=remote_path,
-                        repo_id=self.repo_name,
-                        token=self.token
-                    )
             logger.info("✅ Model files uploaded successfully")
             return True
@@ -378,7 +415,7 @@ Training metrics and configuration are stored in the HF Dataset repository: `{se
 ## Files
-- `pytorch_model.bin`: Model weights
 - `config.json`: Model configuration
 - `tokenizer.json`: Tokenizer configuration
 - `training_results/`: Training logs and results

 import json
 import argparse
 import logging
+import time
 from pathlib import Path
 from typing import Dict, Any, Optional, List
 from datetime import datetime
 import subprocess
 import shutil
+import platform
+# Set timeout for HF operations to prevent hanging
+os.environ['HF_HUB_DOWNLOAD_TIMEOUT'] = '300'
+os.environ['HF_HUB_UPLOAD_TIMEOUT'] = '600'
 try:
     from huggingface_hub import HfApi, create_repo, upload_file
 logger = logging.getLogger(__name__)
+class TimeoutError(Exception):
+    """Custom timeout exception"""
+    pass
+def timeout_handler(signum, frame):
+    """Signal handler for timeout"""
+    raise TimeoutError("Operation timed out")
 class HuggingFacePusher:
     """Push trained models and results to Hugging Face Hub with HF Datasets integration"""
         try:
             logger.info(f"Creating repository: {self.repo_name}")
+            # Create repository with timeout handling
+            try:
+                # Create repository
+                create_repo(
+                    repo_id=self.repo_name,
+                    token=self.token,
+                    private=self.private,
+                    exist_ok=True
+                )
+                logger.info(f"✅ Repository created: https://huggingface.co/{self.repo_name}")
+                return True
+            except Exception as e:
+                logger.error(f"❌ Repository creation failed: {e}")
+                return False
         except Exception as e:
             logger.error(f"❌ Failed to create repository: {e}")
     def validate_model_path(self) -> bool:
         """Validate that the model path contains required files"""
+        # Support both safetensors and pytorch formats
         required_files = [
             "config.json",
             "tokenizer.json",
             "tokenizer_config.json"
         ]
+        # Check for model files (either safetensors or pytorch)
+        model_files = [
+            "model.safetensors.index.json",  # Safetensors format
+            "pytorch_model.bin"  # PyTorch format
+        ]
         missing_files = []
         for file in required_files:
             if not (self.model_path / file).exists():
                 missing_files.append(file)
+        # Check if at least one model file exists
+        model_file_exists = any((self.model_path / file).exists() for file in model_files)
+        if not model_file_exists:
+            missing_files.extend(model_files)
         if missing_files:
             logger.error(f"❌ Missing required files: {missing_files}")
             return False
 This model is licensed under the Apache 2.0 License.
 """
     def _get_model_size(self) -> float:
         """Get model size in GB"""
             return "Unknown"
     def upload_model_files(self) -> bool:
+        """Upload model files to Hugging Face Hub with timeout protection"""
         try:
             logger.info("Uploading model files...")
                     remote_path = str(relative_path)
                     logger.info(f"Uploading {relative_path}")
+                    try:
+                        upload_file(
+                            path_or_fileobj=str(file_path),
+                            path_in_repo=remote_path,
+                            repo_id=self.repo_name,
+                            token=self.token
+                        )
+                        logger.info(f"✅ Uploaded {relative_path}")
+                    except Exception as e:
+                        logger.error(f"❌ Failed to upload {relative_path}: {e}")
+                        return False
             logger.info("✅ Model files uploaded successfully")
             return True
 ## Files
+- `model.safetensors.index.json`: Model weights (safetensors format)
 - `config.json`: Model configuration
 - `tokenizer.json`: Tokenizer configuration
 - `training_results/`: Training logs and results

scripts/model_tonic/quantize_model.py CHANGED Viewed

@@ -13,6 +13,7 @@ from typing import Dict, Any, Optional, List, Union
 from datetime import datetime
 import subprocess
 import shutil
 try:
     import torch
@@ -100,14 +101,25 @@ class ModelQuantizer:
             return False
         # Check for essential model files
-        required_files = ['config.json', 'pytorch_model.bin']
         optional_files = ['tokenizer.json', 'tokenizer_config.json']
         missing_files = []
         for file in required_files:
             if not (self.model_path / file).exists():
                 missing_files.append(file)
         if missing_files:
             logger.error(f"❌ Missing required model files: {missing_files}")
             return False

 from datetime import datetime
 import subprocess
 import shutil
+import platform
 try:
     import torch
             return False
         # Check for essential model files
+        required_files = ['config.json']
         optional_files = ['tokenizer.json', 'tokenizer_config.json']
+        # Check for model files (either safetensors or pytorch)
+        model_files = [
+            "model.safetensors.index.json",  # Safetensors format
+            "pytorch_model.bin"  # PyTorch format
+        ]
         missing_files = []
         for file in required_files:
             if not (self.model_path / file).exists():
                 missing_files.append(file)
+        # Check if at least one model file exists
+        model_file_exists = any((self.model_path / file).exists() for file in model_files)
+        if not model_file_exists:
+            missing_files.extend(model_files)
         if missing_files:
             logger.error(f"❌ Missing required model files: {missing_files}")
             return False

config_test_monitoring_auto_resolve_20250727_153310.json → test_data/config_test_monitoring_auto_resolve_20250727_153310.json RENAMED Viewed

File without changes

config_test_monitoring_auto_resolve_20250727_161709.json → test_data/config_test_monitoring_auto_resolve_20250727_161709.json RENAMED Viewed

File without changes

config_test_monitoring_integration_20250727_151307.json → test_data/config_test_monitoring_integration_20250727_151307.json RENAMED Viewed

File without changes

config_test_monitoring_integration_20250727_151403.json → test_data/config_test_monitoring_integration_20250727_151403.json RENAMED Viewed

File without changes

test_update_kwargs.py → tests/test_update_kwargs_1.py RENAMED Viewed

File without changes

verify_fix.py → tests/verify_fix_1.py RENAMED Viewed

File without changes