# AI SBOM Generator System Architecture

## Overview

The AI SBOM Generator is a configurable system that automatically generates Software Bill of Materials (SBOM) documents for AI models hosted on HuggingFace. The system uses a registry-driven architecture that allows for dynamic field configuration without code changes.

## System Architecture

### Core Components

```
┌─────────────────────────────────────────────────────────────┐
│                    AI SBOM Generator                   │
├─────────────────────────────────────────────────────────────┤
│  Web Interface (FastAPI + HTML Templates)              │
├─────────────────────────────────────────────────────────────┤
│  API Layer                                             │
│  ├── Generation Endpoints                              │
│  ├── Scoring Endpoints                                 │
│  └── Batch Processing                                  │
├─────────────────────────────────────────────────────────────┤
│  Core Generation Engine                                │
│  ├── AIBOMGenerator (generator.py)                     │
│  ├── Enhanced Extractor (enhanced_extractor.py)        │
│  └── Field Registry Manager (field_registry_manager.py)│
├─────────────────────────────────────────────────────────────┤
│  Configuration Layer                                   │
│  ├── Field Registry (field_registry.json)              │
│  ├── Scoring Configuration                             │
│  └── AIBOM Generation Rules                            │
├─────────────────────────────────────────────────────────────┤
│  Data Sources                                          │
│  ├── HuggingFace API                                   │
│  ├── Model Cards                                       │
│  ├── Configuration Files                               │
│  └── README Content                                    │
└─────────────────────────────────────────────────────────────┘
```

### Key Features

- **Registry-Driven Configuration**: All fields and scoring rules defined in JSON
- **Multi-Strategy Extraction**: 6 different extraction methods per field
- **Standards Compliance**: CycloneDX 1.6 compatible output
- **Configurable Scoring**: Weighted scoring system with tier-based multipliers
- **Automatic Field Discovery**: New fields added to registry are automatically processed
- **Comprehensive Logging**: Detailed extraction and scoring logs for debugging

## Process Workflow

### 1. System Initialization

```
System Initialization Process:

    ┌───────────────────┐
    │  System Startup  │
    └─────────┬─────────┘
              │
              ▼
    ┌───────────────────┐
    │ Load Field       │
    │ Registry         │
    └─────────┬─────────┘
              │
              ▼
    ┌───────────────────┐
    │ Initialize       │
    │ Registry Manager │
    └─────────┬─────────┘
              │
              ▼
    ┌─────────────────┐
    │ Load Scoring   │
    │ Configuration  │
    └─────────┬───────┘
              │
              ▼
    ┌─────────────────┐
    │ Initialize     │
    │ Enhanced       │
    │ Extractor      │
    └─────────┬───────┘
              │
              ▼
    ┌─────────────────┐
    │  System Ready  │
    └─────────────────┘
```

**Steps:**
1. **Load Field Registry**: Read `field_registry.json` containing all field definitions
2. **Initialize Registry Manager**: Create manager instance with loaded configuration
3. **Load Scoring Configuration**: Parse scoring weights, tiers, and category definitions
4. **Initialize Enhanced Extractor**: Create extractor with registry-driven field discovery
5. **System Ready**: All components initialized and ready for SBOM generation

### 2. SBOM Generation Process

```
SBOM Generation Workflow:

User Request ──┐
               │
               ▼
    ┌───────────────────┐      ┌────────────────────┐     ┌──────────────────┐
    │ Validate Model  │─────▶│ Fetch Model Info  │───▶│ Initialize      │
    │ ID              │      │                   │    │ Enhanced        │
    └───────────────────┘      └────────────────────┘    │ Extractor       │
                                                      └──────────┬───────┘
                                                                │
    ┌───────────────────┐     ┌──────────────────┐                 │
    │ Return SBOM +    │◀───│ Calculate       │◀────────────────┘
    │ Score            │    │ Completeness    │
    └───────────────────┘     │ Score           │
                            └──────────────────┘
                                    ▲
                                    │
                           ┌────────────────────┐
                           │ Generate AIBOM    │
                           │ Structure         │
                           └────────────────────┘
                                    ▲
                                    │
                           ┌────────────────────┐
                           │ Multi-Strategy    │
                           │ Field Processing  │
                           └────────────────────┘
                                    ▲
                                    │
                           ┌────────────────────┐
                           │ Registry-Driven   │
                           │ Extraction        │
                           └────────────────────┘
```

#### 2.1 Model Information Gathering

**Input**: HuggingFace model ID (e.g., `deepseek-ai/DeepSeek-R1`)

**Process**:
1. **Validate Model ID**: Check format and accessibility
2. **Fetch Model Info**: Retrieve metadata from HuggingFace API
3. **Download Model Card**: Get structured model documentation
4. **Fetch Configuration Files**: Download `config.json`, `tokenizer_config.json`
5. **Extract README Content**: Parse model description and documentation

#### 2.2 Registry-Driven Field Extraction

**For each of the 29 registry fields:**

```
Multi-Strategy Field Extraction:

Field from Registry
        │
        ▼
┌───────────────────┐     Success?
│ Strategy 1:      │────────┐
│ HuggingFace API  │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 2:      │        │
│ Model Card       │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 3:      │        │
│ Config Files     │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 4:      │        │
│ Text Patterns    │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 5:      │        │
│ Intelligent      │        │
│ Inference        │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 6:      │        │
│ Fallback Value   │        │
└───────────────────┘        │
        │                  │
        ▼                  │
┌───────────────────┐◀───────┘
│ Store Result &   │
│ Log Outcome      │
└───────────────────┘
```

**Extraction Strategies**:

1. **HuggingFace API Extraction**
   - Direct field mapping from API response
   - High confidence, structured data
   - Fields: `name`, `author`, `license`, `tags`, etc.

2. **Model Card Extraction**
   - Parse structured model card YAML/metadata
   - Medium-high confidence
   - Fields: `limitation`, `metrics`, `datasets`, etc.

3. **Configuration File Extraction**
   - Mine technical details from config files
   - High confidence for technical fields
   - Fields: `typeOfModel`, `hyperparameter`, etc.

4. **Text Pattern Extraction**
   - Regex-based extraction from README content
   - Medium confidence, requires validation
   - Fields: `safetyRiskAssessment`, `informationAboutTraining`, etc.

5. **Intelligent Inference**
   - Smart defaults based on model characteristics
   - Medium confidence, contextual
   - Fields: `primaryPurpose`, `domain`, etc.

6. **Fallback Values**
   - Placeholder values when no data available
   - Low/no confidence, maintains structure
   - Ensures complete SBOM structure

#### 2.3 AIBOM Structure Generation

**Process**:
1. **Create Base Structure**: Initialize CycloneDX 1.6 compliant structure
2. **Populate Metadata Section**: Add extracted metadata fields
3. **Build Component Section**: Create model component with extracted data
4. **Add Model Card**: Include AI-specific model card information
5. **Generate External References**: Add distribution and repository links
6. **Create Dependencies**: Define model dependencies and relationships
7. **Validate Structure**: Ensure CycloneDX compliance

**Output Structure**:
```json
{
  "bomFormat": "CycloneDX",
  "specVersion": "1.6",
  "serialNumber": "urn:uuid:...",
  "version": 1,
  "metadata": {
    "timestamp": "...",
    "tools": [...],
    "component": {...},
    "properties": [...]
  },
  "components": [{
    "type": "machine-learning-model",
    "name": "...",
    "modelCard": {...},
    "properties": [...]
  }],
  "externalReferences": [...],
  "dependencies": [...]
}
```

### 3. Completeness Scoring Process

```
Completeness Scoring Process:

┌───────────────────┐
│ Extracted Fields │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Categorize       │
│ Fields           │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Apply Tier       │
│ Weights          │
│ • Critical: 3x   │
│ • Important: 2x  │
│ • Supplement: 1x │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Calculate        │
│ Category Scores  │
│ • Required: 20   │
│ • Metadata: 20   │
│ • Basic: 20      │
│ • ModelCard: 30  │
│ • ExtRefs: 10    │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Sum Weighted     │
│ Scores           │
│ (Max: 100)       │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Generate Score   │
│ Report           │
└───────────────────┘
```

**Scoring Algorithm**:

1. **Field Categorization**: Group fields by category (required_fields, metadata, etc.)
2. **Tier Weight Application**: Apply multipliers (Critical: 3x, Important: 2x, Supplementary: 1x)
3. **Category Score Calculation**: `(Fields Present / Total Fields) × Category Weight`
4. **Final Score**: Sum of all category scores (max 100)

**Category Weights**:
- Required Fields: 20 points
- Metadata: 20 points  
- Component Basic: 20 points
- Component Model Card: 30 points
- External References: 10 points

### 4. Output Generation

**Generated Artifacts**:
1. **AIBOM JSON**: CycloneDX 1.6 compliant SBOM document
2. **Completeness Score**: Numerical score (0-100) with breakdown
3. **Field Checklist**: Detailed field-by-field analysis
4. **Extraction Report**: Confidence levels and data sources
5. **Validation Results**: Compliance and quality checks

## Configuration Management

### Field Registry Structure

The system is driven by `field_registry.json` which defines:

- **Field Definitions**: All 29 extractable fields
- **Scoring Configuration**: Weights, tiers, and categories
- **AIBOM Generation Rules**: Structure and validation rules
- **Extraction Strategies**: How each field should be extracted

### Dynamic Configuration

**Adding New Fields**:
1. Add field definition to `field_registry.json`
2. System automatically discovers and attempts extraction
3. No code changes required

**Updating Scoring**:
1. Modify weights in registry configuration
2. Changes take effect immediately
3. Consistent scoring across all models

## Quality Assurance

### Validation Layers

1. **Input Validation**: Model ID format and accessibility
2. **Extraction Validation**: Data type and format checking
3. **Structure Validation**: CycloneDX schema compliance
4. **Scoring Validation**: Mathematical correctness
5. **Output Validation**: JSON schema and completeness

### Error Handling

- **Individual Field Failures**: Don't stop overall processing
- **Graceful Degradation**: Fallback to lower-confidence strategies
- **Comprehensive Logging**: Detailed error tracking and debugging
- **Recovery Mechanisms**: Automatic retry and alternative approaches

## Performance Characteristics

### Typical Processing Times

- **Single Model**: 2-5 seconds
- **Batch Processing**: 10-50 models/minute
- **Registry Loading**: <1 second
- **Field Extraction**: 1-3 seconds per model

### Scalability Features

- **Concurrent Processing**: Multiple models processed simultaneously
- **Caching**: Model metadata and configuration caching
- **Rate Limiting**: Respectful API usage
- **Resource Management**: Memory and connection pooling

## Integration Points

### APIs

- **Generation API**: `/api/generate` - Single model AI SBOM generation, with download URL
- **Generation with Completness Score Report API**: `/api/generate-with-report` - Generation API with completness scoring report
- **Completness Score Report Only API**: `/api/models/{model_id}/score` - Get the completeness score for a model without generating AI SBOM

### Data Sources

- **HuggingFace Hub**: Primary model metadata source
- **Model Repositories**: Direct file access for configurations
- **Model Cards**: Structured documentation parsing

### Output Formats

- **CycloneDX JSON**: Primary SBOM format
- **Field Reports**: Human-readable analysis
- **CSV Exports**: Batch processing results
- **API Responses**: Structured JSON for integration

This architecture provides a robust, configurable, and standards-compliant solution for AI model SBOM generation with comprehensive field extraction and scoring capabilities.