aibom-generator / docs /AI_SBOM_Generator_System_Architecture.md
a1c00l's picture
Upload 3 files
550ed76 verified
|
raw
history blame
17.7 kB

AI SBOM Generator System Architecture

Overview

The AI SBOM Generator is a configurable system that automatically generates Software Bill of Materials (SBOM) documents for AI models hosted on HuggingFace. The system uses a registry-driven architecture that allows for dynamic field configuration without code changes.

System Architecture

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AI SBOM Generator                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Web Interface (FastAPI + HTML Templates)              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  API Layer                                             β”‚
β”‚  β”œβ”€β”€ Generation Endpoints                              β”‚
β”‚  β”œβ”€β”€ Scoring Endpoints                                 β”‚
β”‚  └── Batch Processing                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Core Generation Engine                                β”‚
β”‚  β”œβ”€β”€ AIBOMGenerator (generator.py)                     β”‚
β”‚  β”œβ”€β”€ Enhanced Extractor (enhanced_extractor.py)        β”‚
β”‚  └── Field Registry Manager (field_registry_manager.py)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Configuration Layer                                   β”‚
β”‚  β”œβ”€β”€ Field Registry (field_registry.json)              β”‚
β”‚  β”œβ”€β”€ Scoring Configuration                             β”‚
β”‚  └── AIBOM Generation Rules                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Data Sources                                          β”‚
β”‚  β”œβ”€β”€ HuggingFace API                                   β”‚
β”‚  β”œβ”€β”€ Model Cards                                       β”‚
β”‚  β”œβ”€β”€ Configuration Files                               β”‚
β”‚  └── README Content                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Features

  • Registry-Driven Configuration: All fields and scoring rules defined in JSON
  • Multi-Strategy Extraction: 6 different extraction methods per field
  • Standards Compliance: CycloneDX 1.6 compatible output
  • Configurable Scoring: Weighted scoring system with tier-based multipliers
  • Automatic Field Discovery: New fields added to registry are automatically processed
  • Comprehensive Logging: Detailed extraction and scoring logs for debugging

Process Workflow

1. System Initialization

System Initialization Process:

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  System Startup  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Load Field       β”‚
    β”‚ Registry         β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Initialize       β”‚
    β”‚ Registry Manager β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Load Scoring   β”‚
    β”‚ Configuration  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Initialize     β”‚
    β”‚ Enhanced       β”‚
    β”‚ Extractor      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  System Ready  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Steps:

  1. Load Field Registry: Read field_registry.json containing all field definitions
  2. Initialize Registry Manager: Create manager instance with loaded configuration
  3. Load Scoring Configuration: Parse scoring weights, tiers, and category definitions
  4. Initialize Enhanced Extractor: Create extractor with registry-driven field discovery
  5. System Ready: All components initialized and ready for SBOM generation

2. SBOM Generation Process

SBOM Generation Workflow:

User Request ──┐
               β”‚
               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Validate Model  │─────▢│ Fetch Model Info  │───▢│ Initialize      β”‚
    β”‚ ID              β”‚      β”‚                   β”‚    β”‚ Enhanced        β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ Extractor       β”‚
                                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
    β”‚ Return SBOM +    │◀───│ Calculate       β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ Score            β”‚    β”‚ Completeness    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚ Score           β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β–²
                                    β”‚
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚ Generate AIBOM    β”‚
                           β”‚ Structure         β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β–²
                                    β”‚
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚ Multi-Strategy    β”‚
                           β”‚ Field Processing  β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β–²
                                    β”‚
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚ Registry-Driven   β”‚
                           β”‚ Extraction        β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.1 Model Information Gathering

Input: HuggingFace model ID (e.g., deepseek-ai/DeepSeek-R1)

Process:

  1. Validate Model ID: Check format and accessibility
  2. Fetch Model Info: Retrieve metadata from HuggingFace API
  3. Download Model Card: Get structured model documentation
  4. Fetch Configuration Files: Download config.json, tokenizer_config.json
  5. Extract README Content: Parse model description and documentation

2.2 Registry-Driven Field Extraction

For each of the 29 registry fields:

Multi-Strategy Field Extraction:

Field from Registry
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     Success?
β”‚ Strategy 1:      │────────┐
β”‚ HuggingFace API  β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
        β”‚                  β”‚
        β”‚ Failure          β”‚
        β–Ό                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚ Strategy 2:      β”‚        β”‚
β”‚ Model Card       β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
        β”‚                  β”‚
        β”‚ Failure          β”‚
        β–Ό                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚ Strategy 3:      β”‚        β”‚
β”‚ Config Files     β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
        β”‚                  β”‚
        β”‚ Failure          β”‚
        β–Ό                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚ Strategy 4:      β”‚        β”‚
β”‚ Text Patterns    β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
        β”‚                  β”‚
        β”‚ Failure          β”‚
        β–Ό                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚ Strategy 5:      β”‚        β”‚
β”‚ Intelligent      β”‚        β”‚
β”‚ Inference        β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
        β”‚                  β”‚
        β”‚ Failure          β”‚
        β–Ό                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚ Strategy 6:      β”‚        β”‚
β”‚ Fallback Value   β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
        β”‚                  β”‚
        β–Ό                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β—€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Store Result &   β”‚
β”‚ Log Outcome      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extraction Strategies:

  1. HuggingFace API Extraction

    • Direct field mapping from API response
    • High confidence, structured data
    • Fields: name, author, license, tags, etc.
  2. Model Card Extraction

    • Parse structured model card YAML/metadata
    • Medium-high confidence
    • Fields: limitation, metrics, datasets, etc.
  3. Configuration File Extraction

    • Mine technical details from config files
    • High confidence for technical fields
    • Fields: typeOfModel, hyperparameter, etc.
  4. Text Pattern Extraction

    • Regex-based extraction from README content
    • Medium confidence, requires validation
    • Fields: safetyRiskAssessment, informationAboutTraining, etc.
  5. Intelligent Inference

    • Smart defaults based on model characteristics
    • Medium confidence, contextual
    • Fields: primaryPurpose, domain, etc.
  6. Fallback Values

    • Placeholder values when no data available
    • Low/no confidence, maintains structure
    • Ensures complete SBOM structure

2.3 AIBOM Structure Generation

Process:

  1. Create Base Structure: Initialize CycloneDX 1.6 compliant structure
  2. Populate Metadata Section: Add extracted metadata fields
  3. Build Component Section: Create model component with extracted data
  4. Add Model Card: Include AI-specific model card information
  5. Generate External References: Add distribution and repository links
  6. Create Dependencies: Define model dependencies and relationships
  7. Validate Structure: Ensure CycloneDX compliance

Output Structure:

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.6",
  "serialNumber": "urn:uuid:...",
  "version": 1,
  "metadata": {
    "timestamp": "...",
    "tools": [...],
    "component": {...},
    "properties": [...]
  },
  "components": [{
    "type": "machine-learning-model",
    "name": "...",
    "modelCard": {...},
    "properties": [...]
  }],
  "externalReferences": [...],
  "dependencies": [...]
}

3. Completeness Scoring Process

Completeness Scoring Process:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Extracted Fields β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Categorize       β”‚
β”‚ Fields           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Apply Tier       β”‚
β”‚ Weights          β”‚
β”‚ β€’ Critical: 3x   β”‚
β”‚ β€’ Important: 2x  β”‚
β”‚ β€’ Supplement: 1x β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Calculate        β”‚
β”‚ Category Scores  β”‚
β”‚ β€’ Required: 20   β”‚
β”‚ β€’ Metadata: 20   β”‚
β”‚ β€’ Basic: 20      β”‚
β”‚ β€’ ModelCard: 30  β”‚
β”‚ β€’ ExtRefs: 10    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sum Weighted     β”‚
β”‚ Scores           β”‚
β”‚ (Max: 100)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Generate Score   β”‚
β”‚ Report           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Scoring Algorithm:

  1. Field Categorization: Group fields by category (required_fields, metadata, etc.)
  2. Tier Weight Application: Apply multipliers (Critical: 3x, Important: 2x, Supplementary: 1x)
  3. Category Score Calculation: (Fields Present / Total Fields) Γ— Category Weight
  4. Final Score: Sum of all category scores (max 100)

Category Weights:

  • Required Fields: 20 points
  • Metadata: 20 points
  • Component Basic: 20 points
  • Component Model Card: 30 points
  • External References: 10 points

4. Output Generation

Generated Artifacts:

  1. AIBOM JSON: CycloneDX 1.6 compliant SBOM document
  2. Completeness Score: Numerical score (0-100) with breakdown
  3. Field Checklist: Detailed field-by-field analysis
  4. Extraction Report: Confidence levels and data sources
  5. Validation Results: Compliance and quality checks

Configuration Management

Field Registry Structure

The system is driven by field_registry.json which defines:

  • Field Definitions: All 29 extractable fields
  • Scoring Configuration: Weights, tiers, and categories
  • AIBOM Generation Rules: Structure and validation rules
  • Extraction Strategies: How each field should be extracted

Dynamic Configuration

Adding New Fields:

  1. Add field definition to field_registry.json
  2. System automatically discovers and attempts extraction
  3. No code changes required

Updating Scoring:

  1. Modify weights in registry configuration
  2. Changes take effect immediately
  3. Consistent scoring across all models

Quality Assurance

Validation Layers

  1. Input Validation: Model ID format and accessibility
  2. Extraction Validation: Data type and format checking
  3. Structure Validation: CycloneDX schema compliance
  4. Scoring Validation: Mathematical correctness
  5. Output Validation: JSON schema and completeness

Error Handling

  • Individual Field Failures: Don't stop overall processing
  • Graceful Degradation: Fallback to lower-confidence strategies
  • Comprehensive Logging: Detailed error tracking and debugging
  • Recovery Mechanisms: Automatic retry and alternative approaches

Performance Characteristics

Typical Processing Times

  • Single Model: 2-5 seconds
  • Batch Processing: 10-50 models/minute
  • Registry Loading: <1 second
  • Field Extraction: 1-3 seconds per model

Scalability Features

  • Concurrent Processing: Multiple models processed simultaneously
  • Caching: Model metadata and configuration caching
  • Rate Limiting: Respectful API usage
  • Resource Management: Memory and connection pooling

Integration Points

APIs

  • Generation API: /api/generate - Single model AI SBOM generation, with download URL
  • Generation with Completness Score Report API: /api/generate-with-report - Generation API with completness scoring report
  • Completness Score Report Only API: /api/models/{model_id}/score - Get the completeness score for a model without generating AI SBOM

Data Sources

  • HuggingFace Hub: Primary model metadata source
  • Model Repositories: Direct file access for configurations
  • Model Cards: Structured documentation parsing

Output Formats

  • CycloneDX JSON: Primary SBOM format
  • Field Reports: Human-readable analysis
  • CSV Exports: Batch processing results
  • API Responses: Structured JSON for integration

This architecture provides a robust, configurable, and standards-compliant solution for AI model SBOM generation with comprehensive field extraction and scoring capabilities.