AI SBOM Generator System Architecture

Overview

The AI SBOM Generator is a configurable system that automatically generates Software Bill of Materials (SBOM) documents for AI models hosted on HuggingFace. The system uses a registry-driven architecture that allows for dynamic field configuration without code changes.

System Architecture

Core Components

┌─────────────────────────────────────────────────────────────┐
│                    AI SBOM Generator                   │
├─────────────────────────────────────────────────────────────┤
│  Web Interface (FastAPI + HTML Templates)              │
├─────────────────────────────────────────────────────────────┤
│  API Layer                                             │
│  ├── Generation Endpoints                              │
│  ├── Scoring Endpoints                                 │
│  └── Batch Processing                                  │
├─────────────────────────────────────────────────────────────┤
│  Core Generation Engine                                │
│  ├── AIBOMGenerator (generator.py)                     │
│  ├── Enhanced Extractor (enhanced_extractor.py)        │
│  └── Field Registry Manager (field_registry_manager.py)│
├─────────────────────────────────────────────────────────────┤
│  Configuration Layer                                   │
│  ├── Field Registry (field_registry.json)              │
│  ├── Scoring Configuration                             │
│  └── AIBOM Generation Rules                            │
├─────────────────────────────────────────────────────────────┤
│  Data Sources                                          │
│  ├── HuggingFace API                                   │
│  ├── Model Cards                                       │
│  ├── Configuration Files                               │
│  └── README Content                                    │
└─────────────────────────────────────────────────────────────┘

Key Features

Registry-Driven Configuration: All fields and scoring rules defined in JSON
Multi-Strategy Extraction: 6 different extraction methods per field
Standards Compliance: CycloneDX 1.6 compatible output
Configurable Scoring: Weighted scoring system with tier-based multipliers
Automatic Field Discovery: New fields added to registry are automatically processed
Comprehensive Logging: Detailed extraction and scoring logs for debugging

Process Workflow

1. System Initialization

System Initialization Process:

    ┌───────────────────┐
    │  System Startup  │
    └─────────┬─────────┘
              │
              ▼
    ┌───────────────────┐
    │ Load Field       │
    │ Registry         │
    └─────────┬─────────┘
              │
              ▼
    ┌───────────────────┐
    │ Initialize       │
    │ Registry Manager │
    └─────────┬─────────┘
              │
              ▼
    ┌─────────────────┐
    │ Load Scoring   │
    │ Configuration  │
    └─────────┬───────┘
              │
              ▼
    ┌─────────────────┐
    │ Initialize     │
    │ Enhanced       │
    │ Extractor      │
    └─────────┬───────┘
              │
              ▼
    ┌─────────────────┐
    │  System Ready  │
    └─────────────────┘

Steps:

Load Field Registry: Read field_registry.json containing all field definitions
Initialize Registry Manager: Create manager instance with loaded configuration
Load Scoring Configuration: Parse scoring weights, tiers, and category definitions
Initialize Enhanced Extractor: Create extractor with registry-driven field discovery
System Ready: All components initialized and ready for SBOM generation

2. SBOM Generation Process

SBOM Generation Workflow:

User Request ──┐
               │
               ▼
    ┌───────────────────┐      ┌────────────────────┐     ┌──────────────────┐
    │ Validate Model  │─────▶│ Fetch Model Info  │───▶│ Initialize      │
    │ ID              │      │                   │    │ Enhanced        │
    └───────────────────┘      └────────────────────┘    │ Extractor       │
                                                      └──────────┬───────┘
                                                                │
    ┌───────────────────┐     ┌──────────────────┐                 │
    │ Return SBOM +    │◀───│ Calculate       │◀────────────────┘
    │ Score            │    │ Completeness    │
    └───────────────────┘     │ Score           │
                            └──────────────────┘
                                    ▲
                                    │
                           ┌────────────────────┐
                           │ Generate AIBOM    │
                           │ Structure         │
                           └────────────────────┘
                                    ▲
                                    │
                           ┌────────────────────┐
                           │ Multi-Strategy    │
                           │ Field Processing  │
                           └────────────────────┘
                                    ▲
                                    │
                           ┌────────────────────┐
                           │ Registry-Driven   │
                           │ Extraction        │
                           └────────────────────┘

2.1 Model Information Gathering

Input: HuggingFace model ID (e.g., deepseek-ai/DeepSeek-R1)

Process:

Validate Model ID: Check format and accessibility
Fetch Model Info: Retrieve metadata from HuggingFace API
Download Model Card: Get structured model documentation
Fetch Configuration Files: Download config.json, tokenizer_config.json
Extract README Content: Parse model description and documentation

2.2 Registry-Driven Field Extraction

For each of the 29 registry fields:

Multi-Strategy Field Extraction:

Field from Registry
        │
        ▼
┌───────────────────┐     Success?
│ Strategy 1:      │────────┐
│ HuggingFace API  │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 2:      │        │
│ Model Card       │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 3:      │        │
│ Config Files     │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 4:      │        │
│ Text Patterns    │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 5:      │        │
│ Intelligent      │        │
│ Inference        │        │
└───────────────────┘        │
        │                  │
        │ Failure          │
        ▼                  │
┌───────────────────┐        │
│ Strategy 6:      │        │
│ Fallback Value   │        │
└───────────────────┘        │
        │                  │
        ▼                  │
┌───────────────────┐◀───────┘
│ Store Result &   │
│ Log Outcome      │
└───────────────────┘

Extraction Strategies:

HuggingFace API Extraction
- Direct field mapping from API response
- High confidence, structured data
- Fields: name, author, license, tags, etc.
Model Card Extraction
- Parse structured model card YAML/metadata
- Medium-high confidence
- Fields: limitation, metrics, datasets, etc.
Configuration File Extraction
- Mine technical details from config files
- High confidence for technical fields
- Fields: typeOfModel, hyperparameter, etc.
Text Pattern Extraction
- Regex-based extraction from README content
- Medium confidence, requires validation
- Fields: safetyRiskAssessment, informationAboutTraining, etc.
Intelligent Inference
- Smart defaults based on model characteristics
- Medium confidence, contextual
- Fields: primaryPurpose, domain, etc.
Fallback Values
- Placeholder values when no data available
- Low/no confidence, maintains structure
- Ensures complete SBOM structure

2.3 AIBOM Structure Generation

Process:

Create Base Structure: Initialize CycloneDX 1.6 compliant structure
Populate Metadata Section: Add extracted metadata fields
Build Component Section: Create model component with extracted data
Add Model Card: Include AI-specific model card information
Generate External References: Add distribution and repository links
Create Dependencies: Define model dependencies and relationships
Validate Structure: Ensure CycloneDX compliance

Output Structure:

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.6",
  "serialNumber": "urn:uuid:...",
  "version": 1,
  "metadata": {
    "timestamp": "...",
    "tools": [...],
    "component": {...},
    "properties": [...]
  },
  "components": [{
    "type": "machine-learning-model",
    "name": "...",
    "modelCard": {...},
    "properties": [...]
  }],
  "externalReferences": [...],
  "dependencies": [...]
}

3. Completeness Scoring Process

Completeness Scoring Process:

┌───────────────────┐
│ Extracted Fields │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Categorize       │
│ Fields           │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Apply Tier       │
│ Weights          │
│ • Critical: 3x   │
│ • Important: 2x  │
│ • Supplement: 1x │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Calculate        │
│ Category Scores  │
│ • Required: 20   │
│ • Metadata: 20   │
│ • Basic: 20      │
│ • ModelCard: 30  │
│ • ExtRefs: 10    │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Sum Weighted     │
│ Scores           │
│ (Max: 100)       │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│ Generate Score   │
│ Report           │
└───────────────────┘

Scoring Algorithm:

Field Categorization: Group fields by category (required_fields, metadata, etc.)
Tier Weight Application: Apply multipliers (Critical: 3x, Important: 2x, Supplementary: 1x)
Category Score Calculation: (Fields Present / Total Fields) × Category Weight
Final Score: Sum of all category scores (max 100)

Category Weights:

Required Fields: 20 points
Metadata: 20 points
Component Basic: 20 points
Component Model Card: 30 points
External References: 10 points

4. Output Generation

Generated Artifacts:

AIBOM JSON: CycloneDX 1.6 compliant SBOM document
Completeness Score: Numerical score (0-100) with breakdown
Field Checklist: Detailed field-by-field analysis
Extraction Report: Confidence levels and data sources
Validation Results: Compliance and quality checks

Configuration Management

Field Registry Structure

The system is driven by field_registry.json which defines:

Field Definitions: All 29 extractable fields
Scoring Configuration: Weights, tiers, and categories
AIBOM Generation Rules: Structure and validation rules
Extraction Strategies: How each field should be extracted

Dynamic Configuration

Adding New Fields:

Add field definition to field_registry.json
System automatically discovers and attempts extraction
No code changes required

Updating Scoring:

Modify weights in registry configuration
Changes take effect immediately
Consistent scoring across all models

Quality Assurance

Validation Layers

Input Validation: Model ID format and accessibility
Extraction Validation: Data type and format checking
Structure Validation: CycloneDX schema compliance
Scoring Validation: Mathematical correctness
Output Validation: JSON schema and completeness

Error Handling

Individual Field Failures: Don't stop overall processing
Graceful Degradation: Fallback to lower-confidence strategies
Comprehensive Logging: Detailed error tracking and debugging
Recovery Mechanisms: Automatic retry and alternative approaches

Performance Characteristics

Typical Processing Times

Single Model: 2-5 seconds
Batch Processing: 10-50 models/minute
Registry Loading: <1 second
Field Extraction: 1-3 seconds per model

Scalability Features

Concurrent Processing: Multiple models processed simultaneously
Caching: Model metadata and configuration caching
Rate Limiting: Respectful API usage
Resource Management: Memory and connection pooling

Integration Points

APIs

Generation API: /api/generate - Single model AI SBOM generation, with download URL
Generation with Completness Score Report API: /api/generate-with-report - Generation API with completness scoring report
Completness Score Report Only API: /api/models/{model_id}/score - Get the completeness score for a model without generating AI SBOM

Data Sources

HuggingFace Hub: Primary model metadata source
Model Repositories: Direct file access for configurations
Model Cards: Structured documentation parsing

Output Formats

CycloneDX JSON: Primary SBOM format
Field Reports: Human-readable analysis
CSV Exports: Batch processing results
API Responses: Structured JSON for integration

This architecture provides a robust, configurable, and standards-compliant solution for AI model SBOM generation with comprehensive field extraction and scoring capabilities.