hf_extractor / graphrag_readme.md
dwb2023's picture
add html
6a50c2d verified
|
raw
history blame
15.6 kB

GraphRAG README

Some fundamental concepts

Data Ingestion

NOTE: mermaid.js diagrams below are based on some inspiring content from the Connected Data London 2024: Entity Resolved Knowledge Graphs masterclass.

graph TD
    %% Database shapes with consistent styling
    SDS[(Structured<br/>Data Sources)]
    UDS[(Unstructured<br/>Data Sources)]
    LG[(lexical graph)]
    SG[(semantic graph)]
    VD[(vector database)]

    %% Flow from structured data
    SDS -->|PII features| ER[entity resolution]
    SDS -->|data records| SG
    SG -->|PII updates| ER
    ER -->|semantic overlay| SG

    %% Schema and ontology
    ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.]
    ONT --> SG

    %% Flow from unstructured data
    UDS --> K[text chunking<br/>function]
    K --> NLP[NLP parse]
    K --> EM[embedding model]
    NLP --> E[NER, RE]
    E --> LG
    LG --> EL[entity linking]
    EL <--> SG

    %% Vector elements connections
    EM --> VD
    VD -.->|capture source chunk<br/>WITHIN references| SG

    %% Thesaurus connection
    ER -.->T[thesaurus]
    T --> EL

    %% Styling classes
    classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
    classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
    classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
    classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
    classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
    classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;

    %% Apply styles by layer/type
    class SDS,UDS dataSource;
    class SG,VD storage;
    class EM embedding;
    class LG lexical;
    class SG semantic;
    class ONT,T reference;

Augment LLM Inference

graph LR
    %% Define database and special shapes
    P[prompt]
    SG[(semantic graph)]
    VD[(vector database)]
    LLM[LLM]
    Z[response]
    
    %% Main flow paths
    P --> Q[generated query]
    P --> EM[embedding model]
    
    %% Upper path through graph elements
    Q --> SG
    SG --> W[semantic<br/>random walk]
    T[thesaurus] --> W
    W --> GA[graph analytics]
    
    %% Lower path through vector elements
    EM --> SS[vector<br/>similarity search]
    SS --> VD
    
    %% Node embeddings and chunk references
    SG -.-|chunk references| VD
    SS -->|node embeddings| SG
    
    %% Final convergence
    GA --> RI[ranked index]
    VD --> RI
    RI --> LLM
    LLM --> Z

    %% Styling classes
    classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
    classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
    classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
    classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
    classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
    classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;

    %% Apply styles by layer/type
    class SDS,UDS dataSource;
    class SG,VD storage;
    class EM embedding;
    class LG lexical;
    class SG semantic;
    class ONT,T reference;

Sequence Diagram - covering the current strwythura (structure) repo

  • the diagram below is largely based on the demo.py functions
  • I used Prefect to dig in and reverse architect the flow...
    • graphrag_demo.py is my simple update to Paco's original python code
    • I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically.
  • Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements. Some great insight often occurs when you can see how individual functions / components are interacting.
    • this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction.
  • For a deeper dive, Paco's YouTube video and associated diagrams help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight.
sequenceDiagram
    participant Main as Main Script
    participant ConstructKG as construct_kg Flow
    participant InitNLP as init_nlp Task
    participant ScrapeHTML as scrape_html Task
    participant MakeChunk as make_chunk Task
    participant ParseText as parse_text Task
    participant MakeEntity as make_entity Task
    participant ExtractEntity as extract_entity Task
    participant ExtractRelations as extract_relations Task
    participant ConnectEntities as connect_entities Task
    participant RunTextRank as run_textrank Task
    participant AbstractOverlay as abstract_overlay Task
    participant GenPyvis as gen_pyvis Task

    Main->>ConstructKG: Start construct_kg flow
    ConstructKG->>InitNLP: Initialize NLP pipeline
    InitNLP-->>ConstructKG: Return NLP object

    loop For each URL in url_list
        ConstructKG->>ScrapeHTML: Scrape HTML content
        ScrapeHTML->>MakeChunk: Create text chunks
        MakeChunk-->>ScrapeHTML: Return chunk list
        ScrapeHTML-->>ConstructKG: Return chunk list

        loop For each chunk in chunk_list
            ConstructKG->>ParseText: Parse text and build lex_graph
            ParseText->>MakeEntity: Create entities from spans
            MakeEntity-->>ParseText: Return entity
            ParseText->>ExtractEntity: Extract and add entities to lex_graph
            ExtractEntity-->>ParseText: Entity added to graph
            ParseText->>ExtractRelations: Extract relations between entities
            ExtractRelations-->>ParseText: Relations added to graph
            ParseText->>ConnectEntities: Connect co-occurring entities
            ConnectEntities-->>ParseText: Connections added to graph
            ParseText-->>ConstructKG: Return parsed doc
        end

        ConstructKG->>RunTextRank: Run TextRank on lex_graph
        RunTextRank-->>ConstructKG: Return ranked entities
        ConstructKG->>AbstractOverlay: Overlay semantic graph
        AbstractOverlay-->>ConstructKG: Overlay completed
    end

    ConstructKG->>GenPyvis: Generate Pyvis visualization
    GenPyvis-->>ConstructKG: Visualization saved
    ConstructKG-->>Main: Flow completed

Run the code

  1. setup local Python environment and install Python dependencies

    • I used Python 3.11, but 3.10 should work as well
    pip install -r requirements.txt
    
  2. Start the local Prefect server

    prefect server start
    
  3. run the graphrag_demo.py script

    python graphrag_demo.py
    

Appendix: Code Overview and Purpose

  • The code forms part of a talk for GraphGeeks.org about constructing knowledge graphs from unstructured data sources, such as web content.
  • It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization.

Key Components and Flow

1. Model and Parameter Settings

  • Core Configuration: Establishes the foundational settings like chunk size, embedding models (BAAI/bge-small-en-v1.5), and database URIs.
  • NER Labels: Defines entity categories such as Person, Organization, Publication, and Technology.
  • Relation Types: Configures relationships like works_at, developed_by, and authored_by for connecting entities.
  • Scraping Parameters: Sets user-agent headers for web requests.

2. Data Validation

  • Classes:
    • TextChunk: Represents segmented text chunks with their embeddings.
    • Entity: Tracks extracted entities, their attributes, and relationships.
  • Purpose: Ensures data is clean and well-structured for downstream processing.

3. Data Collection

  • Functions:
    • scrape_html: Fetches and parses webpage content.
    • uni_scrubber: Cleans Unicode and formatting issues.
    • make_chunk: Segments long text into manageable chunks for embedding.
  • Role: Prepares raw, unstructured data for structured analysis.

4. Lexical Graph Construction

  • Initialization:
    • init_nlp: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE).
  • Graph Parsing:
    • parse_text: Creates lexical graphs using TextRank algorithms.
    • make_entity: Extracts and integrates entities into the graph.
    • connect_entities: Links entities co-occurring in the same context.
  • Purpose: Converts text into a structured, connected graph of entities and relationships.

5. Numerical Processing

  • Functions:
    • calc_quantile_bins: Creates quantile bins for numerical data.
    • root_mean_square: Computes RMS for normalization.
    • stripe_column: Applies quantile binning to data columns.
  • Role: Provides statistical operations to refine and rank graph components.

6. TextRank Implementation

  • Functions:
    • run_textrank: Ranks entities in the graph based on a PageRank-inspired algorithm.
  • Purpose: Identifies and prioritizes key entities for knowledge graph construction.

7. Semantic Overlay

  • Functions:
    • abstract_overlay: Abstracts a semantic layer from the lexical graph.
    • Connects entities to their originating text chunks for context preservation.
  • Role: Enhances the graph with higher-order relationships and semantic depth.

8. Visualization

  • Tool: pyvis
  • Functions:
    • gen_pyvis: Creates an interactive visualization of the knowledge graph.
  • Features:
    • Node sizing reflects entity importance.
    • Physics-based layout supports intuitive exploration.

9. Orchestration

  • Function:
    • construct_kg: Orchestrates the full pipeline from data collection to visualization.
  • Purpose: Ensures the seamless integration of all layers and components.

Notable Implementation Details

  • Multi-Layer Graph Representation: Combines lexical and semantic graphs for layered analysis.
  • Vector Embedding Integration: Enhances entity representation with embeddings.
  • Error Handling and Debugging: Includes robust logging and debugging features.
  • Scalability: Designed for handling diverse and large datasets with dynamic relationships.

Appendix: Architectural Workflow

1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction

1.1 Workflow Layers

Data Ingestion:

  • Role: Extract raw data from structured and unstructured sources for downstream processing.
  • Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis.
  • Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis.

Lexical Graph Construction:

  • Role: Build a foundational graph by integrating tokenized data and semantic relationships.
  • Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank).
  • Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure.

Entity and Relation Extraction:

  • Role: Identify and label entities, along with their relationships, to enrich the graph structure.
  • Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity.
  • Requirements: Domain-tuned models and algorithms for accurate extraction.

Graph Construction and Visualization:

  • Role: Develop and display the knowledge graph to facilitate analysis and decision-making.
  • Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis).
  • Requirements: Scalable graph-building frameworks and intuitive visualization tools.

Semantic Overlay:

  • Role: Enhance the graph with additional context and reasoning capabilities.
  • Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision.
  • Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases.

2. Visualized Workflow

2.1 Logical Data Flow

graph TD
A[Raw Data] -->|Scrape| B[Chunks]
B -->|Lexical Parsing| C[Lexical Graph]
C -->|NER + RE| D[Entities and Relations]
D -->|Construct KG| E[Knowledge Graph]
E -->|Overlay Ontologies| F[Enriched Graph]
F -->|Visualize| G[Interactive View]

3. Glossary

Participant Description Workflow Layer
HTML Scraper (BeautifulSoup) Fetches unstructured text data from web sources. Data Ingestion
Text Chunker Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding. Data Ingestion
SpaCy Pipeline Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction. Entity and Relation Extraction
Embedding Model (bge-small-en-v1.5) Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. Data Ingestion
GLiNER Identifies domain-specific entities and returns labeled outputs. Entity and Relation Extraction
GLiREL Extracts relationships between identified entities, adding connectivity to the graph. Entity and Relation Extraction
Vector Database (LanceDB) Stores chunk embeddings for efficient querying in downstream tasks. Data Ingestion
Word2Vec (Gensim) Generates entity embeddings based on graph co-occurrence for additional analysis. Semantic Graph Construction
Graph Constructor (NetworkX) Builds and analyzes the knowledge graph, ranking entities using TextRank. Graph Construction and Visualization
Graph Visualizer (PyVis) Provides an interactive visualization of the knowledge graph for interpretability. Graph Construction and Visualization

Citations: giving credit where credit is due...

Inspired by the great work done by multiple individuals who created the Connected Data London 2024: Entity Resolved Knowledge Graphs masterclass I created this document to highlight areas that rang true.

The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass.