Spaces:

dwb2023
/

hf_extractor

Runtime error

File size: 15,594 Bytes

6a50c2d

# GraphRAG README

## Some fundamental concepts

### Data Ingestion

NOTE:  mermaid.js diagrams below are based on some inspiring content from the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/DerwenAI/cdl2024_masterclass/blob/main/README.md) masterclass.

```mermaid
graph TD
    %% Database shapes with consistent styling
    SDS[(Structured<br/>Data Sources)]
    UDS[(Unstructured<br/>Data Sources)]
    LG[(lexical graph)]
    SG[(semantic graph)]
    VD[(vector database)]

    %% Flow from structured data
    SDS -->|PII features| ER[entity resolution]
    SDS -->|data records| SG
    SG -->|PII updates| ER
    ER -->|semantic overlay| SG

    %% Schema and ontology
    ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.]
    ONT --> SG

    %% Flow from unstructured data
    UDS --> K[text chunking<br/>function]
    K --> NLP[NLP parse]
    K --> EM[embedding model]
    NLP --> E[NER, RE]
    E --> LG
    LG --> EL[entity linking]
    EL <--> SG

    %% Vector elements connections
    EM --> VD
    VD -.->|capture source chunk<br/>WITHIN references| SG

    %% Thesaurus connection
    ER -.->T[thesaurus]
    T --> EL

    %% Styling classes
    classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
    classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
    classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
    classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
    classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
    classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;

    %% Apply styles by layer/type
    class SDS,UDS dataSource;
    class SG,VD storage;
    class EM embedding;
    class LG lexical;
    class SG semantic;
    class ONT,T reference;
```

### Augment LLM Inference

```mermaid
graph LR
    %% Define database and special shapes
    P[prompt]
    SG[(semantic graph)]
    VD[(vector database)]
    LLM[LLM]
    Z[response]
    
    %% Main flow paths
    P --> Q[generated query]
    P --> EM[embedding model]
    
    %% Upper path through graph elements
    Q --> SG
    SG --> W[semantic<br/>random walk]
    T[thesaurus] --> W
    W --> GA[graph analytics]
    
    %% Lower path through vector elements
    EM --> SS[vector<br/>similarity search]
    SS --> VD
    
    %% Node embeddings and chunk references
    SG -.-|chunk references| VD
    SS -->|node embeddings| SG
    
    %% Final convergence
    GA --> RI[ranked index]
    VD --> RI
    RI --> LLM
    LLM --> Z

    %% Styling classes
    classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
    classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
    classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
    classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
    classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
    classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;

    %% Apply styles by layer/type
    class SDS,UDS dataSource;
    class SG,VD storage;
    class EM embedding;
    class LG lexical;
    class SG semantic;
    class ONT,T reference;
```

## Sequence Diagram - covering the current `strwythura` (structure) repo

- the diagram below is largely based on the `demo.py` functions
- I used [Prefect](https://www.prefect.io/) to `dig in` and reverse architect the flow...
  - [graphrag_demo.py](./graphrag_demo.py) is my simple update to [Paco's original python code](./demo.py)
  - I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically.
- Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements.  Some great insight often occurs when you can see how individual functions / components are interacting.
  - this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction.
- For a deeper dive, [Paco's YouTube video and associated diagrams](https://senzing.com/gph-graph-rag-llm-knowledge-graphs/) help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight.


```mermaid
sequenceDiagram
    participant Main as Main Script
    participant ConstructKG as construct_kg Flow
    participant InitNLP as init_nlp Task
    participant ScrapeHTML as scrape_html Task
    participant MakeChunk as make_chunk Task
    participant ParseText as parse_text Task
    participant MakeEntity as make_entity Task
    participant ExtractEntity as extract_entity Task
    participant ExtractRelations as extract_relations Task
    participant ConnectEntities as connect_entities Task
    participant RunTextRank as run_textrank Task
    participant AbstractOverlay as abstract_overlay Task
    participant GenPyvis as gen_pyvis Task

    Main->>ConstructKG: Start construct_kg flow
    ConstructKG->>InitNLP: Initialize NLP pipeline
    InitNLP-->>ConstructKG: Return NLP object

    loop For each URL in url_list
        ConstructKG->>ScrapeHTML: Scrape HTML content
        ScrapeHTML->>MakeChunk: Create text chunks
        MakeChunk-->>ScrapeHTML: Return chunk list
        ScrapeHTML-->>ConstructKG: Return chunk list

        loop For each chunk in chunk_list
            ConstructKG->>ParseText: Parse text and build lex_graph
            ParseText->>MakeEntity: Create entities from spans
            MakeEntity-->>ParseText: Return entity
            ParseText->>ExtractEntity: Extract and add entities to lex_graph
            ExtractEntity-->>ParseText: Entity added to graph
            ParseText->>ExtractRelations: Extract relations between entities
            ExtractRelations-->>ParseText: Relations added to graph
            ParseText->>ConnectEntities: Connect co-occurring entities
            ConnectEntities-->>ParseText: Connections added to graph
            ParseText-->>ConstructKG: Return parsed doc
        end

        ConstructKG->>RunTextRank: Run TextRank on lex_graph
        RunTextRank-->>ConstructKG: Return ranked entities
        ConstructKG->>AbstractOverlay: Overlay semantic graph
        AbstractOverlay-->>ConstructKG: Overlay completed
    end

    ConstructKG->>GenPyvis: Generate Pyvis visualization
    GenPyvis-->>ConstructKG: Visualization saved
    ConstructKG-->>Main: Flow completed
```

## Run the code

1. setup local Python environment and install Python dependencies

   - I used Python 3.11, but 3.10 should work as well

    ```bash
    pip install -r requirements.txt
    ```

2. Start the local Prefect server

   - follow the [self-hosted instructions](https://docs.prefect.io/v3/get-started/quickstart#connect-to-a-prefect-api) to launch the `Prefect UI`

    ```python
    prefect server start
    ```

3. run the `graphrag_demo.py` script

    ```python
    python graphrag_demo.py
    ```

## Appendix: Code Overview and Purpose

- The code forms part of a talk for **GraphGeeks.org** about constructing **knowledge graphs** from **unstructured data sources**, such as web content.
- It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization.

---

### **Key Components and Flow**

#### **1. Model and Parameter Settings**
- **Core Configuration**: Establishes the foundational settings like chunk size, embedding models (`BAAI/bge-small-en-v1.5`), and database URIs.
- **NER Labels**: Defines entity categories such as `Person`, `Organization`, `Publication`, and `Technology`.
- **Relation Types**: Configures relationships like `works_at`, `developed_by`, and `authored_by` for connecting entities.
- **Scraping Parameters**: Sets user-agent headers for web requests.

#### **2. Data Validation**
- **Classes**:
  - `TextChunk`: Represents segmented text chunks with their embeddings.
  - `Entity`: Tracks extracted entities, their attributes, and relationships.
- **Purpose**: Ensures data is clean and well-structured for downstream processing.

#### **3. Data Collection**
- **Functions**:
  - `scrape_html`: Fetches and parses webpage content.
  - `uni_scrubber`: Cleans Unicode and formatting issues.
  - `make_chunk`: Segments long text into manageable chunks for embedding.
- **Role**: Prepares raw, unstructured data for structured analysis.

#### **4. Lexical Graph Construction**
- **Initialization**:
  - `init_nlp`: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE).
- **Graph Parsing**:
  - `parse_text`: Creates lexical graphs using TextRank algorithms.
  - `make_entity`: Extracts and integrates entities into the graph.
  - `connect_entities`: Links entities co-occurring in the same context.
- **Purpose**: Converts text into a structured, connected graph of entities and relationships.

#### **5. Numerical Processing**
- **Functions**:
  - `calc_quantile_bins`: Creates quantile bins for numerical data.
  - `root_mean_square`: Computes RMS for normalization.
  - `stripe_column`: Applies quantile binning to data columns.
- **Role**: Provides statistical operations to refine and rank graph components.

#### **6. TextRank Implementation**
- **Functions**:
  - `run_textrank`: Ranks entities in the graph based on a PageRank-inspired algorithm.
- **Purpose**: Identifies and prioritizes key entities for knowledge graph construction.

#### **7. Semantic Overlay**
- **Functions**:
  - `abstract_overlay`: Abstracts a semantic layer from the lexical graph.
  - Connects entities to their originating text chunks for context preservation.
- **Role**: Enhances the graph with higher-order relationships and semantic depth.

#### **8. Visualization**
- **Tool**: `pyvis`
- **Functions**:
  - `gen_pyvis`: Creates an interactive visualization of the knowledge graph.
- **Features**:
  - Node sizing reflects entity importance.
  - Physics-based layout supports intuitive exploration.

#### **9. Orchestration**
- **Function**:
  - `construct_kg`: Orchestrates the full pipeline from data collection to visualization.
- **Purpose**: Ensures the seamless integration of all layers and components.

---

### **Notable Implementation Details**

- **Multi-Layer Graph Representation**: Combines lexical and semantic graphs for layered analysis.
- **Vector Embedding Integration**: Enhances entity representation with embeddings.
- **Error Handling and Debugging**: Includes robust logging and debugging features.
- **Scalability**: Designed for handling diverse and large datasets with dynamic relationships.

---

## Appendix:  Architectural Workflow

### **1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction**

#### **1.1 Workflow Layers**

**Data Ingestion:**
- Role: Extract raw data from structured and unstructured sources for downstream processing.
- Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis.
- Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis.

**Lexical Graph Construction:**
- Role: Build a foundational graph by integrating tokenized data and semantic relationships.
- Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank).
- Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure.

**Entity and Relation Extraction:**
- Role: Identify and label entities, along with their relationships, to enrich the graph structure.
- Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity.
- Requirements: Domain-tuned models and algorithms for accurate extraction.

**Graph Construction and Visualization:**
- Role: Develop and display the knowledge graph to facilitate analysis and decision-making.
- Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis).
- Requirements: Scalable graph-building frameworks and intuitive visualization tools.

**Semantic Overlay:**
- Role: Enhance the graph with additional context and reasoning capabilities.
- Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision.
- Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases.


### **2. Visualized Workflow**

#### **2.1 Logical Data Flow**

```mermaid
graph TD
A[Raw Data] -->|Scrape| B[Chunks]
B -->|Lexical Parsing| C[Lexical Graph]
C -->|NER + RE| D[Entities and Relations]
D -->|Construct KG| E[Knowledge Graph]
E -->|Overlay Ontologies| F[Enriched Graph]
F -->|Visualize| G[Interactive View]
```

---

### **3. Glossary**

| **Participant**                | **Description**                                                                                   | **Workflow Layer**                 |
|--------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------|
| **HTML Scraper (BeautifulSoup)** | Fetches unstructured text data from web sources.                                                  | Data Ingestion                     |
| **Text Chunker**               | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding.        | Data Ingestion                     |
| **SpaCy Pipeline**             | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction.             | Entity and Relation Extraction     |
| **Embedding Model (bge-small-en-v1.5)** | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion |
| **GLiNER**                     | Identifies domain-specific entities and returns labeled outputs.                                  | Entity and Relation Extraction     |
| **GLiREL**                     | Extracts relationships between identified entities, adding connectivity to the graph.             | Entity and Relation Extraction     |
| **Vector Database (LanceDB)**  | Stores chunk embeddings for efficient querying in downstream tasks.                              | Data Ingestion         |
| **Word2Vec (Gensim)**          | Generates entity embeddings based on graph co-occurrence for additional analysis.                 | Semantic Graph Construction         |
| **Graph Constructor (NetworkX)** | Builds and analyzes the knowledge graph, ranking entities using TextRank.                       | Graph Construction and Visualization |
| **Graph Visualizer (PyVis)**   | Provides an interactive visualization of the knowledge graph for interpretability.                | Graph Construction and Visualization |

## Citations: giving credit where credit is due...

Inspired by the great work done by multiple individuals who created the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/donbr/cdl2024_masterclass/blob/main/README.md) masterclass I created this document to highlight areas that rang true.

- Paco Nathan https://senzing.com/consult-entity-resolution-paco/
- Clair Sullivan https://clairsullivan.com/
- Louis Guitton https://guitton.co/
- Jeff Butcher https://github.com/jbutcher21
- Michael Dockter https://github.com/docktermj

The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass.