hf_extractor / graphrag_readme.md
dwb2023's picture
add html
6a50c2d verified
|
raw
history blame
15.6 kB
# GraphRAG README
## Some fundamental concepts
### Data Ingestion
NOTE: mermaid.js diagrams below are based on some inspiring content from the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/DerwenAI/cdl2024_masterclass/blob/main/README.md) masterclass.
```mermaid
graph TD
%% Database shapes with consistent styling
SDS[(Structured<br/>Data Sources)]
UDS[(Unstructured<br/>Data Sources)]
LG[(lexical graph)]
SG[(semantic graph)]
VD[(vector database)]
%% Flow from structured data
SDS -->|PII features| ER[entity resolution]
SDS -->|data records| SG
SG -->|PII updates| ER
ER -->|semantic overlay| SG
%% Schema and ontology
ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.]
ONT --> SG
%% Flow from unstructured data
UDS --> K[text chunking<br/>function]
K --> NLP[NLP parse]
K --> EM[embedding model]
NLP --> E[NER, RE]
E --> LG
LG --> EL[entity linking]
EL <--> SG
%% Vector elements connections
EM --> VD
VD -.->|capture source chunk<br/>WITHIN references| SG
%% Thesaurus connection
ER -.->T[thesaurus]
T --> EL
%% Styling classes
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
%% Apply styles by layer/type
class SDS,UDS dataSource;
class SG,VD storage;
class EM embedding;
class LG lexical;
class SG semantic;
class ONT,T reference;
```
### Augment LLM Inference
```mermaid
graph LR
%% Define database and special shapes
P[prompt]
SG[(semantic graph)]
VD[(vector database)]
LLM[LLM]
Z[response]
%% Main flow paths
P --> Q[generated query]
P --> EM[embedding model]
%% Upper path through graph elements
Q --> SG
SG --> W[semantic<br/>random walk]
T[thesaurus] --> W
W --> GA[graph analytics]
%% Lower path through vector elements
EM --> SS[vector<br/>similarity search]
SS --> VD
%% Node embeddings and chunk references
SG -.-|chunk references| VD
SS -->|node embeddings| SG
%% Final convergence
GA --> RI[ranked index]
VD --> RI
RI --> LLM
LLM --> Z
%% Styling classes
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
%% Apply styles by layer/type
class SDS,UDS dataSource;
class SG,VD storage;
class EM embedding;
class LG lexical;
class SG semantic;
class ONT,T reference;
```
## Sequence Diagram - covering the current `strwythura` (structure) repo
- the diagram below is largely based on the `demo.py` functions
- I used [Prefect](https://www.prefect.io/) to `dig in` and reverse architect the flow...
- [graphrag_demo.py](./graphrag_demo.py) is my simple update to [Paco's original python code](./demo.py)
- I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically.
- Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements. Some great insight often occurs when you can see how individual functions / components are interacting.
- this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction.
- For a deeper dive, [Paco's YouTube video and associated diagrams](https://senzing.com/gph-graph-rag-llm-knowledge-graphs/) help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight.
```mermaid
sequenceDiagram
participant Main as Main Script
participant ConstructKG as construct_kg Flow
participant InitNLP as init_nlp Task
participant ScrapeHTML as scrape_html Task
participant MakeChunk as make_chunk Task
participant ParseText as parse_text Task
participant MakeEntity as make_entity Task
participant ExtractEntity as extract_entity Task
participant ExtractRelations as extract_relations Task
participant ConnectEntities as connect_entities Task
participant RunTextRank as run_textrank Task
participant AbstractOverlay as abstract_overlay Task
participant GenPyvis as gen_pyvis Task
Main->>ConstructKG: Start construct_kg flow
ConstructKG->>InitNLP: Initialize NLP pipeline
InitNLP-->>ConstructKG: Return NLP object
loop For each URL in url_list
ConstructKG->>ScrapeHTML: Scrape HTML content
ScrapeHTML->>MakeChunk: Create text chunks
MakeChunk-->>ScrapeHTML: Return chunk list
ScrapeHTML-->>ConstructKG: Return chunk list
loop For each chunk in chunk_list
ConstructKG->>ParseText: Parse text and build lex_graph
ParseText->>MakeEntity: Create entities from spans
MakeEntity-->>ParseText: Return entity
ParseText->>ExtractEntity: Extract and add entities to lex_graph
ExtractEntity-->>ParseText: Entity added to graph
ParseText->>ExtractRelations: Extract relations between entities
ExtractRelations-->>ParseText: Relations added to graph
ParseText->>ConnectEntities: Connect co-occurring entities
ConnectEntities-->>ParseText: Connections added to graph
ParseText-->>ConstructKG: Return parsed doc
end
ConstructKG->>RunTextRank: Run TextRank on lex_graph
RunTextRank-->>ConstructKG: Return ranked entities
ConstructKG->>AbstractOverlay: Overlay semantic graph
AbstractOverlay-->>ConstructKG: Overlay completed
end
ConstructKG->>GenPyvis: Generate Pyvis visualization
GenPyvis-->>ConstructKG: Visualization saved
ConstructKG-->>Main: Flow completed
```
## Run the code
1. setup local Python environment and install Python dependencies
- I used Python 3.11, but 3.10 should work as well
```bash
pip install -r requirements.txt
```
2. Start the local Prefect server
- follow the [self-hosted instructions](https://docs.prefect.io/v3/get-started/quickstart#connect-to-a-prefect-api) to launch the `Prefect UI`
```python
prefect server start
```
3. run the `graphrag_demo.py` script
```python
python graphrag_demo.py
```
## Appendix: Code Overview and Purpose
- The code forms part of a talk for **GraphGeeks.org** about constructing **knowledge graphs** from **unstructured data sources**, such as web content.
- It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization.
---
### **Key Components and Flow**
#### **1. Model and Parameter Settings**
- **Core Configuration**: Establishes the foundational settings like chunk size, embedding models (`BAAI/bge-small-en-v1.5`), and database URIs.
- **NER Labels**: Defines entity categories such as `Person`, `Organization`, `Publication`, and `Technology`.
- **Relation Types**: Configures relationships like `works_at`, `developed_by`, and `authored_by` for connecting entities.
- **Scraping Parameters**: Sets user-agent headers for web requests.
#### **2. Data Validation**
- **Classes**:
- `TextChunk`: Represents segmented text chunks with their embeddings.
- `Entity`: Tracks extracted entities, their attributes, and relationships.
- **Purpose**: Ensures data is clean and well-structured for downstream processing.
#### **3. Data Collection**
- **Functions**:
- `scrape_html`: Fetches and parses webpage content.
- `uni_scrubber`: Cleans Unicode and formatting issues.
- `make_chunk`: Segments long text into manageable chunks for embedding.
- **Role**: Prepares raw, unstructured data for structured analysis.
#### **4. Lexical Graph Construction**
- **Initialization**:
- `init_nlp`: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE).
- **Graph Parsing**:
- `parse_text`: Creates lexical graphs using TextRank algorithms.
- `make_entity`: Extracts and integrates entities into the graph.
- `connect_entities`: Links entities co-occurring in the same context.
- **Purpose**: Converts text into a structured, connected graph of entities and relationships.
#### **5. Numerical Processing**
- **Functions**:
- `calc_quantile_bins`: Creates quantile bins for numerical data.
- `root_mean_square`: Computes RMS for normalization.
- `stripe_column`: Applies quantile binning to data columns.
- **Role**: Provides statistical operations to refine and rank graph components.
#### **6. TextRank Implementation**
- **Functions**:
- `run_textrank`: Ranks entities in the graph based on a PageRank-inspired algorithm.
- **Purpose**: Identifies and prioritizes key entities for knowledge graph construction.
#### **7. Semantic Overlay**
- **Functions**:
- `abstract_overlay`: Abstracts a semantic layer from the lexical graph.
- Connects entities to their originating text chunks for context preservation.
- **Role**: Enhances the graph with higher-order relationships and semantic depth.
#### **8. Visualization**
- **Tool**: `pyvis`
- **Functions**:
- `gen_pyvis`: Creates an interactive visualization of the knowledge graph.
- **Features**:
- Node sizing reflects entity importance.
- Physics-based layout supports intuitive exploration.
#### **9. Orchestration**
- **Function**:
- `construct_kg`: Orchestrates the full pipeline from data collection to visualization.
- **Purpose**: Ensures the seamless integration of all layers and components.
---
### **Notable Implementation Details**
- **Multi-Layer Graph Representation**: Combines lexical and semantic graphs for layered analysis.
- **Vector Embedding Integration**: Enhances entity representation with embeddings.
- **Error Handling and Debugging**: Includes robust logging and debugging features.
- **Scalability**: Designed for handling diverse and large datasets with dynamic relationships.
---
## Appendix: Architectural Workflow
### **1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction**
#### **1.1 Workflow Layers**
**Data Ingestion:**
- Role: Extract raw data from structured and unstructured sources for downstream processing.
- Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis.
- Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis.
**Lexical Graph Construction:**
- Role: Build a foundational graph by integrating tokenized data and semantic relationships.
- Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank).
- Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure.
**Entity and Relation Extraction:**
- Role: Identify and label entities, along with their relationships, to enrich the graph structure.
- Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity.
- Requirements: Domain-tuned models and algorithms for accurate extraction.
**Graph Construction and Visualization:**
- Role: Develop and display the knowledge graph to facilitate analysis and decision-making.
- Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis).
- Requirements: Scalable graph-building frameworks and intuitive visualization tools.
**Semantic Overlay:**
- Role: Enhance the graph with additional context and reasoning capabilities.
- Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision.
- Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases.
### **2. Visualized Workflow**
#### **2.1 Logical Data Flow**
```mermaid
graph TD
A[Raw Data] -->|Scrape| B[Chunks]
B -->|Lexical Parsing| C[Lexical Graph]
C -->|NER + RE| D[Entities and Relations]
D -->|Construct KG| E[Knowledge Graph]
E -->|Overlay Ontologies| F[Enriched Graph]
F -->|Visualize| G[Interactive View]
```
---
### **3. Glossary**
| **Participant** | **Description** | **Workflow Layer** |
|--------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------|
| **HTML Scraper (BeautifulSoup)** | Fetches unstructured text data from web sources. | Data Ingestion |
| **Text Chunker** | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding. | Data Ingestion |
| **SpaCy Pipeline** | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction. | Entity and Relation Extraction |
| **Embedding Model (bge-small-en-v1.5)** | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion |
| **GLiNER** | Identifies domain-specific entities and returns labeled outputs. | Entity and Relation Extraction |
| **GLiREL** | Extracts relationships between identified entities, adding connectivity to the graph. | Entity and Relation Extraction |
| **Vector Database (LanceDB)** | Stores chunk embeddings for efficient querying in downstream tasks. | Data Ingestion |
| **Word2Vec (Gensim)** | Generates entity embeddings based on graph co-occurrence for additional analysis. | Semantic Graph Construction |
| **Graph Constructor (NetworkX)** | Builds and analyzes the knowledge graph, ranking entities using TextRank. | Graph Construction and Visualization |
| **Graph Visualizer (PyVis)** | Provides an interactive visualization of the knowledge graph for interpretability. | Graph Construction and Visualization |
## Citations: giving credit where credit is due...
Inspired by the great work done by multiple individuals who created the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/donbr/cdl2024_masterclass/blob/main/README.md) masterclass I created this document to highlight areas that rang true.
- Paco Nathan https://senzing.com/consult-entity-resolution-paco/
- Clair Sullivan https://clairsullivan.com/
- Louis Guitton https://guitton.co/
- Jeff Butcher https://github.com/jbutcher21
- Michael Dockter https://github.com/docktermj
The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass.