Spaces:
Running
on
Zero
Running
on
Zero
# GraphRAG README | |
## Some fundamental concepts | |
### Data Ingestion | |
NOTE: mermaid.js diagrams below are based on some inspiring content from the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/DerwenAI/cdl2024_masterclass/blob/main/README.md) masterclass. | |
```mermaid | |
graph TD | |
%% Database shapes with consistent styling | |
SDS[(Structured<br/>Data Sources)] | |
UDS[(Unstructured<br/>Data Sources)] | |
LG[(lexical graph)] | |
SG[(semantic graph)] | |
VD[(vector database)] | |
%% Flow from structured data | |
SDS -->|PII features| ER[entity resolution] | |
SDS -->|data records| SG | |
SG -->|PII updates| ER | |
ER -->|semantic overlay| SG | |
%% Schema and ontology | |
ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.] | |
ONT --> SG | |
%% Flow from unstructured data | |
UDS --> K[text chunking<br/>function] | |
K --> NLP[NLP parse] | |
K --> EM[embedding model] | |
NLP --> E[NER, RE] | |
E --> LG | |
LG --> EL[entity linking] | |
EL <--> SG | |
%% Vector elements connections | |
EM --> VD | |
VD -.->|capture source chunk<br/>WITHIN references| SG | |
%% Thesaurus connection | |
ER -.->T[thesaurus] | |
T --> EL | |
%% Styling classes | |
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px; | |
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px; | |
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px; | |
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px; | |
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px; | |
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px; | |
%% Apply styles by layer/type | |
class SDS,UDS dataSource; | |
class SG,VD storage; | |
class EM embedding; | |
class LG lexical; | |
class SG semantic; | |
class ONT,T reference; | |
``` | |
### Augment LLM Inference | |
```mermaid | |
graph LR | |
%% Define database and special shapes | |
P[prompt] | |
SG[(semantic graph)] | |
VD[(vector database)] | |
LLM[LLM] | |
Z[response] | |
%% Main flow paths | |
P --> Q[generated query] | |
P --> EM[embedding model] | |
%% Upper path through graph elements | |
Q --> SG | |
SG --> W[semantic<br/>random walk] | |
T[thesaurus] --> W | |
W --> GA[graph analytics] | |
%% Lower path through vector elements | |
EM --> SS[vector<br/>similarity search] | |
SS --> VD | |
%% Node embeddings and chunk references | |
SG -.-|chunk references| VD | |
SS -->|node embeddings| SG | |
%% Final convergence | |
GA --> RI[ranked index] | |
VD --> RI | |
RI --> LLM | |
LLM --> Z | |
%% Styling classes | |
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px; | |
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px; | |
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px; | |
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px; | |
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px; | |
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px; | |
%% Apply styles by layer/type | |
class SDS,UDS dataSource; | |
class SG,VD storage; | |
class EM embedding; | |
class LG lexical; | |
class SG semantic; | |
class ONT,T reference; | |
``` | |
## Sequence Diagram - covering the current `strwythura` (structure) repo | |
- the diagram below is largely based on the `demo.py` functions | |
- I used [Prefect](https://www.prefect.io/) to `dig in` and reverse architect the flow... | |
- [graphrag_demo.py](./graphrag_demo.py) is my simple update to [Paco's original python code](./demo.py) | |
- I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically. | |
- Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements. Some great insight often occurs when you can see how individual functions / components are interacting. | |
- this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction. | |
- For a deeper dive, [Paco's YouTube video and associated diagrams](https://senzing.com/gph-graph-rag-llm-knowledge-graphs/) help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight. | |
```mermaid | |
sequenceDiagram | |
participant Main as Main Script | |
participant ConstructKG as construct_kg Flow | |
participant InitNLP as init_nlp Task | |
participant ScrapeHTML as scrape_html Task | |
participant MakeChunk as make_chunk Task | |
participant ParseText as parse_text Task | |
participant MakeEntity as make_entity Task | |
participant ExtractEntity as extract_entity Task | |
participant ExtractRelations as extract_relations Task | |
participant ConnectEntities as connect_entities Task | |
participant RunTextRank as run_textrank Task | |
participant AbstractOverlay as abstract_overlay Task | |
participant GenPyvis as gen_pyvis Task | |
Main->>ConstructKG: Start construct_kg flow | |
ConstructKG->>InitNLP: Initialize NLP pipeline | |
InitNLP-->>ConstructKG: Return NLP object | |
loop For each URL in url_list | |
ConstructKG->>ScrapeHTML: Scrape HTML content | |
ScrapeHTML->>MakeChunk: Create text chunks | |
MakeChunk-->>ScrapeHTML: Return chunk list | |
ScrapeHTML-->>ConstructKG: Return chunk list | |
loop For each chunk in chunk_list | |
ConstructKG->>ParseText: Parse text and build lex_graph | |
ParseText->>MakeEntity: Create entities from spans | |
MakeEntity-->>ParseText: Return entity | |
ParseText->>ExtractEntity: Extract and add entities to lex_graph | |
ExtractEntity-->>ParseText: Entity added to graph | |
ParseText->>ExtractRelations: Extract relations between entities | |
ExtractRelations-->>ParseText: Relations added to graph | |
ParseText->>ConnectEntities: Connect co-occurring entities | |
ConnectEntities-->>ParseText: Connections added to graph | |
ParseText-->>ConstructKG: Return parsed doc | |
end | |
ConstructKG->>RunTextRank: Run TextRank on lex_graph | |
RunTextRank-->>ConstructKG: Return ranked entities | |
ConstructKG->>AbstractOverlay: Overlay semantic graph | |
AbstractOverlay-->>ConstructKG: Overlay completed | |
end | |
ConstructKG->>GenPyvis: Generate Pyvis visualization | |
GenPyvis-->>ConstructKG: Visualization saved | |
ConstructKG-->>Main: Flow completed | |
``` | |
## Run the code | |
1. setup local Python environment and install Python dependencies | |
- I used Python 3.11, but 3.10 should work as well | |
```bash | |
pip install -r requirements.txt | |
``` | |
2. Start the local Prefect server | |
- follow the [self-hosted instructions](https://docs.prefect.io/v3/get-started/quickstart#connect-to-a-prefect-api) to launch the `Prefect UI` | |
```python | |
prefect server start | |
``` | |
3. run the `graphrag_demo.py` script | |
```python | |
python graphrag_demo.py | |
``` | |
## Appendix: Code Overview and Purpose | |
- The code forms part of a talk for **GraphGeeks.org** about constructing **knowledge graphs** from **unstructured data sources**, such as web content. | |
- It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization. | |
--- | |
### **Key Components and Flow** | |
#### **1. Model and Parameter Settings** | |
- **Core Configuration**: Establishes the foundational settings like chunk size, embedding models (`BAAI/bge-small-en-v1.5`), and database URIs. | |
- **NER Labels**: Defines entity categories such as `Person`, `Organization`, `Publication`, and `Technology`. | |
- **Relation Types**: Configures relationships like `works_at`, `developed_by`, and `authored_by` for connecting entities. | |
- **Scraping Parameters**: Sets user-agent headers for web requests. | |
#### **2. Data Validation** | |
- **Classes**: | |
- `TextChunk`: Represents segmented text chunks with their embeddings. | |
- `Entity`: Tracks extracted entities, their attributes, and relationships. | |
- **Purpose**: Ensures data is clean and well-structured for downstream processing. | |
#### **3. Data Collection** | |
- **Functions**: | |
- `scrape_html`: Fetches and parses webpage content. | |
- `uni_scrubber`: Cleans Unicode and formatting issues. | |
- `make_chunk`: Segments long text into manageable chunks for embedding. | |
- **Role**: Prepares raw, unstructured data for structured analysis. | |
#### **4. Lexical Graph Construction** | |
- **Initialization**: | |
- `init_nlp`: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE). | |
- **Graph Parsing**: | |
- `parse_text`: Creates lexical graphs using TextRank algorithms. | |
- `make_entity`: Extracts and integrates entities into the graph. | |
- `connect_entities`: Links entities co-occurring in the same context. | |
- **Purpose**: Converts text into a structured, connected graph of entities and relationships. | |
#### **5. Numerical Processing** | |
- **Functions**: | |
- `calc_quantile_bins`: Creates quantile bins for numerical data. | |
- `root_mean_square`: Computes RMS for normalization. | |
- `stripe_column`: Applies quantile binning to data columns. | |
- **Role**: Provides statistical operations to refine and rank graph components. | |
#### **6. TextRank Implementation** | |
- **Functions**: | |
- `run_textrank`: Ranks entities in the graph based on a PageRank-inspired algorithm. | |
- **Purpose**: Identifies and prioritizes key entities for knowledge graph construction. | |
#### **7. Semantic Overlay** | |
- **Functions**: | |
- `abstract_overlay`: Abstracts a semantic layer from the lexical graph. | |
- Connects entities to their originating text chunks for context preservation. | |
- **Role**: Enhances the graph with higher-order relationships and semantic depth. | |
#### **8. Visualization** | |
- **Tool**: `pyvis` | |
- **Functions**: | |
- `gen_pyvis`: Creates an interactive visualization of the knowledge graph. | |
- **Features**: | |
- Node sizing reflects entity importance. | |
- Physics-based layout supports intuitive exploration. | |
#### **9. Orchestration** | |
- **Function**: | |
- `construct_kg`: Orchestrates the full pipeline from data collection to visualization. | |
- **Purpose**: Ensures the seamless integration of all layers and components. | |
--- | |
### **Notable Implementation Details** | |
- **Multi-Layer Graph Representation**: Combines lexical and semantic graphs for layered analysis. | |
- **Vector Embedding Integration**: Enhances entity representation with embeddings. | |
- **Error Handling and Debugging**: Includes robust logging and debugging features. | |
- **Scalability**: Designed for handling diverse and large datasets with dynamic relationships. | |
--- | |
## Appendix: Architectural Workflow | |
### **1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction** | |
#### **1.1 Workflow Layers** | |
**Data Ingestion:** | |
- Role: Extract raw data from structured and unstructured sources for downstream processing. | |
- Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis. | |
- Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis. | |
**Lexical Graph Construction:** | |
- Role: Build a foundational graph by integrating tokenized data and semantic relationships. | |
- Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank). | |
- Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure. | |
**Entity and Relation Extraction:** | |
- Role: Identify and label entities, along with their relationships, to enrich the graph structure. | |
- Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity. | |
- Requirements: Domain-tuned models and algorithms for accurate extraction. | |
**Graph Construction and Visualization:** | |
- Role: Develop and display the knowledge graph to facilitate analysis and decision-making. | |
- Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis). | |
- Requirements: Scalable graph-building frameworks and intuitive visualization tools. | |
**Semantic Overlay:** | |
- Role: Enhance the graph with additional context and reasoning capabilities. | |
- Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision. | |
- Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases. | |
### **2. Visualized Workflow** | |
#### **2.1 Logical Data Flow** | |
```mermaid | |
graph TD | |
A[Raw Data] -->|Scrape| B[Chunks] | |
B -->|Lexical Parsing| C[Lexical Graph] | |
C -->|NER + RE| D[Entities and Relations] | |
D -->|Construct KG| E[Knowledge Graph] | |
E -->|Overlay Ontologies| F[Enriched Graph] | |
F -->|Visualize| G[Interactive View] | |
``` | |
--- | |
### **3. Glossary** | |
| **Participant** | **Description** | **Workflow Layer** | | |
|--------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------| | |
| **HTML Scraper (BeautifulSoup)** | Fetches unstructured text data from web sources. | Data Ingestion | | |
| **Text Chunker** | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding. | Data Ingestion | | |
| **SpaCy Pipeline** | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction. | Entity and Relation Extraction | | |
| **Embedding Model (bge-small-en-v1.5)** | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion | | |
| **GLiNER** | Identifies domain-specific entities and returns labeled outputs. | Entity and Relation Extraction | | |
| **GLiREL** | Extracts relationships between identified entities, adding connectivity to the graph. | Entity and Relation Extraction | | |
| **Vector Database (LanceDB)** | Stores chunk embeddings for efficient querying in downstream tasks. | Data Ingestion | | |
| **Word2Vec (Gensim)** | Generates entity embeddings based on graph co-occurrence for additional analysis. | Semantic Graph Construction | | |
| **Graph Constructor (NetworkX)** | Builds and analyzes the knowledge graph, ranking entities using TextRank. | Graph Construction and Visualization | | |
| **Graph Visualizer (PyVis)** | Provides an interactive visualization of the knowledge graph for interpretability. | Graph Construction and Visualization | | |
## Citations: giving credit where credit is due... | |
Inspired by the great work done by multiple individuals who created the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/donbr/cdl2024_masterclass/blob/main/README.md) masterclass I created this document to highlight areas that rang true. | |
- Paco Nathan https://senzing.com/consult-entity-resolution-paco/ | |
- Clair Sullivan https://clairsullivan.com/ | |
- Louis Guitton https://guitton.co/ | |
- Jeff Butcher https://github.com/jbutcher21 | |
- Michael Dockter https://github.com/docktermj | |
The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass. | |