Spaces:
Running
on
Zero
Running
on
Zero
GraphRAG README
Some fundamental concepts
Data Ingestion
NOTE: mermaid.js diagrams below are based on some inspiring content from the Connected Data London 2024: Entity Resolved Knowledge Graphs masterclass.
graph TD
%% Database shapes with consistent styling
SDS[(Structured<br/>Data Sources)]
UDS[(Unstructured<br/>Data Sources)]
LG[(lexical graph)]
SG[(semantic graph)]
VD[(vector database)]
%% Flow from structured data
SDS -->|PII features| ER[entity resolution]
SDS -->|data records| SG
SG -->|PII updates| ER
ER -->|semantic overlay| SG
%% Schema and ontology
ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.]
ONT --> SG
%% Flow from unstructured data
UDS --> K[text chunking<br/>function]
K --> NLP[NLP parse]
K --> EM[embedding model]
NLP --> E[NER, RE]
E --> LG
LG --> EL[entity linking]
EL <--> SG
%% Vector elements connections
EM --> VD
VD -.->|capture source chunk<br/>WITHIN references| SG
%% Thesaurus connection
ER -.->T[thesaurus]
T --> EL
%% Styling classes
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
%% Apply styles by layer/type
class SDS,UDS dataSource;
class SG,VD storage;
class EM embedding;
class LG lexical;
class SG semantic;
class ONT,T reference;
Augment LLM Inference
graph LR
%% Define database and special shapes
P[prompt]
SG[(semantic graph)]
VD[(vector database)]
LLM[LLM]
Z[response]
%% Main flow paths
P --> Q[generated query]
P --> EM[embedding model]
%% Upper path through graph elements
Q --> SG
SG --> W[semantic<br/>random walk]
T[thesaurus] --> W
W --> GA[graph analytics]
%% Lower path through vector elements
EM --> SS[vector<br/>similarity search]
SS --> VD
%% Node embeddings and chunk references
SG -.-|chunk references| VD
SS -->|node embeddings| SG
%% Final convergence
GA --> RI[ranked index]
VD --> RI
RI --> LLM
LLM --> Z
%% Styling classes
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
%% Apply styles by layer/type
class SDS,UDS dataSource;
class SG,VD storage;
class EM embedding;
class LG lexical;
class SG semantic;
class ONT,T reference;
Sequence Diagram - covering the current strwythura
(structure) repo
- the diagram below is largely based on the
demo.py
functions - I used Prefect to
dig in
and reverse architect the flow...- graphrag_demo.py is my simple update to Paco's original python code
- I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically.
- Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements. Some great insight often occurs when you can see how individual functions / components are interacting.
- this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction.
- For a deeper dive, Paco's YouTube video and associated diagrams help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight.
sequenceDiagram
participant Main as Main Script
participant ConstructKG as construct_kg Flow
participant InitNLP as init_nlp Task
participant ScrapeHTML as scrape_html Task
participant MakeChunk as make_chunk Task
participant ParseText as parse_text Task
participant MakeEntity as make_entity Task
participant ExtractEntity as extract_entity Task
participant ExtractRelations as extract_relations Task
participant ConnectEntities as connect_entities Task
participant RunTextRank as run_textrank Task
participant AbstractOverlay as abstract_overlay Task
participant GenPyvis as gen_pyvis Task
Main->>ConstructKG: Start construct_kg flow
ConstructKG->>InitNLP: Initialize NLP pipeline
InitNLP-->>ConstructKG: Return NLP object
loop For each URL in url_list
ConstructKG->>ScrapeHTML: Scrape HTML content
ScrapeHTML->>MakeChunk: Create text chunks
MakeChunk-->>ScrapeHTML: Return chunk list
ScrapeHTML-->>ConstructKG: Return chunk list
loop For each chunk in chunk_list
ConstructKG->>ParseText: Parse text and build lex_graph
ParseText->>MakeEntity: Create entities from spans
MakeEntity-->>ParseText: Return entity
ParseText->>ExtractEntity: Extract and add entities to lex_graph
ExtractEntity-->>ParseText: Entity added to graph
ParseText->>ExtractRelations: Extract relations between entities
ExtractRelations-->>ParseText: Relations added to graph
ParseText->>ConnectEntities: Connect co-occurring entities
ConnectEntities-->>ParseText: Connections added to graph
ParseText-->>ConstructKG: Return parsed doc
end
ConstructKG->>RunTextRank: Run TextRank on lex_graph
RunTextRank-->>ConstructKG: Return ranked entities
ConstructKG->>AbstractOverlay: Overlay semantic graph
AbstractOverlay-->>ConstructKG: Overlay completed
end
ConstructKG->>GenPyvis: Generate Pyvis visualization
GenPyvis-->>ConstructKG: Visualization saved
ConstructKG-->>Main: Flow completed
Run the code
setup local Python environment and install Python dependencies
- I used Python 3.11, but 3.10 should work as well
pip install -r requirements.txt
Start the local Prefect server
- follow the self-hosted instructions to launch the
Prefect UI
prefect server start
- follow the self-hosted instructions to launch the
run the
graphrag_demo.py
scriptpython graphrag_demo.py
Appendix: Code Overview and Purpose
- The code forms part of a talk for GraphGeeks.org about constructing knowledge graphs from unstructured data sources, such as web content.
- It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization.
Key Components and Flow
1. Model and Parameter Settings
- Core Configuration: Establishes the foundational settings like chunk size, embedding models (
BAAI/bge-small-en-v1.5
), and database URIs. - NER Labels: Defines entity categories such as
Person
,Organization
,Publication
, andTechnology
. - Relation Types: Configures relationships like
works_at
,developed_by
, andauthored_by
for connecting entities. - Scraping Parameters: Sets user-agent headers for web requests.
2. Data Validation
- Classes:
TextChunk
: Represents segmented text chunks with their embeddings.Entity
: Tracks extracted entities, their attributes, and relationships.
- Purpose: Ensures data is clean and well-structured for downstream processing.
3. Data Collection
- Functions:
scrape_html
: Fetches and parses webpage content.uni_scrubber
: Cleans Unicode and formatting issues.make_chunk
: Segments long text into manageable chunks for embedding.
- Role: Prepares raw, unstructured data for structured analysis.
4. Lexical Graph Construction
- Initialization:
init_nlp
: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE).
- Graph Parsing:
parse_text
: Creates lexical graphs using TextRank algorithms.make_entity
: Extracts and integrates entities into the graph.connect_entities
: Links entities co-occurring in the same context.
- Purpose: Converts text into a structured, connected graph of entities and relationships.
5. Numerical Processing
- Functions:
calc_quantile_bins
: Creates quantile bins for numerical data.root_mean_square
: Computes RMS for normalization.stripe_column
: Applies quantile binning to data columns.
- Role: Provides statistical operations to refine and rank graph components.
6. TextRank Implementation
- Functions:
run_textrank
: Ranks entities in the graph based on a PageRank-inspired algorithm.
- Purpose: Identifies and prioritizes key entities for knowledge graph construction.
7. Semantic Overlay
- Functions:
abstract_overlay
: Abstracts a semantic layer from the lexical graph.- Connects entities to their originating text chunks for context preservation.
- Role: Enhances the graph with higher-order relationships and semantic depth.
8. Visualization
- Tool:
pyvis
- Functions:
gen_pyvis
: Creates an interactive visualization of the knowledge graph.
- Features:
- Node sizing reflects entity importance.
- Physics-based layout supports intuitive exploration.
9. Orchestration
- Function:
construct_kg
: Orchestrates the full pipeline from data collection to visualization.
- Purpose: Ensures the seamless integration of all layers and components.
Notable Implementation Details
- Multi-Layer Graph Representation: Combines lexical and semantic graphs for layered analysis.
- Vector Embedding Integration: Enhances entity representation with embeddings.
- Error Handling and Debugging: Includes robust logging and debugging features.
- Scalability: Designed for handling diverse and large datasets with dynamic relationships.
Appendix: Architectural Workflow
1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction
1.1 Workflow Layers
Data Ingestion:
- Role: Extract raw data from structured and unstructured sources for downstream processing.
- Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis.
- Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis.
Lexical Graph Construction:
- Role: Build a foundational graph by integrating tokenized data and semantic relationships.
- Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank).
- Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure.
Entity and Relation Extraction:
- Role: Identify and label entities, along with their relationships, to enrich the graph structure.
- Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity.
- Requirements: Domain-tuned models and algorithms for accurate extraction.
Graph Construction and Visualization:
- Role: Develop and display the knowledge graph to facilitate analysis and decision-making.
- Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis).
- Requirements: Scalable graph-building frameworks and intuitive visualization tools.
Semantic Overlay:
- Role: Enhance the graph with additional context and reasoning capabilities.
- Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision.
- Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases.
2. Visualized Workflow
2.1 Logical Data Flow
graph TD
A[Raw Data] -->|Scrape| B[Chunks]
B -->|Lexical Parsing| C[Lexical Graph]
C -->|NER + RE| D[Entities and Relations]
D -->|Construct KG| E[Knowledge Graph]
E -->|Overlay Ontologies| F[Enriched Graph]
F -->|Visualize| G[Interactive View]
3. Glossary
Participant | Description | Workflow Layer |
---|---|---|
HTML Scraper (BeautifulSoup) | Fetches unstructured text data from web sources. | Data Ingestion |
Text Chunker | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding. | Data Ingestion |
SpaCy Pipeline | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction. | Entity and Relation Extraction |
Embedding Model (bge-small-en-v1.5) | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion |
GLiNER | Identifies domain-specific entities and returns labeled outputs. | Entity and Relation Extraction |
GLiREL | Extracts relationships between identified entities, adding connectivity to the graph. | Entity and Relation Extraction |
Vector Database (LanceDB) | Stores chunk embeddings for efficient querying in downstream tasks. | Data Ingestion |
Word2Vec (Gensim) | Generates entity embeddings based on graph co-occurrence for additional analysis. | Semantic Graph Construction |
Graph Constructor (NetworkX) | Builds and analyzes the knowledge graph, ranking entities using TextRank. | Graph Construction and Visualization |
Graph Visualizer (PyVis) | Provides an interactive visualization of the knowledge graph for interpretability. | Graph Construction and Visualization |
Citations: giving credit where credit is due...
Inspired by the great work done by multiple individuals who created the Connected Data London 2024: Entity Resolved Knowledge Graphs masterclass I created this document to highlight areas that rang true.
- Paco Nathan https://senzing.com/consult-entity-resolution-paco/
- Clair Sullivan https://clairsullivan.com/
- Louis Guitton https://guitton.co/
- Jeff Butcher https://github.com/jbutcher21
- Michael Dockter https://github.com/docktermj
The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass.