Spaces:
Running
on
Zero
Running
on
Zero
File size: 15,594 Bytes
6a50c2d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 |
# GraphRAG README
## Some fundamental concepts
### Data Ingestion
NOTE: mermaid.js diagrams below are based on some inspiring content from the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/DerwenAI/cdl2024_masterclass/blob/main/README.md) masterclass.
```mermaid
graph TD
%% Database shapes with consistent styling
SDS[(Structured<br/>Data Sources)]
UDS[(Unstructured<br/>Data Sources)]
LG[(lexical graph)]
SG[(semantic graph)]
VD[(vector database)]
%% Flow from structured data
SDS -->|PII features| ER[entity resolution]
SDS -->|data records| SG
SG -->|PII updates| ER
ER -->|semantic overlay| SG
%% Schema and ontology
ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.]
ONT --> SG
%% Flow from unstructured data
UDS --> K[text chunking<br/>function]
K --> NLP[NLP parse]
K --> EM[embedding model]
NLP --> E[NER, RE]
E --> LG
LG --> EL[entity linking]
EL <--> SG
%% Vector elements connections
EM --> VD
VD -.->|capture source chunk<br/>WITHIN references| SG
%% Thesaurus connection
ER -.->T[thesaurus]
T --> EL
%% Styling classes
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
%% Apply styles by layer/type
class SDS,UDS dataSource;
class SG,VD storage;
class EM embedding;
class LG lexical;
class SG semantic;
class ONT,T reference;
```
### Augment LLM Inference
```mermaid
graph LR
%% Define database and special shapes
P[prompt]
SG[(semantic graph)]
VD[(vector database)]
LLM[LLM]
Z[response]
%% Main flow paths
P --> Q[generated query]
P --> EM[embedding model]
%% Upper path through graph elements
Q --> SG
SG --> W[semantic<br/>random walk]
T[thesaurus] --> W
W --> GA[graph analytics]
%% Lower path through vector elements
EM --> SS[vector<br/>similarity search]
SS --> VD
%% Node embeddings and chunk references
SG -.-|chunk references| VD
SS -->|node embeddings| SG
%% Final convergence
GA --> RI[ranked index]
VD --> RI
RI --> LLM
LLM --> Z
%% Styling classes
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
%% Apply styles by layer/type
class SDS,UDS dataSource;
class SG,VD storage;
class EM embedding;
class LG lexical;
class SG semantic;
class ONT,T reference;
```
## Sequence Diagram - covering the current `strwythura` (structure) repo
- the diagram below is largely based on the `demo.py` functions
- I used [Prefect](https://www.prefect.io/) to `dig in` and reverse architect the flow...
- [graphrag_demo.py](./graphrag_demo.py) is my simple update to [Paco's original python code](./demo.py)
- I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically.
- Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements. Some great insight often occurs when you can see how individual functions / components are interacting.
- this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction.
- For a deeper dive, [Paco's YouTube video and associated diagrams](https://senzing.com/gph-graph-rag-llm-knowledge-graphs/) help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight.
```mermaid
sequenceDiagram
participant Main as Main Script
participant ConstructKG as construct_kg Flow
participant InitNLP as init_nlp Task
participant ScrapeHTML as scrape_html Task
participant MakeChunk as make_chunk Task
participant ParseText as parse_text Task
participant MakeEntity as make_entity Task
participant ExtractEntity as extract_entity Task
participant ExtractRelations as extract_relations Task
participant ConnectEntities as connect_entities Task
participant RunTextRank as run_textrank Task
participant AbstractOverlay as abstract_overlay Task
participant GenPyvis as gen_pyvis Task
Main->>ConstructKG: Start construct_kg flow
ConstructKG->>InitNLP: Initialize NLP pipeline
InitNLP-->>ConstructKG: Return NLP object
loop For each URL in url_list
ConstructKG->>ScrapeHTML: Scrape HTML content
ScrapeHTML->>MakeChunk: Create text chunks
MakeChunk-->>ScrapeHTML: Return chunk list
ScrapeHTML-->>ConstructKG: Return chunk list
loop For each chunk in chunk_list
ConstructKG->>ParseText: Parse text and build lex_graph
ParseText->>MakeEntity: Create entities from spans
MakeEntity-->>ParseText: Return entity
ParseText->>ExtractEntity: Extract and add entities to lex_graph
ExtractEntity-->>ParseText: Entity added to graph
ParseText->>ExtractRelations: Extract relations between entities
ExtractRelations-->>ParseText: Relations added to graph
ParseText->>ConnectEntities: Connect co-occurring entities
ConnectEntities-->>ParseText: Connections added to graph
ParseText-->>ConstructKG: Return parsed doc
end
ConstructKG->>RunTextRank: Run TextRank on lex_graph
RunTextRank-->>ConstructKG: Return ranked entities
ConstructKG->>AbstractOverlay: Overlay semantic graph
AbstractOverlay-->>ConstructKG: Overlay completed
end
ConstructKG->>GenPyvis: Generate Pyvis visualization
GenPyvis-->>ConstructKG: Visualization saved
ConstructKG-->>Main: Flow completed
```
## Run the code
1. setup local Python environment and install Python dependencies
- I used Python 3.11, but 3.10 should work as well
```bash
pip install -r requirements.txt
```
2. Start the local Prefect server
- follow the [self-hosted instructions](https://docs.prefect.io/v3/get-started/quickstart#connect-to-a-prefect-api) to launch the `Prefect UI`
```python
prefect server start
```
3. run the `graphrag_demo.py` script
```python
python graphrag_demo.py
```
## Appendix: Code Overview and Purpose
- The code forms part of a talk for **GraphGeeks.org** about constructing **knowledge graphs** from **unstructured data sources**, such as web content.
- It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization.
---
### **Key Components and Flow**
#### **1. Model and Parameter Settings**
- **Core Configuration**: Establishes the foundational settings like chunk size, embedding models (`BAAI/bge-small-en-v1.5`), and database URIs.
- **NER Labels**: Defines entity categories such as `Person`, `Organization`, `Publication`, and `Technology`.
- **Relation Types**: Configures relationships like `works_at`, `developed_by`, and `authored_by` for connecting entities.
- **Scraping Parameters**: Sets user-agent headers for web requests.
#### **2. Data Validation**
- **Classes**:
- `TextChunk`: Represents segmented text chunks with their embeddings.
- `Entity`: Tracks extracted entities, their attributes, and relationships.
- **Purpose**: Ensures data is clean and well-structured for downstream processing.
#### **3. Data Collection**
- **Functions**:
- `scrape_html`: Fetches and parses webpage content.
- `uni_scrubber`: Cleans Unicode and formatting issues.
- `make_chunk`: Segments long text into manageable chunks for embedding.
- **Role**: Prepares raw, unstructured data for structured analysis.
#### **4. Lexical Graph Construction**
- **Initialization**:
- `init_nlp`: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE).
- **Graph Parsing**:
- `parse_text`: Creates lexical graphs using TextRank algorithms.
- `make_entity`: Extracts and integrates entities into the graph.
- `connect_entities`: Links entities co-occurring in the same context.
- **Purpose**: Converts text into a structured, connected graph of entities and relationships.
#### **5. Numerical Processing**
- **Functions**:
- `calc_quantile_bins`: Creates quantile bins for numerical data.
- `root_mean_square`: Computes RMS for normalization.
- `stripe_column`: Applies quantile binning to data columns.
- **Role**: Provides statistical operations to refine and rank graph components.
#### **6. TextRank Implementation**
- **Functions**:
- `run_textrank`: Ranks entities in the graph based on a PageRank-inspired algorithm.
- **Purpose**: Identifies and prioritizes key entities for knowledge graph construction.
#### **7. Semantic Overlay**
- **Functions**:
- `abstract_overlay`: Abstracts a semantic layer from the lexical graph.
- Connects entities to their originating text chunks for context preservation.
- **Role**: Enhances the graph with higher-order relationships and semantic depth.
#### **8. Visualization**
- **Tool**: `pyvis`
- **Functions**:
- `gen_pyvis`: Creates an interactive visualization of the knowledge graph.
- **Features**:
- Node sizing reflects entity importance.
- Physics-based layout supports intuitive exploration.
#### **9. Orchestration**
- **Function**:
- `construct_kg`: Orchestrates the full pipeline from data collection to visualization.
- **Purpose**: Ensures the seamless integration of all layers and components.
---
### **Notable Implementation Details**
- **Multi-Layer Graph Representation**: Combines lexical and semantic graphs for layered analysis.
- **Vector Embedding Integration**: Enhances entity representation with embeddings.
- **Error Handling and Debugging**: Includes robust logging and debugging features.
- **Scalability**: Designed for handling diverse and large datasets with dynamic relationships.
---
## Appendix: Architectural Workflow
### **1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction**
#### **1.1 Workflow Layers**
**Data Ingestion:**
- Role: Extract raw data from structured and unstructured sources for downstream processing.
- Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis.
- Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis.
**Lexical Graph Construction:**
- Role: Build a foundational graph by integrating tokenized data and semantic relationships.
- Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank).
- Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure.
**Entity and Relation Extraction:**
- Role: Identify and label entities, along with their relationships, to enrich the graph structure.
- Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity.
- Requirements: Domain-tuned models and algorithms for accurate extraction.
**Graph Construction and Visualization:**
- Role: Develop and display the knowledge graph to facilitate analysis and decision-making.
- Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis).
- Requirements: Scalable graph-building frameworks and intuitive visualization tools.
**Semantic Overlay:**
- Role: Enhance the graph with additional context and reasoning capabilities.
- Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision.
- Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases.
### **2. Visualized Workflow**
#### **2.1 Logical Data Flow**
```mermaid
graph TD
A[Raw Data] -->|Scrape| B[Chunks]
B -->|Lexical Parsing| C[Lexical Graph]
C -->|NER + RE| D[Entities and Relations]
D -->|Construct KG| E[Knowledge Graph]
E -->|Overlay Ontologies| F[Enriched Graph]
F -->|Visualize| G[Interactive View]
```
---
### **3. Glossary**
| **Participant** | **Description** | **Workflow Layer** |
|--------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------|
| **HTML Scraper (BeautifulSoup)** | Fetches unstructured text data from web sources. | Data Ingestion |
| **Text Chunker** | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding. | Data Ingestion |
| **SpaCy Pipeline** | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction. | Entity and Relation Extraction |
| **Embedding Model (bge-small-en-v1.5)** | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion |
| **GLiNER** | Identifies domain-specific entities and returns labeled outputs. | Entity and Relation Extraction |
| **GLiREL** | Extracts relationships between identified entities, adding connectivity to the graph. | Entity and Relation Extraction |
| **Vector Database (LanceDB)** | Stores chunk embeddings for efficient querying in downstream tasks. | Data Ingestion |
| **Word2Vec (Gensim)** | Generates entity embeddings based on graph co-occurrence for additional analysis. | Semantic Graph Construction |
| **Graph Constructor (NetworkX)** | Builds and analyzes the knowledge graph, ranking entities using TextRank. | Graph Construction and Visualization |
| **Graph Visualizer (PyVis)** | Provides an interactive visualization of the knowledge graph for interpretability. | Graph Construction and Visualization |
## Citations: giving credit where credit is due...
Inspired by the great work done by multiple individuals who created the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/donbr/cdl2024_masterclass/blob/main/README.md) masterclass I created this document to highlight areas that rang true.
- Paco Nathan https://senzing.com/consult-entity-resolution-paco/
- Clair Sullivan https://clairsullivan.com/
- Louis Guitton https://guitton.co/
- Jeff Butcher https://github.com/jbutcher21
- Michael Dockter https://github.com/docktermj
The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass.
|