File size: 15,594 Bytes
6a50c2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
# GraphRAG README

## Some fundamental concepts

### Data Ingestion

NOTE:  mermaid.js diagrams below are based on some inspiring content from the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/DerwenAI/cdl2024_masterclass/blob/main/README.md) masterclass.

```mermaid
graph TD
    %% Database shapes with consistent styling
    SDS[(Structured<br/>Data Sources)]
    UDS[(Unstructured<br/>Data Sources)]
    LG[(lexical graph)]
    SG[(semantic graph)]
    VD[(vector database)]

    %% Flow from structured data
    SDS -->|PII features| ER[entity resolution]
    SDS -->|data records| SG
    SG -->|PII updates| ER
    ER -->|semantic overlay| SG

    %% Schema and ontology
    ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.]
    ONT --> SG

    %% Flow from unstructured data
    UDS --> K[text chunking<br/>function]
    K --> NLP[NLP parse]
    K --> EM[embedding model]
    NLP --> E[NER, RE]
    E --> LG
    LG --> EL[entity linking]
    EL <--> SG

    %% Vector elements connections
    EM --> VD
    VD -.->|capture source chunk<br/>WITHIN references| SG

    %% Thesaurus connection
    ER -.->T[thesaurus]
    T --> EL

    %% Styling classes
    classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
    classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
    classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
    classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
    classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
    classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;

    %% Apply styles by layer/type
    class SDS,UDS dataSource;
    class SG,VD storage;
    class EM embedding;
    class LG lexical;
    class SG semantic;
    class ONT,T reference;
```

### Augment LLM Inference

```mermaid
graph LR
    %% Define database and special shapes
    P[prompt]
    SG[(semantic graph)]
    VD[(vector database)]
    LLM[LLM]
    Z[response]
    
    %% Main flow paths
    P --> Q[generated query]
    P --> EM[embedding model]
    
    %% Upper path through graph elements
    Q --> SG
    SG --> W[semantic<br/>random walk]
    T[thesaurus] --> W
    W --> GA[graph analytics]
    
    %% Lower path through vector elements
    EM --> SS[vector<br/>similarity search]
    SS --> VD
    
    %% Node embeddings and chunk references
    SG -.-|chunk references| VD
    SS -->|node embeddings| SG
    
    %% Final convergence
    GA --> RI[ranked index]
    VD --> RI
    RI --> LLM
    LLM --> Z

    %% Styling classes
    classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
    classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
    classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
    classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
    classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
    classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;

    %% Apply styles by layer/type
    class SDS,UDS dataSource;
    class SG,VD storage;
    class EM embedding;
    class LG lexical;
    class SG semantic;
    class ONT,T reference;
```

## Sequence Diagram - covering the current `strwythura` (structure) repo

- the diagram below is largely based on the `demo.py` functions
- I used [Prefect](https://www.prefect.io/) to `dig in` and reverse architect the flow...
  - [graphrag_demo.py](./graphrag_demo.py) is my simple update to [Paco's original python code](./demo.py)
  - I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically.
- Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements.  Some great insight often occurs when you can see how individual functions / components are interacting.
  - this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction.
- For a deeper dive, [Paco's YouTube video and associated diagrams](https://senzing.com/gph-graph-rag-llm-knowledge-graphs/) help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight.


```mermaid
sequenceDiagram
    participant Main as Main Script
    participant ConstructKG as construct_kg Flow
    participant InitNLP as init_nlp Task
    participant ScrapeHTML as scrape_html Task
    participant MakeChunk as make_chunk Task
    participant ParseText as parse_text Task
    participant MakeEntity as make_entity Task
    participant ExtractEntity as extract_entity Task
    participant ExtractRelations as extract_relations Task
    participant ConnectEntities as connect_entities Task
    participant RunTextRank as run_textrank Task
    participant AbstractOverlay as abstract_overlay Task
    participant GenPyvis as gen_pyvis Task

    Main->>ConstructKG: Start construct_kg flow
    ConstructKG->>InitNLP: Initialize NLP pipeline
    InitNLP-->>ConstructKG: Return NLP object

    loop For each URL in url_list
        ConstructKG->>ScrapeHTML: Scrape HTML content
        ScrapeHTML->>MakeChunk: Create text chunks
        MakeChunk-->>ScrapeHTML: Return chunk list
        ScrapeHTML-->>ConstructKG: Return chunk list

        loop For each chunk in chunk_list
            ConstructKG->>ParseText: Parse text and build lex_graph
            ParseText->>MakeEntity: Create entities from spans
            MakeEntity-->>ParseText: Return entity
            ParseText->>ExtractEntity: Extract and add entities to lex_graph
            ExtractEntity-->>ParseText: Entity added to graph
            ParseText->>ExtractRelations: Extract relations between entities
            ExtractRelations-->>ParseText: Relations added to graph
            ParseText->>ConnectEntities: Connect co-occurring entities
            ConnectEntities-->>ParseText: Connections added to graph
            ParseText-->>ConstructKG: Return parsed doc
        end

        ConstructKG->>RunTextRank: Run TextRank on lex_graph
        RunTextRank-->>ConstructKG: Return ranked entities
        ConstructKG->>AbstractOverlay: Overlay semantic graph
        AbstractOverlay-->>ConstructKG: Overlay completed
    end

    ConstructKG->>GenPyvis: Generate Pyvis visualization
    GenPyvis-->>ConstructKG: Visualization saved
    ConstructKG-->>Main: Flow completed
```

## Run the code

1. setup local Python environment and install Python dependencies

   - I used Python 3.11, but 3.10 should work as well

    ```bash
    pip install -r requirements.txt
    ```

2. Start the local Prefect server

   - follow the [self-hosted instructions](https://docs.prefect.io/v3/get-started/quickstart#connect-to-a-prefect-api) to launch the `Prefect UI`

    ```python
    prefect server start
    ```

3. run the `graphrag_demo.py` script

    ```python
    python graphrag_demo.py
    ```

## Appendix: Code Overview and Purpose

- The code forms part of a talk for **GraphGeeks.org** about constructing **knowledge graphs** from **unstructured data sources**, such as web content.
- It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization.

---

### **Key Components and Flow**

#### **1. Model and Parameter Settings**
- **Core Configuration**: Establishes the foundational settings like chunk size, embedding models (`BAAI/bge-small-en-v1.5`), and database URIs.
- **NER Labels**: Defines entity categories such as `Person`, `Organization`, `Publication`, and `Technology`.
- **Relation Types**: Configures relationships like `works_at`, `developed_by`, and `authored_by` for connecting entities.
- **Scraping Parameters**: Sets user-agent headers for web requests.

#### **2. Data Validation**
- **Classes**:
  - `TextChunk`: Represents segmented text chunks with their embeddings.
  - `Entity`: Tracks extracted entities, their attributes, and relationships.
- **Purpose**: Ensures data is clean and well-structured for downstream processing.

#### **3. Data Collection**
- **Functions**:
  - `scrape_html`: Fetches and parses webpage content.
  - `uni_scrubber`: Cleans Unicode and formatting issues.
  - `make_chunk`: Segments long text into manageable chunks for embedding.
- **Role**: Prepares raw, unstructured data for structured analysis.

#### **4. Lexical Graph Construction**
- **Initialization**:
  - `init_nlp`: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE).
- **Graph Parsing**:
  - `parse_text`: Creates lexical graphs using TextRank algorithms.
  - `make_entity`: Extracts and integrates entities into the graph.
  - `connect_entities`: Links entities co-occurring in the same context.
- **Purpose**: Converts text into a structured, connected graph of entities and relationships.

#### **5. Numerical Processing**
- **Functions**:
  - `calc_quantile_bins`: Creates quantile bins for numerical data.
  - `root_mean_square`: Computes RMS for normalization.
  - `stripe_column`: Applies quantile binning to data columns.
- **Role**: Provides statistical operations to refine and rank graph components.

#### **6. TextRank Implementation**
- **Functions**:
  - `run_textrank`: Ranks entities in the graph based on a PageRank-inspired algorithm.
- **Purpose**: Identifies and prioritizes key entities for knowledge graph construction.

#### **7. Semantic Overlay**
- **Functions**:
  - `abstract_overlay`: Abstracts a semantic layer from the lexical graph.
  - Connects entities to their originating text chunks for context preservation.
- **Role**: Enhances the graph with higher-order relationships and semantic depth.

#### **8. Visualization**
- **Tool**: `pyvis`
- **Functions**:
  - `gen_pyvis`: Creates an interactive visualization of the knowledge graph.
- **Features**:
  - Node sizing reflects entity importance.
  - Physics-based layout supports intuitive exploration.

#### **9. Orchestration**
- **Function**:
  - `construct_kg`: Orchestrates the full pipeline from data collection to visualization.
- **Purpose**: Ensures the seamless integration of all layers and components.

---

### **Notable Implementation Details**

- **Multi-Layer Graph Representation**: Combines lexical and semantic graphs for layered analysis.
- **Vector Embedding Integration**: Enhances entity representation with embeddings.
- **Error Handling and Debugging**: Includes robust logging and debugging features.
- **Scalability**: Designed for handling diverse and large datasets with dynamic relationships.

---

## Appendix:  Architectural Workflow

### **1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction**

#### **1.1 Workflow Layers**

**Data Ingestion:**
- Role: Extract raw data from structured and unstructured sources for downstream processing.
- Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis.
- Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis.

**Lexical Graph Construction:**
- Role: Build a foundational graph by integrating tokenized data and semantic relationships.
- Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank).
- Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure.

**Entity and Relation Extraction:**
- Role: Identify and label entities, along with their relationships, to enrich the graph structure.
- Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity.
- Requirements: Domain-tuned models and algorithms for accurate extraction.

**Graph Construction and Visualization:**
- Role: Develop and display the knowledge graph to facilitate analysis and decision-making.
- Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis).
- Requirements: Scalable graph-building frameworks and intuitive visualization tools.

**Semantic Overlay:**
- Role: Enhance the graph with additional context and reasoning capabilities.
- Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision.
- Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases.


### **2. Visualized Workflow**

#### **2.1 Logical Data Flow**

```mermaid
graph TD
A[Raw Data] -->|Scrape| B[Chunks]
B -->|Lexical Parsing| C[Lexical Graph]
C -->|NER + RE| D[Entities and Relations]
D -->|Construct KG| E[Knowledge Graph]
E -->|Overlay Ontologies| F[Enriched Graph]
F -->|Visualize| G[Interactive View]
```

---

### **3. Glossary**

| **Participant**                | **Description**                                                                                   | **Workflow Layer**                 |
|--------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------|
| **HTML Scraper (BeautifulSoup)** | Fetches unstructured text data from web sources.                                                  | Data Ingestion                     |
| **Text Chunker**               | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding.        | Data Ingestion                     |
| **SpaCy Pipeline**             | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction.             | Entity and Relation Extraction     |
| **Embedding Model (bge-small-en-v1.5)** | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion |
| **GLiNER**                     | Identifies domain-specific entities and returns labeled outputs.                                  | Entity and Relation Extraction     |
| **GLiREL**                     | Extracts relationships between identified entities, adding connectivity to the graph.             | Entity and Relation Extraction     |
| **Vector Database (LanceDB)**  | Stores chunk embeddings for efficient querying in downstream tasks.                              | Data Ingestion         |
| **Word2Vec (Gensim)**          | Generates entity embeddings based on graph co-occurrence for additional analysis.                 | Semantic Graph Construction         |
| **Graph Constructor (NetworkX)** | Builds and analyzes the knowledge graph, ranking entities using TextRank.                       | Graph Construction and Visualization |
| **Graph Visualizer (PyVis)**   | Provides an interactive visualization of the knowledge graph for interpretability.                | Graph Construction and Visualization |

## Citations: giving credit where credit is due...

Inspired by the great work done by multiple individuals who created the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/donbr/cdl2024_masterclass/blob/main/README.md) masterclass I created this document to highlight areas that rang true.

- Paco Nathan https://senzing.com/consult-entity-resolution-paco/
- Clair Sullivan https://clairsullivan.com/
- Louis Guitton https://guitton.co/
- Jeff Butcher https://github.com/jbutcher21
- Michael Dockter https://github.com/docktermj

The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass.