In this presentation, delivered by ABK Andreas Kollegger at QCon London 2024, the focus was on Connecting the Dots for Information Discovery. The classic RAG application extends an LLM with private information, able to fetch answers to questions that are contained in a single chunk of text. What if the answer requires connecting the dots across multiple chunks that may not be directly similar to the question? That is information discovery with GraphRAG.
You'll learn how to:
- reconstruct chunks into the original context
- meaningfully connect disparate chunks
- expand unstructured text data with structured data
- combine all this into a RAG workflow
3. Generative AI
● Learns random sentences
from random people
● Talks like a person but doesn't really
understand what it's saying
● Occasionally speaks absolute nonsense
● Sensitive to question phrasing
● Answers reflect the person asking
● Can't explain or verify answers
● Limited to public "knowledge"
Neo4j Inc. All rights reserved 2024
3
4. Generative AI
● Learns random sentences
from random people
● Talks like a person but doesn't really
understand what it's saying
● Occasionally speaks absolute nonsense
● Sensitive to question phrasing
● Answers reflect the person asking
● Can't explain or verify answers
● Limited to public "knowledge"
Neo4j Inc. All rights reserved 2024
4
7. Retrieval Augmented
Generation (RAG)
RAG is a software design pattern for
integrating GenAI Apps with custom data
sources, like a database.
Neo4j Inc. All rights reserved 2024
7
8. A Generative AI application
uses an LLM
to provide responses
to user prompts
(aka ChatGPT)
Neo4j Inc. All rights reserved 2024
8
8
User
Prompt
Complete
Response
GenAI
Application
LLM
User Prompt
Response
9. RAG augments the LLM by
intercepting a user's prompt,
then making a query to a database,
then using the query results as
context for the user's prompt,
creating a new prompt that is passed
to the LLM
for a complete, curated response
Neo4j Inc. All rights reserved 2024
9
9
Database
GenAI
Application
Complete
Response
User
Prompt
LLM
User Prompt
+ Context
Response
User Prompt
Context
1 2
1 2
10. This sets up a knowledge stack…
the user knows something about the
question they're asking
the application knows something
about the user
the database knows about particular
information and data
the LLM knows about whatever it
found on the internet
Neo4j Inc. All rights reserved 2024
10
10
User Knowledge
App Knowledge
Database Knowledge
LLM Knowledge
Knowledge Stack
11. This sets up a knowledge stack…
the user knows something about the
question they're asking
the application knows something
about the user
the database knows about particular
information and data
the LLM knows about whatever it
found on the internet
Neo4j Inc. All rights reserved 2024
11
11
User Knowledge
App Knowledge
Database Knowledge
LLM Knowledge
Knowledge Stack
Knowledge you control,
in the app and the database.
12. Three Sources of Data
for RAG
Each with different access patterns,
supporting different kinds of questions.
Neo4j Inc. All rights reserved 2023
12
14. Neo4j Inc. All rights reserved 2024
14
Pure Text
Unstructured data in PDFs,
plain text files, or images
15. Information search: “What is Apple's primary business?”
Answer with: Implicit knowledge derived from text.
Neo4j Inc. All rights reserved 2024
15
Pure Text
Unstructured data in PDFs,
plain text files, or images
16. Neo4j Inc. All rights reserved 2024
16
Pure Data
Pure Text
17. Neo4j Inc. All rights reserved 2024
17
Pure Data
Structured data
in a database
Pure Text
18. Neo4j Inc. All rights reserved 2024
18
Pure Data
Structured data
in a database
Pure Text
Information query: “How many iPhones did Apple sell this quarter?”
Answer with: Explicit facts from a database query.
19. Neo4j Inc. All rights reserved 2024
19
Pure Text Pure Data
Mixed
Text + Data
20. Neo4j Inc. All rights reserved 2024
20
Pure Text Pure Data
Mixed
Text + Data
Structured data together
with long-form text
21. Neo4j Inc. All rights reserved 2024
21
Pure Text Pure Data
Mixed
Text + Data
Structured data together
with long-form text
Information discovery: “Which investors will be impacted by a chip shortage?”
Answer with: Combined search and data query.
22. Neo4j Inc. All rights reserved 2024
22
Pure Text Pure Data
Mixed
Text + Data
23. Neo4j Inc. All rights reserved 2024
23
Pure Text Pure Data
Mixed
Text + Data
A Knowledge Graph:
Information architecture for data, organized using graph structures,
which places data within context.
24. Neo4j Inc. All rights reserved 2024
24
Pure Text Pure Data
Mixed
Text + Data
A Knowledge Graph:
Information architecture for data, organized using graph structures,
which places data within context.
Graph RAG:
Supports multiple modes of information retrieval, including
information search, information query, and information discovery.
25. Neo4j Inc. All rights reserved 2024
25
Pure Text Pure Data
Mixed
Text + Data
Vector Search Search + Pattern Matching Graph Queries
Find relevant documents
plus context for
information search
Expand context
and rank the relevance for
information discovery
Directly query the
knowledge graph for
information query
27. SEC Edgar Financial Data
The EDGAR database provides free public
access to company information, allowing
research about public company financial
information and operations through the filings
they submit to the SEC.
There are two forms that we'll look at today:
1. Form 10K-filings from publicly traded
companies
2. Form 13 -filings from institutional
investment management firms
Neo4j Inc. All rights reserved 2024
27
28. Data Modeling Strategy
Start with a Minimum Viable Graph (MVG)
Create, Enhance, Connect then repeat to grow the graph
1. Create-identify interesting information, create records
2. Enhance-supercharge the data by enhancing some dimension
3. Connect-connect information to expand context and reveal knowledge
Neo4j Inc. All rights reserved 2024
28
29. Form
10k
Chunk
Chunk
Chunk
Chunk
Create -Form 10K text chunks
exercitation ullamco
laboris nisi ut aliquip
enim ad minim veniam,
quis nostrud
incididunt ut labore et
dolore magna aliqua. Ut
adipiscing elit, sed do
eiusmod tempor
Lorem ipsum dolor sit
amet, consectetur
Chunk
Neo4j Inc. All rights reserved 2024
29
2. Split Text
1. Source - Form 10K 3. Create Nodes
30. Form
10k
Chunk
Chunk
Chunk
Chunk
Enhance -Text with an embedding
exercitation ullamco
laboris nisi ut aliquip
enim ad minim veniam,
quis nostrud
incididunt ut labore et
dolore magna aliqua. Ut
adipiscing elit, sed do
eiusmod tempor
Lorem ipsum dolor sit
amet, consectetur
[0.6,0.2,0.1,0.7]
[0.5,0.2,0.1,0.7]
[0.4,0.2,0.1,0.7]
[0.3,0.2,0.1,0.5]
[0.2,0.2,0.1,0.7]
1. Source - Chunks
Chunk
Neo4j Inc. All rights reserved 2024
Vector Index
30
4. Add Embedding
31. Form
10k
Chunk
Chunk
Chunk
Chunk
Connect -Connect chunks into a list
exercitation ullamco
laboris nisi ut aliquip
enim ad minim veniam,
quis nostrud
incididunt ut labore et
dolore magna aliqua. Ut
adipiscing elit, sed do
eiusmod tempor
Lorem ipsum dolor sit
amet, consectetur
[0.6,0.2,0.1,0.7]
[0.5,0.2,0.1,0.7]
[0.4,0.2,0.1,0.7]
[0.3,0.2,0.1,0.5]
[0.2,0.2,0.1,0.7] Chunk
NEXT
1. Connect Chunks
Chunk
Chunk
Neo4j Inc. All rights reserved 2024
31
32. Form
10k
Chunk
Chunk
Chunk
Chunk
Create, Enhance, Connect Form 10K
exercitation ullamco
laboris nisi ut aliquip
enim ad minim veniam,
quis nostrud
incididunt ut labore et
dolore magna aliqua. Ut
adipiscing elit, sed do
eiusmod tempor
Lorem ipsum dolor sit
amet, consectetur
2. Split Text
[0.6,0.2,0.1,0.7]
[0.5,0.2,0.1,0.7]
[0.4,0.2,0.1,0.7]
[0.3,0.2,0.1,0.5]
[0.2,0.2,0.1,0.7]
4. Add Embedding
1. Source - Form 10K 3. Create Nodes
Chunk
NEXT
5. Connect
Chunk
Chunk
Extract Enhance Expand
Neo4j Inc. All rights reserved 2024
32
33. Benefits:
● vector similarity search to find
relevant text
● expand context window with
previous/next chunks
● enable paging through text
Neo4j Inc. All rights reserved 2024
33
Chunk
NEXT
Minimum Viable Graph
formId: string
chunkId: string
text: string
textEmbedding: float[]
vector index
Linked List of Text
34. Create-create separate Form nodes
for each Form 10K. Add summary.
Enhance-vector index of summary.
Connect-connect from Form to first
node in linked list. Then from each
chunk back to the Form Node.
Benefits:
● expand context of chunk with
summary text
● navigate from form to text
Neo4j Inc. All rights reserved 2024
34
Chunk
NEXT
Improve Context
cusip6: string
formId: string
summary: string
summaryEmbedding: float[]
vector index
Hierarchical Summary
Form
PART_OF
SECTION
35. Add Form 13
Neo4j Inc. All rights reserved 2024
35
Company
Manager
OWNS_STOCK_IN
Create-create Manager and Company
nodes
Enhance-full-text index of names
Connect-connect Manager nodes to
Company nodes through investments
Benefits:
● pattern-matching queries
● search names by text similarity
(Apple and Apple Inc)
rather than conceptual similarity
(Apple and Banana)
name: string
address: string
full-text index
shares: integer
value: float
name: string
address: string
full-text index
Structured Data
36. Company
Manager
OWNS_STOCK_IN
Address
L
O
C
A
T
E
D
_
A
T
L
O
C
A
T
E
D
_
A
T
Located at Address
Neo4j Inc. All rights reserved 2024
36
Create-create Address nodes
Enhance-geospatial index of address
Connect-connect Manager and
Company nodes to Address
Benefits:
● pattern-based location queries
● distance-based calculations,
search companies within radius or
bounding box
city: string
state: string
country: string
location: Point
geospatial index
Geospatial Search
37. Combine Graphs
Neo4j Inc. All rights reserved 2024
37
Connect-connect Company nodes to
the Form they filed
Benefits:
● expanded context for
vector-based search
● refine search results by location
● expanded pattern matches
Mixed Text & Data
Chunk
Company
FILED
Form
PART_OF
SECTION
Manager
OWNS_STOCK_IN
NEXT
Address
L
O
C
A
T
E
D
_
A
T
L
O
C
A
T
E
D
_
A
T
38. Create, Enhance, Connect SEC Financial Forms
Sections from a Form Form 10K Nodes Public Companies Management Firms Addresses
Source Form 10K json files (:Chunk) Form 13 CSV Form 13 CSV (:Company), (:Manager)
1. Create (:Chunk) (:Form) (:Company) (:Manager) (:Address)
2. Enhance Vector embedding Vector embedding Full-text index Full-text index Geospatial index
3. Connect (Chunk)
-[NEXT]->(Chunk)
(Chunk)
-[PART_OF]->(Form)
(Company)
-[FILED]->(Form)
(Manager)
-[OWNS_STOCK_IN]->(Company)
(Company|Manager)
-[LOCATED_AT]->(Address)
You can continue to grow the knowledge graph…
● cross-link Companies that mention each other
● add People, Places, Topics extracted from text (named entity recognition)
● add more Form data, or other related sources
● add User information to keep history, refine relevance and enable feedback
Neo4j Inc. All rights reserved 2024
38
39. Resources & Next Steps
Neo4j Inc. All rights reserved 2024
39
Code
github.com/neo4j-examples/sec-edgar-notebooks
Get Started with Neo4j -Aura Free
neo4j.com/cloud/aura-free/
GenAI Ecosystem & Free Learning Resources
neo4j.com/labs/genai-ecosystem/
graphacademy.neo4j.com/categories/llms/