Data Engineering
for RAG
ABK of Neo4j
(Andreas Kollegger)
Generative AI
Neo4j Inc. All rights reserved 2024
2
Generative AI
● Learns random sentences
from random people
● Talks like a person but doesn't really
understand what it's saying
● Occasionally speaks absolute nonsense
● Sensitive to question phrasing
● Answers reflect the person asking
● Can't explain or verify answers
● Limited to public "knowledge"
Neo4j Inc. All rights reserved 2024
3
Generative AI
● Learns random sentences
from random people
● Talks like a person but doesn't really
understand what it's saying
● Occasionally speaks absolute nonsense
● Sensitive to question phrasing
● Answers reflect the person asking
● Can't explain or verify answers
● Limited to public "knowledge"
Neo4j Inc. All rights reserved 2024
4
How do we
integrate
with the alien
technology?
Neo4j Inc. All rights reserved 2024
5
Everything
starts with
practical work,
using RAG…
Neo4j Inc. All rights reserved 2024
6
Retrieval Augmented
Generation (RAG)
RAG is a software design pattern for
integrating GenAI Apps with custom data
sources, like a database.
Neo4j Inc. All rights reserved 2024
7
A Generative AI application
uses an LLM
to provide responses
to user prompts
(aka ChatGPT)
Neo4j Inc. All rights reserved 2024
8
8
User
Prompt
Complete
Response
GenAI
Application
LLM
User Prompt
Response
RAG augments the LLM by
intercepting a user's prompt,
then making a query to a database,
then using the query results as
context for the user's prompt,
creating a new prompt that is passed
to the LLM
for a complete, curated response
Neo4j Inc. All rights reserved 2024
9
9
Database
GenAI
Application
Complete
Response
User
Prompt
LLM
User Prompt
+ Context
Response
User Prompt
Context
1 2
1 2
This sets up a knowledge stack…
the user knows something about the
question they're asking
the application knows something
about the user
the database knows about particular
information and data
the LLM knows about whatever it
found on the internet
Neo4j Inc. All rights reserved 2024
10
10
User Knowledge
App Knowledge
Database Knowledge
LLM Knowledge
Knowledge Stack
This sets up a knowledge stack…
the user knows something about the
question they're asking
the application knows something
about the user
the database knows about particular
information and data
the LLM knows about whatever it
found on the internet
Neo4j Inc. All rights reserved 2024
11
11
User Knowledge
App Knowledge
Database Knowledge
LLM Knowledge
Knowledge Stack
Knowledge you control,
in the app and the database.
Three Sources of Data
for RAG
Each with different access patterns,
supporting different kinds of questions.
Neo4j Inc. All rights reserved 2023
12
Neo4j Inc. All rights reserved 2024
13
Pure Text
Neo4j Inc. All rights reserved 2024
14
Pure Text
Unstructured data in PDFs,
plain text files, or images
Information search: “What is Apple's primary business?”
Answer with: Implicit knowledge derived from text.
Neo4j Inc. All rights reserved 2024
15
Pure Text
Unstructured data in PDFs,
plain text files, or images
Neo4j Inc. All rights reserved 2024
16
Pure Data
Pure Text
Neo4j Inc. All rights reserved 2024
17
Pure Data
Structured data
in a database
Pure Text
Neo4j Inc. All rights reserved 2024
18
Pure Data
Structured data
in a database
Pure Text
Information query: “How many iPhones did Apple sell this quarter?”
Answer with: Explicit facts from a database query.
Neo4j Inc. All rights reserved 2024
19
Pure Text Pure Data
Mixed
Text + Data
Neo4j Inc. All rights reserved 2024
20
Pure Text Pure Data
Mixed
Text + Data
Structured data together
with long-form text
Neo4j Inc. All rights reserved 2024
21
Pure Text Pure Data
Mixed
Text + Data
Structured data together
with long-form text
Information discovery: “Which investors will be impacted by a chip shortage?”
Answer with: Combined search and data query.
Neo4j Inc. All rights reserved 2024
22
Pure Text Pure Data
Mixed
Text + Data
Neo4j Inc. All rights reserved 2024
23
Pure Text Pure Data
Mixed
Text + Data
A Knowledge Graph:
Information architecture for data, organized using graph structures,
which places data within context.
Neo4j Inc. All rights reserved 2024
24
Pure Text Pure Data
Mixed
Text + Data
A Knowledge Graph:
Information architecture for data, organized using graph structures,
which places data within context.
Graph RAG:
Supports multiple modes of information retrieval, including
information search, information query, and information discovery.
Neo4j Inc. All rights reserved 2024
25
Pure Text Pure Data
Mixed
Text + Data
Vector Search Search + Pattern Matching Graph Queries
Find relevant documents
plus context for
information search
Expand context
and rank the relevance for
information discovery
Directly query the
knowledge graph for
information query
GenAI Example:
SEC Edgar
Financial Forms
Neo4j Inc. All rights reserved 2024
26
SEC Edgar Financial Data
The EDGAR database provides free public
access to company information, allowing
research about public company financial
information and operations through the filings
they submit to the SEC.
There are two forms that we'll look at today:
1. Form 10K-filings from publicly traded
companies
2. Form 13 -filings from institutional
investment management firms
Neo4j Inc. All rights reserved 2024
27
Data Modeling Strategy
Start with a Minimum Viable Graph (MVG)
Create, Enhance, Connect then repeat to grow the graph
1. Create-identify interesting information, create records
2. Enhance-supercharge the data by enhancing some dimension
3. Connect-connect information to expand context and reveal knowledge
Neo4j Inc. All rights reserved 2024
28
Form
10k
Chunk
Chunk
Chunk
Chunk
Create -Form 10K text chunks
exercitation ullamco
laboris nisi ut aliquip
enim ad minim veniam,
quis nostrud
incididunt ut labore et
dolore magna aliqua. Ut
adipiscing elit, sed do
eiusmod tempor
Lorem ipsum dolor sit
amet, consectetur
Chunk
Neo4j Inc. All rights reserved 2024
29
2. Split Text
1. Source - Form 10K 3. Create Nodes
Form
10k
Chunk
Chunk
Chunk
Chunk
Enhance -Text with an embedding
exercitation ullamco
laboris nisi ut aliquip
enim ad minim veniam,
quis nostrud
incididunt ut labore et
dolore magna aliqua. Ut
adipiscing elit, sed do
eiusmod tempor
Lorem ipsum dolor sit
amet, consectetur
[0.6,0.2,0.1,0.7]
[0.5,0.2,0.1,0.7]
[0.4,0.2,0.1,0.7]
[0.3,0.2,0.1,0.5]
[0.2,0.2,0.1,0.7]
1. Source - Chunks
Chunk
Neo4j Inc. All rights reserved 2024
Vector Index
30
4. Add Embedding
Form
10k
Chunk
Chunk
Chunk
Chunk
Connect -Connect chunks into a list
exercitation ullamco
laboris nisi ut aliquip
enim ad minim veniam,
quis nostrud
incididunt ut labore et
dolore magna aliqua. Ut
adipiscing elit, sed do
eiusmod tempor
Lorem ipsum dolor sit
amet, consectetur
[0.6,0.2,0.1,0.7]
[0.5,0.2,0.1,0.7]
[0.4,0.2,0.1,0.7]
[0.3,0.2,0.1,0.5]
[0.2,0.2,0.1,0.7] Chunk
NEXT
1. Connect Chunks
Chunk
Chunk
Neo4j Inc. All rights reserved 2024
31
Form
10k
Chunk
Chunk
Chunk
Chunk
Create, Enhance, Connect Form 10K
exercitation ullamco
laboris nisi ut aliquip
enim ad minim veniam,
quis nostrud
incididunt ut labore et
dolore magna aliqua. Ut
adipiscing elit, sed do
eiusmod tempor
Lorem ipsum dolor sit
amet, consectetur
2. Split Text
[0.6,0.2,0.1,0.7]
[0.5,0.2,0.1,0.7]
[0.4,0.2,0.1,0.7]
[0.3,0.2,0.1,0.5]
[0.2,0.2,0.1,0.7]
4. Add Embedding
1. Source - Form 10K 3. Create Nodes
Chunk
NEXT
5. Connect
Chunk
Chunk
Extract Enhance Expand
Neo4j Inc. All rights reserved 2024
32
Benefits:
● vector similarity search to find
relevant text
● expand context window with
previous/next chunks
● enable paging through text
Neo4j Inc. All rights reserved 2024
33
Chunk
NEXT
Minimum Viable Graph
formId: string
chunkId: string
text: string
textEmbedding: float[]
vector index
Linked List of Text
Create-create separate Form nodes
for each Form 10K. Add summary.
Enhance-vector index of summary.
Connect-connect from Form to first
node in linked list. Then from each
chunk back to the Form Node.
Benefits:
● expand context of chunk with
summary text
● navigate from form to text
Neo4j Inc. All rights reserved 2024
34
Chunk
NEXT
Improve Context
cusip6: string
formId: string
summary: string
summaryEmbedding: float[]
vector index
Hierarchical Summary
Form
PART_OF
SECTION
Add Form 13
Neo4j Inc. All rights reserved 2024
35
Company
Manager
OWNS_STOCK_IN
Create-create Manager and Company
nodes
Enhance-full-text index of names
Connect-connect Manager nodes to
Company nodes through investments
Benefits:
● pattern-matching queries
● search names by text similarity
(Apple and Apple Inc)
rather than conceptual similarity
(Apple and Banana)
name: string
address: string
full-text index
shares: integer
value: float
name: string
address: string
full-text index
Structured Data
Company
Manager
OWNS_STOCK_IN
Address
L
O
C
A
T
E
D
_
A
T
L
O
C
A
T
E
D
_
A
T
Located at Address
Neo4j Inc. All rights reserved 2024
36
Create-create Address nodes
Enhance-geospatial index of address
Connect-connect Manager and
Company nodes to Address
Benefits:
● pattern-based location queries
● distance-based calculations,
search companies within radius or
bounding box
city: string
state: string
country: string
location: Point
geospatial index
Geospatial Search
Combine Graphs
Neo4j Inc. All rights reserved 2024
37
Connect-connect Company nodes to
the Form they filed
Benefits:
● expanded context for
vector-based search
● refine search results by location
● expanded pattern matches
Mixed Text & Data
Chunk
Company
FILED
Form
PART_OF
SECTION
Manager
OWNS_STOCK_IN
NEXT
Address
L
O
C
A
T
E
D
_
A
T
L
O
C
A
T
E
D
_
A
T
Create, Enhance, Connect SEC Financial Forms
Sections from a Form Form 10K Nodes Public Companies Management Firms Addresses
Source Form 10K json files (:Chunk) Form 13 CSV Form 13 CSV (:Company), (:Manager)
1. Create (:Chunk) (:Form) (:Company) (:Manager) (:Address)
2. Enhance Vector embedding Vector embedding Full-text index Full-text index Geospatial index
3. Connect (Chunk)
-[NEXT]->(Chunk)
(Chunk)
-[PART_OF]->(Form)
(Company)
-[FILED]->(Form)
(Manager)
-[OWNS_STOCK_IN]->(Company)
(Company|Manager)
-[LOCATED_AT]->(Address)
You can continue to grow the knowledge graph…
● cross-link Companies that mention each other
● add People, Places, Topics extracted from text (named entity recognition)
● add more Form data, or other related sources
● add User information to keep history, refine relevance and enable feedback
Neo4j Inc. All rights reserved 2024
38
Resources & Next Steps
Neo4j Inc. All rights reserved 2024
39
Code
github.com/neo4j-examples/sec-edgar-notebooks
Get Started with Neo4j -Aura Free
neo4j.com/cloud/aura-free/
GenAI Ecosystem & Free Learning Resources
neo4j.com/labs/genai-ecosystem/
graphacademy.neo4j.com/categories/llms/
Thank you!
andreas.kollegger@neo4j.com
Neo4j Inc. All rights reserved 2023
40

Neo4j: Data Engineering for RAG (retrieval augmented generation)

  • 1.
    Data Engineering for RAG ABKof Neo4j (Andreas Kollegger)
  • 2.
    Generative AI Neo4j Inc.All rights reserved 2024 2
  • 3.
    Generative AI ● Learnsrandom sentences from random people ● Talks like a person but doesn't really understand what it's saying ● Occasionally speaks absolute nonsense ● Sensitive to question phrasing ● Answers reflect the person asking ● Can't explain or verify answers ● Limited to public "knowledge" Neo4j Inc. All rights reserved 2024 3
  • 4.
    Generative AI ● Learnsrandom sentences from random people ● Talks like a person but doesn't really understand what it's saying ● Occasionally speaks absolute nonsense ● Sensitive to question phrasing ● Answers reflect the person asking ● Can't explain or verify answers ● Limited to public "knowledge" Neo4j Inc. All rights reserved 2024 4
  • 5.
    How do we integrate withthe alien technology? Neo4j Inc. All rights reserved 2024 5
  • 6.
    Everything starts with practical work, usingRAG… Neo4j Inc. All rights reserved 2024 6
  • 7.
    Retrieval Augmented Generation (RAG) RAGis a software design pattern for integrating GenAI Apps with custom data sources, like a database. Neo4j Inc. All rights reserved 2024 7
  • 8.
    A Generative AIapplication uses an LLM to provide responses to user prompts (aka ChatGPT) Neo4j Inc. All rights reserved 2024 8 8 User Prompt Complete Response GenAI Application LLM User Prompt Response
  • 9.
    RAG augments theLLM by intercepting a user's prompt, then making a query to a database, then using the query results as context for the user's prompt, creating a new prompt that is passed to the LLM for a complete, curated response Neo4j Inc. All rights reserved 2024 9 9 Database GenAI Application Complete Response User Prompt LLM User Prompt + Context Response User Prompt Context 1 2 1 2
  • 10.
    This sets upa knowledge stack… the user knows something about the question they're asking the application knows something about the user the database knows about particular information and data the LLM knows about whatever it found on the internet Neo4j Inc. All rights reserved 2024 10 10 User Knowledge App Knowledge Database Knowledge LLM Knowledge Knowledge Stack
  • 11.
    This sets upa knowledge stack… the user knows something about the question they're asking the application knows something about the user the database knows about particular information and data the LLM knows about whatever it found on the internet Neo4j Inc. All rights reserved 2024 11 11 User Knowledge App Knowledge Database Knowledge LLM Knowledge Knowledge Stack Knowledge you control, in the app and the database.
  • 12.
    Three Sources ofData for RAG Each with different access patterns, supporting different kinds of questions. Neo4j Inc. All rights reserved 2023 12
  • 13.
    Neo4j Inc. Allrights reserved 2024 13 Pure Text
  • 14.
    Neo4j Inc. Allrights reserved 2024 14 Pure Text Unstructured data in PDFs, plain text files, or images
  • 15.
    Information search: “Whatis Apple's primary business?” Answer with: Implicit knowledge derived from text. Neo4j Inc. All rights reserved 2024 15 Pure Text Unstructured data in PDFs, plain text files, or images
  • 16.
    Neo4j Inc. Allrights reserved 2024 16 Pure Data Pure Text
  • 17.
    Neo4j Inc. Allrights reserved 2024 17 Pure Data Structured data in a database Pure Text
  • 18.
    Neo4j Inc. Allrights reserved 2024 18 Pure Data Structured data in a database Pure Text Information query: “How many iPhones did Apple sell this quarter?” Answer with: Explicit facts from a database query.
  • 19.
    Neo4j Inc. Allrights reserved 2024 19 Pure Text Pure Data Mixed Text + Data
  • 20.
    Neo4j Inc. Allrights reserved 2024 20 Pure Text Pure Data Mixed Text + Data Structured data together with long-form text
  • 21.
    Neo4j Inc. Allrights reserved 2024 21 Pure Text Pure Data Mixed Text + Data Structured data together with long-form text Information discovery: “Which investors will be impacted by a chip shortage?” Answer with: Combined search and data query.
  • 22.
    Neo4j Inc. Allrights reserved 2024 22 Pure Text Pure Data Mixed Text + Data
  • 23.
    Neo4j Inc. Allrights reserved 2024 23 Pure Text Pure Data Mixed Text + Data A Knowledge Graph: Information architecture for data, organized using graph structures, which places data within context.
  • 24.
    Neo4j Inc. Allrights reserved 2024 24 Pure Text Pure Data Mixed Text + Data A Knowledge Graph: Information architecture for data, organized using graph structures, which places data within context. Graph RAG: Supports multiple modes of information retrieval, including information search, information query, and information discovery.
  • 25.
    Neo4j Inc. Allrights reserved 2024 25 Pure Text Pure Data Mixed Text + Data Vector Search Search + Pattern Matching Graph Queries Find relevant documents plus context for information search Expand context and rank the relevance for information discovery Directly query the knowledge graph for information query
  • 26.
    GenAI Example: SEC Edgar FinancialForms Neo4j Inc. All rights reserved 2024 26
  • 27.
    SEC Edgar FinancialData The EDGAR database provides free public access to company information, allowing research about public company financial information and operations through the filings they submit to the SEC. There are two forms that we'll look at today: 1. Form 10K-filings from publicly traded companies 2. Form 13 -filings from institutional investment management firms Neo4j Inc. All rights reserved 2024 27
  • 28.
    Data Modeling Strategy Startwith a Minimum Viable Graph (MVG) Create, Enhance, Connect then repeat to grow the graph 1. Create-identify interesting information, create records 2. Enhance-supercharge the data by enhancing some dimension 3. Connect-connect information to expand context and reveal knowledge Neo4j Inc. All rights reserved 2024 28
  • 29.
    Form 10k Chunk Chunk Chunk Chunk Create -Form 10Ktext chunks exercitation ullamco laboris nisi ut aliquip enim ad minim veniam, quis nostrud incididunt ut labore et dolore magna aliqua. Ut adipiscing elit, sed do eiusmod tempor Lorem ipsum dolor sit amet, consectetur Chunk Neo4j Inc. All rights reserved 2024 29 2. Split Text 1. Source - Form 10K 3. Create Nodes
  • 30.
    Form 10k Chunk Chunk Chunk Chunk Enhance -Text withan embedding exercitation ullamco laboris nisi ut aliquip enim ad minim veniam, quis nostrud incididunt ut labore et dolore magna aliqua. Ut adipiscing elit, sed do eiusmod tempor Lorem ipsum dolor sit amet, consectetur [0.6,0.2,0.1,0.7] [0.5,0.2,0.1,0.7] [0.4,0.2,0.1,0.7] [0.3,0.2,0.1,0.5] [0.2,0.2,0.1,0.7] 1. Source - Chunks Chunk Neo4j Inc. All rights reserved 2024 Vector Index 30 4. Add Embedding
  • 31.
    Form 10k Chunk Chunk Chunk Chunk Connect -Connect chunksinto a list exercitation ullamco laboris nisi ut aliquip enim ad minim veniam, quis nostrud incididunt ut labore et dolore magna aliqua. Ut adipiscing elit, sed do eiusmod tempor Lorem ipsum dolor sit amet, consectetur [0.6,0.2,0.1,0.7] [0.5,0.2,0.1,0.7] [0.4,0.2,0.1,0.7] [0.3,0.2,0.1,0.5] [0.2,0.2,0.1,0.7] Chunk NEXT 1. Connect Chunks Chunk Chunk Neo4j Inc. All rights reserved 2024 31
  • 32.
    Form 10k Chunk Chunk Chunk Chunk Create, Enhance, ConnectForm 10K exercitation ullamco laboris nisi ut aliquip enim ad minim veniam, quis nostrud incididunt ut labore et dolore magna aliqua. Ut adipiscing elit, sed do eiusmod tempor Lorem ipsum dolor sit amet, consectetur 2. Split Text [0.6,0.2,0.1,0.7] [0.5,0.2,0.1,0.7] [0.4,0.2,0.1,0.7] [0.3,0.2,0.1,0.5] [0.2,0.2,0.1,0.7] 4. Add Embedding 1. Source - Form 10K 3. Create Nodes Chunk NEXT 5. Connect Chunk Chunk Extract Enhance Expand Neo4j Inc. All rights reserved 2024 32
  • 33.
    Benefits: ● vector similaritysearch to find relevant text ● expand context window with previous/next chunks ● enable paging through text Neo4j Inc. All rights reserved 2024 33 Chunk NEXT Minimum Viable Graph formId: string chunkId: string text: string textEmbedding: float[] vector index Linked List of Text
  • 34.
    Create-create separate Formnodes for each Form 10K. Add summary. Enhance-vector index of summary. Connect-connect from Form to first node in linked list. Then from each chunk back to the Form Node. Benefits: ● expand context of chunk with summary text ● navigate from form to text Neo4j Inc. All rights reserved 2024 34 Chunk NEXT Improve Context cusip6: string formId: string summary: string summaryEmbedding: float[] vector index Hierarchical Summary Form PART_OF SECTION
  • 35.
    Add Form 13 Neo4jInc. All rights reserved 2024 35 Company Manager OWNS_STOCK_IN Create-create Manager and Company nodes Enhance-full-text index of names Connect-connect Manager nodes to Company nodes through investments Benefits: ● pattern-matching queries ● search names by text similarity (Apple and Apple Inc) rather than conceptual similarity (Apple and Banana) name: string address: string full-text index shares: integer value: float name: string address: string full-text index Structured Data
  • 36.
    Company Manager OWNS_STOCK_IN Address L O C A T E D _ A T L O C A T E D _ A T Located at Address Neo4jInc. All rights reserved 2024 36 Create-create Address nodes Enhance-geospatial index of address Connect-connect Manager and Company nodes to Address Benefits: ● pattern-based location queries ● distance-based calculations, search companies within radius or bounding box city: string state: string country: string location: Point geospatial index Geospatial Search
  • 37.
    Combine Graphs Neo4j Inc.All rights reserved 2024 37 Connect-connect Company nodes to the Form they filed Benefits: ● expanded context for vector-based search ● refine search results by location ● expanded pattern matches Mixed Text & Data Chunk Company FILED Form PART_OF SECTION Manager OWNS_STOCK_IN NEXT Address L O C A T E D _ A T L O C A T E D _ A T
  • 38.
    Create, Enhance, ConnectSEC Financial Forms Sections from a Form Form 10K Nodes Public Companies Management Firms Addresses Source Form 10K json files (:Chunk) Form 13 CSV Form 13 CSV (:Company), (:Manager) 1. Create (:Chunk) (:Form) (:Company) (:Manager) (:Address) 2. Enhance Vector embedding Vector embedding Full-text index Full-text index Geospatial index 3. Connect (Chunk) -[NEXT]->(Chunk) (Chunk) -[PART_OF]->(Form) (Company) -[FILED]->(Form) (Manager) -[OWNS_STOCK_IN]->(Company) (Company|Manager) -[LOCATED_AT]->(Address) You can continue to grow the knowledge graph… ● cross-link Companies that mention each other ● add People, Places, Topics extracted from text (named entity recognition) ● add more Form data, or other related sources ● add User information to keep history, refine relevance and enable feedback Neo4j Inc. All rights reserved 2024 38
  • 39.
    Resources & NextSteps Neo4j Inc. All rights reserved 2024 39 Code github.com/neo4j-examples/sec-edgar-notebooks Get Started with Neo4j -Aura Free neo4j.com/cloud/aura-free/ GenAI Ecosystem & Free Learning Resources neo4j.com/labs/genai-ecosystem/ graphacademy.neo4j.com/categories/llms/
  • 40.