1. Neo4j, Inc. All rights reserved 2023
1
Data Lineage with Neo4J
J.MOLIERE -
MentorJ-jerome@javaxpert.com
23 Février 2023
2. Neo4j, Inc. All rights reserved 2023
2
Data lineage in 2 minutes
It is just about metadata showing how data
changes along different steps in your I.S:
● Transformations
● Extractions
● And other process
To do what ?
● Identify quickly errors
● Evaluate impacts
● Improve communication with data
consumers (downstream)
Concept similar to
traceability in Supply Chain
Concrete question
sample: email address
should not be seen in
plain text in this screen..
3. Neo4j, Inc. All rights reserved 2023
3
Data lineage & graphs
Once again graphs are first class citizens for
hosting such process chains.
But the risk is to have too much information and
not to focus on the resulting metadata
⇒ filtering
⇒ being able to drop all the noise
Finding solutions to problems rather than showing
many nodes…
More than 90% of the
algorithms can be modelled as
graphs!!!
Graphs are not just nodes but
relationships!!!
- Relationships have
attributes
- Can be filtered
More relationships means
better results and better
performance.
4. Neo4j, Inc. All rights reserved 2023
Property Based Testing
Comes from functional programming world (Haskell/Scala):
● Goes beyond Example Based Testing
● Tries to generalize concepts (refer to Bertrand Meyer Design By Contracts), what is
always true for my Customers, Product or any other domain entity
● Uses chaos & entropy to generate random values and check your assertions
● Very smart way to prepare data sets accordingly to TDM concepts
○ Prefer synthetic data over production data..
⇒ hijack these tools to use the generator to prepare data sets
5. Neo4j, Inc. All rights reserved 2023
5
Finding answers to real questions
with Neo4J
Some nodes are service instances (applications,
databases, search engines)
Some others are ETL or Kafka Streams process
able to use data:
• Transform
• Extract
• Join
Data-lineage is the art of using metadata coming
from this chain of processing steps.
Very simple
example showing
how to react quickly
while being
controlled by the
CNIL regulatory
office…
Demo
7. Neo4j, Inc. All rights reserved 2023
7
Finding answers to real questions
with Neo4J
Something went
wrong in our
system, some
restricted data
appears in one
application in clear
(should be
anonymized)...
Who is the culprit?
Demo
8. Neo4j, Inc. All rights reserved 2023
From ETL pipeline
to fine-grained data
processing
meta-model :
• A Job has a set
of atomic
processings
• A processing
uses output
data from
processings
belonging to
upstream jobs
Demo Data Model
9. Neo4j, Inc. All rights reserved 2023
Pipeline
Backbone
Extract job
metadata
B
u
i
l
d
t
h
e
g
r
a
p
h
w
i
t
h
c
y
p
h
e
r
10. Neo4j, Inc. All rights reserved 2023
Extract ETL metadata
Inject m
etadata
into
cypher query
Build related part of
actionable graph
12. Neo4j, Inc. All rights reserved 2023
12
- Where is gone
PBT ?
- Data-lineage
- Power of
graphs
- Power of graph
traversal
Take
aways - Adding many random generated nodes obfuscates
the demo
- Effect is counterproductive
- Data-lineage shown in a simple case
- META-DATA!!
- ⇒ metadata is easy to use with Cypher
13. Neo4j, Inc. All rights reserved 2023
13
Merci !
Questions welcome
Thanks to Pierre from the Neo4j team for technical assistance
Thanks to Eva/Cedric for the opportunity offered