Neo4j, Inc. All rights reserved 2023
1
Data Lineage with Neo4J
J.MOLIERE -
MentorJ-jerome@javaxpert.com
23 Février 2023
Neo4j, Inc. All rights reserved 2023
2
Data lineage in 2 minutes
It is just about metadata showing how data
changes along different steps in your I.S:
● Transformations
● Extractions
● And other process
To do what ?
● Identify quickly errors
● Evaluate impacts
● Improve communication with data
consumers (downstream)
Concept similar to
traceability in Supply Chain
Concrete question
sample: email address
should not be seen in
plain text in this screen..
Neo4j, Inc. All rights reserved 2023
3
Data lineage & graphs
Once again graphs are first class citizens for
hosting such process chains.
But the risk is to have too much information and
not to focus on the resulting metadata
⇒ filtering
⇒ being able to drop all the noise
Finding solutions to problems rather than showing
many nodes…
More than 90% of the
algorithms can be modelled as
graphs!!!
Graphs are not just nodes but
relationships!!!
- Relationships have
attributes
- Can be filtered
More relationships means
better results and better
performance.
Neo4j, Inc. All rights reserved 2023
Property Based Testing
Comes from functional programming world (Haskell/Scala):
● Goes beyond Example Based Testing
● Tries to generalize concepts (refer to Bertrand Meyer Design By Contracts), what is
always true for my Customers, Product or any other domain entity
● Uses chaos & entropy to generate random values and check your assertions
● Very smart way to prepare data sets accordingly to TDM concepts
○ Prefer synthetic data over production data..
⇒ hijack these tools to use the generator to prepare data sets
Neo4j, Inc. All rights reserved 2023
5
Finding answers to real questions
with Neo4J
Some nodes are service instances (applications,
databases, search engines)
Some others are ETL or Kafka Streams process
able to use data:
• Transform
• Extract
• Join
Data-lineage is the art of using metadata coming
from this chain of processing steps.
Very simple
example showing
how to react quickly
while being
controlled by the
CNIL regulatory
office…
Demo
Neo4j, Inc. All rights reserved 2023
Neo4j, Inc. All rights reserved 2023
7
Finding answers to real questions
with Neo4J
Something went
wrong in our
system, some
restricted data
appears in one
application in clear
(should be
anonymized)...
Who is the culprit?
Demo
Neo4j, Inc. All rights reserved 2023
From ETL pipeline
to fine-grained data
processing
meta-model :
• A Job has a set
of atomic
processings
• A processing
uses output
data from
processings
belonging to
upstream jobs
Demo Data Model
Neo4j, Inc. All rights reserved 2023
Pipeline
Backbone
Extract job
metadata
B
u
i
l
d
t
h
e
g
r
a
p
h
w
i
t
h
c
y
p
h
e
r
Neo4j, Inc. All rights reserved 2023
Extract ETL metadata
Inject m
etadata
into
cypher query
Build related part of
actionable graph
Neo4j, Inc. All rights reserved 2023
11
Demo
Neo4j, Inc. All rights reserved 2023
12
- Where is gone
PBT ?
- Data-lineage
- Power of
graphs
- Power of graph
traversal
Take
aways - Adding many random generated nodes obfuscates
the demo
- Effect is counterproductive
- Data-lineage shown in a simple case
- META-DATA!!
- ⇒ metadata is easy to use with Cypher
Neo4j, Inc. All rights reserved 2023
13
Merci !
Questions welcome
Thanks to Pierre from the Neo4j team for technical assistance
Thanks to Eva/Cedric for the opportunity offered

Data Lineage, Property Based Testing & Neo4j

  • 1.
    Neo4j, Inc. Allrights reserved 2023 1 Data Lineage with Neo4J J.MOLIERE - MentorJ-jerome@javaxpert.com 23 Février 2023
  • 2.
    Neo4j, Inc. Allrights reserved 2023 2 Data lineage in 2 minutes It is just about metadata showing how data changes along different steps in your I.S: ● Transformations ● Extractions ● And other process To do what ? ● Identify quickly errors ● Evaluate impacts ● Improve communication with data consumers (downstream) Concept similar to traceability in Supply Chain Concrete question sample: email address should not be seen in plain text in this screen..
  • 3.
    Neo4j, Inc. Allrights reserved 2023 3 Data lineage & graphs Once again graphs are first class citizens for hosting such process chains. But the risk is to have too much information and not to focus on the resulting metadata ⇒ filtering ⇒ being able to drop all the noise Finding solutions to problems rather than showing many nodes… More than 90% of the algorithms can be modelled as graphs!!! Graphs are not just nodes but relationships!!! - Relationships have attributes - Can be filtered More relationships means better results and better performance.
  • 4.
    Neo4j, Inc. Allrights reserved 2023 Property Based Testing Comes from functional programming world (Haskell/Scala): ● Goes beyond Example Based Testing ● Tries to generalize concepts (refer to Bertrand Meyer Design By Contracts), what is always true for my Customers, Product or any other domain entity ● Uses chaos & entropy to generate random values and check your assertions ● Very smart way to prepare data sets accordingly to TDM concepts ○ Prefer synthetic data over production data.. ⇒ hijack these tools to use the generator to prepare data sets
  • 5.
    Neo4j, Inc. Allrights reserved 2023 5 Finding answers to real questions with Neo4J Some nodes are service instances (applications, databases, search engines) Some others are ETL or Kafka Streams process able to use data: • Transform • Extract • Join Data-lineage is the art of using metadata coming from this chain of processing steps. Very simple example showing how to react quickly while being controlled by the CNIL regulatory office… Demo
  • 6.
    Neo4j, Inc. Allrights reserved 2023
  • 7.
    Neo4j, Inc. Allrights reserved 2023 7 Finding answers to real questions with Neo4J Something went wrong in our system, some restricted data appears in one application in clear (should be anonymized)... Who is the culprit? Demo
  • 8.
    Neo4j, Inc. Allrights reserved 2023 From ETL pipeline to fine-grained data processing meta-model : • A Job has a set of atomic processings • A processing uses output data from processings belonging to upstream jobs Demo Data Model
  • 9.
    Neo4j, Inc. Allrights reserved 2023 Pipeline Backbone Extract job metadata B u i l d t h e g r a p h w i t h c y p h e r
  • 10.
    Neo4j, Inc. Allrights reserved 2023 Extract ETL metadata Inject m etadata into cypher query Build related part of actionable graph
  • 11.
    Neo4j, Inc. Allrights reserved 2023 11 Demo
  • 12.
    Neo4j, Inc. Allrights reserved 2023 12 - Where is gone PBT ? - Data-lineage - Power of graphs - Power of graph traversal Take aways - Adding many random generated nodes obfuscates the demo - Effect is counterproductive - Data-lineage shown in a simple case - META-DATA!! - ⇒ metadata is easy to use with Cypher
  • 13.
    Neo4j, Inc. Allrights reserved 2023 13 Merci ! Questions welcome Thanks to Pierre from the Neo4j team for technical assistance Thanks to Eva/Cedric for the opportunity offered