MEREDITH + CLIENT NAME | 1MEREDITH + CLIENT NAME | 1MEREDITH + GRAPHCONNECT 2020 | 1
Identity Graph at Scale
Transforming Billions of Page views
to Unique Identity Profiles in Publishing
MEREDITH + CLIENT NAME | 2MEREDITH + CLIENT NAME | 2MEREDITH + GRAPHCONNECT 2020 | 2
MEREDITH + CLIENT NAME | 4MEREDITH + CLIENT NAME | 4MEREDITH + CLIENT NAME | 4
MEREDITH + CLIENT NAME | 3MEREDITH + CLIENT NAME | 3MEREDITH + GRAPHCONNECT 2020 | 3
We are Meredith Corporation, a publicly
held media and marketing services
company founded upon serving our
customers and committed to building
value for our shareholders.
We are on the pulse of pop culture,
entertainment, food, fashion and
lifestyle, news, business and finance,
and sports
Who we are:
Meredith
Brands
Our Brands are in nearly every grocery store, gas station and dentist
office across the U.S.
Digital Presence
Our multi-channel digital approach to media provides touchpoints
through varied devices from mobile, desktop, console and OTT.
Analytics
Our focus aims to provide our consumers with top content relevant
to their daily lives while providing directed audiences for advertising
that contribute directly to dollars spent by our consumers.
Programmatic Targeting
Our unparalleled database delivers custom content at the point-of-
decision, leveraging first-party data and unique distribution
resources to engage our audience.
R & D
Our Research and Development projects focus on cutting edge
technology to allow Meredith stand alone as the Premier Publishing
and Content platform in the U.S.
MEREDITH + CLIENT NAME | 4MEREDITH + CLIENT NAME | 4MEREDITH + GRAPHCONNECT 2020 | 4
Serve well-defined audiences, deliver the messages of national and local advertisers, and extend our brand franchises
and expertise to related markets. The new Paradigm of Publishing is Personalization.
Meredith Digital’s Mission
B A C K G R O U N D
ENTERTAINMENT + STYLE FOOD PARENTING HOME + LIFESTYLE TRAVEL + LUXURY HEALTH + WELLNESS
453M 531M 18M 148M 64M 31M
Source: comScore Multiplatform, December 2018
MEREDITH + CLIENT NAME | 5MEREDITH + CLIENT NAME | 5MEREDITH + GRAPHCONNECT 2020 | 5
Daily news
“Play news from People”
School lunch supplies are low!
“Add fruit to shopping list”
Social Media
Check Instagram & Facebook
7:30 AM
Drop kids
at school
10:30 PM
Lights Out
5:30 AM
Wake up…
Consumer Action
Media Moment
Get some exercise
Morning yoga Flow from Shape5:45 AM
6:05 AM
6:15 AM
6:45 AM
Daily commute
Parents Podcast
What’s for Dinner?
QR scan “Asian Salmon Bowls” in Real Simple
5:00 PM
Pick up kids
8:00 PM
Kids to bed
Daily commute
EW’s Game of Thrones Podcast
Make dinner
“Cook Salmon Bowls”
Self care
Daily meditation from Health
Stock up
Place Shipt order
7:45 AM
11:15 AM
12:20 PM3:15 PM
5:30 PM
6:00 PM
Me time
IGTV Locals from T&L
8:30 PM
Cleanup emergency!
“How to remove soy sauce?”
7:30 PM
Voice
Audio
Mobile
Print
Video
We connect
with her across
multiple touch
points throughout
her day
O U R T O U C H P O I N T S
MEREDITH + CLIENT NAME | 6MEREDITH + CLIENT NAME | 6MEREDITH + GRAPHCONNECT 2020 | 6
Measuring the
Mutable
Cookies are constantly changing.
Firewalls, anti-virus, and diligent digital users all
contribute to cookie loss
Cross Device challenges.
Typical users interact with our brands across
many devices but cookies are device confined
Intelligent Tracking Prevention 2.3
Browsers.
Safari, Chrome, Firefox all have new security
standards to inhibit third-party cookies
The fight against online tracking and
analytics
MEREDITH + CLIENT NAME | 7MEREDITH + CLIENT NAME | 7MEREDITH + GRAPHCONNECT 2020 | 7
Models Made of Sand
Audience Propensity on unstable Cookies
Models Cost Money.
Even the Best-in-class audience
segmentation models suffer from cookie loss
Activation is Paramount.
Propensity Models are only as good as their
activation
Advertising In the Dark.
Cookie loss leads to less click throughs
MEREDITH + CLIENT NAME | 8MEREDITH + CLIENT NAME | 8MEREDITH + GRAPHCONNECT 2020 | 8
Creating a Unified view of a Digital User
Confluence of Data
First–Party Data
Third–Party Data
• Various data stream providers and touchpoints
• A true Digital footprint requires all the sources
• No one stream has all of the information
• Cookie Recovery through Connections
• Creating Profiles with Longevity
• More touchpoints = Better Models
MEREDITH + CLIENT NAME | 9MEREDITH + CLIENT NAME | 9MEREDITH + GRAPHCONNECT 2020 | 9
Spotting Snake Oil Sellers
Investing in data you can trust
• Identity Resolution Vendors are a dime a dozen
but can cost a lot more
• How can you validate vendors with so much
anonymous traffic?
• Graphs + First-Party data give the power to
validate
• Look for linkages with too many locations,
repetitive timestamps, and multiple emails to
discredit faulty connections
MEREDITH + CLIENT NAME | 10MEREDITH + CLIENT NAME | 10MEREDITH + GRAPHCONNECT 2020 | 10
A Timeline of Development
From Proof of Concept to Production
T H E S O L U T I O N : I D E N T I T Y G R A P H
Data Size:
3 Months of data from first party
only sources – 100’s MM of cookies
.5 TB
RESULT:
Determined Graph Model
Rudimentary Import Process
Using Pattern matching - Cypher
Next Steps:
Scale to 1 year
Import/Export Process
Include 3rd party data
RESULT:
Discovered APOC and Graph Algos
UnionFind Algorithm Bug
APOC parallel Import procedures
Seeding UF work around
Data Size:
20+ Months of data from first and
third party sources - 4.4 TB
Custom Java Import Procedure
RESULT:
UnionFind Algorithm with Seeding
Custom Java Import/Export
Procedure
MEREDITH + CLIENT NAME | 11MEREDITH + CLIENT NAME | 11MEREDITH + GRAPHCONNECT 2020 | 11
Building on Foundations
Proof of Concept
T H E S O L U T I O N
• Graph Model Development
• Importing Data with Neo4j-Admin
Import
• MATCH (u:User)-[]->(m)<-[]-(u2:User)
WHERE u.uid = abc123 and u <> u2
RETURN u, u2
MEREDITH + CLIENT NAME | 12MEREDITH + CLIENT NAME | 12MEREDITH + GRAPHCONNECT 2020 | 12
Building on Foundations
Proof of Concept
T H E S O L U T I O N
• Graph Model Development
• Importing Data with Neo4j-Admin Import
• MATCH (u:User)-[]->(m)<-[]-(u2:User)
WHERE u.uid = abc123 and u <> u2
RETURN u, u2
• 26 Alphabet + 10 digits = 36 Possibilities
• 36*36*36…*36 = 36^32 Permutations
• Chance any two people get the same id is
1/(6.3340287e+49) =~ 0
Probability:
MEREDITH + CLIENT NAME | 13MEREDITH + CLIENT NAME | 13MEREDITH + GRAPHCONNECT 2020 | 13
Building on Foundations
Proof of Concept
T H E S O L U T I O N
• Graph Model Development
• Different relationships for cookie observation, URL
visits, IP/Device type Visits
• Import with Neo4j Admin import
• Data from AWS Redshift using UNLOAD cmd
• CSV using | delimiter
• Basic Pattern Matching
• Match (u:User)-[]-(m)-[]-(u2:User)-[]-(m2)-[]-(u3:User)
WHERE u <> u2 AND u <> u3 AND u2<>u3
RETURN u,m,u2,m2,u3
LIMIT 100
MEREDITH + CLIENT NAME | 14MEREDITH + CLIENT NAME | 14MEREDITH + GRAPHCONNECT 2020 | 14
Building on Foundations
Proof of Concept
T H E S O L U T I O N
• Graph Model Development
• Importing Data with Neo4j-Admin Import
• MATCH (u:User)-[]->(m)<-[]-(u2:User)
WHERE u.uid = abc123
• RETURN u,u2
Problems:
• Graph was static – CSV import was
too slow
• Other Streams of Cookie Data
• IP gave conflicting Connections
• URL solved recommendation Not
Identity
MEREDITH + CLIENT NAME | 15MEREDITH + CLIENT NAME | 15MEREDITH + GRAPHCONNECT 2020 | 15
Building on Foundations
Proof of Concept
T H E S O L U T I O N
Next Steps:
• Scale to 1+ year
• Improve Import Procedure
• Develop Daily Import/Export
Procedure
• Include Other streams of Cookie data
• Prevent Multi Relationship Between
Cookies
MEREDITH + CLIENT NAME | 16MEREDITH + CLIENT NAME | 16MEREDITH + GRAPHCONNECT 2020 | 16
Building on Foundations
Proof of Concept
T H E S O L U T I O N
Next Steps:
• Scale to 1+ year
• Improve Import Procedure
• Develop Daily Import/Export
Procedure
• Include Other streams of Cookie data
• Prevent Multi Relationship Between
Cookies
MEREDITH + CLIENT NAME | 17MEREDITH + CLIENT NAME | 17MEREDITH + GRAPHCONNECT 2020 | 17
Moving Toward Production
• Scaling to 6 Months of data with 1
stream – 2 TB database
• Optimizing Neo4j Admin Import
• Graph Connect 2018
• Using APOC Periodic Iterate for
Import and Export Procedures
• Found some Identity Partners
showed Hyper Connections
LearningFromMistakes
MEREDITH + CLIENT NAME | 18MEREDITH + CLIENT NAME | 18MEREDITH + GRAPHCONNECT 2020 | 18
Moving Toward Production
• GraphConnect 2018
⁃ Learn about APOC & Algos
• Pattern Matching is slow
⁃ APOC Subgraph Procedure
• Cypher is non parallelized
⁃ APOC periodic Iterate is your
friend
• Utilizing Graph Algorithms
LearningFromMistakes
MEREDITH + CLIENT NAME | 19MEREDITH + CLIENT NAME | 19MEREDITH + GRAPHCONNECT 2020 | 19
Moving Toward Production
Union Find
• Calculate and Enumerate
all disjointed subgraphs
within a graph
• For every maximal
subgraph in a Database,
provide a unique integer
to represent that
subgraph
LearningFromMistakes
MEREDITH + CLIENT NAME | 20MEREDITH + CLIENT NAME | 20MEREDITH + GRAPHCONNECT 2020 | 20
Moving Toward Production
Union Find
• Calculate and Enumerate
all disjointed subgraphs
within a graph
• For every maximal
subgraph in a Database,
provide a unique integer
to represent that
subgraph
LearningFromMistakes
MEREDITH + CLIENT NAME | 21MEREDITH + CLIENT NAME | 21MEREDITH + GRAPHCONNECT 2020 | 21
Moving Toward Production
Problems
• Trouble Scaling to more
than 2 Billion – “huge”
parameter was not
working
• No seeding available –
every subgraph id was
shuffled each run
LearningFromMistakes
MEREDITH + CLIENT NAME | 22MEREDITH + CLIENT NAME | 22MEREDITH + GRAPHCONNECT 2020 | 22
Moving Toward Production
Solutions
• Only Use data you need –
Trim the Fat
• Use Dummy property and
run Apoc to Check when
seed Changed
LearningFromMistakes
MEREDITH + CLIENT NAME | 23MEREDITH + CLIENT NAME | 23MEREDITH + GRAPHCONNECT 2020 | 23
Moving Toward Production
Parallel Imports
CALL apoc.periodic.iterate('call apoc.load.jdbc($credentials,"select distinct cookie1, cookie2,
min(timestamp) as timestamp from cookie_table where cookie2 is not null group by cookie1,cookie2")
yield row','WITH row AS row, datetime(REPLACE(toString(row.timestamp),' ','T')) AS timestamp
MERGE (u:User {uid:trim(row.cookie1)}) SET u:IsNew, u.last_obs=timestamp WITH row, u,timestamp
FOREACH (n IN (CASE WHEN NOT exists(u.first_obs) THEN [1] ELSE [] END) | SET u.first_obs = timestamp)
MATCH (u)-[:OBSERVED_WITH]->(x) WITH u, row,timestamp, collect(distinct x) AS seen
OPTIONAL MATCH (u)-[:OBSERVED_BAD]->(y) WITH u, row,timestamp, collect(distinct y) AS
seen_bad,seen
MERGE (c: Cookie2{ cookie2:trim(row.cookie2)}) SET c.last_obs = timestamp, c:IsNew
FOREACH (n IN (CASE WHEN NOT exists(c.first_obs) THEN [1] ELSE [] END) | SET c.first_obs = timestamp)
FOREACH (n IN (CASE WHEN NOT c IN seen AND NOT c IN seen_bad THEN [1] ELSE [] END) | CREATE (u)-
[r1:OBSERVED_WITH]->(c) SET r1.first_obs = timestamp)’,
{batchSize:100,iteratelist:false,parallel:true,params:{credentials:$credentials}});
LearningFromMistakes
MEREDITH + CLIENT NAME | 24MEREDITH + CLIENT NAME | 24MEREDITH + GRAPHCONNECT 2020 | 24
Moving Toward Production
Parallel Imports – Code. breakdown
CALL apoc.periodic.iterate(‘ DRIVING STATEMENT’, ‘ACTION STATEMENT’,
{batchSize:100,iteratelist:false,parallel:true,params:{credentials:$credentials}});
DRIVING STATEMENT = call apoc.load.jdbc($credentials, “SQL STATEMENT”) yield row
ACTION STATEMENT = Lots of Merges and Conditional Look ups to prevent creating Multiple
Relationships
Note: Pass parameters through the Params Statement
Issues – If you have nodes that are being written/merged on across multiple threads that Batch will fail –
attempted to adjust by changing # of Threads and Batch size.
LearningFromMistakes
MEREDITH + CLIENT NAME | 25MEREDITH + CLIENT NAME | 25MEREDITH + GRAPHCONNECT 2020 | 25
Moving Toward Production
APOC Subgraph all
Previously using pattern matching was Expensive look ups, only 6 hops, imagine 10 hops out:
MATCH (u:User) WHERE u.uid = ‘1234’
WITH u MATCH (u)-[]-(m)-[]-(u2:User)-[]-(m2)-[]-(u3:User)-[]-(m3)-[]-(u4:User)
WHERE u <> u2 AND u <> u3 AND u <> u4 AND u2 <>u3 AND u2 <> u4 AND u3 <> u4
RETURN u,m,u2,m2,u3,m3,u4
LIMIT 100
APOC is better, faster, and easier:
MATCH (user:User) WHERE user.uid = “1234"
CALL apoc.path.subgraphAll(user, {maxLevel:10,filterStartNode:true,labelFilter:'>User'}) YIELD nodes
unwind nodes as no return no
LearningFromMistakes
MEREDITH + CLIENT NAME | 26MEREDITH + CLIENT NAME | 26MEREDITH + GRAPHCONNECT 2020 | 26
Moving Toward Production
Problems:
• Configuringagraphmodelto
accountfortwostreamsofdata
(twosourcesoftruth)
• APOCimportinParallelhadfailed
batches–Lockingissueswithwrites
• SeedingWorkaroundrequired
reevaluatingBillionsofidseachday
toseewhatchangedbetween
UnionFindrunseachday
LearningFromMistakes
MEREDITH + CLIENT NAME | 27MEREDITH + CLIENT NAME | 27MEREDITH + GRAPHCONNECT 2020 | 27
Moving Toward Production
NextSteps:
• Scale to 20+ Months
• Implement UnionFind +
Seeding as single Algo
• Daily importing and
exporting
• Optimize Heap usage
LearningFromMistakes
MEREDITH + CLIENT NAME | 28MEREDITH + CLIENT NAME | 28MEREDITH + GRAPHCONNECT 2020 | 28
Creating a Unified view of a Digital User
Reaching Production Scale
• Initial Runtime 28+ Hours for daily imports
• Optimized UnionFind – Only write on Changes
• Rewrote Preprocessing steps into Custom Java
Procedures
• Dropped runtime down to 14 hrs
• Improved Heap usage
• 20+ Months of data – 4+ TB database
• Custom Java Procedure Import/Exports
• UnionFind with Seeding
• Custom Java Procedure preprocessing
• Variable Heap and Page Cache
MEREDITH + CLIENT NAME | 29MEREDITH + CLIENT NAME | 29MEREDITH + GRAPHCONNECT 2020 | 29
Creating a Unified view of a Digital User
Reaching Production Scale
• Initial Runtime 28+ Hours for daily imports
• Optimized UnionFind – Only write on Changes
• Rewrote Preprocessing steps into Custom Java
Procedures
• Dropped runtime down to 14 hrs
• Improved Heap usage
Problem:
• Constantly fighting growing Heap
Demand
• 280 GB Heap -> 300 GB -> 330 GB
• More heap less Page Cache
Solution: Variable Heap and Page Cache
MEREDITH + CLIENT NAME | 30MEREDITH + CLIENT NAME | 30MEREDITH + GRAPHCONNECT 2020 | 30
Creating a Unified view of a Digital User
Reaching Production Scale
• Initial Runtime 28+ Hours for daily imports
• Optimized UnionFind – Only write on Changes
• Rewrote Preprocessing steps into Custom Java
Procedures
• Dropped runtime down to 12 hrs
• Improved Heap usage
• 14.4 Billion Nodes
• 67.6 Billion Properties
• 20.6 Billion Relationships
• 20 Months of data
MEREDITH + CLIENT NAME | 31MEREDITH + CLIENT NAME | 31MEREDITH + GRAPHCONNECT 2020 | 31
Illuminating The Anonymous
Measuring
Understanding Customers over time
Improved Targeting for more relevant
content and advertising campaigns.
241.6Days on average
per Profile
346MCookies to
163MProfiles
25%Of Traffic has
a Profile
From
3.9 Visits
Average
23.8 Visits
Average
612%Increase in Visits
per profile
- - - O U T C O M E S - - -
Source Line, Source Sans Reg, 8pt
MEREDITH + CLIENT NAME | 32MEREDITH + CLIENT NAME | 32MEREDITH + GRAPHCONNECT 2020 | 32
Identify
what data
Matters
APOC and
Algos are
your Friend
Simplify
Your
Problem
Custom Java
Procedures
Scale
Neo4j
Community
and
Engineers
Salient Takeaways To Scale
Apoc Periodic
Iterate and Graph
Algorithms use
Multiple cores
Evaluate what data
is needed to
Answer the
Question
Explore different
Graph models and
determine which is
the most simple
Custom Java
procedures can
empower your
Project
When issues arise
seek help from
Professionals and
Active Community
Members
- - - O U T C O M E S - - -
Learning from other’s experiences
MEREDITH + CLIENT NAME | 33MEREDITH + CLIENT NAME | 33MEREDITH + GRAPHCONNECT 2020 | 33
Thank You
Contact: Benjamin.Squire@Meredith.com
LinkedIn: linkedin.com/in/benjamin-squire/

Identity Graph at Scale: Transforming Billions of Page Views to Unique Identity Profiles in Publishing

  • 1.
    MEREDITH + CLIENTNAME | 1MEREDITH + CLIENT NAME | 1MEREDITH + GRAPHCONNECT 2020 | 1 Identity Graph at Scale Transforming Billions of Page views to Unique Identity Profiles in Publishing
  • 2.
    MEREDITH + CLIENTNAME | 2MEREDITH + CLIENT NAME | 2MEREDITH + GRAPHCONNECT 2020 | 2 MEREDITH + CLIENT NAME | 4MEREDITH + CLIENT NAME | 4MEREDITH + CLIENT NAME | 4
  • 3.
    MEREDITH + CLIENTNAME | 3MEREDITH + CLIENT NAME | 3MEREDITH + GRAPHCONNECT 2020 | 3 We are Meredith Corporation, a publicly held media and marketing services company founded upon serving our customers and committed to building value for our shareholders. We are on the pulse of pop culture, entertainment, food, fashion and lifestyle, news, business and finance, and sports Who we are: Meredith Brands Our Brands are in nearly every grocery store, gas station and dentist office across the U.S. Digital Presence Our multi-channel digital approach to media provides touchpoints through varied devices from mobile, desktop, console and OTT. Analytics Our focus aims to provide our consumers with top content relevant to their daily lives while providing directed audiences for advertising that contribute directly to dollars spent by our consumers. Programmatic Targeting Our unparalleled database delivers custom content at the point-of- decision, leveraging first-party data and unique distribution resources to engage our audience. R & D Our Research and Development projects focus on cutting edge technology to allow Meredith stand alone as the Premier Publishing and Content platform in the U.S.
  • 4.
    MEREDITH + CLIENTNAME | 4MEREDITH + CLIENT NAME | 4MEREDITH + GRAPHCONNECT 2020 | 4 Serve well-defined audiences, deliver the messages of national and local advertisers, and extend our brand franchises and expertise to related markets. The new Paradigm of Publishing is Personalization. Meredith Digital’s Mission B A C K G R O U N D ENTERTAINMENT + STYLE FOOD PARENTING HOME + LIFESTYLE TRAVEL + LUXURY HEALTH + WELLNESS 453M 531M 18M 148M 64M 31M Source: comScore Multiplatform, December 2018
  • 5.
    MEREDITH + CLIENTNAME | 5MEREDITH + CLIENT NAME | 5MEREDITH + GRAPHCONNECT 2020 | 5 Daily news “Play news from People” School lunch supplies are low! “Add fruit to shopping list” Social Media Check Instagram & Facebook 7:30 AM Drop kids at school 10:30 PM Lights Out 5:30 AM Wake up… Consumer Action Media Moment Get some exercise Morning yoga Flow from Shape5:45 AM 6:05 AM 6:15 AM 6:45 AM Daily commute Parents Podcast What’s for Dinner? QR scan “Asian Salmon Bowls” in Real Simple 5:00 PM Pick up kids 8:00 PM Kids to bed Daily commute EW’s Game of Thrones Podcast Make dinner “Cook Salmon Bowls” Self care Daily meditation from Health Stock up Place Shipt order 7:45 AM 11:15 AM 12:20 PM3:15 PM 5:30 PM 6:00 PM Me time IGTV Locals from T&L 8:30 PM Cleanup emergency! “How to remove soy sauce?” 7:30 PM Voice Audio Mobile Print Video We connect with her across multiple touch points throughout her day O U R T O U C H P O I N T S
  • 6.
    MEREDITH + CLIENTNAME | 6MEREDITH + CLIENT NAME | 6MEREDITH + GRAPHCONNECT 2020 | 6 Measuring the Mutable Cookies are constantly changing. Firewalls, anti-virus, and diligent digital users all contribute to cookie loss Cross Device challenges. Typical users interact with our brands across many devices but cookies are device confined Intelligent Tracking Prevention 2.3 Browsers. Safari, Chrome, Firefox all have new security standards to inhibit third-party cookies The fight against online tracking and analytics
  • 7.
    MEREDITH + CLIENTNAME | 7MEREDITH + CLIENT NAME | 7MEREDITH + GRAPHCONNECT 2020 | 7 Models Made of Sand Audience Propensity on unstable Cookies Models Cost Money. Even the Best-in-class audience segmentation models suffer from cookie loss Activation is Paramount. Propensity Models are only as good as their activation Advertising In the Dark. Cookie loss leads to less click throughs
  • 8.
    MEREDITH + CLIENTNAME | 8MEREDITH + CLIENT NAME | 8MEREDITH + GRAPHCONNECT 2020 | 8 Creating a Unified view of a Digital User Confluence of Data First–Party Data Third–Party Data • Various data stream providers and touchpoints • A true Digital footprint requires all the sources • No one stream has all of the information • Cookie Recovery through Connections • Creating Profiles with Longevity • More touchpoints = Better Models
  • 9.
    MEREDITH + CLIENTNAME | 9MEREDITH + CLIENT NAME | 9MEREDITH + GRAPHCONNECT 2020 | 9 Spotting Snake Oil Sellers Investing in data you can trust • Identity Resolution Vendors are a dime a dozen but can cost a lot more • How can you validate vendors with so much anonymous traffic? • Graphs + First-Party data give the power to validate • Look for linkages with too many locations, repetitive timestamps, and multiple emails to discredit faulty connections
  • 10.
    MEREDITH + CLIENTNAME | 10MEREDITH + CLIENT NAME | 10MEREDITH + GRAPHCONNECT 2020 | 10 A Timeline of Development From Proof of Concept to Production T H E S O L U T I O N : I D E N T I T Y G R A P H Data Size: 3 Months of data from first party only sources – 100’s MM of cookies .5 TB RESULT: Determined Graph Model Rudimentary Import Process Using Pattern matching - Cypher Next Steps: Scale to 1 year Import/Export Process Include 3rd party data RESULT: Discovered APOC and Graph Algos UnionFind Algorithm Bug APOC parallel Import procedures Seeding UF work around Data Size: 20+ Months of data from first and third party sources - 4.4 TB Custom Java Import Procedure RESULT: UnionFind Algorithm with Seeding Custom Java Import/Export Procedure
  • 11.
    MEREDITH + CLIENTNAME | 11MEREDITH + CLIENT NAME | 11MEREDITH + GRAPHCONNECT 2020 | 11 Building on Foundations Proof of Concept T H E S O L U T I O N • Graph Model Development • Importing Data with Neo4j-Admin Import • MATCH (u:User)-[]->(m)<-[]-(u2:User) WHERE u.uid = abc123 and u <> u2 RETURN u, u2
  • 12.
    MEREDITH + CLIENTNAME | 12MEREDITH + CLIENT NAME | 12MEREDITH + GRAPHCONNECT 2020 | 12 Building on Foundations Proof of Concept T H E S O L U T I O N • Graph Model Development • Importing Data with Neo4j-Admin Import • MATCH (u:User)-[]->(m)<-[]-(u2:User) WHERE u.uid = abc123 and u <> u2 RETURN u, u2 • 26 Alphabet + 10 digits = 36 Possibilities • 36*36*36…*36 = 36^32 Permutations • Chance any two people get the same id is 1/(6.3340287e+49) =~ 0 Probability:
  • 13.
    MEREDITH + CLIENTNAME | 13MEREDITH + CLIENT NAME | 13MEREDITH + GRAPHCONNECT 2020 | 13 Building on Foundations Proof of Concept T H E S O L U T I O N • Graph Model Development • Different relationships for cookie observation, URL visits, IP/Device type Visits • Import with Neo4j Admin import • Data from AWS Redshift using UNLOAD cmd • CSV using | delimiter • Basic Pattern Matching • Match (u:User)-[]-(m)-[]-(u2:User)-[]-(m2)-[]-(u3:User) WHERE u <> u2 AND u <> u3 AND u2<>u3 RETURN u,m,u2,m2,u3 LIMIT 100
  • 14.
    MEREDITH + CLIENTNAME | 14MEREDITH + CLIENT NAME | 14MEREDITH + GRAPHCONNECT 2020 | 14 Building on Foundations Proof of Concept T H E S O L U T I O N • Graph Model Development • Importing Data with Neo4j-Admin Import • MATCH (u:User)-[]->(m)<-[]-(u2:User) WHERE u.uid = abc123 • RETURN u,u2 Problems: • Graph was static – CSV import was too slow • Other Streams of Cookie Data • IP gave conflicting Connections • URL solved recommendation Not Identity
  • 15.
    MEREDITH + CLIENTNAME | 15MEREDITH + CLIENT NAME | 15MEREDITH + GRAPHCONNECT 2020 | 15 Building on Foundations Proof of Concept T H E S O L U T I O N Next Steps: • Scale to 1+ year • Improve Import Procedure • Develop Daily Import/Export Procedure • Include Other streams of Cookie data • Prevent Multi Relationship Between Cookies
  • 16.
    MEREDITH + CLIENTNAME | 16MEREDITH + CLIENT NAME | 16MEREDITH + GRAPHCONNECT 2020 | 16 Building on Foundations Proof of Concept T H E S O L U T I O N Next Steps: • Scale to 1+ year • Improve Import Procedure • Develop Daily Import/Export Procedure • Include Other streams of Cookie data • Prevent Multi Relationship Between Cookies
  • 17.
    MEREDITH + CLIENTNAME | 17MEREDITH + CLIENT NAME | 17MEREDITH + GRAPHCONNECT 2020 | 17 Moving Toward Production • Scaling to 6 Months of data with 1 stream – 2 TB database • Optimizing Neo4j Admin Import • Graph Connect 2018 • Using APOC Periodic Iterate for Import and Export Procedures • Found some Identity Partners showed Hyper Connections LearningFromMistakes
  • 18.
    MEREDITH + CLIENTNAME | 18MEREDITH + CLIENT NAME | 18MEREDITH + GRAPHCONNECT 2020 | 18 Moving Toward Production • GraphConnect 2018 ⁃ Learn about APOC & Algos • Pattern Matching is slow ⁃ APOC Subgraph Procedure • Cypher is non parallelized ⁃ APOC periodic Iterate is your friend • Utilizing Graph Algorithms LearningFromMistakes
  • 19.
    MEREDITH + CLIENTNAME | 19MEREDITH + CLIENT NAME | 19MEREDITH + GRAPHCONNECT 2020 | 19 Moving Toward Production Union Find • Calculate and Enumerate all disjointed subgraphs within a graph • For every maximal subgraph in a Database, provide a unique integer to represent that subgraph LearningFromMistakes
  • 20.
    MEREDITH + CLIENTNAME | 20MEREDITH + CLIENT NAME | 20MEREDITH + GRAPHCONNECT 2020 | 20 Moving Toward Production Union Find • Calculate and Enumerate all disjointed subgraphs within a graph • For every maximal subgraph in a Database, provide a unique integer to represent that subgraph LearningFromMistakes
  • 21.
    MEREDITH + CLIENTNAME | 21MEREDITH + CLIENT NAME | 21MEREDITH + GRAPHCONNECT 2020 | 21 Moving Toward Production Problems • Trouble Scaling to more than 2 Billion – “huge” parameter was not working • No seeding available – every subgraph id was shuffled each run LearningFromMistakes
  • 22.
    MEREDITH + CLIENTNAME | 22MEREDITH + CLIENT NAME | 22MEREDITH + GRAPHCONNECT 2020 | 22 Moving Toward Production Solutions • Only Use data you need – Trim the Fat • Use Dummy property and run Apoc to Check when seed Changed LearningFromMistakes
  • 23.
    MEREDITH + CLIENTNAME | 23MEREDITH + CLIENT NAME | 23MEREDITH + GRAPHCONNECT 2020 | 23 Moving Toward Production Parallel Imports CALL apoc.periodic.iterate('call apoc.load.jdbc($credentials,"select distinct cookie1, cookie2, min(timestamp) as timestamp from cookie_table where cookie2 is not null group by cookie1,cookie2") yield row','WITH row AS row, datetime(REPLACE(toString(row.timestamp),' ','T')) AS timestamp MERGE (u:User {uid:trim(row.cookie1)}) SET u:IsNew, u.last_obs=timestamp WITH row, u,timestamp FOREACH (n IN (CASE WHEN NOT exists(u.first_obs) THEN [1] ELSE [] END) | SET u.first_obs = timestamp) MATCH (u)-[:OBSERVED_WITH]->(x) WITH u, row,timestamp, collect(distinct x) AS seen OPTIONAL MATCH (u)-[:OBSERVED_BAD]->(y) WITH u, row,timestamp, collect(distinct y) AS seen_bad,seen MERGE (c: Cookie2{ cookie2:trim(row.cookie2)}) SET c.last_obs = timestamp, c:IsNew FOREACH (n IN (CASE WHEN NOT exists(c.first_obs) THEN [1] ELSE [] END) | SET c.first_obs = timestamp) FOREACH (n IN (CASE WHEN NOT c IN seen AND NOT c IN seen_bad THEN [1] ELSE [] END) | CREATE (u)- [r1:OBSERVED_WITH]->(c) SET r1.first_obs = timestamp)’, {batchSize:100,iteratelist:false,parallel:true,params:{credentials:$credentials}}); LearningFromMistakes
  • 24.
    MEREDITH + CLIENTNAME | 24MEREDITH + CLIENT NAME | 24MEREDITH + GRAPHCONNECT 2020 | 24 Moving Toward Production Parallel Imports – Code. breakdown CALL apoc.periodic.iterate(‘ DRIVING STATEMENT’, ‘ACTION STATEMENT’, {batchSize:100,iteratelist:false,parallel:true,params:{credentials:$credentials}}); DRIVING STATEMENT = call apoc.load.jdbc($credentials, “SQL STATEMENT”) yield row ACTION STATEMENT = Lots of Merges and Conditional Look ups to prevent creating Multiple Relationships Note: Pass parameters through the Params Statement Issues – If you have nodes that are being written/merged on across multiple threads that Batch will fail – attempted to adjust by changing # of Threads and Batch size. LearningFromMistakes
  • 25.
    MEREDITH + CLIENTNAME | 25MEREDITH + CLIENT NAME | 25MEREDITH + GRAPHCONNECT 2020 | 25 Moving Toward Production APOC Subgraph all Previously using pattern matching was Expensive look ups, only 6 hops, imagine 10 hops out: MATCH (u:User) WHERE u.uid = ‘1234’ WITH u MATCH (u)-[]-(m)-[]-(u2:User)-[]-(m2)-[]-(u3:User)-[]-(m3)-[]-(u4:User) WHERE u <> u2 AND u <> u3 AND u <> u4 AND u2 <>u3 AND u2 <> u4 AND u3 <> u4 RETURN u,m,u2,m2,u3,m3,u4 LIMIT 100 APOC is better, faster, and easier: MATCH (user:User) WHERE user.uid = “1234" CALL apoc.path.subgraphAll(user, {maxLevel:10,filterStartNode:true,labelFilter:'>User'}) YIELD nodes unwind nodes as no return no LearningFromMistakes
  • 26.
    MEREDITH + CLIENTNAME | 26MEREDITH + CLIENT NAME | 26MEREDITH + GRAPHCONNECT 2020 | 26 Moving Toward Production Problems: • Configuringagraphmodelto accountfortwostreamsofdata (twosourcesoftruth) • APOCimportinParallelhadfailed batches–Lockingissueswithwrites • SeedingWorkaroundrequired reevaluatingBillionsofidseachday toseewhatchangedbetween UnionFindrunseachday LearningFromMistakes
  • 27.
    MEREDITH + CLIENTNAME | 27MEREDITH + CLIENT NAME | 27MEREDITH + GRAPHCONNECT 2020 | 27 Moving Toward Production NextSteps: • Scale to 20+ Months • Implement UnionFind + Seeding as single Algo • Daily importing and exporting • Optimize Heap usage LearningFromMistakes
  • 28.
    MEREDITH + CLIENTNAME | 28MEREDITH + CLIENT NAME | 28MEREDITH + GRAPHCONNECT 2020 | 28 Creating a Unified view of a Digital User Reaching Production Scale • Initial Runtime 28+ Hours for daily imports • Optimized UnionFind – Only write on Changes • Rewrote Preprocessing steps into Custom Java Procedures • Dropped runtime down to 14 hrs • Improved Heap usage • 20+ Months of data – 4+ TB database • Custom Java Procedure Import/Exports • UnionFind with Seeding • Custom Java Procedure preprocessing • Variable Heap and Page Cache
  • 29.
    MEREDITH + CLIENTNAME | 29MEREDITH + CLIENT NAME | 29MEREDITH + GRAPHCONNECT 2020 | 29 Creating a Unified view of a Digital User Reaching Production Scale • Initial Runtime 28+ Hours for daily imports • Optimized UnionFind – Only write on Changes • Rewrote Preprocessing steps into Custom Java Procedures • Dropped runtime down to 14 hrs • Improved Heap usage Problem: • Constantly fighting growing Heap Demand • 280 GB Heap -> 300 GB -> 330 GB • More heap less Page Cache Solution: Variable Heap and Page Cache
  • 30.
    MEREDITH + CLIENTNAME | 30MEREDITH + CLIENT NAME | 30MEREDITH + GRAPHCONNECT 2020 | 30 Creating a Unified view of a Digital User Reaching Production Scale • Initial Runtime 28+ Hours for daily imports • Optimized UnionFind – Only write on Changes • Rewrote Preprocessing steps into Custom Java Procedures • Dropped runtime down to 12 hrs • Improved Heap usage • 14.4 Billion Nodes • 67.6 Billion Properties • 20.6 Billion Relationships • 20 Months of data
  • 31.
    MEREDITH + CLIENTNAME | 31MEREDITH + CLIENT NAME | 31MEREDITH + GRAPHCONNECT 2020 | 31 Illuminating The Anonymous Measuring Understanding Customers over time Improved Targeting for more relevant content and advertising campaigns. 241.6Days on average per Profile 346MCookies to 163MProfiles 25%Of Traffic has a Profile From 3.9 Visits Average 23.8 Visits Average 612%Increase in Visits per profile - - - O U T C O M E S - - - Source Line, Source Sans Reg, 8pt
  • 32.
    MEREDITH + CLIENTNAME | 32MEREDITH + CLIENT NAME | 32MEREDITH + GRAPHCONNECT 2020 | 32 Identify what data Matters APOC and Algos are your Friend Simplify Your Problem Custom Java Procedures Scale Neo4j Community and Engineers Salient Takeaways To Scale Apoc Periodic Iterate and Graph Algorithms use Multiple cores Evaluate what data is needed to Answer the Question Explore different Graph models and determine which is the most simple Custom Java procedures can empower your Project When issues arise seek help from Professionals and Active Community Members - - - O U T C O M E S - - - Learning from other’s experiences
  • 33.
    MEREDITH + CLIENTNAME | 33MEREDITH + CLIENT NAME | 33MEREDITH + GRAPHCONNECT 2020 | 33 Thank You Contact: Benjamin.Squire@Meredith.com LinkedIn: linkedin.com/in/benjamin-squire/