SlideShare a Scribd company logo
Moving to ScyllaDB - A
Graph of Billions scale
Saurabh Verma, Principal Engineer
K S Sathish, VP Engineering
Presenters
K S Sathish, VP Engineering
Sathish heads the engineering at Zeotap. Bangalore India
Engineering strategy and technical architecture.
17+ years of experience
Been building big data stacks for various verticals for past 8 years
Saurabh Verma, Principal Engineer
Saurabh is a Principal Engineer at Zeotap.
Leads Data engineering team for Identity product suite
Architecture, design and engineering delivery of the Identity product.
Spent the last 6 years in building big data systems.
Place company logo
here
■ Identity and Data platform - People Based data
■ Enables Brands to better understand their customers - 360º View
■ World’s Largest Independent People Graph
■ Full Privacy/GDPR compliant
■ 80+ Data partners
■ Catering to Ad-Tech and MarTech
ZEOTAP
Identity Resolution
Use Cases
Identity Resolution
● Singular View of all Identities of a
Person
● Multiple Identity sources
● Different Identifiers
○ Web Cookies
○ Mobile
○ Partner Platform
○ CRM
Linkages between these identifiers
are more important than the
individual Identifiers
Identity Use cases
■ Match Test - Reference IDs JOIN with ID universe
■ Export - IDs retrieved based on Match and pushed out
■ Reporting
■ Compliance - Opt Out - Disconnect
■ 3rd party extension
■ Identity Quality
■ Short SLAs for Freshness of Data - meaning quick ingestion and
retrieval
Data Access
Old Implementation
Reports
Redshift
Athena
Partner 1
Partner 2
Partner n
Processing
Curated
Denormalized
Data S3
Processing
Client ID sets Match Test
Exports
Identity Tech - Reqs
■ Workload
● High Read and High Write - Ingestion and Retrieval can happen simultaneously
■ Write
● Ingestion - Streaming and Batch
● Deletion - Streaming and Batch
● Above 50K writes per second to meet SLAs
■ Housekeep
● TTL - based on conditions
Identity Tech- Reqs Cont...
■ Read
● Lookup Matching IDs
● Retrieve Linked IDs
● Retrieve Linked IDs based on conditions
■ ID Type - Android ID, website cookie
■ Property - Recency, quality, country
● Count
● Depth filters
Time to Change
Reports
Processing
Client ID sets Match Test
Exports
ID Graph??
Partner 1
Partner 2
Partner n
Processing
Introducing GraphDB
Why Native Graph
Native Graph Database (JanusGraph)
Low latency
neighbourhood traversal
(OLTP) - Lookup & Retrieve
- Graph traversal modeled as iterative low-latency lookups in
the Scylla K,V store
- Runtime proportional to the client data set & overlap
percentage
Lower Data Ingestion SLAs - Ingestion modeled as UPSERT operations
- Aligned with Streaming & Differential data ingestions
- Economically lower footprint to run in production
Linkages are first-class
citizen
- Linkages have properties and traversals can leverage these
properties
- On the fly path computation
Analytics Stats on the
Graph, Clustering (OLAP)
- Bulk export and massive parallel processing available with
GraphComputer integration with Spark, Hadoop, Giraph
And… Concise solutions to the right problems
■ Find the path between 2 user IDs
SQL Gremlin Query
(select * from idmvp
where id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and idtype1 =
'id_mid_13'
and id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and idtype2 =
'id_mid_4') // depth = 1
union
(select * from idmvp t1, idmvp t2
where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 =
'id_mid_13'
and t2.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t2.idtype2 =
'id_mid_4') // depth = 2
union
(select * from idmvp t1, idmvp t2, idmvp t3
where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 =
'id_mid_13'
and t3.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t3.idtype2 =
'id_mid_4') // depth = 3
g.V()
.has('id','75d630a9-2d34-433e-b05f-2031a0342e42').has('type',
'id_mid_13')
.repeat(both().simplePath().timeLimit(40000))
.until(has('id','5c557df3-df47-4603-64bc-5a9a63f22245')
.has('type','id_mid_4'))
.limit(10).path()
.by(‘id’)
POCs and Findings
POC Hardware
Janus On Scylla Aerospike OrientDB DGraph
3 x i3.2xLarge 3 x i3.2xLarge 3 x i3.2xLarge 3 x r4.16xLarge
Client Configuration
3 x c5.18xLarge
Server Configuration
Replication Factor
1
Store Benchmarking - 3B IDs, 1B edges
JanusGraph with
ScyllaDB
Aerospike OrientDB DGraph
Sharded, Distributed
Storage Model LPG Custom LPG RDF
Cost of ETL before Ingestion Lower Lower Lower Higher
Native Graph DB
Node / Edge Schema Change without
downtime?
Benchmark dataset load completed?
Acceptable Query Performance? - -
Production Setup Running Cost Lower Higher - -
Production Setup Operational Management
(based on our experience with AS in prod)
Higher Lower - -
✓ ✓ ✓
✓✓✓
✓✓✓ ✓
✓ ✓
✓ ✓
❌
❌
❌ ❌
The Data Model
ID Graph Data Model
label: id
type: online
idtype: adid_sha1
id: c3b2a1ed
os: ‘android’
country: ‘ESP’
dpid: {1}
ip: [1.2.3.4]
linkedTo: {dp1: t1, dp2: t2,
quality: 0.30, linkType: 1}
linkedTo: {dp1: t1, dp2: t2, dp3: t3,
dp4: t4, quality: 0.55, linkType: 3}
label: id
type: online
idtype: adid
id: a711a4de
os: ‘android’
country: ‘ITA’
dpid: {2,3,4}
label: id
type: online
Idtype: googlecookie
id: 01e0ffa7
os: ‘android’
country: ‘ESP’
dpid: {1,2}
label: id
type: online
idtype: adid
id: 412ce1f0
os: ‘android’
country: ‘ITA’
dpid: {2,4}
ip: [1.2.3.4]
label: id
type: offline
idtype: email
id: abc@gmail.com
os: ‘ios’
country: ‘ESP’
dpid: {2,4}
linkedTo: {dp1: t1, quality: 0.25,
linkType: 3, linkSource: ip}
linkedTo: {dp2: t2, dp3: t3,
dp4: t4, quality: 0.71,
linkType: 9}
Expressiveness of Model
label: id
type: online
idtype: adid_sha1
id: c3b2a1ed
os: ‘android’
country: ‘ESP’
dpid: {1}
ip: [1.2.3.4]
linkedTo: {dp1: t1, dp2: t2,
quality: 0.30, linkType: 1}
linkedTo: {dp1: t1, dp2: t2, dp3: t3,
dp4: t4, quality: 0.55, linkType: 3}
label: id
type: online
idtype: adid
id: a711a4de
os: ‘android’
country: ‘ITA’
dpid: {2,3,4}
label: id
type: online
Idtype: googlecookie
id: 01e0ffa7
os: ‘android’
country: ‘ESP’
dpid: {1,2}
label: id
type: online
idtype: adid
id: 412ce1f0
os: ‘android’
country: ‘ITA’
dpid: {2,4}
ip: [1.2.3.4]
label: id
type: offline
idtype: email
id: abc@gmail.com
os: ‘ios’
country: ‘ESP’
dpid: {2,4}
linkedTo: {dp1: t1, quality: 0.25,
linkType: 3, linkSource: ip}
linkedTo: {dp2: t2, dp3: t3,
dp4: t4, quality: 0.71,
linkType: 9}
Quality
Filtered Links
ID Attribute
Filtering
Recency
Filtered Links
Extensible
Data Model
Transitive
Links
Streaming Ingestion
Streaming Ingestion
■ Workload
● 300 - 400 million data points per day
● Dedupe & Enrich
● Merge
● Final snapshot
■ Batch Process
● Spark Join
● Merge runtime - 4 to 6 hours
● Redshift load time - 2 to 3 hours
● Painful Failures
Stream & Batch
Dedup
Enrich
S3
Merge
Redshift
Streaming Ingestion
■ And...
● Time - 2 to 3 hours
● Join Vs Lookup
● All Stream
● Failures - down by 83%
Stream
& Batch
Dedup
Enrich
Streaming
Graph Ingester
Streaming
Graph Ingester
Vertex
Edge
KV Store
Findings
■ Consider Splitting Vertex Load from Edge Load
● Write behaviour is different
● Achieve overall better QPS
■ Benchmark Vertex load speed against CPU utilization
● Observed 5K TPS per server core
■ Consider Client Side Caching - Edge Load
● One lookup and One write with many duplicate IDs - Too many disk hits (Thrashing)
● 100% write - 4.8K TPS per core
● LeveledCompactionStrategy performed better than
SizeTieredCompactionStrategy
Traversal
Findings
■ Be Wary of Supernodes
● Supernodes > 600 vertices drastic QPS drop
● 40K QPS to 2K
■ Multi-Level Traversal - Depth limiting
● QPS decreases though not linear
● depth of 5 - 40K QPS to 12K
Findings
■ Play with Compaction strategies
● For our queries LevelTiered increased QPS by 2.5X
● With LevelTiered - concurrent clients better handled
● QPS stabilized at 30K
Know Your Query And Data
■ Segments are country based - filter based on Countries
■ Vertex Metadata not huge
Fetching individual properties from the Vertex
gremlin>g.V().has('id','1').has('type','email')
.values('id', 'type', 'USA').profile()
Fetching entire property map during traversal
gremlin>g.V().has('id','1').has('type','email')
.valueMap().profile()
Step Traversers Time
JanusGraphStep
_condition=(id=1
AND type = email)
1 0.987
JanusGraphPrope
rtiesStep
_condition=((type[
id] OR type[type]
OR type[USA]))
4 1.337
2.325 s
Step Traversers Time
JanusGraphStep
_condition=(id=1
AND type = email)
1 0.902
PropertyMapStep
(value)
1 0.175
1.077 s
~200%
Graph Analysis
ID Graph Quality
■ How Trustable is our ID graph
● What happens if match rate is ridiculously high
● Cluster of 63 million IDs
■ Connectivity analysis - heuristics
● Density
● Depth
● Clustering
● Distance
■ Can we arrive at Quality Score for edges?
Scoring V1
■ AD scoring - Edge Agreement (A) / Disagreement (D)
■ Recency Scoring - Augment A & D with Recency
■ Calculate Composite Score
■ Adjust composite score with IDs metadata
Scoring - AD
Scoring V1
AD Score
Recency
Score
Composite
Score
Adjust
Event Rarity
Final
Score
Scoring - Representation
OLTP & OLAP Export
■ Interaction with JanusGraph backed by ScyllaDB
● For each input ID find the connected IDs in the ID Graph based on filters
● Modeled as Depth First Search implemented in Gremlin in Apache Spark
● Property and depth filtering done at the application layer
● The overlapping ID output is stored on deep storage eg AWS s3
■ Across-Graph Traversals
● Separate compliance requirements per 3rd party Graph vendor
● Probabilistic vs Deterministic Graph vendors
● Each Graph Vendor represented as a separate keyspace in ScyllaDB
● The application layer enables runtime chaining and ordering for Across-Graph
traversals
OLTP Export - ID Overlap Finder Workflow
■ Export Native Graph DB to Deep Storage
■ Apache Spark based ID Graph Quality Scoring
OLAP Export - Storage & Analytics
OLTP ID
Graph
Periodic
Backup
ScyllaDB
SSTables
on AWS s3
OLAP ID
Graph
Periodic
Refresh
SparkOLAP
Export to AWS
s3
GryoOutputFormat
Native Graph on AWS
s3
Periodic Static
Reports
ID Graph Quality
Data Science
Pipeline
ID Graph Quality Score Update
Prod Setup
Prod Setup
■ V1 release in Nov 2018
■ In production on AWS i3.4xLarge instances
■ These are 16 core, 122 GB RAM instances
■ ScyllaDB Version 3.0.6 provisioned via AWS Scylla AMIs
■ Using Scylla Grafana Dashboards for Production Metrics
■ Using LevelTieredCompactionStrategy in production
■ Stats (To be updated before final deck)
Take away
■ 2 primary Workflows
● ID overlap finder
● ID retriever
Consideration : 2-node Scylla cluster, the peak client connections is around 3,000
ID overlap finder ~4X numbers of ID retriever
Run Together
● Race and SLA degrade!
● High Failure Rates
Whatever The Tool...
Introduce - Prioritization & Throttling
Priority with Aging - Match Test get priority but nothing starves
Throttle - Limit concurrent Jobs
And…
■ SLA from p95 of 10 hours to 2 hours
■ Job failure rate from 20% to 2% per day
All Higher Level Constructs in Control Plane
Good Architecture is a Must!
Thank you Stay in touch
Any questions?
Sathish K S
sathish.ks@gmail
Not on Twitter!
Saurabh Verma
saurabhdec1988@gmail
@saurabhdec1988

More Related Content

What's hot

Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Amazon Web Services
 
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB DayAmazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Web Services Korea
 
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB AtlasMongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB
 
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
Naoki (Neo) SATO
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
MongoDB
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
Andre Essing
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
Amazon Web Services
 
Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
Norberto Leite
 
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
Amazon Web Services
 
MongoDB company and case studies - john hong
MongoDB company and case studies - john hong MongoDB company and case studies - john hong
MongoDB company and case studies - john hong
Ha-Yang(White) Moon
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
Amazon Web Services
 
MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개
Ha-Yang(White) Moon
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Amazon Web Services Korea
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Alex Zeltov
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
Sriskandarajah Suhothayan
 
Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...
Amazon Web Services
 

What's hot (20)

Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
 
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB DayAmazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
 
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB AtlasMongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
MongoDB World 2019: Ticketek: Scaling to Global Ticket Sales with MongoDB Atlas
 
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
 
Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
 
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
 
MongoDB company and case studies - john hong
MongoDB company and case studies - john hong MongoDB company and case studies - john hong
MongoDB company and case studies - john hong
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
 
Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...
 

Similar to Zeotap: Moving to ScyllaDB - A Graph of Billions Scale

Scaling Production Data across Microservices
Scaling Production Data across MicroservicesScaling Production Data across Microservices
Scaling Production Data across Microservices
Erik Ashepa
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
Cambridge Semantics
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
Cambridge Semantics
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Fwdays
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Databricks
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Databricks
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Mitul Tiwari
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
Andrew Liu
 

Similar to Zeotap: Moving to ScyllaDB - A Graph of Billions Scale (20)

Scaling Production Data across Microservices
Scaling Production Data across MicroservicesScaling Production Data across Microservices
Scaling Production Data across Microservices
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
[PASS Summit 2016] Blazing Fast, Planet-Scale Customer Scenarios with Azure D...
 

Recently uploaded

Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
itech2017
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
zwunae
 
An Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering TechniquesAn Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering Techniques
ambekarshweta25
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 

Recently uploaded (20)

Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
 
An Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering TechniquesAn Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering Techniques
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 

Zeotap: Moving to ScyllaDB - A Graph of Billions Scale

  • 1. Moving to ScyllaDB - A Graph of Billions scale Saurabh Verma, Principal Engineer K S Sathish, VP Engineering
  • 2. Presenters K S Sathish, VP Engineering Sathish heads the engineering at Zeotap. Bangalore India Engineering strategy and technical architecture. 17+ years of experience Been building big data stacks for various verticals for past 8 years Saurabh Verma, Principal Engineer Saurabh is a Principal Engineer at Zeotap. Leads Data engineering team for Identity product suite Architecture, design and engineering delivery of the Identity product. Spent the last 6 years in building big data systems. Place company logo here
  • 3. ■ Identity and Data platform - People Based data ■ Enables Brands to better understand their customers - 360º View ■ World’s Largest Independent People Graph ■ Full Privacy/GDPR compliant ■ 80+ Data partners ■ Catering to Ad-Tech and MarTech ZEOTAP
  • 5. Identity Resolution ● Singular View of all Identities of a Person ● Multiple Identity sources ● Different Identifiers ○ Web Cookies ○ Mobile ○ Partner Platform ○ CRM Linkages between these identifiers are more important than the individual Identifiers
  • 6. Identity Use cases ■ Match Test - Reference IDs JOIN with ID universe ■ Export - IDs retrieved based on Match and pushed out ■ Reporting ■ Compliance - Opt Out - Disconnect ■ 3rd party extension ■ Identity Quality ■ Short SLAs for Freshness of Data - meaning quick ingestion and retrieval
  • 7. Data Access Old Implementation Reports Redshift Athena Partner 1 Partner 2 Partner n Processing Curated Denormalized Data S3 Processing Client ID sets Match Test Exports
  • 8. Identity Tech - Reqs ■ Workload ● High Read and High Write - Ingestion and Retrieval can happen simultaneously ■ Write ● Ingestion - Streaming and Batch ● Deletion - Streaming and Batch ● Above 50K writes per second to meet SLAs ■ Housekeep ● TTL - based on conditions
  • 9. Identity Tech- Reqs Cont... ■ Read ● Lookup Matching IDs ● Retrieve Linked IDs ● Retrieve Linked IDs based on conditions ■ ID Type - Android ID, website cookie ■ Property - Recency, quality, country ● Count ● Depth filters
  • 10. Time to Change Reports Processing Client ID sets Match Test Exports ID Graph?? Partner 1 Partner 2 Partner n Processing
  • 12. Why Native Graph Native Graph Database (JanusGraph) Low latency neighbourhood traversal (OLTP) - Lookup & Retrieve - Graph traversal modeled as iterative low-latency lookups in the Scylla K,V store - Runtime proportional to the client data set & overlap percentage Lower Data Ingestion SLAs - Ingestion modeled as UPSERT operations - Aligned with Streaming & Differential data ingestions - Economically lower footprint to run in production Linkages are first-class citizen - Linkages have properties and traversals can leverage these properties - On the fly path computation Analytics Stats on the Graph, Clustering (OLAP) - Bulk export and massive parallel processing available with GraphComputer integration with Spark, Hadoop, Giraph
  • 13. And… Concise solutions to the right problems ■ Find the path between 2 user IDs SQL Gremlin Query (select * from idmvp where id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and idtype1 = 'id_mid_13' and id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and idtype2 = 'id_mid_4') // depth = 1 union (select * from idmvp t1, idmvp t2 where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 = 'id_mid_13' and t2.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t2.idtype2 = 'id_mid_4') // depth = 2 union (select * from idmvp t1, idmvp t2, idmvp t3 where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 = 'id_mid_13' and t3.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t3.idtype2 = 'id_mid_4') // depth = 3 g.V() .has('id','75d630a9-2d34-433e-b05f-2031a0342e42').has('type', 'id_mid_13') .repeat(both().simplePath().timeLimit(40000)) .until(has('id','5c557df3-df47-4603-64bc-5a9a63f22245') .has('type','id_mid_4')) .limit(10).path() .by(‘id’)
  • 15. POC Hardware Janus On Scylla Aerospike OrientDB DGraph 3 x i3.2xLarge 3 x i3.2xLarge 3 x i3.2xLarge 3 x r4.16xLarge Client Configuration 3 x c5.18xLarge Server Configuration Replication Factor 1
  • 16. Store Benchmarking - 3B IDs, 1B edges JanusGraph with ScyllaDB Aerospike OrientDB DGraph Sharded, Distributed Storage Model LPG Custom LPG RDF Cost of ETL before Ingestion Lower Lower Lower Higher Native Graph DB Node / Edge Schema Change without downtime? Benchmark dataset load completed? Acceptable Query Performance? - - Production Setup Running Cost Lower Higher - - Production Setup Operational Management (based on our experience with AS in prod) Higher Lower - - ✓ ✓ ✓ ✓✓✓ ✓✓✓ ✓ ✓ ✓ ✓ ✓ ❌ ❌ ❌ ❌
  • 18. ID Graph Data Model label: id type: online idtype: adid_sha1 id: c3b2a1ed os: ‘android’ country: ‘ESP’ dpid: {1} ip: [1.2.3.4] linkedTo: {dp1: t1, dp2: t2, quality: 0.30, linkType: 1} linkedTo: {dp1: t1, dp2: t2, dp3: t3, dp4: t4, quality: 0.55, linkType: 3} label: id type: online idtype: adid id: a711a4de os: ‘android’ country: ‘ITA’ dpid: {2,3,4} label: id type: online Idtype: googlecookie id: 01e0ffa7 os: ‘android’ country: ‘ESP’ dpid: {1,2} label: id type: online idtype: adid id: 412ce1f0 os: ‘android’ country: ‘ITA’ dpid: {2,4} ip: [1.2.3.4] label: id type: offline idtype: email id: abc@gmail.com os: ‘ios’ country: ‘ESP’ dpid: {2,4} linkedTo: {dp1: t1, quality: 0.25, linkType: 3, linkSource: ip} linkedTo: {dp2: t2, dp3: t3, dp4: t4, quality: 0.71, linkType: 9}
  • 19. Expressiveness of Model label: id type: online idtype: adid_sha1 id: c3b2a1ed os: ‘android’ country: ‘ESP’ dpid: {1} ip: [1.2.3.4] linkedTo: {dp1: t1, dp2: t2, quality: 0.30, linkType: 1} linkedTo: {dp1: t1, dp2: t2, dp3: t3, dp4: t4, quality: 0.55, linkType: 3} label: id type: online idtype: adid id: a711a4de os: ‘android’ country: ‘ITA’ dpid: {2,3,4} label: id type: online Idtype: googlecookie id: 01e0ffa7 os: ‘android’ country: ‘ESP’ dpid: {1,2} label: id type: online idtype: adid id: 412ce1f0 os: ‘android’ country: ‘ITA’ dpid: {2,4} ip: [1.2.3.4] label: id type: offline idtype: email id: abc@gmail.com os: ‘ios’ country: ‘ESP’ dpid: {2,4} linkedTo: {dp1: t1, quality: 0.25, linkType: 3, linkSource: ip} linkedTo: {dp2: t2, dp3: t3, dp4: t4, quality: 0.71, linkType: 9} Quality Filtered Links ID Attribute Filtering Recency Filtered Links Extensible Data Model Transitive Links
  • 21. Streaming Ingestion ■ Workload ● 300 - 400 million data points per day ● Dedupe & Enrich ● Merge ● Final snapshot ■ Batch Process ● Spark Join ● Merge runtime - 4 to 6 hours ● Redshift load time - 2 to 3 hours ● Painful Failures Stream & Batch Dedup Enrich S3 Merge Redshift
  • 22. Streaming Ingestion ■ And... ● Time - 2 to 3 hours ● Join Vs Lookup ● All Stream ● Failures - down by 83% Stream & Batch Dedup Enrich Streaming Graph Ingester Streaming Graph Ingester Vertex Edge KV Store
  • 23. Findings ■ Consider Splitting Vertex Load from Edge Load ● Write behaviour is different ● Achieve overall better QPS ■ Benchmark Vertex load speed against CPU utilization ● Observed 5K TPS per server core ■ Consider Client Side Caching - Edge Load ● One lookup and One write with many duplicate IDs - Too many disk hits (Thrashing) ● 100% write - 4.8K TPS per core ● LeveledCompactionStrategy performed better than SizeTieredCompactionStrategy
  • 25. Findings ■ Be Wary of Supernodes ● Supernodes > 600 vertices drastic QPS drop ● 40K QPS to 2K ■ Multi-Level Traversal - Depth limiting ● QPS decreases though not linear ● depth of 5 - 40K QPS to 12K
  • 26. Findings ■ Play with Compaction strategies ● For our queries LevelTiered increased QPS by 2.5X ● With LevelTiered - concurrent clients better handled ● QPS stabilized at 30K
  • 27. Know Your Query And Data ■ Segments are country based - filter based on Countries ■ Vertex Metadata not huge Fetching individual properties from the Vertex gremlin>g.V().has('id','1').has('type','email') .values('id', 'type', 'USA').profile() Fetching entire property map during traversal gremlin>g.V().has('id','1').has('type','email') .valueMap().profile() Step Traversers Time JanusGraphStep _condition=(id=1 AND type = email) 1 0.987 JanusGraphPrope rtiesStep _condition=((type[ id] OR type[type] OR type[USA])) 4 1.337 2.325 s Step Traversers Time JanusGraphStep _condition=(id=1 AND type = email) 1 0.902 PropertyMapStep (value) 1 0.175 1.077 s ~200%
  • 29. ID Graph Quality ■ How Trustable is our ID graph ● What happens if match rate is ridiculously high ● Cluster of 63 million IDs ■ Connectivity analysis - heuristics ● Density ● Depth ● Clustering ● Distance ■ Can we arrive at Quality Score for edges?
  • 30. Scoring V1 ■ AD scoring - Edge Agreement (A) / Disagreement (D) ■ Recency Scoring - Augment A & D with Recency ■ Calculate Composite Score ■ Adjust composite score with IDs metadata
  • 34. OLTP & OLAP Export
  • 35. ■ Interaction with JanusGraph backed by ScyllaDB ● For each input ID find the connected IDs in the ID Graph based on filters ● Modeled as Depth First Search implemented in Gremlin in Apache Spark ● Property and depth filtering done at the application layer ● The overlapping ID output is stored on deep storage eg AWS s3 ■ Across-Graph Traversals ● Separate compliance requirements per 3rd party Graph vendor ● Probabilistic vs Deterministic Graph vendors ● Each Graph Vendor represented as a separate keyspace in ScyllaDB ● The application layer enables runtime chaining and ordering for Across-Graph traversals OLTP Export - ID Overlap Finder Workflow
  • 36. ■ Export Native Graph DB to Deep Storage ■ Apache Spark based ID Graph Quality Scoring OLAP Export - Storage & Analytics OLTP ID Graph Periodic Backup ScyllaDB SSTables on AWS s3 OLAP ID Graph Periodic Refresh SparkOLAP Export to AWS s3 GryoOutputFormat Native Graph on AWS s3 Periodic Static Reports ID Graph Quality Data Science Pipeline ID Graph Quality Score Update
  • 38. Prod Setup ■ V1 release in Nov 2018 ■ In production on AWS i3.4xLarge instances ■ These are 16 core, 122 GB RAM instances ■ ScyllaDB Version 3.0.6 provisioned via AWS Scylla AMIs ■ Using Scylla Grafana Dashboards for Production Metrics ■ Using LevelTieredCompactionStrategy in production ■ Stats (To be updated before final deck)
  • 40. ■ 2 primary Workflows ● ID overlap finder ● ID retriever Consideration : 2-node Scylla cluster, the peak client connections is around 3,000 ID overlap finder ~4X numbers of ID retriever Run Together ● Race and SLA degrade! ● High Failure Rates Whatever The Tool...
  • 41. Introduce - Prioritization & Throttling Priority with Aging - Match Test get priority but nothing starves Throttle - Limit concurrent Jobs And… ■ SLA from p95 of 10 hours to 2 hours ■ Job failure rate from 20% to 2% per day All Higher Level Constructs in Control Plane Good Architecture is a Must!
  • 42. Thank you Stay in touch Any questions? Sathish K S sathish.ks@gmail Not on Twitter! Saurabh Verma saurabhdec1988@gmail @saurabhdec1988