Moving to ScyllaDB - A
Graph of Billions scale
Saurabh Verma, Principal Engineer
K S Sathish, VP Engineering
Presenters
K S Sathish, VP Engineering
Sathish heads the engineering at Zeotap. Bangalore India
Engineering strategy and technical architecture.
17+ years of experience
Been building big data stacks for various verticals for past 8 years
Saurabh Verma, Principal Engineer
Saurabh is a Principal Engineer at Zeotap.
Leads Data engineering team for Identity product suite
Architecture, design and engineering delivery of the Identity product.
Spent the last 6 years in building big data systems.
Place company logo
here
■ Identity and Data platform - People Based data
■ Enables Brands to better understand their customers - 360º View
■ World’s Largest Independent People Graph
■ Full Privacy/GDPR compliant
■ 80+ Data partners
■ Catering to Ad-Tech and MarTech
ZEOTAP
Identity Resolution
Use Cases
Identity Resolution
● Singular View of all Identities of a
Person
● Multiple Identity sources
● Different Identifiers
○ Web Cookies
○ Mobile
○ Partner Platform
○ CRM
Linkages between these identifiers
are more important than the
individual Identifiers
Identity Use cases
■ Match Test - Reference IDs JOIN with ID universe
■ Export - IDs retrieved based on Match and pushed out
■ Reporting
■ Compliance - Opt Out - Disconnect
■ 3rd party extension
■ Identity Quality
■ Short SLAs for Freshness of Data - meaning quick ingestion and
retrieval
Data Access
Old Implementation
Reports
Redshift
Athena
Partner 1
Partner 2
Partner n
Processing
Curated
Denormalized
Data S3
Processing
Client ID sets Match Test
Exports
Identity Tech - Reqs
■ Workload
● High Read and High Write - Ingestion and Retrieval can happen simultaneously
■ Write
● Ingestion - Streaming and Batch
● Deletion - Streaming and Batch
● Above 50K writes per second to meet SLAs
■ Housekeep
● TTL - based on conditions
Identity Tech- Reqs Cont...
■ Read
● Lookup Matching IDs
● Retrieve Linked IDs
● Retrieve Linked IDs based on conditions
■ ID Type - Android ID, website cookie
■ Property - Recency, quality, country
● Count
● Depth filters
Time to Change
Reports
Processing
Client ID sets Match Test
Exports
ID Graph??
Partner 1
Partner 2
Partner n
Processing
Introducing GraphDB
Why Native Graph
Native Graph Database (JanusGraph)
Low latency
neighbourhood traversal
(OLTP) - Lookup & Retrieve
- Graph traversal modeled as iterative low-latency lookups in
the Scylla K,V store
- Runtime proportional to the client data set & overlap
percentage
Lower Data Ingestion SLAs - Ingestion modeled as UPSERT operations
- Aligned with Streaming & Differential data ingestions
- Economically lower footprint to run in production
Linkages are first-class
citizen
- Linkages have properties and traversals can leverage these
properties
- On the fly path computation
Analytics Stats on the
Graph, Clustering (OLAP)
- Bulk export and massive parallel processing available with
GraphComputer integration with Spark, Hadoop, Giraph
And… Concise solutions to the right problems
■ Find the path between 2 user IDs
SQL Gremlin Query
(select * from idmvp
where id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and idtype1 =
'id_mid_13'
and id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and idtype2 =
'id_mid_4') // depth = 1
union
(select * from idmvp t1, idmvp t2
where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 =
'id_mid_13'
and t2.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t2.idtype2 =
'id_mid_4') // depth = 2
union
(select * from idmvp t1, idmvp t2, idmvp t3
where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 =
'id_mid_13'
and t3.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t3.idtype2 =
'id_mid_4') // depth = 3
g.V()
.has('id','75d630a9-2d34-433e-b05f-2031a0342e42').has('type',
'id_mid_13')
.repeat(both().simplePath().timeLimit(40000))
.until(has('id','5c557df3-df47-4603-64bc-5a9a63f22245')
.has('type','id_mid_4'))
.limit(10).path()
.by(‘id’)
POCs and Findings
POC Hardware
Janus On Scylla Aerospike OrientDB DGraph
3 x i3.2xLarge 3 x i3.2xLarge 3 x i3.2xLarge 3 x r4.16xLarge
Client Configuration
3 x c5.18xLarge
Server Configuration
Replication Factor
1
Store Benchmarking - 3B IDs, 1B edges
JanusGraph with
ScyllaDB
Aerospike OrientDB DGraph
Sharded, Distributed
Storage Model LPG Custom LPG RDF
Cost of ETL before Ingestion Lower Lower Lower Higher
Native Graph DB
Node / Edge Schema Change without
downtime?
Benchmark dataset load completed?
Acceptable Query Performance? - -
Production Setup Running Cost Lower Higher - -
Production Setup Operational Management
(based on our experience with AS in prod)
Higher Lower - -
✓ ✓ ✓
✓✓✓
✓✓✓ ✓
✓ ✓
✓ ✓
❌
❌
❌ ❌
The Data Model
ID Graph Data Model
label: id
type: online
idtype: adid_sha1
id: c3b2a1ed
os: ‘android’
country: ‘ESP’
dpid: {1}
ip: [1.2.3.4]
linkedTo: {dp1: t1, dp2: t2,
quality: 0.30, linkType: 1}
linkedTo: {dp1: t1, dp2: t2, dp3: t3,
dp4: t4, quality: 0.55, linkType: 3}
label: id
type: online
idtype: adid
id: a711a4de
os: ‘android’
country: ‘ITA’
dpid: {2,3,4}
label: id
type: online
Idtype: googlecookie
id: 01e0ffa7
os: ‘android’
country: ‘ESP’
dpid: {1,2}
label: id
type: online
idtype: adid
id: 412ce1f0
os: ‘android’
country: ‘ITA’
dpid: {2,4}
ip: [1.2.3.4]
label: id
type: offline
idtype: email
id: abc@gmail.com
os: ‘ios’
country: ‘ESP’
dpid: {2,4}
linkedTo: {dp1: t1, quality: 0.25,
linkType: 3, linkSource: ip}
linkedTo: {dp2: t2, dp3: t3,
dp4: t4, quality: 0.71,
linkType: 9}
Expressiveness of Model
label: id
type: online
idtype: adid_sha1
id: c3b2a1ed
os: ‘android’
country: ‘ESP’
dpid: {1}
ip: [1.2.3.4]
linkedTo: {dp1: t1, dp2: t2,
quality: 0.30, linkType: 1}
linkedTo: {dp1: t1, dp2: t2, dp3: t3,
dp4: t4, quality: 0.55, linkType: 3}
label: id
type: online
idtype: adid
id: a711a4de
os: ‘android’
country: ‘ITA’
dpid: {2,3,4}
label: id
type: online
Idtype: googlecookie
id: 01e0ffa7
os: ‘android’
country: ‘ESP’
dpid: {1,2}
label: id
type: online
idtype: adid
id: 412ce1f0
os: ‘android’
country: ‘ITA’
dpid: {2,4}
ip: [1.2.3.4]
label: id
type: offline
idtype: email
id: abc@gmail.com
os: ‘ios’
country: ‘ESP’
dpid: {2,4}
linkedTo: {dp1: t1, quality: 0.25,
linkType: 3, linkSource: ip}
linkedTo: {dp2: t2, dp3: t3,
dp4: t4, quality: 0.71,
linkType: 9}
Quality
Filtered Links
ID Attribute
Filtering
Recency
Filtered Links
Extensible
Data Model
Transitive
Links
Streaming Ingestion
Streaming Ingestion
■ Workload
● 300 - 400 million data points per day
● Dedupe & Enrich
● Merge
● Final snapshot
■ Batch Process
● Spark Join
● Merge runtime - 4 to 6 hours
● Redshift load time - 2 to 3 hours
● Painful Failures
Stream & Batch
Dedup
Enrich
S3
Merge
Redshift
Streaming Ingestion
■ And...
● Time - 2 to 3 hours
● Join Vs Lookup
● All Stream
● Failures - down by 83%
Stream
& Batch
Dedup
Enrich
Streaming
Graph Ingester
Streaming
Graph Ingester
Vertex
Edge
KV Store
Findings
■ Consider Splitting Vertex Load from Edge Load
● Write behaviour is different
● Achieve overall better QPS
■ Benchmark Vertex load speed against CPU utilization
● Observed 5K TPS per server core
■ Consider Client Side Caching - Edge Load
● One lookup and One write with many duplicate IDs - Too many disk hits (Thrashing)
● 100% write - 4.8K TPS per core
● LeveledCompactionStrategy performed better than
SizeTieredCompactionStrategy
Traversal
Findings
■ Be Wary of Supernodes
● Supernodes > 600 vertices drastic QPS drop
● 40K QPS to 2K
■ Multi-Level Traversal - Depth limiting
● QPS decreases though not linear
● depth of 5 - 40K QPS to 12K
Findings
■ Play with Compaction strategies
● For our queries LevelTiered increased QPS by 2.5X
● With LevelTiered - concurrent clients better handled
● QPS stabilized at 30K
Know Your Query And Data
■ Segments are country based - filter based on Countries
■ Vertex Metadata not huge
Fetching individual properties from the Vertex
gremlin>g.V().has('id','1').has('type','email')
.values('id', 'type', 'USA').profile()
Fetching entire property map during traversal
gremlin>g.V().has('id','1').has('type','email')
.valueMap().profile()
Step Traversers Time
JanusGraphStep
_condition=(id=1
AND type = email)
1 0.987
JanusGraphPrope
rtiesStep
_condition=((type[
id] OR type[type]
OR type[USA]))
4 1.337
2.325 s
Step Traversers Time
JanusGraphStep
_condition=(id=1
AND type = email)
1 0.902
PropertyMapStep
(value)
1 0.175
1.077 s
~200%
Graph Analysis
ID Graph Quality
■ How Trustable is our ID graph
● What happens if match rate is ridiculously high
● Cluster of 63 million IDs
■ Connectivity analysis - heuristics
● Density
● Depth
● Clustering
● Distance
■ Can we arrive at Quality Score for edges?
Scoring V1
■ AD scoring - Edge Agreement (A) / Disagreement (D)
■ Recency Scoring - Augment A & D with Recency
■ Calculate Composite Score
■ Adjust composite score with IDs metadata
Scoring - AD
Scoring V1
AD Score
Recency
Score
Composite
Score
Adjust
Event Rarity
Final
Score
Scoring - Representation
OLTP & OLAP Export
■ Interaction with JanusGraph backed by ScyllaDB
● For each input ID find the connected IDs in the ID Graph based on filters
● Modeled as Depth First Search implemented in Gremlin in Apache Spark
● Property and depth filtering done at the application layer
● The overlapping ID output is stored on deep storage eg AWS s3
■ Across-Graph Traversals
● Separate compliance requirements per 3rd party Graph vendor
● Probabilistic vs Deterministic Graph vendors
● Each Graph Vendor represented as a separate keyspace in ScyllaDB
● The application layer enables runtime chaining and ordering for Across-Graph
traversals
OLTP Export - ID Overlap Finder Workflow
■ Export Native Graph DB to Deep Storage
■ Apache Spark based ID Graph Quality Scoring
OLAP Export - Storage & Analytics
OLTP ID
Graph
Periodic
Backup
ScyllaDB
SSTables
on AWS s3
OLAP ID
Graph
Periodic
Refresh
SparkOLAP
Export to AWS
s3
GryoOutputFormat
Native Graph on AWS
s3
Periodic Static
Reports
ID Graph Quality
Data Science
Pipeline
ID Graph Quality Score Update
Prod Setup
Prod Setup
■ V1 release in Nov 2018
■ In production on AWS i3.4xLarge instances
■ These are 16 core, 122 GB RAM instances
■ ScyllaDB Version 3.0.6 provisioned via AWS Scylla AMIs
■ Using Scylla Grafana Dashboards for Production Metrics
■ Using LevelTieredCompactionStrategy in production
■ Stats (To be updated before final deck)
Take away
■ 2 primary Workflows
● ID overlap finder
● ID retriever
Consideration : 2-node Scylla cluster, the peak client connections is around 3,000
ID overlap finder ~4X numbers of ID retriever
Run Together
● Race and SLA degrade!
● High Failure Rates
Whatever The Tool...
Introduce - Prioritization & Throttling
Priority with Aging - Match Test get priority but nothing starves
Throttle - Limit concurrent Jobs
And…
■ SLA from p95 of 10 hours to 2 hours
■ Job failure rate from 20% to 2% per day
All Higher Level Constructs in Control Plane
Good Architecture is a Must!
Thank you Stay in touch
Any questions?
Sathish K S
sathish.ks@gmail
Not on Twitter!
Saurabh Verma
saurabhdec1988@gmail
@saurabhdec1988

Zeotap: Moving to ScyllaDB - A Graph of Billions Scale

  • 1.
    Moving to ScyllaDB- A Graph of Billions scale Saurabh Verma, Principal Engineer K S Sathish, VP Engineering
  • 2.
    Presenters K S Sathish,VP Engineering Sathish heads the engineering at Zeotap. Bangalore India Engineering strategy and technical architecture. 17+ years of experience Been building big data stacks for various verticals for past 8 years Saurabh Verma, Principal Engineer Saurabh is a Principal Engineer at Zeotap. Leads Data engineering team for Identity product suite Architecture, design and engineering delivery of the Identity product. Spent the last 6 years in building big data systems. Place company logo here
  • 3.
    ■ Identity andData platform - People Based data ■ Enables Brands to better understand their customers - 360º View ■ World’s Largest Independent People Graph ■ Full Privacy/GDPR compliant ■ 80+ Data partners ■ Catering to Ad-Tech and MarTech ZEOTAP
  • 4.
  • 5.
    Identity Resolution ● SingularView of all Identities of a Person ● Multiple Identity sources ● Different Identifiers ○ Web Cookies ○ Mobile ○ Partner Platform ○ CRM Linkages between these identifiers are more important than the individual Identifiers
  • 6.
    Identity Use cases ■Match Test - Reference IDs JOIN with ID universe ■ Export - IDs retrieved based on Match and pushed out ■ Reporting ■ Compliance - Opt Out - Disconnect ■ 3rd party extension ■ Identity Quality ■ Short SLAs for Freshness of Data - meaning quick ingestion and retrieval
  • 7.
    Data Access Old Implementation Reports Redshift Athena Partner1 Partner 2 Partner n Processing Curated Denormalized Data S3 Processing Client ID sets Match Test Exports
  • 8.
    Identity Tech -Reqs ■ Workload ● High Read and High Write - Ingestion and Retrieval can happen simultaneously ■ Write ● Ingestion - Streaming and Batch ● Deletion - Streaming and Batch ● Above 50K writes per second to meet SLAs ■ Housekeep ● TTL - based on conditions
  • 9.
    Identity Tech- ReqsCont... ■ Read ● Lookup Matching IDs ● Retrieve Linked IDs ● Retrieve Linked IDs based on conditions ■ ID Type - Android ID, website cookie ■ Property - Recency, quality, country ● Count ● Depth filters
  • 10.
    Time to Change Reports Processing ClientID sets Match Test Exports ID Graph?? Partner 1 Partner 2 Partner n Processing
  • 11.
  • 12.
    Why Native Graph NativeGraph Database (JanusGraph) Low latency neighbourhood traversal (OLTP) - Lookup & Retrieve - Graph traversal modeled as iterative low-latency lookups in the Scylla K,V store - Runtime proportional to the client data set & overlap percentage Lower Data Ingestion SLAs - Ingestion modeled as UPSERT operations - Aligned with Streaming & Differential data ingestions - Economically lower footprint to run in production Linkages are first-class citizen - Linkages have properties and traversals can leverage these properties - On the fly path computation Analytics Stats on the Graph, Clustering (OLAP) - Bulk export and massive parallel processing available with GraphComputer integration with Spark, Hadoop, Giraph
  • 13.
    And… Concise solutionsto the right problems ■ Find the path between 2 user IDs SQL Gremlin Query (select * from idmvp where id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and idtype1 = 'id_mid_13' and id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and idtype2 = 'id_mid_4') // depth = 1 union (select * from idmvp t1, idmvp t2 where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 = 'id_mid_13' and t2.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t2.idtype2 = 'id_mid_4') // depth = 2 union (select * from idmvp t1, idmvp t2, idmvp t3 where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 = 'id_mid_13' and t3.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t3.idtype2 = 'id_mid_4') // depth = 3 g.V() .has('id','75d630a9-2d34-433e-b05f-2031a0342e42').has('type', 'id_mid_13') .repeat(both().simplePath().timeLimit(40000)) .until(has('id','5c557df3-df47-4603-64bc-5a9a63f22245') .has('type','id_mid_4')) .limit(10).path() .by(‘id’)
  • 14.
  • 15.
    POC Hardware Janus OnScylla Aerospike OrientDB DGraph 3 x i3.2xLarge 3 x i3.2xLarge 3 x i3.2xLarge 3 x r4.16xLarge Client Configuration 3 x c5.18xLarge Server Configuration Replication Factor 1
  • 16.
    Store Benchmarking -3B IDs, 1B edges JanusGraph with ScyllaDB Aerospike OrientDB DGraph Sharded, Distributed Storage Model LPG Custom LPG RDF Cost of ETL before Ingestion Lower Lower Lower Higher Native Graph DB Node / Edge Schema Change without downtime? Benchmark dataset load completed? Acceptable Query Performance? - - Production Setup Running Cost Lower Higher - - Production Setup Operational Management (based on our experience with AS in prod) Higher Lower - - ✓ ✓ ✓ ✓✓✓ ✓✓✓ ✓ ✓ ✓ ✓ ✓ ❌ ❌ ❌ ❌
  • 17.
  • 18.
    ID Graph DataModel label: id type: online idtype: adid_sha1 id: c3b2a1ed os: ‘android’ country: ‘ESP’ dpid: {1} ip: [1.2.3.4] linkedTo: {dp1: t1, dp2: t2, quality: 0.30, linkType: 1} linkedTo: {dp1: t1, dp2: t2, dp3: t3, dp4: t4, quality: 0.55, linkType: 3} label: id type: online idtype: adid id: a711a4de os: ‘android’ country: ‘ITA’ dpid: {2,3,4} label: id type: online Idtype: googlecookie id: 01e0ffa7 os: ‘android’ country: ‘ESP’ dpid: {1,2} label: id type: online idtype: adid id: 412ce1f0 os: ‘android’ country: ‘ITA’ dpid: {2,4} ip: [1.2.3.4] label: id type: offline idtype: email id: abc@gmail.com os: ‘ios’ country: ‘ESP’ dpid: {2,4} linkedTo: {dp1: t1, quality: 0.25, linkType: 3, linkSource: ip} linkedTo: {dp2: t2, dp3: t3, dp4: t4, quality: 0.71, linkType: 9}
  • 19.
    Expressiveness of Model label:id type: online idtype: adid_sha1 id: c3b2a1ed os: ‘android’ country: ‘ESP’ dpid: {1} ip: [1.2.3.4] linkedTo: {dp1: t1, dp2: t2, quality: 0.30, linkType: 1} linkedTo: {dp1: t1, dp2: t2, dp3: t3, dp4: t4, quality: 0.55, linkType: 3} label: id type: online idtype: adid id: a711a4de os: ‘android’ country: ‘ITA’ dpid: {2,3,4} label: id type: online Idtype: googlecookie id: 01e0ffa7 os: ‘android’ country: ‘ESP’ dpid: {1,2} label: id type: online idtype: adid id: 412ce1f0 os: ‘android’ country: ‘ITA’ dpid: {2,4} ip: [1.2.3.4] label: id type: offline idtype: email id: abc@gmail.com os: ‘ios’ country: ‘ESP’ dpid: {2,4} linkedTo: {dp1: t1, quality: 0.25, linkType: 3, linkSource: ip} linkedTo: {dp2: t2, dp3: t3, dp4: t4, quality: 0.71, linkType: 9} Quality Filtered Links ID Attribute Filtering Recency Filtered Links Extensible Data Model Transitive Links
  • 20.
  • 21.
    Streaming Ingestion ■ Workload ●300 - 400 million data points per day ● Dedupe & Enrich ● Merge ● Final snapshot ■ Batch Process ● Spark Join ● Merge runtime - 4 to 6 hours ● Redshift load time - 2 to 3 hours ● Painful Failures Stream & Batch Dedup Enrich S3 Merge Redshift
  • 22.
    Streaming Ingestion ■ And... ●Time - 2 to 3 hours ● Join Vs Lookup ● All Stream ● Failures - down by 83% Stream & Batch Dedup Enrich Streaming Graph Ingester Streaming Graph Ingester Vertex Edge KV Store
  • 23.
    Findings ■ Consider SplittingVertex Load from Edge Load ● Write behaviour is different ● Achieve overall better QPS ■ Benchmark Vertex load speed against CPU utilization ● Observed 5K TPS per server core ■ Consider Client Side Caching - Edge Load ● One lookup and One write with many duplicate IDs - Too many disk hits (Thrashing) ● 100% write - 4.8K TPS per core ● LeveledCompactionStrategy performed better than SizeTieredCompactionStrategy
  • 24.
  • 25.
    Findings ■ Be Waryof Supernodes ● Supernodes > 600 vertices drastic QPS drop ● 40K QPS to 2K ■ Multi-Level Traversal - Depth limiting ● QPS decreases though not linear ● depth of 5 - 40K QPS to 12K
  • 26.
    Findings ■ Play withCompaction strategies ● For our queries LevelTiered increased QPS by 2.5X ● With LevelTiered - concurrent clients better handled ● QPS stabilized at 30K
  • 27.
    Know Your QueryAnd Data ■ Segments are country based - filter based on Countries ■ Vertex Metadata not huge Fetching individual properties from the Vertex gremlin>g.V().has('id','1').has('type','email') .values('id', 'type', 'USA').profile() Fetching entire property map during traversal gremlin>g.V().has('id','1').has('type','email') .valueMap().profile() Step Traversers Time JanusGraphStep _condition=(id=1 AND type = email) 1 0.987 JanusGraphPrope rtiesStep _condition=((type[ id] OR type[type] OR type[USA])) 4 1.337 2.325 s Step Traversers Time JanusGraphStep _condition=(id=1 AND type = email) 1 0.902 PropertyMapStep (value) 1 0.175 1.077 s ~200%
  • 28.
  • 29.
    ID Graph Quality ■How Trustable is our ID graph ● What happens if match rate is ridiculously high ● Cluster of 63 million IDs ■ Connectivity analysis - heuristics ● Density ● Depth ● Clustering ● Distance ■ Can we arrive at Quality Score for edges?
  • 30.
    Scoring V1 ■ ADscoring - Edge Agreement (A) / Disagreement (D) ■ Recency Scoring - Augment A & D with Recency ■ Calculate Composite Score ■ Adjust composite score with IDs metadata
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    ■ Interaction withJanusGraph backed by ScyllaDB ● For each input ID find the connected IDs in the ID Graph based on filters ● Modeled as Depth First Search implemented in Gremlin in Apache Spark ● Property and depth filtering done at the application layer ● The overlapping ID output is stored on deep storage eg AWS s3 ■ Across-Graph Traversals ● Separate compliance requirements per 3rd party Graph vendor ● Probabilistic vs Deterministic Graph vendors ● Each Graph Vendor represented as a separate keyspace in ScyllaDB ● The application layer enables runtime chaining and ordering for Across-Graph traversals OLTP Export - ID Overlap Finder Workflow
  • 36.
    ■ Export NativeGraph DB to Deep Storage ■ Apache Spark based ID Graph Quality Scoring OLAP Export - Storage & Analytics OLTP ID Graph Periodic Backup ScyllaDB SSTables on AWS s3 OLAP ID Graph Periodic Refresh SparkOLAP Export to AWS s3 GryoOutputFormat Native Graph on AWS s3 Periodic Static Reports ID Graph Quality Data Science Pipeline ID Graph Quality Score Update
  • 37.
  • 38.
    Prod Setup ■ V1release in Nov 2018 ■ In production on AWS i3.4xLarge instances ■ These are 16 core, 122 GB RAM instances ■ ScyllaDB Version 3.0.6 provisioned via AWS Scylla AMIs ■ Using Scylla Grafana Dashboards for Production Metrics ■ Using LevelTieredCompactionStrategy in production ■ Stats (To be updated before final deck)
  • 39.
  • 40.
    ■ 2 primaryWorkflows ● ID overlap finder ● ID retriever Consideration : 2-node Scylla cluster, the peak client connections is around 3,000 ID overlap finder ~4X numbers of ID retriever Run Together ● Race and SLA degrade! ● High Failure Rates Whatever The Tool...
  • 41.
    Introduce - Prioritization& Throttling Priority with Aging - Match Test get priority but nothing starves Throttle - Limit concurrent Jobs And… ■ SLA from p95 of 10 hours to 2 hours ■ Job failure rate from 20% to 2% per day All Higher Level Constructs in Control Plane Good Architecture is a Must!
  • 42.
    Thank you Stayin touch Any questions? Sathish K S sathish.ks@gmail Not on Twitter! Saurabh Verma saurabhdec1988@gmail @saurabhdec1988

Editor's Notes

  • #33 Event Rarity IP address on 2 vertices same Country of 2 vertices is different