11
MARCH 27, 2017
How to Process 50 Billion Monthly
Messages with Full Availability &
Performance
Yuan Ren, Head of Data Science
Aug 15, 2017
22
Agenda
• What’s mParticle
• What problem to solve
• Our journey from Cassandra to Scylla
33
The single, secure API
for Growth
mParticle provides a single, secure API to integrate and orchestrate your entire
marketing stack so that brands can enhance analytics and optimize acquisition,
engagement, and monetization in a multi-screen world.
44
Trusted by the very best brands
55
mParticle is created to solve modern data challenges
Identity
Resolution
Platforms
Feeds
Events
Data Warehouses
Audiences
Audience
Builder
Profile
Enrichment
Rules &
Filters
Security
66
mParticle Platform Stats
•Monthly Unique users/devices by major platforms
•350M iOS devices monthly
•1B Android devices monthly
•Monthly data volume
•50B batches
•100B events
•150TB of data in binary format added to S3
77
Need a Near Real Time Data Store
• Data streams in and out in near real time
•Handle various types of data load
•Full availability
88
Data Schema & Ingestion Rate
• Each message is a batch of events,
avg 20 kB per message
•An array of events
•User Info
•App Info
•Device Info
• Data sent to mParticle from user
devices and S2S
•Avg 20K msg / sec
•Peak 40K msg / sec
99
Data Processing Notes
• Real time processing. On every received batch
•Query all historical batches/events of the user, and process them through a rule
engine
•Read latency must be low
•Write the received batch into the database
• Batch processing
•Whenever there’s a change in rules engine, we query all historical data and reprocess
them
SDK
S2S
● User profile enrichment
● Evaluations based on
users’ full history
Events
Data
Warehouses
Audiences
1010
Cassandra got us started
•High read and write throughput
•Horizontal scalability
•A widely adopted technology with proven track record
•Low cost with DataStax startup program
•We used Cassandra until Q4 of 2016
1111
Cassandra Data Model
•Data partitioned by client, i.e., separate tables per client
•Each table is partitioned by mparticle userid
•For real time processing, we read / write by mparticle userid
•For batch processing, we split read into partitions and query by partition
CREATE TABLE {table_per_client} (
userid bigint,
time timestamp,
eventdata blob,
primary key (userid, time)
)
WITH clustering order by (time DESC)
1212
Cassandra Stats (Sept, 2016)
•We scale up/down our systems
to meet our read/write latency
requirements
•One Cassandra cluster running
on 12 EC2 nodes
•c3.8xlarge (32 vCPU, 60GB
memory, 640GB SSD storage)
•Two factors with biggest
impact to latencies
•Batch processing
•S2S data loads
* Latency stats not available for 2016
1313
Cassandra Stats (continued)
• A bottleneck was hit because of compactions
•If any higher load was pushed to the cluster, compactions would get out of control and either crash
the C* service, or the cluster would get unresponsive
•Having a backlog of compactions means that the read latencies are much worse than they could be
1414
Cassandra Pain Points
•Amount of human labor involved in tuning
•Lack of affordable support from DataStax
•We had consulted with a 3rd party Cassandra consulting company, which turned out to be a bad
experience
•Ended up with an over complicated setup that is hard to modify and scale
•At the end of our Cassandra journey, we were having backlogs of data
processing up to 20 hours, on a good day
•Just not a good fit for us at that time
1515
Scylla POC
• Why Scylla?
•Compatible with Cassandra and doesn’t require code
change
•Rewrite in C++. We really don’t like tuning JVM
• POC Process
•We engaged them in POC as soon as they released
version 1.0
•Tested with real data on the same hardware
•Scylla beat Cassandra significantly in our case
•Much lower compaction backlog
•Ease of configuration
•Self-tuning during installation
•Highly responsive and knowledgeable support
from Scylla engineers
* Scylla websites has more rigorous performance comparison between C* and Scylla
Live C*
cluster
Test SQS
1
Test SQS
2
Test
Cassandra
Cluster
Test Scylla
Cluster
1616
• Essentially only need to migrate data
• No code change
• No data model change
• Except that Scylla helped us pick a better data model that should’ve been used in C*
• Migration Steps
• Migrated one client at a time
• Temporarily paused data ingestion for a client
• Migrated the client’s data from C* to Scylla
• Resumed data ingestion
• After migration, Scylla immediately kept up with our data loads in real
time, or minimal backlog
Cassandra to Scylla Migration
1717
• Our current data volume is about 3 times as large as 2016/6, and going up
• One scylla cluster running on 10 i3.16xlarge nodes
• 64 vCPU, 488GB RAM, 15TB storage
• Scylla automatically determines compaction rate dynamically
• In C* you can configure compactions to be stronger or weaker, but it could become invalid as your
workload changes
• Our Scylla cluster showed minimal pending compactions
Scylla Stats - Compactions
1818
• With much lower pending compactions, read / write latency is naturally
lower
• With Scylla, we don’t have data backlogs
Scylla Stats - Latencies
1919
• Ability to isolate background and
foreground tasks and determine the
best rate of things like compaction and
repairs
• Basically no change to scylla.yaml file
• Scylla does kernel tuning at deployment
time
Scylla’s Self Tuning
2020
• We used Scylla AMIs for AWS EC2
• For our initial deployment of the cluster, it’s as simple as running ScyllaDB setup utility
• Lessons learned on deploying to i3 instances
• We started using i3 as soon as it became available
• Currently we use 10 i3.16xlarge nodes
• i3 instance were not the most stable types
• Scylla’s fast recovery time helps
• A node could be brought back within 8 hours, with live data load and replicating 7TB of data
• Customizations of Scylla AMIs are needed, e.g., for kernel tuning
• Scylla’s latest AMI supports i3
• Use small number of big nodes instead of many small nodes
Scylla AMIs and I3
2121
• Direct Quotes from our Director of DevOps
From a devops perspective, when it comes to getting support from a third-party vendor on their own product, the best you can
hope for is product competency, professionalism and infrastructure competency.
Product competency is relatively common to find – most support teams know their stuff pretty well. Underlying infrastructure
proficiency is not as common as it should be – I have dealt with many support teams that have the attitude of "it's not our
product, it is the OS/hardware/etc. – you should contact their support", which not only does not help resolve the issue quickly,
but may end up causing more trouble in the long run, because of the disconnect between the layers. Lastly, professionalism and
dependability – you want to have the support team be there for you until the issue is resolved, no matter what.
With Scylla's support team, you get all three at 100% and beyond.
Their engineers know the ins and outs of their product, without having to "get back to you" hours later. All engineers are
responsive and on top of an issue or question so you get the response, and ultimately the resolution, as fast as it can possibly be
done.
They are absolutely dedicated and reliable and will make sure that your issue is resolved, or they will work with you 24/7
until you are satisfied.
Lastly, they are experts in the OS and hardware on which their customers run the product. This may seem like a side note or a
"nice to have", but knowing the underlying infrastructure, how it behaves, what you can expect from it, how you can tune it to get
the most out of it, is an absolute gem. It can not only help, but in certain cases it can mean the difference between a quick and
solid resolution, and a prolonged case involving multiple vendors. It can mean the difference between mediocre performance, or
the one Scylla offers.
Scylla Support
2222
• We used Cassandra / Scylla for high read/write throughput and low
latencies
• If you struggle with tuning Cassandra, definitely consider Scylla
• Better performance
• Makes DevOps life easier
• Awesome support
• Future plans with Scylla
Summary
2323
Thank you!
GET IN TOUCH!
We are hiring!
mParticle
257 Park Avenue South, 9th Floor
New York, NY 10010
@mparticles | http://www.mparticle.com

mParticle's Journey to Scylla from Cassandra

  • 1.
    11 MARCH 27, 2017 Howto Process 50 Billion Monthly Messages with Full Availability & Performance Yuan Ren, Head of Data Science Aug 15, 2017
  • 2.
    22 Agenda • What’s mParticle •What problem to solve • Our journey from Cassandra to Scylla
  • 3.
    33 The single, secureAPI for Growth mParticle provides a single, secure API to integrate and orchestrate your entire marketing stack so that brands can enhance analytics and optimize acquisition, engagement, and monetization in a multi-screen world.
  • 4.
    44 Trusted by thevery best brands
  • 5.
    55 mParticle is createdto solve modern data challenges Identity Resolution Platforms Feeds Events Data Warehouses Audiences Audience Builder Profile Enrichment Rules & Filters Security
  • 6.
    66 mParticle Platform Stats •MonthlyUnique users/devices by major platforms •350M iOS devices monthly •1B Android devices monthly •Monthly data volume •50B batches •100B events •150TB of data in binary format added to S3
  • 7.
    77 Need a NearReal Time Data Store • Data streams in and out in near real time •Handle various types of data load •Full availability
  • 8.
    88 Data Schema &Ingestion Rate • Each message is a batch of events, avg 20 kB per message •An array of events •User Info •App Info •Device Info • Data sent to mParticle from user devices and S2S •Avg 20K msg / sec •Peak 40K msg / sec
  • 9.
    99 Data Processing Notes •Real time processing. On every received batch •Query all historical batches/events of the user, and process them through a rule engine •Read latency must be low •Write the received batch into the database • Batch processing •Whenever there’s a change in rules engine, we query all historical data and reprocess them SDK S2S ● User profile enrichment ● Evaluations based on users’ full history Events Data Warehouses Audiences
  • 10.
    1010 Cassandra got usstarted •High read and write throughput •Horizontal scalability •A widely adopted technology with proven track record •Low cost with DataStax startup program •We used Cassandra until Q4 of 2016
  • 11.
    1111 Cassandra Data Model •Datapartitioned by client, i.e., separate tables per client •Each table is partitioned by mparticle userid •For real time processing, we read / write by mparticle userid •For batch processing, we split read into partitions and query by partition CREATE TABLE {table_per_client} ( userid bigint, time timestamp, eventdata blob, primary key (userid, time) ) WITH clustering order by (time DESC)
  • 12.
    1212 Cassandra Stats (Sept,2016) •We scale up/down our systems to meet our read/write latency requirements •One Cassandra cluster running on 12 EC2 nodes •c3.8xlarge (32 vCPU, 60GB memory, 640GB SSD storage) •Two factors with biggest impact to latencies •Batch processing •S2S data loads * Latency stats not available for 2016
  • 13.
    1313 Cassandra Stats (continued) •A bottleneck was hit because of compactions •If any higher load was pushed to the cluster, compactions would get out of control and either crash the C* service, or the cluster would get unresponsive •Having a backlog of compactions means that the read latencies are much worse than they could be
  • 14.
    1414 Cassandra Pain Points •Amountof human labor involved in tuning •Lack of affordable support from DataStax •We had consulted with a 3rd party Cassandra consulting company, which turned out to be a bad experience •Ended up with an over complicated setup that is hard to modify and scale •At the end of our Cassandra journey, we were having backlogs of data processing up to 20 hours, on a good day •Just not a good fit for us at that time
  • 15.
    1515 Scylla POC • WhyScylla? •Compatible with Cassandra and doesn’t require code change •Rewrite in C++. We really don’t like tuning JVM • POC Process •We engaged them in POC as soon as they released version 1.0 •Tested with real data on the same hardware •Scylla beat Cassandra significantly in our case •Much lower compaction backlog •Ease of configuration •Self-tuning during installation •Highly responsive and knowledgeable support from Scylla engineers * Scylla websites has more rigorous performance comparison between C* and Scylla Live C* cluster Test SQS 1 Test SQS 2 Test Cassandra Cluster Test Scylla Cluster
  • 16.
    1616 • Essentially onlyneed to migrate data • No code change • No data model change • Except that Scylla helped us pick a better data model that should’ve been used in C* • Migration Steps • Migrated one client at a time • Temporarily paused data ingestion for a client • Migrated the client’s data from C* to Scylla • Resumed data ingestion • After migration, Scylla immediately kept up with our data loads in real time, or minimal backlog Cassandra to Scylla Migration
  • 17.
    1717 • Our currentdata volume is about 3 times as large as 2016/6, and going up • One scylla cluster running on 10 i3.16xlarge nodes • 64 vCPU, 488GB RAM, 15TB storage • Scylla automatically determines compaction rate dynamically • In C* you can configure compactions to be stronger or weaker, but it could become invalid as your workload changes • Our Scylla cluster showed minimal pending compactions Scylla Stats - Compactions
  • 18.
    1818 • With muchlower pending compactions, read / write latency is naturally lower • With Scylla, we don’t have data backlogs Scylla Stats - Latencies
  • 19.
    1919 • Ability toisolate background and foreground tasks and determine the best rate of things like compaction and repairs • Basically no change to scylla.yaml file • Scylla does kernel tuning at deployment time Scylla’s Self Tuning
  • 20.
    2020 • We usedScylla AMIs for AWS EC2 • For our initial deployment of the cluster, it’s as simple as running ScyllaDB setup utility • Lessons learned on deploying to i3 instances • We started using i3 as soon as it became available • Currently we use 10 i3.16xlarge nodes • i3 instance were not the most stable types • Scylla’s fast recovery time helps • A node could be brought back within 8 hours, with live data load and replicating 7TB of data • Customizations of Scylla AMIs are needed, e.g., for kernel tuning • Scylla’s latest AMI supports i3 • Use small number of big nodes instead of many small nodes Scylla AMIs and I3
  • 21.
    2121 • Direct Quotesfrom our Director of DevOps From a devops perspective, when it comes to getting support from a third-party vendor on their own product, the best you can hope for is product competency, professionalism and infrastructure competency. Product competency is relatively common to find – most support teams know their stuff pretty well. Underlying infrastructure proficiency is not as common as it should be – I have dealt with many support teams that have the attitude of "it's not our product, it is the OS/hardware/etc. – you should contact their support", which not only does not help resolve the issue quickly, but may end up causing more trouble in the long run, because of the disconnect between the layers. Lastly, professionalism and dependability – you want to have the support team be there for you until the issue is resolved, no matter what. With Scylla's support team, you get all three at 100% and beyond. Their engineers know the ins and outs of their product, without having to "get back to you" hours later. All engineers are responsive and on top of an issue or question so you get the response, and ultimately the resolution, as fast as it can possibly be done. They are absolutely dedicated and reliable and will make sure that your issue is resolved, or they will work with you 24/7 until you are satisfied. Lastly, they are experts in the OS and hardware on which their customers run the product. This may seem like a side note or a "nice to have", but knowing the underlying infrastructure, how it behaves, what you can expect from it, how you can tune it to get the most out of it, is an absolute gem. It can not only help, but in certain cases it can mean the difference between a quick and solid resolution, and a prolonged case involving multiple vendors. It can mean the difference between mediocre performance, or the one Scylla offers. Scylla Support
  • 22.
    2222 • We usedCassandra / Scylla for high read/write throughput and low latencies • If you struggle with tuning Cassandra, definitely consider Scylla • Better performance • Makes DevOps life easier • Awesome support • Future plans with Scylla Summary
  • 23.
    2323 Thank you! GET INTOUCH! We are hiring! mParticle 257 Park Avenue South, 9th Floor New York, NY 10010 @mparticles | http://www.mparticle.com