Joining Billions of Rows in Seconds
with One Database Instead of Two:
Replacing MongoDB and Hive with Scylla
Alexys Jacob
CTO, Numberly
1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug
Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)
Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all data sources and
destinations.
➔ For this we use ID matching tables.
ID matching tables
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable by
partner
Queried AND updated all the time!
➔ High read AND write workload
JOIN
Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH
ACTIVATE
Current implementation(s)
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
Drawbacks & pitfalls
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
Scylla?
Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline
Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!
Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID over the last N
months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!
Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!
scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)
Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!
Results:
▪ idle cluster: 2 minutes, 15 seconds
▪ normal cluster: 4 minutes
▪ overloaded cluster: 15 minutes
Spark 2 + Hive: reference metrics
Hive
(population)
Hive
(ID matching)
Partitions
count
+
Let’s use Scylla!
Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!
Spark 2 + Hive + Scylla
Hive
(population)
Scylla
(ID matching)
Partitions
count
+
Spark 2 / Scala test workload
DataStax’s spark-cassandra-connector joinWithCassandraTable
▪ spark-cassandra-connector-2.0.1-s_2.11.jar
▪ Java 7
Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666
Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
Spark 2 + Scylla results
Cold cache: 12 minutes
Hot cache: 2 minutes
Reference results:
idle cluster: 2 minutes, 15 seconds
normal cluster: 4 minutes
overloaded cluster: 15 minutes
OK for Scala, what about Python?
No joinWithCassandraTable
when using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches
Dask + Hive + Scylla
Results:
▪ Cold cache: 6min
▪ Hot cache: 2min
Hive
(population)
Scylla
(ID matching)
Partitions
count
Dask + Hive + Scylla time break down
Hive
Scylla
Partitions
count50 seconds
10 seconds
60 seconds
Dask + Parquet + Scylla
Parquet files
(HDFS)
Scylla
Partitions
count
10 seconds!
Dask + Scylla results
Cold cache: 5 minutes
Hot cache: 1 minute 5 seconds
Spark 2 results:
cold cache: 6 minutes
hot cache: 2 minutes
Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files:
Scylla!
Production environment
▪ 6x DELL R640
• dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB
Gentoo Linux
Multi-DC setup
Ansible based provisioning and backups
Monitored by scylla-grafana-monitoring
Housekeeping handled by scylla-manager
Thank You
Questions welcomed!
Stay in touch
alexys@numberly.com
@ultrabug https://ultrabug.fr

Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla

  • 1.
    Joining Billions ofRows in Seconds with One Database Instead of Two: Replacing MongoDB and Hive with Scylla Alexys Jacob CTO, Numberly
  • 2.
    1 Eiffel Tower 2Soccer World Cups 15 Years in the Data industry Pythonista OSS enthusiast & contributor Gentoo Linux developer CTO at Numberly - living in Paris, France whoami @ultrabug
  • 3.
    Business context ofNumberly Digital Marketing Technologist (MarTech) Handling the relationship between brands and people (People based) Dealing with multiple sources and a wide range of data types (Events) Mixing and correlating a massive amount of different types of events... ...which all have their own identifiers (think primary keys)
  • 4.
    Business context ofNumberly Web navigation tracking (browser ID: cookie) CRM databases (email address, customer ID) Partners’ digital platforms (cookie ID, hash(email address)) Mobile phone apps (device ID: IDFA, GAID) Ability to synchronize and translate identifiers between all data sources and destinations. ➔ For this we use ID matching tables.
  • 5.
    ID matching tables 1.SELECT reference population 2. JOIN with the ID matching table 3. MATCHED population is usable by partner Queried AND updated all the time! ➔ High read AND write workload JOIN
  • 6.
    Real life example:retargeting From a database (email) to a web banner (cookie) Previous donors generous@coconut.fr isupportu@lab.com wiki4ever@wp.eu openinternet@free.fr https://kitty.eu AppNexus ... Google ID matching table Cookie id = 123 Cookie id = 297 ? Cookie id = 896 Ad Exchange User cookie id 123 SELECT MATCH ACTIVATE
  • 7.
  • 8.
    Drawbacks & pitfalls Events Message queues HDFS Realtime Programs Batch Calculation MongoDB Hive Batch pipeline Real time pipeline
  • 9.
  • 10.
    Future implementation usingScylla? Events Message queues Real time Programs Batch Calculation Scylla Batch pipeline Real time pipeline
  • 11.
    Proof Of Concepthardware Recycled hardware… ▪ 2x DELL R510 • 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC ▪ 1x DELL R710 • 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC ➔ Compete with our production? Scylla is in!
  • 12.
    Finding the rightschema model Query based AND test-driven data modeling 1. What are all the cookie IDs associated to the given partner ID over the last N months? 2. What is the last cookie ID/date for the given partner ID? Gotcha: the reverse questions are also to be answered! ➔ Denormalization ➔ Prototype with your language of choice!
  • 13.
    Schema tip! > Whatis the last cookie ID for the given partner ID? TIP: CLUSTERING ORDER ▪ Defaults to ASC ➔ Latest value at the end of the sstable! ▪ Change “date” ordering to DESC ➔ Latest value at the top of the sstable ➔ Reduced read latency!
  • 14.
    scylla-grafana-monitoring Set it upand test it! ▪ Use cassandra-stress Key graphs: ▪ number of open connections ▪ cache hits / misses ▪ per shard/node distribution ▪ sstable reads TIP: reduce default scrape interval ▪ scrape_interval: 2s (4s default) ▪ scrape_timeout: 1s (5s default)
  • 15.
    Reference data andmetrics Reference dataset ▪ 10M population ▪ 400M ID matching table ➔ Representative volumes Measured on our production stack, with real load NOT a benchmark!
  • 16.
    Results: ▪ idle cluster:2 minutes, 15 seconds ▪ normal cluster: 4 minutes ▪ overloaded cluster: 15 minutes Spark 2 + Hive: reference metrics Hive (population) Hive (ID matching) Partitions count +
  • 17.
  • 18.
    Testing with Scylla Distinguishbetween hot and cold cache scenarios ▪ Cold cache: mostly disk I/O bound ▪ Hot cache: mostly memory bound Push your Scylla cluster to its limits!
  • 19.
    Spark 2 +Hive + Scylla Hive (population) Scylla (ID matching) Partitions count +
  • 20.
    Spark 2 /Scala test workload DataStax’s spark-cassandra-connector joinWithCassandraTable ▪ spark-cassandra-connector-2.0.1-s_2.11.jar ▪ Java 7
  • 21.
    Spark 2 tuning(1/2) Use a fixed number of executors ▪ spark.dynamicAllocation.enabled=false ▪ spark.executor.instances=30 Change Spark split size to match Scylla for read performance ▪ spark.cassandra.input.split.size_in_mb=1 Adjust reads per seconds ▪ spark.cassandra.input.reads_per_sec=6666
  • 22.
    Spark 2 tuning(2/2) Tune the number of connections opened by each executor ▪ spark.cassandra.connection.connections_per_executor_max=100 Align driver timeouts with server timeouts (check scylla.yaml) ▪ spark.cassandra.connection.timeout_ms=150000 ▪ spark.cassandra.read.timeout_ms=150000 ScyllaDB blog posts & webinar ▪ https://www.scylladb.com/2018/07/31/spark-scylla/ ▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/ ▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/ ▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
  • 23.
    Spark 2 +Scylla results Cold cache: 12 minutes Hot cache: 2 minutes Reference results: idle cluster: 2 minutes, 15 seconds normal cluster: 4 minutes overloaded cluster: 15 minutes
  • 24.
    OK for Scala,what about Python? No joinWithCassandraTable when using pyspark... Maybe we don’t need Spark 2 at all! 1. Load the 10M rows from Hive 2. For every row lookup the ID matching table from Scylla 3. Count the resulting number of matches
  • 25.
    Dask + Hive+ Scylla Results: ▪ Cold cache: 6min ▪ Hot cache: 2min Hive (population) Scylla (ID matching) Partitions count
  • 26.
    Dask + Hive+ Scylla time break down Hive Scylla Partitions count50 seconds 10 seconds 60 seconds
  • 27.
    Dask + Parquet+ Scylla Parquet files (HDFS) Scylla Partitions count 10 seconds!
  • 28.
    Dask + Scyllaresults Cold cache: 5 minutes Hot cache: 1 minute 5 seconds Spark 2 results: cold cache: 6 minutes hot cache: 2 minutes
  • 29.
    Python+Scylla with Parquettips! ▪ Use execute_concurrent() ▪ Increase concurrency parameter (defaults to 100) ▪ Use libev as connection_class instead of asyncore ▪ Use hdfs3 + pyarrow to read and load Parquet files:
  • 30.
  • 31.
    Production environment ▪ 6xDELL R640 • dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB Gentoo Linux Multi-DC setup Ansible based provisioning and backups Monitored by scylla-grafana-monitoring Housekeeping handled by scylla-manager
  • 32.
    Thank You Questions welcomed! Stayin touch alexys@numberly.com @ultrabug https://ultrabug.fr

Editor's Notes

  • #8 Lambda architecture
  • #9 Keeping (in sync) two copies of the same data Batch data freshness Operational burden => none can sustain both read and write workload
  • #10 Can Scylla sustain our ID matching tables workloads while maintaining consistently low upsert/write and lookup/read latencies?
  • #11 Simpler data consistency Operational simplicity and efficiency Reduced costs
  • #12 Always try a technology under the best omens :) Running Gentoo Linux
  • #13 ID translations must be done both ways => denormalization I wrote tests on my dataset so I could concentrate on the model while making sure that all my questions were being answered correctly and consistently.
  • #14 We ended up with three denormalized tables History like table (just like a log) Optimize for latest value? This will ensure that the latest values (rows) are stored at the beginning of the sstable file effectively reducing the read latency when the row is not in cache!
  • #15 Docker based, Easy to install, Multi environment support Understand the performance of your cluster Tune your workload for optimal performances
  • #16 Ref dataset : data cardinality, representative volumes MongoDB cluster, make sure to shard and index the dataset just like you do on the production collections. Hive, respect the storage file format of your current implementations as well as their partitioning. Combien de machines en prod ?! le dire
  • #19 It’s time to break Scylla, your goal here is to saturate the Scylla cluster, get it to ~90% load
  • #20  read the 10M population rows from Hive in a partitioned manner for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid create a dataframe from the resulting matches gather back all the dataframes and merge them count the number of matches Spark 2 cold is 12min Spark 2 hot is 2min
  • #21 I experienced pretty crappy performances at first. Grafana monitoring showed that Scylla was not being the bottleneck Repartition is used to leverage the driver’s knowledge on how data is sharded to optimize how it is going to be split between spark workers
  • #22 Take your clusters’ utilization into account
  • #23 Take your clusters’ utilization into account
  • #24 With spinning disks, the cold start result can compete with the results of a heavily loaded Hadoop cluster where pending containers and parallelism are knocking down its performances Those three refurbished machines can compete with our current production machines and implementations They can even match an idle Hive cluster of a medium size DIGRESSION!
  • #25 I went into the crazy quest of beating Spark 2 performances using a pure Python implementation. The main problem to compete with Spark 2 is that it is a distributed framework and Python by itself is not. So you can’t possibly imagine outperforming Spark 2 with your single machine Spark 2 is shipped and ran on executors using YARN so we are firing up JVMs and dispatching containers all the time. This is a quite expensive process that we have a chance to avoid using Python! joinWithCassandraTable JOINs 10M with 400M...
  • #26  read the 10M population rows from Hive in a partitioned manner for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid create a dataframe from the resulting matches gather back all the dataframes and merge them count the number of matches Spark 2 cold is 12min Spark 2 hot is 2min
  • #28 libhdfs3 + pyarrow combo. It is faster to load everything on a single machine than loading from Hive on multiple ones!
  • #29 The Hive loading + partitioning got down from 50s to 10s
  • #31 The conclusion of the evaluation has not been driven by the good figures we got out of our test workloads Those are no benchmarks and never pretended to be but we could still prove that performances were solid enough to not be a blocker in the adoption of Scylla Instead we decided on the following points of interest (in no particular order): data consistency production reliability datacenter awareness ease of operation infrastructure rationalisation Developer friendliness (mais c’est pas mongo) Costs (train engineers)