Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla


Published on

Many organizations struggle to balance traditional big data infrastructure with NoSQL databases. Other organizations do the smart thing and consolidate the two. This presentation explores Numberly’s experience migrating an intensive and join-hungry production workload from MongoDB and Hive to Scylla.

Join Alexys Jacob, CTO of Numberly, to learn how they joined billions of rows in seconds and dramatically reduced operational and development complexity by using a single database for their hybrid analytical use case.

As a bonus, Alexys will also cover benchmarks for Dask (a flexible parallel computing library for analytic computing) and Spark, highlighting their differences and lessons learned along the way.

Published in: Technology
  • Be the first to comment

Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla

  1. 1. Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla Alexys Jacob - CTO, Numberly
  2. 2. Moderator - Peter Corless, ScyllaDB Peter has a 29-year career in Silicon Valley that threads through stints at e2f, Aerospike, Cisco and Apple. He is passionate about technology, customer success, engendering community, and social media. In his off hours he enjoys playing 4X strategy games. Twitter: @petercorless 2
  3. 3. 3 + The Real-Time Big Data Database + Drop-in replacement for Cassandra + 10X the performance & low tail latency + Open source and enterprise editions + Founded by the creators of KVM hypervisor + HQs: Palo Alto, CA; Herzelia, Israel + Learn more at About ScyllaDB
  4. 4. Presenter - Alexys Jacob, Numberly 4
  5. 5. 1 Eiffel Tower 2 Soccer World Cups 15 Years in the Data industry Pythonista OSS enthusiast & contributor Gentoo Linux developer CTO at Numberly - living in Paris, France whoami @ultrabug 5
  6. 6. Business context of Numberly Digital Marketing Technologist (MarTech) Handling the relationship between brands and people (People based) Dealing with multiple sources and a wide range of data types (Events) Mixing and correlating a massive amount of different types of events... ...which all have their own identifiers (think primary keys) 6
  7. 7. Business context of Numberly Web navigation tracking (browser ID: cookie) CRM databases (email address, customer ID) Partners’ digital platforms (cookie ID, hash(email address)) Mobile phone apps (device ID: IDFA, GAID) Ability to synchronize and translate identifiers between all data sources and destinations. ➔ For this we use ID matching tables. 7
  8. 8. ID matching tables JOIN 1. SELECT reference population 2. JOIN with the ID matching table 3. MATCHED population is usable by partner Queried AND updated all the time! ➔ High read AND write workload 8
  9. 9. Real life example: retargeting From a database (email) to a web banner (cookie) Previous donors AppNexus ... Google ID matching table Cookie id = 123 Cookie id = 297 ? Cookie id = 896 Ad Exchange User cookie id 123 SELECT MATCH ACTIVATE 9
  10. 10. Current implementation(s) Events Message queues HDFS Real time Programs Batch Calculation MongoDB Hive Batch pipeline Real time pipeline 10
  11. 11. Drawbacks & pitfalls Events Message queues HDFS Real time Programs Batch Calculation MongoDB Hive Batch pipeline Real time pipeline 11
  12. 12. Scylla?
  13. 13. Future implementation using Scylla? Events Message queues Real time Programs Batch Calculation Scylla Batch pipeline Real time pipeline 13
  14. 14. Proof Of Concept hardware Recycled hardware… ▪ 2x DELL R510 • 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC ▪ 1x DELL R710 • 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC ➔ Compete with our production? Scylla is in! 14
  15. 15. Finding the right schema model Query based AND test-driven data modeling 1. What are all the cookie IDs associated to the given partner ID over the last N months? 2. What is the last cookie ID/date for the given partner ID? Gotcha: the reverse questions are also to be answered! ➔ Denormalization ➔ Prototype with your language of choice! 15
  16. 16. Schema tip! > What is the last cookie ID for the given partner ID? TIP: CLUSTERING ORDER ▪ Defaults to ASC ➔ Latest value at the end of the sstable! ▪ Change “date” ordering to DESC ➔ Latest value at the top of the sstable ➔ Reduced read latency! 16
  17. 17. scylla-grafana-monitoring Set it up and test it! ▪ Use cassandra-stress Key graphs: ▪ number of open connections ▪ cache hits / misses ▪ per shard/node distribution ▪ sstable reads TIP: reduce default scrape interval ▪ scrape_interval: 2s (4s default) ▪ scrape_timeout: 1s (5s default) 17
  18. 18. Reference data and metrics Reference dataset ▪ 10M population ▪ 400M ID matching table ➔ Representative volumes Measured on our production stack, with real load NOT a benchmark! 18
  19. 19. Results: ▪ idle cluster: 2 minutes, 15 seconds ▪ normal cluster: 4 minutes ▪ overloaded cluster: 15 minutes Spark 2 + Hive: reference metrics Hive (population) Hive (ID matching) Partitions count + 19
  20. 20. Let’s use Scylla!
  21. 21. Testing with Scylla Distinguish between hot and cold cache scenarios ▪ Cold cache: mostly disk I/O bound ▪ Hot cache: mostly memory bound Push your Scylla cluster to its limits! 21
  22. 22. Spark 2 + Hive + Scylla Hive (population) Scylla (ID matching) Partitions count + 22
  23. 23. Spark 2 / Scala test workload DataStax’s spark-cassandra-connector joinWithCassandraTable ▪ spark-cassandra-connector-2.0.1-s_2.11.jar ▪ Java 7 23
  24. 24. Spark 2 tuning (1/2) Use a fixed number of executors ▪ spark.dynamicAllocation.enabled=false ▪ spark.executor.instances=30 Change Spark split size to match Scylla for read performance ▪ spark.cassandra.input.split.size_in_mb=1 Adjust reads per seconds ▪ spark.cassandra.input.reads_per_sec=6666 24
  25. 25. Spark 2 tuning (2/2) Tune the number of connections opened by each executor ▪ spark.cassandra.connection.connections_per_executor_max=100 Align driver timeouts with server timeouts (check scylla.yaml) ▪ spark.cassandra.connection.timeout_ms=150000 ▪ ScyllaDB blog posts & webinar ▪ ▪ ▪ ▪ 25
  26. 26. Spark 2 + Scylla results Cold cache: 12 minutes Hot cache: 2 minutes Reference results: idle cluster: 2 minutes, 15 seconds normal cluster: 4 minutes overloaded cluster: 15 minutes
  27. 27. OK for Scala, what about Python? No joinWithCassandraTable when using pyspark... Maybe we don’t need Spark 2 at all! 1. Load the 10M rows from Hive 2. For every row lookup the ID matching table from Scylla 3. Count the resulting number of matches 27
  28. 28. Dask + Hive + Scylla Results: ▪ Cold cache: 6min ▪ Hot cache: 2min Hive (population) Scylla (ID matching) Partitions count 28
  29. 29. Dask + Hive + Scylla time break down 50 seconds 10 seconds 60 seconds Hive Scylla (ID matching) Partitions count 29
  30. 30. Dask + Parquet + Scylla Parquet files (HDFS) Scylla Partitions count 10 seconds! 30
  31. 31. Dask + Scylla results Cold cache: 5 minutes Hot cache: 1 minute 5 seconds Spark 2 results: cold cache: 6 minutes hot cache: 2 minutes
  32. 32. Python+Scylla with Parquet tips! ▪ Use execute_concurrent() ▪ Increase concurrency parameter (defaults to 100) ▪ Use libev as connection_class instead of asyncore ▪ Use hdfs3 + pyarrow to read and load Parquet files:
  33. 33. Scylla!
  34. 34. Production environment + 6x DELL R640 + dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB Gentoo Linux Multi-DC setup Ansible based provisioning and backups Monitored by scylla-grafana-monitoring Housekeeping handled by scylla-manager 34
  35. 35. Q&A Stay in touch @ultrabug
  36. 36. United States 1900 Embarcadero Road Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel @scylladb Thank You!