SlideShare a Scribd company logo
Joining Billions of Rows
in Seconds: Replacing
MongoDB and Hive with Scylla
Alexys Jacob - CTO, Numberly
Moderator - Peter Corless, ScyllaDB
Peter has a 29-year career in Silicon Valley that
threads through stints at e2f, Aerospike, Cisco and
Apple. He is passionate about technology, customer
success, engendering community, and social media. In
his off hours he enjoys playing 4X strategy games.
Twitter: @petercorless
2
3
+ The Real-Time Big Data Database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
+ Learn more at scylladb.com
About ScyllaDB
Presenter - Alexys Jacob, Numberly
4
1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug
5
Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)
6
Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all
data sources and destinations.
➔ For this we use ID matching tables.
7
ID matching tables
JOIN
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable
by partner
Queried AND updated all the time!
➔ High read AND write workload
8
Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH ACTIVATE
9
Current implementation(s)
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
10
Drawbacks & pitfalls
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
11
Scylla?
Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline
13
Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!
14
Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID
over the last N months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!
15
Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to
DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!
16
scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)
17
Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!
18
Results:
▪ idle cluster: 2 minutes, 15 seconds
▪ normal cluster: 4 minutes
▪ overloaded cluster: 15 minutes
Spark 2 + Hive: reference metrics
Hive
(population)
Hive
(ID matching)
Partitions
count
+
19
Let’s use Scylla!
Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!
21
Spark 2 + Hive + Scylla
Hive
(population)
Scylla
(ID matching)
Partitions
count
+
22
Spark 2 / Scala test workload
DataStax’s spark-cassandra-connector joinWithCassandraTable
▪ spark-cassandra-connector-2.0.1-s_2.11.jar
▪ Java 7
23
Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666
24
Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
25
Spark 2 + Scylla results
Cold cache: 12 minutes
Hot cache: 2 minutes
Reference results:
idle cluster: 2 minutes, 15 seconds
normal cluster: 4 minutes
overloaded cluster: 15 minutes
OK for Scala, what about Python?
No joinWithCassandraTable when
using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches
27
Dask + Hive + Scylla
Results:
▪ Cold cache: 6min
▪ Hot cache: 2min
Hive
(population)
Scylla
(ID matching)
Partitions
count
28
Dask + Hive + Scylla time break down
50 seconds
10 seconds
60 seconds
Hive
Scylla
(ID matching)
Partitions
count
29
Dask + Parquet + Scylla
Parquet files
(HDFS)
Scylla
Partitions
count
10 seconds!
30
Dask + Scylla results
Cold cache: 5 minutes
Hot cache: 1 minute 5 seconds
Spark 2 results:
cold cache: 6 minutes
hot cache: 2 minutes
Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files:
Scylla!
Production environment
+ 6x DELL R640
+ dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB
Gentoo Linux
Multi-DC setup
Ansible based provisioning and backups
Monitored by scylla-grafana-monitoring
Housekeeping handled by scylla-manager
34
Q&A
Stay in touch
alexys@numberly.com
@ultrabug
ultrabug.fr
United States
1900 Embarcadero Road
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank You!

More Related Content

What's hot

ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Altinity Ltd
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
ScyllaDB
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
ScyllaDB
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
Redge Technologies
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
Patrick McFadin
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
InfluxData
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
confluent
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
Minsoo Jun
 

What's hot (20)

ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 

Similar to Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla

Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
ScyllaDB
 
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
DevOps.com
 
Webinar how to build a highly available time series solution with kairos-db (1)
Webinar  how to build a highly available time series solution with kairos-db (1)Webinar  how to build a highly available time series solution with kairos-db (1)
Webinar how to build a highly available time series solution with kairos-db (1)
Julia Angell
 
Webinar: How to build a highly available time series solution with KairosDB
Webinar: How to build a highly available time series solution with KairosDBWebinar: How to build a highly available time series solution with KairosDB
Webinar: How to build a highly available time series solution with KairosDB
ScyllaDB
 
Scylla Virtual Workshop 2022
Scylla Virtual Workshop 2022Scylla Virtual Workshop 2022
Scylla Virtual Workshop 2022
ScyllaDB
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
ScyllaDB
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
DataStax Academy
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
ScyllaDB
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
ScyllaDB
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Daniel Zivkovic
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
Transforming the Database: Critical Innovations for Performance at Scale
Transforming the Database: Critical Innovations for Performance at ScaleTransforming the Database: Critical Innovations for Performance at Scale
Transforming the Database: Critical Innovations for Performance at Scale
ScyllaDB
 
Scylla db deck, july 2017
Scylla db deck, july 2017Scylla db deck, july 2017
Scylla db deck, july 2017
Dor Laor
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
ScyllaDB
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Java Day Minsk 2016 Keynote about Microservices in real world
Java Day Minsk 2016 Keynote about Microservices in real worldJava Day Minsk 2016 Keynote about Microservices in real world
Java Day Minsk 2016 Keynote about Microservices in real world
Кирилл Толкачёв
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Build DynamoDB-Compatible Apps with Python
Build DynamoDB-Compatible Apps with PythonBuild DynamoDB-Compatible Apps with Python
Build DynamoDB-Compatible Apps with Python
ScyllaDB
 

Similar to Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla (20)

Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
 
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
 
Webinar how to build a highly available time series solution with kairos-db (1)
Webinar  how to build a highly available time series solution with kairos-db (1)Webinar  how to build a highly available time series solution with kairos-db (1)
Webinar how to build a highly available time series solution with kairos-db (1)
 
Webinar: How to build a highly available time series solution with KairosDB
Webinar: How to build a highly available time series solution with KairosDBWebinar: How to build a highly available time series solution with KairosDB
Webinar: How to build a highly available time series solution with KairosDB
 
Scylla Virtual Workshop 2022
Scylla Virtual Workshop 2022Scylla Virtual Workshop 2022
Scylla Virtual Workshop 2022
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Transforming the Database: Critical Innovations for Performance at Scale
Transforming the Database: Critical Innovations for Performance at ScaleTransforming the Database: Critical Innovations for Performance at Scale
Transforming the Database: Critical Innovations for Performance at Scale
 
Scylla db deck, july 2017
Scylla db deck, july 2017Scylla db deck, july 2017
Scylla db deck, july 2017
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Java Day Minsk 2016 Keynote about Microservices in real world
Java Day Minsk 2016 Keynote about Microservices in real worldJava Day Minsk 2016 Keynote about Microservices in real world
Java Day Minsk 2016 Keynote about Microservices in real world
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Build DynamoDB-Compatible Apps with Python
Build DynamoDB-Compatible Apps with PythonBuild DynamoDB-Compatible Apps with Python
Build DynamoDB-Compatible Apps with Python
 

More from ScyllaDB

Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
ScyllaDB
 
ScyllaDB Cloud: Faster and More Flexible
ScyllaDB Cloud: Faster and More FlexibleScyllaDB Cloud: Faster and More Flexible
ScyllaDB Cloud: Faster and More Flexible
ScyllaDB
 
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
ScyllaDB
 
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB
 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
ScyllaDB
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
ScyllaDB
 
Inside Expedia's Migration to ScyllaDB for Change Data Capture
Inside Expedia's Migration to ScyllaDB for Change Data CaptureInside Expedia's Migration to ScyllaDB for Change Data Capture
Inside Expedia's Migration to ScyllaDB for Change Data Capture
ScyllaDB
 
Terraform Best Practices for Infrastructure Scaling
Terraform Best Practices for Infrastructure ScalingTerraform Best Practices for Infrastructure Scaling
Terraform Best Practices for Infrastructure Scaling
ScyllaDB
 
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State StoreElasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
ScyllaDB
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
ScyllaDB Real-Time Event Processing with CDC
ScyllaDB Real-Time Event Processing with CDCScyllaDB Real-Time Event Processing with CDC
ScyllaDB Real-Time Event Processing with CDC
ScyllaDB
 
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to SuccessMongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
Real-Time or Analytics Workloads... Why Not Both?
Real-Time or Analytics Workloads... Why Not Both?Real-Time or Analytics Workloads... Why Not Both?
Real-Time or Analytics Workloads... Why Not Both?
ScyllaDB
 
Real-Time Persisted Events at Supercell
Real-Time Persisted Events at  SupercellReal-Time Persisted Events at  Supercell
Real-Time Persisted Events at Supercell
ScyllaDB
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDBScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB
 
An All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS MarketAn All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS Market
ScyllaDB
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
ScyllaDB
 
So You've Lost Quorum: Lessons From Accidental Downtime
So You've Lost Quorum: Lessons From Accidental DowntimeSo You've Lost Quorum: Lessons From Accidental Downtime
So You've Lost Quorum: Lessons From Accidental Downtime
ScyllaDB
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 

More from ScyllaDB (20)

Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
 
ScyllaDB Cloud: Faster and More Flexible
ScyllaDB Cloud: Faster and More FlexibleScyllaDB Cloud: Faster and More Flexible
ScyllaDB Cloud: Faster and More Flexible
 
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
 
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
 
Inside Expedia's Migration to ScyllaDB for Change Data Capture
Inside Expedia's Migration to ScyllaDB for Change Data CaptureInside Expedia's Migration to ScyllaDB for Change Data Capture
Inside Expedia's Migration to ScyllaDB for Change Data Capture
 
Terraform Best Practices for Infrastructure Scaling
Terraform Best Practices for Infrastructure ScalingTerraform Best Practices for Infrastructure Scaling
Terraform Best Practices for Infrastructure Scaling
 
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State StoreElasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
 
ScyllaDB Real-Time Event Processing with CDC
ScyllaDB Real-Time Event Processing with CDCScyllaDB Real-Time Event Processing with CDC
ScyllaDB Real-Time Event Processing with CDC
 
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to SuccessMongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
 
Real-Time or Analytics Workloads... Why Not Both?
Real-Time or Analytics Workloads... Why Not Both?Real-Time or Analytics Workloads... Why Not Both?
Real-Time or Analytics Workloads... Why Not Both?
 
Real-Time Persisted Events at Supercell
Real-Time Persisted Events at  SupercellReal-Time Persisted Events at  Supercell
Real-Time Persisted Events at Supercell
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDBScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
 
An All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS MarketAn All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS Market
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
 
So You've Lost Quorum: Lessons From Accidental Downtime
So You've Lost Quorum: Lessons From Accidental DowntimeSo You've Lost Quorum: Lessons From Accidental Downtime
So You've Lost Quorum: Lessons From Accidental Downtime
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 

Recently uploaded

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 

Recently uploaded (20)

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 

Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla

  • 1. Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla Alexys Jacob - CTO, Numberly
  • 2. Moderator - Peter Corless, ScyllaDB Peter has a 29-year career in Silicon Valley that threads through stints at e2f, Aerospike, Cisco and Apple. He is passionate about technology, customer success, engendering community, and social media. In his off hours he enjoys playing 4X strategy games. Twitter: @petercorless 2
  • 3. 3 + The Real-Time Big Data Database + Drop-in replacement for Cassandra + 10X the performance & low tail latency + Open source and enterprise editions + Founded by the creators of KVM hypervisor + HQs: Palo Alto, CA; Herzelia, Israel + Learn more at scylladb.com About ScyllaDB
  • 4. Presenter - Alexys Jacob, Numberly 4
  • 5. 1 Eiffel Tower 2 Soccer World Cups 15 Years in the Data industry Pythonista OSS enthusiast & contributor Gentoo Linux developer CTO at Numberly - living in Paris, France whoami @ultrabug 5
  • 6. Business context of Numberly Digital Marketing Technologist (MarTech) Handling the relationship between brands and people (People based) Dealing with multiple sources and a wide range of data types (Events) Mixing and correlating a massive amount of different types of events... ...which all have their own identifiers (think primary keys) 6
  • 7. Business context of Numberly Web navigation tracking (browser ID: cookie) CRM databases (email address, customer ID) Partners’ digital platforms (cookie ID, hash(email address)) Mobile phone apps (device ID: IDFA, GAID) Ability to synchronize and translate identifiers between all data sources and destinations. ➔ For this we use ID matching tables. 7
  • 8. ID matching tables JOIN 1. SELECT reference population 2. JOIN with the ID matching table 3. MATCHED population is usable by partner Queried AND updated all the time! ➔ High read AND write workload 8
  • 9. Real life example: retargeting From a database (email) to a web banner (cookie) Previous donors generous@coconut.fr isupportu@lab.com wiki4ever@wp.eu openinternet@free.fr https://kitty.eu AppNexus ... Google ID matching table Cookie id = 123 Cookie id = 297 ? Cookie id = 896 Ad Exchange User cookie id 123 SELECT MATCH ACTIVATE 9
  • 11. Drawbacks & pitfalls Events Message queues HDFS Real time Programs Batch Calculation MongoDB Hive Batch pipeline Real time pipeline 11
  • 13. Future implementation using Scylla? Events Message queues Real time Programs Batch Calculation Scylla Batch pipeline Real time pipeline 13
  • 14. Proof Of Concept hardware Recycled hardware… ▪ 2x DELL R510 • 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC ▪ 1x DELL R710 • 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC ➔ Compete with our production? Scylla is in! 14
  • 15. Finding the right schema model Query based AND test-driven data modeling 1. What are all the cookie IDs associated to the given partner ID over the last N months? 2. What is the last cookie ID/date for the given partner ID? Gotcha: the reverse questions are also to be answered! ➔ Denormalization ➔ Prototype with your language of choice! 15
  • 16. Schema tip! > What is the last cookie ID for the given partner ID? TIP: CLUSTERING ORDER ▪ Defaults to ASC ➔ Latest value at the end of the sstable! ▪ Change “date” ordering to DESC ➔ Latest value at the top of the sstable ➔ Reduced read latency! 16
  • 17. scylla-grafana-monitoring Set it up and test it! ▪ Use cassandra-stress Key graphs: ▪ number of open connections ▪ cache hits / misses ▪ per shard/node distribution ▪ sstable reads TIP: reduce default scrape interval ▪ scrape_interval: 2s (4s default) ▪ scrape_timeout: 1s (5s default) 17
  • 18. Reference data and metrics Reference dataset ▪ 10M population ▪ 400M ID matching table ➔ Representative volumes Measured on our production stack, with real load NOT a benchmark! 18
  • 19. Results: ▪ idle cluster: 2 minutes, 15 seconds ▪ normal cluster: 4 minutes ▪ overloaded cluster: 15 minutes Spark 2 + Hive: reference metrics Hive (population) Hive (ID matching) Partitions count + 19
  • 21. Testing with Scylla Distinguish between hot and cold cache scenarios ▪ Cold cache: mostly disk I/O bound ▪ Hot cache: mostly memory bound Push your Scylla cluster to its limits! 21
  • 22. Spark 2 + Hive + Scylla Hive (population) Scylla (ID matching) Partitions count + 22
  • 23. Spark 2 / Scala test workload DataStax’s spark-cassandra-connector joinWithCassandraTable ▪ spark-cassandra-connector-2.0.1-s_2.11.jar ▪ Java 7 23
  • 24. Spark 2 tuning (1/2) Use a fixed number of executors ▪ spark.dynamicAllocation.enabled=false ▪ spark.executor.instances=30 Change Spark split size to match Scylla for read performance ▪ spark.cassandra.input.split.size_in_mb=1 Adjust reads per seconds ▪ spark.cassandra.input.reads_per_sec=6666 24
  • 25. Spark 2 tuning (2/2) Tune the number of connections opened by each executor ▪ spark.cassandra.connection.connections_per_executor_max=100 Align driver timeouts with server timeouts (check scylla.yaml) ▪ spark.cassandra.connection.timeout_ms=150000 ▪ spark.cassandra.read.timeout_ms=150000 ScyllaDB blog posts & webinar ▪ https://www.scylladb.com/2018/07/31/spark-scylla/ ▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/ ▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/ ▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/ 25
  • 26. Spark 2 + Scylla results Cold cache: 12 minutes Hot cache: 2 minutes Reference results: idle cluster: 2 minutes, 15 seconds normal cluster: 4 minutes overloaded cluster: 15 minutes
  • 27. OK for Scala, what about Python? No joinWithCassandraTable when using pyspark... Maybe we don’t need Spark 2 at all! 1. Load the 10M rows from Hive 2. For every row lookup the ID matching table from Scylla 3. Count the resulting number of matches 27
  • 28. Dask + Hive + Scylla Results: ▪ Cold cache: 6min ▪ Hot cache: 2min Hive (population) Scylla (ID matching) Partitions count 28
  • 29. Dask + Hive + Scylla time break down 50 seconds 10 seconds 60 seconds Hive Scylla (ID matching) Partitions count 29
  • 30. Dask + Parquet + Scylla Parquet files (HDFS) Scylla Partitions count 10 seconds! 30
  • 31. Dask + Scylla results Cold cache: 5 minutes Hot cache: 1 minute 5 seconds Spark 2 results: cold cache: 6 minutes hot cache: 2 minutes
  • 32. Python+Scylla with Parquet tips! ▪ Use execute_concurrent() ▪ Increase concurrency parameter (defaults to 100) ▪ Use libev as connection_class instead of asyncore ▪ Use hdfs3 + pyarrow to read and load Parquet files:
  • 34. Production environment + 6x DELL R640 + dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB Gentoo Linux Multi-DC setup Ansible based provisioning and backups Monitored by scylla-grafana-monitoring Housekeeping handled by scylla-manager 34
  • 36. United States 1900 Embarcadero Road Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank You!