OLAP or OLTP
Why not both?
Glauber Costa
VP Field Engineering, ScyllaDB
Presenter bio
Glauber Costa is VP of Field Engineering at ScyllaDB. He shares
his time between the engineering department working on
upcoming Scylla features and helping customers succeed.
Before ScyllaDB, Glauber worked with Virtualization in the Linux
Kernel for 10 years, with contributions into the Xen and KVM
Hypervisors and all sorts of guest functionality and containers.
The road ahead
▪ Scylla celebrates its 4th birthday.
• Performance leadership solidified, TPC design spreading.
▪ Performance is always in our radar and we’ll keep improving.
• But what’s next?
What’s next?
Mina Naguib is the Director of Site Reliability Engineering at Samsung ADS
Let’s make it (more) BORING!
The two major workload types
Analytics (OLAP)
▪ minutes, hours, days
▪ TB / PB of data per operation
▪ throughput oriented
▪ high parallelism
Two major workload types
Analytics (OLAP)
▪ minutes, hours, days
▪ TB / PB of data per operation
▪ throughput oriented
▪ high parallelism
Real-time (OLTP)
▪ microseconds, milliseconds
▪ kB of data per operation
▪ latency oriented
▪ low/moderate parallelism
OLTP-optimized doing OLAP?
or
OLAP-optimized doing OLTP?
The role of money
Things that money can buy
▪ Food
▪ Clothes
▪ A house where I am from
▪ Throughput
The role of money
Things that money can buy
▪ Food
▪ Clothes
▪ A house where I am from
▪ Throughput
Things that money cannot buy
▪ Love
▪ Happiness
▪ A house in the Bay Area
▪ Latencies
Shared clusters- the tuning conundrum
▪ Tune for latencies: throughput suffers
▪ Tune for throughput: latency suffers
▪ Patterns are seasonal. Which one to use as a tuning base?
Classical Solution
Real Time Data Center Analytics Data Center
DATABASEDATABASE
Cost/year for 150TB of replicated data
(price based on AWS i3.metal)
Hardware Estimated waste % Estimated waste $
1 DC (10 instances) USD 278,560.00 40% USD 167,136.00
2 DC (20 instances) USD 557,120.00 40% + 40% USD 334,272.00
Plus increased maintenance costs on admin and tuning!
Total now is 20 instances
Example:
Capacity per instance: 15TB
Minimum amount of instances: 10
Assumptions:
Real time workload is latency sensitive. Only uses 60% of resources.
Analytics don’t run constantly, therefore only uses 60% of resources.
How can Scylla help you now ?
What is your database running?
▪ Foreground, user-generated workload
• user queries, user updates
▪ Background, maintenance operations
• Some are proportional to user workload (compactions)
• Some are maintenance generated (repair)
I/O Scheduling
Query
Commitlog
Compaction
Queue
Queue
Userspace
I/O
Scheduler
Disk
Max useful disk concurrency
I/O queued in FS/deviceNo queues
Queue
CPU Scheduling
read write read Compaction
CPU
CPU
Compaction
SSTable write
SSTable write
read write readread write read
Which tasks to run?
100 shares
100 shares
Which tasks to run?
100 shares
50 shares
▪ Strong mathematical foundation on control theory
▪ Automatically adjust to any incoming workload
Controlled processes
Real time vs Analytics in the same DC
▪ Scylla controllers: background has limited impact.
▪ Workloads affect each other - but user has control
▪ Careful restriction of parallelism:
• Run a single DC today.
Real time vs Analytics in the same DC
▪ Scylla controllers: background has limited impact.
▪ Workloads affect each other - but user has control
▪ Careful restriction of parallelism:
• Run a single DC today.
Don’t miss the Kiwi.com talk and see this in practice
Real time vs Analytics 1.5TB of Data, 1 Node.
200k/s Random queries, 0% cache hit rate.
Real time vs Analytics 1.5TB of Data, 1 Node.
200k/s Random queries, 0% cache hit rate.
Average latency: 750us
Real time vs Analytics 1.5TB of Data, 1 Node.
200k/s Random queries, 0% cache hit rate.
Average latency: 750us
p95 latency: 1.9ms
Real time vs Analytics
Average latency: 750us
p95 latency: 1.9ms
p99 latency: 3.3ms
1.5TB of Data, 1 Node.
200k/s Random queries, 0% cache hit rate.
Real time vs Analytics Analytics runs together with real time queries
Real time vs Analytics
average: 3.7ms
Analytics runs together with real time queries
Real time vs Analytics
p95: 13.4ms
Analytics runs together with real time queries
Real time vs Analytics
p99: 60.2ms
p99: 28.7ms
Analytics runs together with real time queries
Real time vs Analytics With the node at 100% real time
throughput suffers
Real time vs Analytics
Not able to sustain 200k/s continuously
With the node at 100% real time
throughput suffers
Real time vs Analytics Analytics runs together with real time
queries
Impact can be reduced by carefully tuning
parallelism of analytics
Analytics parallelism greatly reduced:
Real time vs Analytics
p99: 14.5ms
p95: 5.3ms
average: 2ms
Analytics runs together with real time
queries
Impact can be reduced by carefully tuning
parallelism of analytics
Analytics parallelism greatly reduced:
p99 Visual Comparison
original parallelism
(30 ms)
fine tuned parallelism (10 ms)
Analytics runs together with real time
queries
Impact can be reduced by carefully tuning
parallelism of analytics
Analytics parallelism greatly reduced:
We can do better.
How we do better
▪ User knows the expected priorities. We just have to be told.
▪ Any query executed under role analytics will be constrained
by its share of the system’s resources
How we do better
CREATE ROLE analytics
WITH LOGIN = true
AND SERVICE_LEVEL = { ‘shares’: 200 };
Real time vs Analytics Analytics are ISOLATED and run together
with real time queries
Analytics Parallelism is set to a high number.
Real time vs Analytics
average: 2ms
Analytics are ISOLATED and run together
with real time queries
Analytics Parallelism is set to a high number.
Real time vs Analytics
p95: 4ms
Analytics are ISOLATED and run together
with real time queries
Analytics Parallelism is set to a high number.
Real time vs Analytics
p99: 6.7ms
Analytics are ISOLATED and run together
with real time queries
Analytics Parallelism is set to a high number.
p99 Visual comparison
non-isolated (30ms)
isolated (6.7 ms)
Time spent tuning:
zero femtoseconds.
Summary
▪ Scylla is a great choice for Real Time + Analytics
▪ ScyllaDB delivers, today, a very compelling and flexible solution
▪ We will improve on our solid foundations built on latency
guarantees to make this use case even more compelling.
▪ Scylla is fast, but...
Performance is
yesterday’s news
Let’s make it boring.
Thank You
Any Questions ?
Please stay in touch
glauber@scylladb.com
@glcst

Scylla Summit 2018: OLAP or OLTP? Why Not Both?

  • 1.
    OLAP or OLTP Whynot both? Glauber Costa VP Field Engineering, ScyllaDB
  • 2.
    Presenter bio Glauber Costais VP of Field Engineering at ScyllaDB. He shares his time between the engineering department working on upcoming Scylla features and helping customers succeed. Before ScyllaDB, Glauber worked with Virtualization in the Linux Kernel for 10 years, with contributions into the Xen and KVM Hypervisors and all sorts of guest functionality and containers.
  • 3.
    The road ahead ▪Scylla celebrates its 4th birthday. • Performance leadership solidified, TPC design spreading. ▪ Performance is always in our radar and we’ll keep improving. • But what’s next?
  • 4.
    What’s next? Mina Naguibis the Director of Site Reliability Engineering at Samsung ADS
  • 5.
    Let’s make it(more) BORING!
  • 6.
    The two majorworkload types Analytics (OLAP) ▪ minutes, hours, days ▪ TB / PB of data per operation ▪ throughput oriented ▪ high parallelism
  • 7.
    Two major workloadtypes Analytics (OLAP) ▪ minutes, hours, days ▪ TB / PB of data per operation ▪ throughput oriented ▪ high parallelism Real-time (OLTP) ▪ microseconds, milliseconds ▪ kB of data per operation ▪ latency oriented ▪ low/moderate parallelism
  • 8.
  • 9.
    The role ofmoney Things that money can buy ▪ Food ▪ Clothes ▪ A house where I am from ▪ Throughput
  • 10.
    The role ofmoney Things that money can buy ▪ Food ▪ Clothes ▪ A house where I am from ▪ Throughput Things that money cannot buy ▪ Love ▪ Happiness ▪ A house in the Bay Area ▪ Latencies
  • 11.
    Shared clusters- thetuning conundrum ▪ Tune for latencies: throughput suffers ▪ Tune for throughput: latency suffers ▪ Patterns are seasonal. Which one to use as a tuning base?
  • 12.
    Classical Solution Real TimeData Center Analytics Data Center DATABASEDATABASE
  • 13.
    Cost/year for 150TBof replicated data (price based on AWS i3.metal) Hardware Estimated waste % Estimated waste $ 1 DC (10 instances) USD 278,560.00 40% USD 167,136.00 2 DC (20 instances) USD 557,120.00 40% + 40% USD 334,272.00 Plus increased maintenance costs on admin and tuning! Total now is 20 instances Example: Capacity per instance: 15TB Minimum amount of instances: 10 Assumptions: Real time workload is latency sensitive. Only uses 60% of resources. Analytics don’t run constantly, therefore only uses 60% of resources.
  • 14.
    How can Scyllahelp you now ?
  • 16.
    What is yourdatabase running? ▪ Foreground, user-generated workload • user queries, user updates ▪ Background, maintenance operations • Some are proportional to user workload (compactions) • Some are maintenance generated (repair)
  • 17.
  • 18.
    CPU Scheduling read writeread Compaction CPU CPU Compaction SSTable write SSTable write read write readread write read
  • 19.
    Which tasks torun? 100 shares 100 shares
  • 20.
    Which tasks torun? 100 shares 50 shares
  • 21.
    ▪ Strong mathematicalfoundation on control theory ▪ Automatically adjust to any incoming workload Controlled processes
  • 22.
    Real time vsAnalytics in the same DC ▪ Scylla controllers: background has limited impact. ▪ Workloads affect each other - but user has control ▪ Careful restriction of parallelism: • Run a single DC today.
  • 23.
    Real time vsAnalytics in the same DC ▪ Scylla controllers: background has limited impact. ▪ Workloads affect each other - but user has control ▪ Careful restriction of parallelism: • Run a single DC today. Don’t miss the Kiwi.com talk and see this in practice
  • 24.
    Real time vsAnalytics 1.5TB of Data, 1 Node. 200k/s Random queries, 0% cache hit rate.
  • 25.
    Real time vsAnalytics 1.5TB of Data, 1 Node. 200k/s Random queries, 0% cache hit rate. Average latency: 750us
  • 26.
    Real time vsAnalytics 1.5TB of Data, 1 Node. 200k/s Random queries, 0% cache hit rate. Average latency: 750us p95 latency: 1.9ms
  • 27.
    Real time vsAnalytics Average latency: 750us p95 latency: 1.9ms p99 latency: 3.3ms 1.5TB of Data, 1 Node. 200k/s Random queries, 0% cache hit rate.
  • 28.
    Real time vsAnalytics Analytics runs together with real time queries
  • 29.
    Real time vsAnalytics average: 3.7ms Analytics runs together with real time queries
  • 30.
    Real time vsAnalytics p95: 13.4ms Analytics runs together with real time queries
  • 31.
    Real time vsAnalytics p99: 60.2ms p99: 28.7ms Analytics runs together with real time queries
  • 32.
    Real time vsAnalytics With the node at 100% real time throughput suffers
  • 33.
    Real time vsAnalytics Not able to sustain 200k/s continuously With the node at 100% real time throughput suffers
  • 34.
    Real time vsAnalytics Analytics runs together with real time queries Impact can be reduced by carefully tuning parallelism of analytics Analytics parallelism greatly reduced:
  • 35.
    Real time vsAnalytics p99: 14.5ms p95: 5.3ms average: 2ms Analytics runs together with real time queries Impact can be reduced by carefully tuning parallelism of analytics Analytics parallelism greatly reduced:
  • 36.
    p99 Visual Comparison originalparallelism (30 ms) fine tuned parallelism (10 ms) Analytics runs together with real time queries Impact can be reduced by carefully tuning parallelism of analytics Analytics parallelism greatly reduced:
  • 37.
    We can dobetter.
  • 38.
    How we dobetter
  • 39.
    ▪ User knowsthe expected priorities. We just have to be told. ▪ Any query executed under role analytics will be constrained by its share of the system’s resources How we do better CREATE ROLE analytics WITH LOGIN = true AND SERVICE_LEVEL = { ‘shares’: 200 };
  • 40.
    Real time vsAnalytics Analytics are ISOLATED and run together with real time queries Analytics Parallelism is set to a high number.
  • 41.
    Real time vsAnalytics average: 2ms Analytics are ISOLATED and run together with real time queries Analytics Parallelism is set to a high number.
  • 42.
    Real time vsAnalytics p95: 4ms Analytics are ISOLATED and run together with real time queries Analytics Parallelism is set to a high number.
  • 43.
    Real time vsAnalytics p99: 6.7ms Analytics are ISOLATED and run together with real time queries Analytics Parallelism is set to a high number.
  • 44.
    p99 Visual comparison non-isolated(30ms) isolated (6.7 ms)
  • 45.
  • 46.
    Summary ▪ Scylla isa great choice for Real Time + Analytics ▪ ScyllaDB delivers, today, a very compelling and flexible solution ▪ We will improve on our solid foundations built on latency guarantees to make this use case even more compelling. ▪ Scylla is fast, but...
  • 47.
  • 48.
  • 49.
    Thank You Any Questions? Please stay in touch glauber@scylladb.com @glcst