SlideShare a Scribd company logo
1 of 61
DIGGING CASSANDRA CLUSTER
Ivan Burmistrov
Ivan Burmistrov
Tech Lead at SKB Kontur
5+ years Cassandra experience (from Cassandra 0.7)
WHO AM I?
burmistrov@skbkontur.ru
@isburmistrov
https://www.linkedin.com/in/isburmistrov/en
• Services for businesses
• B2B: e-Invoicing
• B2G: e-reporting of tax returns to government
SKB KONTUR
RETAIL
• 24 x 7 x 365
• Guarantee of delivering
REQUIREMENTS
• 24 x 7 x 365
• Guarantee of delivering
• Delivery time <= 1 minute
REQUIREMENTS
When Cassandra just works
When Cassandra just works
When Cassandra just works
SMART GUY
• 150+ different tables in cluster (Cassandra 1.2)
• Client read latency (99th percentile): 100ms – 2.0s
• Affected almost all tables
• CPU: 40% – 80%
• Disk: not a problem
THE PROBLEM
2 sec.
• ReadLatency.99thPercentile
node’s latency of processing read request
• ReadLatency.OneMinuteRate
node’s read requests per second
• SSTablesPerReadHistogram
how many SSTables node reads per read request
HYPOTHESIS 1: ANOMALIES IN METRICS
• ReadLatency.99thPercentile
node’s latency of processing read request
• ReadLatency.OneMinuteRate
node’s read requests per second
• SSTablesPerReadHistogram
how many SSTables node reads per read request
• Tables were pretty similar in these metrics
• What values are good, which are bad?
HYPOTHESIS 1: ANOMALIES IN METRICS
• Decrease/increase compaction throughput
• Change compaction strategy
HYPOTHESIS 2: COMPACTION
• Decrease/increase compaction throughput
• Change compaction strategy
• Nothing changed
HYPOTHESIS 2: COMPACTION
• ParNew GC – 6 seconds per minute (10%!)
• Read good articles about Cassandra and GC
• http://tech.shift.com/post/74311817513/cassandra-tuning-
the-jvm-for-read-heavy-workloads
• http://aryanet.com/blog/cassandra-garbage-collector-tuning
• Tried to tune
HYPOTHESIS 3: GC
• ParNew GC – 6 seconds per minute (10%!)
• Read good articles about Cassandra and GC
• http://tech.shift.com/post/74311817513/cassandra-tuning-
the-jvm-for-read-heavy-workloads
• http://aryanet.com/blog/cassandra-garbage-collector-tuning
• Tried to tune
• Nothing changed
HYPOTHESIS 3: GC
• Built-in profiling tool from Oracle JDK 7 Update 40
• Low performance overhead: 1-2%
• Useful for CPU profiling: hot threads, hot methods,
call stacks,…
• Profiling results: 70% of time – SSTablesReader
Java Mission Control and Java Flight Recorder
• SSTablesPerReadHistogram did not help
• We needed another metric
• SSTablesPerSecond
how many SSTables each table read per second
SSTablesPerSecond = SSTablesPerReadHistogram.Mean *
ReadLatency.OneMinuteRate
What tables cause most reads of SSTables?
SSTablesPerSecond
• 7 leading tables = only 7 candidates for deep investigation
• Large difference between leaders and others
• Almost all leaders were surprises
• 3 types of problems
SSTablesPerSecond: results
Problem 1: Invalid timestamp usage
CREATE TABLE users_lastaction (
user_id uuid,
subsystem text,
last_action_time timestamp,
PRIMARY KEY (user_id)
);
subsystem: ‘API‘,‘WebApplication‘,…
Problem 1: Invalid timestamp usage
First subsystem:
INSERT INTO users_lastaction
(user_id, subsystem, last_action_time)
VALUES (62c36092-82a1-3a00-93d1-46196ee77204,‘API',‘2011-02-03T04:05:00');
Second subsystem:
INSERT INTO users_lastaction
(user_id, subsystem, last_action_time)
VALUES (62c36092-82a1-3a00-93d1-46196ee77204,‘WebApp',‘2011-02-08T07:05:00')
USING TIMESTAMP 635774040762020710;
Time in ticks,
10000 ticks = 1 millisecond
Problem 1: Invalid timestamp usage
SELECT last_action_time FROM users_lastaction
WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204
AND subsystem = ‘API'
SSTables
Memtable
Problem 1: Invalid timestamp usage
SELECT last_action_time FROM users_lastaction
WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204
AND subsystem = ‘API'
1. Looks at Memtable
SSTables
Memtable
Problem 1: Invalid timestamp usage
SELECT last_action_time FROM users_lastaction
WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204
AND subsystem = ‘API'
1. Looks at Memtable
2. Filters SSTables using bloom filter
SSTables
Memtable
Problem 1: Invalid timestamp usage
SELECT last_action_time FROM users_lastaction
WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204
AND subsystem = ‘API'
1. Looks at Memtable
2. Filters SSTables using bloom filter
3. Filters SSTables by timestamp
(CASSANDRA-2498)
SSTables
Memtable
Problem 1: Invalid timestamp usage
SELECT last_action_time FROM users_lastaction
WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204
AND subsystem = ‘API'
1. Looks at Memtable
2. Filters SSTables using bloom filter
3. Filters SSTables by timestamp
(CASSANDRA-2498)
4. Reads remaining SSTables
SSTables
Memtable
Problem 1: Invalid timestamp usage
SELECT last_action_time FROM users_lastaction
WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204
AND subsystem = ‘API'
1. Looks at Memtable
2. Filters SSTables using bloom filter
3. Filters SSTables by timestamp
(CASSANDRA-2498)
4. Reads remaining SSTables
5. Merges result
SSTables
Memtable
Problem 1: Invalid timestamp usage
First subsystem:
INSERT INTO users_lastaction
(user_id, subsystem, last_action_time)
VALUES (62c36092-82a1-3a00-93d1-46196ee77204,‘API',‘2011-02-03T04:05:00');
Second subsystem:
INSERT INTO users_lastaction
(user_id, subsystem, last_action_time)
VALUES (62c36092-82a1-3a00-93d1-46196ee77204,‘WebApp',‘2011-02-08T07:05:00')
USING TIMESTAMP 635774040762020710;
Time in ticks,
10000 ticks = 1 millisecond
Problem 1: Invalid timestamp usage
Fix:
started to use equal timestamp sources for one
table
Problem 2: Few writes, many reads
• Reads dominates over writes (example – user accounts)
• Each read – from SSTable (Memtable already flushed)
Problem 2: Few writes, many reads
• Reads dominates over writes (example – user accounts)
• Each read – from SSTable (Memtable already flushed)
• Fix: just enabled row cache
Problem 3: Aggressive time series
CREATE TABLE activity_records(
time_bucket text,
record_time timestamp,
record_content text,
PRIMARY KEY (time_bucket, record_time)
);
SELECT record_content FROM activity_records
WHERE time_bucket = ‘2015-05-10 12:00:00'
AND record_time > ‘2015-05-10 12:30:10'
Problem 3: Aggressive time series
SELECT record_content FROM activity_records
WHERE time_bucket = ‘2015-05-10 12:00:00'
AND record_time > ‘2015-05-10 12:30:10'
SSTables
Memtable
Problem 3: Aggressive time series
SELECT record_content FROM activity_records
WHERE time_bucket = ‘2015-05-10 12:00:00'
AND record_time > ‘2015-05-10 12:30:10'
1. Looks at Memtable
SSTables
Memtable
Problem 3: Aggressive time series
SELECT record_content FROM activity_records
WHERE time_bucket = ‘2015-05-10 12:00:00'
AND record_time > ‘2015-05-10 12:30:10'
1. Looks at Memtable
2. Filters SSTables using bloom filter
SSTables
Memtable
Problem 3: Aggressive time series
SELECT record_content FROM activity_records
WHERE time_bucket = ‘2015-05-10 12:00:00'
AND record_time > ‘2015-05-10 12:30:10'
1. Looks at Memtable
2. Filters SSTables using bloom filter
3. Can’t use CASSANDRA-2498
SSTables
Memtable
Problem 3: Aggressive time series
SELECT record_content FROM activity_records
WHERE time_bucket = ‘2015-05-10 12:00:00'
AND record_time > ‘2015-05-10 12:30:10'
1. Looks at Memtable
2. Filters SSTables using bloom filter
3. Can’t use CASSANDRA-2498
4. CASSANDRA-5514!
SSTables
Memtable
Problem 3: Aggressive time series
SELECT record_content FROM activity_records
WHERE time_bucket = ‘2015-05-10 12:00:00'
AND record_time > ‘2015-05-10 12:30:10'
1. Looks at Memtable
2. Filters SSTables using bloom filter
3. Can’t use CASSANDRA-2498
4. CASSANDRA-5514!
5. Reads remaining SSTables
SSTables
Memtable
Problem 3: Aggressive time series
SELECT record_content FROM activity_records
WHERE time_bucket = ‘2015-05-10 12:00:00'
AND record_time > ‘2015-05-10 12:30:10'
1. Looks at Memtable
2. Filters SSTables using bloom filter
3. Can’t use CASSANDRA-2498
4. CASSANDRA-5514!
5. Reads remaining SSTables
6. Merges result SSTables
Memtable
Problem 3: Aggressive time series
Fix: just upgraded to Cassandra 2.0+
SSTablesPerSecond: before
SSTablesPerSecond: after
Before:
• Client read latency (99th percentile): 100ms – 2s
• CPU: 40% – 80%
After:
• Client read latency (99th percentile): 50ms – 200ms
• CPU: 20% – 50%
WHAT ABOUT OUR GOAL?
• Reading SSTables vs reading Memtable – 50/50
• SliceQuery – 70%
PROFILE AGAIN
• LiveScannedHistogram
how many live columns node scans per slice query
• TombstonesScannedHistogram
how many tombstones node scans per slice query
LOOK AT METRICS AGAIN
• LiveScannedHistogram
how many live columns node scans per slice query
• TombstonesScannedHistogram
how many tombstones node scans per slice query
• Not found any anomalies
LOOK AT METRICS AGAIN
• LiveScannedHistogram
how many live columns node scans per slice query
• TombstonesScannedHistogram
how many tombstones node scans per slice query
• Not found any anomalies
• Why not use the successful trick?
LOOK AT METRICS AGAIN
LiveScannedPerSecond
how many live columns Cassandra scans per second for each table
LiveScannedHistogram.Mean * ReadLatency.OneMinuteRate
• 1 obvious leader
• Large difference between leader and others
• Leader – big surprise
LiveScannedPerSecond: results
• 1 obvious leader
• Large difference between leader and others
• Leader – big surprise
• Fix: fixed the bug
LiveScannedPerSecond: results
Initial:
• Client read latency (99th percentile): 100ms – 2.0s
• CPU: 40% – 80%
After SSTablesPerSecond fixes:
• Client read latency (99th percentile): 50ms – 200ms
• CPU: 20% – 50%
After LiveScannedPerSecond fixes:
• Client read latency (99th percentile): 30ms – 100ms
• CPU: 10% – 30%
WHAT ABOUT OUR GOAL?
Compaction – 30%
PROFILE AGAIN
Compaction – 30%
Fix:
throttled down compactions during high load period,
throttled up during low load period
PROFILE AGAIN
WHAT ABOUT OUR GOAL?
Initial:
• Client read latency (99th percentile): 100ms – 2.0s
• CPU: 40% – 80%
After LiveSkannedPerSecond fixes:
• Client read latency (99th percentile): 30ms – 100ms
• CPU: 10% – 30%
After Compaction fixes:
• Client read latency (99th percentile): 10ms – 50ms
• CPU: 5% – 25%
WHAT ABOUT OUR GOAL?
• TombstonesScannedPerSecond
• KeyCacheMissesPerSecond
• …
MORE METRICS!
• TombstonesScannedPerSecond
• KeyCacheMissesPerSecond
• …
MORE METRICS!
Initial:
• Client read latency (99th percentile): 100ms – 2.0s
• CPU: 40% – 80%
After all fixes:
• Client read latency (99th percentile): 5ms – 25ms 50 times less at average!
• CPU: 5% – 15% 7 times less at average
THANK YOU
Extra: The effect of the slow queries
pending tasks concurrent_reads

More Related Content

What's hot

What's hot (6)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Pandora FMS:Cassandra Plugin
Pandora FMS:Cassandra PluginPandora FMS:Cassandra Plugin
Pandora FMS:Cassandra Plugin
 
HTTP Plugin for MySQL!
HTTP Plugin for MySQL!HTTP Plugin for MySQL!
HTTP Plugin for MySQL!
 
Cassandra compaction
Cassandra compactionCassandra compaction
Cassandra compaction
 
Cassandra Community Webinar | The World's Next Top Data Model
Cassandra Community Webinar | The World's Next Top Data ModelCassandra Community Webinar | The World's Next Top Data Model
Cassandra Community Webinar | The World's Next Top Data Model
 
Lessons learnt on a 2000-core cluster
Lessons learnt on a 2000-core clusterLessons learnt on a 2000-core cluster
Lessons learnt on a 2000-core cluster
 

Viewers also liked

New text document
New text documentNew text document
New text document
singaqq
 

Viewers also liked (13)

Borisova_Layout
Borisova_LayoutBorisova_Layout
Borisova_Layout
 
Zika Corona Wide Screen
Zika Corona Wide ScreenZika Corona Wide Screen
Zika Corona Wide Screen
 
The Script digipak
The Script digipakThe Script digipak
The Script digipak
 
Mysql51to55
Mysql51to55Mysql51to55
Mysql51to55
 
Kids Hand In Hand Project
Kids Hand In Hand ProjectKids Hand In Hand Project
Kids Hand In Hand Project
 
Joshua 20-22, Cities Of Refuge; LORD’s Promises not failed; land not processe...
Joshua 20-22, Cities Of Refuge; LORD’s Promises not failed; land not processe...Joshua 20-22, Cities Of Refuge; LORD’s Promises not failed; land not processe...
Joshua 20-22, Cities Of Refuge; LORD’s Promises not failed; land not processe...
 
Animación Digital
Animación Digital Animación Digital
Animación Digital
 
New text document
New text documentNew text document
New text document
 
Ppt history of old testament
Ppt history of old testamentPpt history of old testament
Ppt history of old testament
 
Physics 10th(freebooks.pk)
Physics 10th(freebooks.pk)Physics 10th(freebooks.pk)
Physics 10th(freebooks.pk)
 
CV_Ilham
CV_IlhamCV_Ilham
CV_Ilham
 
Unit3.2 Atmospheric pollution
Unit3.2 Atmospheric pollutionUnit3.2 Atmospheric pollution
Unit3.2 Atmospheric pollution
 
Leadership Presentation
Leadership PresentationLeadership Presentation
Leadership Presentation
 

Similar to Digging Cassandra Cluster

What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
DataStax
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
[NetPonto] NoSQL em Windows Azure Table Storage
[NetPonto] NoSQL em Windows Azure Table Storage[NetPonto] NoSQL em Windows Azure Table Storage
[NetPonto] NoSQL em Windows Azure Table Storage
Vitor Tomaz
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
MongoDB
 
Donatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIsDonatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIs
Tanya Denisyuk
 

Similar to Digging Cassandra Cluster (20)

Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
[NetPonto] NoSQL em Windows Azure Table Storage
[NetPonto] NoSQL em Windows Azure Table Storage[NetPonto] NoSQL em Windows Azure Table Storage
[NetPonto] NoSQL em Windows Azure Table Storage
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
NoSQL em Windows Azure Table Storage - Vitor Tomaz
NoSQL em Windows Azure Table Storage - Vitor TomazNoSQL em Windows Azure Table Storage - Vitor Tomaz
NoSQL em Windows Azure Table Storage - Vitor Tomaz
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Strategic Autovacuum
Strategic AutovacuumStrategic Autovacuum
Strategic Autovacuum
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Tuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlTuning Autovacuum in Postgresql
Tuning Autovacuum in Postgresql
 
Oracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachOracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approach
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Donatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIsDonatas Mažionis, Building low latency web APIs
Donatas Mažionis, Building low latency web APIs
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Digging Cassandra Cluster

  • 2. Ivan Burmistrov Tech Lead at SKB Kontur 5+ years Cassandra experience (from Cassandra 0.7) WHO AM I? burmistrov@skbkontur.ru @isburmistrov https://www.linkedin.com/in/isburmistrov/en
  • 3. • Services for businesses • B2B: e-Invoicing • B2G: e-reporting of tax returns to government SKB KONTUR
  • 5. • 24 x 7 x 365 • Guarantee of delivering REQUIREMENTS
  • 6. • 24 x 7 x 365 • Guarantee of delivering • Delivery time <= 1 minute REQUIREMENTS
  • 11. • 150+ different tables in cluster (Cassandra 1.2) • Client read latency (99th percentile): 100ms – 2.0s • Affected almost all tables • CPU: 40% – 80% • Disk: not a problem THE PROBLEM 2 sec.
  • 12. • ReadLatency.99thPercentile node’s latency of processing read request • ReadLatency.OneMinuteRate node’s read requests per second • SSTablesPerReadHistogram how many SSTables node reads per read request HYPOTHESIS 1: ANOMALIES IN METRICS
  • 13. • ReadLatency.99thPercentile node’s latency of processing read request • ReadLatency.OneMinuteRate node’s read requests per second • SSTablesPerReadHistogram how many SSTables node reads per read request • Tables were pretty similar in these metrics • What values are good, which are bad? HYPOTHESIS 1: ANOMALIES IN METRICS
  • 14. • Decrease/increase compaction throughput • Change compaction strategy HYPOTHESIS 2: COMPACTION
  • 15. • Decrease/increase compaction throughput • Change compaction strategy • Nothing changed HYPOTHESIS 2: COMPACTION
  • 16. • ParNew GC – 6 seconds per minute (10%!) • Read good articles about Cassandra and GC • http://tech.shift.com/post/74311817513/cassandra-tuning- the-jvm-for-read-heavy-workloads • http://aryanet.com/blog/cassandra-garbage-collector-tuning • Tried to tune HYPOTHESIS 3: GC
  • 17. • ParNew GC – 6 seconds per minute (10%!) • Read good articles about Cassandra and GC • http://tech.shift.com/post/74311817513/cassandra-tuning- the-jvm-for-read-heavy-workloads • http://aryanet.com/blog/cassandra-garbage-collector-tuning • Tried to tune • Nothing changed HYPOTHESIS 3: GC
  • 18. • Built-in profiling tool from Oracle JDK 7 Update 40 • Low performance overhead: 1-2% • Useful for CPU profiling: hot threads, hot methods, call stacks,… • Profiling results: 70% of time – SSTablesReader Java Mission Control and Java Flight Recorder
  • 19. • SSTablesPerReadHistogram did not help • We needed another metric • SSTablesPerSecond how many SSTables each table read per second SSTablesPerSecond = SSTablesPerReadHistogram.Mean * ReadLatency.OneMinuteRate What tables cause most reads of SSTables?
  • 21. • 7 leading tables = only 7 candidates for deep investigation • Large difference between leaders and others • Almost all leaders were surprises • 3 types of problems SSTablesPerSecond: results
  • 22. Problem 1: Invalid timestamp usage CREATE TABLE users_lastaction ( user_id uuid, subsystem text, last_action_time timestamp, PRIMARY KEY (user_id) ); subsystem: ‘API‘,‘WebApplication‘,…
  • 23. Problem 1: Invalid timestamp usage First subsystem: INSERT INTO users_lastaction (user_id, subsystem, last_action_time) VALUES (62c36092-82a1-3a00-93d1-46196ee77204,‘API',‘2011-02-03T04:05:00'); Second subsystem: INSERT INTO users_lastaction (user_id, subsystem, last_action_time) VALUES (62c36092-82a1-3a00-93d1-46196ee77204,‘WebApp',‘2011-02-08T07:05:00') USING TIMESTAMP 635774040762020710; Time in ticks, 10000 ticks = 1 millisecond
  • 24. Problem 1: Invalid timestamp usage SELECT last_action_time FROM users_lastaction WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204 AND subsystem = ‘API' SSTables Memtable
  • 25. Problem 1: Invalid timestamp usage SELECT last_action_time FROM users_lastaction WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204 AND subsystem = ‘API' 1. Looks at Memtable SSTables Memtable
  • 26. Problem 1: Invalid timestamp usage SELECT last_action_time FROM users_lastaction WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204 AND subsystem = ‘API' 1. Looks at Memtable 2. Filters SSTables using bloom filter SSTables Memtable
  • 27. Problem 1: Invalid timestamp usage SELECT last_action_time FROM users_lastaction WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204 AND subsystem = ‘API' 1. Looks at Memtable 2. Filters SSTables using bloom filter 3. Filters SSTables by timestamp (CASSANDRA-2498) SSTables Memtable
  • 28. Problem 1: Invalid timestamp usage SELECT last_action_time FROM users_lastaction WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204 AND subsystem = ‘API' 1. Looks at Memtable 2. Filters SSTables using bloom filter 3. Filters SSTables by timestamp (CASSANDRA-2498) 4. Reads remaining SSTables SSTables Memtable
  • 29. Problem 1: Invalid timestamp usage SELECT last_action_time FROM users_lastaction WHERE user_id = 62c36092-82a1-3a00-93d1-46196ee77204 AND subsystem = ‘API' 1. Looks at Memtable 2. Filters SSTables using bloom filter 3. Filters SSTables by timestamp (CASSANDRA-2498) 4. Reads remaining SSTables 5. Merges result SSTables Memtable
  • 30. Problem 1: Invalid timestamp usage First subsystem: INSERT INTO users_lastaction (user_id, subsystem, last_action_time) VALUES (62c36092-82a1-3a00-93d1-46196ee77204,‘API',‘2011-02-03T04:05:00'); Second subsystem: INSERT INTO users_lastaction (user_id, subsystem, last_action_time) VALUES (62c36092-82a1-3a00-93d1-46196ee77204,‘WebApp',‘2011-02-08T07:05:00') USING TIMESTAMP 635774040762020710; Time in ticks, 10000 ticks = 1 millisecond
  • 31. Problem 1: Invalid timestamp usage Fix: started to use equal timestamp sources for one table
  • 32. Problem 2: Few writes, many reads • Reads dominates over writes (example – user accounts) • Each read – from SSTable (Memtable already flushed)
  • 33. Problem 2: Few writes, many reads • Reads dominates over writes (example – user accounts) • Each read – from SSTable (Memtable already flushed) • Fix: just enabled row cache
  • 34. Problem 3: Aggressive time series CREATE TABLE activity_records( time_bucket text, record_time timestamp, record_content text, PRIMARY KEY (time_bucket, record_time) ); SELECT record_content FROM activity_records WHERE time_bucket = ‘2015-05-10 12:00:00' AND record_time > ‘2015-05-10 12:30:10'
  • 35. Problem 3: Aggressive time series SELECT record_content FROM activity_records WHERE time_bucket = ‘2015-05-10 12:00:00' AND record_time > ‘2015-05-10 12:30:10' SSTables Memtable
  • 36. Problem 3: Aggressive time series SELECT record_content FROM activity_records WHERE time_bucket = ‘2015-05-10 12:00:00' AND record_time > ‘2015-05-10 12:30:10' 1. Looks at Memtable SSTables Memtable
  • 37. Problem 3: Aggressive time series SELECT record_content FROM activity_records WHERE time_bucket = ‘2015-05-10 12:00:00' AND record_time > ‘2015-05-10 12:30:10' 1. Looks at Memtable 2. Filters SSTables using bloom filter SSTables Memtable
  • 38. Problem 3: Aggressive time series SELECT record_content FROM activity_records WHERE time_bucket = ‘2015-05-10 12:00:00' AND record_time > ‘2015-05-10 12:30:10' 1. Looks at Memtable 2. Filters SSTables using bloom filter 3. Can’t use CASSANDRA-2498 SSTables Memtable
  • 39. Problem 3: Aggressive time series SELECT record_content FROM activity_records WHERE time_bucket = ‘2015-05-10 12:00:00' AND record_time > ‘2015-05-10 12:30:10' 1. Looks at Memtable 2. Filters SSTables using bloom filter 3. Can’t use CASSANDRA-2498 4. CASSANDRA-5514! SSTables Memtable
  • 40. Problem 3: Aggressive time series SELECT record_content FROM activity_records WHERE time_bucket = ‘2015-05-10 12:00:00' AND record_time > ‘2015-05-10 12:30:10' 1. Looks at Memtable 2. Filters SSTables using bloom filter 3. Can’t use CASSANDRA-2498 4. CASSANDRA-5514! 5. Reads remaining SSTables SSTables Memtable
  • 41. Problem 3: Aggressive time series SELECT record_content FROM activity_records WHERE time_bucket = ‘2015-05-10 12:00:00' AND record_time > ‘2015-05-10 12:30:10' 1. Looks at Memtable 2. Filters SSTables using bloom filter 3. Can’t use CASSANDRA-2498 4. CASSANDRA-5514! 5. Reads remaining SSTables 6. Merges result SSTables Memtable
  • 42. Problem 3: Aggressive time series Fix: just upgraded to Cassandra 2.0+
  • 45. Before: • Client read latency (99th percentile): 100ms – 2s • CPU: 40% – 80% After: • Client read latency (99th percentile): 50ms – 200ms • CPU: 20% – 50% WHAT ABOUT OUR GOAL?
  • 46. • Reading SSTables vs reading Memtable – 50/50 • SliceQuery – 70% PROFILE AGAIN
  • 47. • LiveScannedHistogram how many live columns node scans per slice query • TombstonesScannedHistogram how many tombstones node scans per slice query LOOK AT METRICS AGAIN
  • 48. • LiveScannedHistogram how many live columns node scans per slice query • TombstonesScannedHistogram how many tombstones node scans per slice query • Not found any anomalies LOOK AT METRICS AGAIN
  • 49. • LiveScannedHistogram how many live columns node scans per slice query • TombstonesScannedHistogram how many tombstones node scans per slice query • Not found any anomalies • Why not use the successful trick? LOOK AT METRICS AGAIN
  • 50. LiveScannedPerSecond how many live columns Cassandra scans per second for each table LiveScannedHistogram.Mean * ReadLatency.OneMinuteRate
  • 51. • 1 obvious leader • Large difference between leader and others • Leader – big surprise LiveScannedPerSecond: results
  • 52. • 1 obvious leader • Large difference between leader and others • Leader – big surprise • Fix: fixed the bug LiveScannedPerSecond: results
  • 53. Initial: • Client read latency (99th percentile): 100ms – 2.0s • CPU: 40% – 80% After SSTablesPerSecond fixes: • Client read latency (99th percentile): 50ms – 200ms • CPU: 20% – 50% After LiveScannedPerSecond fixes: • Client read latency (99th percentile): 30ms – 100ms • CPU: 10% – 30% WHAT ABOUT OUR GOAL?
  • 55. Compaction – 30% Fix: throttled down compactions during high load period, throttled up during low load period PROFILE AGAIN
  • 56. WHAT ABOUT OUR GOAL?
  • 57. Initial: • Client read latency (99th percentile): 100ms – 2.0s • CPU: 40% – 80% After LiveSkannedPerSecond fixes: • Client read latency (99th percentile): 30ms – 100ms • CPU: 10% – 30% After Compaction fixes: • Client read latency (99th percentile): 10ms – 50ms • CPU: 5% – 25% WHAT ABOUT OUR GOAL?
  • 59. • TombstonesScannedPerSecond • KeyCacheMissesPerSecond • … MORE METRICS! Initial: • Client read latency (99th percentile): 100ms – 2.0s • CPU: 40% – 80% After all fixes: • Client read latency (99th percentile): 5ms – 25ms 50 times less at average! • CPU: 5% – 15% 7 times less at average
  • 61. Extra: The effect of the slow queries pending tasks concurrent_reads

Editor's Notes

  1. Hello everybody. My today’s talk “Digging Cassandra Cluster” will be about how to determine the most problem tables in Cassandra Cluster. “The most problem” means “tables that create problems not only for itself but for other tables too”.
  2. Before we start let me give short overview about who I am and about my company. My name is Ivan Burmistrov. I’m tech lead of one of the teams at “SKB Kontur”. I started to use Cassandra in production about 5 years ago from version 0.7.
  3. SKB Kontur – company in Russia focuses on services for business. In particular, “business to business” services: for example, electronic Invoices between trading partners “business to government” services: for example, electronic reporting of tax returns to government So, the core part of almost all our services is storage and delivery of critical electronic documents.
  4. In particular, my team focuses on electronic document interchange between retailers and suppliers of products. Example of retailer on the slide here is Auchan. Auchan, for those who don’t know, is big European chain store like Walmart. Suppliers of products are companies like Procter & Gamble, Nestle and Coca-Cola. Every day retailers order products from suppliers through our service and expect this products will be in shops on time.
  5. So, our service is business critical because it is on the critical path of delivering products to customers in shops. And our clients expect from constant availability us guarantee of delivering electronic documents. And because of reliability requirements we chose Cassandra. What about the document delivery time? This part of requirements is pretty soft: we just need to guarantee delivery time not exceed one minute. One minute!? Sounds like “infinite”, isn’t it? Yes, we thought as well, because we built our service based on Cassandra and it just worked. And of course we passed this delivery time requirement without any difficulties.
  6. So, our service is business critical because it is on the critical path of delivering products to customers in shops. And our clients expect from constant availability us guarantee of delivering electronic documents. And because of reliability requirements we chose Cassandra. What about the document delivery time? This part of requirements is pretty soft: we just need to guarantee delivery time not exceed one minute. One minute!? Sounds like “infinite”, isn’t it? Yes, we thought as well, because we built our service based on Cassandra and it just worked. And of course we passed this delivery time requirement without any difficulties.
  7. So, we were relaxed, careless and we added, added, added features into our service. And, as result, we added, added, added tables into cluster. We didn’t think about performance.
  8. So, we were relaxed, careless and we added, added, added features into our service. And, as result, we added, added, added tables into cluster. We didn’t think about performance.
  9. So, we were relaxed, careless and we added, added, added features into our service. And, as result, we added, added, added tables into cluster. We didn’t think about performance.
  10. This guy thought about performance for us. But one day our happy days finished: we started to violate the soft delivery time requirement. And we found our cluster in bad condition.
  11. There were more than 150 different tables in the cluster. “Different” means different pattern of usage, different data model, different number of requests... Client read latency ranged from 100 milliseconds to 2 seconds and affected almost all tables. On the slide you can see a plot of client read latency, 99th percentile . And the problem was in that with so high read latency almost all subsystems started to work slow and unacceptable. When we looked on basic indicators of machines with Cassandra, we found that CPU utilization was pretty high: from 40 percent to 80 percent. At the same time, disk’s indicators was absolutely normal, so we concluded that we have issue with CPU. We started investigation and tried to understand the reason for the terrible latency.
  12. If you find any presentation or an article about understanding problems in Cassandra, it almost certainly will contain words “Look at metrics, search anomalies”. And of course we followed this sensible recommendation. On the slide you can see some interesting metrics. But these metrics (and all other metrics) didn’t help us. Because high CPU Utilization we saw bad read latency values across almost all of our tables. For other metrics, in particular, SSTablesPerReadHistogram, we saw many pretty similar tables. For example, for 65 tables Cassandra processed 3 SSTables per read at average. Was it good or bad values? We didn’t know. The same situation was for almost all other metrics. So, we failed to find bad tables. Or probably were in horrible situation and didn’t have bad tables? We didn’t know. And so, we started to check other hypotheses. Говорю много чисел, а на слайде не отражены.
  13. If you find any presentation or an article about understanding problems in Cassandra, it almost certainly will contain words “Look at metrics, search anomalies”. And of course we followed this sensible recommendation. On the slide you can see some interesting metrics. But these metrics (and all other metrics) didn’t help us. Because high CPU Utilization we saw bad read latency values across almost all of our tables. For other metrics, in particular, SSTablesPerReadHistogram, we saw many pretty similar tables. For example, for 65 tables Cassandra processed 3 SSTables per read at average. Was it good or bad values? We didn’t know. The same situation was for almost all other metrics. So, we failed to find bad tables. Or probably were in horrible situation and didn’t have bad tables? We didn’t know. And so, we started to check other hypotheses. Говорю много чисел, а на слайде не отражены.
  14. OK, what other processes in Cassandra can cause high CPU usage? We suspected compaction. For check this hypothesis we tried to decrease compaction throughput, increase it, change the compaction strategy. Nothing changed. Absolutely no changes both in latency and CPU utilization.
  15. OK, what other processes in Cassandra can cause high CPU usage? We suspected compaction. For check this hypothesis we tried to decrease compaction throughput, increase it, change the compaction strategy. Nothing changed. Absolutely no changes both in latency and CPU utilization.
  16. OK, probably GC is our enemy? We observed high GC activity: node spent in GC about 6 second per minute. We read many good articles about how to tune GC, many successful stories about tuning GC for Cassandra (some links on the slide). And we tried to follow recommendations from these articles. But nothing changed. What did we wrong? Why nothing helped? We stopped check hypothesis blindly. We needed more information about what our cluster actually did.
  17. OK, probably GC is our enemy? We observed high GC activity: node spent in GC about 6 second per minute. We read many good articles about how to tune GC, many successful stories about tuning GC for Cassandra (some links on the slide). And we tried to follow recommendations from these articles. But nothing changed. What did we wrong? Why nothing helped? We stopped check hypothesis blindly. We needed more information about what our cluster actually did.
  18. To understand it we chose Java Mission Control and Java Flight Recorder – built-in profiling tools from Oracle JDK 7 Update 40. It has low overhead – from 1 to 2 percent and useful for CPU profiling: contains reports about hot threads, hot methods, call stacks. We launched one of the nodes under the profiler. And the result was the following: 70 percent of time was spent in SSTablesReader.
  19. OK, so the key question was “what tables cause most reads of SSTables?”. As we remember, SSTablesPerReadHistogram did not help us, we needed another metric. And we decided if we had SSTablesPerSecond metric (that is how many SSTables each table read per second), it could help us. Fortunately we can calculate this metric approximately by multiplication SSTablesPerReadHistogram.Mean and ReadLatency.OneMinuteRate metrics. We did it, and let’s see to a graph of this metric.
  20. Each line on this graph is values of metric SSTablesPerSecond for some table. This is screenshot of our real graph that we got then, except one detail: here we see one obvious leader that dominates, but actually we found 7 such leaders. I hid other leaders for demonstrate difference between them and other tables.
  21. So, the results of analysis of this metric were followings: - We found 7 leading tables, so only 7 candidates for deep investigation The difference between leaders and others was huge, so we hoped that fixing this tables could have a positive effect on entire cluster And one interesting remark. All leaders were surprises for us. It means that we did not expect that these tables could cause problems for the cluster. Moreover, the tables were not in leaders neither in read rate nor SSTablesPerRead metrics. They were near the middle in both metrics, but after multiplication they became the leaders, as we saw. When we found 7 candidates, we started to investigate why each of these tables read so many SSTables. And we found 3 types of problems within these tables.
  22. First type is “Invalid timestamp usage”. We had a table where for each user we stored last activity time within each of our subsystems. For example, last activity in web application, or in API, or any subsystem we can imagine. And each subsystem wrote into this table individually, so they did not intersect in stored data.
  23. Let’s see an example of writing data into this table by two subsystems. First subsystem doesn’t use USING TIMESTAMP instruction, but second uses it. And for timestamp uses current time in ticks. As we know, if we don’t use USING TIMESTAMP instruction, then Cassandra coordinator calculates it for us as current time in microseconds. So, second subsystem uses timestamps that ten times bigger than first subsystem uses. Exactly same situation was in our system. But these subsystems don’t intersect in stored data, so it seems that it is not too serious bug. But we had problems, why?
  24. To understand it, let’s remember how Cassandra processes read requests. Assume we have such read query. There are Memtable and several SSTables. First, Cassandra looks at memtable. And assume column in memtable was found. After that, Cassandra filters SSTables using bloom filter. And after that Cassandra filters SSTables according to timestamps. It deletes from consideration SSTables with maximum timestamp lower than column's timestamp that was read from Memtable. This great optimization was implemented in CASSANDRA-2498 issue ticket. At the next step Cassandra reads column from remaining SSTables and returns most recent column. OK, lets go back to our example.
  25. To understand it, let’s remember how Cassandra processes read requests. Assume we have such read query. There are Memtable and several SSTables. First, Cassandra looks at memtable. And assume column in memtable was found. After that, Cassandra filters SSTables using bloom filter. And after that Cassandra filters SSTables according to timestamps. It deletes from consideration SSTables with maximum timestamp lower than column's timestamp that was read from Memtable. This great optimization was implemented in CASSANDRA-2498 issue ticket. At the next step Cassandra reads column from remaining SSTables and returns most recent column. OK, lets go back to our example.
  26. To understand it, let’s remember how Cassandra processes read requests. Assume we have such read query. There are Memtable and several SSTables. First, Cassandra looks at memtable. And assume column in memtable was found. After that, Cassandra filters SSTables using bloom filter. And after that Cassandra filters SSTables according to timestamps. It deletes from consideration SSTables with maximum timestamp lower than column's timestamp that was read from Memtable. This great optimization was implemented in CASSANDRA-2498 issue ticket. At the next step Cassandra reads column from remaining SSTables and returns most recent column. OK, lets go back to our example.
  27. To understand it, let’s remember how Cassandra processes read requests. Assume we have such read query. There are Memtable and several SSTables. First, Cassandra looks at memtable. And assume column in memtable was found. After that, Cassandra filters SSTables using bloom filter. And after that Cassandra filters SSTables according to timestamps. It deletes from consideration SSTables with maximum timestamp lower than column's timestamp that was read from Memtable. This great optimization was implemented in CASSANDRA-2498 issue ticket. At the next step Cassandra reads column from remaining SSTables and returns most recent column. OK, lets go back to our example.
  28. To understand it, let’s remember how Cassandra processes read requests. Assume we have such read query. There are Memtable and several SSTables. First, Cassandra looks at memtable. And assume column in memtable was found. After that, Cassandra filters SSTables using bloom filter. And after that Cassandra filters SSTables according to timestamps. It deletes from consideration SSTables with maximum timestamp lower than column's timestamp that was read from Memtable. This great optimization was implemented in CASSANDRA-2498 issue ticket. At the next step Cassandra reads column from remaining SSTables and returns most recent column. OK, lets go back to our example.
  29. To understand it, let’s remember how Cassandra processes read requests. Assume we have such read query. There are Memtable and several SSTables. First, Cassandra looks at memtable. And assume column in memtable was found. After that, Cassandra filters SSTables using bloom filter. And after that Cassandra filters SSTables according to timestamps. It deletes from consideration SSTables with maximum timestamp lower than column's timestamp that was read from Memtable. This great optimization was implemented in CASSANDRA-2498 issue ticket. At the next step Cassandra reads column from remaining SSTables and returns most recent column. OK, lets go back to our example.
  30. All columns that were written by API subsystem, have lower timestamp than columns that was written by Web Application subsystem. So, timestamp optimization doesn’t work when we read columns for API subsystem and therefore we read many more SSTables than actually needed. Использовать одну систему именования (first/second либо API/WebApp).
  31. The fix of the problem was very simple. We just started to use similar timestamps for all subsystems.
  32. Second problem was with tables for which number of reads greatly exceeded the number of writes. An example of such table is user accounts. For these tables each read was from SSTable because Memtable with requested column almost certainly was already flushed. To fix the problem we just enabled row cache for these tables.
  33. Second problem was with tables for which number of reads greatly exceeded the number of writes. An example of such table is user accounts. For these tables each read was from SSTable because Memtable with requested column almost certainly was already flushed. To fix the problem we just enabled row cache for these tables.
  34. Third problem was with aggressive time series data. Processing time series data is a common usage pattern of Cassandra. And we use it for storing and analyzing various activities in our system. We store different activity records and analyze it in background. But we need a very fast reaction on some type of activity anomalies, so background services is quite aggressively polling time series table for most recent data using query like example on the slide. Why this is a problem? To understand it remember again how Cassandra processed read requests.
  35. Again, Cassandra firstly looks at memtable. Then, filters SSTables using bloom filter. Unlike read single column Cassandra can’t use timestamp optimization. Because for slice query some columns can be only in SSTables. But, in CASSANDRA-5514 another great optimization was implemented: Cassandra tracks min and max clustered values per SSTable and filters SSTables that don’t intersect request for sure. That is, for our example SSTables with obviously less values of record_time will be filtered. After that, Cassandra finally reads column from remaining SSTables and merges result.
  36. Again, Cassandra firstly looks at memtable. Then, filters SSTables using bloom filter. Unlike read single column Cassandra can’t use timestamp optimization. Because for slice query some columns can be only in SSTables. But, in CASSANDRA-5514 another great optimization was implemented: Cassandra tracks min and max clustered values per SSTable and filters SSTables that don’t intersect request for sure. That is, for our example SSTables with obviously less values of record_time will be filtered. After that, Cassandra finally reads column from remaining SSTables and merges result.
  37. Again, Cassandra firstly looks at memtable. Then, filters SSTables using bloom filter. Unlike read single column Cassandra can’t use timestamp optimization. Because for slice query some columns can be only in SSTables. But, in CASSANDRA-5514 another great optimization was implemented: Cassandra tracks min and max clustered values per SSTable and filters SSTables that don’t intersect request for sure. That is, for our example SSTables with obviously less values of record_time will be filtered. After that, Cassandra finally reads column from remaining SSTables and merges result.
  38. Again, Cassandra firstly looks at memtable. Then, filters SSTables using bloom filter. Unlike read single column Cassandra can’t use timestamp optimization. Because for slice query some columns can be only in SSTables. But, in CASSANDRA-5514 another great optimization was implemented: Cassandra tracks min and max clustered values per SSTable and filters SSTables that don’t intersect request for sure. That is, for our example SSTables with obviously less values of record_time will be filtered. After that, Cassandra finally reads column from remaining SSTables and merges result.
  39. Again, Cassandra firstly looks at memtable. Then, filters SSTables using bloom filter. Unlike read single column Cassandra can’t use timestamp optimization. Because for slice query some columns can be only in SSTables. But, in CASSANDRA-5514 another great optimization was implemented: Cassandra tracks min and max clustered values per SSTable and filters SSTables that don’t intersect request for sure. That is, for our example SSTables with obviously less values of record_time will be filtered. After that, Cassandra finally reads column from remaining SSTables and merges result.
  40. Again, Cassandra firstly looks at memtable. Then, filters SSTables using bloom filter. Unlike read single column Cassandra can’t use timestamp optimization. Because for slice query some columns can be only in SSTables. But, in CASSANDRA-5514 another great optimization was implemented: Cassandra tracks min and max clustered values per SSTable and filters SSTables that don’t intersect request for sure. That is, for our example SSTables with obviously less values of record_time will be filtered. After that, Cassandra finally reads column from remaining SSTables and merges result.
  41. Again, Cassandra firstly looks at memtable. Then, filters SSTables using bloom filter. Unlike read single column Cassandra can’t use timestamp optimization. Because for slice query some columns can be only in SSTables. But, in CASSANDRA-5514 another great optimization was implemented: Cassandra tracks min and max clustered values per SSTable and filters SSTables that don’t intersect request for sure. That is, for our example SSTables with obviously less values of record_time will be filtered. After that, Cassandra finally reads column from remaining SSTables and merges result.
  42. So, the fix was simple: we just upgraded to Cassandra 2.0.
  43. Let’s remember the graphic of SSTablesPerSecond metric before our fix.
  44. And here is graphic after our fixes. Impressive difference.
  45. What about our main goal? Both latency and CPU was reduced, comparison on the slide. Good result, but not excellent. We wanted better result, for being safe.
  46. So, we profiled a Cassandra node again. Time spent in reading SSTables became equals to time spent in reading Memtable. And 70 percent of time was in processing slice queries. So, the question was: what tables generated this high activity?
  47. For answer, we looked at the metrics again. On the slide there are two useful metrics for analyzing slice request activity. LiveScannedHistogram: histogram of the number of scanned live columns per slice query TombstonesScannedHistogram: histogram of the number of scanned tombstones per slice query But again, we did not find any anomalies in this metrics.
  48. For answer, we looked at the metrics again. On the slide there are two useful metrics for analyzing slice request activity. LiveScannedHistogram: histogram of the number of scanned live columns per slice query TombstonesScannedHistogram: histogram of the number of scanned tombstones per slice query But again, we did not find any anomalies in this metrics.
  49. For answer, we looked at the metrics again. On the slide there are two useful metrics for analyzing slice request activity. LiveScannedHistogram: histogram of the number of scanned live columns per slice query TombstonesScannedHistogram: histogram of the number of scanned tombstones per slice query But again, we did not find any anomalies in this metrics.
  50. We tried to use successful trick and built metric LiveScannedPerSecond: how many live column Cassandra scans per second for each table. The formula and our graph on the slide.
  51. The results of analyzing this metric were following: 1 obvious leader with a large difference between it and others And leader was big surprise. It was the table we used for calculating unimportant background statistics. And because of bug we scanned bigger slices than needed. So, we simply fixed the bug.
  52. The results of analyzing this metric were following: 1 obvious leader with a large difference between it and others And leader was big surprise. It was the table we used for calculating unimportant background statistics. And because of bug we scanned bigger slices than needed. So, we simply fixed the bug.
  53. What about our main goal? Both latency and CPU was reduced, comparison on the slide. Good result, but still not enough for us.
  54. And we profiled node again and found that node spend 30% of time in compaction. For fix it we just throttled down compactions during high load period and throttled up it during low load period.
  55. And we profiled node again and found that node spend 30% of time in compaction. For fix it we just throttled down compactions during high load period and throttled up it during low load period.
  56. Let’s look at the client latency difference before and after all these fixes.
  57. And these results in numbers. We reduced upper limit of latency from 2 seconds to 50 milliseconds.
  58. Thera are many metrics that could be build from base Cassandra metrics, for example: TombstoneScannedPerSecond could answer the question: does the cluster waste time in scanning many tombstones? KeyCacheMissesPerSecond could answer the question: do some tables in the cluster have problems with keyCache? We made a few small fixes based on the information from these metrics. I have not enough time to tell about them in details. But after all these fixes we got following results: 50 times less at average! It was unbelievable for us, because we did not do something extraordinary. Almost fixes was pretty simple, some of them did not require coding, only settings tuning. And we still use this metrics not only to investigate problems, but for check our data model assumptions in production. For example, we expected that some table will scan very few tombstones. A simple look at TombstonesScannedPerSecond metric can answer if is it true or not. To wrap up, I recommend to try calculating these metrics on your cluster. Probably you will get some surprises like we got.
  59. Thera are many metrics that could be build from base Cassandra metrics, for example: TombstoneScannedPerSecond could answer the question: does the cluster waste time in scanning many tombstones? KeyCacheMissesPerSecond could answer the question: do some tables in the cluster have problems with keyCache? We made a few small fixes based on the information from these metrics. I have not enough time to tell about them in details. But after all these fixes we got following results: 50 times less at average! It was unbelievable for us, because we did not do something extraordinary. Almost fixes was pretty simple, some of them did not require coding, only settings tuning. And we still use this metrics not only to investigate problems, but for check our data model assumptions in production. For example, we expected that some table will scan very few tombstones. A simple look at TombstonesScannedPerSecond metric can answer if is it true or not. To wrap up, I recommend to try calculating these metrics on your cluster. Probably you will get some surprises like we got.
  60. Thank you!