SlideShare a Scribd company logo
1 of 21
Download to read offline
Apache Cassandra Resilience
“If anything just cannot go wrong, it will anyway.” - Murphy Law
Carlos Rolo - Cassandra Consultant
April 2015
About me
●  Pythian Cassandra Consultant since January 2015
●  Working with Cassandra since 2011
●  Working with distributed systems since 2010
●  History:
●  Pythian
●  Leaseweb CDN
●  Portugal Telecom
●  DRI
●  Twitter: @cjrolo
2
About Pythian
Remote data management services
Three main offerings
Managed Services
Consulting
DevOps
Engaged by IT and Operations executives
looking to address skills or resourcing gaps
18 years in business
Growing at 30+% per year
400+ employees
250+ customers worldwide
3
Oracle Platinum Partner
Microsoft Gold Data Platform Partner
Microsoft Silver Cloud Competency Partner
AWS Advanced Consulting Partner
Datastax Gold Partner
Top 5% industry expertise
10 Oracle ACEs
2 Oracle ACE Directors
1 Oracle ACE Associates
6 Microsoft MVPs
2 Microsoft MCMs
1 Cloudera Champion of Big Data
1 DataStax Platinum Certified Consultant
Agenda
●  The Setup
●  Guaranteeing Reliability
●  Testing
●  The (BIG) problem
●  Recovering
●  Lessons Learned
●  Q&A
4
The Setup
●  15 Servers
a.  Bare metal, no virtualization
●  5 Datacenters
a.  Geographic Distant
b.  EU 1, EU 2, US East, US West, APAC1
c.  First 2 DC deployed with Cassandra 0.8!
●  Spinning disks
a.  RAID 1
●  RF 1 or 3 depending on Data
5
The Clients
●  Thrift
●  HUGE write volume, Minimal read volume
a.  Reads could spike, still won’t match writes
●  Write Consistency = ONE
●  Read Consistency = QUORUM
●  The anti-patterns…
a.  Counters
b.  Batches
c.  Deletes
6
Guaranteeing Reliability
●  System was new, Cassandra was new…
●  ... but the data was needed!
a.  HDD Redundancy, RAID 1
b.  Server Redundancy (RF=3)
c.  Datacenter redundancy (Base was 2)
d.  Several fallbacks programmed into the clients
e.  Raw data backups to different location
f.  And more!
●  Something has to go REALLY wrong to make this unavailable
7
Testing
●  First test was accidental…
a.  Lost link between DC’s
b.  No issues… with Cassandra!
●  After this...
8
Testing (2)
●  Ok, now something serious!
●  Use upgrades, maintenance windows, etc to test reliability
●  The Good:
a.  Never had downtime
b.  App latency was affected, but still responsive
c.  Client fallbacks where perfect
9
Testing (3)
●  The Bad:
a.  Commitlog replays
b.  Repairs
c.  Some I/O pressure
d.  Write load + compaction = Run out of HDD
●  Learn and Prepare:
a.  “Fix” counter drift - Tricky but worked
b.  Repairs… We have to live with them!
c.  “Tweak” commitlogs settings
d.  Some changes to the clients
10
The (BIG) Problem
●  Too much confidence!
●  Traffic was flowing nice!
11
But...
●  The service was growing, the cluster was not.
●  The cluster also had more work to do
a.  Map-Reduce and the likes
●  But was still some room available, so, all fine!
“If everything seems to be going well, you have obviously overlooked something” - Murphy Law
12
What about upgrading?
●  Sometimes “add a node” is not that easy
a.  No vnodes…
b.  Geographic distributed is not all roses,
c.  You actually need physical room in the rack,
d.  Other constraints exist
i.  Power
ii.  Budget
iii.  Network Ports/Capacity
iv.  etc... 13
What Happened?
●  5 DC’s, 2 and 1/2 Down...
●  System had a huge upgrade in capacity… not Cassandra
●  Huge volume of incoming writes + Half of the cluster down…
●  No longer a simple situation!
14
Brace for impact!
●  Assess the situation, define priorities…
●  First try:
a.  Keep everything alive
b.  Didn’t work…
i.  Massive OOM happening all-around the remaining nodes
ii.  Huge GC spikes
iii.  Remember the compaction + writes? Yup, running out of HDD
space
15
Impact!
●  Clients where getting time-outs, traffic rerouted, heavy loaded servers got
even more load…
●  But…
a.  Some other fallbacks kicked in! Reducing load,
b.  Always 2 servers up in each (surviving) DC, writes still getting in,
c.  Read clients were highly tolerant to latency, so reads still being
delivered
16
Everything counts!
●  Delay writes to the HDD (dirty_ratio)!
●  Tweak compaction!
●  Delete commitlogs before restart!
a.  But sync the commitlog!
●  Disable read_repair
●  Reads… CL=1
●  Non-Critical data… RF=1
●  Map-Reduces were put on a hold
●  Throttle traffic with IPTables
17
Recovering
●  Once Datacenters start going online, clients redirected.
●  Remove any “magic” inserted in the process
a.  Dirty_ratio, IPtables, etc...
●  Enable read_repair
●  Set Read back to QUORUM
●  Disable compaction throttling
a.  Until SSTable count is nice
●  Repair
18
Lessons Learned
●  Cassandra is highly reliable! As long as 1 server was up we knew that data
would be there.
●  It is (somewhat) predictable during failure.
●  Lots of tweaks can be made during emergencies
●  Use SSDs no matter what anybody tells you!
●  Swap can save data!
a.  But keep swappiness low enough!
19
Q&A
●  Thanks for listening!
●  Questions?
20
Cassandra Day London 2015: The Resilience of Apache Cassandra

More Related Content

What's hot

MySQL 高可用性
MySQL 高可用性MySQL 高可用性
MySQL 高可用性YUCHENG HU
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaiFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaScyllaDB
 
How Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintHow Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintScyllaDB
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...ScyllaDB
 
About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014Michal Harish
 
moabcon2012 - Transitioning from Grid Engine
moabcon2012 - Transitioning from Grid Enginemoabcon2012 - Transitioning from Grid Engine
moabcon2012 - Transitioning from Grid EngineFrédérick Lefebvre
 
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
High-Load Storage of Users’ Actions with ScyllaDB and HDDsHigh-Load Storage of Users’ Actions with ScyllaDB and HDDs
High-Load Storage of Users’ Actions with ScyllaDB and HDDsScyllaDB
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...ScyllaDB
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017Petr Zapletal
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
 
Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
 Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
Scylla Summit 2022: IO Scheduling & NVMe Disk ModellingScyllaDB
 
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.Raghavendra Prabhu
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with KubernetesRaghavendra Prabhu
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla ClusterScyllaDB
 
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
Scylla Summit 2022: Overcoming the Performance Cost of Streaming TransactionsScylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
Scylla Summit 2022: Overcoming the Performance Cost of Streaming TransactionsScyllaDB
 
Maksims Greckis - Trace File Analyzer
Maksims Greckis - Trace File Analyzer  Maksims Greckis - Trace File Analyzer
Maksims Greckis - Trace File Analyzer Andrejs Vorobjovs
 
Free & Open DynamoDB API for Everyone
Free & Open DynamoDB API for EveryoneFree & Open DynamoDB API for Everyone
Free & Open DynamoDB API for EveryoneScyllaDB
 
Mantis qcon nyc_2015
Mantis qcon nyc_2015Mantis qcon nyc_2015
Mantis qcon nyc_2015neerajrj
 
Reactive mistakes reactive nyc
Reactive mistakes   reactive nycReactive mistakes   reactive nyc
Reactive mistakes reactive nycPetr Zapletal
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDavid van Geest
 

What's hot (20)

MySQL 高可用性
MySQL 高可用性MySQL 高可用性
MySQL 高可用性
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaiFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
 
How Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintHow Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter Footprint
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
 
moabcon2012 - Transitioning from Grid Engine
moabcon2012 - Transitioning from Grid Enginemoabcon2012 - Transitioning from Grid Engine
moabcon2012 - Transitioning from Grid Engine
 
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
High-Load Storage of Users’ Actions with ScyllaDB and HDDsHigh-Load Storage of Users’ Actions with ScyllaDB and HDDs
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
 Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
 
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
Scylla Summit 2022: Overcoming the Performance Cost of Streaming TransactionsScylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
 
Maksims Greckis - Trace File Analyzer
Maksims Greckis - Trace File Analyzer  Maksims Greckis - Trace File Analyzer
Maksims Greckis - Trace File Analyzer
 
Free & Open DynamoDB API for Everyone
Free & Open DynamoDB API for EveryoneFree & Open DynamoDB API for Everyone
Free & Open DynamoDB API for Everyone
 
Mantis qcon nyc_2015
Mantis qcon nyc_2015Mantis qcon nyc_2015
Mantis qcon nyc_2015
 
Reactive mistakes reactive nyc
Reactive mistakes   reactive nycReactive mistakes   reactive nyc
Reactive mistakes reactive nyc
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and Cassandra
 

Similar to Cassandra Day London 2015: The Resilience of Apache Cassandra

kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahujacamunda services GmbH
 
Altitude SF 2017: Reddit - How we built and scaled r/place
Altitude SF 2017: Reddit - How we built and scaled r/placeAltitude SF 2017: Reddit - How we built and scaled r/place
Altitude SF 2017: Reddit - How we built and scaled r/placeFastly
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Tapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout SessionTapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout SessionWeston Jossey
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Mad scalability: Scaling when you are not Google
Mad scalability: Scaling when you are not GoogleMad scalability: Scaling when you are not Google
Mad scalability: Scaling when you are not GoogleAbel Muíño
 
Building a Small DC
Building a Small DCBuilding a Small DC
Building a Small DCAPNIC
 
On component interface
On component interfaceOn component interface
On component interfaceLaurence Chen
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Building a Small Datacenter
Building a Small DatacenterBuilding a Small Datacenter
Building a Small Datacenterssuser4b98f0
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIURohit Jnagal
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum
 

Similar to Cassandra Day London 2015: The Resilience of Apache Cassandra (20)

kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
 
Altitude SF 2017: Reddit - How we built and scaled r/place
Altitude SF 2017: Reddit - How we built and scaled r/placeAltitude SF 2017: Reddit - How we built and scaled r/place
Altitude SF 2017: Reddit - How we built and scaled r/place
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Tapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout SessionTapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout Session
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Mad scalability: Scaling when you are not Google
Mad scalability: Scaling when you are not GoogleMad scalability: Scaling when you are not Google
Mad scalability: Scaling when you are not Google
 
Building a Small DC
Building a Small DCBuilding a Small DC
Building a Small DC
 
On component interface
On component interfaceOn component interface
On component interface
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Building a Small Datacenter
Building a Small DatacenterBuilding a Small Datacenter
Building a Small Datacenter
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Cassandra Day London 2015: The Resilience of Apache Cassandra

  • 1. Apache Cassandra Resilience “If anything just cannot go wrong, it will anyway.” - Murphy Law Carlos Rolo - Cassandra Consultant April 2015
  • 2. About me ●  Pythian Cassandra Consultant since January 2015 ●  Working with Cassandra since 2011 ●  Working with distributed systems since 2010 ●  History: ●  Pythian ●  Leaseweb CDN ●  Portugal Telecom ●  DRI ●  Twitter: @cjrolo 2
  • 3. About Pythian Remote data management services Three main offerings Managed Services Consulting DevOps Engaged by IT and Operations executives looking to address skills or resourcing gaps 18 years in business Growing at 30+% per year 400+ employees 250+ customers worldwide 3 Oracle Platinum Partner Microsoft Gold Data Platform Partner Microsoft Silver Cloud Competency Partner AWS Advanced Consulting Partner Datastax Gold Partner Top 5% industry expertise 10 Oracle ACEs 2 Oracle ACE Directors 1 Oracle ACE Associates 6 Microsoft MVPs 2 Microsoft MCMs 1 Cloudera Champion of Big Data 1 DataStax Platinum Certified Consultant
  • 4. Agenda ●  The Setup ●  Guaranteeing Reliability ●  Testing ●  The (BIG) problem ●  Recovering ●  Lessons Learned ●  Q&A 4
  • 5. The Setup ●  15 Servers a.  Bare metal, no virtualization ●  5 Datacenters a.  Geographic Distant b.  EU 1, EU 2, US East, US West, APAC1 c.  First 2 DC deployed with Cassandra 0.8! ●  Spinning disks a.  RAID 1 ●  RF 1 or 3 depending on Data 5
  • 6. The Clients ●  Thrift ●  HUGE write volume, Minimal read volume a.  Reads could spike, still won’t match writes ●  Write Consistency = ONE ●  Read Consistency = QUORUM ●  The anti-patterns… a.  Counters b.  Batches c.  Deletes 6
  • 7. Guaranteeing Reliability ●  System was new, Cassandra was new… ●  ... but the data was needed! a.  HDD Redundancy, RAID 1 b.  Server Redundancy (RF=3) c.  Datacenter redundancy (Base was 2) d.  Several fallbacks programmed into the clients e.  Raw data backups to different location f.  And more! ●  Something has to go REALLY wrong to make this unavailable 7
  • 8. Testing ●  First test was accidental… a.  Lost link between DC’s b.  No issues… with Cassandra! ●  After this... 8
  • 9. Testing (2) ●  Ok, now something serious! ●  Use upgrades, maintenance windows, etc to test reliability ●  The Good: a.  Never had downtime b.  App latency was affected, but still responsive c.  Client fallbacks where perfect 9
  • 10. Testing (3) ●  The Bad: a.  Commitlog replays b.  Repairs c.  Some I/O pressure d.  Write load + compaction = Run out of HDD ●  Learn and Prepare: a.  “Fix” counter drift - Tricky but worked b.  Repairs… We have to live with them! c.  “Tweak” commitlogs settings d.  Some changes to the clients 10
  • 11. The (BIG) Problem ●  Too much confidence! ●  Traffic was flowing nice! 11
  • 12. But... ●  The service was growing, the cluster was not. ●  The cluster also had more work to do a.  Map-Reduce and the likes ●  But was still some room available, so, all fine! “If everything seems to be going well, you have obviously overlooked something” - Murphy Law 12
  • 13. What about upgrading? ●  Sometimes “add a node” is not that easy a.  No vnodes… b.  Geographic distributed is not all roses, c.  You actually need physical room in the rack, d.  Other constraints exist i.  Power ii.  Budget iii.  Network Ports/Capacity iv.  etc... 13
  • 14. What Happened? ●  5 DC’s, 2 and 1/2 Down... ●  System had a huge upgrade in capacity… not Cassandra ●  Huge volume of incoming writes + Half of the cluster down… ●  No longer a simple situation! 14
  • 15. Brace for impact! ●  Assess the situation, define priorities… ●  First try: a.  Keep everything alive b.  Didn’t work… i.  Massive OOM happening all-around the remaining nodes ii.  Huge GC spikes iii.  Remember the compaction + writes? Yup, running out of HDD space 15
  • 16. Impact! ●  Clients where getting time-outs, traffic rerouted, heavy loaded servers got even more load… ●  But… a.  Some other fallbacks kicked in! Reducing load, b.  Always 2 servers up in each (surviving) DC, writes still getting in, c.  Read clients were highly tolerant to latency, so reads still being delivered 16
  • 17. Everything counts! ●  Delay writes to the HDD (dirty_ratio)! ●  Tweak compaction! ●  Delete commitlogs before restart! a.  But sync the commitlog! ●  Disable read_repair ●  Reads… CL=1 ●  Non-Critical data… RF=1 ●  Map-Reduces were put on a hold ●  Throttle traffic with IPTables 17
  • 18. Recovering ●  Once Datacenters start going online, clients redirected. ●  Remove any “magic” inserted in the process a.  Dirty_ratio, IPtables, etc... ●  Enable read_repair ●  Set Read back to QUORUM ●  Disable compaction throttling a.  Until SSTable count is nice ●  Repair 18
  • 19. Lessons Learned ●  Cassandra is highly reliable! As long as 1 server was up we knew that data would be there. ●  It is (somewhat) predictable during failure. ●  Lots of tweaks can be made during emergencies ●  Use SSDs no matter what anybody tells you! ●  Swap can save data! a.  But keep swappiness low enough! 19
  • 20. Q&A ●  Thanks for listening! ●  Questions? 20