C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo DataStax Academy
Speaker: Dave Gardner, Architect at Hailo
Video: http://www.youtube.com/watch?v=6cUuE7sTdU0&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=16
Hailo has leveraged Cassandra to build one of the most successful startups in European history. This presentations looks at how Hailo grew from a simple MySQL-backed infrastructure to a resilient Cassandra-backed system running in three data centres globally. Topics covered include: the process of migration, experience running multi-DC on AWS, common data modeling patterns and security implications for achieving PCI compliance.
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...DataStax Academy
Hailo has leveraged Cassandra to build one of the most successful startups in European history. This presentations looks at how Hailo grew from a simple MySQL-backed infrastructure to a resilient Cassandra-backed system running in three data centers globally. Topics covered include: the process of migration, experience running multi-DC on AWS, common data modeling patterns and security implications for achieving PCI compliance.
Dynamic Scaling at Pinterest. Large Scale Production Engineering meetup, Feb...ArenSand
The document discusses Pinterest's use of dynamic scaling to adjust the number of machines provisioned based on traffic patterns. It scales machines up and down throughout the day and night based on a scheduled scale factor. This allows using cheaper spot instances when possible. Dynamic scaling is automated using tools that configure, monitor, and adjust machines. While it helps handle traffic fluctuations, there are also risks like unknown machines, failures in auto-discovery services, and upstream impacts if over-provisioning.
Case Study: Troubleshooting Cassandra performance issues as a developerCarlos Alonso Pérez
This talk will be a step by step walkthrough of a developer troubleshooting a real performance issue we had at MyDrive, from the very first steps diagnosing the symptoms, through looking at metric charts down to CQL queries, the Ruby CQL driver, and Ruby code profiling.
This are the slides from the intensive Cassandra Workshop I held in Madrid as a Meetup: http://www.meetup.com/Madrid-Cassandra-Users/events/225944063/ They cover all the Cassandra core concepts, and data modelling basic ones to get up and running with Cassandra.
Tokyo Cassandra Summit 2014: Tunable Consistency by Al TobeyDataStax Academy
This document discusses strategies for avoiding read-modify-write operations in Cassandra databases. It presents several Cassandra features that allow updating data without explicit read-modify-writes, such as overwriting rows, using collections, and lightweight transactions. It also covers data modeling techniques like journaling, content-addressable storage, and modeling time-series data. The document concludes that Cassandra is well-suited for write-heavy workloads and provides tools to safely perform read-modify-writes when necessary.
C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo DataStax Academy
Speaker: Dave Gardner, Architect at Hailo
Video: http://www.youtube.com/watch?v=6cUuE7sTdU0&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=16
Hailo has leveraged Cassandra to build one of the most successful startups in European history. This presentations looks at how Hailo grew from a simple MySQL-backed infrastructure to a resilient Cassandra-backed system running in three data centres globally. Topics covered include: the process of migration, experience running multi-DC on AWS, common data modeling patterns and security implications for achieving PCI compliance.
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...DataStax Academy
Hailo has leveraged Cassandra to build one of the most successful startups in European history. This presentations looks at how Hailo grew from a simple MySQL-backed infrastructure to a resilient Cassandra-backed system running in three data centers globally. Topics covered include: the process of migration, experience running multi-DC on AWS, common data modeling patterns and security implications for achieving PCI compliance.
Dynamic Scaling at Pinterest. Large Scale Production Engineering meetup, Feb...ArenSand
The document discusses Pinterest's use of dynamic scaling to adjust the number of machines provisioned based on traffic patterns. It scales machines up and down throughout the day and night based on a scheduled scale factor. This allows using cheaper spot instances when possible. Dynamic scaling is automated using tools that configure, monitor, and adjust machines. While it helps handle traffic fluctuations, there are also risks like unknown machines, failures in auto-discovery services, and upstream impacts if over-provisioning.
Case Study: Troubleshooting Cassandra performance issues as a developerCarlos Alonso Pérez
This talk will be a step by step walkthrough of a developer troubleshooting a real performance issue we had at MyDrive, from the very first steps diagnosing the symptoms, through looking at metric charts down to CQL queries, the Ruby CQL driver, and Ruby code profiling.
This are the slides from the intensive Cassandra Workshop I held in Madrid as a Meetup: http://www.meetup.com/Madrid-Cassandra-Users/events/225944063/ They cover all the Cassandra core concepts, and data modelling basic ones to get up and running with Cassandra.
Tokyo Cassandra Summit 2014: Tunable Consistency by Al TobeyDataStax Academy
This document discusses strategies for avoiding read-modify-write operations in Cassandra databases. It presents several Cassandra features that allow updating data without explicit read-modify-writes, such as overwriting rows, using collections, and lightweight transactions. It also covers data modeling techniques like journaling, content-addressable storage, and modeling time-series data. The document concludes that Cassandra is well-suited for write-heavy workloads and provides tools to safely perform read-modify-writes when necessary.
Cassandra Day SV 2014: Beyond Read-Modify-Write with Apache CassandraDataStax Academy
This document discusses strategies for updating data in Apache Cassandra beyond using read-modify-write operations. It describes how eventual consistency allows safe updates without locking by propagating changes asynchronously. It also covers Cassandra features like collections, lightweight transactions, and content-addressable storage that provide flexible data models for modern web-scale applications while avoiding the need for read-modify-write in many cases.
C* Summit 2013: Hardware Agnostic - Cassandra on Raspberry Pi by Andy CobleyDataStax Academy
The raspberry Pi is a credit-card sized $25 ARM based linux box designed to teach children the basics of programming. The machine comes with a 700MHz ARM and 512Mb of memory and boots off a SD card, not much power for running the likes of a Cassandra cluster. This presentation will discuss the problems of getting Cassandra up and running on the Pi and will answer the all important question: Why on Earth would you want to do this!?
The document discusses the pros and cons of using public cloud computing services versus hosting infrastructure internally for a new startup. Some advantages mentioned include flexibility, avoiding large upfront capital expenditures, and the ability to scale resources up and down as needed. Disadvantages include public cloud services becoming more expensive than internal hosting at large scale, inefficient resource ratios for some workloads, and high costs for intensive disk and SSD usage. The document aims to provide considerations for a startup evaluating whether to use public cloud services.
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...DataStax
Instaclustr provides managed Apache Cassandra and DataStax Enterprise clusters in the cloud. They initially ran Cassandra on custom Ubuntu images but moved to CoreOS for its immutable and self-updating capabilities. Using Docker and CoreOS together allows Cassandra to run in immutable Docker containers while CoreOS handles OS-level updates. Integrating Cassandra containers with the CoreOS and systemd init system provides reliable automatic restarts and the ability to notify when Cassandra is ready using dbus inter-process communication. This architecture provides a robust solution for running and updating Cassandra in production clusters.
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
We have seen rapid adoption of C* at eBay in past two years. We have made tremendous efforts to integrate C* into existing database platforms, including Oracle, MySQL, Postgres, MongoDB, XMP etc.. We also scale C* to meet business requirement and encountered technical challenges you only see at eBay scale, 100TB data on hundreds of nodes. We will share our experience of deployment automation, managing, monitoring, reporting for both Apache Cassandra and DataStax enterprise.
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesRaghavendra Prabhu
This is a talk about orchestration of Cassandra with cassandra operator, kubernetes and Yelp PaaSTA (https://github.com/Yelp/paasta).
The talk was presented at Computer Laboratory, University of Cambridge as part of the Engineering, Science and Technology Event (https://www.careers.cam.ac.uk/recruiting/event2Tech.asp) in November 2019.
This document provides an overview and comparison of Cassandra and Redis. Cassandra is an open-source NoSQL database that is optimized for high throughput and availability. It is commonly used by companies like Netflix, Apple, and Facebook. Redis is an open-source in-memory key-value store written in C. It is optimized for low latency and is commonly used for caching, sessions, queues, and analytics. Both databases are battle tested and have strengths in different areas - Cassandra favors availability over consistency while Redis operates entirely in memory for faster performance but with a single thread. The document discusses various features, use cases, and best practices for operating each database.
- Micro-batching involves grouping statements into small batches to improve throughput and reduce network overhead when writing to Cassandra.
- A benchmark was conducted to compare individual statements, regular batches, and partition-aware batches when inserting 1 million rows into Cassandra.
- The results showed that partition-aware batches had shorter runtime, lower client and cluster CPU usage, and was more performant overall compared to individual statements and regular batches. However, it may have higher latency which is better suited for bulk data processing rather than real-time workloads.
1) Ben Bromhead is the CTO of Instaclustr, which provides Cassandra-as-a-Service. When adding capacity to an existing Cassandra cluster, joining nodes normally bootstrap by streaming data from existing nodes, adding load.
2) "Bootstrap from backups" is proposed as a solution where joining nodes stream data directly from backups stored in object storage rather than existing cluster nodes, reducing load on the cluster.
3) This allows more reactive scaling with fewer side effects than typical predictive capacity planning approaches, and makes clusters more cost effective to run. The technique is currently in beta testing.
Powerpoint file(incl. animations!): http://db.tt/oQiXb9lq
This is the slides of the presentation "Wordpress optimization" who presented at WordCamp 2013.
How to improve your wordpress performance and speed up your website more than 700% faster!
The document discusses Apache Cassandra, a distributed database management system designed to handle large amounts of data across many commodity servers. It was developed at Facebook and modeled after Google's Bigtable. The summary discusses key concepts like its use of consistent hashing to distribute data, support for tunable consistency levels, and focus on scalability and availability over traditional SQL features. It also provides an overview of how Cassandra differs from relational databases by not supporting joins, having an optional schema, and using a prematerialized and transaction-less model.
Ben Bromhead is the co-founder and CTO of Instaclustr, which provides Cassandra-as-a-Service. Instaclustr manages 50+ Cassandra nodes for customers. Early on, Instaclustr encountered issues like a Cassandra bug causing assertion errors for large column names and had to perform an emergency migration for a customer whose self-managed cluster was down for 48 hours. Migrations and real-world usage revealed new challenges compared to initial perfect test scenarios.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Johnny Miller
DataStax is a company that drives development of the Apache Cassandra database. It has over 400 customers including 24 Fortune 100 companies. DataStax Enterprise provides a highly available, scalable and secure database platform using Cassandra for mission critical applications. It supports analytics, search and multi-datacenter deployments across hybrid cloud environments.
C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Alb...DataStax Academy
Ooyala has been using Apache Cassandra since version 0.4. Our data ingest volume has exploded since 0.4 and Cassandra has scaled along with us. Al will cover many topics from an operational perspective on how to manage, tune, and scale Cassandra in a production environment.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
Unique ID generation in distributed systemsDave Gardner
The document discusses different strategies for generating unique IDs in a distributed system. It covers using auto-incrementing numeric IDs in MySQL, which are not resilient, and various solutions like UUIDs, Twitter Snowflake IDs, and Flickr ticket servers that generate IDs in a distributed and ordered way without coordination between data centers. It also provides code examples of generating Twitter Snowflake-like IDs in PHP without coordination using ZeroMQ.
The document discusses planning for failure when building software systems. It notes that as software projects grow larger with more engineers, complexity and the potential for failures increases. The author discusses how the taxi app Hailo has grown significantly and now uses a service-oriented architecture across multiple data centers to improve reliability. Key technologies discussed include Zookeeper, Elasticsearch, NSQ, and Cruftflake which provide distributed and resilient capabilities. The importance of testing failures through simulation is emphasized to improve reliability.
Cassandra, Modeling and Availability at AMUGMatthew Dennis
brief high level comparison of modeling between relational databases and Cassandra followed by a brief description of how Cassandra achieves global availability
Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.
Cassandra Day SV 2014: Beyond Read-Modify-Write with Apache CassandraDataStax Academy
This document discusses strategies for updating data in Apache Cassandra beyond using read-modify-write operations. It describes how eventual consistency allows safe updates without locking by propagating changes asynchronously. It also covers Cassandra features like collections, lightweight transactions, and content-addressable storage that provide flexible data models for modern web-scale applications while avoiding the need for read-modify-write in many cases.
C* Summit 2013: Hardware Agnostic - Cassandra on Raspberry Pi by Andy CobleyDataStax Academy
The raspberry Pi is a credit-card sized $25 ARM based linux box designed to teach children the basics of programming. The machine comes with a 700MHz ARM and 512Mb of memory and boots off a SD card, not much power for running the likes of a Cassandra cluster. This presentation will discuss the problems of getting Cassandra up and running on the Pi and will answer the all important question: Why on Earth would you want to do this!?
The document discusses the pros and cons of using public cloud computing services versus hosting infrastructure internally for a new startup. Some advantages mentioned include flexibility, avoiding large upfront capital expenditures, and the ability to scale resources up and down as needed. Disadvantages include public cloud services becoming more expensive than internal hosting at large scale, inefficient resource ratios for some workloads, and high costs for intensive disk and SSD usage. The document aims to provide considerations for a startup evaluating whether to use public cloud services.
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...DataStax
Instaclustr provides managed Apache Cassandra and DataStax Enterprise clusters in the cloud. They initially ran Cassandra on custom Ubuntu images but moved to CoreOS for its immutable and self-updating capabilities. Using Docker and CoreOS together allows Cassandra to run in immutable Docker containers while CoreOS handles OS-level updates. Integrating Cassandra containers with the CoreOS and systemd init system provides reliable automatic restarts and the ability to notify when Cassandra is ready using dbus inter-process communication. This architecture provides a robust solution for running and updating Cassandra in production clusters.
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
We have seen rapid adoption of C* at eBay in past two years. We have made tremendous efforts to integrate C* into existing database platforms, including Oracle, MySQL, Postgres, MongoDB, XMP etc.. We also scale C* to meet business requirement and encountered technical challenges you only see at eBay scale, 100TB data on hundreds of nodes. We will share our experience of deployment automation, managing, monitoring, reporting for both Apache Cassandra and DataStax enterprise.
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesRaghavendra Prabhu
This is a talk about orchestration of Cassandra with cassandra operator, kubernetes and Yelp PaaSTA (https://github.com/Yelp/paasta).
The talk was presented at Computer Laboratory, University of Cambridge as part of the Engineering, Science and Technology Event (https://www.careers.cam.ac.uk/recruiting/event2Tech.asp) in November 2019.
This document provides an overview and comparison of Cassandra and Redis. Cassandra is an open-source NoSQL database that is optimized for high throughput and availability. It is commonly used by companies like Netflix, Apple, and Facebook. Redis is an open-source in-memory key-value store written in C. It is optimized for low latency and is commonly used for caching, sessions, queues, and analytics. Both databases are battle tested and have strengths in different areas - Cassandra favors availability over consistency while Redis operates entirely in memory for faster performance but with a single thread. The document discusses various features, use cases, and best practices for operating each database.
- Micro-batching involves grouping statements into small batches to improve throughput and reduce network overhead when writing to Cassandra.
- A benchmark was conducted to compare individual statements, regular batches, and partition-aware batches when inserting 1 million rows into Cassandra.
- The results showed that partition-aware batches had shorter runtime, lower client and cluster CPU usage, and was more performant overall compared to individual statements and regular batches. However, it may have higher latency which is better suited for bulk data processing rather than real-time workloads.
1) Ben Bromhead is the CTO of Instaclustr, which provides Cassandra-as-a-Service. When adding capacity to an existing Cassandra cluster, joining nodes normally bootstrap by streaming data from existing nodes, adding load.
2) "Bootstrap from backups" is proposed as a solution where joining nodes stream data directly from backups stored in object storage rather than existing cluster nodes, reducing load on the cluster.
3) This allows more reactive scaling with fewer side effects than typical predictive capacity planning approaches, and makes clusters more cost effective to run. The technique is currently in beta testing.
Powerpoint file(incl. animations!): http://db.tt/oQiXb9lq
This is the slides of the presentation "Wordpress optimization" who presented at WordCamp 2013.
How to improve your wordpress performance and speed up your website more than 700% faster!
The document discusses Apache Cassandra, a distributed database management system designed to handle large amounts of data across many commodity servers. It was developed at Facebook and modeled after Google's Bigtable. The summary discusses key concepts like its use of consistent hashing to distribute data, support for tunable consistency levels, and focus on scalability and availability over traditional SQL features. It also provides an overview of how Cassandra differs from relational databases by not supporting joins, having an optional schema, and using a prematerialized and transaction-less model.
Ben Bromhead is the co-founder and CTO of Instaclustr, which provides Cassandra-as-a-Service. Instaclustr manages 50+ Cassandra nodes for customers. Early on, Instaclustr encountered issues like a Cassandra bug causing assertion errors for large column names and had to perform an emergency migration for a customer whose self-managed cluster was down for 48 hours. Migrations and real-world usage revealed new challenges compared to initial perfect test scenarios.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Johnny Miller
DataStax is a company that drives development of the Apache Cassandra database. It has over 400 customers including 24 Fortune 100 companies. DataStax Enterprise provides a highly available, scalable and secure database platform using Cassandra for mission critical applications. It supports analytics, search and multi-datacenter deployments across hybrid cloud environments.
C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Alb...DataStax Academy
Ooyala has been using Apache Cassandra since version 0.4. Our data ingest volume has exploded since 0.4 and Cassandra has scaled along with us. Al will cover many topics from an operational perspective on how to manage, tune, and scale Cassandra in a production environment.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
Unique ID generation in distributed systemsDave Gardner
The document discusses different strategies for generating unique IDs in a distributed system. It covers using auto-incrementing numeric IDs in MySQL, which are not resilient, and various solutions like UUIDs, Twitter Snowflake IDs, and Flickr ticket servers that generate IDs in a distributed and ordered way without coordination between data centers. It also provides code examples of generating Twitter Snowflake-like IDs in PHP without coordination using ZeroMQ.
The document discusses planning for failure when building software systems. It notes that as software projects grow larger with more engineers, complexity and the potential for failures increases. The author discusses how the taxi app Hailo has grown significantly and now uses a service-oriented architecture across multiple data centers to improve reliability. Key technologies discussed include Zookeeper, Elasticsearch, NSQ, and Cruftflake which provide distributed and resilient capabilities. The importance of testing failures through simulation is emphasized to improve reliability.
Cassandra, Modeling and Availability at AMUGMatthew Dennis
brief high level comparison of modeling between relational databases and Cassandra followed by a brief description of how Cassandra achieves global availability
Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.
The document discusses data modeling goals and examples for Cassandra. It provides guidance on keeping related data together on disk, avoiding normalization, and modeling time series data. Examples covered include mapping time series data points to Cassandra rows and columns, querying time slices, bucketing data, and eventually consistent transaction logging to provide atomicity. The document aims to help with common Cassandra modeling questions and patterns.
Talk from CassandraSF 2012 showing the importance of real durability. Examples of use for row level isolation in Cassandra and the implementation of a transaction log pattern. The example used is a banking system on top of Cassandra with support crediting/debiting an account, viewing an account balance and transferring money between accounts.
- In Cassandra, data is modeled differently than in relational databases, with an emphasis on denormalizing data and organizing it to support common queries with minimal disk seeks
- Cassandra uses keyspaces, column families, rows, columns and timestamps to organize data, with columns ordered to enable efficient querying of ranges
- To effectively model data in Cassandra, you should think about common queries and design schemas to co-locate frequently accessed data on disk to minimize I/O during queries
This document summarizes several Cassandra anti-patterns including:
- Using a non-Oracle JVM which is not recommended.
- Putting the commit log and data directories on the same disk which can impact performance.
- Using EBS volumes on EC2 which can have unpredictable performance and throughput issues.
- Configuring overly large JVM heaps over 16GB which can cause garbage collection issues.
- Performing large batch mutations in a single operation which risks timeouts if not broken into smaller batches.
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
The document summarizes a workshop on Cassandra data modeling. It discusses four use cases: (1) modeling clickstream data by storing sessions and clicks in separate column families, (2) modeling a rolling time window of data points by storing each point in a column with a TTL, (3) modeling rolling counters by storing counts in columns indexed by time bucket, and (4) using transaction logs to achieve eventual consistency when modeling many-to-many relationships by serializing transactions and deleting logs after commit. The document provides recommendations and alternatives for each use case.
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
random list of Apache Cassndra Anti Patterns. There is a lot of info on what to use Cassandra for and how, but not a lot of information on what not to do. This presentation works towards filling that gap.
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling.
Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880
Cassandra concepts, patterns and anti-patternsDave Gardner
The document discusses Cassandra concepts, patterns, and anti-patterns. It begins with an agenda that covers choosing NoSQL, Cassandra concepts based on Dynamo and Bigtable, and patterns and anti-patterns of use. It then delves into Cassandra concepts such as consistent hashing, vector clocks, gossip protocol, hinted handoff, read repair, and consistency levels. It also discusses Bigtable concepts like sparse column-based data model, SSTables, commit log, and memtables. Finally, it outlines several patterns and anti-patterns of Cassandra use.
Cassandra's data model is more flexible than typically assumed.
Cassandra allows tuning of consistency levels to balance availability and consistency. It can be made consistently when certain replication conditions are met.
Cassandra uses a row-oriented model where rows are uniquely identified by keys and group columns and super columns. Super column families allow grouping columns under a common name and are often used for denormalizing data.
Cassandra's data model is query-based rather than domain-based. It focuses on answering questions through flexible querying rather than storing predefined objects. Design patterns like materialized views and composite keys can help support different types of queries.
This document outlines Netflix's culture of freedom and responsibility. Some key points:
- Netflix focuses on attracting and retaining "stunning colleagues" through a high-performance culture rather than perks. Managers use a "Keeper Test" to determine which employees they would fight to keep.
- The culture emphasizes values over rules. Netflix aims to minimize complexity as it grows by increasing talent density rather than imposing processes. This allows the company to maintain flexibility.
- Employees are given significant responsibility and freedom in their roles, such as having no vacation tracking or expense policies beyond acting in the company's best interests. The goal is to avoid chaos through self-discipline rather than controls.
- Providing
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark DataStax Academy
Speaker: Richard Low, Analytics Tech Lead at SwiftKey
Video: http://www.youtube.com/watch?v=QTb4HTwVMq0&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=2
Everything Cassandra does is designed for a real-time workload of high volume inserts and frequent small queries. Cassandra has Hadoop and Hive integration, but performing long running ad-hoc queries with these tools is difficult without impacting real-time performance or requires duplicate clusters. This talk will explain how I'm integrating Cassandra with Shark, a drop-in Hive replacement developed by Berkeley's AmpLab. It's designed to give fine grained control over all resource usage so you can safely run arbitrary ad-hoc queries on your existing cluster with controlled and predictable impact.
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Richard Low
The document discusses running batch analytics queries on Cassandra databases by using Spark and Shark to directly access the SSTables. Current solutions like running Hive on Cassandra have performance issues. The author's solution uses Spark workers running on Cassandra nodes to read SSTables directly, avoiding the filesystem cache and CQL interface. Performance tests show this approach is 2.5x faster than using the CQL interface and has lower and more predictable query latency, even under write load. The author calls for further development and contributions to the technique.
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1DataStax Academy
Speaker: Jonathan Ellis, Apache Cassandra Chair & CTO/Co-Founder at DataStax
Keynote presentation on Apache Cassandra 2.0 & 2.1 at Cassandra Summit EU 2013
The document discusses Cassandra 2.1, including:
- New features like user defined types, collection indexing, and more efficient HyperLogLog filters and repair processes.
- Past and ongoing improvements to Cassandra's performance, scalability, reliability and ease of use over its 5 year history and multiple releases.
- Details on Cassandra's architecture like its read path, compaction strategies, and use of on- and off-heap memory.
ContextSpace is working to develop and support an open source implementation of the Camunda core engine that persists all if its data to Cassandra. This development addresses issues of ACID as well as approaches to lock management. ContextSpace plans to integrate this implementation with its own product offering in order to expose data and events generated from its identity, security, roles, messaging and contextual user activities to be managed by Camunda-driven business processes.
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi DataStax Academy
Speaker: Andy Cobley, Lecturer at University of Dundee
Video: http://www.youtube.com/watch?v=0U4iOSMnRdk&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=1
Abstract: The raspberry Pi is a credit-card sized $25 ARM based linux box designed to teach children the basics of programming. The machine comes with a 700MHz ARM and 512Mb of memory and boots off a SD card, not much power for running the likes of a Cassandra cluster. This presentation will discuss the problems of getting Cassandra up and running on the Pi and will answer the all important question: Why on Earth would you want to do this!?
What is Apache Cassandra? | Apache Cassandra Tutorial | Apache Cassandra Intr...Edureka!
** Apache Cassandra Certification Training: https://www.edureka.co/cassandra **
This Edureka tutorial on "What is Apache Cassandra" will give you a detailed introduction to the NoSQL database Apache Cassandra and it's various features. Learn why Cassandra is preferred over other Databases. You will also learn about the various elements of Cassandra Database with an interactive Industry based Use Case.
Speaker: Aaron Morton, Apache Cassandra Committer & Co-Founder/Principle Consultant at The Last Pickle Inc.
Video: http://www.youtube.com/watch?v=efI5fL8eEfo&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=23
From the microsecond your request hits an Apache Cassandra node there are many code paths, threads and machines involved in storing or fetching your data. This talk will step through the common operations and highlight the code responsible. Apache Cassandra solves many interesting problems to provide a scalable, distributed, fault tolerant database. Cluster wide operations track node membership, direct requests and implement consistency guarantees. At the node level, the Log Structured storage engine provides high performance reads and writes. All of this is implemented in a Java code base that has greatly matured over the past few years. This talk will step through read and write requests, automatic processes and manual maintenance tasks. I'll discuss the general approach to solving the problem and drill down to the code responsible for implementation. Existing Cassandra users, those wanting to contribute to the project and people interested in Dynamo based systems will all benefit from this tour of the code base.
C* Summit EU 2013: Effective Cassandra Development with AchillesDataStax Academy
This document discusses Achilles, an open source persistence manager for Cassandra that provides features like entity mapping, common CRUD operations, query DSL, and integration with Spring. It highlights that Achilles was created by developers for developers and aims to support all CQL3 features and upcoming Cassandra features. The presentation encourages effective Cassandra development using Achilles and provides an overview of its capabilities and roadmap.
Effective cassandra development with achillesDuyhai Doan
This document discusses Achilles, an open source persistence manager for Cassandra that provides features like entity mapping, common CRUD operations, query DSL, and integration with Spring. It highlights that Achilles was created by developers for developers to make Cassandra development more effective. The roadmap includes future support for secondary indexes, bean validation, DAO templates, and new Cassandra 2.0 features.
** Apache Cassandra Certification Training: https://www.edureka.co/cassandra **
In this PPT, you will get a detailed introduction to NoSQL and Apache Cassandra Questions and Answers required to crack any Interview. Brush up your Knowledge of Cassandra, It's various database elements and how to configure the database.
Pollfish is a survey platform which provides access to millions of targeted users. Pollfish allows easy distribution and targeting of surveys through existing mobile apps. (https://www.pollfish.com/). At pollfish we use Cassandra for difference use cases, eg. for application data store to maximize write throughput when appropriate and for our analytics project to find insights in application generated data. As a medium to accomplish our success so far, we use the Datastax's DSE 4.6 environment which integrates Appache Cassadra, Spark and a hadoop compatible file system (CFS). We will discuss how we started, how the journey was and the impressions gained so far along with some tips learned the hard way. This is a result of joint work of an excellent team here at Pollfish.
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
Presenter: Feng Qu, Principal DBA at eBay
Cassandra has been adopted widely at eBay in recent years and used by many end-user facing applications. I will introduce best practices we have built over the time around system design, capacity planning, deployment automation, monitoring integration, performance analysis and troubleshooting. I will also share our experience working with DataStax support to provide a highly available, highly scalable data store fitting into eBay infrastructure.
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsAnant Corporation
In this lunch, Johnny will show us how easy it is to start monitoring your Cassandra cluster in minutes. He will explain the various aspects and features of Cassandra that need to be monitored, how to do it, and most importantly why! Approaches for backups and Cassandra repairs will be discussed and explored in detail.
Learn how AxonOps significantly reduces the complexity and overhead when looking after Cassandra and ensures your Cassandra cluster is reliable and resilient.
Experienced developer, DevOps, architect, and AxonOps co-founder, Johnny Miller, has worked with a wide variety of companies – from small start-ups to large enterprises. He has been working with Cassandra for many years and has a deep understanding of the challenges facing modern companies looking to adopt Apache Cassandra.
Cassandra is a distributed database designed to handle large amounts of structured data across commodity servers. It provides linear scalability, fault tolerance, and high availability. Cassandra's architecture is masterless with all nodes equal, allowing it to scale out easily. Data is replicated across multiple nodes according to the replication strategy and factor for redundancy. Cassandra supports flexible and dynamic data modeling and tunable consistency levels. It is commonly used for applications requiring high throughput and availability, such as social media, IoT, and retail.
C* Summit 2013: Time-Series Metrics with Cassandra by Mike HeffnerDataStax Academy
This document discusses using Cassandra to store time-series metrics data. It describes how the schema was matched to storage by using a measurement column family with rows organized by metric ID and time. It also covers optimizing data expiration through techniques like TTL expiration, synchronized compactions, and leveraging immutable sstable modification times. Effective monitoring is emphasized as well, including dashboards to track the ring and using Cassandra log volumes to identify issues.
Horizontal scaling in the Cloud is the way to adapt resources to load of systems. The Cloud allows users to scale virtually indefinitely, or enough for their needs.
This way the number of servers follows trend of requests, and TCO (Total Cost of Owneship) of IT infrastructure can be reduced. Nonetheless companies can avoid dealing with capacity planning and pre-provisioning issues.
This talk will show how to use Python and Rackspace/OpenStack API and SDK to implement an event-based scaling solution (software released under the open-source Apache License: stay tuned).
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...Amazon Web Services
Startups around the world use AWS services to access the power of the cloud to grow faster and more cost effectively. In this session, Smartsheet talks about how they were able to cost-effectively build their prototype for scale and avoid replatforming at different points in the adoption curve, and Quantcast discusses how they are running a high-performance analytics solution on AWS. They provide several tips and tricks for S3, and show how they removed a traditional MySQL data store from a distributed-image hosting application so that the only required data store is S3. They also show how to avoid common, cumbersome database practices by working with the eventually consistent nature of S3 objects and the fact that objects and directories share the same namespace.
Similar to Cabs, Cassandra, and Hailo (at Cassandra EU) (20)
Intro slides from Cassandra London July 2011Dave Gardner
The document compares Cassandra and MongoDB, two NoSQL databases. It provides information on their data models, conflict resolution approaches, distribution methods, and differences. A commenter responds that Cassandra and MongoDB have almost nothing in common and that claims of Cassandra dying off are incorrect.
This document provides a summary of various resources about Apache Cassandra, including blog posts on migrating Netflix to Cassandra, indexing in Cassandra, and Cassandra at Twitter. It also lists a book on Cassandra and highlights the key components of the Acunu data platform, which includes Cassandra, management tools, and an easily installed package.
An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.
Introduction to Cassandra at London Web MeetupDave Gardner
A 15 minute introduction to the Cassandra distributed data store from the February 2011 London Web meetup.
This covers the basics of who is using it, why you might want to use it (due to the large amount of data being collected by Web Apps today) and, most importantly, _what_ it is!
What are the challenges of running Apache Cassandra on Amazon EC2? Is it a good idea?
In this presentation, we explore reasons for and against running the distributed database Cassandra on EC2. We look at the I/O performance of EC2 and
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
4. 0.6 to 1.2
• 1,352 changed files with 235,413 additions and 47,487 deletions
• 7,429 commits
• 1,653 tickets completed
https://github.com/apache/cassandra/compare/cassandra-0.6.0...cassandra-1.2
https://github.com/apache/cassandra/blob/trunk/CHANGES.txt
#CASSANDRAEU
CASSANDRASUMMITEU
5. What this talk is about
Cassandra adoption at Hailo from three perspectives:
1. Development
2. Operational
3. Management
#CASSANDRAEU
CASSANDRASUMMITEU
6. What is Hailo?
Hailo is The Taxi Magnet. Use Hailo to get a cab wherever you are, whenever you want.
#CASSANDRAEU
CASSANDRASUMMITEU
10. What is Hailo?
• The world’s highest-rated taxi app – over 11,000 five-star reviews
• Over 500,000 registered passengers
• A Hailo hail is accepted around the world every 4 seconds
• Hailo operates in 15 cities on 3 continents from Tokyo to Toronto in
nearly 2 years of operation
#CASSANDRAEU
CASSANDRASUMMITEU
11. Hailo is growing
• Hailo is a marketplace that facilitates over $100M in run-rate
transactions and is making the world a better place for passengers
and drivers
• Hailo has raised over $50M in financing from the world's best
investors including Union Square Ventures, Accel, the founder of
Skype (via Atomico), Wellington Partners (Spotify), Sir Richard
Branson, and our CEO's mother, Janice
#CASSANDRAEU
CASSANDRASUMMITEU
12. The history
The story behind Cassandra adoption at Hailo
#CASSANDRAEU
CASSANDRASUMMITEU
13. Hailo launched in London in November 2011
• Launched on AWS
• Two PHP/MySQL web apps plus a Java backend
• Mostly built by a team of 3 or 4 backend engineers
• MySQL multi-master for single AZ resilience
#CASSANDRAEU
CASSANDRASUMMITEU
14. Why Cassandra?
• A desire for greater resilience – “become a utility”
Cassandra is designed for high availability
• Plans for international expansion around a single consumer app
Cassandra is good at global replication
• Expected growth
Cassandra scales linearly for both reads and writes
• Prior experience
I had experience with Cassandra and could recommend it
#CASSANDRAEU
CASSANDRASUMMITEU
15. The path to adoption
• Largely unilateral decision by developers – a result of a startup
culture
• Replacement of key consumer app functionality, splitting up the
PHP/MySQL web app into a mixture of global PHP/Java services
backed by a Cassandra data store
• Launched into production in September 2012 – originally just
powering North American expansion, before gradually switching
over Dublin and London
#CASSANDRAEU
CASSANDRASUMMITEU
16. One year on...
• Further breakdown of functionality into Go/Java SOA
• Migrating all online databases to Cassandra
#CASSANDRAEU
CASSANDRASUMMITEU
21. Considerations for entity storage
• Do not read the entire entity, update one property and then write
back a mutation containing every column
• Only mutate columns that have been set
• This avoids read-before-write race conditions
#CASSANDRAEU
CASSANDRASUMMITEU
26. Considerations for time series storage
• Choose row key carefully, since this partitions the records
• Think about how many records you want in a single row
• Denormalise on write into many indexes
#CASSANDRAEU
CASSANDRASUMMITEU
28. Analytics
• With Cassandra we lost the ability to carry out analytics
eg: COUNT, SUM, AVG, GROUP BY
• We use Acunu Analytics to give us this abilty in real time, for preplanned query templates
• It is backed by Cassandra and therefore highly available, resilient
and globally distributed
• Integration is straightforward
#CASSANDRAEU
CASSANDRASUMMITEU
33. “Allows a team of 2 to achieve things they
wouldn’t have considered before Cassandra
existed”
Chris H, Operations Engineer
#CASSANDRAEU
CASSANDRASUMMITEU
37. Stats
Cluster
AWS VPCs with Open
VPN links
3 AZs per region
m1.large machines
~ 1TB/node
Provisoned IOPS EBS
#CASSANDRAEU
Operational
Cluster
~ 200GB/node
CASSANDRASUMMITEU
38. Backups
• SSTable snapshot
• Used to upload to S3, but this was taking >6 hours and consuming
all our network bandwidth
• Now take EBS snapshot of the data volumes
#CASSANDRAEU
CASSANDRASUMMITEU
39. Encryption
• Requirement for NYC launch
• We use dmcrypt to encrypt the entire EBS volume
• Chose dmcrypt because it is uncomplicated
• Our tests show a 1% performance hit in disk performance, which
concurs with what Amazon suggest
#CASSANDRAEU
CASSANDRASUMMITEU
41. Multi DC
• Something that Cassandra makes trivial
• Would have been very difficult to accomplish active-active inter-DC
replication with a team of 2 without Cassandra
• Rolling repair needed to make it safe (we use LOCAL_QUORUM)
• We schedule “narrow repairs” on different nodes in our cluster
each night
#CASSANDRAEU
CASSANDRASUMMITEU
42. Compression
• Our stats cluster was running at ~1.5TB per node
• We didn’t want to add more nodes
• With compression, we are now back to ~600GB
• Easy to accomplish
• `nodetool upgradesstables` on a rolling schedule
#CASSANDRAEU
CASSANDRASUMMITEU
44. “The days of the quick and dirty are over”
Simon V, EVP Operations
#CASSANDRAEU
CASSANDRASUMMITEU
45. Technically, everything is fine…
• Our COO feels that C* is “technically good and beautiful”, a
“perfectly good option”
• Our EVPO says that C* reminds him of a time series database in
use at Goldman Sachs that had “very good performance”
…but there are concerns
#CASSANDRAEU
CASSANDRASUMMITEU
46. People who can
attempt to query
MySQL
People who can
attempt to
query Cassandra
#CASSANDRAEU
CASSANDRASUMMITEU
51. Lesson learned
• Have an advocate - get someone who will sell the vision internally
• Learn the theory - teach each team member the fundamentals
• Make an effort to get everyone on board
#CASSANDRAEU
CASSANDRASUMMITEU
58. Lesson learned
• Be pro-active with Cassandra, even if it seems to be running
smoothly
• Peer-review data models, take time to think about them
• Big rows are bad - use cfstats to look for them
• Mixed workloads can cause problems - use cfhistograms and look
out for signs of data modeling problems
• Think about the compaction strategy for each CF
#CASSANDRAEU
CASSANDRASUMMITEU
60. Lessons learned
• EBS is nearly always the cause of Amazon outages
• EBS is a single point of failure (it will fail everywhere in your
cluster)
• EBS is slow
• EBS is expensive
• EBS is unnecessary!
#CASSANDRAEU
CASSANDRASUMMITEU
62. Lessons learned
• Keep the business informed – explain the tradeoffs in simple terms
• Sing from the same hymn sheet
• Make sure there solutions in place for every use case from the
beginning
#CASSANDRAEU
CASSANDRASUMMITEU
63. People who can
attempt to query
MySQL
#CASSANDRAEU
People who can
attempt to
query Cassandra
CASSANDRASUMMITEU
65. We like Cassandra
• Solid design
• HA characteristics
• Easy multi-DC setup
• Simplicity of operation
#CASSANDRAEU
CASSANDRASUMMITEU
66. Lessons for successful adoption
• Have an advocate, sell the dream
• Learn the fundamentals, get the best out of Cassandra
• Invest in tools to make life easier
• Keep management in the loop, explain the trade offs
#CASSANDRAEU
CASSANDRASUMMITEU
67. The future
• We will continue to invest in Cassandra as we expand globally
• We will hire people with experience running Cassandra
• We will focus on expanding our reporting facilities
• We aspire to extend our network (1M consumer installs, wallet)
beyond cabs
• We will continue to hire the best engineers in London, NYC and
Asia
#CASSANDRAEU
CASSANDRASUMMITEU
I started using Cassandra in 2010, back in version 0.6. Back then it was quite hard work.
I founded the London meetup group in 2010 and have been flying the C* flag over London ever since. My motivation was to connect with others who were using Cassandra. Back then “swapping war stories” was a common theme. Cassandra was not easy to use.
Fast forward to 2013. 7,429 commits later. Cassandra “just works”. Kudos to the team of committers and contributors who have made this happen.
4:30Whilst “it just works” is quite compelling, there are still challenges to successful adoption of C* in an organisation. I am going to talk about our experiences at Hailo, from three perpsectives: dev, ops and management.
On iOS and Android, live in London, New York, Chicago, Toronto, Boston, Dublin, Madrid
Founded by 3 taxi drivers and 3 seasoned entrepreneurs.
Built by a small team, in one room, on a boat on the Thames, but with global ambitions. Cloud native from day 1 – run solely on AWS.
My recommendation was based on the solid design principles behind C*, something I’ve talked about in the past.
13:00
Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
Read heavy, demand-driven. Writes consistent.
Time series for storing records of all actions in Hailo. In this instance bucketed by a daily row key, for all messages. The column name is a type 1 UUID.
We also denormalise for other indexes, eg: here we store every message sent to a given address under a single row.
Stats service – insert rate at 5k/sec. Responsible for storing business events from all areas of our system.
Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
We are not using CQL.
We can execute AQL
Some screenshot
27:00
London, NYC, Tokyo, Osaka, Dublin, Toronto, Boston, Chicago, Madrid, Barcelona, Washington, Montreal
Our rings, plus key stats (m1.large, 18 nodes in cluster A, 12 nodes in cluster B, 100GB per node in cluster A, ~ 600GB in cluster B)
EC2 snitch
Our rings, plus key stats (m1.large, 18 nodes in cluster A, 12 nodes in cluster B, 100GB per node in cluster A, ~ 600GB in cluster B)
I interviewed key people from our management team to gauge their reaction to our C* deployment.
There is a perceptionthat we have made it much harder to get at our data. In the early days at Hailo, when we all worked in one room, developers could execute ad-hoc queries on the fly for management. Nowadays we can’t. The reasons behind this are two-fold – firstly it is true that C* is harder to execute ad-hoc queries. But that’s not the whole picture. Much of our data is still in MySQL, and the queries we used to do against this data do not run smoothly either. The perception, however, is that it is the “new database” that is the cause of problems.
It’s easy to cause yourself a “Big Data” problem. Developers collect and store data because they can, without being clear about the business implications.
1. Most people have N years of SQL experience where N >= 5
Sometimes C* works too well. Clearly this cluster needs some attention, but our application is still working fine.We are probably at the point where we need a dedicated C* expert.
2. It’s possible to shoot yourself in the foot – but this is true of SQL (eg: joins that work with low data volumes)
Big rows are bad – they expose a data modeling problem
Big rows are bad – they expose a data modeling problem
Big rows are bad – they expose a data modeling problem
With the right tools, we could change the picture completely.