This slide is intended to instruct the backend service team members of PM2.5 Open Data Service (pm25.lass-net.org) to learn the basic stuff about Apache Cassandra.
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
The design of Apache Cassandra allows applications to provide constant uptime. Peer-to-Peer technology ensures there are no single points of failure, and the Consistency guarantees allow applications to function correctly while some nodes are down. There is also a wealth of information provided by the JMX API and the system log. All of this means that when things go wrong you have the time, information and platform to resolve them without downtime. This presentation will cover some of the common, and not so common, performance issues, failures and management tasks observed in running clusters. Aaron will discuss how to gather information and how to act on it. Operators, Developers and Managers will all benefit from this exposition of Cassandra in the wild.
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
Have you ever wondered what is in all of those SSTable files and how it helps Cassandra find and manage your data? If you go to the Datastax website they will give you a high level explanation of what is in each file. In this talk we will go much deeper explaining each file and walking through a dump of its contents. We will also explore the differences between Cassandra 2.1 and 3.4.
About the Speaker
John Schulz Prinicipal Consultant, The Pythian Group
John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and most recently NoSQL. For the last 15 he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop and Hbase. He has been working with Cassandra since 2010. For the last eighteen months he has been working for The Pythian Group to help their customers improve their existing databases and select new ones.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
Understanding Data Partitioning and Replication in Apache CassandraDataStax
This document provides an overview of data partitioning and replication in Apache Cassandra. It discusses how Cassandra partitions data across nodes using configurable strategies like random and ordered partitioning. It also explains how Cassandra replicates data for fault tolerance using a replication factor and different strategies like simple and network topology. The network topology strategy places replicas across racks and data centers. Various snitches help Cassandra determine network topology.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
1. Cassandra is a decentralized structured storage system designed for scalability and high availability without single points of failure.
2. It uses consistent hashing to partition data across nodes and provide high availability, and an anti-entropy process to detect and repair inconsistencies between nodes.
3. Clients can specify consistency levels for reads and writes, with different levels balancing availability and consistency. The quorum protocol is used to achieve consistency when replicating data across nodes.
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
The design of Apache Cassandra allows applications to provide constant uptime. Peer-to-Peer technology ensures there are no single points of failure, and the Consistency guarantees allow applications to function correctly while some nodes are down. There is also a wealth of information provided by the JMX API and the system log. All of this means that when things go wrong you have the time, information and platform to resolve them without downtime. This presentation will cover some of the common, and not so common, performance issues, failures and management tasks observed in running clusters. Aaron will discuss how to gather information and how to act on it. Operators, Developers and Managers will all benefit from this exposition of Cassandra in the wild.
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
Have you ever wondered what is in all of those SSTable files and how it helps Cassandra find and manage your data? If you go to the Datastax website they will give you a high level explanation of what is in each file. In this talk we will go much deeper explaining each file and walking through a dump of its contents. We will also explore the differences between Cassandra 2.1 and 3.4.
About the Speaker
John Schulz Prinicipal Consultant, The Pythian Group
John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and most recently NoSQL. For the last 15 he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop and Hbase. He has been working with Cassandra since 2010. For the last eighteen months he has been working for The Pythian Group to help their customers improve their existing databases and select new ones.
Advanced Apache Cassandra Operations with JMXzznate
Nodetool is a command line interface for managing a Cassandra node. It provides commands for node administration, cluster inspection, table operations and more. The nodetool info command displays node-specific information such as status, load, memory usage and cache details. The nodetool compactionstats command shows compaction status including active tasks and progress. The nodetool tablestats command displays statistics for a specific table including read/write counts, space usage, cache usage and latency.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
Understanding Data Partitioning and Replication in Apache CassandraDataStax
This document provides an overview of data partitioning and replication in Apache Cassandra. It discusses how Cassandra partitions data across nodes using configurable strategies like random and ordered partitioning. It also explains how Cassandra replicates data for fault tolerance using a replication factor and different strategies like simple and network topology. The network topology strategy places replicas across racks and data centers. Various snitches help Cassandra determine network topology.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
1. Cassandra is a decentralized structured storage system designed for scalability and high availability without single points of failure.
2. It uses consistent hashing to partition data across nodes and provide high availability, and an anti-entropy process to detect and repair inconsistencies between nodes.
3. Clients can specify consistency levels for reads and writes, with different levels balancing availability and consistency. The quorum protocol is used to achieve consistency when replicating data across nodes.
Cassandra by example - the path of read and write requestsgrro
This article describes how Cassandra handles and processes requests. It will help you to get a better impression about Cassandra's internals and architecture. The path of a single read request as well as the path of a single write request will be described in detail.
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
This document provides an agenda and introduction for a presentation on Apache Cassandra and DataStax Enterprise. The presentation covers an introduction to Cassandra and NoSQL, the CAP theorem, Apache Cassandra features and architecture including replication, consistency levels and failure handling. It also discusses the Cassandra Query Language, data modeling for time series data, and new features in DataStax Enterprise like Spark integration and secondary indexes on collections. The presentation concludes with recommendations for getting started with Cassandra in production environments.
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...DataStax
Ooyala has been using Apache Cassandra since version 0.4.Their data ingest volume has exploded since 0.4 and Cassandra has scaled along with it. In this webinar, Al will share lessons that he has learned across an array of topics from an operational perspective including how to manage, tune, and scale Cassandra in a production environment.
Speaker: Al Tobey, Tech Lead, Compute and Data Services at Ooyala
Al Tobey is Tech Lead of the Compute and Data services team at Ooyala. His team develops and operates Ooyala's internal big data platform, consisting of Apache Cassandra, Hadoop, and internally developed tools. When not in front of a computer, Al is a father, husband, and trombonist.
Building Apache Cassandra clusters for massive scaleAlex Thompson
Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
About the Speaker
Julien Anguenot VP Software Engineering, iland Internet Solutions, Corp
Julien currently serves as iland's Vice President of Software Engineering. Prior to joining iland, Mr. Anguenot held tech leadership positions at several open source content management vendors and tech startups in Europe and in the U.S. Julien is a long time Open Source software advocate, contributor and speaker: Zope, ZODB, Nuxeo contributor, Zope and OpenStack foundations member, his talks includes Apache Con, Cassandra summit, OpenStack summit, The WWW Conference or still EuroPython.
MySQL Cluster provides high availability through data replication across multiple nodes, automatic failover, and synchronous replication to ensure data integrity, but it has limitations in that the entire database must reside in memory and database size is restricted by available memory. Other options for high availability with MySQL include using MySQL proxy to split reads and writes across nodes, replication with multi-master setups, and technologies like DRBD to replicate data for recovery. Planning for failures, keeping implementations simple, and separating data and connectivity high availability are important principles for highly available MySQL architectures.
A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
Tech Talk: Best Practices for Data ModelingScyllaDB
When we think about database performance, data modeling shouldn't be overlooked; the way data is written and retrieved dictates how fast your system can operate. Because Scylla is a non-relational database, its data model focuses on application queries to build the most efficient data structure. Adapting to a new data modeling mindset can be done pragmatically by understanding new database concepts and how they apply to Scylla.
In this webinar you will learn about:
- Scylla data model and basic CQL concepts
- Primary and Clustering key selection
- Collections and User-Defined Types
- Problem finding techniques
This document provides an overview of Cassandra, a decentralized, distributed database management system. It discusses why the author's company chose Cassandra over other options like HBase and MySQL for their real-time data needs. The document then covers Cassandra's data model, architecture, data partitioning, replication, and other key aspects like writes, reads, deletes, and compaction. It also notes some limitations of Cassandra and provides additional resource links.
Cassandra is the dominant data store used at Netflix and it's health is critical to many of its services. In this talk we will share details of the recent redesign of our health monitoring system and how we leveraged a reactive stream processing system to give us a real-time view our entire fleet while dramatically improving accuracy and reducing false alarms in our alerting.
About the Speaker
Jason Cacciatore Senior Software Engineer, Netflix
Jason Cacciatore is a Senior Software Engineer at Netflix, where he's been working for the past several years. He's interested in stateful distributed systems and has a diverse background in technology. In his spare time he enjoys spending time with his wife and two sons, reading non-fiction, and watching Netflix documentaries.
This document provides an overview of Cassandra's read and write paths. It describes the core components involved, including memtables, SSTables, commitlog, cache service, column family store, and more. It explains how writes are applied to the commitlog and memtable and how reads merge data from memtables and SSTables using the collation controller.
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...DataStax
A solid backup strategy is a DBA's bread and butter. Cassandra's nodetool snapshot makes it easy to back up the SSTable files, but there remains the question of where to put them and how. Knewton's backup strategy uses Ansible for distributed backups and stores them in S3.
Unfortunately, it's all too easy to store backups that are essentially useless due to the absence of a coherent restoration strategy. This problem proved much more difficult and nuanced than taking the backups themselves. I will discuss Knewton's restoration strategy, which again leverages Ansible, yet I will focus on general principles and pitfalls to be avoided. In particular, restores necessitated modifying our backup strategy to generate cluster-wide metadata that is critical for a smooth automated restoration. Such pitfalls indicate that a restore-focused backup design leads to faster and more deterministic recovery.
About the Speaker
Joshua Wickman Database Engineer, Knewton
Dr. Joshua Wickman is currently part of the database team at Knewton, a NYC tech company focused on adaptive learning. He earned his PhD at the University of Delaware in 2012, where he studied particle physics models of the early universe. After a brief stint teaching college physics, he entered the New York tech industry in 2014 working with NoSQL, first with MongoDB and then Cassandra. He was certified in Cassandra at his first Cassandra Summit in 2015.
This document provides an overview of Cassandra, including:
- Cassandra is a distributed, column-oriented database that is highly scalable and has no single point of failure.
- It compares Cassandra to relational databases, noting Cassandra's flexible schema and lack of joins.
- The architecture includes keyspaces, tables and columns, with replication specified at the keyspace level.
- Queries in Cassandra Query Language (CQL) have limitations compared to other databases.
The document compares two methods for limiting CPU usage of databases on the same server: instance caging and processor_group_name binding. It provides facts about how each method works, observations on performance differences, and examples of customer cases where each method may be best. Instance caging allows limiting CPU count online but the SGA is interleaved, while binding groups databases to specific CPUs requiring a restart but keeps the SGA local. The best choice depends on factors like database count and whether guaranteed CPU resources are needed for some databases.
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
Cassandra by example - the path of read and write requestsgrro
This article describes how Cassandra handles and processes requests. It will help you to get a better impression about Cassandra's internals and architecture. The path of a single read request as well as the path of a single write request will be described in detail.
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
This document provides an agenda and introduction for a presentation on Apache Cassandra and DataStax Enterprise. The presentation covers an introduction to Cassandra and NoSQL, the CAP theorem, Apache Cassandra features and architecture including replication, consistency levels and failure handling. It also discusses the Cassandra Query Language, data modeling for time series data, and new features in DataStax Enterprise like Spark integration and secondary indexes on collections. The presentation concludes with recommendations for getting started with Cassandra in production environments.
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...DataStax
Ooyala has been using Apache Cassandra since version 0.4.Their data ingest volume has exploded since 0.4 and Cassandra has scaled along with it. In this webinar, Al will share lessons that he has learned across an array of topics from an operational perspective including how to manage, tune, and scale Cassandra in a production environment.
Speaker: Al Tobey, Tech Lead, Compute and Data Services at Ooyala
Al Tobey is Tech Lead of the Compute and Data services team at Ooyala. His team develops and operates Ooyala's internal big data platform, consisting of Apache Cassandra, Hadoop, and internally developed tools. When not in front of a computer, Al is a father, husband, and trombonist.
Building Apache Cassandra clusters for massive scaleAlex Thompson
Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
About the Speaker
Julien Anguenot VP Software Engineering, iland Internet Solutions, Corp
Julien currently serves as iland's Vice President of Software Engineering. Prior to joining iland, Mr. Anguenot held tech leadership positions at several open source content management vendors and tech startups in Europe and in the U.S. Julien is a long time Open Source software advocate, contributor and speaker: Zope, ZODB, Nuxeo contributor, Zope and OpenStack foundations member, his talks includes Apache Con, Cassandra summit, OpenStack summit, The WWW Conference or still EuroPython.
MySQL Cluster provides high availability through data replication across multiple nodes, automatic failover, and synchronous replication to ensure data integrity, but it has limitations in that the entire database must reside in memory and database size is restricted by available memory. Other options for high availability with MySQL include using MySQL proxy to split reads and writes across nodes, replication with multi-master setups, and technologies like DRBD to replicate data for recovery. Planning for failures, keeping implementations simple, and separating data and connectivity high availability are important principles for highly available MySQL architectures.
A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
Tech Talk: Best Practices for Data ModelingScyllaDB
When we think about database performance, data modeling shouldn't be overlooked; the way data is written and retrieved dictates how fast your system can operate. Because Scylla is a non-relational database, its data model focuses on application queries to build the most efficient data structure. Adapting to a new data modeling mindset can be done pragmatically by understanding new database concepts and how they apply to Scylla.
In this webinar you will learn about:
- Scylla data model and basic CQL concepts
- Primary and Clustering key selection
- Collections and User-Defined Types
- Problem finding techniques
This document provides an overview of Cassandra, a decentralized, distributed database management system. It discusses why the author's company chose Cassandra over other options like HBase and MySQL for their real-time data needs. The document then covers Cassandra's data model, architecture, data partitioning, replication, and other key aspects like writes, reads, deletes, and compaction. It also notes some limitations of Cassandra and provides additional resource links.
Cassandra is the dominant data store used at Netflix and it's health is critical to many of its services. In this talk we will share details of the recent redesign of our health monitoring system and how we leveraged a reactive stream processing system to give us a real-time view our entire fleet while dramatically improving accuracy and reducing false alarms in our alerting.
About the Speaker
Jason Cacciatore Senior Software Engineer, Netflix
Jason Cacciatore is a Senior Software Engineer at Netflix, where he's been working for the past several years. He's interested in stateful distributed systems and has a diverse background in technology. In his spare time he enjoys spending time with his wife and two sons, reading non-fiction, and watching Netflix documentaries.
This document provides an overview of Cassandra's read and write paths. It describes the core components involved, including memtables, SSTables, commitlog, cache service, column family store, and more. It explains how writes are applied to the commitlog and memtable and how reads merge data from memtables and SSTables using the collation controller.
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...DataStax
A solid backup strategy is a DBA's bread and butter. Cassandra's nodetool snapshot makes it easy to back up the SSTable files, but there remains the question of where to put them and how. Knewton's backup strategy uses Ansible for distributed backups and stores them in S3.
Unfortunately, it's all too easy to store backups that are essentially useless due to the absence of a coherent restoration strategy. This problem proved much more difficult and nuanced than taking the backups themselves. I will discuss Knewton's restoration strategy, which again leverages Ansible, yet I will focus on general principles and pitfalls to be avoided. In particular, restores necessitated modifying our backup strategy to generate cluster-wide metadata that is critical for a smooth automated restoration. Such pitfalls indicate that a restore-focused backup design leads to faster and more deterministic recovery.
About the Speaker
Joshua Wickman Database Engineer, Knewton
Dr. Joshua Wickman is currently part of the database team at Knewton, a NYC tech company focused on adaptive learning. He earned his PhD at the University of Delaware in 2012, where he studied particle physics models of the early universe. After a brief stint teaching college physics, he entered the New York tech industry in 2014 working with NoSQL, first with MongoDB and then Cassandra. He was certified in Cassandra at his first Cassandra Summit in 2015.
This document provides an overview of Cassandra, including:
- Cassandra is a distributed, column-oriented database that is highly scalable and has no single point of failure.
- It compares Cassandra to relational databases, noting Cassandra's flexible schema and lack of joins.
- The architecture includes keyspaces, tables and columns, with replication specified at the keyspace level.
- Queries in Cassandra Query Language (CQL) have limitations compared to other databases.
The document compares two methods for limiting CPU usage of databases on the same server: instance caging and processor_group_name binding. It provides facts about how each method works, observations on performance differences, and examples of customer cases where each method may be best. Instance caging allows limiting CPU count online but the SGA is interleaved, while binding groups databases to specific CPUs requiring a restart but keeps the SGA local. The best choice depends on factors like database count and whether guaranteed CPU resources are needed for some databases.
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
The document discusses Cassandra's data model and how it replaces HDFS services. It describes:
1) Two column families - "inode" and "sblocks" - that replace the HDFS NameNode and DataNode services respectively, with "inode" storing metadata and "sblocks" storing file blocks.
2) CFS reads involve reading the "inode" info to find the block and subblock, then directly accessing the data from the Cassandra SSTable file on the node where it is stored.
3) Keyspaces are containers for column families in Cassandra, and the NetworkTopologyStrategy places replicas across data centers to enable local reads and survive failures.
Apache Cassandra is a highly scalable, distributed database designed to handle large amounts of data across many servers with no single point of failure. It uses a peer-to-peer distributed system where data is replicated across multiple nodes for availability even if some nodes fail. Cassandra uses a column-oriented data model with dynamic schemas and supports fast writes and linear scalability.
The document provides information on tools for monitoring and administering Cassandra clusters. It discusses Cassandra-specific tools like nodetool for monitoring metrics and performing administrative tasks. It also lists system monitoring tools for metrics like CPU usage, disk I/O, network activity, and more. Finally, it gives best practices for hardware selection with Cassandra including recommendations for memory, CPU, and disk space.
This document discusses scaling Cassandra for big data applications. It describes how Ooyala uses Cassandra for fast access to data generated by MapReduce, high availability key-value storage from Storm, and playhead tracking for cross-device resume. It outlines Ooyala's experience migrating to newer Cassandra versions as data doubled yearly, including removing expired tombstones, schema changes, and Linux performance tuning.
This document provides an overview of Apache Cassandra, an open source, distributed, wide column store NoSQL database. It discusses key features like scalability, high availability, fault tolerance, and replication across datacenters. Cassandra can handle 1 million writes per second and has been battle tested by many large companies. The document also covers topics like consistency models, data partitioning, replication strategies, indexing, storage, and the Java driver.
This document provides an overview and comparison of Cassandra and Redis. Cassandra is an open-source NoSQL database that is optimized for high throughput and availability. It is commonly used by companies like Netflix, Apple, and Facebook. Redis is an open-source in-memory key-value store written in C. It is optimized for low latency and is commonly used for caching, sessions, queues, and analytics. Both databases are battle tested and have strengths in different areas - Cassandra favors availability over consistency while Redis operates entirely in memory for faster performance but with a single thread. The document discusses various features, use cases, and best practices for operating each database.
Cassandra is a distributed key-value database inspired by Amazon's Dynamo and Google's Bigtable. It uses a gossip-based protocol for node communication and consistent hashing to partition and replicate data across nodes. Cassandra stores data in memory (memtables) and on disk (SSTables), uses commit logs for crash recovery, and is highly available with tunable consistency.
Montreal User Group - Cloning CassandraAdam Hutson
This document provides steps for cloning an Apache Cassandra database cluster. It begins with an introduction and overview. The main steps are:
1. Backup the existing cluster's data, schema, and token assignments and store off-site.
2. Create a new destination cluster matching the original's node count.
3. Restore the backed up data, schema files, and token assignments. Start the new cluster to complete the cloning process.
Apache Cassandra is an open source, distributed database management system designed to handle large amounts of data across commodity servers. It provides high availability with no single point of failure, linear scalability and performance. Cassandra uses a decentralized architecture with no single point of failure, asynchronous masterless replication, column-oriented storage, and tunable consistency. It was developed by Avinash Lakshman and Prashant Malik at Facebook to power the Facebook inbox search feature.
The document discusses Apache Cassandra, a distributed database management system designed to handle large amounts of data across many commodity servers. It was developed at Facebook and modeled after Google's Bigtable. The summary discusses key concepts like its use of consistent hashing to distribute data, support for tunable consistency levels, and focus on scalability and availability over traditional SQL features. It also provides an overview of how Cassandra differs from relational databases by not supporting joins, having an optional schema, and using a prematerialized and transaction-less model.
Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount of data and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Apache Cassandra Lunch #70: Basics of Apache CassandraAnant Corporation
In Cassandra Lunch #70, we discuss the Basics of Apache Cassandra and setup a stand-alone Apache Cassandra.
Accompanying Blog: https://blog.anant.us/cassandra-launch-70-basics-of-apache-cassandra
Accompanying YouTube: https://youtu.be/o-yU0mi4nzc
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
This is a preliminary study and the objective of this study is to make simple distributed database system with some basic tutorials. Cassandra is a distributed database from Apache that is highly scalable and designed to accomplish very large amounts of organized data. Without having a single point of failure, it offers high accessibility. This report highlights with a basic outline of Cassandra trailed by its architecture, installation, and significant classes and interfaces. Subsequently, it proceeds to cover how to perform operations such as CREATE, ALTER, UPDATE, and DELETE on KEYSPACES, TABLES, and INDEXES using CQLSH using C#/.NET Client with a sample program done by ASP.NET(C#).
This document provides an overview of Cassandra, a NoSQL database. It discusses that Cassandra is an open source, distributed database designed to handle large amounts of structured data across nodes. The document outlines Cassandra's architecture, which involves distributing data across peer nodes so that there is no single point of failure. It also discusses Cassandra's data model, including keyspaces, column families, and the use of the Cassandra Query Language to define schemas, insert and query data. In closing, the document notes that Cassandra is well-suited for applications that require scaling to handle large, variable workloads across data centers with high performance and availability.
Speaker: Aaron Morton, Apache Cassandra Committer & Co-Founder/Principle Consultant at The Last Pickle Inc.
Video: http://www.youtube.com/watch?v=efI5fL8eEfo&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=23
From the microsecond your request hits an Apache Cassandra node there are many code paths, threads and machines involved in storing or fetching your data. This talk will step through the common operations and highlight the code responsible. Apache Cassandra solves many interesting problems to provide a scalable, distributed, fault tolerant database. Cluster wide operations track node membership, direct requests and implement consistency guarantees. At the node level, the Log Structured storage engine provides high performance reads and writes. All of this is implemented in a Java code base that has greatly matured over the past few years. This talk will step through read and write requests, automatic processes and manual maintenance tasks. I'll discuss the general approach to solving the problem and drill down to the code responsible for implementation. Existing Cassandra users, those wanting to contribute to the project and people interested in Dynamo based systems will all benefit from this tour of the code base.
Similar to Basic stuff You Need to Know about Cassandra (20)
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
1. BASIC STUFF YOU NEED TO KNOW ABOUT
CASSANDRA
AUG. 2018
YU-CHANG HO (ANDY)
FORMER RESEARCH ASSISTANT, ACADEMIA SINICA
2. A GREEK STORY
➡An Ancient Greek Prophet
➡Second-most beautiful woman in the
world
➡Gift of Prophecy from Apollo
➡Figure of Tragedy
‣ Ref. https://www.wikiwand.com/en/
Cassandra
3. APACHE CASSANDRA
WHAT IS APACHE CASSANDRA (C*)?
▸ Originated at Facebook Inc.
▸ Combines the concept of Google BigTable & Amazon Dynamo.
▸ Data Modeling: BigTable
▸ System Architecture: Dynamo
▸ A distributed database system with high scalability.
▸ Written in Java (The JVM Tuning Hell!!).
Ref. https://www.wikiwand.com/en/Apache_Cassandra
Ref. https://www.wikiwand.com/en/Dynamo_(storage_system)
Ref. https://www.wikiwand.com/en/Bigtable
4. APACHE CASSANDRA
WHAT IS APACHE CASSANDRA (C*)?- CONT.
▸ It is a popular database system! (Ranked in 2018)
1 Oracle
2 MySQL
3 Microsoft SQL Server
4 PostgreSQL
5 MongoDB
6 DB2
7 Reids
8 Elasticsearch
9 Microsoft Access
10 Cassandra
Ref. https://db-engines.com/en/ranking
5. APACHE CASSANDRA
WHAT IS APACHE CASSANDRA (C*)?- CONT.
▸ There is no master/slave relationship among C* nodes!
▸ Every node could be read/written.
▸ In our scenario, we assume the GCP node to be the
“Master” to control the data insertion.
▸ Our currently using version: 3.11.2.
6. APACHE CASSANDRA
THE CAP THEOREM
▸ Eric Brewer, UC Berkeley
▸ C: Consistency
▸ A: Availability
▸ P: Partition-tolerance
▸ All 3 parts of CAP cannot be satisfied at the same time.
Ref. https://www.wikiwand.com/en/CAP_theorem
7. APACHE CASSANDRA
THE CAP THEOREM OF CASSANDRA
▸ C: The consistency of data → Eventually consistency
▸ A: The availability of service → Always available
▸ P: Ability to distribute that load effectively → High Scalability
▸ Still we could tried to satisfied all the three parts! (Tuning the
consistency level for R/W)
▸ Provided high availability and some level of consistency.
Ref. https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
11. APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Keyspace: ColumnFamily in BigTable, Database in MySQL
▸ Table: Just table, don’t be confused! :-)
▸ MemTable: Cassandra will first store in memory. After a
certain among of data is reached, flush to disk (SSTable).
▸ Commit_log: Not only store in memory, C* will also first
create a log for those new data to prevent from failure and
is able to restore those data if bad thing happens.
▸ SSTable: The compressed files of data stored in disk.
12. APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Replica: A copy/duplication of data.
▸ Replication Factor (RF): The number of replica you wish to
maintain in a certain data center.
▸ Partitioner: A partitioner determines how data is distributed
across the nodes in the cluster. (Token created by hash.)
▸ Coordinator: It is a role (sub-process) when one of a node
receive a query. It try to find the data among nodes. And on
each node, MemTable and SSTable are checked.
13. APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Gossip Protocol: The protocol for a C* node to discover the
information of other nodes.
▸ Seed Node: The node that mainly keep the topology information.
▸ Now, we have GCP (TW), UCSD (US), NTU (JP) seed nodes.
▸ Snitch: The protocol for a C* node to map IPs to racks and data
centers (the topology).
▸ When perform a read, a snitch would be useful.
▸ Create the topology and help decide which node to be query.
14. APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Consistency Level (CL): The arbitrary assignment of
consistent the query should achieve.
ANY
Lowest level. Even if all the replica node are
down, the withe could still successd.
ONE At least one replica node should succeed.
QUORUM (RF / 2) + 1 nodes should succeed.
ALL Highest level, every node should succeed.
LOCAL_ONE
For multiple data center. One node in a certain
data center should succeed.
LOCAL_QUORUM For multiple data center. See QUORUM.
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlConfigConsistency.html
15. APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Compaction: Commit the data. Clean the deleted data and
compress the remaining data in to SSTable.
▸ When performing repair, SSTable rebuild, or clean, you
might see C* is doing compaction in order to make the data
consistent.
▸ Tombstone: Data deletion is not as usual. Delete is done as
insertion (mark a data to be deleted).
▸ gc_grace_period: A certain period of time that C* will ensure
all the nodes received the tombstone info. (Default: 10 days)
17. APACHE CASSANDRA
CASSANDRA CONFIGURATION
cluster_name <cluster name>
listen_interface <ethernet interface name>
listen_address <the IP address on the main interface>
authenticator PasswordAuthenticator
authorizer CassandraAuthorizer
endpoint_snitch GossipingPropertyFileSnitch
seeds <the seed server address>
broadcast_address <External IP address>
permissions_validity_in_ms 20000
concurrent_reads 16 * num. of disk used by data_file_directories
concurrent_writes 8 * num. of cores
concurrent_counter_writes 16 * num. of disk used by data_file_directories
streaming_keep_alive_period_in_secs 3600 (1hr)
read_request_timeout_in_ms 10000
18. APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
listen_interface <ethernet interface name>
listen_address <the IP address on the main interface>
broadcast_address <External IP address>
▸ Most of our machine is a VM, which might under a local
DHCP environment. The main interface might listen on a
local IP, say 192.168.xxx.xxx.
▸ In this case, you need to set the broadcast_address to
make the other nodes able to find the node you are going
to add.
19. APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
▸ authenticator / authorizer: A pair of assignment for Cassandra
account management.
▸ (PasswordAuthenticator/CassandraAuthorizer) is a fixed pair,
don’t change them.
▸ endpoint_snitch: What kind of snitch you would like to use.
▸ GossipingPropertyFileSnitch: You need to modified
cassandra-rackdc.properties to use this snitch.
authenticator PasswordAuthenticator
authorizer CassandraAuthorizer
endpoint_snitch GossipingPropertyFileSnitch
20. APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
▸ permissions_validity_in_ms: How long to cache the auth.
info?
permissions_validity_in_ms 20000
concurrent_reads 16 * num. of disk used by data_file_directories
concurrent_writes 8 * num. of cores
concurrent_counter_writes 16 * num. of disk used by data_file_directories
▸ concurrent_*: Hardware resource dependent.
21. APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
▸ Some of the machine has higher network latency, these
settings will try to prevent the Cassandra from time-out.
streaming_keep_alive_period_in_secs 3600 (1hr)
read_request_timeout_in_ms 10000
22. APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
▸ Still a lot of configure to learn and discover!
▸ Lots of comments available in cassandra.yaml. Check
them out if you have time.
24. APACHE CASSANDRA
HOW IS DATA WRITTEN?
1. Write data to MemTable (memory) & log data in
commit_log (disk)
‣ Durable writes: Failure tolerance!
2. Flush data from MemTable
‣ commitlog_total_space_in_mb: Threshold to flush
3. Storing data on disk in SSTable
25. APACHE CASSANDRA
HOW IS DATA WRITTEN?- CONT.
‣ Commit_log replay on restart. This is why sometimes the
reboot of Cassandra might be longer and sometimes
shorter. It depends on how many data it should replay.
WRITE
REQUEST
SSTABLE
(COMMITTED DATA)
COMMIT_LOG
FlushMEMTABLE
Memory
Hard Disk
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
26. APACHE CASSANDRA
HOW IS DATA WRITTEN?- CONT.
‣ Notes that it is recommended to keep the storage of commit_log and
SSTable on different disk.
‣ If possible, attach at least 3 hard disk drive to your machine. (SSD is more
than welcome!)
WRITE
REQUEST
SSTABLE
(COMMITTED DATA)
COMMIT_LOG
FlushMEMTABLE
Memory
Hard Disk
Ref. https://wiki.apache.org/cassandra/PerformanceTuning
27. APACHE CASSANDRA
HOW IS DATA READ?
▸ Coordinator will find which node(s) to ask for the required
data.
▸ On the responsible node:
▸ Try to find data in MemTable first.
▸ Find the data in compressed SSTable file.
▸ Combine the results (from MemTable & SSTable) and
return to the coordinator.
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
28. APACHE CASSANDRA
HOW IS DATA DELETED?
▸ Keep in mind that it is a large-scale distributed system.
Deletion could be dangerous to harm the consistency.
▸ Deletion as Insertion: Tombstone.
▸ gc_grace_seconds: Prevent from party-rock zombie!!
▸ Compaction “clear” the data.
▸ You may assign a TTL to a data row!
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html
Everyday I’m shuffling!
31. APACHE CASSANDRA
REPLICA, REPLICATION FACTOR (RF)
▸ How to determine the placement of replica?
▸ SimpleStrategy & NetworkTopologyStrategy
▸ Simple Strategy: Places the first replica on a node
determined by the partitioner. Additional replicas are
placed on the next nodes clockwise in the ring without
considering topology.
32. APACHE CASSANDRA
REPLICA, REPLICATION FACTOR (RF)- CONT.
▸ NetworkTopologyStrategy: Required to set the RF for
each data center.
▸ NetworkTopologyStrategy: Places replicas in the same
datacenter by walking the ring clockwise until reaching
the first node in another rack.
ALTER KEYSPACE <keyspace> WITH REPLICATION
= {'class': 'NetworkTopologyStrategy',
'DC1': <num>, 'DC2': <num>} with
durable_write=true;
37. APACHE CASSANDRA
REPLICATION- CONT.
▸ It’s all about fault-tolerance (Availability).
▸ Enable the system to continue working even though there
are some node is not available.
▸ Fault-tolerance in the level of data center, rack.
▸ Do not let RF > {NUM. OF NODES IN A DC}!!!
▸ Always remember to increase the RF of system_auth
keyspace before you add a new node!!!
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archDataDistributeReplication.html
38. APACHE CASSANDRA
HINTED-HANDOFF
▸ The process that help the dead node to recover the data.
▸ The other nodes will keep the data for a certain period of
time for the dead node. When the node come back online,
they will stream the data to that revived node.
▸ Default: 3 days. Therefore, we should deal with the dead
node and bring it back within this period.
39. APACHE CASSANDRA
NODETOOL
▸ A monitoring/controlling tool of C*.
▸ To control C*, you should be familiar with this guy.
▸ Refer to: https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsNodetool.html
40. APACHE CASSANDRA
CQLSH
▸ CQL: Cassandra Query Language, looks like traditional
SQL command.
▸ A commanding shell to interact with C*.
▸ Look like this:
41. APACHE CASSANDRA
CQLSH- CONT.
▸ You may alter the settings of existing keyspace, table using
CQLSH. For example, change the RF of a keyspace.
▸ Of course, CQLSH could be used to create/delete/modify/
query keyspace and table.
▸ Refer to: https://paper.dropbox.com/doc/Cassandra-Management-Operations--
AIIgTHW33s5ArnWYx18kxfU3Ag-
AvuMYLwTQhgWUKc6h1sUd#:uid=865346154186617362484552&h2=The-cqlsh-Command
42. APACHE CASSANDRA
THE SYSTEM STATUS CHECK
▸ This command return all the status of existing nodes.
▸ Status interpretation:
▸ UN (Up/Normal): Node is working properly
▸ DN (Down/Normal): The Node is offline
▸ UL (Up/Leaving): The Node is leaving the cluster (node
deletion)
$ nodetool status
43. APACHE CASSANDRA
THE SYSTEM STATUS CHECK- CONT.
▸ This command also tells you the data portion of each DC,
the disk usage, and the UUID of a node.
$ nodetool status
Disk Usage Data Portion
44. APACHE CASSANDRA
THE SYSTEM STATUS CHECK- CONT.
▸ This command shows the listening port of the machine.
▸ It’s a quick way to check if C* is still online.
▸ Cassandra port usage:
$ netstat -lnt
7000 Gossiping port (unencrypted)
9042 CQLSH/client API communication port
7199 JMX monitoring port
45. APACHE CASSANDRA
THE SYSTEM STATUS CHECK- CONT.
▸ This command shows the status of C* process. It should
always in the status “active (running)”.
▸ If you see the status is in “active (exited)”, then C* is
already dead due to some error present. Check the log for
further information.
$ service cassandra status
48. APACHE CASSANDRA
REPAIR
▸ The process to maintain data consistency across the
cluster.
▸ This is the operation that will make you burn the midnight
oil……
$ nodetool repair [option]
Cassandra
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRepair.html
49. APACHE CASSANDRA
REPAIR- CONT.
▸ Full Repair & Incremental Repair
▸ This should be done periodically!
▸ As recommendation by C* official:
▸ Incremental repair every 1 - 3 days (within GC grace
period)
▸ Full repair every 1 - 3 weeks
Ref. http://cassandra.apache.org/doc/latest/operating/repair.html
50. APACHE CASSANDRA
REPAIR- CONT.
▸ How to monitor the repair progress?? Good question!
▸ The log files
▸ Useful commands:
$ nodetool netstats
# print the status of streaming
$ nodetool compactionstats
# print th status of compaction
$ nodetool tpstats
# show the thread pool running processes
51. APACHE CASSANDRA
SSTABLE CORRUPTION
▸ If a repair failed or the data sync is not well performed, this
will happen……
▸ For example, when you see this after repair is done:
▸ Prepare a cup of coffee, you might need it……. 😨
[2017-05-16 00:26:40,555] Repair session dbbf6510-39ef-11e7-8027-d710f406f829 for range
(-4631786651008530880,-4578496872070625882] failed with error [repair #dbbf6510-39ef-11e7-8027-
d710f406f829 on watchtower_keyspace/release_stages,
(-4631786651008530880,-4578496872070625882]] Validation failed in /xxx.xxx.xxx.xxx (progress: 0%)
52. APACHE CASSANDRA
SSTABLE CORRUPTION- CONT.
▸ All you need to do, is to run the following on the node with IP
xxx.xxx.xxx.xxx:
▸ Same as repair, using the same set of nodetool commands to see if C*
is still working.
▸ If everything goes well, try the repair again and hope nothing bad
happen again.
$ nodetool scrub
[2017-05-16 00:26:40,555] Repair session dbbf6510-39ef-11e7-8027-d710f406f829 for range
(-4631786651008530880,-4578496872070625882] failed with error [repair #dbbf6510-39ef-11e7-8027-
d710f406f829 on watchtower_keyspace/release_stages,
(-4631786651008530880,-4578496872070625882]] Validation failed in /xxx.xxx.xxx.xxx (progress: 0%)
Ref. https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool-
repair
53. APACHE CASSANDRA
RUNNING OUT OF DISK SPACE! DO YOU PERFORM DELETION?
▸ Remember as for now, the Master node has only 100GB of
disk space. Approximately, the data will grow 1.xGB each
month.
▸ Frequently check the following:
$ nodetool status
# check the data portion and disk usage
$ df -h
# check the real hard disk space usage
54. APACHE CASSANDRA
RUNNING OUT OF DISK SPACE! DO YOU PERFORM DELETION?- CONT.
▸ If the C* eat up too many space, you could perform the
data deletion by issuing a repair:
▸ Or you could try to clear the data snapshot:
$ nodetool repair [option]
# repair the data and this will trigger the data
compaction
$ nodetool clearsnapshot
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAboutSnapshots.html
55. APACHE CASSANDRA
NEW NODE COMING IN, GREAT!
▸ Make sure RF of system_auth is increased first.
▸ Perform network connectivity, performance check.
▸ Refer to here: https://paper.dropbox.com/doc/Cassandra-Management-Operations--
AIIgTHW33s5ArnWYx18kxfU3Ag-
AvuMYLwTQhgWUKc6h1sUd#:uid=308409713240027648094943&h2=Add-a-New-Node
▸ Bootstrap of new node might failed, check the log files
frequently!
56. APACHE CASSANDRA
NEW NODE COMING IN, GREAT!- CONT.
▸ How could I know the bootstrap failed:
▸ Log files (of course!)
▸ nodetool status show highly in-balance data portion.
▸ Might be a network throughput issue, try to fix it and
resume the bootstrap:
$ nodetool bootstrap resume
57. APACHE CASSANDRA
LESS POSSIBLE BUT COULD HAPPEN, NODE DELETION
▸ You might want to delete a node for any issue coming up.
▸ Refer to: https://paper.dropbox.com/doc/Cassandra-Management-Operations--
AIIgTHW33s5ArnWYx18kxfU3Ag-
AvuMYLwTQhgWUKc6h1sUd#:uid=454006913486500030503564&h2=Delete/Remove-a-Node
▸ If everything goes fine, reduce the RF of system_auth to
make the RF of it not larger than the total num. of nodes.
58. APACHE CASSANDRA
CASSANDRA OPERATIONS
▸ Too many things to discuss, which is hard to cover them all
in this talk.
▸ Please frequently check the doc for further information:
▸ https://paper.dropbox.com/doc/Cassandra-Management-Operations--
AIIgTHW33s5ArnWYx18kxfU3Ag-AvuMYLwTQhgWUKc6h1sUd
59. APACHE CASSANDRA
SYSTEM_AUTH & CURRENT CASSANDRA USER ACCOUNT
▸ I keep talking about the system_auth keyspace, what is it
anyway?
▸ system_auth: The keyspace that keep the account info. of
Cassandra.
▸ If the data in system_auth is inconsistent, the
authentication might fault on a certain node. You will see
authentication failed for a certain period of time.
▸ Data loss!!!
60. APACHE CASSANDRA
SYSTEM_AUTH & CURRENT CASSANDRA USER ACCOUNT- CONT.
▸ Increase RF of system_auth first before adding a new node
is just a theoretical approach……
▸ Current user account in Cassandra:
cassandra
Default superuser, now treated as a backup superuser. Has
the same password as iisnrl account.
iisnrl The main superuser.
kairosdb The user for master KairosDB insertion. Non-superuser.
lassgroup
The user for participating parties to archive data. Non-
superuser.