The document provides an overview of Apache Cassandra, including its key components, data replication, scalability, read/write operations, and tunable data consistency. It discusses how Cassandra is a distributed, decentralized database that provides high availability and horizontal scalability. The key components that enable these features are nodes, partitioners, snitches, gossip protocols, and the replication of data across multiple nodes.
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
Have you ever wondered what is in all of those SSTable files and how it helps Cassandra find and manage your data? If you go to the Datastax website they will give you a high level explanation of what is in each file. In this talk we will go much deeper explaining each file and walking through a dump of its contents. We will also explore the differences between Cassandra 2.1 and 3.4.
About the Speaker
John Schulz Prinicipal Consultant, The Pythian Group
John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and most recently NoSQL. For the last 15 he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop and Hbase. He has been working with Cassandra since 2010. For the last eighteen months he has been working for The Pythian Group to help their customers improve their existing databases and select new ones.
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
Telecom operators need to find operational anomalies in their networks very quickly. This need, however, is shared with many other industries as well so there are lessons for all of us here. Spark plus a streaming architecture can solve these problems very nicely. I will present both a practical architecture as well as design patterns and some detailed algorithms for detecting anomalies in event streams. These algorithms are simple but quite general and can be applied across a wide variety of situations.
Dr. Edward (Eddie) Bortnikov (Senior Director of Research) @ Verizon Media:
Ingestion and queries of real-time data in Druid are performed by a core software component named Incremental Index (I^2).
I^2’s scalability is paramount to the speed of the ingested data becoming queryable as well as to the operational efficiency of the Druid cluster.
The current I^2 Implementation is based on the traditional ordered JDK key-value (KV-)map.
We present an experimental I^2 implementation that is based on a novel data structure named OakMap - a scalable thread-safe off-heap KV-map for Big Data applications in Java.
With OakMap, I^2 can ingest data at almost 2x speed while using 30% less RAM.
The project is expected to become GA in 2020.
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
Have you ever wondered what is in all of those SSTable files and how it helps Cassandra find and manage your data? If you go to the Datastax website they will give you a high level explanation of what is in each file. In this talk we will go much deeper explaining each file and walking through a dump of its contents. We will also explore the differences between Cassandra 2.1 and 3.4.
About the Speaker
John Schulz Prinicipal Consultant, The Pythian Group
John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and most recently NoSQL. For the last 15 he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop and Hbase. He has been working with Cassandra since 2010. For the last eighteen months he has been working for The Pythian Group to help their customers improve their existing databases and select new ones.
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
Telecom operators need to find operational anomalies in their networks very quickly. This need, however, is shared with many other industries as well so there are lessons for all of us here. Spark plus a streaming architecture can solve these problems very nicely. I will present both a practical architecture as well as design patterns and some detailed algorithms for detecting anomalies in event streams. These algorithms are simple but quite general and can be applied across a wide variety of situations.
Dr. Edward (Eddie) Bortnikov (Senior Director of Research) @ Verizon Media:
Ingestion and queries of real-time data in Druid are performed by a core software component named Incremental Index (I^2).
I^2’s scalability is paramount to the speed of the ingested data becoming queryable as well as to the operational efficiency of the Druid cluster.
The current I^2 Implementation is based on the traditional ordered JDK key-value (KV-)map.
We present an experimental I^2 implementation that is based on a novel data structure named OakMap - a scalable thread-safe off-heap KV-map for Big Data applications in Java.
With OakMap, I^2 can ingest data at almost 2x speed while using 30% less RAM.
The project is expected to become GA in 2020.
Wide Column Store NoSQL vs SQL Data ModelingScyllaDB
NoSQL schemas are designed with very different goals in mind than SQL schemas. Where SQL normalizes data, NoSQL denormalizes. Where SQL joins ad-hoc, NoSQL pre-joins. And where SQL tries to push performance to the runtime, NoSQL bakes performance into the schema. Join us for an exploration of the core concepts of NoSQL schema design, using Scylla as an example to demonstrate the tradeoffs and rationale.
Managing data analytics in a hybrid cloudKaran Singh
We’ll talk about the changes in the industry that customers are faced with and how Red Hat Hyperconverged Infrastructure can address those challenges . Our customers are struggling not only to manage the growth of big data (structured and unstructured), but also to reap timely business insights from their data using their existing data infrastructure like monolithic Hadoop clusters. This often leads to alternative approaches that often lead to disappointing results.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
Addressing the High Cost of Apache CassandraScyllaDB
Is your Cassandra deployment size out of control? * Do you get constant requests to source more nodes to sustain your NoSQL workload? * Do you need to put an external cache in front of your database to ensure performance? * Is managing your Cassandra clusters too time-consuming and expensive -- either from your own staff or the high price you’re paying your DBaaS vendor?
In this webinar, we’ll dive into the myths about Cassandra ownership costs and the pitfalls that come with it. We’ll show that using modern design techniques, simplified tuning and a scalable datastore can help you control your Total Cost of Ownership (TCO).
Eyal Gutkind, our VP of Solution Engineering, will walk you though:
Primary and secondary considerations for evaluating the effectiveness of your data platform
The correlation between use cases and deployment costs
How you know it's time to migrate and why
Cassandra users! This is a must-attend session for you!! We will show you ways to gauge your Cassandra overspend.
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...J On The Beach
Developing reliable data acquisition, processing and control modules for mission critical systems - as they run at CERN - has always been challenging. As both data volumes and rates increase, non-functional requirements such as performance, availability, and maintainability are getting more important than ever. C2MON is a modular Open Source Java framework for realising highly available, large industrial monitoring and control solutions. It has been initially developed for CERN’s demanding infrastructure monitoring needs and is based on more than 10 years of experience with the Technical Infrastructure Monitoring (TIM) systems at CERN. Combining maintainability and high-availability within a portable architecture is the focus of this work. Making use of standard Java libraries for in-memory data management, clustering and data persistence, the platform becomes interesting for many Big Data scenarios.
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
The renewable energy industry has only recently started to rely on data-driven models on applications that have traditionally required complex physical solutions. In this talk, we would like to show how we leverage Spark, Keras and (in our case, on-prem) high performance computing (HPC) infrastructure to potentially tackle common and interesting problems in the wind-related industry (saving hours of CPU-consuming simulations).
We use:
Apache Spark and Hive for data preparation and a combination of different data sources (some of them in the range of the petabytes scale).
Keras for model training/generation.
HPC for coordination and node-wide training of hyperparameters.
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks. Finally, we will outline the challenges in moving middleware to the Cloud environments."
Watch the video: https://youtu.be/hR8cnFVF8Zg
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
Please join us to learn about the recent developments during the past year in the MapR Community Edition. In these slides, we will cover the following platform updates:
-Taking cluster monitoring to the next level with the Spyglass Initiative
-Real-time streaming with MapR Streams
-MapR-DB JSON document database and application development with OJAI
-Securing your data with access control expressions (ACEs)
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Home For Gypsies – Storage for NoSQL DatabasesAtish Kathpal
Video presentation: https://www.youtube.com/watch?v=QRmCr9qTL5o
Technical paper: https://www.usenix.org/conference/hotstorage17/program/presentation/kathpal
Abstract: Introduction to NoSQL DBs and CAP theorem
1. Advantages of shared storage as compared to DAS for NoSQL DBs: a) independent scaling of compute and storage b) consolidation, support of mixed workloads and predictable performance through use of flash arrays c) easier storage administration as compared to managing 100s of independent direct-attached disks in a scale-out NoSQL deployment
2. Challenges and need for integrated backup and restore in NoSQL DBs: a) cluster-consistent backups at scale in an eventually consistent system, b) storage efficiency in backups and c) ability to restore to different topologies
3. Relevance of light-weight snapshots, cloning and high-level solution directions to address challenges listed in #3
Speaker: Atish Kathpal
Acknowledgement: Priya Sehgal, Gaurav Makkar, Parag Deshmukh
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
Most databases are based on architectures that pre-date advances to modern hardware. This results in performance issues, the need to overprovision, and a high total cost of ownership. In this webinar we will discuss the advances to modern server technology and take a deep dive into Scylla’s shard-per-core architecture and our asynchronous engine, the Seastar framework.
Join us to learn how Seastar (and Scylla):
Avoid locks and contention on the CPU level
Bypass kernel bottlenecks
Implement its per-core shared-nothing autosharding mechanism
Utilize modern storage hardware
Leverage NUMA to get the best RAM performance
Balance your data across CPUs and nodes for best and smoothest performance
Plus we’ll cover the advantages of unlocking vertical scalability.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
NoSQL and NewSQL: Tradeoffs between Scalable Performance & ConsistencyScyllaDB
This webinar compares NoSQL and NewSQL databases. We will look at the significant architectural differences between the two, tradeoffs between availability, scalable performance and consistency, data models, and share benchmark results to display the performance implications of NoSQL versus NewSQL.
Primary and Clustering Keys should be one of the very first things you learn about when modeling Cassandra data. Most people coming from a relational background automatically think, ""Yeah, I know what a Primary Key is"", and gloss right over it. Because of this, there always seems to be a lot of confusion around the topic of Primary Keys in Cassandra. This presentation will demystify that confusion. I will cover what the different types of Keys are, how they can be used, what their purpose is, and how they affect your queries.
For this presentation, I will be using CrossFit gym locations as my subject matter. I will explain the differences between Primary Keys, Compound Keys, Clustering Keys, & Composite Keys. I will also show how the data behind each type differs as stored on disk. Lastly, I will show what queries each type of key will support.
About the Speaker
Adam Hutson Data Architect, DataScale
Adam is Data Architect for DataScale, Inc. He is a seasoned data professional with experience designing & developing large-scale, high-volume database systems. Adam previously spent four years as Senior Data Engineer for Expedia building a distributed Hotel Search using Cassandra 1.1 in AWS. Having worked with Cassandra since version 0.8, he was early to recognize the value Cassandra adds to Enterprise data storage. Adam is also a DataStax Certified Cassandra Developer.
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...DataStax
Many users set the replication strategy on their keyspaces to NetworkTopologyStrategy and move on with modeling their data or developing the next big application. But what does that replication strategy really mean? Let's explore replication and consistency in Cassandra.
How are replicas chosen?
Where does node topology (location in a cluster) come into play?
What can I expect when nodes are down I'm querying with a Consistency Level of local quorum?
If a rack goes down can I still respond to quorum queries?
These questions may be simple to test, but have nuances that should be understood. This talk will dive into these topics in a visual and technical manner. Seasoned Cassandra veterans and new users alike stand to gain knowledge about these critical Cassandra components.
About the Speaker
Christopher Bradford Solutions Architect, DataStax
High performance drives Christopher Bradford. He has worked across various industries including the federal government, higher education, social news syndication, low latency HD video delivery and usability research. Chris combines application engineering principles and systems administration experience to design and implement performant systems. He has architected applications and systems to create highly available, fault tolerant, distributed services in a myriad environments.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Wide Column Store NoSQL vs SQL Data ModelingScyllaDB
NoSQL schemas are designed with very different goals in mind than SQL schemas. Where SQL normalizes data, NoSQL denormalizes. Where SQL joins ad-hoc, NoSQL pre-joins. And where SQL tries to push performance to the runtime, NoSQL bakes performance into the schema. Join us for an exploration of the core concepts of NoSQL schema design, using Scylla as an example to demonstrate the tradeoffs and rationale.
Managing data analytics in a hybrid cloudKaran Singh
We’ll talk about the changes in the industry that customers are faced with and how Red Hat Hyperconverged Infrastructure can address those challenges . Our customers are struggling not only to manage the growth of big data (structured and unstructured), but also to reap timely business insights from their data using their existing data infrastructure like monolithic Hadoop clusters. This often leads to alternative approaches that often lead to disappointing results.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
Addressing the High Cost of Apache CassandraScyllaDB
Is your Cassandra deployment size out of control? * Do you get constant requests to source more nodes to sustain your NoSQL workload? * Do you need to put an external cache in front of your database to ensure performance? * Is managing your Cassandra clusters too time-consuming and expensive -- either from your own staff or the high price you’re paying your DBaaS vendor?
In this webinar, we’ll dive into the myths about Cassandra ownership costs and the pitfalls that come with it. We’ll show that using modern design techniques, simplified tuning and a scalable datastore can help you control your Total Cost of Ownership (TCO).
Eyal Gutkind, our VP of Solution Engineering, will walk you though:
Primary and secondary considerations for evaluating the effectiveness of your data platform
The correlation between use cases and deployment costs
How you know it's time to migrate and why
Cassandra users! This is a must-attend session for you!! We will show you ways to gauge your Cassandra overspend.
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...J On The Beach
Developing reliable data acquisition, processing and control modules for mission critical systems - as they run at CERN - has always been challenging. As both data volumes and rates increase, non-functional requirements such as performance, availability, and maintainability are getting more important than ever. C2MON is a modular Open Source Java framework for realising highly available, large industrial monitoring and control solutions. It has been initially developed for CERN’s demanding infrastructure monitoring needs and is based on more than 10 years of experience with the Technical Infrastructure Monitoring (TIM) systems at CERN. Combining maintainability and high-availability within a portable architecture is the focus of this work. Making use of standard Java libraries for in-memory data management, clustering and data persistence, the platform becomes interesting for many Big Data scenarios.
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
The renewable energy industry has only recently started to rely on data-driven models on applications that have traditionally required complex physical solutions. In this talk, we would like to show how we leverage Spark, Keras and (in our case, on-prem) high performance computing (HPC) infrastructure to potentially tackle common and interesting problems in the wind-related industry (saving hours of CPU-consuming simulations).
We use:
Apache Spark and Hive for data preparation and a combination of different data sources (some of them in the range of the petabytes scale).
Keras for model training/generation.
HPC for coordination and node-wide training of hyperparameters.
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks. Finally, we will outline the challenges in moving middleware to the Cloud environments."
Watch the video: https://youtu.be/hR8cnFVF8Zg
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
Please join us to learn about the recent developments during the past year in the MapR Community Edition. In these slides, we will cover the following platform updates:
-Taking cluster monitoring to the next level with the Spyglass Initiative
-Real-time streaming with MapR Streams
-MapR-DB JSON document database and application development with OJAI
-Securing your data with access control expressions (ACEs)
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Home For Gypsies – Storage for NoSQL DatabasesAtish Kathpal
Video presentation: https://www.youtube.com/watch?v=QRmCr9qTL5o
Technical paper: https://www.usenix.org/conference/hotstorage17/program/presentation/kathpal
Abstract: Introduction to NoSQL DBs and CAP theorem
1. Advantages of shared storage as compared to DAS for NoSQL DBs: a) independent scaling of compute and storage b) consolidation, support of mixed workloads and predictable performance through use of flash arrays c) easier storage administration as compared to managing 100s of independent direct-attached disks in a scale-out NoSQL deployment
2. Challenges and need for integrated backup and restore in NoSQL DBs: a) cluster-consistent backups at scale in an eventually consistent system, b) storage efficiency in backups and c) ability to restore to different topologies
3. Relevance of light-weight snapshots, cloning and high-level solution directions to address challenges listed in #3
Speaker: Atish Kathpal
Acknowledgement: Priya Sehgal, Gaurav Makkar, Parag Deshmukh
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
Most databases are based on architectures that pre-date advances to modern hardware. This results in performance issues, the need to overprovision, and a high total cost of ownership. In this webinar we will discuss the advances to modern server technology and take a deep dive into Scylla’s shard-per-core architecture and our asynchronous engine, the Seastar framework.
Join us to learn how Seastar (and Scylla):
Avoid locks and contention on the CPU level
Bypass kernel bottlenecks
Implement its per-core shared-nothing autosharding mechanism
Utilize modern storage hardware
Leverage NUMA to get the best RAM performance
Balance your data across CPUs and nodes for best and smoothest performance
Plus we’ll cover the advantages of unlocking vertical scalability.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
NoSQL and NewSQL: Tradeoffs between Scalable Performance & ConsistencyScyllaDB
This webinar compares NoSQL and NewSQL databases. We will look at the significant architectural differences between the two, tradeoffs between availability, scalable performance and consistency, data models, and share benchmark results to display the performance implications of NoSQL versus NewSQL.
Primary and Clustering Keys should be one of the very first things you learn about when modeling Cassandra data. Most people coming from a relational background automatically think, ""Yeah, I know what a Primary Key is"", and gloss right over it. Because of this, there always seems to be a lot of confusion around the topic of Primary Keys in Cassandra. This presentation will demystify that confusion. I will cover what the different types of Keys are, how they can be used, what their purpose is, and how they affect your queries.
For this presentation, I will be using CrossFit gym locations as my subject matter. I will explain the differences between Primary Keys, Compound Keys, Clustering Keys, & Composite Keys. I will also show how the data behind each type differs as stored on disk. Lastly, I will show what queries each type of key will support.
About the Speaker
Adam Hutson Data Architect, DataScale
Adam is Data Architect for DataScale, Inc. He is a seasoned data professional with experience designing & developing large-scale, high-volume database systems. Adam previously spent four years as Senior Data Engineer for Expedia building a distributed Hotel Search using Cassandra 1.1 in AWS. Having worked with Cassandra since version 0.8, he was early to recognize the value Cassandra adds to Enterprise data storage. Adam is also a DataStax Certified Cassandra Developer.
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...DataStax
Many users set the replication strategy on their keyspaces to NetworkTopologyStrategy and move on with modeling their data or developing the next big application. But what does that replication strategy really mean? Let's explore replication and consistency in Cassandra.
How are replicas chosen?
Where does node topology (location in a cluster) come into play?
What can I expect when nodes are down I'm querying with a Consistency Level of local quorum?
If a rack goes down can I still respond to quorum queries?
These questions may be simple to test, but have nuances that should be understood. This talk will dive into these topics in a visual and technical manner. Seasoned Cassandra veterans and new users alike stand to gain knowledge about these critical Cassandra components.
About the Speaker
Christopher Bradford Solutions Architect, DataStax
High performance drives Christopher Bradford. He has worked across various industries including the federal government, higher education, social news syndication, low latency HD video delivery and usability research. Chris combines application engineering principles and systems administration experience to design and implement performant systems. He has architected applications and systems to create highly available, fault tolerant, distributed services in a myriad environments.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
http://tyfs.rocks
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy
The presentation aims to highlight the challenges posed by large scale and near real-time data processing problems. In past, such problems were solved using conventional technologies, primarily a database and JMS queue. However these solutions had their limits and presented serious problems in terms of scale and redundancy. The new breed of products - a la Cassandra & Kafka, being innately distributed in their design, aim to tackle such challenges in a very elegant manner. The presentation will showcase some of the use cases of this genre from the industry and describe the solutions which have been increasing in their sophistication.
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel's 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...confluent
Do you know who is knocking on your network’s door? Have new regulations left you scratching your head on how to a handle what is happening in your network? Network flow data helps answer many questions across a multitude of use cases including network security, performance, capacity planning, routing, operational troubleshooting and more. Today’s modern day streaming data pipelines need to include tools that can scale to meet the demands of these service providers while continuing to provide responsive answers to difficult questions. In addition to stream processing, data needs to be stored in a redundant, operationally focused database to provide fast, reliable answers to critical questions. Together, Kafka and Druid work together to create such a pipeline.
In this talk Eric Graham and Rachel Pedreschi will discuss these pipelines and cover the following topics: Network flow use cases and why this data is important. Reference architectures from production systems at a major international Bank. Why Kafka and Druid and other OSS tools for Network flows. A demo of one such system.
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
Do you know who is knocking on your network’s door? Have new regulations left you scratching your head on how to handle what is happening in your network? Network flow data helps answer many questions across a multitude of use cases including network security, performance, capacity planning, routing, operational troubleshooting and more. Today’s modern day streaming data pipelines need to include tools that can scale to meet the demands of these service providers while continuing to provide responsive answers to difficult questions. In addition to stream processing, data needs to be stored in a redundant, operationally focused database to provide fast, reliable answers to critical questions. Together, Kafka and Druid work together to create such a pipeline.
In this talk Eric Graham and Rachel Pedreschi will discuss these pipelines and cover the following topics:
-Network flow use cases and why this data is important.
-Reference architectures from production systems at a major international Bank.
-Why Kafka and Druid and other OSS tools for Network Flows.
-A demo of one such system.
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
- Learn how AWS can help you process and make better use of your data with meaningful insights.
- Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
- Learn about real time data processing with Amazon Kinesis.
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
Learn how AWS can help you process and make better use of your data with meaningful insights.
Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
Learn about real time data processing with Amazon Kinesis.
Data analytics master class: predict hotel revenueKris Peeters
We predict future revenues in hotels by solving the data science puzzle end-to-end: from infrastructure in the cloud and security, to data ingestion, data cleaning, feature building and model training and model scoring.
The video of this talk is here: https://www.facebook.com/datamindedbe/posts/1385820021562117
Introduction to Real Application Cluster
RAC - Savior of DBA
Oracle Clusterware (Platform on Platform)
RAC Startup sequence
RAC Architecture
RAC Components
Single Instance on RAC
Node Eviction
Important Log directories in RAC.
Tips to monitor and improve the RAC environment.
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.
Azure Days 2019: Azure Chatbot Development for Airline Irregularities (Remco ...Trivadis
During major irregularities, the service desks of airline companies are heavily overloaded for short periods of time. A chatbot could help out during these peak hours. In this session we show how SWISS International Airlines developed a chatbot for irregularity handling. We shed light on the challenges, such as sensitive customer data and a company starting its journey into the cloud.
Azure Days 2019: Trivadis Azure Foundation – Das Fundament für den ... (Nisan...Trivadis
Trivadis Azure Foundation – Das Fundament für den erfolgreichen Einsatz der Azure Cloud
Die Azure Cloud steuert auf ihr 10-jähriges Jubiläum zu und ist in der Schweiz angekommen. Im Vergleich zum Betrieb von On-Premise Lösungen bietet die Cloud eine Vielzahl von Vorteilen. Viele Aufgaben aus der On-Premise Welt werden im Cloud Computing vom Anbieter übernommen.
Aber die Freiheiten, welche Cloud Computing bietet, sind sehr mächtig und das beste Rezept für Wildwuchs und Chaos. Viele unserer Kunden werden sich erst jetzt bewusst, um welche Aufgaben sie sich bereits vor 5 Jahren hätten kümmern sollen. Die Trivadis Azure Foundation ist unser in der Praxis erprobtes Vorgehen, um alle Vorteile der Cloud optimal Nutzen zu können, ohne die Kontrolle zu verlieren. In dieser Session bekommen Sie einen Einblick in unsere Azure Foundation Methodik, zusätzlich berichten wir von den Azure-Erfahrungen unserer Kunden.
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
In dieser Session stellen wir ein Projekt vor, in welchem wir ein umfassendes BI-System mit Hilfe von Azure Blob Storage, Azure SQL, Azure Logic Apps und Azure Analysis Services für und in der Azure Cloud aufgebaut haben. Wir berichten über die Herausforderungen, wie wir diese gelöst haben und welche Learnings und Best Practices wir mitgenommen haben.
Azure Days 2019: Master the Move to Azure (Konrad Brunner)Trivadis
Die Azure Cloud hat sich in den letzten 10 Jahren etabliert und steht heute sowohl global, als auch lokal zur Verfügung,
der Schritt in die Cloud muss aber gut geplant werden. In diesem Talk teilen wir unsere Erfahrungen aus diversen Projekten mit Ihnen. Wir zeigen, worauf Sie besonders achten müssen, damit Ihr Wechsel in die Cloud ein Erfolg wird.
Azure Days 2019: Keynote Azure Switzerland – Status Quo und Ausblick (Primo A...Trivadis
Die Azure Cloud ist in der Schweiz angekommen. In dieser Session beleuchtet Primo Amrein, Cloud Lead bei Microsoft Schweiz, die Einführung der Azure Cloud in der Schweiz, berichtet über die Erfolgsgeschichten und die Lessons Learned. Die Session wird mit einem Ausblick auf die Roadmap abgerundet.
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Trivadis
«Moderne» Data Warehouse/Data Lake Architekturen strotzen oft nur von Layern und Services. Mit solchen Systemen lassen sich Petabytes von Daten verwalten und analysieren. Das Ganze hat aber auch seinen Preis (Komplexität, Latenzzeit, Stabilität) und nicht jedes Projekt wird mit diesem Ansatz glücklich.
Der Vortrag zeigt die Reise von einer technologieverliebten Lösung zu einer auf die Anwender Bedürfnisse abgestimmten Umgebung. Er zeigt die Sonnen- und Schattenseiten von massiv parallelen Systemen und soll die Sinne auf das Aufnehmen der realen Kundenanforderungen sensibilisieren.
Azure Days 2019: Get Connected with Azure API Management (Gerry Keune & Stefa...Trivadis
API-Management bietet eine integrierte Umgebung zur Erstellung, Ausführung, Verwaltung und Sicherung von Enterprise-APIs für moderne digitale Anwendungen. Die Firma Vinci Energies Schweiz setzt den Azure API-Management Dienst seit mehreren Jahren in unterschiedlichen Projekten erfolgreich ein. Ein Erfahrungsbericht, der die Möglichkeiten, aber auch die Grenzen von Azure API-Management aufzeigt.
Azure Days 2019: Infrastructure as Code auf Azure (Jonas Wanninger & Daniel H...Trivadis
Heutzutage schreibt man nicht nur Applikationen mit Code. Dank der Cloud wird die Konfiguration von Infrastruktur wie virtuellen Maschinen oder Netzwerken in Code definiert und automatisiert ausgeliefert. Man spricht von Infrastructure as Code, kurz: IAC. Für Infrastructure as Code auf Azure gibt es viele tools wie Ansible, Puppet, Chef, etc. Zwei Lösungen stechen durch Ihren unterschiedlichen Ansatz heraus - Die Azure Resource Manager Templates (ARM) als Microsoft-native Lösung, immer auf dem neusten Stand, aber an Azure gebunden. Auf der anderen Seite Terraform von HashiCorp mit einer deskriptiven Sprache als Grundlage, dafür weniger Features im Security-Bereich. Für einen Grosskunden haben wir die beiden Technologien verglichen. Die Resultate zeigen wir in dieser Session mit Livedemos auf.
Azure Days 2019: Wie bringt man eine Data Analytics Plattform in die Cloud? (...Trivadis
Was waren die Learnings und Challenges um eine auf Azure basierende, moderne Data Analytics Plattform für einen großen Konzern als Service bereitzustellen und in das Enterprise zu integrieren? Ein Projekt mit vielen interessanten Aspekten über Azure BI Services wie HDInsight, die Integration in ein Enterprise in einem "as a Service" Model, Management der Kosten und Verrechnungen der Services, und noch viel mehr. Diese Session bietet Einblicke in eines unserer Projekte, die Ihnen in Ihrem nächsten Projekt behilflich sein werden.
Azure Days 2019: Azure@Helsana: Die Erweiterung von Dynamics CRM mit Azure Po...Trivadis
Die Helsana (https://www.helsana.ch), die Nummer 2 der grössten Krankenversicherungen der Schweiz, verfolgt eine moderne Cloud-First Strategie. Um komplexe Marketingkampagnen mit einem hohen Grad an Automatisierung ausführen zu können, wurden von Helsana diverse Produkte evaluiert. Leider fand sich keines, welches allen Anforderungen genügte. In enger Zusammenarbeit mit Microsoft wurde die zu 100% Azure-basierte Anwendung CRM-Analytics (CRMa) erstellt, welche Leads und Aufgaben aus dem Dynamics CRM gemäss komplexen Verteilregelwerken an die Regionen, Niederlassungen und Kundenbetreuer verteilt. Die Resultate und Performance der Kampagnen können über eine Data Analytics Strecke analysiert und in PowerBI visualisiert werden. Manuelle Prozesse zur Zielgruppenselektion wurden automatisiert und die Zeit von der Idee bis zur Selektion der Zielgruppe konnte von 10(!) Tagen auf einige Minuten reduziert werden. Mit der Einführung von CRMa hat die Helsana einen massgebenden Schritt in die Digitalisierung und zu einem ganzheitlichen Kampagnenmanagement geschafft.
TechEvent 2019: Kundenstory - Kein Angebot, kein Auftrag – Wie Du ein individ...Trivadis
TechEvent 2019: Kundenstory - Kein Angebot, kein Auftrag – Wie Du ein individuelles Angebot in 5 Sek formulierst; Martin Kortstiege, Ronny Bauer - Trivadis
TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...Trivadis
TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing PostgreSQL to Oracle, the best kept secrets; Konrad Häfeli, Jan Karremans - Trivadis
TechEvent 2019: Kundenstory - Vom Hauptmann zu Köpenick zum Polizisten 2020 -...Trivadis
TechEvent 2019: Kundenstory - Vom Hauptmann zu Köpenick zum Polizisten 2020 - von klassischen zu agilen Prozessen; Martin Moog, Esther Trapp, Norbert Ziebarth - Trivadis
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Search and Society: Reimagining Information Access for Radical Futures
TechEvent Apache Cassandra
1. BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Apache Cassandra
Under The Hood
Robert Bialek
2. Who Am I
Apache Cassandra Under The Hood2 15.09.2018
Senior Principal Consultant and Trainer at Trivadis GmbH in Munich.
– Master of Science in Computer Engineering.
– At Trivadis since 2004.
– Trivadis Partner since 2012.
Focus:
– Data and service high availability, disaster recovery.
– Architecture design, optimization, automation.
– Troubleshooting.
– Trainer: O-RAC, O-DG.
3. Agenda
Apache Cassandra Under The Hood3 15.09.2018
1. Introduction
2. Key Components
3. Data Replication
4. Scalability
5. Read/Write Operations
6. Data Consistency
7. Summary
5. What is Apache Cassandra?
Apache Cassandra Under The Hood5 15.09.2018
Distributed NoSQL (wide column) partitioned row store database, which runs within a
JVM.
Decentralized, highly fault tolerant database with no single point of failure.
Horizontal scalable system (computing resources/performance).
Initially developed at Facebook, released as an open source project in July 2008.
– Based on Amazon‘s Dynamo and Google‘s Big Table.
6. Apache Cassandra & CAP Theorem?
Apache Cassandra Under The Hood6 15.09.2018
According to CAP (Brewer’s) theorem “it is impossible for a distributed data store to
simultaneously provide more than two out of the following three guarantees”
– Consistency
– Availability
– Partition tolerance
Apache Cassandra is a AP system.
– Data result is eventually consistent (though, consistency is tunable).
– Does not adhere to all ACID properties.
? ?
7. Cassandra for Enterprise Applications
Apache Cassandra Under The Hood7 15.09.2018
Support 24x7x365.
Enterprise features, e.g.: DSE Advanced Security, DSE Analytics, DSE Search, DSE
Graph, DSE Advanced Replication, DSE Tiered Storage, DSE NodeSync, ...
Administration and monitoring with DSE OpsCenter (real-time monitoring, tuning,
provisioning, backup, security management).
According to DataStax, 2x or more throughput compared to Apache Cassandra.
Documentation, client drivers and DSE for development are free to use.
8. Who is Using Cassandra Database?
Apache Cassandra Under The Hood8 15.09.2018
Source http://cassandra.apache.org
– Apple: over 75,000 nodes storing over 10 PB of data.
– Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per day.
– Chinese search engine Easou: 270 nodes, 300 TB, over 800 million requests per
day.
– eBay: over 100 nodes, 250 TB.
Source https://www.datastax.com/customers
– Microsoft, UBS, Sony, Sky, ING, NEC, Coursera, CISCO, Walmart, NVIDIA,
Samsung, …
10. Node – Basic Database Infrastructure
Apache Cassandra Under The Hood10 15.09.2018
Commodity hardware, ideally local storage (reduce
dependencies).
Hosts software and configuration files:
– cassandra.yaml, cassandra-rackdc.properties, …
Hosts data and accompanying structures:
Cassandra Node
(DSE: Transactional Node)
Index.db
Data.db
(SSTable) Statistics.db
CompressionInfo.db
Digest.crc32
Filter.db
TOC.txt
11. Keyspaces & Tables
Apache Cassandra Under The Hood11 15.09.2018
Table (Column Family)
– Stores data based on a primary key.
• Primary key: partitioning key plus optionally
clustering columns.
– Physically split into partitioned.
– Denormalization (data duplication) is necessary.
Keyspace
– Grouping of data, similar to a schema.
– Defines replication properties.
12. Partitioner – Data Distribution
Apache Cassandra Under The Hood12 15.09.2018
Determines which node receives data based on
partitioning key token.
Supplied partitioners (own can be created):
Data
Token
PARTITIONER
Murmur3Partitioner (default)
Random Partitioner
ByteOrderedPartitioner
‘Cassandra'
356242581507269238
13. Cassandra Ring – Singe Token Architecture
Apache Cassandra Under The Hood13 15.09.2018
Cassandra Ring
initial_token:1
initial_token:10
initial_token: 20
initial_token: 30
Example Partitioner
Token Range: 1 – 40
Token Range: 31 – 40,1
Token Range: 2 – 10
Token Range: 11 – 20
Token Range: 21 – 30
Data
Token
15. Snitches – Ring Topology
Apache Cassandra Under The Hood15 15.09.2018
Determines physical location (datacenter and a
rack) of a Cassandra node.
Dynamic snitching (enabled by default):
– Monitors the read performance and ring health.
SNITECHES
SimpleSnitch/DseSimpleSnitch (default)
GossipingPropertyFileSnitch
PropertyFileSnitch
Ec2Snitch/Ec2MultiRegionSnitch/GoogleCloudSnitch/
CloudstackSnitch
RackInferringSnitch
DC 1 DC 2
Rack 1 Rack 1
Rack 2 Rack 2
16. Gossip – Internode Communication
Apache Cassandra Under The Hood16 15.09.2018
Peer-to-peer communication protocol to exchange
ring state information.
Gossip process runs every second and exchanges
messages with up to three other nodes in the ring.
Eventually, all nodes learn (indirectly) about all
other nodes.
18. Cassandra Ring – Scale Out
Apache Cassandra Under The Hood18 15.09.2018
Increases computing power and
throughput of a Cassandra ring.
Online and transparent to the
applications.
Ring
Information
START
Joing Ring
Generate
Tokens
FINISH
Joing Ring
Cassandra Ring
SEED Node
Bootstrap
Data Streaming
Software &
Configuration Files
19. Cassandra Ring – Scale In
Apache Cassandra Under The Hood19 15.09.2018
Decreases computing power of a
Cassandra ring.
Online and transparent to the
applications.
Cassandra Ring
DECOMMISSION
Data Streaming
Remove
Tokens
DECOMMISSIONED
21. Replication – Data High Availability
Apache Cassandra Under The Hood21 15.09.2018
To ensure data and service high availability, Cassandra stores data on multiple
nodes in a cluster.
All replicas are equally important (no primary or
secondary data).
Replication strategy and replication factor (RF) is
defined on a keyspace (application) level.
– RF can be set differently in different data centers.
Two replication strategies are available:
– SimpleStrategy
– NetworkTopologyStrategy
DC 1 DC 2
Rack 1 Rack 1
Rack 2 Rack 2
22. Replication – SimpleStrategy (RF: 2)
Apache Cassandra Under The Hood22 15.09.2018
Data Center 1
Rack 1 Rack 1
Rack 1 Rack 1
23. Replication – NetworkTopologyStrategy (RF/DC: 2)
Apache Cassandra Under The Hood23 15.09.2018
Data Center 1 Data Center 2
Rack 1 Rack 1
Rack 2 Rack 2
25. Read Request Flow on a Cassandra Node
Apache Cassandra Under The Hood25 15.09.2018
Memtable Row Cache Bloom Filter
Partition Key
Cache
Compression
Offset Map Partition Summary
Partition Index
SSTables
MemoryDisk
26. Write Request Flow on a Cassandra Node
Apache Cassandra Under The Hood26 15.09.2018
Memtable
Index.db
Data.db
(SSTable)
MemoryDisk
Commit Log
Statistics.db
CompressionInfo.db
Digest.crc32
Filter.db
TOC.txt
Compaction
Process
27. Upserts on a Cassandra Node
Apache Cassandra Under The Hood27 15.09.2018
Memtable
TAG: CASSANDRA
SSTables
ID C1 C2 TSTAMP
1 2 TEST1 100
ID C1 C2 TSTAMP
2 3 TEST2 50
INSERT INTO t (TAG, ID,C1,C2)
VALUES (‘CASSANDRA‘,1,5,‘TEST3‘);
UPDATE t SET C2=PROD1 WHERE
TAG=‘CASSANDRA‘ AND ID=1;
DELTE FROM t
WHERE TAG=‘CASSANDRA‘ AND ID=2;
ID C1 C2 TSTAMP
1 5 TEST3 150
ID C2 TSTAMP
1 PROD1 200
ID Tombstone
(marked_deleted)
TSTAMP
2 250
Partition Key: TAG
Primary Key: TAG, ID
28. Compaction Process on a Cassandra Node
Apache Cassandra Under The Hood28 15.09.2018
ID C1 C2 TSTAMP
1 2 TEST1 100
ID C1 C2 TSTAMP
2 3 TEST2 50
ID C1 C2 TSTAMP
1 5 TEST3 150
ID C2 TSTAMP
1 PROD1 200
ID Tombstone
(marked_deleted)
TSTAMP
2 250
ID C1 C2 TSTAMP
3 4 TEST3 120
ID C1 C2 TSTAMP
1 5 PROD1 300
ID Tombstone
(marked_deleted)
TSTAMP
2 250
ID C1 C2 TSTAMP
3 4 TEST3 120
gc_grace_seconds
reached?
New SSTable
Compaction Strategies
SizeTieredCompactionStrategy (STCS)
LeveledCompactionStrategy (LCS)
TimeWindowCompactionStrategy (TWCS)
No
30. Data Consistency – Overview
Apache Cassandra Under The Hood30 15.09.2018
Cassandra offers tunable data consistency for read and write operations.
Two types of read requests:
– Direct read request.
– Digest read request.
Inconsistent data can be repaired automatically by:
– Background read repair request.
– NodeSync continuous background repair (only DSE 6).
Inconsistent data can be repaired manually by:
– Anty-Entropy Repair.
31. Tunable Consistency
Apache Cassandra Under The Hood31 15.09.2018
A tradeoff between data consistency and availability
WRITE Consistency Level READ Consistency Level
ALL ALL
EACH_QUORUM Not supported.
QUORUM QUORUM
LOCAL_QUORUM LOCAL_QUORUM
ONE, TWO, THREE ONE, TWO, THREE
LOCAL_ONE LOCAL_ONE
ANY Not supported.
Not supported. SERIAL
Not supported. LOCAL_SERIAL
32. Read Requests & Tunable Consistency (1)
Apache Cassandra Under The Hood32 15.09.2018
One DC, CONSISTENCY=QUORUM, RF=3
Coordinator
Direct Read
Digest Read
speculative_retry!
33. Read Requests & Tunable Consistency (2)
Apache Cassandra Under The Hood33 15.09.2018
One DC, CONSISTENCY=QUORUM, RF=3
Coordinator
Direct Read
Digest Read Background
Read Repair
read_repair_chance=0.10
34. Read Requests & Tunable Consistency (3)
Apache Cassandra Under The Hood34 15.09.2018
Two DC, CONSISTENCY=QUORUM, RF=3
Coordinator
DC=1 DC=2
Direct Read
Digest Read
Digest Read
Digest Read
35. Read Requests & Tunable Consistency (4)
Apache Cassandra Under The Hood35 15.09.2018
Two DC, CONSISTENCY=LOCAL_QUORUM, RF=3
Coordinator
DC=1 DC=2
Direct Read
Digest Read
36. Write Requests & Tunable Consistency (1)
Apache Cassandra Under The Hood36 15.09.2018
One DC, CONSISTENCY=ONE, RF=3
Coordinator
37. Write Requests & Tunable Consistency (2)
Apache Cassandra Under The Hood37 15.09.2018
One DC, CONSISTENCY=QUORUM, RF=3
Coordinator
DELETE
Possibile
ZOMBI
Hinted Handoff
38. Data Consistency – Anty-Entropy Repair
Apache Cassandra Under The Hood38 15.09.2018
Manual data repair:
– A Merkle tree is build for each replica
– Merkle trees are compered between all
replicas.
Repair can be performed:
– Sequential.
– Parallel.
– Datacenter parallel.
Source: DSE 6.0 Architecture Guide
40. Summary
Apache Cassandra Under The Hood40 15.09.2018
Cassandra is a very powerful distributed and decentralized NoSQL database with no
singe point of failure.
It guarantees service and data availability in case of a partitioned network, though
the data might be stale.
Designed for large data stores which require performant and scalable system.
Application data model need to be designed for Cassandra.
Many ways to interact with the database:
– CQLSH (Cassandra Query Language Shell).
– Drivers and tools provided by DataStax.
DataStax offers support for enterprise customers and a good documentation.
41. 15.09.2018 Apache Cassandra Under The Hood41
Robert Bialek
Senior Principal Consultant
Tel. +49 89 99 27 59 38
robert.bialek@trivadis.com