This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008
- Key features of Cassandra including linear scalability, continuous availability, support for multiple data centers, operational simplicity, and analytics capabilities
- Details on Cassandra's architecture including its cluster layer based on Amazon Dynamo and data store layer based on Google BigTable
- Explanations of Cassandra's data distribution, token ranges, replication, coordinator nodes, tunable consistency levels, and write path
- Descriptions of Cassandra's data model including last write win and examples of CRUD operations and table schemas
Apache Cassandra, part 2 – data model example, machineryAndrey Lomakin
Aim of this presentation to provide enough information for enterprise architect to choose whether Cassandra will be project data store. Presentation describes each nuance of Cassandra architecture and ways to design data and work with them.
Apache Cassandra is an open source, distributed, decentralized, scalable, highly available, fault-tolerant, and tunably consistent database. It is based on Amazon's Dynamo and Google's Bigtable models. Cassandra provides linear scalability, high availability with no single points of failure, and tunable consistency. It uses a distributed data model across a cluster with configurable replication for fault tolerance.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
This document summarizes a presentation on Cassandra use cases and anti-patterns. It discusses several anti-patterns to avoid such as queue-like designs, intensive updates on the same column, and designing around a dynamic schema. It also provides examples of good use cases such as rate limiting, anti-fraud detection, and account validation. The document contains an agenda, descriptions of each anti-pattern and their level of failure, as well as explanations and demonstrations of the example use cases.
This document provides an introduction to Cassandra including:
1) An overview of Cassandra's key architecture including its linear scalability, continuous availability across data centers, and operational simplicity.
2) A discussion of Cassandra's data model including its use of Last Write Wins for conflict resolution and examples of modeling one-to-many relationships using clustered tables.
3) Details on Cassandra's consistency levels and how they impact availability and durability of writes and reads.
- Cassandra is an open source, distributed database management system designed to handle large amounts of data across many commodity servers. It was originally developed at Facebook in 2008 and is now an Apache project.
- Cassandra provides high availability with no single point of failure, linear scalability and performance of tens of thousands of queries per second. It is used by many large companies including Netflix, Twitter and eBay.
- Data is organized into tables within keyspaces. Tables must have a primary key which determines how data is partitioned and indexed. Cassandra uses a decentralized architecture with no single point of failure and automatic data distribution across nodes.
This document summarizes a research paper on Google's globally distributed database called Spanner. Spanner provides strong consistency and transactions across globally distributed data. It addresses the need for a scalable database to replace Google's sharded MySQL deployment. Spanner uses TrueTime to synchronize clocks across datacenters with bounded uncertainty. It assigns timestamps to transactions using this synchronized time to ensure consistency. Spanner supports different types of transactions like read-write, read-only, and snapshot reads through its consistency and concurrency control mechanisms. Evaluation results show Spanner can provide low latency, high throughput and availability even during leader failures.
Apache Cassandra, part 2 – data model example, machineryAndrey Lomakin
Aim of this presentation to provide enough information for enterprise architect to choose whether Cassandra will be project data store. Presentation describes each nuance of Cassandra architecture and ways to design data and work with them.
Apache Cassandra is an open source, distributed, decentralized, scalable, highly available, fault-tolerant, and tunably consistent database. It is based on Amazon's Dynamo and Google's Bigtable models. Cassandra provides linear scalability, high availability with no single points of failure, and tunable consistency. It uses a distributed data model across a cluster with configurable replication for fault tolerance.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
This document summarizes a presentation on Cassandra use cases and anti-patterns. It discusses several anti-patterns to avoid such as queue-like designs, intensive updates on the same column, and designing around a dynamic schema. It also provides examples of good use cases such as rate limiting, anti-fraud detection, and account validation. The document contains an agenda, descriptions of each anti-pattern and their level of failure, as well as explanations and demonstrations of the example use cases.
This document provides an introduction to Cassandra including:
1) An overview of Cassandra's key architecture including its linear scalability, continuous availability across data centers, and operational simplicity.
2) A discussion of Cassandra's data model including its use of Last Write Wins for conflict resolution and examples of modeling one-to-many relationships using clustered tables.
3) Details on Cassandra's consistency levels and how they impact availability and durability of writes and reads.
- Cassandra is an open source, distributed database management system designed to handle large amounts of data across many commodity servers. It was originally developed at Facebook in 2008 and is now an Apache project.
- Cassandra provides high availability with no single point of failure, linear scalability and performance of tens of thousands of queries per second. It is used by many large companies including Netflix, Twitter and eBay.
- Data is organized into tables within keyspaces. Tables must have a primary key which determines how data is partitioned and indexed. Cassandra uses a decentralized architecture with no single point of failure and automatic data distribution across nodes.
This document summarizes a research paper on Google's globally distributed database called Spanner. Spanner provides strong consistency and transactions across globally distributed data. It addresses the need for a scalable database to replace Google's sharded MySQL deployment. Spanner uses TrueTime to synchronize clocks across datacenters with bounded uncertainty. It assigns timestamps to transactions using this synchronized time to ensure consistency. Spanner supports different types of transactions like read-write, read-only, and snapshot reads through its consistency and concurrency control mechanisms. Evaluation results show Spanner can provide low latency, high throughput and availability even during leader failures.
This document provides an introduction and overview of the Python programming language. It describes Python's origins, philosophy, features, and uses. Key points covered include Python's support for rapid development, object-oriented programming, embedding and extending with C, and its portability across platforms. Examples of Python code are provided to illustrate concepts like modules, functions, control flow, and data types.
Spanner is Google's globally distributed database that provides strong consistency across data centers while remaining highly available and scalable. It uses a combination of techniques including external consistency based on TrueTime, which bounds global clock uncertainty, and a multi-version concurrency control approach using timestamps assigned based on TrueTime. Spanner supports ACID transactions, snapshot reads to allow for high concurrency, and other features like atomic schema changes through the use of these techniques.
Spanner is Google's globally distributed database that provides ACID transactions across data centers. It uses TrueTime to assign timestamps for distributed transactions, ensuring consistency. A Spanner deployment consists of servers organized into zones across datacenters, with a universe master and placement driver coordinating. Transactions are executed across servers and committed using time-based consensus. Evaluation shows Spanner provides low-latency reads and commits at large scale for Google applications.
The document discusses using Elasticsearch's percolator to store queries in an index and then retrieve matching queries by indexing documents. It provides an example of storing queries for terms like "bad protocol" and then percolating a log message to return matching queries. The document also describes techniques for optimizing percolation performance like rewriting queries and messages into a shared sparse vector space for faster matching.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Quite often "new" people are only "new" to Postgres. This is my summary of do's and don'ts when it comes to teaching Postgres, what to take note on, with emphasis on teaching
Parallel R in snow (english after 2nd slide)Cdiscount
This presentation discusses parallelizing computations in R using the snow package. It demonstrates how to:
1. Create a cluster with multiple R sessions using makeCluster()
2. Split data across the sessions using clusterSplit() and export data to each node
3. Write functions to execute in parallel on each node using clusterEvalQ()
4. Collect the results, such as by summing outputs, to obtain the final parallelized computation. As an example, it shows how to parallelize the likelihood calculation for a probit regression model, reducing the computation time.
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
This document discusses federating queries across PostgreSQL databases using foreign data wrappers (FDWs). It begins by introducing the author and their background. It then covers using FDWs to partition tables across multiple nodes for queries, the benefits over traditional views, and demonstrates counting rows across nodes. It notes limitations like network overhead, lack of keys/constraints, and single-threaded execution. Finally, it discusses strategies like using many small nodes, node-level partitioning, distributed processing, and multi-headed setups to optimize federated querying.
Salmon is a proposed protocol that defines a standard way for comments and annotations on one site to "swim upstream" and be posted to the original source, allowing for a virtuous cycle of commentary. It works by having content signed and posted to a target site's Salmon endpoint, with the signature then verified to authenticate the sender before the target site decides how to handle the received content. Specifications are provided for Salmon implementations using Atom and JSON formats.
In this presentation I talked about how Windows user account passwords can be cracked using methods described in Philippe Oechslin's paper "Making a Faster Cryptanalytic Time-Memory Trade-Off" and demonstrated the ideas by stealing hashes using fgdump or Ophcrack and using Rainbow Tables (Cain with RainbowCrack) to actually crack the passwords of students present at the talk. There is some interesting stuff about secure passwords along with bunch of other things.
- The document discusses strategies for analyzing large datasets that are too big to fit into memory, including using cloud computing, the ff and rsqlite packages in R, and sampling with the data.sample package.
- The ff and rsqlite packages allow working with data beyond RAM limits but require rewriting code, while data.sample provides sampling without rewriting code but introduces sampling error.
- Cloud computing avoids rewriting code and has no memory limits but requires setup, and sampling is good for analysis but not reporting exact values.
Monitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas FittlCitus Data
Your PostgreSQL database is one of the most important pieces of your architecture. What should you really watch out for, send reports on and alert on? We’ll discuss how query performance statistics can be made accessible to application developers, critical entries one should monitor in the PostgreSQL log files, how to collect EXPLAIN plans at scale, how to watch over autovacuum and VACUUM operations, and how to flag issues based on schema statistics.
Spanner is Google's globally distributed database that provides SQL queries and ACID transactions across datacenters. It uses TrueTime to assign timestamps based on a global clock with bounded uncertainty. This allows lock-free read transactions to retrieve a consistent view of data. Write transactions use two-phase locking and commit in timestamp order respecting the global clock to ensure external consistency.
About Flexible Indexing
Postgres’ rich variety of data structures and data-type specific indexes can be confusing for newer and experienced Postgres users alike who may be unsure when and how to use them. For example, gin indexing specializes in the rapid lookup of keys with many duplicates — an area where traditional btree indexes perform poorly. This is particularly useful for json and full text searching. GiST allows for efficient indexing of two-dimensional values and range types.
To listen to the recorded presentation with Bruce Momjian, visit Enterprisedb.com > Resources > Webcasts > Ondemand Webcasts.
For product information and subscriptions, please email sales@enterprisedb.com.
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
Postgres relies heavily on an extension ecosystem, but that is almost 100% dependent on C; which cuts out developers, libraries, and ideas from the world of Postgres. postgres-extension.rs changes that by supporting development of extensions in Rust. Rust is a memory-safe language that integrates nicely in any environment, has powerful libraries, a vibrant ecosystem, and a prolific developer community.
Rust is a unique language because it supports high-level features but all the magic happens at compile-time, and the resulting code is not dependent on an intrusive or bulky runtime. That makes it ideal for integrating with postgres, which has a lot of its own runtime, like memory contexts and signal handlers. postgres-extension.rs offers this integration, allowing the development of extensions in rust, even if deeply-integrated into the postgres internals, and helping handle tricky issues like error handling. This is done through a collection of Rust function declarations, macros, and utility functions that allow rust code to call into postgres, and safely handle resulting errors.
This document summarizes an introduction to advanced MySQL query and schema tuning techniques presented by Alexander Rubin. It discusses how to identify and address slow queries through better indexing, temporary tables, and query optimization. Specific techniques covered include using indexes to optimize equality and range queries, ordering fields in composite indexes, and avoiding disk-based temporary tables for GROUP BY and other complex queries.
The State of (Full) Text Search in PostgreSQL 12Jimmy Angelakos
How to navigate the rich but confusing field of (Full) Text Search in PostgreSQL. A short introduction will explain the concepts involved, followed by a discussion of functions, operators, indexes and collation support in Postgres in relevance to searching for text. Examples of usage will be provided, along with some stats demonstrating the differences.
Extending Cassandra with Doradus OLAP for High Performance Analyticsrandyguck
Slides from an O'Reilly Webinar given on July 29th, 2015. This presentation describes how the Doradus database framework and the OLAP storage service extend Cassandra to provide a unique database solution for certain big data applications. Doradus OLAP uses columnar storage, application-level sharding, compression, and other techniques to store data very densely, yielding fast loading and queries that can scan millions of objects per second.
This document provides an introduction and overview of the Python programming language. It describes Python's origins, philosophy, features, and uses. Key points covered include Python's support for rapid development, object-oriented programming, embedding and extending with C, and its portability across platforms. Examples of Python code are provided to illustrate concepts like modules, functions, control flow, and data types.
Spanner is Google's globally distributed database that provides strong consistency across data centers while remaining highly available and scalable. It uses a combination of techniques including external consistency based on TrueTime, which bounds global clock uncertainty, and a multi-version concurrency control approach using timestamps assigned based on TrueTime. Spanner supports ACID transactions, snapshot reads to allow for high concurrency, and other features like atomic schema changes through the use of these techniques.
Spanner is Google's globally distributed database that provides ACID transactions across data centers. It uses TrueTime to assign timestamps for distributed transactions, ensuring consistency. A Spanner deployment consists of servers organized into zones across datacenters, with a universe master and placement driver coordinating. Transactions are executed across servers and committed using time-based consensus. Evaluation shows Spanner provides low-latency reads and commits at large scale for Google applications.
The document discusses using Elasticsearch's percolator to store queries in an index and then retrieve matching queries by indexing documents. It provides an example of storing queries for terms like "bad protocol" and then percolating a log message to return matching queries. The document also describes techniques for optimizing percolation performance like rewriting queries and messages into a shared sparse vector space for faster matching.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Quite often "new" people are only "new" to Postgres. This is my summary of do's and don'ts when it comes to teaching Postgres, what to take note on, with emphasis on teaching
Parallel R in snow (english after 2nd slide)Cdiscount
This presentation discusses parallelizing computations in R using the snow package. It demonstrates how to:
1. Create a cluster with multiple R sessions using makeCluster()
2. Split data across the sessions using clusterSplit() and export data to each node
3. Write functions to execute in parallel on each node using clusterEvalQ()
4. Collect the results, such as by summing outputs, to obtain the final parallelized computation. As an example, it shows how to parallelize the likelihood calculation for a probit regression model, reducing the computation time.
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
This document discusses federating queries across PostgreSQL databases using foreign data wrappers (FDWs). It begins by introducing the author and their background. It then covers using FDWs to partition tables across multiple nodes for queries, the benefits over traditional views, and demonstrates counting rows across nodes. It notes limitations like network overhead, lack of keys/constraints, and single-threaded execution. Finally, it discusses strategies like using many small nodes, node-level partitioning, distributed processing, and multi-headed setups to optimize federated querying.
Salmon is a proposed protocol that defines a standard way for comments and annotations on one site to "swim upstream" and be posted to the original source, allowing for a virtuous cycle of commentary. It works by having content signed and posted to a target site's Salmon endpoint, with the signature then verified to authenticate the sender before the target site decides how to handle the received content. Specifications are provided for Salmon implementations using Atom and JSON formats.
In this presentation I talked about how Windows user account passwords can be cracked using methods described in Philippe Oechslin's paper "Making a Faster Cryptanalytic Time-Memory Trade-Off" and demonstrated the ideas by stealing hashes using fgdump or Ophcrack and using Rainbow Tables (Cain with RainbowCrack) to actually crack the passwords of students present at the talk. There is some interesting stuff about secure passwords along with bunch of other things.
- The document discusses strategies for analyzing large datasets that are too big to fit into memory, including using cloud computing, the ff and rsqlite packages in R, and sampling with the data.sample package.
- The ff and rsqlite packages allow working with data beyond RAM limits but require rewriting code, while data.sample provides sampling without rewriting code but introduces sampling error.
- Cloud computing avoids rewriting code and has no memory limits but requires setup, and sampling is good for analysis but not reporting exact values.
Monitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas FittlCitus Data
Your PostgreSQL database is one of the most important pieces of your architecture. What should you really watch out for, send reports on and alert on? We’ll discuss how query performance statistics can be made accessible to application developers, critical entries one should monitor in the PostgreSQL log files, how to collect EXPLAIN plans at scale, how to watch over autovacuum and VACUUM operations, and how to flag issues based on schema statistics.
Spanner is Google's globally distributed database that provides SQL queries and ACID transactions across datacenters. It uses TrueTime to assign timestamps based on a global clock with bounded uncertainty. This allows lock-free read transactions to retrieve a consistent view of data. Write transactions use two-phase locking and commit in timestamp order respecting the global clock to ensure external consistency.
About Flexible Indexing
Postgres’ rich variety of data structures and data-type specific indexes can be confusing for newer and experienced Postgres users alike who may be unsure when and how to use them. For example, gin indexing specializes in the rapid lookup of keys with many duplicates — an area where traditional btree indexes perform poorly. This is particularly useful for json and full text searching. GiST allows for efficient indexing of two-dimensional values and range types.
To listen to the recorded presentation with Bruce Momjian, visit Enterprisedb.com > Resources > Webcasts > Ondemand Webcasts.
For product information and subscriptions, please email sales@enterprisedb.com.
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
Postgres relies heavily on an extension ecosystem, but that is almost 100% dependent on C; which cuts out developers, libraries, and ideas from the world of Postgres. postgres-extension.rs changes that by supporting development of extensions in Rust. Rust is a memory-safe language that integrates nicely in any environment, has powerful libraries, a vibrant ecosystem, and a prolific developer community.
Rust is a unique language because it supports high-level features but all the magic happens at compile-time, and the resulting code is not dependent on an intrusive or bulky runtime. That makes it ideal for integrating with postgres, which has a lot of its own runtime, like memory contexts and signal handlers. postgres-extension.rs offers this integration, allowing the development of extensions in rust, even if deeply-integrated into the postgres internals, and helping handle tricky issues like error handling. This is done through a collection of Rust function declarations, macros, and utility functions that allow rust code to call into postgres, and safely handle resulting errors.
This document summarizes an introduction to advanced MySQL query and schema tuning techniques presented by Alexander Rubin. It discusses how to identify and address slow queries through better indexing, temporary tables, and query optimization. Specific techniques covered include using indexes to optimize equality and range queries, ordering fields in composite indexes, and avoiding disk-based temporary tables for GROUP BY and other complex queries.
The State of (Full) Text Search in PostgreSQL 12Jimmy Angelakos
How to navigate the rich but confusing field of (Full) Text Search in PostgreSQL. A short introduction will explain the concepts involved, followed by a discussion of functions, operators, indexes and collation support in Postgres in relevance to searching for text. Examples of usage will be provided, along with some stats demonstrating the differences.
Extending Cassandra with Doradus OLAP for High Performance Analyticsrandyguck
Slides from an O'Reilly Webinar given on July 29th, 2015. This presentation describes how the Doradus database framework and the OLAP storage service extend Cassandra to provide a unique database solution for certain big data applications. Doradus OLAP uses columnar storage, application-level sharding, compression, and other techniques to store data very densely, yielding fast loading and queries that can scan millions of objects per second.
Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...DataStax
In this presentation, we will detail two image processing applications which rely on a Cassandra centric architecture to achieve distributed, high accuracy analysis of a variety of image formats, types, and quality, and which require different kinds of metadata processing as well as feature extraction from the image themselves. We will outline the architecture choices made for the two use case studies, and how we found Cassandra to be the ideal choice for the persistence layer implementation technology. In conclusion we will discuss extensions to the two use cases discussed and some of the 'lessons learned' from the two implementation projects.
About the Speaker
Kerry Koitzsch Project Lead, Kildane Software Technologies, Inc
Kerry Koitzsch is a software engineer and architect specializing in big data applications, NoSQL databases, and image processing. He currently works for Correlli Software Systems, a big data analytics company in Sunnyvale CA.
Apache Cassandra is one of the most renowned NoSQL databases. Although it's often associated with great scalability, improper usage might result in shooting yourself in the foot. In this talk I'll present a set of ideas and guidelines - both for developers and administrators - which will help you to make your project an epic failure.
Cassandra Day Chicago 2015: Advanced Data ModelingDataStax Academy
The document discusses modeling data in Cassandra using the Chebotko method. It begins by explaining the conceptual, logical, and physical modeling stages of the Chebotko method. It then provides an example of modeling user data in a music database, showing the conceptual model, identifying access patterns, and designing the logical model with tables to satisfy each query. The logical model example shows how to design Cassandra tables for queries about performers, albums, tracks, users and their activities.
Overview of the Doradus database open source project and the Cassandra database on which it is based. This presentation was given to the Orange County Big Data Meetup group on July 16, 2014.
The document describes the KDM tool, which automates Cassandra data modeling tasks. It streamlines the data modeling methodology by guiding users and automating conceptual to logical mapping, physical optimizations, and CQL generation. The KDM tool simplifies the complex data modeling process, eliminates human errors, and helps users build, verify, and learn data modeling. Future work on the tool includes support for materialized views, user defined types, application workflow design, and additional diagram types.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
Cassandra introduction apache con 2014 budapestDuyhai Doan
This document provides an introduction and summary of Cassandra presented by Duy Hai Doan. It discusses Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008. The key architecture of Cassandra including its data distribution across nodes, replication for failure tolerance, and consistency models for reads and writes is summarized.
The document discusses Cassandra architecture and operations. It provides an overview of key Cassandra concepts like data distribution across nodes, replication, consistency levels, and the write and read paths. It also covers topics like compaction strategies, best practices for configuration, and operational recommendations.
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...NoSQLmatters
This document discusses Cassandra use cases and anti-patterns. It describes queue-like designs, intensive updates on the same column, and designing around a dynamic schema as anti-patterns that can lead to failures. Rate limiting, fraud prevention, and account validation are provided as examples of good use cases. Key-value modeling, clustering, compaction strategies, and time-to-live features are also overviewed.
There are a few options for performing more complex queries in Cassandra beyond the restrictions of the WHERE clause:
1. Denormalize/duplicate data across tables to allow querying on different columns. For example, have one table keyed on user ID and another keyed on message date to allow filtering by date.
2. Offload complex queries to an external search index like Solr or Elasticsearch that can handle full-text and complex queries, and keep Cassandra as the system of record.
3. Use Spark/Hive on Cassandra to run queries across the cluster and leverage their more powerful query engines.
4. Consider a different database if your queries require joins, complex where clauses, or don't map well to
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
This document summarizes the Cassandra Java driver and tools. It discusses the driver's architecture including connection pooling, request pipelining, load balancing policies, and automatic failover. It also covers using statements, asynchronous reads, the query builder, and the object mapper. Lastly, it discusses new automatic paging functionality in the driver.
This document summarizes Cassandra drivers and tools. It discusses the Java driver architecture including connection pooling, load balancing policies, and automatic paging. It also demonstrates Cassandra Unit for testing, the Java driver object mapper module, and Achilles object mapper with features like dirty checking. Live coding examples are provided for these tools.
Cassandra nice use cases and worst anti patternsDuyhai Doan
This document discusses Cassandra use cases and anti-patterns. Some good use cases include rate limiting, fraud prevention, account validation, and storing sensor time series data. Poor designs include using Cassandra like a queue, storing null values, intensive updates to the same column, and dynamically changing the schema. The document provides examples and explanations of how to properly implement these scenarios in Cassandra.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
This document provides an overview of the NodeJS Cassandra driver. It begins with a brief introduction of Cassandra and then discusses the driver's architecture, streaming capabilities, and API. Key aspects covered include connection pooling, request pipelining, load balancing policies, automatic failover, and data type mappings. The presentation concludes with a code example and demonstration of the driver.
This document provides an overview of big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure and consensus. It also covers data sharding, replication, and the CAP theorem. Key points include how latency is impacted by network delays, different failure modes, and that the CAP theorem states that a distributed system can only guarantee two of consistency, availability, and partition tolerance at once.
This document provides an introduction to Cassandra including:
- Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise.
- Cassandra was created at Facebook and is now open source software with the current version being 3.2.
- Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.
Big data 101 for beginners riga dev daysDuyhai Doan
This document provides an overview and introduction to big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure modes, and consensus protocols. It also covers data sharding and replication techniques. The document explains the CAP theorem and how it relates to consistency and availability. Finally, it discusses different distributed systems architectures like master/slave versus masterless designs.
Vienna Feb 2015: Cassandra: How it works and what it's good for!Christopher Batey
This document provides an overview of Apache Cassandra and how it works. It discusses Cassandra being a distributed, masterless database based on Amazon Dynamo and Google BigTable. Key aspects covered include replication, fault tolerance, tunable consistency levels, and data modeling. Various use cases for Cassandra are also presented such as for storing time series data, sensor data, and financial transactions.
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It begins with introductions of the speaker and Datastax. It then discusses what Spark and Cassandra are, including their architectures and key characteristics like Spark being fast, easy to use, and supporting multiple languages. The document demonstrates basic Spark code and how RDDs work. It covers the Spark and Cassandra connectors and how they provide locality-aware joins. It also discusses use cases and deployment options. Finally, it considers future improvements like leveraging Solr for local filtering to improve data locality during joins.
This document discusses using Spark with Apache Cassandra for various use cases including loading data from various sources, performing analytics, and sanitizing, validating, and transforming data. It provides examples of using Spark jobs to import data, clean data, perform schema migrations, and run analytics queries. It also covers aspects of the connector architecture like data locality, failure handling, and cross data center operations. The document concludes with discussing a benchmark that used Spark and Cassandra to perform parallel data ingestion and top-K queries on 3.2 billion rows of data with SASI indices.
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
This is the first webinar of a Back to Basics series that will introduce you to the MongoDB database, what it is, why you would use it, and what you would use it for.
Similar to Introduction to Cassandra & Data model (20)
This document provides an overview of DataStax Enterprise, a database platform for cloud applications. It discusses key features of DataStax Enterprise including that it is certified for production, offers automatic management services for configuration and administration through OpsCenter, and provides 24/7 expert support. The document also summarizes various DataStax Enterprise technologies and capabilities like advanced replication, tiered storage, security features, and integration with search, analytics, and graph databases.
Datastax day 2016 introduction to apache cassandraDuyhai Doan
This document provides an overview of Apache Cassandra and discusses its key features. It describes how Cassandra distributes and replicates data across multiple nodes for continuous availability and linear scalability. It also covers Cassandra's consistency model and how consistency levels can be tuned to balance availability and durability. The document lists Cassandra's features like collections, user-defined types, materialized views, and JSON support for flexible data modeling.
Datastax day 2016 : Cassandra data modeling basicsDuyhai Doan
This document discusses data modeling with Apache Cassandra. It covers:
1. The objectives of data modeling like reducing query latency and avoiding disasters
2. Choosing the right partition key which is the main entry point for queries and helps distribute data
3. Using clustering columns to simulate one-to-many relationships and enable sorting and range queries
4. Other critical details like avoiding huge partitions, sub-partitioning techniques, and how deletes create tombstones
This document discusses Apache Cassandra and its features and use cases. It provides an overview of Cassandra's key characteristics like massive scalability, extreme availability, and rich data modeling. Example use cases mentioned include messaging, collections/playlists, fraud detection, recommendations, and IoT sensor data. New features introduced in Cassandra in 2016 are also summarized, such as delete by range, materialized views, atomic UDT updates, a new SASI index, and support for GROUP BY queries.
Spark zeppelin-cassandra at synchrotronDuyhai Doan
This document discusses using Spark, Cassandra, and Zeppelin for storing and aggregating metrics data from a particle accelerator project called HDB++. It provides an overview of the HDB++ project, how it previously used MySQL but now stores data in Cassandra. It describes the Spark jobs that are run to load metrics data from Cassandra and generate statistics that are written back to Cassandra. It also demonstrates visualizing the data using Zeppelin and discusses some tricks and traps to be aware of when using this stack.
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
Cassandra 3 new features @ Geecon Krakow 2016Duyhai Doan
Duyhai Doan gave a presentation on new features in Cassandra 3.0, including materialized views, user defined functions, user defined aggregates, and the new SASI full text search index. Materialized views allow pre-computing common queries to improve performance. User defined functions and aggregates enable pushing computation to the server. The new SASI index provides improved full text search capabilities in Cassandra.
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
Duy Hai Doan presented Apache Zeppelin, an open-source web-based notebook that allows users to interact with data. Zeppelin provides a front-end GUI and display system for data analysis tools and uses interpreters to connect to back-end systems like Spark, Cassandra, and Flink. Doan demonstrated Zeppelin's notebook interface, display options, and how users can write their own interpreters to connect new systems to Zeppelin. Future plans for Zeppelin include improving usability, adding authentication and authorization, and developing more interpreters and visualizations.
This document discusses user-defined functions and materialized views in Cassandra. It provides information on how to create user-defined functions and user-defined aggregates, including the syntax and best practices. It also covers how user-defined functions and aggregates are executed. The document then discusses materialized views, including why they are useful and how they work at a high level. It provides the syntax for creating materialized views and describes how updates are handled.
This document discusses Cassandra and the Datastax Academy. It provides examples of companies using Cassandra as infrastructure including ING, Netflix, Sony, and Microsoft. It also discusses the increasing SQL support in Cassandra, such as user defined functions, materialized views, and secondary indexes. The document notes that skills in Cassandra are in high demand but difficult to find. It promotes the Datastax Academy as a free solution to this problem, offering self-paced courses, instructor-led training, and O'Reilly certification to boost careers.
Apache zeppelin, the missing component for the big data ecosystemDuyhai Doan
Apache Zeppelin is a web-based notebook that allows users to interact with data via interpreters like Spark, SQL, and Cassandra. It provides a GUI for data scientists to write code and visualizations in notebooks. Zeppelin has a modular architecture that allows new interpreters to be easily added. It also includes features like scheduling, sharing, and exporting of notebooks.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
Distributed algorithms for big data @ GeeConDuyhai Doan
This document discusses distributed algorithms for big data. It begins with an overview of HyperLogLog for estimating cardinality and counting distinct elements in a large data set. It then explains how HyperLogLog works by using a hash function to distribute the data across buckets and applying the LogLog algorithm to each bucket before taking the harmonic mean. The document also covers Paxos for distributed consensus, explaining the phases of prepare, promise, accept and learn to reach agreement in the presence of failures.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
2. Shameless self-promotion!
@doanduyhai
2
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• Europe technical point of contact
☞ duy_hai.doan@datastax.com
• production troubleshooting
3. Datastax!
@doanduyhai
3
• Founded in April 2010
• We drive Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Home to Cassandra chair & most committers (≈80%)
• Headquartered in San Francisco Bay area
• EU headquarters in London, offices in France and Germany
4. Agenda!
@doanduyhai
4
Architecture
• Cluster, Replication, Consistency
Data model
• Last Write Win (LWW), CQL basics, From SQL to CQL
Dev Center Demo
DSE overview
CQL In Depth (time permitted)
5. Cassandra history!
@doanduyhai
5
NoSQL database
• created at Facebook
• open-sourced since 2008
• current version = 2.1
• column-oriented ☞ distributed table
12. Cassandra architecture!
@doanduyhai
12
Cluster layer
• Amazon DynamoDB paper
• masterless architecture
Data-store layer
• Google Big Table paper
• Columns/columns family
13. Cassandra architecture!
@doanduyhai
13
API (CQL & RPC)
CLUSTER (DYNAMO)
DATA STORE (BIG TABLES)
DISKS
Node1
Client request
API (CQL & RPC)
CLUSTER (DYNAMO)
DATA STORE (BIG TABLES)
DISKS
Node2
14. Data distribution!
@doanduyhai
14
Random: hash of #partition → token = hash(#p)
Hash: ]-X, X]
X = huge number (264/2)
n1
n2
n3
n4
n5
n6
n7
n8
15. Token Ranges!
@doanduyhai
15
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
17. Failure tolerance!
@doanduyhai
17
Replication Factor (RF) = 3
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
{B, A, H}
{C, B, A}
{D, C, B}
A
B
C
D
E
F
G
H
18. Coordinator node!
Incoming requests (read/write)
Coordinator node handles the request
Every node can be coordinator àmasterless
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
request
19. Consistency!
@doanduyhai
19
Tunable at runtime
• ONE
• QUORUM (strict majority w.r.t. RF)
• ALL
Apply both to read & write
20. Write consistency!
Write ONE
• write request to all replicas in //
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
21. Write consistency!
Write ONE
• write request to all replicas in //
• wait for ONE ack before returning to
client
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
22. Write consistency!
Write ONE
• write request to all replicas in //
• wait for ONE ack before returning to
client
• other acks later, asynchronously
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs
23. Write consistency!
Write QUORUM
• write request to all replicas in //
• wait for QUORUM acks before
returning to client
• other acks later, asynchronously
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs
24. Read consistency!
Read ONE
• read from one node among all replicas
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
25. Read consistency!
Read ONE
• read from one node among all replicas
• contact the fastest node (stats)
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
26. Read consistency!
Read QUORUM
• read from one fastest node
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
27. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
replicas to reach QUORUM
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
28. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
replicas to reach QUORUM
• return most up-to-date data to client
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
29. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
replicas to reach QUORUM
• return most up-to-date data to client
• repair if digest mismatch n1
@doanduyhai
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
39. Consistency summary!
ONERead + ONEWrite
☞ available for read/write even (N-1) replicas down
QUORUMRead + QUORUMWrite
☞ available for read/write even 1+ replica down
@doanduyhai 39
47. Last Write Win (LWW)!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
@doanduyhai
47
jdoe
age
name
33 John DOE
#partition
48. Last Write Win (LWW)!
@doanduyhai
jdoe
age (t1) name (t1)
33 John DOE
48
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
auto-generated timestamp (μs)
.
49. Last Write Win (LWW)!
@doanduyhai
49
UPDATE users SET age = 34 WHERE login = jdoe;
jdoe
SSTable1 SSTable2
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
50. Last Write Win (LWW)!
@doanduyhai
50
DELETE age FROM users WHERE login = jdoe;
tombstone
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
51. Last Write Win (LWW)!
@doanduyhai
51
SELECT age FROM users WHERE login = jdoe;
? ? ?
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
52. Last Write Win (LWW)!
@doanduyhai
52
SELECT age FROM users WHERE login = jdoe;
✕ ✕ ✓
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
53. Compaction!
@doanduyhai
53
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
New SSTable
jdoe
age (t3) name (t1)
ý John DOE
54. CRUD operations!
@doanduyhai
54
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE users SET age = 34 WHERE login = jdoe;
DELETE age FROM users WHERE login = jdoe;
SELECT age FROM users WHERE login = jdoe;
57. Queries!
@doanduyhai
57
Get message by user and message_id (date)
SELECT * FROM mailbox WHERE login = jdoe
and message_id = ‘2014-09-25 16:00:00’;
Get message by user and date interval
SELECT * FROM mailbox WHERE login = jdoe
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
58. Queries!
@doanduyhai
58
Get message by message_id only (#partition not provided)
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval (#partition not provided)
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
59. Queries!
Get message by user range (range query on #partition)
Get message by user pattern (non exact match on #partition)
@doanduyhai
59
SELECT * FROM mailbox WHERE login >= hsue and login <= jdoe;
SELECT * FROM mailbox WHERE login like ‘%doe%‘;
60. WHERE clause restrictions!
@doanduyhai
60
All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition
Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed
• ☞ full cluster scan
On clustering columns, only range queries (<, ≤, >, ≥) and exact match
WHERE clause only possible on columns defined in PRIMARY KEY
61. WHERE clause restrictions!
@doanduyhai
61
What if I want to perform « arbitrary » WHERE clause ?
• search form scenario, dynamic search fields
62. WHERE clause restrictions!
@doanduyhai
62
What if I want to perform « arbitrary » WHERE clause ?
• search form scenario, dynamic search fields
☞ Apache Solr (Lucene) integration (DSE)
SELECT * FROM users WHERE solr_query = ‘age:[33 TO *] AND sex:male’;
SELECT * FROM users WHERE solr_query = ‘lastname:*schwei?er’;
63. Collections & maps!
@doanduyhai
63
CREATE TABLE users (
login text,
name text,
age int,
friends set<text>,
hobbies list<text>,
languages map<int, text>,
…
PRIMARY KEY(login));
64. User Defined Type (UDT)!
Instead of
@doanduyhai
64
CREATE TABLE users (
login text,
…
street_number int,
street_name text,
postcode int,
country text,
…
PRIMARY KEY(login));
65. User Defined Type (UDT)!
@doanduyhai
65
CREATE TYPE address (
street_number int,
street_name text,
postcode int,
country text);
CREATE TABLE users (
login text,
…
location frozen <address>,
…
PRIMARY KEY(login));
69. From SQL to CQL!
@doanduyhai
69
Remember…
CQL is not SQL
70. From SQL to CQL!
@doanduyhai
70
Remember…
there is no join
(do you want to scale ?)
71. From SQL to CQL!
@doanduyhai
71
Remember…
there is no integrity constraint
(do you want to read-before-write ?)
72. From SQL to CQL!
@doanduyhai
72
Paradigm change
• space is cheap (somehow …), latency is precious
• embrace immutability
• think query first
• denormalize !!!
73. From SQL to CQL!
@doanduyhai
73
Normalized
User
1
n
Comment
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author_id text, // typical join id
content text,
PRIMARY KEY((article_id), comment_id));
74. From SQL to CQL!
@doanduyhai
74
De-normalized
User
1
n
Comment
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author person, // person is UDT
content text,
PRIMARY KEY((article_id), comment_id));
75. Data modeling best practices!
@doanduyhai
75
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
76. Data modeling best practices!
@doanduyhai
76
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
Denormalize
• wisely, only duplicate necessary & immutable data
• functional/technical trade-off
77. Data modeling best practices!
@doanduyhai
77
Person UDT
- firstname/lastname
- date of birth
- gender
- mood
- location
78. Data modeling best practices!
@doanduyhai
78
John DOE, male
birthdate: 21/02/1981
subscribed since 03/06/2011
☉ San Mateo, CA
’’Impossible is not John DOE’’
Full detail read from
User table on click
83. Training Day | December 3rd
Beginner Track
• Introduction to Cassandra
• Introduction to Spark, Shark, Scala and
Cassandra
Advanced Track
• Data Modeling
• Performance Tuning
Conference Day | December 4th
Cassandra Summit Europe 2014 will be the single
largest gathering of Cassandra users in Europe.
Learn how the world's most successful companies are
transforming their businesses and growing faster than
ever using Apache Cassandra.
http://bit.ly/cassandrasummit2014
@doanduyhai Company Confidential 83
84. CQL In Depth!
Simple Table!
Clustered Table!
Bucketing!
99. Query With Clustered Table!
Select by operator and city for all dates
Select by operator and city range for all dates
@doanduyhai
99
SELECT * FROM daily_3g_quality_per_city
WHERE operator = ‘verizon’ AND city = ‘Austin’
SELECT * FROM daily_3g_quality_per_city
WHERE operator = ‘verizon’ AND city >= ‘Austin’ AND city <= ‘New York’
100. Query With Clustered Table!
Select by operator and city and date
Select by operator and city and range of date
@doanduyhai
100
SELECT * FROM daily_3g_quality_per_city
WHERE operator = ‘verizon’ AND city = ‘Austin’ AND date = 20140910
SELECT * FROM daily_3g_quality_per_city
WHERE operator = ‘verizon’ AND city = ‘Austin’
AND date >= 20140910 AND date <= 20140913
101. Query With Clustered Table!
@doanduyhai
101
Select by operator and city and date tuples
SELECT * FROM daily_3g_quality_per_city
WHERE operator = ‘verizon’ AND city = ‘Austin’
AND date IN (20140910, 20140913)
102. Query With Clustered Table!
@doanduyhai
102
Select by operator and date without city
SELECT * FROM daily_3g_quality_per_city
WHERE operator = ‘verizon’ AND date = 20140910
Map<operator,
SortedMap<city,
SortedMap<date,
SortedMap<column_label,value>>>>!
108. Bucketing!
@doanduyhai
108
But how can I select raw data between 14:45 and 15:10 ?
14:45 à ?
15:00 à 15:10
sensor_id:2014091014
date1 date2 date3 date4 …
blob1 blob2 blob3 blob4 …
sensor_id:2014091015
date11 date12 date13 date14 …
blob11 blob12 blob13 blob14 …
109. Bucketing!
Solution
• use IN clause on partition key component
• with range condition on date column
☞ date column should be monotonic function (increasing/decreasing)
@doanduyhai
109
SELECT * FROM sensor_data WHERE sensor_id = xxx
AND date_bucket IN (2014091014 , 2014091015)
AND date >= ‘2014-09-10 14:45:00.000‘
AND date <= ‘2014-09-10 15:10:00.000‘
110. Bucketing Caveats!
@doanduyhai
110
IN clause for #partition is not silver bullet !
• use scarcely
• keep cardinality low (≤ 5)
n1
n2
n3
n4
n5
n6
n7
coordinator
n8
sensor_id:2014091014
sensor_id:2014091015
111. Bucketing Caveats!
@doanduyhai
111
IN clause for #partition is not silver bullet !
• use scarcely
• keep cardinality low (≤ 5)
• prefer // async queries
• ease of query vs perf
n1
n2
n3
n4
n5
n6
n7
n8
Async client
sensor_id:2014091014
sensor_id:2014091015