The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...NoSQLmatters
This document discusses Cassandra use cases and anti-patterns. It describes queue-like designs, intensive updates on the same column, and designing around a dynamic schema as anti-patterns that can lead to failures. Rate limiting, fraud prevention, and account validation are provided as examples of good use cases. Key-value modeling, clustering, compaction strategies, and time-to-live features are also overviewed.
Speaker: Charlie Swanson
Learn how MongoDB answers your queries from a query system engineer. If you've ever had a performance problem with a query but didn't know how to find the cause, or if you've ever needed to confirm that your shiny new index is being put to work, the explain command is an excellent place to start. MongoDB's explain system is a powerful tool for solving this type of problem, but can be intimidating and unwieldy to use. In this talk, we will discuss how the explain command works and break down its output into digestible pieces.
Speaker: André Spiegel
Many applications require processes that load large amounts of data into MongoDB. It is easy to get these processes wrong, resulting in hours or days of loading time when it could be done in minutes. This talk identifies common mistakes and pitfalls and shows design patterns that can dramatically improve performance. The patterns introduced here can be used with any tool or programming language.
Powerful Analysis with the Aggregation PipelineMongoDB
Speaker: Asya Kamsky
Think you need to move your data "elsewhere" to do powerful analysis? Think again. The most efficient way to analyze your data is where it already lives. MongoDB Aggregation Pipeline has been getting more and more powerful and using new stages, expressions and tricks we can do extensive analysis of our data inside MongoDB Server.
The document discusses several topics related to education including communities of practice, epistemic frames, reflection-in-action, ways of knowing and doing, and epistemic network analysis. It also includes examples of codes used in epistemic network analysis and diagrams showing connections between codes across utterances and stanzas in a chat discourse.
Building Data Driven Products at LinkedinMitul Tiwari
This document discusses building data products at LinkedIn using Hadoop. It describes how LinkedIn builds recommendations products like "People You May Know" by processing member connection data with Hadoop tools. The workflow involves using Kafka to transfer data to HDFS, Pig and MapReduce to process the data, Azkaban to manage Hadoop jobs, and Voldemort to store results and serve recommendations to members. Triangle closing algorithms in Pig are used to find common connections between members and predict potential new connections. The results are pushed to production systems to power features like "People You May Know" recommendations.
This document provides instructions for installing MongoDB on Windows and CentOS. It outlines 5 steps for installing on Windows which include downloading MongoDB, creating a data folder, extracting the download package, connecting to MongoDB using mongo.exe, and testing with sample data. It also outlines 5 steps for installing on CentOS that mirror the Windows steps. The document then discusses additional MongoDB concepts like connecting to databases, creating collections and inserting documents, using cursors, querying for specific documents, and core CRUD operations.
This document discusses user-defined functions and materialized views in Cassandra. It provides information on how to create user-defined functions and user-defined aggregates, including the syntax and best practices. It also covers how user-defined functions and aggregates are executed. The document then discusses materialized views, including why they are useful and how they work at a high level. It provides the syntax for creating materialized views and describes how updates are handled.
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...NoSQLmatters
This document discusses Cassandra use cases and anti-patterns. It describes queue-like designs, intensive updates on the same column, and designing around a dynamic schema as anti-patterns that can lead to failures. Rate limiting, fraud prevention, and account validation are provided as examples of good use cases. Key-value modeling, clustering, compaction strategies, and time-to-live features are also overviewed.
Speaker: Charlie Swanson
Learn how MongoDB answers your queries from a query system engineer. If you've ever had a performance problem with a query but didn't know how to find the cause, or if you've ever needed to confirm that your shiny new index is being put to work, the explain command is an excellent place to start. MongoDB's explain system is a powerful tool for solving this type of problem, but can be intimidating and unwieldy to use. In this talk, we will discuss how the explain command works and break down its output into digestible pieces.
Speaker: André Spiegel
Many applications require processes that load large amounts of data into MongoDB. It is easy to get these processes wrong, resulting in hours or days of loading time when it could be done in minutes. This talk identifies common mistakes and pitfalls and shows design patterns that can dramatically improve performance. The patterns introduced here can be used with any tool or programming language.
Powerful Analysis with the Aggregation PipelineMongoDB
Speaker: Asya Kamsky
Think you need to move your data "elsewhere" to do powerful analysis? Think again. The most efficient way to analyze your data is where it already lives. MongoDB Aggregation Pipeline has been getting more and more powerful and using new stages, expressions and tricks we can do extensive analysis of our data inside MongoDB Server.
The document discusses several topics related to education including communities of practice, epistemic frames, reflection-in-action, ways of knowing and doing, and epistemic network analysis. It also includes examples of codes used in epistemic network analysis and diagrams showing connections between codes across utterances and stanzas in a chat discourse.
Building Data Driven Products at LinkedinMitul Tiwari
This document discusses building data products at LinkedIn using Hadoop. It describes how LinkedIn builds recommendations products like "People You May Know" by processing member connection data with Hadoop tools. The workflow involves using Kafka to transfer data to HDFS, Pig and MapReduce to process the data, Azkaban to manage Hadoop jobs, and Voldemort to store results and serve recommendations to members. Triangle closing algorithms in Pig are used to find common connections between members and predict potential new connections. The results are pushed to production systems to power features like "People You May Know" recommendations.
This document provides instructions for installing MongoDB on Windows and CentOS. It outlines 5 steps for installing on Windows which include downloading MongoDB, creating a data folder, extracting the download package, connecting to MongoDB using mongo.exe, and testing with sample data. It also outlines 5 steps for installing on CentOS that mirror the Windows steps. The document then discusses additional MongoDB concepts like connecting to databases, creating collections and inserting documents, using cursors, querying for specific documents, and core CRUD operations.
This document discusses user-defined functions and materialized views in Cassandra. It provides information on how to create user-defined functions and user-defined aggregates, including the syntax and best practices. It also covers how user-defined functions and aggregates are executed. The document then discusses materialized views, including why they are useful and how they work at a high level. It provides the syntax for creating materialized views and describes how updates are handled.
This document summarizes Cassandra drivers and tools. It discusses the Java driver architecture including connection pooling, load balancing policies, and automatic paging. It also demonstrates Cassandra Unit for testing, the Java driver object mapper module, and Achilles object mapper with features like dirty checking. Live coding examples are provided for these tools.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
This document provides an introduction to Cassandra including:
1) An overview of Cassandra's key architecture including its linear scalability, continuous availability across data centers, and operational simplicity.
2) A discussion of Cassandra's data model including its use of Last Write Wins for conflict resolution and examples of modeling one-to-many relationships using clustered tables.
3) Details on Cassandra's consistency levels and how they impact availability and durability of writes and reads.
This document summarizes a presentation about the KillrChat messaging application. KillrChat is a scalable messaging app built using AngularJS, Spring, and Cassandra. It demonstrates de-normalization and provides an exercise for attendees to work with user and chat room management, as well as chat messages. The document outlines the architecture, data models, and solutions for handling concurrent requests to avoid inconsistencies through the use of lightweight transactions in Cassandra.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
This document discusses using Spark with Apache Cassandra for various use cases including loading data from various sources, performing analytics, and sanitizing, validating, and transforming data. It provides examples of using Spark jobs to import data, clean data, perform schema migrations, and run analytics queries. It also covers aspects of the connector architecture like data locality, failure handling, and cross data center operations. The document concludes with discussing a benchmark that used Spark and Cassandra to perform parallel data ingestion and top-K queries on 3.2 billion rows of data with SASI indices.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
The presentation introduces KillrChat, a scalable messaging app built using Cassandra to demonstrate denormalization. It discusses the technology stack including Cassandra, Spring Boot, and AngularJS. It then covers the data models and solutions for various entities like users, chat rooms, and messages to handle concurrent modifications using lightweight transactions. Real-time features are implemented with WebSockets. The presentation provides a hands-on exercise for attendees and highlights how to build a real application with the Cassandra ecosystem.
The document describes the KillrChat application, which is a scalable chat application built with AngularJS, Cassandra, and Spring Boot. It discusses the application architecture including using Cassandra for distributed data storage and scaling out via a message broker. It also summarizes the key components of the application including controllers, services, REST resources, directives, and how data is distributed in Cassandra.
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
Datastax day 2016 introduction to apache cassandraDuyhai Doan
This document provides an overview of Apache Cassandra and discusses its key features. It describes how Cassandra distributes and replicates data across multiple nodes for continuous availability and linear scalability. It also covers Cassandra's consistency model and how consistency levels can be tuned to balance availability and durability. The document lists Cassandra's features like collections, user-defined types, materialized views, and JSON support for flexible data modeling.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
Spark cassandra integration, theory and practiceDuyhai Doan
This document discusses Spark and Cassandra integration. It begins with an introduction to Spark, describing it as a general data processing framework that is faster than Hadoop. It then discusses the Cassandra database and its data distribution using token ranges. The document provides examples of using the Spark/Cassandra connector for reading and writing data between Spark and Cassandra, including techniques for ensuring data locality. It discusses best practices for cluster deployment and handling failures while maintaining data locality. Finally, it presents some use cases for using Spark/Cassandra including data cleaning, schema migration, and analytics.
This document provides building code and construction information for a proposed remodel of a 2,377 square foot KFC restaurant located in Surprise, Arizona. The scope of work includes renovations to the dining area including new seating, flooring, wall finishes and lighting. Restroom remodels are planned to provide accessibility compliance upgrades. Exterior alterations consist of a new facade, finishes and lighting. Site improvements will modify sidewalks and parking areas to comply with accessibility standards. Construction will conform to the 2006 International Building Code and other applicable local codes.
This document discusses Libon's migration of contact data from an SQL database to Cassandra. It began with billions of contact records stored relationally in Oracle. Performance became unpredictable at scale. Tuning Oracle helped but new challenges like high availability and multi-datacenter support remained. The migration strategy involved writing to both databases, migrating old data, and switching fully to Cassandra while preserving no downtime and safe rollback. Business code refactoring maintained existing tests by modifying services and repositories to work with the new Cassandra data model.
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
This document summarizes a presentation on Cassandra use cases and anti-patterns. It discusses several anti-patterns to avoid such as queue-like designs, intensive updates on the same column, and designing around a dynamic schema. It also provides examples of good use cases such as rate limiting, anti-fraud detection, and account validation. The document contains an agenda, descriptions of each anti-pattern and their level of failure, as well as explanations and demonstrations of the example use cases.
This document discusses Cassandra and the Datastax Academy. It provides examples of companies using Cassandra as infrastructure including ING, Netflix, Sony, and Microsoft. It also discusses the increasing SQL support in Cassandra, such as user defined functions, materialized views, and secondary indexes. The document notes that skills in Cassandra are in high demand but difficult to find. It promotes the Datastax Academy as a free solution to this problem, offering self-paced courses, instructor-led training, and O'Reilly certification to boost careers.
There are a few options for performing more complex queries in Cassandra beyond the restrictions of the WHERE clause:
1. Denormalize/duplicate data across tables to allow querying on different columns. For example, have one table keyed on user ID and another keyed on message date to allow filtering by date.
2. Offload complex queries to an external search index like Solr or Elasticsearch that can handle full-text and complex queries, and keep Cassandra as the system of record.
3. Use Spark/Hive on Cassandra to run queries across the cluster and leverage their more powerful query engines.
4. Consider a different database if your queries require joins, complex where clauses, or don't map well to
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008
- Key features of Cassandra including linear scalability, continuous availability, support for multiple data centers, operational simplicity, and analytics capabilities
- Details on Cassandra's architecture including its cluster layer based on Amazon Dynamo and data store layer based on Google BigTable
- Explanations of Cassandra's data distribution, token ranges, replication, coordinator nodes, tunable consistency levels, and write path
- Descriptions of Cassandra's data model including last write win and examples of CRUD operations and table schemas
Cassandra nice use cases and worst anti patternsDuyhai Doan
This document discusses Cassandra use cases and anti-patterns. Some good use cases include rate limiting, fraud prevention, account validation, and storing sensor time series data. Poor designs include using Cassandra like a queue, storing null values, intensive updates to the same column, and dynamically changing the schema. The document provides examples and explanations of how to properly implement these scenarios in Cassandra.
This document summarizes Cassandra drivers and tools. It discusses the Java driver architecture including connection pooling, load balancing policies, and automatic paging. It also demonstrates Cassandra Unit for testing, the Java driver object mapper module, and Achilles object mapper with features like dirty checking. Live coding examples are provided for these tools.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
This document provides an introduction to Cassandra including:
1) An overview of Cassandra's key architecture including its linear scalability, continuous availability across data centers, and operational simplicity.
2) A discussion of Cassandra's data model including its use of Last Write Wins for conflict resolution and examples of modeling one-to-many relationships using clustered tables.
3) Details on Cassandra's consistency levels and how they impact availability and durability of writes and reads.
This document summarizes a presentation about the KillrChat messaging application. KillrChat is a scalable messaging app built using AngularJS, Spring, and Cassandra. It demonstrates de-normalization and provides an exercise for attendees to work with user and chat room management, as well as chat messages. The document outlines the architecture, data models, and solutions for handling concurrent requests to avoid inconsistencies through the use of lightweight transactions in Cassandra.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
This document discusses using Spark with Apache Cassandra for various use cases including loading data from various sources, performing analytics, and sanitizing, validating, and transforming data. It provides examples of using Spark jobs to import data, clean data, perform schema migrations, and run analytics queries. It also covers aspects of the connector architecture like data locality, failure handling, and cross data center operations. The document concludes with discussing a benchmark that used Spark and Cassandra to perform parallel data ingestion and top-K queries on 3.2 billion rows of data with SASI indices.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
The presentation introduces KillrChat, a scalable messaging app built using Cassandra to demonstrate denormalization. It discusses the technology stack including Cassandra, Spring Boot, and AngularJS. It then covers the data models and solutions for various entities like users, chat rooms, and messages to handle concurrent modifications using lightweight transactions. Real-time features are implemented with WebSockets. The presentation provides a hands-on exercise for attendees and highlights how to build a real application with the Cassandra ecosystem.
The document describes the KillrChat application, which is a scalable chat application built with AngularJS, Cassandra, and Spring Boot. It discusses the application architecture including using Cassandra for distributed data storage and scaling out via a message broker. It also summarizes the key components of the application including controllers, services, REST resources, directives, and how data is distributed in Cassandra.
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
Datastax day 2016 introduction to apache cassandraDuyhai Doan
This document provides an overview of Apache Cassandra and discusses its key features. It describes how Cassandra distributes and replicates data across multiple nodes for continuous availability and linear scalability. It also covers Cassandra's consistency model and how consistency levels can be tuned to balance availability and durability. The document lists Cassandra's features like collections, user-defined types, materialized views, and JSON support for flexible data modeling.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
Spark cassandra integration, theory and practiceDuyhai Doan
This document discusses Spark and Cassandra integration. It begins with an introduction to Spark, describing it as a general data processing framework that is faster than Hadoop. It then discusses the Cassandra database and its data distribution using token ranges. The document provides examples of using the Spark/Cassandra connector for reading and writing data between Spark and Cassandra, including techniques for ensuring data locality. It discusses best practices for cluster deployment and handling failures while maintaining data locality. Finally, it presents some use cases for using Spark/Cassandra including data cleaning, schema migration, and analytics.
This document provides building code and construction information for a proposed remodel of a 2,377 square foot KFC restaurant located in Surprise, Arizona. The scope of work includes renovations to the dining area including new seating, flooring, wall finishes and lighting. Restroom remodels are planned to provide accessibility compliance upgrades. Exterior alterations consist of a new facade, finishes and lighting. Site improvements will modify sidewalks and parking areas to comply with accessibility standards. Construction will conform to the 2006 International Building Code and other applicable local codes.
This document discusses Libon's migration of contact data from an SQL database to Cassandra. It began with billions of contact records stored relationally in Oracle. Performance became unpredictable at scale. Tuning Oracle helped but new challenges like high availability and multi-datacenter support remained. The migration strategy involved writing to both databases, migrating old data, and switching fully to Cassandra while preserving no downtime and safe rollback. Business code refactoring maintained existing tests by modifying services and repositories to work with the new Cassandra data model.
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
This document summarizes a presentation on Cassandra use cases and anti-patterns. It discusses several anti-patterns to avoid such as queue-like designs, intensive updates on the same column, and designing around a dynamic schema. It also provides examples of good use cases such as rate limiting, anti-fraud detection, and account validation. The document contains an agenda, descriptions of each anti-pattern and their level of failure, as well as explanations and demonstrations of the example use cases.
This document discusses Cassandra and the Datastax Academy. It provides examples of companies using Cassandra as infrastructure including ING, Netflix, Sony, and Microsoft. It also discusses the increasing SQL support in Cassandra, such as user defined functions, materialized views, and secondary indexes. The document notes that skills in Cassandra are in high demand but difficult to find. It promotes the Datastax Academy as a free solution to this problem, offering self-paced courses, instructor-led training, and O'Reilly certification to boost careers.
There are a few options for performing more complex queries in Cassandra beyond the restrictions of the WHERE clause:
1. Denormalize/duplicate data across tables to allow querying on different columns. For example, have one table keyed on user ID and another keyed on message date to allow filtering by date.
2. Offload complex queries to an external search index like Solr or Elasticsearch that can handle full-text and complex queries, and keep Cassandra as the system of record.
3. Use Spark/Hive on Cassandra to run queries across the cluster and leverage their more powerful query engines.
4. Consider a different database if your queries require joins, complex where clauses, or don't map well to
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008
- Key features of Cassandra including linear scalability, continuous availability, support for multiple data centers, operational simplicity, and analytics capabilities
- Details on Cassandra's architecture including its cluster layer based on Amazon Dynamo and data store layer based on Google BigTable
- Explanations of Cassandra's data distribution, token ranges, replication, coordinator nodes, tunable consistency levels, and write path
- Descriptions of Cassandra's data model including last write win and examples of CRUD operations and table schemas
Cassandra nice use cases and worst anti patternsDuyhai Doan
This document discusses Cassandra use cases and anti-patterns. Some good use cases include rate limiting, fraud prevention, account validation, and storing sensor time series data. Poor designs include using Cassandra like a queue, storing null values, intensive updates to the same column, and dynamically changing the schema. The document provides examples and explanations of how to properly implement these scenarios in Cassandra.
The document discusses Cassandra architecture and operations. It provides an overview of key Cassandra concepts like data distribution across nodes, replication, consistency levels, and the write and read paths. It also covers topics like compaction strategies, best practices for configuration, and operational recommendations.
This document summarizes the Cassandra Java driver and tools. It discusses the driver's architecture including connection pooling, request pipelining, load balancing policies, and automatic failover. It also covers using statements, asynchronous reads, the query builder, and the object mapper. Lastly, it discusses new automatic paging functionality in the driver.
Cassandra introduction apache con 2014 budapestDuyhai Doan
This document provides an introduction and summary of Cassandra presented by Duy Hai Doan. It discusses Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008. The key architecture of Cassandra including its data distribution across nodes, replication for failure tolerance, and consistency models for reads and writes is summarized.
This document discusses query optimization in database systems. It explains that a query optimizer is needed because there are many possible ways to execute a query with different tables and joins. The optimizer uses statistics, cost modeling, and explores the search space of options to pick the most efficient plan. It also shows how database internals knowledge like indexes, joins, and parallelism can help the optimizer generate better execution plans.
This document discusses Achilles, an object mapper for Cassandra. It provides a live demo of Achilles' main API for insert, update, remove, and find operations. The document also outlines Achilles' documentation, slice and typed queries, native queries, options, and roadmap including asynchronous support and integration with Elasticsearch.
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It begins with introductions of the speaker and Datastax. It then discusses what Spark and Cassandra are, including their architectures and key characteristics like Spark being fast, easy to use, and supporting multiple languages. The document demonstrates basic Spark code and how RDDs work. It covers the Spark and Cassandra connectors and how they provide locality-aware joins. It also discusses use cases and deployment options. Finally, it considers future improvements like leveraging Solr for local filtering to improve data locality during joins.
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...hamidsamadi
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It introduces Spark and its characteristics like being fast, easy to use, and having a rich API. It then discusses Cassandra's data distribution using token ranges and how Spark partitions data to maximize data locality when reading from and writing to Cassandra. The document demonstrates the Spark-Cassandra connector architecture and how it exposes Cassandra tables as RDDs and DataFrames while pushing predicates down for filtering. It also provides examples of using the connector API to read and write data and ensuring data locality.
A lot has changed since I gave one of these talks and man, has it been good. 2.0 brought us a lot of new CQL features and now with 2.1 we get even more! Let me show you some real life data models and those new features taking developer productivity to an all new high. User Defined Types, New Counters, Paging, Static Columns. Exciting new ways of making your app truly killer!
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Crate.io offers a solution for the Internet of Things (IoT) and its requirements. CrateDB is a distributed SQL database that can handle huge amounts of diverse data from IoT devices in real-time. It features automatic scaling, high availability, dynamic schemas, and geospatial querying capabilities. CrateDB uses a column-oriented approach which optimizes common operations on sets of data like aggregations and counting, unlike row-oriented databases. Real-world use cases of CrateDB were presented from companies handling wind turbine data, vehicle telemetry, security monitoring, and more. A live demo then showed how CrateDB can simplify IoT architectures by replacing queueing systems and multiple databases with a single scalable
Cassandra Summit 2014: Real Data Models of Silicon ValleyDataStax Academy
A lot has changed since I gave one of these talks and man, has it been good. 2.0 brought us a lot of new CQL features and now with 2.1 we get even more! Let me show you some real life data models and those new features taking developer productivity to an all new high. User Defined Types, New Counters, Paging, Static Columns. Exciting new ways of making your app truly killer!
KillrChat: Building Your First Application in Apache Cassandra (English)DataStax Academy
KillrChat is a scalable messaging app built using AngularJS, Spring, and Cassandra. It demonstrates how to build a real messaging application using the Cassandra database and handle features like user accounts, chat rooms, joining/leaving rooms, and messaging in a scalable way. The presentation covered the architecture, data models for users, rooms, and messages, and how to handle concurrent modifications to data using lightweight transactions in Cassandra.
This document provides an introduction to the Python programming language. It discusses why Python is useful, highlighting that it is easy to read and learn, has a powerful interactive interpreter, and is scalable and high-level. It also outlines key features like being procedural, object-oriented, and dynamically typed. The document then discusses popular domains where Python is used, like web development, machine learning, and data analysis. It covers execution modes, variables, data types, operators, conditional execution, functions, and building a "Who Wants to Be a Millionaire" game in Python.
The document discusses Dynamic Data Exchange (DDE), an early Windows API that allows data sharing between applications; it outlines Business Information Server (BIS) support for DDE including initiating conversations, reading/writing data, and executing commands; examples are provided for using DDE between BIS and applications like Excel, Word, and Visual Basic.
Cassandra Community Webinar | Data Model on FireDataStax
Functional data models are great, but how can you squeeze out more performance and make them awesome? Let's talk through some example Cassandra 2.0 models, go through the tuning steps and understand the tradeoffs. Many time's just a simple understanding of the underlying Cassandra 2.0 internals can make all the difference. I've helped some of the biggest companies in the world do this and I can help you. Do you feel the need for Cassandra 2.0 speed?
PHP performance 101: so you need to use a databaseLeon Fayer
Being involved in performance audits on systems of every size, from start-up sites hacked together overnight, to a ginormous applications built by world-recognized brand companies, I’ve seen a lot of interesting (and sometimes very unique) performance issues in every level of the stack: code, architecture, databases (sometimes all of the above). But there are a few particular, very “Performance 101″, issues that (unfortunately) appear in a lot of code bases. In this talk I present the most common database-related performance bottlenecks that can happen in most PHP applications.
The Ring programming language version 1.4.1 book - Part 14 of 31Mahmoud Samir Fayed
This document discusses creating a 2D game engine in Ring for desktop and mobile games. It describes the different layers of the project including the games layer, game engine classes, interface to graphics libraries, and graphics library bindings. Key classes for the engine are described like Game, GameObject, Sprite, Text, Animate, Sound and Map. The engine is designed to use declarative programming in the games layer to create games. Examples of games that could be built with it include Stars Fighter, Flappy Bird 3000, and Super Man 2016. The interface layers allow switching between Allegro and SDL graphics libraries.
Similar to Cassandra introduction at FinishJUG (20)
This document provides an overview of big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure and consensus. It also covers data sharding, replication, and the CAP theorem. Key points include how latency is impacted by network delays, different failure modes, and that the CAP theorem states that a distributed system can only guarantee two of consistency, availability, and partition tolerance at once.
Big data 101 for beginners riga dev daysDuyhai Doan
This document provides an overview and introduction to big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure modes, and consensus protocols. It also covers data sharding and replication techniques. The document explains the CAP theorem and how it relates to consistency and availability. Finally, it discusses different distributed systems architectures like master/slave versus masterless designs.
This document provides an overview of DataStax Enterprise, a database platform for cloud applications. It discusses key features of DataStax Enterprise including that it is certified for production, offers automatic management services for configuration and administration through OpsCenter, and provides 24/7 expert support. The document also summarizes various DataStax Enterprise technologies and capabilities like advanced replication, tiered storage, security features, and integration with search, analytics, and graph databases.
Datastax day 2016 : Cassandra data modeling basicsDuyhai Doan
This document discusses data modeling with Apache Cassandra. It covers:
1. The objectives of data modeling like reducing query latency and avoiding disasters
2. Choosing the right partition key which is the main entry point for queries and helps distribute data
3. Using clustering columns to simulate one-to-many relationships and enable sorting and range queries
4. Other critical details like avoiding huge partitions, sub-partitioning techniques, and how deletes create tombstones
This document discusses Apache Cassandra and its features and use cases. It provides an overview of Cassandra's key characteristics like massive scalability, extreme availability, and rich data modeling. Example use cases mentioned include messaging, collections/playlists, fraud detection, recommendations, and IoT sensor data. New features introduced in Cassandra in 2016 are also summarized, such as delete by range, materialized views, atomic UDT updates, a new SASI index, and support for GROUP BY queries.
Spark zeppelin-cassandra at synchrotronDuyhai Doan
This document discusses using Spark, Cassandra, and Zeppelin for storing and aggregating metrics data from a particle accelerator project called HDB++. It provides an overview of the HDB++ project, how it previously used MySQL but now stores data in Cassandra. It describes the Spark jobs that are run to load metrics data from Cassandra and generate statistics that are written back to Cassandra. It also demonstrates visualizing the data using Zeppelin and discusses some tricks and traps to be aware of when using this stack.
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
Cassandra 3 new features @ Geecon Krakow 2016Duyhai Doan
Duyhai Doan gave a presentation on new features in Cassandra 3.0, including materialized views, user defined functions, user defined aggregates, and the new SASI full text search index. Materialized views allow pre-computing common queries to improve performance. User defined functions and aggregates enable pushing computation to the server. The new SASI index provides improved full text search capabilities in Cassandra.
This document provides an introduction to Cassandra including:
- Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise.
- Cassandra was created at Facebook and is now open source software with the current version being 3.2.
- Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
Duy Hai Doan presented Apache Zeppelin, an open-source web-based notebook that allows users to interact with data. Zeppelin provides a front-end GUI and display system for data analysis tools and uses interpreters to connect to back-end systems like Spark, Cassandra, and Flink. Doan demonstrated Zeppelin's notebook interface, display options, and how users can write their own interpreters to connect new systems to Zeppelin. Future plans for Zeppelin include improving usability, adding authentication and authorization, and developing more interpreters and visualizations.
Apache zeppelin, the missing component for the big data ecosystemDuyhai Doan
Apache Zeppelin is a web-based notebook that allows users to interact with data via interpreters like Spark, SQL, and Cassandra. It provides a GUI for data scientists to write code and visualizations in notebooks. Zeppelin has a modular architecture that allows new interpreters to be easily added. It also includes features like scheduling, sharing, and exporting of notebooks.
Distributed algorithms for big data @ GeeConDuyhai Doan
This document discusses distributed algorithms for big data. It begins with an overview of HyperLogLog for estimating cardinality and counting distinct elements in a large data set. It then explains how HyperLogLog works by using a hash function to distribute the data across buckets and applying the LogLog algorithm to each bucket before taking the harmonic mean. The document also covers Paxos for distributed consensus, explaining the phases of prepare, promise, accept and learn to reach agreement in the presence of failures.
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Keywords: AI, Containeres, Kubernetes, Cloud Native
Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
What is an RPA CoE? Session 2 – CoE RolesDianaGray10
In this session, we will review the players involved in the CoE and how each role impacts opportunities.
Topics covered:
• What roles are essential?
• What place in the automation journey does each role play?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Ukraine
Під час доповіді відповімо на питання, навіщо потрібно підвищувати продуктивність аплікації і які є найефективніші способи для цього. А також поговоримо про те, що таке кеш, які його види бувають та, основне — як знайти performance bottleneck?
Відео та деталі заходу: https://bit.ly/45tILxj
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...AlexanderRichford
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation Functions to Prevent Interaction with Malicious QR Codes.
Aim of the Study: The goal of this research was to develop a robust hybrid approach for identifying malicious and insecure URLs derived from QR codes, ensuring safe interactions.
This is achieved through:
Machine Learning Model: Predicts the likelihood of a URL being malicious.
Security Validation Functions: Ensures the derived URL has a valid certificate and proper URL format.
This innovative blend of technology aims to enhance cybersecurity measures and protect users from potential threats hidden within QR codes 🖥 🔒
This study was my first introduction to using ML which has shown me the immense potential of ML in creating more secure digital environments!
2. @doanduyhai
Who Am I ?!
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
3. @doanduyhai
Datastax!
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
11. @doanduyhai
Multi-DC usages!
Prod data copy for testing/benchmarking
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3n1
Use
LOCAL
consistency
My tiny test
cluster
Data copy
NEVER WRITE HERE !!!
11
34. @doanduyhai
Consistency summary!
ONERead + ONEWrite
☞ available for read/write even (N-1) replicas down
QUORUMRead + QUORUMWrite
☞ available for read/write even 1+ replica down
34
42. @doanduyhai
Last Write Win (LWW)!
jdoe
age
name
33 John DOE
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
#partition
42
43. @doanduyhai
Last Write Win (LWW)!
jdoe
age (t1) name (t1)
33 John DOE
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
auto-generated timestamp
.
43
44. @doanduyhai
Last Write Win (LWW)!
UPDATE users SET age = 34 WHERE login = ‘jdoe’;
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
SSTable1 SSTable2
44
45. @doanduyhai
Last Write Win (LWW)!
DELETE age FROM users WHERE login = ‘jdoe’;
jdoe
age (t3)
ý
tombstone
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
SSTable1 SSTable2 SSTable3
45
46. @doanduyhai
Last Write Win (LWW)!
SELECT age FROM users WHERE login = ‘jdoe’;
???
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
46
47. @doanduyhai
Last Write Win (LWW)!
SELECT age FROM users WHERE login = ‘jdoe’;
✓✕✕
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
47
49. @doanduyhai
Historical data!
history
id
date1(t1) date2(t2) … date9(t9)
… … … …
SSTable1 SSTable2
You want to keep data history ?
• do not use internal generated timestamp !!!
• ☞ time-series data modeling
id
date10(t10)date11(t11) …
…
… … … …
49
50. @doanduyhai
CRUD operations!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE users SET age = 34 WHERE login = ‘jdoe’;
DELETE age FROM users WHERE login = ‘jdoe’;
SELECT age FROM users WHERE login = ‘jdoe’;
50
52. @doanduyhai
What about joins ?!
How can I join data between tables ?
How can I model 1 – N relationships ?
How to model a mailbox ?
EmailsUser
1 n
52
55. @doanduyhai
Queries!
Get message by user and message_id (date)
SELECT * FROM mailbox WHERE login = jdoe
and message_id = ‘2014-09-25 16:00:00’;
Get message by user and date interval
SELECT * FROM mailbox WHERE login = jdoe
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
55
56. @doanduyhai
Queries!
Get message by message_id only ?
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval only ?
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
❓
❓
56
57. @doanduyhai
Queries!
Get message by message_id only (#partition not provided)
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval only (#partition not provided)
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
57
61. @doanduyhai
Queries!
SELECT * FROM mailbox WHERE login >= ‘hsue’ and login <= ‘jdoe’;
Get message by user range (range query on #partition)
SELECT * FROM mailbox WHERE login like ‘%doe%‘;
Get message by user pattern (non exact match on #partition)
61
62. @doanduyhai
WHERE clause restrictions!
All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition
Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed
• ☞ full cluster scan
On clustering columns, only range queries (<, ≤, >, ≥) and exact match
WHERE clause only possible
• on columns defined in PRIMARY KEY
• on indexed columns ( )
62
64. @doanduyhai
WHERE clause restrictions!
What if I want to perform « arbitrary » WHERE clause ?
• search form scenario, dynamic search fields
DO NOT RE-INVENT THE WHEEL !
☞ Apache Solr (Lucene) integration (Datastax Enterprise)
☞ Same JVM, 1-cluster-2-products (Solr & Cassandra)
64
65. @doanduyhai
WHERE clause restrictions!
What if I want to perform « arbitrary » WHERE clause ?
• search form scenario, dynamic search fields
DO NOT RE-INVENT THE WHEEL !
☞ Apache Solr (Lucene) integration (Datastax Enterprise)
☞ Same JVM, 1-cluster-2-products (Solr & Cassandra)
SELECT * FROM users WHERE solr_query = ‘age:[33 TO *] AND gender:male’;
SELECT * FROM users WHERE solr_query = ‘lastname:*schwei?er’;
65
66. @doanduyhai
Collections & maps!
CREATE TABLE users (
login text,
name text,
age int,
friends set<text>,
hobbies list<text>,
languages map<int, text>,
…
PRIMARY KEY(login));
66
Keep the cardinality low ≈ 1000
67. @doanduyhai
User Defined Type (UDT)!
CREATE TABLE users (
login text,
…
street_number int,
street_name text,
postcode int,
country text,
…
PRIMARY KEY(login));
Instead of
67
68. @doanduyhai
User Defined Type (UDT)!
CREATE TYPE address (
street_number int,
street_name text,
postcode int,
country text);
CREATE TABLE users (
login text,
…
location frozen <address>,
…
PRIMARY KEY(login));
68
70. @doanduyhai
UDT update!
UPDATE users set location =
{
‘street_number’: 125,
‘street_name’: ‘Congress Avenue’,
‘postcode’: 95054,
‘country’: ‘USA’
}
WHERE login = jdoe;
Can be nested ☞ store documents
• but no dynamic fields (or use map<text, blob>)
70
71. @doanduyhai
From SQL to CQL!
Normalized
Comment
User
1
n
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author_login text, // typical join id
content text,
PRIMARY KEY((article_id), comment_id));
71
72. @doanduyhai
From SQL to CQL
1 SELECT
- 10 last comments
- 10 author_login
What to do with 10 author_login ???
Comment
User
1
n
72
73. @doanduyhai
From SQL to CQL
1 SELECT
- 10 last comments
- 10 author_login
What to do with 10 author_login ???
10 extra SELECT → N+1 SELECT problem !
Comment
User
1
n
73
74. @doanduyhai
From SQL to CQL!
De-normalized
Comment
User
1
n
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author frozen<person>, // person is UDT
content text,
PRIMARY KEY((article_id), comment_id));
74
75. @doanduyhai
Data modeling best practices!
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
75
76. @doanduyhai
Data modeling best practices!
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
Denormalize
• wisely, only duplicate necessary & immutable data
• functional/technical trade-off
76
78. @doanduyhai
Data modeling best practices!
John DOE, male
birthdate: 21/02/1981
subscribed since 03/06/2011
☉ San Mateo, CA
’’Impossible is not John DOE’’
Full detail read from
User table on click
78
80. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
80
81. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
But always keep those scenarios rare (5%-10% max), focus on the 90%
81
82. @doanduyhai
Data modeling trade-off
2 strategies
• either accept to normalize some data (extra SELECT required)
• or de-normalize and update everywhere upon data mutation
But always keep those scenarios rare (5%-10% max), focus on the 90%
Example: Twitter tweet deletion
82
84. @doanduyhai
Lightweight Transaction (LWT)!
What ? ☞ make operations linearizable
Why ? ☞ solve a class of race conditions in Cassandra that
would require installing an external lock manager
84
85. @doanduyhai
Lightweight Transaction (LWT)!
INSERT INTO account (id, email)
VALUES (‘jdoe’,
‘john_doe@fiction.com’);
SELECT * FROM account
WHERE id= ‘jdoe’;
(0 rows)
SELECT * FROM account
WHERE id= ‘jdoe’;
(0 rows)
INSERT INTO account (id, email)
VALUES (‘jdoe’,
‘jdoe@fiction.com’);
winner
85
86. @doanduyhai
Lightweight Transaction (LWT)!
How ? ☞ implementing Paxos protocol on Cassandra
Syntax ?
INSERT INTO account (id, email) VALUES (‘jdoe’, ‘john_doe@fiction.com’)
IF NOT EXISTS;
UPDATE account SET email = ‘jdoe@fiction.com’
IF email = ‘john_doe@fiction.com’ WHERE id=‘jdoe’;
86