brief high level comparison of modeling between relational databases and Cassandra followed by a brief description of how Cassandra achieves global availability
- In Cassandra, data is modeled differently than in relational databases, with an emphasis on denormalizing data and organizing it to support common queries with minimal disk seeks
- Cassandra uses keyspaces, column families, rows, columns and timestamps to organize data, with columns ordered to enable efficient querying of ranges
- To effectively model data in Cassandra, you should think about common queries and design schemas to co-locate frequently accessed data on disk to minimize I/O during queries
Talk from CassandraSF 2012 showing the importance of real durability. Examples of use for row level isolation in Cassandra and the implementation of a transaction log pattern. The example used is a banking system on top of Cassandra with support crediting/debiting an account, viewing an account balance and transferring money between accounts.
Introduction to Apache Cassandra and support within WSO2 PlatformSrinath Perera
Cassandra can be used within the WSO2 platform to provide NoSQL data support. It offers high scalability, availability with no single point of failure. WSO2 supports Cassandra in two ways: 1) As a service where users can access Cassandra keyspaces through the web console with security integration. 2) As standalone Cassandra nodes integrated with WSO2 security. A sample program demonstrates how to insert and query data using the Hector Cassandra client.
This talk will explore two libraries, a Cassandra native CQL client and a Clojure DSL for writing CQL3 queries.
This will demonstrate how Cassandra and Clojure are a great fit, show the strength of the functional approach to this domain and how in particular the data centric nature of Clojure makes a lot of sense in this context.
This document discusses functional programming concepts like pure functions, immutable data, and avoiding side effects and shared state, comparing them to object-oriented and imperative approaches. It shows examples of representing and transforming data as immutable Clojure data structures like vectors and maps. Functions for processing this data in a declarative way are demonstrated, with an emphasis on composability through chaining of pure functions.
Rust
Why do you care about Rust? Who has the time to learn all these new languages? It seems like a new one is popping up every other week and this trend is growing at an exponential rate. Good news, a fair number of them are crafted really well and efficiently solve specific problems. Bad news, how do you keep up with all of this, let alone decide which languages to include in your companies technology portfolio.
Despite the challenges of all these new languages, a majority of developers are intrigued about the idea of becoming a polyglot, but don't know where to begin or don't have the time. In my polyglot travels, there is one language of late that is the sure-fire answer to the above questions, Rust.
In this talk I’ll explore the value behind becoming more polyglotic as a developer, how to pick languages to learn, and then dive deep in the the language of Rust. Which in my opinion, is hands down the best up and coming languages to learn.
About the Presenter
Anthony Broad-Crawford has been a developer since the year 2000 with a short side stint as a semi-professional poker player. Since his transition to software development Anthony has...
1. Built 8 patent receiving technologies
2. Founded two global companies
3. Been a CTO (3x), CPO (1x), and CEO (1x)
and is currently the CTO at Fooda where he manages product, user experience, and engineering. Fooda is predominantly web and mobile technology company focused on bringing great & healthy food from the best restaurant's to people while at the office.
Through his career, in production applications Anthony has used Ruby, Java, Jave (Android), Objective-C and Swift, .NET, Erlang, Scala, Node.JS, LISP, Smalltalk, and even assembly, with his recent favorite, Rust . No, not all at the same time in the same application.
Anthony now spends his time building great teams, that leverage great technology, to build great products, but still looks to codes every chance he can get :)
Rust tutorial from Boston Meetup 2015-07-22nikomatsakis
The document discusses various topics related to learning and using the Rust programming language. It begins with an introduction to some of Rust's core concepts like ownership and borrowing which provide memory safety without garbage collection. It then covers everyday usage of Rust including common data types, modules, cargo, and derives. The document also demonstrates concepts like methods, enums, slices, iterators, and privacy. It concludes by recommending additional resources for learning more about Rust.
Rust provides both control and safety through its ownership and borrowing model. It enforces safe patterns using the type system to prevent issues like data races, use-after-free errors, and iterator invalidation. This is achieved with no runtime overhead. Rust also supports building efficient abstractions through features like zero-cost abstractions and its approach to concurrency that guarantees freedom from data races. While Rust borrows ideas from other languages, its type system and ownership rules allow building applications that have control without compromising on safety.
- In Cassandra, data is modeled differently than in relational databases, with an emphasis on denormalizing data and organizing it to support common queries with minimal disk seeks
- Cassandra uses keyspaces, column families, rows, columns and timestamps to organize data, with columns ordered to enable efficient querying of ranges
- To effectively model data in Cassandra, you should think about common queries and design schemas to co-locate frequently accessed data on disk to minimize I/O during queries
Talk from CassandraSF 2012 showing the importance of real durability. Examples of use for row level isolation in Cassandra and the implementation of a transaction log pattern. The example used is a banking system on top of Cassandra with support crediting/debiting an account, viewing an account balance and transferring money between accounts.
Introduction to Apache Cassandra and support within WSO2 PlatformSrinath Perera
Cassandra can be used within the WSO2 platform to provide NoSQL data support. It offers high scalability, availability with no single point of failure. WSO2 supports Cassandra in two ways: 1) As a service where users can access Cassandra keyspaces through the web console with security integration. 2) As standalone Cassandra nodes integrated with WSO2 security. A sample program demonstrates how to insert and query data using the Hector Cassandra client.
This talk will explore two libraries, a Cassandra native CQL client and a Clojure DSL for writing CQL3 queries.
This will demonstrate how Cassandra and Clojure are a great fit, show the strength of the functional approach to this domain and how in particular the data centric nature of Clojure makes a lot of sense in this context.
This document discusses functional programming concepts like pure functions, immutable data, and avoiding side effects and shared state, comparing them to object-oriented and imperative approaches. It shows examples of representing and transforming data as immutable Clojure data structures like vectors and maps. Functions for processing this data in a declarative way are demonstrated, with an emphasis on composability through chaining of pure functions.
Rust
Why do you care about Rust? Who has the time to learn all these new languages? It seems like a new one is popping up every other week and this trend is growing at an exponential rate. Good news, a fair number of them are crafted really well and efficiently solve specific problems. Bad news, how do you keep up with all of this, let alone decide which languages to include in your companies technology portfolio.
Despite the challenges of all these new languages, a majority of developers are intrigued about the idea of becoming a polyglot, but don't know where to begin or don't have the time. In my polyglot travels, there is one language of late that is the sure-fire answer to the above questions, Rust.
In this talk I’ll explore the value behind becoming more polyglotic as a developer, how to pick languages to learn, and then dive deep in the the language of Rust. Which in my opinion, is hands down the best up and coming languages to learn.
About the Presenter
Anthony Broad-Crawford has been a developer since the year 2000 with a short side stint as a semi-professional poker player. Since his transition to software development Anthony has...
1. Built 8 patent receiving technologies
2. Founded two global companies
3. Been a CTO (3x), CPO (1x), and CEO (1x)
and is currently the CTO at Fooda where he manages product, user experience, and engineering. Fooda is predominantly web and mobile technology company focused on bringing great & healthy food from the best restaurant's to people while at the office.
Through his career, in production applications Anthony has used Ruby, Java, Jave (Android), Objective-C and Swift, .NET, Erlang, Scala, Node.JS, LISP, Smalltalk, and even assembly, with his recent favorite, Rust . No, not all at the same time in the same application.
Anthony now spends his time building great teams, that leverage great technology, to build great products, but still looks to codes every chance he can get :)
Rust tutorial from Boston Meetup 2015-07-22nikomatsakis
The document discusses various topics related to learning and using the Rust programming language. It begins with an introduction to some of Rust's core concepts like ownership and borrowing which provide memory safety without garbage collection. It then covers everyday usage of Rust including common data types, modules, cargo, and derives. The document also demonstrates concepts like methods, enums, slices, iterators, and privacy. It concludes by recommending additional resources for learning more about Rust.
Rust provides both control and safety through its ownership and borrowing model. It enforces safe patterns using the type system to prevent issues like data races, use-after-free errors, and iterator invalidation. This is achieved with no runtime overhead. Rust also supports building efficient abstractions through features like zero-cost abstractions and its approach to concurrency that guarantees freedom from data races. While Rust borrows ideas from other languages, its type system and ownership rules allow building applications that have control without compromising on safety.
This document contains C code for loading sentences or queries into dynamic variables based on the input parameters. It includes a switch statement with 8 cases that set the values of dynamic variables like dyn_column, dyn_title, dyn_sts, and dyn_size based on the input sentence number. These dynamic variables are used to define SQL queries and reports on Oracle database instance statistics and performance metrics.
This document provides an introduction to the Rust programming language. It describes that Rust was developed by Mozilla Research beginning in 2009 to combine the type safety of Haskell, concurrency of Erlang, and speed of C++. Rust reached version 1.0 in 2015 and is a generic, multiparadigm systems programming language that runs on platforms including ARM, Apple, Linux, Windows and embedded devices. It emphasizes security, performance and fine-grained memory safety without garbage collection.
Rust: Reach Further (from QCon Sao Paolo 2018)nikomatsakis
Rust is a new programming language that is growing rapidly. Rust's goal is to support a high-level coding style while offering performance comparable to C and C++ as well as minimal runtime requirements -- it does not require a runtime or garbage collector, and you can even choose to forego the standard library. At the same time, Rust offers strong support for parallel programming, including guaranteed freedom from data-races (something that GC’d languages like Java or Go do not provide).
Rust’s slim runtime requirements make it an ideal choice for integrating into other languages and projects. Anywhere that you could integrate a C or C++ library, you can choose to use Rust instead. Mozilla, for example, has rewritten a portion of the Firefox web browser in Rust -- while keeping the rest in C++. There are also projects for writing native extensions to Python, Ruby, and Node in Rust, as well as a recent effort to have the Rust compiler generate WebAssembly.
This talk will cover some of the highlights of Rust's design, and show how Rust's type system not only supports different parallel styles but also encourages users to write code that is amenable to parallelization. I'll also talk a bit about some of the experiences of using Rust in production, as well as how to integrate Rust into existing projects written in different languages.
The document provides an introduction to the syntax and semantics of the Smalltalk programming language. It summarizes literals like numbers, strings, arrays, variables, assignments, returns, pseudo-variables, message expressions, block expressions, and conditionals and loops. The summary focuses on key concepts like objects communicating through message passing, different types of messages, variable scope, and blocks acting as anonymous methods.
This document provides an overview of working with bytecode in Pharo. It discusses:
- The Pharo compiler and how it translates code to an abstract syntax tree (AST), intermediate representation (IR), and finally bytecode.
- Generating bytecode using the IRBuilder tool, which allows defining bytecode instructions like stack operations, jumps, sends, and accessing variables.
- Parsing and interpreting bytecode using the InstructionStream hierarchy, which allows analyzing, printing, and decompiling bytecode.
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Windows 10 Nt Heap Exploitation (Chinese version)Angel Boy
The document discusses Windows memory allocation and the NT heap. It describes the core data structures used, including the _HEAP, _HEAP_ENTRY chunks, and _HEAP_LIST_LOOKUP BlocksIndex. It explains how allocated, freed, and VirtualAlloc chunks are structured and managed in the Back-End, including using freelist chains and BlocksIndex to efficiently service allocation requests.
Apache Cassandra in Bangalore - Cassandra Internals and Performanceaaronmorton
Cassandra internals and performance was presented. The key points covered include:
1) Cassandra has a layered architecture with APIs, a Dynamo layer, and a database layer. The Dynamo layer implements the Dynamo paper and handles replication and failure handling.
2) The database layer includes the memtable, SSTables, commit log and more. It handles writes, flushes, compactions and reads from storage.
3) A number of performance tests were shown measuring the impact of configuration parameters like memtable flush queue size, commit log sync period, and secondary indexes on write and read latency. Bloom filters, compactions and concurrency were also discussed.
Apache Cassandra, part 2 – data model example, machineryAndrey Lomakin
Aim of this presentation to provide enough information for enterprise architect to choose whether Cassandra will be project data store. Presentation describes each nuance of Cassandra architecture and ways to design data and work with them.
Rust provides safe, fast code through its ownership and borrowing model which prevents common bugs like use-after-free and data races. It enables building efficient parallel programs while avoiding the need for locking. Traits allow defining common interfaces that can be implemented for different types, providing abstraction without runtime costs. The language also supports unsafe code for interfacing with other systems while still enforcing safety within Rust programs through the type system.
Look Ma, “update DB to HTML5 using C++”, no hands! aleks-f
This document discusses using C++ and the POCO framework to generate HTML output from database queries without requiring any server-side code generation or browser plugins.
The POCO framework provides classes like RecordSet and RowFormatter that allow querying a database and outputting the results as HTML or other formats. Classes like Poco::Dynamic::Var allow strong typing while retaining flexibility. Together these enable generating HTML output directly from SQL queries in a performance-conscious way without extra processing on the server.
1) A message is a request sent from one object to another to execute a method.
2) The message receiver is the object that receives the message.
3) The method selector is the name of the method being called in the message.
Smalltalk uses only message passing between objects to perform operations.
This document discusses virtual machines and their implementation. It begins with an introduction to virtual machines, noting that they provide an abstract computing architecture that supports programming languages in a hardware-independent way. It then outlines the main components of a virtual machine, including the heap store, interpreter, automatic memory management, and threading system. The document dives into more detail on each of these components and also discusses optimizations that can be performed. It concludes by noting that while virtual machines provide benefits like platform independence, there is an overhead to their execution.
This document discusses various best practice patterns for object-oriented programming in Smalltalk, including:
1. Naming conventions for classes, methods, variables and parameters to clearly indicate their purpose and role.
2. Using delegation to share implementation between classes without inheritance. Delegation patterns include simple delegation, self delegation, and reversing methods.
3. Double dispatch and method objects for handling multiple cases through message passing rather than conditionals.
4. Lazy initialization, caching, and conversion methods to optimize performance of expensive computations.
5. Patterns for collections, intervals and streams including comparing and ordering objects.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008
- Key features of Cassandra including linear scalability, continuous availability, support for multiple data centers, operational simplicity, and analytics capabilities
- Details on Cassandra's architecture including its cluster layer based on Amazon Dynamo and data store layer based on Google BigTable
- Explanations of Cassandra's data distribution, token ranges, replication, coordinator nodes, tunable consistency levels, and write path
- Descriptions of Cassandra's data model including last write win and examples of CRUD operations and table schemas
This document discusses principles of object-oriented design, including:
1. Parametrizing code to avoid hardcoding constants and allow runtime changes without recompilation.
2. Defining behavior independently of state through abstract state and message layers.
3. Dividing programs into short, single-purpose methods to improve readability, reuse, and maintenance.
1. The document introduces macros in Scala and Clojure, providing an example of a macro that implements an assert function.
2. It demonstrates a Clojure macro that implements infix notation and shows how macros allow transforming code at compilation.
3. Various libraries and tools that use macros are listed, such as Wartremover, Datomisca, Expecty, Async, and MacWire.
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...JAX London
2011-11-02 | 05:45 PM - 06:35 PM | Victoria
The Disruptor is new open-source concurrency framework, designed as a high performance mechanism for inter-thread messaging. It was developed at LMAX as part of our efforts to build the world's fastest financial exchange. Using the Disruptor as an example, this talk will explain of some of the more detailed and less understood areas of concurrency, such as memory barriers and cache coherency. These concepts are often regarded as scary complex magic only accessible by wizards like Doug Lea and Cliff Click. Our talk will try and demystify them and show that concurrency can be understood by us mere mortal programmers.
The document discusses data modeling goals and examples for Cassandra. It provides guidance on keeping related data together on disk, avoiding normalization, and modeling time series data. Examples covered include mapping time series data points to Cassandra rows and columns, querying time slices, bucketing data, and eventually consistent transaction logging to provide atomicity. The document aims to help with common Cassandra modeling questions and patterns.
This document contains C code for loading sentences or queries into dynamic variables based on the input parameters. It includes a switch statement with 8 cases that set the values of dynamic variables like dyn_column, dyn_title, dyn_sts, and dyn_size based on the input sentence number. These dynamic variables are used to define SQL queries and reports on Oracle database instance statistics and performance metrics.
This document provides an introduction to the Rust programming language. It describes that Rust was developed by Mozilla Research beginning in 2009 to combine the type safety of Haskell, concurrency of Erlang, and speed of C++. Rust reached version 1.0 in 2015 and is a generic, multiparadigm systems programming language that runs on platforms including ARM, Apple, Linux, Windows and embedded devices. It emphasizes security, performance and fine-grained memory safety without garbage collection.
Rust: Reach Further (from QCon Sao Paolo 2018)nikomatsakis
Rust is a new programming language that is growing rapidly. Rust's goal is to support a high-level coding style while offering performance comparable to C and C++ as well as minimal runtime requirements -- it does not require a runtime or garbage collector, and you can even choose to forego the standard library. At the same time, Rust offers strong support for parallel programming, including guaranteed freedom from data-races (something that GC’d languages like Java or Go do not provide).
Rust’s slim runtime requirements make it an ideal choice for integrating into other languages and projects. Anywhere that you could integrate a C or C++ library, you can choose to use Rust instead. Mozilla, for example, has rewritten a portion of the Firefox web browser in Rust -- while keeping the rest in C++. There are also projects for writing native extensions to Python, Ruby, and Node in Rust, as well as a recent effort to have the Rust compiler generate WebAssembly.
This talk will cover some of the highlights of Rust's design, and show how Rust's type system not only supports different parallel styles but also encourages users to write code that is amenable to parallelization. I'll also talk a bit about some of the experiences of using Rust in production, as well as how to integrate Rust into existing projects written in different languages.
The document provides an introduction to the syntax and semantics of the Smalltalk programming language. It summarizes literals like numbers, strings, arrays, variables, assignments, returns, pseudo-variables, message expressions, block expressions, and conditionals and loops. The summary focuses on key concepts like objects communicating through message passing, different types of messages, variable scope, and blocks acting as anonymous methods.
This document provides an overview of working with bytecode in Pharo. It discusses:
- The Pharo compiler and how it translates code to an abstract syntax tree (AST), intermediate representation (IR), and finally bytecode.
- Generating bytecode using the IRBuilder tool, which allows defining bytecode instructions like stack operations, jumps, sends, and accessing variables.
- Parsing and interpreting bytecode using the InstructionStream hierarchy, which allows analyzing, printing, and decompiling bytecode.
Cassandra data structures and algorithmsDuyhai Doan
This document discusses Cassandra data structures and algorithms. It begins with an introduction and agenda, then covers Cassandra's use of CRDTs, bloom filters, and Merkle trees for its data model. It explains how Cassandra columns can be modeled as a CRDT join semilattice and proves their eventual convergence. The document also covers Cassandra's write path, read path optimized with bloom filters, and the math behind bloom filter probabilities.
Windows 10 Nt Heap Exploitation (Chinese version)Angel Boy
The document discusses Windows memory allocation and the NT heap. It describes the core data structures used, including the _HEAP, _HEAP_ENTRY chunks, and _HEAP_LIST_LOOKUP BlocksIndex. It explains how allocated, freed, and VirtualAlloc chunks are structured and managed in the Back-End, including using freelist chains and BlocksIndex to efficiently service allocation requests.
Apache Cassandra in Bangalore - Cassandra Internals and Performanceaaronmorton
Cassandra internals and performance was presented. The key points covered include:
1) Cassandra has a layered architecture with APIs, a Dynamo layer, and a database layer. The Dynamo layer implements the Dynamo paper and handles replication and failure handling.
2) The database layer includes the memtable, SSTables, commit log and more. It handles writes, flushes, compactions and reads from storage.
3) A number of performance tests were shown measuring the impact of configuration parameters like memtable flush queue size, commit log sync period, and secondary indexes on write and read latency. Bloom filters, compactions and concurrency were also discussed.
Apache Cassandra, part 2 – data model example, machineryAndrey Lomakin
Aim of this presentation to provide enough information for enterprise architect to choose whether Cassandra will be project data store. Presentation describes each nuance of Cassandra architecture and ways to design data and work with them.
Rust provides safe, fast code through its ownership and borrowing model which prevents common bugs like use-after-free and data races. It enables building efficient parallel programs while avoiding the need for locking. Traits allow defining common interfaces that can be implemented for different types, providing abstraction without runtime costs. The language also supports unsafe code for interfacing with other systems while still enforcing safety within Rust programs through the type system.
Look Ma, “update DB to HTML5 using C++”, no hands! aleks-f
This document discusses using C++ and the POCO framework to generate HTML output from database queries without requiring any server-side code generation or browser plugins.
The POCO framework provides classes like RecordSet and RowFormatter that allow querying a database and outputting the results as HTML or other formats. Classes like Poco::Dynamic::Var allow strong typing while retaining flexibility. Together these enable generating HTML output directly from SQL queries in a performance-conscious way without extra processing on the server.
1) A message is a request sent from one object to another to execute a method.
2) The message receiver is the object that receives the message.
3) The method selector is the name of the method being called in the message.
Smalltalk uses only message passing between objects to perform operations.
This document discusses virtual machines and their implementation. It begins with an introduction to virtual machines, noting that they provide an abstract computing architecture that supports programming languages in a hardware-independent way. It then outlines the main components of a virtual machine, including the heap store, interpreter, automatic memory management, and threading system. The document dives into more detail on each of these components and also discusses optimizations that can be performed. It concludes by noting that while virtual machines provide benefits like platform independence, there is an overhead to their execution.
This document discusses various best practice patterns for object-oriented programming in Smalltalk, including:
1. Naming conventions for classes, methods, variables and parameters to clearly indicate their purpose and role.
2. Using delegation to share implementation between classes without inheritance. Delegation patterns include simple delegation, self delegation, and reversing methods.
3. Double dispatch and method objects for handling multiple cases through message passing rather than conditionals.
4. Lazy initialization, caching, and conversion methods to optimize performance of expensive computations.
5. Patterns for collections, intervals and streams including comparing and ordering objects.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008
- Key features of Cassandra including linear scalability, continuous availability, support for multiple data centers, operational simplicity, and analytics capabilities
- Details on Cassandra's architecture including its cluster layer based on Amazon Dynamo and data store layer based on Google BigTable
- Explanations of Cassandra's data distribution, token ranges, replication, coordinator nodes, tunable consistency levels, and write path
- Descriptions of Cassandra's data model including last write win and examples of CRUD operations and table schemas
This document discusses principles of object-oriented design, including:
1. Parametrizing code to avoid hardcoding constants and allow runtime changes without recompilation.
2. Defining behavior independently of state through abstract state and message layers.
3. Dividing programs into short, single-purpose methods to improve readability, reuse, and maintenance.
1. The document introduces macros in Scala and Clojure, providing an example of a macro that implements an assert function.
2. It demonstrates a Clojure macro that implements infix notation and shows how macros allow transforming code at compilation.
3. Various libraries and tools that use macros are listed, such as Wartremover, Datomisca, Expecty, Async, and MacWire.
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...JAX London
2011-11-02 | 05:45 PM - 06:35 PM | Victoria
The Disruptor is new open-source concurrency framework, designed as a high performance mechanism for inter-thread messaging. It was developed at LMAX as part of our efforts to build the world's fastest financial exchange. Using the Disruptor as an example, this talk will explain of some of the more detailed and less understood areas of concurrency, such as memory barriers and cache coherency. These concepts are often regarded as scary complex magic only accessible by wizards like Doug Lea and Cliff Click. Our talk will try and demystify them and show that concurrency can be understood by us mere mortal programmers.
The document discusses data modeling goals and examples for Cassandra. It provides guidance on keeping related data together on disk, avoiding normalization, and modeling time series data. Examples covered include mapping time series data points to Cassandra rows and columns, querying time slices, bucketing data, and eventually consistent transaction logging to provide atomicity. The document aims to help with common Cassandra modeling questions and patterns.
This document summarizes several Cassandra anti-patterns including:
- Using a non-Oracle JVM which is not recommended.
- Putting the commit log and data directories on the same disk which can impact performance.
- Using EBS volumes on EC2 which can have unpredictable performance and throughput issues.
- Configuring overly large JVM heaps over 16GB which can cause garbage collection issues.
- Performing large batch mutations in a single operation which risks timeouts if not broken into smaller batches.
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
The document summarizes a workshop on Cassandra data modeling. It discusses four use cases: (1) modeling clickstream data by storing sessions and clicks in separate column families, (2) modeling a rolling time window of data points by storing each point in a column with a TTL, (3) modeling rolling counters by storing counts in columns indexed by time bucket, and (4) using transaction logs to achieve eventual consistency when modeling many-to-many relationships by serializing transactions and deleting logs after commit. The document provides recommendations and alternatives for each use case.
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
random list of Apache Cassndra Anti Patterns. There is a lot of info on what to use Cassandra for and how, but not a lot of information on what not to do. This presentation works towards filling that gap.
This document discusses best practices for running Cassandra on Amazon EC2. It recommends instance sizes like m1.xlarge for most use cases. It emphasizes configuring data and commit logs on ephemeral drives for better performance than EBS volumes. It also stresses the importance of distributing nodes across availability zones and regions for high availability. Overall, the document provides guidance on optimizing Cassandra deployments on EC2 through choices of hardware, data storage, networking and operational practices.
The document discusses planning for failure when building software systems. It notes that as software projects grow larger with more engineers, complexity and the potential for failures increases. The author discusses how the taxi app Hailo has grown significantly and now uses a service-oriented architecture across multiple data centers to improve reliability. Key technologies discussed include Zookeeper, Elasticsearch, NSQ, and Cruftflake which provide distributed and resilient capabilities. The importance of testing failures through simulation is emphasized to improve reliability.
Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
Slides from my NoSQL Exchange 2011 talk introducing Apache Cassandra. This talk explained the fundamental concepts of Cassandra and then demonstrated how to build a simple ad-targeting application using PHP, with a focus on data modeling.
Video of talk: http://skillsmatter.com/podcast/home/cassandra/js-2880
Cassandra concepts, patterns and anti-patternsDave Gardner
The document discusses Cassandra concepts, patterns, and anti-patterns. It begins with an agenda that covers choosing NoSQL, Cassandra concepts based on Dynamo and Bigtable, and patterns and anti-patterns of use. It then delves into Cassandra concepts such as consistent hashing, vector clocks, gossip protocol, hinted handoff, read repair, and consistency levels. It also discusses Bigtable concepts like sparse column-based data model, SSTables, commit log, and memtables. Finally, it outlines several patterns and anti-patterns of Cassandra use.
Cassandra's data model is more flexible than typically assumed.
Cassandra allows tuning of consistency levels to balance availability and consistency. It can be made consistently when certain replication conditions are met.
Cassandra uses a row-oriented model where rows are uniquely identified by keys and group columns and super columns. Super column families allow grouping columns under a common name and are often used for denormalizing data.
Cassandra's data model is query-based rather than domain-based. It focuses on answering questions through flexible querying rather than storing predefined objects. Design patterns like materialized views and composite keys can help support different types of queries.
Unique ID generation in distributed systemsDave Gardner
The document discusses different strategies for generating unique IDs in a distributed system. It covers using auto-incrementing numeric IDs in MySQL, which are not resilient, and various solutions like UUIDs, Twitter Snowflake IDs, and Flickr ticket servers that generate IDs in a distributed and ordered way without coordination between data centers. It also provides code examples of generating Twitter Snowflake-like IDs in PHP without coordination using ZeroMQ.
1. Cassandra is a distributed, decentralized, and fault-tolerant NoSQL database that distributes data across nodes in a cluster to provide high availability and no single points of failure.
2. It is optimized for high write throughput and sacrifices consistency in favor of availability. Data is distributed across nodes and replicated for fault tolerance, with tunable consistency levels.
3. Cassandra is best for write-heavy workloads with large volumes of log or time series data where data is accessed by key and queries are satisfied by a single partition. It is not a general purpose database and lacks features like transactions and joins.
This document provides an overview of how Cassandra is used at odnoklassniki.ru to store photo marks data. Some key points:
Cassandra stores over 1.5TB of photo marks data across 8 nodes, with a replication factor of 2. The data is partitioned across 256 column families to distribute load evenly. Column bloom filters are used to optimize for the majority of queries being "NOT EXISTS". Uniform replication is used to distribute load uniformly in the case of node failures. The data model uses three column families - MarksByPhoto, MarksByOwner, and MarksUserIndex - to support common query patterns like totals by photo, marks by owner, and user cleanup.
The document discusses data-oriented design principles for game engine development in C++. It emphasizes understanding how data is represented and used to solve problems, rather than focusing on writing code. It provides examples of how restructuring code to better utilize data locality and cache lines can significantly improve performance by reducing cache misses. Booleans packed into structures are identified as having extremely low information density, wasting cache space.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2aaronmorton
This document provides an introduction to Apache Cassandra, including an overview of key concepts like the cluster, nodes, data model, and data modeling best practices. It discusses Cassandra's origins and popularity. The presentation covers the cluster architecture with consistent hashing and token ranges, replication strategies, consistency levels, and more. It also summarizes the Cassandra data model including tables, columns, SSTables, caching, compaction and discusses building a Twitter-like data model in CQL.
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
This document provides an overview and introduction to ClickHouse, an open source column-oriented data warehouse. It discusses installing and running ClickHouse on Linux and Docker, designing tables, loading and querying data, available client libraries, performance tuning techniques like materialized views and compression, and strengths/weaknesses for different use cases. More information resources are also listed.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started... which soon becomes 50 distributed replica sets as volume increases. This talk describes how we designed a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. In this deep-dive talk, we explore how we've used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.
Querying federations of Triple Pattern FragmentsRuben Verborgh
This document discusses querying datasets using Triple Pattern Fragments (TPF), which enable low-cost federated querying over the web. TPF interfaces return partial RDF datasets matching a given triple pattern. Intelligent clients can decompose SPARQL queries into triple patterns and query multiple TPF servers in parallel to solve queries. This achieves high query performance and availability even with many clients, as TPF servers have lightweight query processing and clients handle query planning and execution. The document compares TPF federation to other federated querying systems.
Cassandra stands out amongst the big data products in its ability to handle optimized writes of large amounts of data while providing configurable fault tolerance and data integrity. Two popular libraries that allow the JVM developer to leverage these capabilities are Hector and the recently open sourced Astyanax. In this talk, Joe presents examples of storing time series data in a Cassandra data store using both of these libraries. There will be code! As an added bonus, a mechanism to unit test using an embedded Cassandra client will be presented.
Code can be downloaded from https://github.com/jmctee/Cassandra-Client-Tutorial
Riak is an open source NoSQL database that implements the principles of Amazon's Dynamo paper for distributed data storage. It uses an eventually consistent model and picks availability and partition tolerance over consistency in the CAP theorem. Data is stored in Riak as simple key-value pairs that can be any data type and accessed via a RESTful API using HTTP verbs like GET, PUT, POST, and DELETE.
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperConnor McDonald
A look at the techniques that middle tier developers can employ to get greater value out of their applications, simply by having an understanding of how the database works and how to make it sing.
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceHeroku
Rob Sullivan took the stage at this year's Waza 2013 to present "Your Database: A Story of Indiffence." For more from Rob, ping him at @datachomp.
For Waza videos stay tuned at http://blog.heroku.com or visit http://vimeo.com/herokuwaza.
Beyond PHP - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just writing PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
HCL Notes and Domino License Cost Reduction in the World of DLAU
Cassandra, Modeling and Availability at AMUG
1. Conceptual Modeling Differences From A RDBMS
Matthew F. Dennis, DataStax // @mdennis
Austin MySQL User Group
January 11, 2012
2. Cassandra Is Not Relational
get out of the relational mindset when working
with Cassandra (or really any NoSQL DB)
3. Work Backwards From Queries
Think in terms of queries, not in terms of
normalizing the data; in fact, you often want to
denormalize (already common in the data
warehousing world, even in RDBMS)
4. OK great, but how do I do that?
Well, you need to know how Cassandra Models
Data (e.g. Google Big Table)
research.google.com/archive/bigtable-osdi06.pdf
Go Read It!
5. In Cassandra:
data is organized into Keyspaces (usually one per app)
➔
each Keyspace can have multiple Column Families
➔
each Column Family can have many Rows
➔
each Row has a Row Key and a variable number of Columns
➔
each Column consists of a Name, Value and Timestamp
➔
6. In Cassandra, Keyspaces:
are similar in concept to a “database” in some RDBMs
➔
are stored in separate directories on disk
➔
are usually one-one with applications
➔
are usually the administrative unit for things related to ops
➔
contain multiple column families
➔
7. In Cassandra, In Keyspaces, Column Famlies:
➔ are similar in concept to a “table” in most RDBMs
➔ are stored in separate files on disk (many per CF)
➔ are usually approximately one-one with query type
➔ are usually the administrative unit for things related to your data
➔ can contain many (~billion* per node) rows
* for a good sized node
(you can always add nodes)
13. thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Rows Are Randomly Ordered
(if using the RandomPartitioner)
14. thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Columns Are Ordered by Name
(by a configurable comparator)
15. Columns are ordered because
doing so allows very efficient
implementations of useful and
common operations
(e.g. merge joins)
16. In particular, within a row I can
find given columns by name very
quickly (ordered names => log(n)
binary search).
17. More importantly, I can query for a
slice between a start and end
Row Key
RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...
start end
18. Why does that matter?
Because columns within a row aren't static!
19. The Column Name Can Be Part of Your Data
INTC ts0: $25.20 ts1: $25.25 ...
AMR ts0: $6.20 ts9: $0.26 ...
CRDS ts0: $1.05 ts5: $6.82 ...
Columns Are Ordered by Name
(in this case by a TimeUUID Comparator)
20. Turns Out That Pattern Comes Up A Lot
➔ stock ticks
➔ event logs
➔ ad clicks/views
➔ sensor records
➔ access/error logs
➔ plane/truck/person/”entity” locations
➔…
21. OK, but I can do that in SQL
Not efficiently at scale, at least not easily ...
22. How it Looks In a RDBMS
ticker timestamp bid ask ...
AMR ts0 ... ... ...
... ... ... ... ...
CRDS ts0 ... ... ...
... ... ... ... ...
Data I Care About ... ts0 ... ... ...
AMR ts1 ... ... ...
... ... ... ... ...
... ... ... ... ...
… ts1 ... ... ...
AMR ts2 ... ... ...
... ts2 ... ... ...
23. How it Looks In a RDBMS
ticker timestamp bid ask ...
AMR ts0 ... ... ...
Larger Than Your Page Size
Disk Seeks
AMR ts1 ... ... ...
Larger Than Your Page Size
AMR ts2 ... ... ...
... ts2 ... ... ...
24. OK, but what about ...
PostgreSQL Cluster Command?
➔
MySQL Cluster Indexes?
➔
Oracle Index Organized Tables?
➔
SQLServer Clustered Index?
➔
25. OK, but what about ...
PostgreSQL Cluster Using?
➔
Meh ...
MySQL [InnoDB] Cluster Indexes?
➔
Oracle Index Organized Table?
➔
SQLServer Clustered Index?
➔
(seriously, who uses SQLServer?!)
26. The on-disk management of that
clustering results in tons of IO …
In the case of PostgreSQL:
clustering is a one time operation
➔
(implies you must periodically rewrite the entire table)
new data is *not* written in clustered order
➔
(which is often the data you care most about)
28. Not a bad idea, except in MySQL there is a limit of
1024 partitions and generally less if using NDB
(you should probably still do it if using MySQL though)
http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
29. OK fine, I agree storing data that is queried
together on disk together is a good thing but
what's that have to do with modeling?
Seek To Here
RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...
Read Precisely My Data *
* more on some caveats later
30. Well, that's what is meant by “work backwards
from your queries” or “think in terms of queries”
(NB: this concept, in general, applies to RDBMS
at scale as well; it is not specific to Cassandra)
31. An Example From Fraud Detection
To calculate risk it is common to need to know all the
emails, destinations, origins, devices, locations, phone
numbers, et cetera ever used for the account in question
32. In a normalized model that usually translates to a
table for each type of entity being tracked
id name ... id device ...
1 guy ... 1000 0xdead ...
2 gal ... 2000 0xb33f ...
... ... ... ... ... ...
id dest ... id email ... id origin ...
15 USA ... 100 guy@ ... 150 USA ...
25 Finland ... 200 gal@ ... 250 Nigeria ...
... ... ... ... ... ... ... ... ...
33. The problem is that at scale that also means
a disk seek for each one …
(even for perfect IOT et al if across multiple tables)
➔Previous emails? That's a seek …
➔Previous devices? That's a seek …
➔Previous destinations? That's a seek ...
34. But In Cassandra I Store The Data I Query
Together On Disk Together
(remember, column names need not be static)
Data I Care About
acctY ... ... ... ... ... ... ...
acctX dest21 dev2 dev7 email3 email9 orig4 ...
acctZ ... ... ... ... ... ... ...
email:cassandra@mailinator.com = dateEmailWasLastUsed
Column Name Column Value
35. Don't treat Cassandra (or any DB) as a black box
➔Understand how your DBs (and data structures) work
➔Understand the building blocks they provide
➔Understand the work complexity (“big O”) of queries
➔For data sets > memory, goal is to minimize seeks *
* on a related note, SSDs are awesome
37. Availability Has Many Levels
➔ Component Failure (disk)
➔ Machine Failure (NIC, cpu, power supply)
➔ Site Failure (UPS, power grid, tornado)
➔ Political Failure (war, coup)
42. Row Key Determines Node
MD5(RK) => T First Replica
t0
t3 < T < 2^127
t3 t1
t2
43. Walk The Ring To Find Subsequent Replicas *
MD5(RK) => T First Replica
t0
t3 < T < 2^127
t3 t1
Second Replica
t2
* by default
44. Writes Happen In Parallel To All Replicas
First Replica
client t0
RK= ...
RK= ...
t3 t1
RK= ...
Second Replica
Coordinator t2
(not a master)
45. Some Or All Replicas Respond
First Replica
client t0
RK= ...
“ok”
X
t3 t1
“ok”
Second Replica
Coordinator Waits For Ack(s) t2
From Destination Node(s)
46. The Coordinator Responds To Client
First Replica
client t0
“ok”
“ok”
X
t3 t1
“ok”
Second Replica
Coordinator Waits For Ack(s) t2
From Destination Node(s)
47. What Nodes Can Be A Coordinator?
The coordinator for any given read or
write is really just whatever node the
client connected to for that request
any node for any request at any time
48. How Many Replicas Does The
Coordinator Wait For?
configurable, per query
➔
ONE / QUORUM are the most common
➔
(more on this in a moment)
49. Writing At CL.One
First Replica
client t0
t3 t1
X
Second Replica
t2 Third
Replica
Wait For At Least One Node
(eventually all nodes get updates)
50. Writing At CL.One
First Replica
client t0
“ok”
“ok”
t3 t1
X
Second Replica
t2 Third
Replica
Wait For At Least One Node
(eventually all nodes get updates)
51. Reading At CL.One
First Replica
client t0
t3 t1
X
Second Replica
t2 Third
Replica
Wait For At Least One Node
(so you might read stale data)
52. Reading At CL.One
First Replica
client t0
“old”
“old”
t3 t1
X
Second Replica
t2 Third
Replica
Wait For At Least One Node
(so you might read stale data)
53. Writing At CL.Quorum
First Replica
client t0
t3 t1
X
Second Replica
t2 Third
Replica
Wait For Majority Of Nodes
(eventually all nodes get updates)
54. Writing At CL.Quorum
First Replica
client t0
“ok” “ok”
“ok”
t3 t1
X
Second Replica
t2 Third
Replica
Wait For Majority Of Nodes
(eventually all nodes get updates)
55. Reading At CL.Quorum
First Replica
client t0
X
t3 t1
Second Replica
t2 Third
Replica
Wait For Majority Of Nodes
(majority => overlap => consistent)
56. Reading At CL.Quorum
First Replica
client t0
“ok”
X
“ok”
t3 t1
“old”
coordinator chooses client
response based on client
Second Replica
supplied per column TS t2 Third
Replica
Wait For Majority Of Nodes
(majority => overlap => consistent)
57. Reading At CL.Quorum
First Replica
client t0
X
Already Has
Response t3 t1
“current”
Second Replica
t2 Third
Replica
Read Repair Updates Stale Nodes
59. Is The Same As A Lost Request
t0
X
RK = ...
t3
* In Regards To Meeting Consistency
60. Which Is The Same As A Failed/Slow Node
X
t0
RK = ...
t3
* In Regards To Meeting Consistency
61. In fact, it is actually impossible for the originator
to reliably distinguish between the 3
62. One More Important Piece:
writes are idempotent *
* except with the counter API, but if you want that it can be done
63. Why is that important?
It means we can replay/retry writes, even late
and/or out of order, and get the same results
After/during node failures
➔
After/during network partitions
➔
After/during upgrades
➔
64. In other words you can concurrently issue
conflicting updates to two different nodes while
those nodes have no communication between them
66. Availability Has Many Levels
➔ Component Failure (disk)
➔ Machine Failure (NIC, cpu, power supply)
➔ Site Failure (UPS, power grid, tornado)
➔ Political Failure (war, coup)
67. If you care about global availability you must
serve reads and writes from multiple data centers
There is no way around this