Abstract:
Cassandra is a new kind of database: it is more than a single-machine system. It naturally runs in a High-Availability configuration. All nodes in the system are symmetric; there is no single point of failure. As you add machines, failure becomes routine, and Cassandra is built to tolerate that with no interruptions.
Cassandra is linearly scalable with good performance characteristics for very small and very large data stores. Unlike earlier efforts, Cassandra is more than just a key-value store; it is a structured data store which can facilitate complex use cases and queries. Cassandra allows for random access to your data organized into rows and columns.
Cassandra is different, and exciting. This presentation will discuss the pros and cons of using Cassandra, and why it has seen such amazing adoption in the past year.
Bio:
Ben Coverston is Director of Operations at DataStax (formerly knows as Riptano), a provider of software, support, services, training, resources and help for Cassandra. He has been involved in enterprise software his entire career. Working in the airline industry, he helped to build some of the highest volume online booking sites in the world. He saw first hand the consequences of trying to solve real world scalability problems at the limit of what traditional relational databases are capable of.
- Understanding Time Series
- What's the Fundamental Problem
- Prometheus Solution (v1.x)
- New Design of Prometheus (v2.x)
- Data Compression Algorithm
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
- Understanding Time Series
- What's the Fundamental Problem
- Prometheus Solution (v1.x)
- New Design of Prometheus (v2.x)
- Data Compression Algorithm
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Further discussion on Data Modeling with Apache Cassandra. Overview of formal data modeling techniques as well as practical. Real-world use cases and associated data models.
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features
C* Summit 2013: Cassandra at Instagram by Rick BransonDataStax Academy
Speaker: Rick Branson, Infrastructure Engineer at Instagram
Cassandra is a critical part of Instagram's large scale site infrastructure that supports more than 100 million active users. This talk is a practical deep dive into data models, systems architecture, and challenges encountered during the implementation process.
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Managing large volumes of data isn’t trivial and needs a plan. Fast Data is how we describe the nature of data in a heavily consumer-driven world. Fast in. Fast out. Is your data infrastructure ready? You will learn some important reference architectures for large-scale data problems. The three main areas are covered:
Organize - Manage the incoming data stream and ensure it is processed correctly and on time. No data left behind.
Process - Analyze volumes of data you receive in near real-time or in a batch. Be ready for fast serving in your application.
Store - Reliably store data in the data models to support your application. Never accept downtime or slow response times.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
Cassandra Community Webinar: Back to Basics with CQL3DataStax
Cassandra is a distributed, massively scalable, fault tolerant, columnar data store, and if you need the ability to make fast writes, the only thing faster than Cassandra is /dev/null! In this fast-paced presentation, we'll briefly describe big data, and the area of big data that Cassandra is designed to fill. We will cover Cassandra's unique, every-node-the-same architecture. We will reveal Cassandra's internal data structure and explain just why Cassandra is so darned fast. Finally, we'll wrap up with a discussion of data modeling using the new standard protocol: CQL (Cassandra Query Language).
A lot has changed since I gave one of these talks and man, has it been good. 2.0 brought us a lot of new CQL features and now with 2.1 we get even more! Let me show you some real life data models and those new features taking developer productivity to an all new high. User Defined Types, New Counters, Paging, Static Columns. Exciting new ways of making your app truly killer!
Functional data models are great, but how can you squeeze out more performance and make them awesome! Let's talk through some example models, go through the tuning steps and understand the tradeoffs. Many time's just a simple understanding of the underlying internals can make all the difference. I've helped some of the biggest companies in the world do this and I can help you. Do you feel the need for Cassandra 2.0 speed?
Presentation by Marco Slaviero at BlackHat USA in 2010.
This presentation is about mining information from memchached. The presentation begins with a brief introduction to memcached. go-derper.rb, a tool developed by the presenter for hacking memchaced servers is introduced and a few memchached mining examples are given. The presentation ends with a brief discussion on serialized objects exposed in the chache.
Cassandra Summit 2014: Cassandra at Instagram 2014DataStax Academy
Presenter: Rick Branson, Infrastructure Engineer at Instagram
As Instagram has scaled to over 200 million users, so has our use of Cassandra. We've built new features and rebuilt old on Cassandra, and it's become an extremely mission-critical foundation of our production infrastructure. Rick will deliver a refresh of our use cases and go deep on the technical challenges we faced during our expansion.
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
The design of Apache Cassandra allows applications to provide constant uptime. Peer-to-Peer technology ensures there are no single points of failure, and the Consistency guarantees allow applications to function correctly while some nodes are down. There is also a wealth of information provided by the JMX API and the system log. All of this means that when things go wrong you have the time, information and platform to resolve them without downtime. This presentation will cover some of the common, and not so common, performance issues, failures and management tasks observed in running clusters. Aaron will discuss how to gather information and how to act on it. Operators, Developers and Managers will all benefit from this exposition of Cassandra in the wild.
Advanced percona xtra db cluster in a nutshell... la suite plsc2016Frederic Descamps
This is a tutorial I gave with my colleague Kenny Gryp at Percona Live 2016 in Santa Clara
Percona XtraDB Cluster is a high availability and high scalability solution for MySQL clustering. Percona XtraDB Cluster integrates Percona Server with the Galera synchronous replication library in a single product package, which enables you to create a cost-effective MySQL cluster.
For three years at Percona Live, we've introduced people to this technology... but what's next? This tutorial continues your education, and targets users that already have experience with Percona XtraDB Cluster and want to go further.
This tutorial will cover the following topics:
- Bootstrapping in details
- certification errors, understanding and preventing them
- Replication failures, how to deal with them
- Secrets of Galera Cache
- Mastering flow control
- Understanding and verifying replication throughput
- How to use WAN replication
- Implications of consistent reads
- Backups
- Load balancers and proxy protocol
Further discussion on Data Modeling with Apache Cassandra. Overview of formal data modeling techniques as well as practical. Real-world use cases and associated data models.
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features
C* Summit 2013: Cassandra at Instagram by Rick BransonDataStax Academy
Speaker: Rick Branson, Infrastructure Engineer at Instagram
Cassandra is a critical part of Instagram's large scale site infrastructure that supports more than 100 million active users. This talk is a practical deep dive into data models, systems architecture, and challenges encountered during the implementation process.
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Managing large volumes of data isn’t trivial and needs a plan. Fast Data is how we describe the nature of data in a heavily consumer-driven world. Fast in. Fast out. Is your data infrastructure ready? You will learn some important reference architectures for large-scale data problems. The three main areas are covered:
Organize - Manage the incoming data stream and ensure it is processed correctly and on time. No data left behind.
Process - Analyze volumes of data you receive in near real-time or in a batch. Be ready for fast serving in your application.
Store - Reliably store data in the data models to support your application. Never accept downtime or slow response times.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
Cassandra Community Webinar: Back to Basics with CQL3DataStax
Cassandra is a distributed, massively scalable, fault tolerant, columnar data store, and if you need the ability to make fast writes, the only thing faster than Cassandra is /dev/null! In this fast-paced presentation, we'll briefly describe big data, and the area of big data that Cassandra is designed to fill. We will cover Cassandra's unique, every-node-the-same architecture. We will reveal Cassandra's internal data structure and explain just why Cassandra is so darned fast. Finally, we'll wrap up with a discussion of data modeling using the new standard protocol: CQL (Cassandra Query Language).
A lot has changed since I gave one of these talks and man, has it been good. 2.0 brought us a lot of new CQL features and now with 2.1 we get even more! Let me show you some real life data models and those new features taking developer productivity to an all new high. User Defined Types, New Counters, Paging, Static Columns. Exciting new ways of making your app truly killer!
Functional data models are great, but how can you squeeze out more performance and make them awesome! Let's talk through some example models, go through the tuning steps and understand the tradeoffs. Many time's just a simple understanding of the underlying internals can make all the difference. I've helped some of the biggest companies in the world do this and I can help you. Do you feel the need for Cassandra 2.0 speed?
Presentation by Marco Slaviero at BlackHat USA in 2010.
This presentation is about mining information from memchached. The presentation begins with a brief introduction to memcached. go-derper.rb, a tool developed by the presenter for hacking memchaced servers is introduced and a few memchached mining examples are given. The presentation ends with a brief discussion on serialized objects exposed in the chache.
Cassandra Summit 2014: Cassandra at Instagram 2014DataStax Academy
Presenter: Rick Branson, Infrastructure Engineer at Instagram
As Instagram has scaled to over 200 million users, so has our use of Cassandra. We've built new features and rebuilt old on Cassandra, and it's become an extremely mission-critical foundation of our production infrastructure. Rick will deliver a refresh of our use cases and go deep on the technical challenges we faced during our expansion.
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
The design of Apache Cassandra allows applications to provide constant uptime. Peer-to-Peer technology ensures there are no single points of failure, and the Consistency guarantees allow applications to function correctly while some nodes are down. There is also a wealth of information provided by the JMX API and the system log. All of this means that when things go wrong you have the time, information and platform to resolve them without downtime. This presentation will cover some of the common, and not so common, performance issues, failures and management tasks observed in running clusters. Aaron will discuss how to gather information and how to act on it. Operators, Developers and Managers will all benefit from this exposition of Cassandra in the wild.
Advanced percona xtra db cluster in a nutshell... la suite plsc2016Frederic Descamps
This is a tutorial I gave with my colleague Kenny Gryp at Percona Live 2016 in Santa Clara
Percona XtraDB Cluster is a high availability and high scalability solution for MySQL clustering. Percona XtraDB Cluster integrates Percona Server with the Galera synchronous replication library in a single product package, which enables you to create a cost-effective MySQL cluster.
For three years at Percona Live, we've introduced people to this technology... but what's next? This tutorial continues your education, and targets users that already have experience with Percona XtraDB Cluster and want to go further.
This tutorial will cover the following topics:
- Bootstrapping in details
- certification errors, understanding and preventing them
- Replication failures, how to deal with them
- Secrets of Galera Cache
- Mastering flow control
- Understanding and verifying replication throughput
- How to use WAN replication
- Implications of consistent reads
- Backups
- Load balancers and proxy protocol
Renegotiating the boundary between database latency and consistencyScyllaDB
With the increasing complexity of modern distributed systems, concerns around latency, availability, and consistency have become almost 'universal'. In response, a new generation of distributed databases is taking over: databases capable of harnessing the power and capabilities of the multi-cloud ecosystem. This new generation of distributed databases is challenging many of the traditional tradeoffs between relational and non-relational models.
This webinar will explore the technologies and trends behind this new generation of distributed databases, then take a technical deep dive into one example: the open source non-relational database ScyllaDB. ScyllaDB was built specifically for extreme low latencies, but has recently increased consistency by implementing the Raft consensus protocol. Engineers will share how they are implementing a low-latency architecture, and how strongly consistent topology and schema changes enable highly reliable and safe systems, without sacrificing low-latency characteristics.
Slides for the talk "Cassandra and Spark: Love at First Sight" given at Texas Linux Fest 2015. Gives an introduction to both Cassandra and Spark and how they work together.
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
This will cover what to consider for high write throughput performance from hardware configuration through to the use of replica sets, multi-data centre deployments, monitoring and sharding to ensure your database is fast and stays online.
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
MongoDB is easy to download and run locally but requires some thought and further understanding when deploying to production. At scale, schema design, indexes and query patterns really matter. So does data structure on disk, sharding, replication and data centre awareness. This talk will examine these factors in the context of analytics, and more generally, to help you optimise MongoDB for any scale.
Presented at MongoDB Days London 2013 by David Mytton.
Distributed Database Consistency: Architectural Considerations and TradeoffsScyllaDB
With the increasing complexity of modern distributed systems, concerns around latency, availability, and consistency have come to the forefront. In response, a new generation of distributed databases is taking over: databases capable of harnessing the power and capabilities of the multi-cloud ecosystem. This new generation of distributed databases is challenging many of the traditional tradeoffs between relational and non-relational models.
This webinar will explore the technologies and trends behind this new generation of distributed databases, then take a technical deep dive into one example: ScyllaDB. ScyllaDB was built specifically for extreme low latencies, but has recently increased consistency by implementing the Raft consensus protocol. Engineers will share how they are implementing a low-latency architecture, and how strongly consistent topology and schema changes enable highly reliable and safe systems, without sacrificing low-latency characteristics.
Presentation at the December meet up of the silicon valley cassandra users group. Summaries how the NASA supercomputer center at Ames is using currently using a cassandra cluster.
These are the slides from my talk at Hulu in March 2015 discussing Apache Spark & Cassandra. I cover the evolution of data from a single machine to RDBMS (MySQL is the primary example) to big data systems.
On the Spark side, I covered batch jobs, streaming, Apache Kafka, an introduction to machine learning, clustering, logistic regression and recommendations systems (collaborative filtering).
The talk was recorded and is available on youtube: https://www.youtube.com/watch?v=_gFgU3phogQ
Persistent Data Structures - partial::ConfIvan Vergiliev
The slides from my talk on Persistent Data Structures at http://partialconf.com/ . The "Implementation" part assumes a bit of prior knowledge on how persistent data structures work, but the rest should be generally accessible.
ScyllaDB V Developer Deep Dive Series: Resiliency and Strong Consistency via ...ScyllaDB
ScyllaDB’s implementation of the Raft consensus protocol translates to strong, immediately consistent schema updates, topology changes, tables and indexes, and more. This eliminates schema and data conflicts, enables rapid and safe increases in cluster capacity, and provides a leap forward in manageability. Join this webinar to learn how the Raft consensus algorithm has been implemented, what you can do with it today, and what radical new capabilities it will enable in the days ahead.
1. Ben
Coverston
Director
of
Opera2ons
ben.coverston@datastax.com
Hosted
By:
Ma=hew
O’Keefe
MorningStar
2. History
• Open
Sourced
by
FB
in
July
2008
• Apache
Incubator
March
2009
• Graduated
March
2010
• Riptano
Founded
April
2010
• First
Summit
August
2010
• Riptano
Changed
to
Datastax
January
2011
3. You
Changed
Your
Name?
Why!?
• Suits
– Marke2ng
– Relevancy
– Riptano
too
“Skateboard”
• The
Real
Reason?
– “The
X
makes
it
sound
cool.”
–
Bender
Bending
Rodriguez,
Futurama
4. Strengths
• Scalable
• Reliable
– Replica2on
that
works
– Mul2-‐DC
Support
– No
Single
Point
of
Failure
• Analy2cs
in
the
same
system
as
OLTP
(with
“integrated”
Hadoop
support)
5. Weaknesses
• No
ACID
Transac2ons
• Limited
Support
for
(OLTP)
ad-‐hoc
queries
• ..but
you
lost
that
when
you
started
to
shard
your
rela2onal
system.
6. A
Short
History
of
Big
Data
(Or
Why
Cassandra)
• Rela2onal
databases
scale
poorly
• B-‐trees
are
slow
– ..and
require
read
before
write.
– ..hope
your
dataset
fits
in
memory
14. What
do
we
end
up
with?
(“The
eBay
Architecture,”
Randy
Shoup
and
Dan
Pritche=)
15.
16. BASE
• BASE
is
diametrically
opposed
to
ACID.
Where
ACID
is
pessimis2c
and
forces
consistency
at
the
end
of
every
opera2on,
BASE
is
op2mis2c
and
accepts
that
the
database
consistency
will
be
in
a
state
of
flux.
Although
this
sounds
impossible
to
cope
with,
in
reality
it
is
quite
manageable
and
leads
to
levels
of
scalability
that
cannot
be
obtained
with
ACID.
– Dan
Pritche=
–
NoSQL
Pioneer,
Ebay
Engineer
h=p://queue.acm.org/detail.cfm?id=1394128
17. Myth
• Lack
of
ACID
means
that
I
have
to
give
up
transac2onal
guarantees
and
consistency.
• Paraphrasing:
At
Nellix
we
tend
to
be
op2mis2c.
When
things
don’t
quite
work
out
we
try
again.
– Siddharth
Andand
• Achievable
18. Cassandra
In
Produc2on
• Nellix
:
Streaming
Bookmarks
• Digital
Reasoning:
NLP
&
En2ty
Analy2cs
• OpenX:
largest
publisher-‐side
ad
network
• Cloudkick:
performance
data
&
aggrega2on
• SimpleGeo:
loca2on-‐as-‐API
• Ooyala:
video
analy2cs
and
business
intelligence
• ngmoco:
massively
mul2player
online
game
worlds
• Kosmix:
social
media
aggrega2on
• Reddit:
vote
tracking
system
• Twi=er:
Rainbird,
geo
data,
analy2cs
• …
lots
more
19. Who
is
inves2ng
in
Cassandra?
• DataStax
• Twi=er:
– We're
inves2ng
in
Cassandra
every
day.
It'll
be
with
us
for
a
long
2me
and
our
usage
of
it
will
only
grow.
• Rackspace
• >
100
different
individuals
have
submi=ed
patches
to
C*
• You?
20. Durability
• Write
to
Commit
Log
– fsync
is
cheap
(append
only)
– Latency
is
only
subject
to
rota2onal
latency
• Separate
par22on
(no
seeking)
• SSD
won’t
hurt,
but
it
may
not
help
either.
• Write
to
memtable
• Flush
memtable
to
SSTable
29. Replica2on
• Simple
Replica2on
Strategy
• Network
Topology
Strategy
– How
many
replicas
in
each
datacenter
for
each
keyspace?
– Generaliza2on
of
Rack
Aware
Strategy
33. Reliability
• No
Single
Points
of
Failure
• Mul2ple
Datacenters
• Monitorable
– JMX
(or
whatever
plugs
into
it
–
lots
of
counters)
– Cac2
– Munin
– Nagios
34. Expecta2on
of
Failure
• C*
is
designed
to
fail
• No
“Clean
Shutdown”
• kill
-‐9,
it’s
ok.
56. I
can
has
smarter
clients?
l Don't
use
thriv
directly
l Higher
level
clients
have
a
lot
of
features
you
want
l Knowledge
about
data
types
l Connec2on
pooling
l Automa2c
retries
l Logging
58. Raw
thriv
API:
Inser2ng
data = {'id': useruuid, ...}
columns = [Column(k, v, time.time())
for (k, v) in data.items()]
mutations = [Mutation(ColumnOrSuperColumn(column=c))
for c in columns]
rows = {useruuid: {'User': mutations}}
client.batch_mutate('Twissandra', rows,
ConsistencyLevel.ONE)
61. Language
support
l Python
l pycassa
l telephus
l Ruby
l Speed
is
a
nega2ve
l Java
l Hector
l PHP
(soon
with
less
suckage!)
62. Done
yet?
l S2ll
doing
1+N
queries
per
page
l Solu2on:
Supercolumns
l Err..
Well
maybe…
63. Supercolumns:
limita2ons
l Requires
reading
an
en2re
SC
(not
the
en2re
row)
from
disk
even
if
you
just
want
one
subcolumn
l No
Secondary
Indexes
l It’s
just
an
extra
map
layer.
l Probably
best
to
avoid
them
if
you
can.
64. UUIDs
l Column
names
should
be
uuids,
not
longs,
to
avoid
collisions
l Version
1
UUIDs
can
be
sorted
by
2me
(“TimeUUID”)
l Any
UUID
can
be
sorted
by
its
raw
bytes
(“LexicalUUID”)
l Usually
Version
4
l Slightly
less
overhead
66. Lucandra
l What
documents
contain
term
X?
l …
and
term
Y?
l …
or
start
with
Z?
67. FAQ:
coun2ng
l UUIDs
+
batch
process
l Mutex
(contrib/mutex
or
“cages”)
l Use
redis
or
mysql
or
memcached
l column-‐per-‐app-‐server
l counter
API
(aver
.7
is
out)
68. Tips
l Insert
instead
of
check-‐then-‐insert
l Use
client-‐side
clock
to
your
advantage
l use
TTL
l Wider
rows
(but
not
too
wide)
l Start
with
queries,
work
backwards
l Avoid
storing
extra
“2mestamp”
columns