NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.
This is a presentation of the popular NoSQL database Apache Cassandra which was created by our team in the context of the module "Business Intelligence and Big Data Analysis".
Demystifying Benchmarks: How to Use Them To Better Evaluate DatabasesClustrix
When looking for the “right” RDBMS for your application, there are many variables you need to consider to ensure you make the right choice. Not all databases are created equal, and you are inevitably going to come across some performance benchmark statistics when evaluating your options. There are a confusing variety of published benchmarks out there: YCSB, Sysbench with a variety of different versions and transaction mixes like 95:5 or 50:50, and others. What do these all mean? How do they relate to what I am trying to accomplish with my application? Our benchmarking guru, Peter Friedenbach, unraveled the mysteries for you at Percona Live 2017. These slides are an outline of Friedenbach's presentation which explained what these different benchmarks measure, why they matter, and which ones best apply to your particular use-case – to arrive at a more scientific selection of the database that’s right for your needs. Please reach out to Clustrix for a recording of the presentation.
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Clustrix
At Clustrix, we think sharding is like stepping in quicksand. Once you make that step, you are stuck constantly maintaining it.
If you are trying to decide to shard or not to shard your MySQL database, or if you are just sick of living with sharding, give our webinar a listen. We’ll walk you through how to think about the problem at hand, and how to avoid getting mired in that quicksand down the road by answering these questions:
- Why do DBAs think sharding is the only end-game?
- What are the long-term costs of sharding?
- What is a better alternative to sharding MySQL?
- How real is it? Is it too good to be true?
View the webcast of this Tech Talk on our YouTube channel.
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...Clustrix
For high-value, high-throughput sites, downtime can cost hundreds of thousands to millions of dollars. Service architectures have baked lots of resiliency into apps, but databases and their system of record design are often vulnerable to single points of failure, bringing down entire systems. Worse still, when the database is recovered, there can be missing data. How many database transactions can your workload handle losing if your primary database goes down?
There are many strategies to minimize MySQL downtime, usually using replication and redundant hardware. Often these systems involve some manual intervention and potential downtime as failover protocols take hold. Also, these strategies may be expensive and require redundant hardware.
At Clustrix, we think there are alternative strategies that may be a better fit for modern apps in a MySQL environment.
In our final Tech Talk in this series on scaling MySQL, we evaluate multiple HA strategies. We also discuss the following topics:
- The difference between fault tolerance and high availability
- Best practices for achieving high availability with MySQL
- What are the costs of achieving HA? What can be the most cost-effective strategy?
- How is it possible to survive a multi-node failure in MySQL?
View the webcast of this Tech Talk on our YouTube channel.
NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.
This is a presentation of the popular NoSQL database Apache Cassandra which was created by our team in the context of the module "Business Intelligence and Big Data Analysis".
Demystifying Benchmarks: How to Use Them To Better Evaluate DatabasesClustrix
When looking for the “right” RDBMS for your application, there are many variables you need to consider to ensure you make the right choice. Not all databases are created equal, and you are inevitably going to come across some performance benchmark statistics when evaluating your options. There are a confusing variety of published benchmarks out there: YCSB, Sysbench with a variety of different versions and transaction mixes like 95:5 or 50:50, and others. What do these all mean? How do they relate to what I am trying to accomplish with my application? Our benchmarking guru, Peter Friedenbach, unraveled the mysteries for you at Percona Live 2017. These slides are an outline of Friedenbach's presentation which explained what these different benchmarks measure, why they matter, and which ones best apply to your particular use-case – to arrive at a more scientific selection of the database that’s right for your needs. Please reach out to Clustrix for a recording of the presentation.
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Clustrix
At Clustrix, we think sharding is like stepping in quicksand. Once you make that step, you are stuck constantly maintaining it.
If you are trying to decide to shard or not to shard your MySQL database, or if you are just sick of living with sharding, give our webinar a listen. We’ll walk you through how to think about the problem at hand, and how to avoid getting mired in that quicksand down the road by answering these questions:
- Why do DBAs think sharding is the only end-game?
- What are the long-term costs of sharding?
- What is a better alternative to sharding MySQL?
- How real is it? Is it too good to be true?
View the webcast of this Tech Talk on our YouTube channel.
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...Clustrix
For high-value, high-throughput sites, downtime can cost hundreds of thousands to millions of dollars. Service architectures have baked lots of resiliency into apps, but databases and their system of record design are often vulnerable to single points of failure, bringing down entire systems. Worse still, when the database is recovered, there can be missing data. How many database transactions can your workload handle losing if your primary database goes down?
There are many strategies to minimize MySQL downtime, usually using replication and redundant hardware. Often these systems involve some manual intervention and potential downtime as failover protocols take hold. Also, these strategies may be expensive and require redundant hardware.
At Clustrix, we think there are alternative strategies that may be a better fit for modern apps in a MySQL environment.
In our final Tech Talk in this series on scaling MySQL, we evaluate multiple HA strategies. We also discuss the following topics:
- The difference between fault tolerance and high availability
- Best practices for achieving high availability with MySQL
- What are the costs of achieving HA? What can be the most cost-effective strategy?
- How is it possible to survive a multi-node failure in MySQL?
View the webcast of this Tech Talk on our YouTube channel.
Tech Talk Series, Part 3: Why is your CFO right to demand you scale down MySQL?Clustrix
Many web businesses enjoy a spike in traffic at some point in the year. Whether it's Black Friday, the NFL draft day, or Mother’s Day, your app needs to be able to scale and capture customer value when it is most needed. Downtime is not an option.
For a database, that means having enough capacity to ensure transaction latency stays within acceptable limits. For high capacity apps using MySQL, this means you may need to deploy triple the normal capacity usage to sustain traffic for one day. But what do you do with that hardware for the rest of the year? Do you leave it idling? That unused capacity is costing you an arm and a leg, and wasted expenses make CFOs grumpy.
In Part 3 of our Tech Talk series, we discuss what the options are for scaling down MySQL, as well as explore answers to the following questions:
- How do I figure out the costs of not scaling down?
- How does ClustrixDB scale-down differently than MySQL?
- How real is elastically scaling in ClustrixDB? What are the catches?
View the webcast of this Tech Talk on our YouTube channel.
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Clustrix
Watch the recording here: https://www.youtube.com/watch?v=ZwERp38ynxQ&feature=youtu.be
In this webinar, Robbie Mihayli, VP of Engineering at Clustrix explores how to set up a SQL RDBMS architecture that scales out and is both elastic and consistent, while simultaneously delivering fault tolerance and ACID compliance.
He also covers how data gets distributed in this architecture, how the query processor works, how rebalancing happens and other architectural elements. Examples cited include cloud deployments and e-commerce use-cases.
In this webinar, you will learn:
1. Five RDBMS scaling strategies along with their trade offs
2. The importance of having no single point of failure for OLTP (fault tolerance)
3. The vagaries of the cloud and how it impacts using an RDBMS in the cloud
Who should watch?
1. People interested in high performance, real-time database solutions
2. Companies who have MySQL in their infrastructure and are concerned that their growth will soon overwhelm MySQL’s single-box design
3. DBA’s who implement ‘read slaves’, ‘multiple-masters’ and ‘sharding’ for MySQL databases and want to learn about better ways to scale
Daniel Abadi Keynote at EDBT 2013
This talk discusses: (1) Motivation for HadoopDB (2) Overview of HadoopDB (3) Lessons learned from commercializing HadoopDB into Hadapt (4) Ideas for overcoming the loading challenge (Invisible Loading)
At Signal we've been running Apache Cassandra in production since late 2011. We use a multi-region Cassandra deployment to make our data available globally to our customers. While Cassandra does much of the heavy lifting for us, we've run into interesting challenges during periods of rapid growth. In this presentation we'll focus on one of those scenarios, including our before and after data model, methodology and tools we used to recover and lessons learned along the way.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
Tech Talk Series, Part 3: Why is your CFO right to demand you scale down MySQL?Clustrix
Many web businesses enjoy a spike in traffic at some point in the year. Whether it's Black Friday, the NFL draft day, or Mother’s Day, your app needs to be able to scale and capture customer value when it is most needed. Downtime is not an option.
For a database, that means having enough capacity to ensure transaction latency stays within acceptable limits. For high capacity apps using MySQL, this means you may need to deploy triple the normal capacity usage to sustain traffic for one day. But what do you do with that hardware for the rest of the year? Do you leave it idling? That unused capacity is costing you an arm and a leg, and wasted expenses make CFOs grumpy.
In Part 3 of our Tech Talk series, we discuss what the options are for scaling down MySQL, as well as explore answers to the following questions:
- How do I figure out the costs of not scaling down?
- How does ClustrixDB scale-down differently than MySQL?
- How real is elastically scaling in ClustrixDB? What are the catches?
View the webcast of this Tech Talk on our YouTube channel.
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Clustrix
Watch the recording here: https://www.youtube.com/watch?v=ZwERp38ynxQ&feature=youtu.be
In this webinar, Robbie Mihayli, VP of Engineering at Clustrix explores how to set up a SQL RDBMS architecture that scales out and is both elastic and consistent, while simultaneously delivering fault tolerance and ACID compliance.
He also covers how data gets distributed in this architecture, how the query processor works, how rebalancing happens and other architectural elements. Examples cited include cloud deployments and e-commerce use-cases.
In this webinar, you will learn:
1. Five RDBMS scaling strategies along with their trade offs
2. The importance of having no single point of failure for OLTP (fault tolerance)
3. The vagaries of the cloud and how it impacts using an RDBMS in the cloud
Who should watch?
1. People interested in high performance, real-time database solutions
2. Companies who have MySQL in their infrastructure and are concerned that their growth will soon overwhelm MySQL’s single-box design
3. DBA’s who implement ‘read slaves’, ‘multiple-masters’ and ‘sharding’ for MySQL databases and want to learn about better ways to scale
Daniel Abadi Keynote at EDBT 2013
This talk discusses: (1) Motivation for HadoopDB (2) Overview of HadoopDB (3) Lessons learned from commercializing HadoopDB into Hadapt (4) Ideas for overcoming the loading challenge (Invisible Loading)
At Signal we've been running Apache Cassandra in production since late 2011. We use a multi-region Cassandra deployment to make our data available globally to our customers. While Cassandra does much of the heavy lifting for us, we've run into interesting challenges during periods of rapid growth. In this presentation we'll focus on one of those scenarios, including our before and after data model, methodology and tools we used to recover and lessons learned along the way.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
- Quick review of Cassandra functionality that applies to this use case
- Common Data Center and application architectures for highly available inventory applications, and why the were designed that way
- Cassandra implementations vis-a-vis infrastructure capabilities
The impedance mismatch: compromises made to fit into IT infrastructures designed and implemented with an old mindset
1) Apache Cassandra in term of CAP Theorem
2) What makes Apache Cassandra "Available"?
3) How Apache Cassandra ensures data consistency?
4) Cassandra advantages and disadvantages
5) Frameworks/libraries to access Apache Cassandra + performance comparison
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Gen Z and the marketplaces - let's translate their needsLaura Szabó
The product workshop focused on exploring the requirements of Generation Z in relation to marketplace dynamics. We delved into their specific needs, examined the specifics in their shopping preferences, and analyzed their preferred methods for accessing information and making purchases within a marketplace. Through the study of real-life cases , we tried to gain valuable insights into enhancing the marketplace experience for Generation Z.
The workshop was held on the DMA Conference in Vienna June 2024.
Italy Agriculture Equipment Market Outlook to 2027harveenkaur52
Agriculture and Animal Care
Ken Research has an expertise in Agriculture and Animal Care sector and offer vast collection of information related to all major aspects such as Agriculture equipment, Crop Protection, Seed, Agriculture Chemical, Fertilizers, Protected Cultivators, Palm Oil, Hybrid Seed, Animal Feed additives and many more.
Our continuous study and findings in agriculture sector provide better insights to companies dealing with related product and services, government and agriculture associations, researchers and students to well understand the present and expected scenario.
Our Animal care category provides solutions on Animal Healthcare and related products and services, including, animal feed additives, vaccination
Italy Agriculture Equipment Market Outlook to 2027
NoSQL A brief look at Apache Cassandra Distributed Database
1. NoSQL (Not Only SQL)
Next generation web-
scale databases
A brief look at Apache Cassandra
Distributed Database
2. Who am I
• Joe Alex
– Software Architect / Data Scientist
Loves to code in Java, Scala
– Areas of Interest: Big Data, Data Analytics,
Machine Learning, Hadoop, Cassandra
– Currently working as Team Lead for Managed
Security Services Portal at Verizon
3. 3
New Face of data
Scale out not up
•Big Data
–user generated; Amazon, Social Networks: Twitter, Facebook, Four
Square
–machine generated; credit cards, RFID, POS, cell phones, GPS,
firewalls, routers
–more and more connected
–less structured
–data sets becoming larger and larger
–joins and relationships are exploding
–cloud computing - scaling and tolerance needs
–backing up is replaced with having multiple active copies
–nodes can crash and applications should survive
–nodes can be added or removed at any point of time
4. 4
New Face of data
Internet of Things (real-world objects connect to the Internet)
– 'Internet of Things' will infuse intelligence into all our systems and
present us with a whole new way to run a home, an enterprise, a
community or an economy. In a 4G world, wireless will connect
everything and that there's really no limit to the number of
connections that can be part of the mobile grid: vehicles,
appliances, buildings, roads, medical monitors.“
– recently announced a partnership with American Security
Logistics (ASL), to "wirelessly connect a series of location based
tracking devices that can be used to help keep tabs on an array of
valuables - from people to pets to pallets.
– 2013, the number of devices connected to the Internet will reach
1 trillion - up from 500 million in 2007.
5. 5
New Face of data
Scale out not up
•Traditional RDBMS
– neither economical or capable
– scaling up doesn't work
– scaling out with traditional DB is not easy
• scaling reads to a relational DB is hard
• scaling writes is almost impossible
– when you try to do, it is not relational anymore
– sharding scales
• but you lose all features that make RDBMS useful
• operational nightmare
– volumes of data strain commercial RDBMS
– cloud computing
– rethink how we store data. Understand your data, find the most efficient model
– de-normalization. normalization strives to remove duplication but duplication is an
interesting alternative to joins
6. 6
New Face of data
What is wrong with RDBMS
•Pros
–SQL lets you query all data at once
–enforces data integrity
–minimizes repetition
–proven
–familiar to DBA, users
•Cons
–rigidly schematic
–joins rapidly become a bottleneck
–difficult to scale up
–gets in way of parallization
–optimization may mitigate benefits of normalization (Sharding)
7. 7
New Face of data
What is good with NRDBMS
•Pros
–schemaless
–master-master replication
–scales well
–everything runs in parallel
–built for the web
•Cons
–integrity-enforcement migrates to code
–limited ORM tooling
–significant learning curve
–proven only in a sub-set of cases
–Unlearning normalization is difficult
8. 8
New Face of data
What is good with NRDBMS
– Relational databases do not fit every problem
– stuffing files in to an RDBMS, maybe there is something better
– using RDBMS for caching, perhaps a lighter weight solution is better
– cramming log data into a RDBMS, perhaps a KeyValue store is better
– trying to do parallel processing with a DB maybe Hadoop MapReduce is better
– executing a long running process taking few hours, may be MapReduce with
Hadoop/Hbase is better and get it done in minutes
– Despite the hype, RDBMS are not doomed, but
– their role and place will certainly change
– Scaling is a real challenge for relational db
• sharding is a band-aid, not feasible beyond a few nodes
– There is a hit in overcoming the initial leaning curve
• it changes how you build applications (jsp, jsf, jpa)
– Drop ACID and think about data
9. 9
New Face of data
What is good with NRDBMS
–Webapps need
• elastic scalability
• flexible schemas
• geographic distribution
• high availability
• reliable storage
–Webapps can do without
• complicated queries
• strong transactions ( some form of consistency is still desirable)
–DB vs NoSQL
• Strong consistency vs Eventual consistency
• Big dataset vs Huge Datasets
• Scaling is possible vs Scaling is easy
• SQL vs MapReduce, API etc
• Good availability vs Very high availability
10. 10
CAP Theorem
You cant have it all
–What is ACID
• Atomic
• Consistent
• Isolated
• Durable
–ACID trips when
• downtime is unacceptable
• reliability is >= 2 nodes
• challenging over Networks
11. 11
CAP Theorem
You cant have it all
•What is CAP Theorem
– Distributed systems can have any two
• Consistency (data is correct at all times)
– ACID transactions
• Availability (read and write all the time)
– Total Redundancy
• Partition Tolerance (plug and play nodes)
– Infinite scale out
– CA - corruption is possible if live nodes cant communicate
– CP - completely inaccessible if any nodes are dead
– AP - always available, but not always read most recent
– Cassandra chooses A and P but allows them to be tunable to have more C
– RDBMS are typically CA
12. 12
CAP Theorem
You cant have it all
•What is BASE
– ACID Alternative
– Basically Available (appears to work all the time)
– Soft state (doesn't have to be consistent all the time)
– Eventually consistent (but eventually it will be)
–BASE (basically available, soft state, eventually consistent) rather than ACID
(atomicity, consistency, isolation, durability )
13. 13
NoSQL
It is really Not Only SQL
•What problems does it solve
–Reliable and simple scaling
–No single point of failure (all nodes are identical)
–High write throughput
–Large data sets
–Scale out not up
–Online load balancing, cluster growth
–flexible schema
–key-oriented queries
–CAP aware
14. 14
NoSQL
It is really Not Only SQL
•Many choices
–Key/Value Stores (distributed hash tables)
Stores entities as key value pairs in large hash tables
– Voldemort, Redis, Riak, SimpleDB, Tokyo Cabinet, Dynomite, MemcacheDB
–Column Oriented (semi-structured)
Stores entities by Column
– Cassandra, Bigtable, HBase, Hypertable, Azure table services
–Document (semi-structured)
stores documents (JSON)
– CouchDB, MongoDB
–Graph (stores entities as nodes and edges)
– Neo4j
16. 16
Cassandra
Highly scalable distributed database
• Created at Facebook
– Designed by Avinash Lakshman and Prashant Malik
– Open sourced by Facebook in 2008
– Apache Incubator
– Graduated in March 2009
– Dynamo's fully distributed design
– Bigtable's Column Family-based data model
17. 17
Cassandra
Highly scalable distributed database
– Proven
• largest production cluster has over 100 TB of data in over 150 machines.
– Fault Tolerant
• automatically replicated to multiple nodes for fault-tolerance
• Replication across multiple data centers supported
• Failed nodes can be replaced with no downtime
– Decentralized
• Every node in the cluster is identical
• no network bottlenecks
• no SPOF
– You're in control
• Choose between synchronous or asynchronous replication for each update
• Highly available asynchronous operations are optimized with features like Hinted Handoff
and Read Repair
– Rich Data Model
• Allows efficient use for many applications beyond simple key/value
– Elastic
• Read and write throughput both increase linearly as new machines are added, with no
downtime or interruption to application
– Durable
• Cassandra is suitable for applications that can't afford to lose data, even when an entire
data center goes down
18. 18
Cassandra
Highly scalable distributed database
–High Availability. Writes never fail.
–Incremental scalability
–Eventually Consistent (Hinted Handoff, Read Repair)
–Tunable tradeoffs between consistency and latency
– partitioning, replication
–Minimal administration
–No Single Point Of Failure (SPOF)
–Key-Value store (with some structure)
–Schemaless
–MapReduce support
–Two read paths available: high-performance weak reads/quorum
reads
–Reads and writes atomic within a single Column Family
–Versioning and conflict resolution (last update wins)
19. 19
Cassandra
Who is using it
• Used by
– Twitter
– Facebook
– Digg
– Rackspace
– Reddit
– IBM
– Cisco
– SimpleGeo
– Cloudkick
– Comcast
– Mahalo
– Ooyala
– OpenX
23. 23
Cassandra
Highly scalable distributed database
• Writes
– no reads
– no seeks
– sequential disk access
– atomic within CF
– Fast
– Any node
– Always writable (hinted hand-off)
– Writes go to a commit log and in-memory storage (memtable)
– Memtable is occasionally flushed to disk (SSTable)
– The SSTables are periodically compacted
– Partitioner
– Wait for W responses
– client issues a write req to a random node in the cassandra cluster partitioner determines
the nodes responsible for the data
– No locks in critical path
– always writable - accepts writes during failure scenarios
24. 24
Cassandra
Highly scalable distributed database
• Reads
– Any nodes
– read repair
– usual cache conventions apply
– Bloom Filters before SSTable
– reads (memtable, sstable)
– Partitioner
– Wait for N – R responses in the background and perform read repair
– Read multiple SSTables
– Slower than writes (but still fast)
– Scales to billions of rows
– Read repair when out of synch
– Row Cache avoid SSTable lookup
– key cache avoid index scan
26. 26
Compared with MySQL
• MySQL
– 300ms write
– 350ms read
• Cassandra
– 0.12 ms write
– 15ms read
– on 50GB data
27. 27
Clients
• Most common way to access is via Thrift Interface.
• Other clients for most languages
• http://wiki.apache.org/cassandra/ClientExamples
• Fauna – Twitter’s Ruby client
• Lazyboy - Digg’s Python library
28. 28
Datamodels
• Cluster: machines (nodes) in logical Cassandra instance. Clusters can contain
multiple keyspaces.
• Keyspace: namespace for ColumnFamilies. (Analogous to DB schema)
• ColumnFamilies: contain multiple columns, referenced by row keys. (Analogous to
table)
• SuperColumns: columns that themselves have subcolumns.
30. 30
Column
• Lowest increment of data. Analogous to Name/Value pairs or Attribute. Key is ID.
• { "name": "emailAddress",
"value": "foo@bar.com",
"timestamp": 123456789 }
31. 31
SuperColumn
• Value is a Map of Columns
• {name: “address",
value: {
street: {name: "street", value: “888 anywhere", timestamp: 123456789},
city: {name: "city", value: “reston", timestamp: 123456789},
zip: {name: "zip", value: “20190", timestamp: 123456789},
}
}
32. 32
Column Families
• Analogous to Tables. Rows can have different columns. Columns can be created
dynamically. Columns are always sorted in row by Column name.
• User = {
keyhole : {
username: “keyhole",
email: " keyhole@bar.com“},
spacer: {
username: “spacer",
email: “spacer@bar.com",
phone: "(888) 888-8888“}
}
35. 35
Column Families
• Analogous to Tables. Rows can have different columns. Columns can be created
dynamically. Columns are always sorted in row by Column name.
• User = {
keyhole : {
username: “keyhole",
email: " keyhole@bar.com“},
spacer: {
username: “spacer",
email: “spacer@bar.com",
phone: "(888) 888-8888“}
}
36. 36
Type of Queries
• Single column
• Slice
• Key range
• Quering : get(), multiget(), get_slice(), multiget_slice(0, get_count, get_range_slice()
• Column comparators - TimeuUID, LexicalUUID, UTF8, Long, Bytes, ...
• Updating - insert(), batch_insert(), remove(), batch_mutate(), remove key range
37. 37
Cassandra
• Conclusions
– You probably do not need an NRDBMS now, but ought to learn one anyway
– Its not just for Twitter and bleeding edge startups Amazon, Facebook, Google, IBM,
Microsoft all get this
– Sometimes it is simply the right tool for the job
– if you are in the cloud you are going to use them
– best of both worlds - external mapping layer JPA driver
– Next Big thing - In Memory elastic DB
• memory can be much more efficient than disk
• RAMClouds become much more attractive for apps with high throughputs requirements
38. 38
More…
•Other articles/videos about Cassandra
–http://wiki.apache.org/cassandra/
–#cassandra on irc.freenode.net
–http://wiki.apache.org/cassandra/ArticlesAndPresent
ations