Real World NoSQL x
Big Data
Overview
 Introduction
 Motivation for NoSQL
 The NoSQL landscape

 Experience sharing
 HBase
 MongoDB
 Cassandra

...
Motivation
 Too much data – the need to “scale out”
 CAP theorem
Motivation
 Too much data – the need to “scale out”
 CAP theorem

 Performance
 RDMBS joining is slow
 Denormalizatio...
Motivation
 Too much data – the need to “scale out”
 CAP theorem

 Performance
 RDMBS joining is slow
 Denormalizatio...
HBase
 Builds on top of HDFS

 Consistent “big-data” database
 Automatically scales out
HBase
 … but we didn’t use it in the end
HBase
 A nightmare to set up and maintain
 Depends on Hadoop, HDFS, Zookeeper
HBase
 A nightmare to set up and maintain
 Depends on Hadoop, HDFS, Zookeeper

 No secondary index
 “Table” alteration...
MongoDB
 De-facto “big-data” “NoSQL” database

 Document based data representation
MongoDB
 De-facto “big-data” “NoSQL” database

 Document based data representation
MongoDB
 A good balance of “traditional” usage and “NoSQL”
usage
 Supports secondary index
 Range query

 Can do table...
MongoDB
 “Big-data” features: sharding, replica set
MongoDB
 … but it got ugly pretty fast

 Devil’s in the details
 Replica set management fiasco
 Sharding is difficult ...
MongoDB
MongoDB
 Reality – it doesn’t scale beyond one machine
 Replica set
Cassandra
 Column Family data store
Cassandra
 Column Family data store
Cassandra
 Column Family data store

 More “NoSQL” than MongoDB. Less features
 Column data store – strictly key/value ...
Cassandra
 Auto-sharding just works

 Replica set requires 0 configuration
 Append only, LSM-tree based storage format
...
Cassandra
 Has rudimentary support for secondary index

 Difficult to do table scan or range scan
 Require substantial ...
Real World Implications
 Why does NoSQL matter to Big Data?
 Schemaless storage model
 Performance
 Scalability

 Rap...
How to Choose
 Maintenance / Scalability

 Supported operations
 OLAP vs. OLTP
Thank You
Chris Yuen
http://cfc.kizzx2.com
http://github.com/kizzx2
@kizzx2

chris@kizzx2.com
Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)
Upcoming SlideShare
Loading in …5
×

Real World NoSQL (by Chris Yuen)

423 views

Published on

The Hong Kong Big Data community had a guest speaker at our Tuesday, 18 February meeting. Chris Yuen from Demyst Data discussed his experience with three NoSQL solutions: Cassandra, MongoDB, and HBase. For more information see http://www.infoincog.com/hong-kong-big-data-meeting-tuesday-18-february/.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
423
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Real World NoSQL (by Chris Yuen)

  1. 1. Real World NoSQL x Big Data
  2. 2. Overview  Introduction  Motivation for NoSQL  The NoSQL landscape  Experience sharing  HBase  MongoDB  Cassandra  Tying it up – how does it really matter
  3. 3. Motivation  Too much data – the need to “scale out”  CAP theorem
  4. 4. Motivation  Too much data – the need to “scale out”  CAP theorem  Performance  RDMBS joining is slow  Denormalization  Key value data store  Alternative data representation  Schemaless “No SQL”
  5. 5. Motivation  Too much data – the need to “scale out”  CAP theorem  Performance  RDMBS joining is slow  Denormalization  Key value data store  Alternative data representation  Schemaless “No SQL”  Document data store
  6. 6. HBase  Builds on top of HDFS  Consistent “big-data” database  Automatically scales out
  7. 7. HBase  … but we didn’t use it in the end
  8. 8. HBase  A nightmare to set up and maintain  Depends on Hadoop, HDFS, Zookeeper
  9. 9. HBase  A nightmare to set up and maintain  Depends on Hadoop, HDFS, Zookeeper  No secondary index  “Table” alteration requires downtime  Not spectacular latency for OLTP usage
  10. 10. MongoDB  De-facto “big-data” “NoSQL” database  Document based data representation
  11. 11. MongoDB  De-facto “big-data” “NoSQL” database  Document based data representation
  12. 12. MongoDB  A good balance of “traditional” usage and “NoSQL” usage  Supports secondary index  Range query  Can do table scan
  13. 13. MongoDB  “Big-data” features: sharding, replica set
  14. 14. MongoDB  … but it got ugly pretty fast  Devil’s in the details  Replica set management fiasco  Sharding is difficult to set up and poorly implemented  https://github.com/kizzx2/mongolab
  15. 15. MongoDB
  16. 16. MongoDB  Reality – it doesn’t scale beyond one machine  Replica set
  17. 17. Cassandra  Column Family data store
  18. 18. Cassandra  Column Family data store
  19. 19. Cassandra  Column Family data store  More “NoSQL” than MongoDB. Less features  Column data store – strictly key/value query
  20. 20. Cassandra  Auto-sharding just works  Replica set requires 0 configuration  Append only, LSM-tree based storage format  Good for SSD  High insert throughput  For storing analytic data
  21. 21. Cassandra  Has rudimentary support for secondary index  Difficult to do table scan or range scan  Require substantial application / paradigm shift
  22. 22. Real World Implications  Why does NoSQL matter to Big Data?  Schemaless storage model  Performance  Scalability  Rapidly incorporate unstructured new data sources without extensive planning
  23. 23. How to Choose  Maintenance / Scalability  Supported operations  OLAP vs. OLTP
  24. 24. Thank You Chris Yuen http://cfc.kizzx2.com http://github.com/kizzx2 @kizzx2 chris@kizzx2.com

×