Welcome to The JungleBuilding Distributed Systems for Large data sets
!     SQL solves all our problems!      !   Or does it?
The Problem with SQL!     At some point, data is too large to fit on      a single machine.      !   Then what do you do?
Your cluster:             SQL          Application
The first sign of trouble!     Can do small queries pretty good!     Large analytical queries?      !   forget it!         ...
Hadoop for Bulk Processing
!     Hadoop = HDFS + MapReduce      !   HDFS = Distributed, Fault Tolerant          File System      !   MapReduce = High...
!     MapReduce works if:      !     Your algorithm needs to touch every piece of data in the set      !     You can write...
!     MapReduce is not so good if:      !     Your data set is very small      !     Your algorithm doesn t need to touch ...
!     No Indexing!     Job startup cost!     No indices       !   Always touches all the data
!     MapReduce code is usually a pain to      write      !   requires a Java developer      !   lots of boilerplate for c...
Pig and Hive!
Apache Pig!      Data Flow Language       !   feels like using sed/awk      !   good at transformations of data
Apache Hive!     SQL-like interface      !   good for large queries      !   maintains table information from          files
Pig vs. Hive!     Both can do the same thing      !   Hive is easier to learn      !   Pig is easier to maintain!     Pret...
The second sign!     Your Bulk processing and ad-hoc      analysis is working great in Hadoop!     But now your small quer...
Scale SQL?!     A Few options:      !   Buy Oracle Rac...$$$$      !   Static Sharding...hard to maintain      !   Don t d...
HBase and Cassandra
Column-Oriented Storage!     SQL =       !   Fixed Columns, infinite rows!     Column-Oriented:      !   Rows are groups of...
HBase/Cassandra!     Both Column-oriented stores!     Both highly available!     Both rely on memory for performance
Apache Cassandra!     Highly Available and Partition Tolerant!     Attempts to hold as much data as      possible in memor...
Eventual Consistency!     Cassandra has Eventual Consistency      !   It is possible to read out-of-date          data!   ...
Why Eventual Consistency?!     Data is only written once      !   Either it s there or not!     You don t care if you get ...
Cassandra Strengths!     Fast      !   Writes faster than Reads!!     Easy to maintain      !   Self-contained
Cassandra Weaknesses!     Consistency Model is complex!     Scanning over rows is excruciating
Apache HBase!     Uses HDFS as storage mechanism!     Holds large proportion of data in RAM      !   need RAM >= 1% of you...
HBase Strengths!     Strong consistency guarantee!     Good at scanning over rows!     Strong community      !   part of t...
HBase weaknesses!     Slower than Cassandra      !   HDFS is higher latency than direct          disk!     Complex to main...
HBase vs. Cassandra!     Pick Cassandra if:      !     Doings lots of writes      !     need easy maintenance      !     d...
Your cluster:             HBase/  Hadoop            Cassandra                           SQL            Application
This is complicated!!     How do we configure it?!     What if we have to run an algorithm on      only a single node at a ...
Apache ZooKeeper!     Distributed Coordination System      !     Designed for creating distributed concurrency controls   ...
!     Now you have:      !   Bulk Processing with Hadoop      !   Large data queries with HBase/          Cassandra      !...
!     Chances are, still need SQL for some      stuff!     If the data sizes are manageable, SQL is      tried-and-true
The People Problem!     Big Data systems are complicated      !   Lots of moving parts      !   Lots of places where thing...
!     Try and Hire an expert directly...      !   Not that many out there
!     Train 2 or 3 experts instead      !   Worth every penny
Who should I hire?!     Probably won t find direct experts!     Look instead for people who:      !   are good with algorit...
Questions?
Thank You!     email:        !   scottfines@gmail.com!     github:       !   scottfines!     linkedin:       !   scottfines
Upcoming SlideShare
Loading in …5
×

Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

1,158 views

Published on

At StampedeCon 2012 in St. Louis, Scott Fines of NISC presents: Recent years have seen a sudden and rapid introduction of new technologies for distributing applications to essentially arbitrary levels. The growth in variety and depth of these different systems has grown to match, and it can be a challenge just to keep up. In this talk, I’ll discuss some of the more common systems such as Hadoop, HBase, and Cassandra, and some of the different scenarios and pitfalls of using them. I’ll cover when MapReduce is powerful and helpful, and when it’s better to use a different approach. Putting it all together, I’ll mention ZooKeeper, Flume, and some of the surrounding small projects that can help make a useable system.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,158
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

  1. 1. Welcome to The JungleBuilding Distributed Systems for Large data sets
  2. 2. !   SQL solves all our problems! !   Or does it?
  3. 3. The Problem with SQL!   At some point, data is too large to fit on a single machine. !   Then what do you do?
  4. 4. Your cluster: SQL Application
  5. 5. The first sign of trouble!   Can do small queries pretty good!   Large analytical queries? !   forget it! !   Takes too long !   Uses too many resources
  6. 6. Hadoop for Bulk Processing
  7. 7. !   Hadoop = HDFS + MapReduce !   HDFS = Distributed, Fault Tolerant File System !   MapReduce = Highly distributed processing engine
  8. 8. !   MapReduce works if: !   Your algorithm needs to touch every piece of data in the set !   You can write your algorithm in a MapReduce structure !   Your data set is gigantic
  9. 9. !   MapReduce is not so good if: !   Your data set is very small !   Your algorithm doesn t need to touch everything !   You only want to query specific pieces of data
  10. 10. !   No Indexing!   Job startup cost!   No indices !   Always touches all the data
  11. 11. !   MapReduce code is usually a pain to write !   requires a Java developer !   lots of boilerplate for common tasks
  12. 12. Pig and Hive!
  13. 13. Apache Pig!   Data Flow Language !   feels like using sed/awk !   good at transformations of data
  14. 14. Apache Hive!   SQL-like interface !   good for large queries !   maintains table information from files
  15. 15. Pig vs. Hive!   Both can do the same thing !   Hive is easier to learn !   Pig is easier to maintain!   Pretty much a matter of taste
  16. 16. The second sign!   Your Bulk processing and ad-hoc analysis is working great in Hadoop!   But now your small queries are sucking
  17. 17. Scale SQL?!   A Few options: !   Buy Oracle Rac...$$$$ !   Static Sharding...hard to maintain !   Don t do it?
  18. 18. HBase and Cassandra
  19. 19. Column-Oriented Storage!   SQL = !   Fixed Columns, infinite rows!   Column-Oriented: !   Rows are groups of Key-Value pairs
  20. 20. HBase/Cassandra!   Both Column-oriented stores!   Both highly available!   Both rely on memory for performance
  21. 21. Apache Cassandra!   Highly Available and Partition Tolerant!   Attempts to hold as much data as possible in memory!   Manages files on local disk
  22. 22. Eventual Consistency!   Cassandra has Eventual Consistency !   It is possible to read out-of-date data! !   Also possible to guarantee consistency, at a cost
  23. 23. Why Eventual Consistency?!   Data is only written once !   Either it s there or not!   You don t care if you get out-of-date data !   Shopping Carts
  24. 24. Cassandra Strengths!   Fast !   Writes faster than Reads!!   Easy to maintain !   Self-contained
  25. 25. Cassandra Weaknesses!   Consistency Model is complex!   Scanning over rows is excruciating
  26. 26. Apache HBase!   Uses HDFS as storage mechanism!   Holds large proportion of data in RAM !   need RAM >= 1% of your data size!
  27. 27. HBase Strengths!   Strong consistency guarantee!   Good at scanning over rows!   Strong community !   part of the Hadoop ecosystem
  28. 28. HBase weaknesses!   Slower than Cassandra !   HDFS is higher latency than direct disk!   Complex to maintain !   requires running !   HDFS !   ZooKeeper
  29. 29. HBase vs. Cassandra!   Pick Cassandra if: !   Doings lots of writes !   need easy maintenance !   don t care about consistency so much!   Pick HBase if !   Scanning over rows a lot !   comfortable with maintaining Hadoop/ZooKeeper !   Need simple consistency guarantees
  30. 30. Your cluster: HBase/ Hadoop Cassandra SQL Application
  31. 31. This is complicated!!   How do we configure it?!   What if we have to run an algorithm on only a single node at a time?!   What if we need to coordinate actions?
  32. 32. Apache ZooKeeper!   Distributed Coordination System !   Designed for creating distributed concurrency controls !   also good for storing configuration !   NOT good for storing anything else!
  33. 33. !   Now you have: !   Bulk Processing with Hadoop !   Large data queries with HBase/ Cassandra !   Coordination with ZooKeeper !   Your old SQL database!
  34. 34. !   Chances are, still need SQL for some stuff!   If the data sizes are manageable, SQL is tried-and-true
  35. 35. The People Problem!   Big Data systems are complicated !   Lots of moving parts !   Lots of places where things can go wrong !   Need good people!
  36. 36. !   Try and Hire an expert directly... !   Not that many out there
  37. 37. !   Train 2 or 3 experts instead !   Worth every penny
  38. 38. Who should I hire?!   Probably won t find direct experts!   Look instead for people who: !   are good with algorithms !   are fast learners !   not risk-averse
  39. 39. Questions?
  40. 40. Thank You!   email: ! scottfines@gmail.com!   github: !   scottfines!   linkedin: !   scottfines

×