• Save
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

Uploaded on

At StampedeCon 2012 in St. Louis, Scott Fines of NISC presents: Recent years have seen a sudden and rapid introduction of new technologies for distributing applications to essentially arbitrary......

At StampedeCon 2012 in St. Louis, Scott Fines of NISC presents: Recent years have seen a sudden and rapid introduction of new technologies for distributing applications to essentially arbitrary levels. The growth in variety and depth of these different systems has grown to match, and it can be a challenge just to keep up. In this talk, I’ll discuss some of the more common systems such as Hadoop, HBase, and Cassandra, and some of the different scenarios and pitfalls of using them. I’ll cover when MapReduce is powerful and helpful, and when it’s better to use a different approach. Putting it all together, I’ll mention ZooKeeper, Flume, and some of the surrounding small projects that can help make a useable system.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 13

http://eventifier.co 9
http://StampedeCon.com 4

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Welcome to The JungleBuilding Distributed Systems for Large data sets
  • 2. !   SQL solves all our problems! !   Or does it?
  • 3. The Problem with SQL!   At some point, data is too large to fit on a single machine. !   Then what do you do?
  • 4. Your cluster: SQL Application
  • 5. The first sign of trouble!   Can do small queries pretty good!   Large analytical queries? !   forget it! !   Takes too long !   Uses too many resources
  • 6. Hadoop for Bulk Processing
  • 7. !   Hadoop = HDFS + MapReduce !   HDFS = Distributed, Fault Tolerant File System !   MapReduce = Highly distributed processing engine
  • 8. !   MapReduce works if: !   Your algorithm needs to touch every piece of data in the set !   You can write your algorithm in a MapReduce structure !   Your data set is gigantic
  • 9. !   MapReduce is not so good if: !   Your data set is very small !   Your algorithm doesn t need to touch everything !   You only want to query specific pieces of data
  • 10. !   No Indexing!   Job startup cost!   No indices !   Always touches all the data
  • 11. !   MapReduce code is usually a pain to write !   requires a Java developer !   lots of boilerplate for common tasks
  • 12. Pig and Hive!
  • 13. Apache Pig!   Data Flow Language !   feels like using sed/awk !   good at transformations of data
  • 14. Apache Hive!   SQL-like interface !   good for large queries !   maintains table information from files
  • 15. Pig vs. Hive!   Both can do the same thing !   Hive is easier to learn !   Pig is easier to maintain!   Pretty much a matter of taste
  • 16. The second sign!   Your Bulk processing and ad-hoc analysis is working great in Hadoop!   But now your small queries are sucking
  • 17. Scale SQL?!   A Few options: !   Buy Oracle Rac...$$$$ !   Static Sharding...hard to maintain !   Don t do it?
  • 18. HBase and Cassandra
  • 19. Column-Oriented Storage!   SQL = !   Fixed Columns, infinite rows!   Column-Oriented: !   Rows are groups of Key-Value pairs
  • 20. HBase/Cassandra!   Both Column-oriented stores!   Both highly available!   Both rely on memory for performance
  • 21. Apache Cassandra!   Highly Available and Partition Tolerant!   Attempts to hold as much data as possible in memory!   Manages files on local disk
  • 22. Eventual Consistency!   Cassandra has Eventual Consistency !   It is possible to read out-of-date data! !   Also possible to guarantee consistency, at a cost
  • 23. Why Eventual Consistency?!   Data is only written once !   Either it s there or not!   You don t care if you get out-of-date data !   Shopping Carts
  • 24. Cassandra Strengths!   Fast !   Writes faster than Reads!!   Easy to maintain !   Self-contained
  • 25. Cassandra Weaknesses!   Consistency Model is complex!   Scanning over rows is excruciating
  • 26. Apache HBase!   Uses HDFS as storage mechanism!   Holds large proportion of data in RAM !   need RAM >= 1% of your data size!
  • 27. HBase Strengths!   Strong consistency guarantee!   Good at scanning over rows!   Strong community !   part of the Hadoop ecosystem
  • 28. HBase weaknesses!   Slower than Cassandra !   HDFS is higher latency than direct disk!   Complex to maintain !   requires running !   HDFS !   ZooKeeper
  • 29. HBase vs. Cassandra!   Pick Cassandra if: !   Doings lots of writes !   need easy maintenance !   don t care about consistency so much!   Pick HBase if !   Scanning over rows a lot !   comfortable with maintaining Hadoop/ZooKeeper !   Need simple consistency guarantees
  • 30. Your cluster: HBase/ Hadoop Cassandra SQL Application
  • 31. This is complicated!!   How do we configure it?!   What if we have to run an algorithm on only a single node at a time?!   What if we need to coordinate actions?
  • 32. Apache ZooKeeper!   Distributed Coordination System !   Designed for creating distributed concurrency controls !   also good for storing configuration !   NOT good for storing anything else!
  • 33. !   Now you have: !   Bulk Processing with Hadoop !   Large data queries with HBase/ Cassandra !   Coordination with ZooKeeper !   Your old SQL database!
  • 34. !   Chances are, still need SQL for some stuff!   If the data sizes are manageable, SQL is tried-and-true
  • 35. The People Problem!   Big Data systems are complicated !   Lots of moving parts !   Lots of places where things can go wrong !   Need good people!
  • 36. !   Try and Hire an expert directly... !   Not that many out there
  • 37. !   Train 2 or 3 experts instead !   Worth every penny
  • 38. Who should I hire?!   Probably won t find direct experts!   Look instead for people who: !   are good with algorithms !   are fast learners !   not risk-averse
  • 39. Questions?
  • 40. Thank You!   email: ! scottfines@gmail.com!   github: !   scottfines!   linkedin: !   scottfines