Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Final deck


Published on

SXSW 2011 Panel Presentation - "Big Data For Everyone (No Data Scientists Required)

Published in: Technology
  • Be the first to comment

Final deck

  1. Big Data for Everyone Twitter: #bd4e
  2. Introduction to Big Data Steve Watt Hadoop Strategy @wattsteve #bd4e
  3. What is “Big Data”? <ul><li>“ Every two days we create as much information as we did from the dawn of civilization up until 2003” – Eric Schmidt, Google </li></ul><ul><li>Current state of affairs: </li></ul><ul><ul><li>Explosion of user generated content </li></ul></ul><ul><ul><li>Storage is really cheap so we can store what we want </li></ul></ul><ul><ul><li>Traditional data stores have reached critical mass </li></ul></ul><ul><li>Issues: </li></ul><ul><ul><li>Enterprise Amnesia </li></ul></ul><ul><ul><li>Traditional architectures become brittle and slow when tasked with trying to process data at petabyte scale </li></ul></ul><ul><ul><li>How do we process unstructured data? </li></ul></ul>
  4. How were these issues addressed? <ul><li>2004 – Google publishes seminal whitepapers on Map/Reduce and the Google File System, a new programming paradigm to process data at Internet Scale </li></ul><ul><li>The whitepapers describe the use of Massive Parallelism to allow a system to scale horizontally, achieving linear performance improvements </li></ul><ul><li>This approach is well suited a cloud model whereby additional instances can be commissioned/de-commisioned to have an immediate effect on performance. </li></ul><ul><li>The approaches described in the Google white papers were incorporated into the open source Apache Hadoop project. </li></ul>
  5. What is Apache Hadoop ? It is a cluster technology with a single master and multiple slaves, designed for commodity hardware It consists of two runtimes, the Hadoop distributed file system ( HDFS ) and Map/Reduce As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine ( data locality ). Hadoop may execute or re-execute a job on any node in the cluster. Node failures are automatically handled by the framework.
  6. The Big Data Ecosystem ClusterChef / Apache Whirr / EC2 Hadoop Pig / WuKong /Cascading Cassandra / HBase Offline Systems (Analytics) Human Consumption BigSheets / DataMeer Hive / Karmasphere Provisioning Nutch / SQOOP / Flume Scripting DBA Non-Programmer Import/Export Tooling Visualizations Online Systems (OLTP @ Scale) NoSQL Commodity Hardware
  7. Offline customer scenario Eric Sammer Solution Architect @esammer #bd4e
  8. Use Case: Product Recommendations <ul><li>“ We can provide a better experience (and make more money) if we provide meaningful product recommendations.” </li></ul><ul><li>We need data: </li></ul><ul><ul><li>What products did a user buy? </li></ul></ul><ul><ul><li>What products did a user browse, hover over, rate, add to cart (but not buy) in the last 2 months? </li></ul></ul><ul><ul><li>What are the attributes of the user? (e.g. income, gender, friends) </li></ul></ul><ul><ul><li>What are our margins on products, inventory, upcoming promotions? </li></ul></ul>
  9. Problems <ul><ul><li>That’s a lot of data! (2 months of activity + all purchase data + all user data) </li></ul></ul><ul><ul><ul><li>Activity: ~20GB per day x ~60 days = 1.2TB </li></ul></ul></ul><ul><ul><ul><li>User Data: ~2GB </li></ul></ul></ul><ul><ul><ul><li>Purchase Data: ~5GB </li></ul></ul></ul><ul><ul><ul><li>Misc: Inventory, product costs, promotion schedules </li></ul></ul></ul><ul><ul><li>Distilling data to aggregates would reduce fidelity. </li></ul></ul><ul><ul><li>Easy to see how looking at more data could improve recommendations. </li></ul></ul><ul><ul><li>How do we keep this information current? </li></ul></ul>
  10. The Answer <ul><ul><li>Calculate all qualifying products once a day for each user and store them for quick display </li></ul></ul><ul><ul><li>Use Hadoop to process data in parallel on hundreds of machines </li></ul></ul>
  12. Online customer scenario Matt Pfeil CEO @mattz62 #bd4e
  13. What is Apache Cassandra? 03/17/11
  14. Use Case: Managing Email <ul><ul><li>“ My email volume is growing exponentially. Traditional solutions – including using a SAN – simply can’t keep up. I need to scale horizontally and get incredibly fast real time performance.” </li></ul></ul><ul><ul><li>The Problem </li></ul></ul><ul><ul><li>How do we achieve scalability, redundancy, high performance? </li></ul></ul><ul><ul><li>How do we store billions of files on commodity hardware? </li></ul></ul><ul><ul><li>How do we increase capacity by simply adding machines? (No SANs!) </li></ul></ul><ul><ul><li>How do we make it FAST? </li></ul></ul>
  15. Requirements <ul><ul><li>Storage for Email </li></ul></ul><ul><ul><ul><li>Billions of emails (<100KB avg) </li></ul></ul></ul><ul><ul><ul><li>2M users, 100 MB of storage each = 190 TB </li></ul></ul></ul><ul><ul><ul><li>Growth of 50% every 6 months </li></ul></ul></ul><ul><ul><ul><li>Durable </li></ul></ul></ul><ul><ul><li>Requirements for Storage System </li></ul></ul><ul><ul><ul><li>No Master/Single Point of Failure </li></ul></ul></ul><ul><ul><ul><li>Linear scalability + redundancy </li></ul></ul></ul><ul><ul><ul><li>Multiple Active Data Centers </li></ul></ul></ul><ul><ul><ul><li>Many reads, many writes </li></ul></ul></ul><ul><ul><ul><li>Millisecond response times </li></ul></ul></ul><ul><ul><ul><li>Commodity hardware </li></ul></ul></ul>
  16. Solution <ul><ul><li>800 TB of Storage </li></ul></ul><ul><ul><li>~1.75 Million reads or writes/sec (No Cache!) </li></ul></ul><ul><ul><li>130 Machines </li></ul></ul><ul><ul><ul><li>Read/Write at both Data Centers </li></ul></ul></ul><ul><ul><ul><li>No “Master” Data Center </li></ul></ul></ul>
  17. Where to next? The Adjacent Possible Flip Kromer CTO @mrflip #bd4e
  18. Something about Anything
  19. Everything about Something
  20. Bigger than One Computer
  21. Bigger than Frontal Lobe
  22. Bigger than Excel
  23. what’s coming to help
  24. myth of the “data base” <ul><li>Hold your data </li></ul><ul><li>Ask questions </li></ul>
  25. Managing & Shipping <ul><li>Hadoop FTW </li></ul><ul><li>Cassandra, HBase, ElasticSearch, ... </li></ul><ul><ul><li>Integration is still too hard </li></ul></ul><ul><li>Dev Ops </li></ul><ul><li>Reliable Decoupling: Flume, Graphite </li></ul>
  26. Data flutters by label Elephants make sturdy piles {GROUP} Number becomes thought process_group Hadoop
  27. Twitter Parser in a Tweet class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort)
  28. pure functionality
  29. pure fun ctionality
  30. <ul><li>Cassandra </li></ul><ul><li>HBase </li></ul><ul><li>ElasticSearch </li></ul><ul><li>MySQL </li></ul><ul><li>Redis </li></ul><ul><li>TokyoTyrant </li></ul><ul><li>SimpleDB </li></ul><ul><li>MongoDB </li></ul><ul><li>sqlite </li></ul><ul><li>whisper (graphite) </li></ul><ul><li>file system </li></ul><ul><li>S3 </li></ul>Data Stores in Production
  31. <ul><li>Cassandra </li></ul><ul><li>HBase </li></ul><ul><li>ElasticSearch </li></ul><ul><li>MySQL </li></ul><ul><li>Redis </li></ul><ul><li>TokyoTyrant </li></ul><ul><li>SimpleDB </li></ul><ul><li>MongoDB </li></ul><ul><li>sqlite </li></ul><ul><li>whisper (graphite) </li></ul><ul><li>file system </li></ul><ul><li>S3 </li></ul>Dev Ops: Rethink Hard
  32. Still Blind <ul><li>Visual Grammar to see it: NYTimes, Stamen, Ben Fry </li></ul><ul><li>Interactive tools: </li></ul><ul><ul><li>Tableau, Spotfire </li></ul></ul><ul><ul><li>, d3.js, Gephi </li></ul></ul>
  33. Human-Scale Tools <ul><li>Data-as-a-Service: </li></ul><ul><ul><li>Infochimps, SimpleGeo </li></ul></ul><ul><ul><li>DrawnToScale </li></ul></ul><ul><li>Business Intelligence </li></ul><ul><li>Familiar Paradigm, New Scale </li></ul><ul><ul><li>BigSheets, Datameer </li></ul></ul>
  34. Panel Discussion Stu Hood Software Engineer @stuhood #bd4e
  35. Thanks for coming! Stu Hood @stuhood Flip Kromer @mrflip Matt Pfeil @mattz62 Eric Sammer @esammer Steve Watt @wattsteve