Final deck


Published on

SXSW 2011 Panel Presentation - "Big Data For Everyone (No Data Scientists Required)

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • What about cascading?
  • What did a user buy? – We don’t want to recommend products they’ve already purchased in the past. What products did a user browse / hover / rate / add to cart – These all provide additional information we can use in making meaningful recommendations. Obviously you wouldn’t want to recommend a product a user has rated poorly (which may also indicate they own it but bought it from someone else). We can sometimes derive implicit information like this from explicit actions. Adding an item to a cart but not purchasing it indicates a strong “likely purchase.” User attributes can tell us more about the appropriate products to recommend for a user. Maybe we should recommend products within a user’s expected budget to increase likelihood of purchase. Also, we don’t want to offer deep discounts to users who can (or have historically) paid full price for products. Knowing what our margins on a product are can also impact our decisions. Inventory is also interesting – we don’t want to recommend something that’s not in stock!
  • If we can read data sequentially at ~250MBps (SATA II, 3Gbps minus some overhead) we’re talking about 1200GB * 1024 (convert to MB) / 250MBps = 4952 seconds / 60 = ~82 minutes just to read the activity logs. This doesn’t account for growth. Distilling data (i.e. grouping users to interested categories and then picking interesting products in each category) drastically reduces accuracy. More data helps. Could we provide better recommendations by looking at year back? What if we looked at people who browsed / bought a significant number of products in common with you and based recommendations on that? What if we could take social data like FB, twitter into account and recommend products based on friends? Repeating this operation to keep data up to date could get expensive quickly.
  • A batch operation to calculate recommendations daily probably makes sense. For each user, store a list of product IDs and present them pseudo-randomly or when a user views a product in the same category. Hadoop Map Reduce allows you to parallelize this type of operation so it can be run on hundreds or thousands of machines.
  • Walk through the basic architecture for a recommendation engine running on Hadoop with RDBMS import / export and app servers logging to HDFS. Mention RDBMS could be HBase, Cassandra, or other “big data” data storage. Sqoop can do RDBMS / Hadoop imp / exp Flume and similar systems can facilitate reliable logging to HDFS from various sources.
  • - From the Big Data leaders – Google + Amazon, Facebook
  • Scalability = ability to add machines for horizontal/easy growth Redundancy = data replicated High performance = Fast.
  • Cost = ~$300K
  • So outside infochimps the internet is this wonderful smorgasbord of facts: With Google and Wikipedia and the rest you can find something about anything you seek.
  • But we’re ready to move to a world where you can find heartier fare -- to find out everything about something
  • Program according to this haiku and hadoop takes care of the janitorial work
  • Here’s a working twitter stream JSON-to-TSV parser that fits in a tweet Run it with cat stream.json | ruby -r'wukong/script' -rjson parser_in_a_tweet.rb --map both fit comfortably in a tweet.
  • what’s left is pure functionality,
  • which to a programmer is pure fun.
  • Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow.
  • Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow.
  • Final deck

    1. Big Data for Everyone Twitter: #bd4e
    2. Introduction to Big Data Steve Watt Hadoop Strategy @wattsteve #bd4e
    3. What is “Big Data”? <ul><li>“ Every two days we create as much information as we did from the dawn of civilization up until 2003” – Eric Schmidt, Google </li></ul><ul><li>Current state of affairs: </li></ul><ul><ul><li>Explosion of user generated content </li></ul></ul><ul><ul><li>Storage is really cheap so we can store what we want </li></ul></ul><ul><ul><li>Traditional data stores have reached critical mass </li></ul></ul><ul><li>Issues: </li></ul><ul><ul><li>Enterprise Amnesia </li></ul></ul><ul><ul><li>Traditional architectures become brittle and slow when tasked with trying to process data at petabyte scale </li></ul></ul><ul><ul><li>How do we process unstructured data? </li></ul></ul>
    4. How were these issues addressed? <ul><li>2004 – Google publishes seminal whitepapers on Map/Reduce and the Google File System, a new programming paradigm to process data at Internet Scale </li></ul><ul><li>The whitepapers describe the use of Massive Parallelism to allow a system to scale horizontally, achieving linear performance improvements </li></ul><ul><li>This approach is well suited a cloud model whereby additional instances can be commissioned/de-commisioned to have an immediate effect on performance. </li></ul><ul><li>The approaches described in the Google white papers were incorporated into the open source Apache Hadoop project. </li></ul>
    5. What is Apache Hadoop ? It is a cluster technology with a single master and multiple slaves, designed for commodity hardware It consists of two runtimes, the Hadoop distributed file system ( HDFS ) and Map/Reduce As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine ( data locality ). Hadoop may execute or re-execute a job on any node in the cluster. Node failures are automatically handled by the framework.
    6. The Big Data Ecosystem ClusterChef / Apache Whirr / EC2 Hadoop Pig / WuKong /Cascading Cassandra / HBase Offline Systems (Analytics) Human Consumption BigSheets / DataMeer Hive / Karmasphere Provisioning Nutch / SQOOP / Flume Scripting DBA Non-Programmer Import/Export Tooling Visualizations Online Systems (OLTP @ Scale) NoSQL Commodity Hardware
    7. Offline customer scenario Eric Sammer Solution Architect @esammer #bd4e
    8. Use Case: Product Recommendations <ul><li>“ We can provide a better experience (and make more money) if we provide meaningful product recommendations.” </li></ul><ul><li>We need data: </li></ul><ul><ul><li>What products did a user buy? </li></ul></ul><ul><ul><li>What products did a user browse, hover over, rate, add to cart (but not buy) in the last 2 months? </li></ul></ul><ul><ul><li>What are the attributes of the user? (e.g. income, gender, friends) </li></ul></ul><ul><ul><li>What are our margins on products, inventory, upcoming promotions? </li></ul></ul>
    9. Problems <ul><ul><li>That’s a lot of data! (2 months of activity + all purchase data + all user data) </li></ul></ul><ul><ul><ul><li>Activity: ~20GB per day x ~60 days = 1.2TB </li></ul></ul></ul><ul><ul><ul><li>User Data: ~2GB </li></ul></ul></ul><ul><ul><ul><li>Purchase Data: ~5GB </li></ul></ul></ul><ul><ul><ul><li>Misc: Inventory, product costs, promotion schedules </li></ul></ul></ul><ul><ul><li>Distilling data to aggregates would reduce fidelity. </li></ul></ul><ul><ul><li>Easy to see how looking at more data could improve recommendations. </li></ul></ul><ul><ul><li>How do we keep this information current? </li></ul></ul>
    10. The Answer <ul><ul><li>Calculate all qualifying products once a day for each user and store them for quick display </li></ul></ul><ul><ul><li>Use Hadoop to process data in parallel on hundreds of machines </li></ul></ul>
    12. Online customer scenario Matt Pfeil CEO @mattz62 #bd4e
    13. What is Apache Cassandra? 03/17/11
    14. Use Case: Managing Email <ul><ul><li>“ My email volume is growing exponentially. Traditional solutions – including using a SAN – simply can’t keep up. I need to scale horizontally and get incredibly fast real time performance.” </li></ul></ul><ul><ul><li>The Problem </li></ul></ul><ul><ul><li>How do we achieve scalability, redundancy, high performance? </li></ul></ul><ul><ul><li>How do we store billions of files on commodity hardware? </li></ul></ul><ul><ul><li>How do we increase capacity by simply adding machines? (No SANs!) </li></ul></ul><ul><ul><li>How do we make it FAST? </li></ul></ul>
    15. Requirements <ul><ul><li>Storage for Email </li></ul></ul><ul><ul><ul><li>Billions of emails (<100KB avg) </li></ul></ul></ul><ul><ul><ul><li>2M users, 100 MB of storage each = 190 TB </li></ul></ul></ul><ul><ul><ul><li>Growth of 50% every 6 months </li></ul></ul></ul><ul><ul><ul><li>Durable </li></ul></ul></ul><ul><ul><li>Requirements for Storage System </li></ul></ul><ul><ul><ul><li>No Master/Single Point of Failure </li></ul></ul></ul><ul><ul><ul><li>Linear scalability + redundancy </li></ul></ul></ul><ul><ul><ul><li>Multiple Active Data Centers </li></ul></ul></ul><ul><ul><ul><li>Many reads, many writes </li></ul></ul></ul><ul><ul><ul><li>Millisecond response times </li></ul></ul></ul><ul><ul><ul><li>Commodity hardware </li></ul></ul></ul>
    16. Solution <ul><ul><li>800 TB of Storage </li></ul></ul><ul><ul><li>~1.75 Million reads or writes/sec (No Cache!) </li></ul></ul><ul><ul><li>130 Machines </li></ul></ul><ul><ul><ul><li>Read/Write at both Data Centers </li></ul></ul></ul><ul><ul><ul><li>No “Master” Data Center </li></ul></ul></ul>
    17. Where to next? The Adjacent Possible Flip Kromer CTO @mrflip #bd4e
    18. Something about Anything
    19. Everything about Something
    20. Bigger than One Computer
    21. Bigger than Frontal Lobe
    22. Bigger than Excel
    23. what’s coming to help
    24. myth of the “data base” <ul><li>Hold your data </li></ul><ul><li>Ask questions </li></ul>
    25. Managing & Shipping <ul><li>Hadoop FTW </li></ul><ul><li>Cassandra, HBase, ElasticSearch, ... </li></ul><ul><ul><li>Integration is still too hard </li></ul></ul><ul><li>Dev Ops </li></ul><ul><li>Reliable Decoupling: Flume, Graphite </li></ul>
    26. Data flutters by label Elephants make sturdy piles {GROUP} Number becomes thought process_group Hadoop
    27. Twitter Parser in a Tweet class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort)
    28. pure functionality
    29. pure fun ctionality
    30. <ul><li>Cassandra </li></ul><ul><li>HBase </li></ul><ul><li>ElasticSearch </li></ul><ul><li>MySQL </li></ul><ul><li>Redis </li></ul><ul><li>TokyoTyrant </li></ul><ul><li>SimpleDB </li></ul><ul><li>MongoDB </li></ul><ul><li>sqlite </li></ul><ul><li>whisper (graphite) </li></ul><ul><li>file system </li></ul><ul><li>S3 </li></ul>Data Stores in Production
    31. <ul><li>Cassandra </li></ul><ul><li>HBase </li></ul><ul><li>ElasticSearch </li></ul><ul><li>MySQL </li></ul><ul><li>Redis </li></ul><ul><li>TokyoTyrant </li></ul><ul><li>SimpleDB </li></ul><ul><li>MongoDB </li></ul><ul><li>sqlite </li></ul><ul><li>whisper (graphite) </li></ul><ul><li>file system </li></ul><ul><li>S3 </li></ul>Dev Ops: Rethink Hard
    32. Still Blind <ul><li>Visual Grammar to see it: NYTimes, Stamen, Ben Fry </li></ul><ul><li>Interactive tools: </li></ul><ul><ul><li>Tableau, Spotfire </li></ul></ul><ul><ul><li>, d3.js, Gephi </li></ul></ul>
    33. Human-Scale Tools <ul><li>Data-as-a-Service: </li></ul><ul><ul><li>Infochimps, SimpleGeo </li></ul></ul><ul><ul><li>DrawnToScale </li></ul></ul><ul><li>Business Intelligence </li></ul><ul><li>Familiar Paradigm, New Scale </li></ul><ul><ul><li>BigSheets, Datameer </li></ul></ul>
    34. Panel Discussion Stu Hood Software Engineer @stuhood #bd4e
    35. Thanks for coming! Stu Hood @stuhood Flip Kromer @mrflip Matt Pfeil @mattz62 Eric Sammer @esammer Steve Watt @wattsteve
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.