Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Other Way of Doing Big Data

3,036 views

Published on

The Other Way of Doing Big Data: Declarative, Decoupled, Federated, Simple, and Resilient.

Also known as: How to Win at Scale and its Influence of People. Originally presented by Flip Kromer to the Research Board, http://www.researchboard.com/ June 2012

Published in: Technology
  • Be the first to comment

The Other Way of Doing Big Data

  1. 1. How to Win at Scale and itsInfluence on People Philip (flip) Kromer CTO, Infochimps.com
  2. 2. Big Data is InevitableIt Demands a New Approach
  3. 3. There’s Another Way
  4. 4. There’s Another WayYou’re Going to Have to follow It
  5. 5. There’s Another WayYou’re Going to Have to follow ItIt Might be a Better Way
  6. 6. The Other Way
  7. 7. Massive component count
  8. 8. Federated Truth email MySQL HBase s3spreadsheets elasticsearch elasticsearch HDFS hipchat redis mongo MongoDB log files salesforce zabbix hubspot ADP Chargify BC/BS ZenDesk google docs
  9. 9. Low Coupling
  10. 10. Reliable Resilient
  11. 11. • Manage 100s of machines: architecture as code• Contain system complexity: relentlessly decouple• Maintain coherency: federated truth• Manage true costs: optimize for people not machines• Manage failure & change:resiliency engineering
  12. 12. The Other WayDeclarative, not HomogenousDecoupled, not Standardized Federated, not Centralized Simple, not Performant Resilient, not Reliable
  13. 13. Declarative
  14. 14. Architecture as Code Lightweight Lightweight Dashboard Dashboard HBase HBase API Data Transport ESh flume ElasticSearch ElasticSearch Operations ApplicationIronfan + ops ics.com Hadoop On-Demand Hadoop Chef
  15. 15. Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop HM NN ZK RS RS RS RS RS RS
  16. 16. provision machinerun statesettingsstandard componentscluster-specificfacet groups
  17. 17. Lightweight Dashboard Lightweight Dashboard HBase HBase HM NN ZK APIData Transport ESh flume ElasticSearch RS RS ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop RS RS RS RS regionserver ssh nfs datanode zbx stargate log tasktracker fw zookeeper
  18. 18. Winsfrom Declarative Lightweight Lightweight Dashboard Dashboard HBase HBase API Data Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop
  19. 19. Recapitulatable
  20. 20. Portable
  21. 21. Decoupled
  22. 22. Our Stack Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop
  23. 23. Our Stack
  24. 24. Our Stack
  25. 25. Engineer : System = 1:10• >60 distinct components• 50-150 machines• 1 ops + 5 hackers + 1 analyst
  26. 26. Self-similar Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop HM NN ZK RS RS ssh ssh hb 2d mstr hb master nfs nfs RS RS namenode zbx 2d nn zbx log jobtracker log zookeeper RS RS fw zookeeper fw alpha beta regionserver ssh regionserver ssh nfs nfs datanode datanode zbx zbx stargate log stargate log tasktracker fw tasktracker fw zookeeper gamma delta
  27. 27. Example: ScraperScraper disk tail’er decorator sink Jobs database
  28. 28. Scraper flumeScraper disk tail’er decorator sink Jobs database while true: get_job fetch_url dump_to_disk
  29. 29. Scraper flumeScraper disk tail’er decorator sink Jobs database while true: ensures get_job reliable fetch_url delivery dump_to_disk
  30. 30. Scraper flumeScraper disk tail’er decorator sink Jobs database while true: ensures parse get_job reliable raw fetch_url delivery => dump_to_disk objects
  31. 31. Scraper flumeScraper disk tail’er decorator sink Jobs database while true: ensures parse store get_job reliable raw object fetch_url delivery => => dump_to_disk objects database
  32. 32. alicealicebobalicebobbob
  33. 33. Simple
  34. 34. • Immediately Understandable• Clear Interface• Few Moving Parts
  35. 35. Federated
  36. 36. Data Stores in Production• HBase • MySQL• ElasticSearch • Redis• Cassandra • sqlite• TokyoTyrant • whisper (graphite)• SimpleDB • file system• MongoDB • S3
  37. 37. Programs Used for This Talk• Emacs • Skitch• Keynote • finder• Preview • flickr.com• Chrome • google image search• ruby (pry) • ssh
  38. 38. How’s my Batch Job Going?• 1 x Job Status• 1 x Counters & App Metrics• N x Task Status• M x Machine System Stats• 1 x Cloud Status• 1 x Chef Server
  39. 39. Dataflow is All
  40. 40. Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop System Diagram Dataflow Workflow
  41. 41. Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop System Diagram Dataflow Workflow Org Chart
  42. 42. Robots are CheapPeople are Important
  43. 43. Expensive / Not Expensive1 trillion 10 kb objects: • 100 % in RAM: $ 212,000 /mo • 10% in Ram: $ 21,000 /mo • On Disk: $ 3,000 /mo • On S3: $ 1,200 /mo
  44. 44. Expensive / Not Expensive1 trillion 10 kb objects: • 100 % in RAM: $ 212,000 /mo • 10% in Ram: $ 21,000 /mo • On Disk: $ 3,000 /mo • On S3: $ 1,200 /mo1 Intern, part-time: $ 1,500 /mo
  45. 45. Scalability is People
  46. 46. Monolithic Software means Meetings
  47. 47. Meetingsare Death
  48. 48. Decentralize. Decouple.
  49. 49. n^2 law of coupling100 things 5 + 3 + 2 things + 2 (tax)
  50. 50. n^2 law of coupling 2500 + 900 + 400 + 400 =10,000 things 4200 thingsto go wrong to go wrong
  51. 51. Infochimps.com 2011 text search Planet of the API acctg APIs infochimps.com models A/B testing cloud services
  52. 52. Infochimps.com 2012 datasets catalog API API docs text search content dashboards Planet of the API acctg APIs auth & payment layout console models A/B testing blog press cloud services collateral
  53. 53. Infochimps.com 2012 (infochimps) icsexpl catalog API (saas) capuchin elasticsrch kanzi beergoggls Planet of the MongoDB APIs george george alphamale MySQL redis WPEngine totem cloud services hubspot
  54. 54. this drawing fits in my head datasets catalog API this app fits in my head, and my laptop
  55. 55. Infochimps.com 2012 (infochimps) icsexpl catalog API (saas) capuchin elasticsrch kanzi beergoggls Planet of the MongoDB APIs george george alphamale MySQL redis WPEngine totem cloud services hubspot
  56. 56. fin. http://infochimps.comhttp://github.com/infochimps-labs

×