How to   Win at Scale      and itsInfluence on People     Philip (flip) Kromer    CTO, Infochimps.com
Big Data is InevitableIt Demands a New Approach
There’s Another Way
There’s Another WayYou’re Going to Have to        follow It
There’s Another WayYou’re Going to Have to        follow ItIt Might be a Better Way
The Other Way
Massive component count
Federated Truth   email                        MySQL            HBase         s3spreadsheets                     elasticse...
Low Coupling
Reliable   Resilient
• Manage 100s of machines: architecture as code• Contain system complexity: relentlessly decouple• Maintain coherency: fed...
The Other WayDeclarative, not HomogenousDecoupled, not Standardized Federated, not Centralized    Simple, not Performant  ...
Declarative
Architecture as Code           Lightweight           Lightweight            Dashboard                                 Dash...
Lightweight           Lightweight  Dashboard                       Dashboard                                              ...
provision machinerun statesettingsstandard componentscluster-specificfacet groups
Lightweight  Dashboard                       Lightweight                       Dashboard                                  ...
Winsfrom Declarative   Lightweight           Lightweight    Dashboard                         Dashboard                   ...
Recapitulatable
Portable
Decoupled
Our Stack Lightweight           Lightweight  Dashboard                       Dashboard                                    ...
Our Stack
Our Stack
Engineer : System = 1:10• >60 distinct components• 50-150 machines• 1 ops + 5 hackers + 1 analyst
Self-similar Lightweight           Lightweight  Dashboard                       Dashboard                                 ...
Example: ScraperScraper     disk   tail’er   decorator     sink Jobs                                    database
Scraper                                flumeScraper    disk     tail’er   decorator     sink Jobs                          ...
Scraper                                flumeScraper    disk     tail’er   decorator     sink Jobs                          ...
Scraper                                flumeScraper    disk     tail’er   decorator     sink Jobs                          ...
Scraper                                flumeScraper    disk     tail’er   decorator     sink Jobs                          ...
alicealicebobalicebobbob
Simple
• Immediately Understandable• Clear Interface• Few Moving Parts
Federated
Data Stores in Production• HBase           • MySQL• ElasticSearch   • Redis• Cassandra       • sqlite• TokyoTyrant     • w...
Programs Used for This Talk• Emacs        • Skitch• Keynote      • finder• Preview      • flickr.com• Chrome       • google ...
How’s my Batch Job Going?• 1 x Job Status• 1 x Counters & App Metrics• N x Task Status• M x Machine System Stats• 1 x Clou...
Dataflow is All
Lightweight           Lightweight  Dashboard                       Dashboard                                              ...
Lightweight           Lightweight  Dashboard                       Dashboard                                              ...
Robots are CheapPeople are Important
Expensive / Not Expensive1 trillion 10 kb objects: • 100 % in RAM: 	$ 212,000 /mo • 10% in Ram: 	 $ 21,000 /mo • On Disk: ...
Expensive / Not Expensive1 trillion 10 kb objects: • 100 % in RAM: 	$ 212,000 /mo • 10% in Ram: 	 $ 21,000 /mo • On Disk: ...
Scalability    is  People
Monolithic Software means Meetings
Meetingsare Death
Decentralize. Decouple.
n^2 law of coupling100 things   5 + 3 + 2 things                    + 2 (tax)
n^2 law of coupling                       2500                           +                        900                     ...
Infochimps.com 2011                  text search                                Planet of the                  API acctg  ...
Infochimps.com 2012           datasets    catalog API           API docs                       text search           conte...
Infochimps.com 2012                                           (infochimps)           icsexpl     catalog API              ...
this drawing fits in my head  datasets      catalog API this app fits in my head, and my laptop
Infochimps.com 2012                                           (infochimps)           icsexpl     catalog API              ...
fin.     http://infochimps.comhttp://github.com/infochimps-labs
The Other Way of Doing Big Data
The Other Way of Doing Big Data
The Other Way of Doing Big Data
The Other Way of Doing Big Data
Upcoming SlideShare
Loading in...5
×

The Other Way of Doing Big Data

2,622

Published on

The Other Way of Doing Big Data: Declarative, Decoupled, Federated, Simple, and Resilient.

Also known as: How to Win at Scale and its Influence of People. Originally presented by Flip Kromer to the Research Board, http://www.researchboard.com/ June 2012

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,622
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
55
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • This is on a 15-person organization\nFederated, meaning the data is semantically disparate\n
  • \n
  • \n
  • people are walking around as if we used to have one kind of database and now we have two\nThe important fact isn’t that one of them is sharded \nThe important fact is that they’re proliferating -- and that’s a good thing.\n
  • Google, Facebook, Amazon had to solve the scalability problem\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow. \n
  • Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow. \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • $200k on 146 Amazon EC2 m2.4xlarge\n$20k 10 TB Data, 10% Ram: $ / month, on 57 Amazon EC2 m2.xlarge\n$3k 10 TB Data, Disk: $ / month, on 6 Amazon EC2 c1.xlarge\n$1.2k 10 TB s3\n\n10 TB Ram: $ / month, on 146 Amazon EC2 m2.4xlarge \n 10_000 * 2.00 * 24 * 30.25 / 68.4 = \n $212,280\n10 TB Data, 10% Ram: $ / month, on 57 Amazon EC2 m2.xlarge \n 0.1 * 10_000 * 0.50 * 24 * 30.25 / 17.5 = \n $20,743\n10 TB Data, Disk: $ / month, on 6 Amazon EC2 c1.xlarge\n machines, price, disk, ram = [6, 0.68, 1_690, 7] ; [(tot_disk = disk * machines), (machine_dollars_mo = (machines * price * 24 * 30.25).round)] $2,962\n10 TB Data, S3: $1,250 / month\n1 intern, $10/hr, 25 hrs/wk, not incl. overhead: $1,100 / month\n\n
  • $200k on 146 Amazon EC2 m2.4xlarge\n$20k 10 TB Data, 10% Ram: $ / month, on 57 Amazon EC2 m2.xlarge\n$3k 10 TB Data, Disk: $ / month, on 6 Amazon EC2 c1.xlarge\n$1.2k 10 TB s3\n\n10 TB Ram: $ / month, on 146 Amazon EC2 m2.4xlarge \n 10_000 * 2.00 * 24 * 30.25 / 68.4 = \n $212,280\n10 TB Data, 10% Ram: $ / month, on 57 Amazon EC2 m2.xlarge \n 0.1 * 10_000 * 0.50 * 24 * 30.25 / 17.5 = \n $20,743\n10 TB Data, Disk: $ / month, on 6 Amazon EC2 c1.xlarge\n machines, price, disk, ram = [6, 0.68, 1_690, 7] ; [(tot_disk = disk * machines), (machine_dollars_mo = (machines * price * 24 * 30.25).round)] $2,962\n10 TB Data, S3: $1,250 / month\n1 intern, $10/hr, 25 hrs/wk, not incl. overhead: $1,100 / month\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • The Other Way of Doing Big Data

    1. 1. How to Win at Scale and itsInfluence on People Philip (flip) Kromer CTO, Infochimps.com
    2. 2. Big Data is InevitableIt Demands a New Approach
    3. 3. There’s Another Way
    4. 4. There’s Another WayYou’re Going to Have to follow It
    5. 5. There’s Another WayYou’re Going to Have to follow ItIt Might be a Better Way
    6. 6. The Other Way
    7. 7. Massive component count
    8. 8. Federated Truth email MySQL HBase s3spreadsheets elasticsearch elasticsearch HDFS hipchat redis mongo MongoDB log files salesforce zabbix hubspot ADP Chargify BC/BS ZenDesk google docs
    9. 9. Low Coupling
    10. 10. Reliable Resilient
    11. 11. • Manage 100s of machines: architecture as code• Contain system complexity: relentlessly decouple• Maintain coherency: federated truth• Manage true costs: optimize for people not machines• Manage failure & change:resiliency engineering
    12. 12. The Other WayDeclarative, not HomogenousDecoupled, not Standardized Federated, not Centralized Simple, not Performant Resilient, not Reliable
    13. 13. Declarative
    14. 14. Architecture as Code Lightweight Lightweight Dashboard Dashboard HBase HBase API Data Transport ESh flume ElasticSearch ElasticSearch Operations ApplicationIronfan + ops ics.com Hadoop On-Demand Hadoop Chef
    15. 15. Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop HM NN ZK RS RS RS RS RS RS
    16. 16. provision machinerun statesettingsstandard componentscluster-specificfacet groups
    17. 17. Lightweight Dashboard Lightweight Dashboard HBase HBase HM NN ZK APIData Transport ESh flume ElasticSearch RS RS ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop RS RS RS RS regionserver ssh nfs datanode zbx stargate log tasktracker fw zookeeper
    18. 18. Winsfrom Declarative Lightweight Lightweight Dashboard Dashboard HBase HBase API Data Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop
    19. 19. Recapitulatable
    20. 20. Portable
    21. 21. Decoupled
    22. 22. Our Stack Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop
    23. 23. Our Stack
    24. 24. Our Stack
    25. 25. Engineer : System = 1:10• >60 distinct components• 50-150 machines• 1 ops + 5 hackers + 1 analyst
    26. 26. Self-similar Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop HM NN ZK RS RS ssh ssh hb 2d mstr hb master nfs nfs RS RS namenode zbx 2d nn zbx log jobtracker log zookeeper RS RS fw zookeeper fw alpha beta regionserver ssh regionserver ssh nfs nfs datanode datanode zbx zbx stargate log stargate log tasktracker fw tasktracker fw zookeeper gamma delta
    27. 27. Example: ScraperScraper disk tail’er decorator sink Jobs database
    28. 28. Scraper flumeScraper disk tail’er decorator sink Jobs database while true: get_job fetch_url dump_to_disk
    29. 29. Scraper flumeScraper disk tail’er decorator sink Jobs database while true: ensures get_job reliable fetch_url delivery dump_to_disk
    30. 30. Scraper flumeScraper disk tail’er decorator sink Jobs database while true: ensures parse get_job reliable raw fetch_url delivery => dump_to_disk objects
    31. 31. Scraper flumeScraper disk tail’er decorator sink Jobs database while true: ensures parse store get_job reliable raw object fetch_url delivery => => dump_to_disk objects database
    32. 32. alicealicebobalicebobbob
    33. 33. Simple
    34. 34. • Immediately Understandable• Clear Interface• Few Moving Parts
    35. 35. Federated
    36. 36. Data Stores in Production• HBase • MySQL• ElasticSearch • Redis• Cassandra • sqlite• TokyoTyrant • whisper (graphite)• SimpleDB • file system• MongoDB • S3
    37. 37. Programs Used for This Talk• Emacs • Skitch• Keynote • finder• Preview • flickr.com• Chrome • google image search• ruby (pry) • ssh
    38. 38. How’s my Batch Job Going?• 1 x Job Status• 1 x Counters & App Metrics• N x Task Status• M x Machine System Stats• 1 x Cloud Status• 1 x Chef Server
    39. 39. Dataflow is All
    40. 40. Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop System Diagram Dataflow Workflow
    41. 41. Lightweight Lightweight Dashboard Dashboard HBase HBase APIData Transport ESh flume ElasticSearch ElasticSearch Operations Application ops ics.com Hadoop On-Demand Hadoop System Diagram Dataflow Workflow Org Chart
    42. 42. Robots are CheapPeople are Important
    43. 43. Expensive / Not Expensive1 trillion 10 kb objects: • 100 % in RAM: $ 212,000 /mo • 10% in Ram: $ 21,000 /mo • On Disk: $ 3,000 /mo • On S3: $ 1,200 /mo
    44. 44. Expensive / Not Expensive1 trillion 10 kb objects: • 100 % in RAM: $ 212,000 /mo • 10% in Ram: $ 21,000 /mo • On Disk: $ 3,000 /mo • On S3: $ 1,200 /mo1 Intern, part-time: $ 1,500 /mo
    45. 45. Scalability is People
    46. 46. Monolithic Software means Meetings
    47. 47. Meetingsare Death
    48. 48. Decentralize. Decouple.
    49. 49. n^2 law of coupling100 things 5 + 3 + 2 things + 2 (tax)
    50. 50. n^2 law of coupling 2500 + 900 + 400 + 400 =10,000 things 4200 thingsto go wrong to go wrong
    51. 51. Infochimps.com 2011 text search Planet of the API acctg APIs infochimps.com models A/B testing cloud services
    52. 52. Infochimps.com 2012 datasets catalog API API docs text search content dashboards Planet of the API acctg APIs auth & payment layout console models A/B testing blog press cloud services collateral
    53. 53. Infochimps.com 2012 (infochimps) icsexpl catalog API (saas) capuchin elasticsrch kanzi beergoggls Planet of the MongoDB APIs george george alphamale MySQL redis WPEngine totem cloud services hubspot
    54. 54. this drawing fits in my head datasets catalog API this app fits in my head, and my laptop
    55. 55. Infochimps.com 2012 (infochimps) icsexpl catalog API (saas) capuchin elasticsrch kanzi beergoggls Planet of the MongoDB APIs george george alphamale MySQL redis WPEngine totem cloud services hubspot
    56. 56. fin. http://infochimps.comhttp://github.com/infochimps-labs
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×