Infochimps + CloudCon: Infinite Monkey Theorem


Published on

Big data expert and Infochimps CEO, Jim Kaskade presents the Infinite Monkey Theorem at CloudCon Expo. He provides an energetic, inspiring, and practical perspective on why Big Data is disrupting. It’s more than historic data analyzed on Hadoop. It’s also more than real-time streaming data stored and queried using NoSQL. Learn more at

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • AvinashKaushik gave a talk at Strata 2012 in Santa Clara in March.If you listen to all the hype of Big Data, it solves for the first problem.If you listen to all the vendors, there is a lot of emphasis on the first part (perhaps Infochimps included), and very little on the second.I think that’s because we don’t exactly know how to truly empower the organization to interact directly with any/all data available.It’s too expensive, risky, complex.
  • 40%+ YoY growth with 2012 generating 2.4Zettabytes alone.
  • AMP:access module processorsPE: Parsing EngineBYNET: Banyan Cross-bar Switch YNET (Y Network)Store:The Parsing Engine dispatches a request to retrieve one or more rows.The BYNET ensures that appropriate AMP(s) are activated.The Parsing Engine dispatches a request to insert a row.The BYNET ensures that the row gets to the appropriate AMP (Access Module Processor) via the hashing algorithm.The AMP stores the row on its associated disk.Each AMP can have multiple physical disks associated with it.Retrieve:The AMPs (access module processors) locate and retrieve desired rows in parallel access and will sort, aggregate or format if needed.The BYNET returns retrieved rows to Parsing Engine.The Parsing Engine returns row(s) to requesting client application.Teradata’s shared-nothing architecture allows for highly scalable data volumes.
  • 3 node Hadoop system:$8K/node$10K switch$4K/node HadoopDistro$24K + $10K x 25%x3 maintenance = $43K$4K x 3 x 3 = $36KTotal = There are three essential elements of an analytic platform: Strong support for analytic database query. A variety of query styles — at a minimum, SQL, MDX or graph.Strong support for analytic processes other than queries. Typically these would be in the areas of mathematics (statistics, predictive analytics, data mining, linear algebra, optimization, graph theory, etc.) and/or data transformation (e.g. sessionization, entity extraction).Strong integration between the first two.The point is — an analytic platform is something on which you can build a range of powerful analytic applications. Some specifics of what to look for in analytic platform may be found in the link above. data warehouse (Full or partial)Kinds of data likely to be included: All, but especially operationalLikely use styles: AllCanonical example: Central EDW for a big enterpriseStresses: Concurrency, reliability, workload managementClassical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL ServerTraditional data martKinds of data likely to be included: AllLikely use styles: Business intelligence, budgeting/consolidation, investigativeExamples: Reporting servers, planning/consolidation servers, anything MOLAP, etc.Stresses: Performance, concurrency, TCOColumnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them — e.g. Sybase IQ and Vertica — have excellent track records in concurrent usage as well.Investigative data mart — agileKinds of data likely to be included: All, especially customer-centricLikely use styles: InvestigativeCanonical example: A few analysts getting a few TB to examineStresses: Ease of setup/load, ease of admin, price/performanceInfobright is often cost-effective among columnar analytic DBMS. Investigative data mart — bigKinds of data likely to be included: All, especially customer-centric, logs, financial trade, scientificLikely use styles: InvestigativeCanonical example: Single-subject 20 TB – 20 PB relational databaseStresses: Performance, scale-out, analytic functionalityPerformance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum.Bit bucket - HadoopKinds of data likely to be included: Logs, other technical/externalLikely use styles: Staging/ETL, investigativeCanonical example: Log files in a Hadoop clusterStresses: TCO, scale-out, transform/big-query performance, ETL functionalityArchival data storeKinds of data likely to be included: Operational, CDR (call detail record), security logLikely use styles: Archival, reporting (for compliance), possibly also investigativeExamples: Any long-term detailed historical storeStresses: TCO, compression, scale-out, performance (if multi-use)Perhaps only Rainstor truly embraces the archival positioningOutsourced data martKinds of data likely to be included: AllLikely use styles: Traditional BI, investigative analytics, staging/ETLExamples: Advertising tracking, SaaS CRMStresses: Performance, TCO, reliability, concurrencyOracle shops = Vertica gets the nod in a number of these casesOperational analytic(s) serverKinds of data likely to be included: Customer-centric, log, financial tradeLikely use styles: Advanced operational analyticsExamples:Lower latency: Web or call-center personalization, anti-fraudHigher latency: Customer profiling, Basel 3 risk analysisStresses: Performance, reliability, analytic functionality, perhaps concurrency
  • Being the CEO of Infochimps, I felt compelled to share a little “chimpy” research with you…The “Infinite Monkey Theorem”….is a METAPHOR that directly relates to Big Data, that I think you’ll appreciate.So what is the “Infinite Monkey Theorem”????The following definition is a variant of the original theorem….let me read it to you.This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.
  • This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.Think about this a little….we’re talking about analyzing real world experiences and observations to predict what will happen…what will happen with our business in the future….the unexperienced and unobserved.This is fundamentally what Big Data proposes to help…
  • So as a metaphor…the "monkey" is not an actual monkey, but a metaphor for an abstract device a device that produces a sequence of letters and symbols.And "almost surely" is a mathematical term with a precise meaningShakespeare’s Hamlet also represents a broader meaning….it represents any text, any work, any insight.
  • So lets look at this in more depth….Infinite number of monkeys -> represents today’s seemingly unlimited computational power of either public or private Clouds…as an elastic delivery method.Keys on a typewriter -> capture discrete transactions which only analyzed together can derive meaning. Again we amass the computational power to process dataAlmost surely -> is translated into a mathematical term, namely the concept of significanceAnd finally, Shakespeare’s Hamlet is what we strive to create and it is the source of our happiness, our translation of this raw resource into insight.
  • Now this may seem “chimpy”….but this is beautiful. I love this metaphor.But we have a LARGE problem….
  • We have a problem today WITH our data infrastructure….our ability to gleam insights.I think all of you know what I’m referring to…..It’s the fact that we’re operating on less than 15% of the corporate data available to us…..even with the ENTERPRISE DATA WAREHOUSE, the EDW which is supposedly storing a COMPLETE, SINGLE VIEW OF THE TRUTH….We’re still giving our business users…..a tiny bit…a little bit of data.
  • The Business User
  • The Business User
  • So why is an elastic, unlimited computational resource important?Op-Ex vs. Cap-ExCost Reduction due to better utilization / productivityTime-to-Market
  • Hedge funds and Wall Street firms, are using Cold War-style satellite surveillance to gather market-moving information. The Port of Long Beach is the second-busiest container port in the United States and acts as a major gateway for trade between the US and Asia. With the activity from this port estimated at over $100 billion per year, this specific port is a location it will pay to keep track of. 

Satellite analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity, comparing activity month by month.This analysis is being performed in Amazon”s EC2
  • Now lets talk about processing your enterprise data assets….your Big Data…..again, we can leverage the cloud infrastructure to scale to the level of any processing needs you may have.
  • The current image shows a Walmart in Wichita, Kansas.Analysts count cars in Wal-Mart parking lots to measure overall customer traffic to understand growth versus its competition.For example, Wal-Mart's growthwas determined to come mostly from areas of high unemployment.This type of analysis is being performed in Amazon”s EC2…
  • The current image shows the a Target in the Moraine Point Plaza located in Gardiner, NorthAnalysts comparing satellite parking lot data with regional unemployment trends found Target's growth tended to come in areas of lower-than-average unemployment.

Again, these processes are being performed in Amazon EC2.…this is interesting….but how do we process the data further to help derive more relevant insights?
  • The way this is performed is by taking data sources like images and storing them into Hadoop. Then using Big Data tools like MapReduce to perform sophisticated analysis on those aggregated data sets.Why is this concept so disruptive?Things like a fraction of the price….no structured data model – aka no star schema…yet the ability to run sophisticated queries and algorithms against all your detailed data.
  • The Business User
  • The previous examples of Walmart and Target involved using a regression algorithm which was executed against the satellite data + other data to produce a quarterly revenue prediction which BEAT all previous models.
  • Which brings us to the discussion around insights.
  • Quote that sets theme….the definition of “Infinite Monkey Theorem”.
  • The Business User
  • Infochimps + CloudCon: Infinite Monkey Theorem

    1. 1. Big Data & CloudInfinite Monkey Theorem CloudCon Expo & Conference October, 2012
    2. 2. FirstWhat is Big Data?“data sets so large and complex that it becomesdifficult to process using on-hand databasemanagement tools.” Request a Demo 2
    3. 3. Data Volume Growing 44x2010 = 1.2 2020 = 35.2Zettabytes/yr Zettabytes/yr Source: 2011 IDC Digital Universe Study Request a Demo 3
    4. 4. Enterprise Data Warehouse Request Answer Parsing ? Engines BYNET InterconnectAmp Amp AmpNode Node Node .... Request a Demo 4
    5. 5. Big Data WarehouseSearch Recommend Rank Analytic Request Master: Answer Score Next-Best-Action Name Node Job Tracker Ethernet Interconnect Slave: Slave: Slave: Task Trckr Task Trckr Task Trckr Data Node Data Node Data Node Semi- .... Structured Data Request a Demo 5
    6. 6. RealTime Traditional Operational Application Ecosystem Deployment in Analytic Public/Private Cloud Appliances Toolset Integration Traditional Decision Support HardenedBatch Large Small Enterprise Enterprise Request a Demo 6
    7. 7. NextInfinite Monkey Theorem (2):an infinite number of monkeys hittingkeys on a typewriter for a period of timewill almost surely type a given text, suchas Shakespeare”s Hamlet. Request a Demo 7
    8. 8. “unexperienced and unobservable“ based on “real experiences and real observations“ Request a Demo 8
    9. 9. ““Infinite Monkey Theorem (2):an infinite number of monkeys hitting keyson a typewriter for a period of time willalmost surely type a given text, such asShakespeare”s Hamlet. Request a Demo 9
    10. 10. infinite number keys on a almost Shakespeare”s of monkeys typewriter surely Hamlet unlimited processing statistically insights computational data significant power Request a Demo 10
    11. 11. #thisischimpy Request a Demo 11
    12. 12. Problem “Little Data For Business Users“ Request a Demo 12
    13. 13. Request a Demo
    14. 14. Request a Demo
    15. 15. “Big Data For Business Users“ Request a Demo 15
    16. 16. Reduce Friction $ $ $ $ ? ExecutiveData Request a Demo 16
    17. 17. #thisisreallygood Request a Demo 17
    18. 18. Public unlimited computational power PrivateVirtualPrivate Request a Demo 18
    19. 19. analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity Request a Demo 1910/22/2012 Infochimps Confidential
    20. 20. Public data processing PrivateVirtualPrivate Request a Demo 20
    21. 21. Walmart Request a Demo 2110/22/2012 Infochimps Confidential
    22. 22. Target Request a Demo 2210/22/2012 Infochimps Confidential
    23. 23. Images Web, Mobile, CRM, ERP, SCM… Business Docs, Transactions & Text Interactions Web Logs SQL NoSQL NewSQLSocial EDW MPP NewSQLSensors Business Intelligence & Analytics Dashboards, Reports GPS Visualization… Request a Demo 23
    24. 24. Public statistically significant PrivateVirtualPrivate Request a Demo 24
    25. 25. #lotsofdata + #simplealgorithms Request a Demo 25
    26. 26. Cars In Lot News Text Web Pricing Quarterly Revenue Prediction SocialSentiment Weather Sensors LocalEmployment Request a Demo 26
    27. 27. Public insights PrivateVirtualPrivate Request a Demo 27
    28. 28. Data Scientist App Developer New Media Gnip Powertrack Business Users Gnip EDC Sources Sentiment Moreover Metabase In-Motion Data Delivery APIs Listening Service Application TV Transcription NoSQL Radio Transcription Print Transcription IT StaffTraditional Media Request a Demo 28
    29. 29. unlimited processing statistically insightscomputational data significant power Request a Demo 29
    30. 30. #inspiredbyAvinashKaushik Request a Demo 30
    31. 31. #1BigDataPlatformLearn More: Download Infochimps™ Platform Technical White Paper Gain Big Insights from Big Data Request a Demo 31