Big Data & Cloud - Infinite Monkey Theorem


Published on

See how an infinite number of monkeys on typewriters eventually recreate Shakespeare...and the metaphor for cloud computing and big data. More importantly, see a number of Big Data use case examples.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • AvinashKaushik gave a talk at Strata 2012 in Santa Clara in March.If you listen to all the hype of Big Data, it solves for the first problem.If you listen to all the vendors, there is a lot of emphasis on the first part (perhaps Infochimps included), and very little on the second.I think that’s because we don’t exactly know how to truly empower the organization to interact directly with any/all data available.It’s too expensive, risky, complex.
  • 40%+ YoY growth with 2012 generating 2.4Zettabytes alone.
  • AMP:access module processorsPE: Parsing EngineBYNET: Banyan Cross-bar Switch YNET (Y Network)Store:The Parsing Engine dispatches a request to retrieve one or more rows.The BYNET ensures that appropriate AMP(s) are activated.The Parsing Engine dispatches a request to insert a row.The BYNET ensures that the row gets to the appropriate AMP (Access Module Processor) via the hashing algorithm.The AMP stores the row on its associated disk.Each AMP can have multiple physical disks associated with it.Retrieve:The AMPs (access module processors) locate and retrieve desired rows in parallel access and will sort, aggregate or format if needed.The BYNET returns retrieved rows to Parsing Engine.The Parsing Engine returns row(s) to requesting client application.Teradata’s shared-nothing architecture allows for highly scalable data volumes.
  • 3 node Hadoop system:$8K/node$10K switch$4K/node HadoopDistro$24K + $10K x 25%x3 maintenance = $43K$4K x 3 x 3 = $36KTotal = There are three essential elements of an analytic platform: Strong support for analytic database query. A variety of query styles — at a minimum, SQL, MDX or graph.Strong support for analytic processes other than queries. Typically these would be in the areas of mathematics (statistics, predictive analytics, data mining, linear algebra, optimization, graph theory, etc.) and/or data transformation (e.g. sessionization, entity extraction).Strong integration between the first two.The point is — an analytic platform is something on which you can build a range of powerful analytic applications. Some specifics of what to look for in analytic platform may be found in the link above. data warehouse (Full or partial)Kinds of data likely to be included: All, but especially operationalLikely use styles: AllCanonical example: Central EDW for a big enterpriseStresses: Concurrency, reliability, workload managementClassical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL ServerTraditional data martKinds of data likely to be included: AllLikely use styles: Business intelligence, budgeting/consolidation, investigativeExamples: Reporting servers, planning/consolidation servers, anything MOLAP, etc.Stresses: Performance, concurrency, TCOColumnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them — e.g. Sybase IQ and Vertica — have excellent track records in concurrent usage as well.Investigative data mart — agileKinds of data likely to be included: All, especially customer-centricLikely use styles: InvestigativeCanonical example: A few analysts getting a few TB to examineStresses: Ease of setup/load, ease of admin, price/performanceInfobright is often cost-effective among columnar analytic DBMS. Investigative data mart — bigKinds of data likely to be included: All, especially customer-centric, logs, financial trade, scientificLikely use styles: InvestigativeCanonical example: Single-subject 20 TB – 20 PB relational databaseStresses: Performance, scale-out, analytic functionalityPerformance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum.Bit bucket - HadoopKinds of data likely to be included: Logs, other technical/externalLikely use styles: Staging/ETL, investigativeCanonical example: Log files in a Hadoop clusterStresses: TCO, scale-out, transform/big-query performance, ETL functionalityArchival data storeKinds of data likely to be included: Operational, CDR (call detail record), security logLikely use styles: Archival, reporting (for compliance), possibly also investigativeExamples: Any long-term detailed historical storeStresses: TCO, compression, scale-out, performance (if multi-use)Perhaps only Rainstor truly embraces the archival positioningOutsourced data martKinds of data likely to be included: AllLikely use styles: Traditional BI, investigative analytics, staging/ETLExamples: Advertising tracking, SaaS CRMStresses: Performance, TCO, reliability, concurrencyOracle shops = Vertica gets the nod in a number of these casesOperational analytic(s) serverKinds of data likely to be included: Customer-centric, log, financial tradeLikely use styles: Advanced operational analyticsExamples:Lower latency: Web or call-center personalization, anti-fraudHigher latency: Customer profiling, Basel 3 risk analysisStresses: Performance, reliability, analytic functionality, perhaps concurrency
  • Being the CEO of Infochimps, I felt compelled to share a little “chimpy” research with you…The “Infinite Monkey Theorem”….is a METAPHOR that directly relates to Big Data, that I think you’ll appreciate.So what is the “Infinite Monkey Theorem”????The following definition is a variant of the original theorem….let me read it to you.This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.
  • This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.Think about this a little….we’re talking about analyzing real world experiences and observations to predict what will happen…what will happen with our business in the future….the unexperienced and unobserved.This is fundamentally what Big Data proposes to help…
  • So as a metaphor…the "monkey" is not an actual monkey, but a metaphor for an abstract device a device that produces a sequence of letters and symbols.And "almost surely" is a mathematical term with a precise meaningShakespeare’s Hamlet also represents a broader meaning….it represents any text, any work, any insight.
  • So lets look at this in more depth….Infinite number of monkeys -> represents today’s seemingly unlimited computational power of either public or private Clouds…as an elastic delivery method.Keys on a typewriter -> capture discrete transactions which only analyzed together can derive meaning. Again we amass the computational power to process dataAlmost surely -> is translated into a mathematical term, namely the concept of significanceAnd finally, Shakespeare’s Hamlet is what we strive to create and it is the source of our happiness, our translation of this raw resource into insight.
  • Now this may seem “chimpy”….but this is beautiful. I love this metaphor.But we have a LARGE problem….
  • We have a problem today WITH our data infrastructure….our ability to gleam insights.I think all of you know what I’m referring to…..It’s the fact that we’re operating on less than 15% of the corporate data available to us…..even with the ENTERPRISE DATA WAREHOUSE, the EDW which is supposedly storing a COMPLETE, SINGLE VIEW OF THE TRUTH….We’re still giving our business users…..a tiny bit…a little bit of data.
  • The Business User
  • The Business User
  • The Business User
  • So why is an elastic, unlimited computational resource important?Op-Ex vs. Cap-ExCost Reduction due to better utilization / productivityTime-to-Market
  • Hedge funds and Wall Street firms, are using Cold War-style satellite surveillance to gather market-moving information. The Port of Long Beach is the second-busiest container port in the United States and acts as a major gateway for trade between the US and Asia. With the activity from this port estimated at over $100 billion per year, this specific port is a location it will pay to keep track of. 

Satellite analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity, comparing activity month by month.This analysis is being performed in Amazon”s EC2
  • Now lets talk about processing your enterprise data assets….your Big Data…..again, we can leverage the cloud infrastructure to scale to the level of any processing needs you may have.
  • The current image shows a Walmart in Wichita, Kansas.Analysts count cars in Wal-Mart parking lots to measure overall customer traffic to understand growth versus its competition.For example, Wal-Mart's growthwas determined to come mostly from areas of high unemployment.This type of analysis is being performed in Amazon”s EC2…
  • The current image shows the a Target in the Moraine Point Plaza located in Gardiner, NorthAnalysts comparing satellite parking lot data with regional unemployment trends found Target's growth tended to come in areas of lower-than-average unemployment.

Again, these processes are being performed in Amazon EC2.…this is interesting….but how do we process the data further to help derive more relevant insights?
  • The way this is performed is by taking data sources like images and storing them into Hadoop. Then using Big Data tools like MapReduce to perform sophisticated analysis on those aggregated data sets.Why is this concept so disruptive?Things like a fraction of the price….no structured data model – aka no star schema…yet the ability to run sophisticated queries and algorithms against all your detailed data.
  • The Business User
  • The previous examples of Walmart and Target involved using a regression algorithm which was executed against the satellite data + other data to produce a quarterly revenue prediction which BEAT all previous models.
  • Which brings us to the discussion around insights.
  • Quote that sets theme….the definition of “Infinite Monkey Theorem”.
  • The Business User
  • Big Data & Cloud - Infinite Monkey Theorem

    1. 1. Big Data & CloudInfinite Monkey Theorem CloudCon Expo & Conference October, 2012
    2. 2. FirstWhat is Big Data?“data sets so large and complex that it becomesdifficult to process using on-hand databasemanagement tools.”10/19/2012 Infochimps Confidential 2
    3. 3. Data Volume Growing 44x 2010 = 1.2 2020 = 35.2 Zettabytes/yr Zettabytes/yr Source: 2011 IDC Digital Universe Study10/19/2012 Infochimps Confidential 3
    4. 4. Enterprise Data Warehouse Request Answer Parsing ? Engines BYNET InterconnectAmp Amp AmpNode Node Node .... PARC | 4
    5. 5. Big Data WarehouseSearch Recommend Rank Analytic Request Master: Answer Score Next-Best-Action Name Node Job Tracker Ethernet Interconnect Slave: Slave: Slave: Task Trckr Task Trckr Task Trckr Data Node Data Node Data Node Semi- .... Structured Data PARC | 5
    6. 6. Real Time Traditional Operational Application Ecosystem Deployment in Analytic Public/Private Cloud Appliances Toolset Integration Traditional Decision Support Hardened Batch Large Small Enterprise Enterprise10/19/2012 Infochimps Confidential 6
    7. 7. NextInfinite Monkey Theorem (2):an infinite number of monkeys hittingkeys on a typewriter for a period of timewill almost surely type a given text, suchas Shakespeare”s Hamlet.10/19/2012 Infochimps Confidential 7
    8. 8. “unexperienced and unobservable“ based on “real experiences and real observations“10/19/2012 Infochimps Confidential 8
    9. 9. ““Infinite Monkey Theorem (2):an infinite number of monkeys hitting keyson a typewriterfor a period of time will atypewriter for a period of time willalmost surely type a given text, such asShakespeare”s Hamlet.10/19/2012 Infochimps Confidential 9
    10. 10. infinite number keys on a almost Shakespeare”s of monkeys typewriter surely Hamlet unlimited processing statistically insights computational data significant power10/19/2012 Infochimps Confidential 10
    11. 11. #thisischimpy10/19/2012 Infochimps Confidential 11
    12. 12. Problem “Little Data For Business Users“10/19/2012 Infochimps Confidential 12
    13. 13. “Big Data For Business Users“10/19/2012 Infochimps Confidential 15
    14. 14. Reduce Friction $ $ $ $ ? Executive Data10/19/2012 Infochimps Confidential 16
    15. 15. #thisisreallygood10/19/2012 Infochimps Confidential 17
    16. 16. Public unlimited computational power Private Virtual Private10/19/2012 Infochimps Confidential 18
    17. 17. analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity10/19/2012 Infochimps Confidential 19
    18. 18. Public data processing Private Virtual Private10/19/2012 Infochimps Confidential 20
    19. 19. Walmart10/19/2012 Infochimps Confidential 21
    20. 20. Target10/19/2012 Infochimps Confidential 22
    21. 21. Images Web, Mobile, CRM, ERP, SCM… Business Docs, Transactions & Text Interactions Web Logs SQL NoSQL NewSQL Social EDW MPP NewSQL Sensors Business Intelligence & Analytics Dashboards, Reports GPS Visualization…10/19/2012 Infochimps Confidential 23
    22. 22. Public statistically significant Private Virtual Private10/19/2012 Infochimps Confidential 24
    23. 23. #lotsofdata + #simplealgorithms10/19/2012 Infochimps Confidential 25
    24. 24. Cars In Lot News Text Web Pricing Quarterly Revenue Prediction SocialSentiment Weather Sensors LocalEmployment 10/19/2012 Infochimps Confidential 26
    25. 25. Public insights Private Virtual Private10/19/2012 Infochimps Confidential 27
    26. 26. New Media Data Scientist App Developer Gnip Powertrack Business Users Gnip EDC Sources Sentiment Moreover Metabase In-Motion Data Delivery APIs Listening Service Application TV Transcription NoSQL Radio Transcription Print Transcription IT StaffTraditional Media 10/19/2012 Infochimps Confidential 28
    27. 27. unlimited processing statistically insightscomputational data significant power10/19/2012 Infochimps Confidential 29
    28. 28. #1BigDataCloudService10/19/2012 Infochimps Confidential 30
    29. 29. #inspiredbyAvinashKaushik10/19/2012 Infochimps Confidential 31