Infochimps Cloudcon 2012


Published on

A Keynote on the "Infinite Monkey Theorem "

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • AvinashKaushik gave a talk at Strata 2012 in Santa Clara in March.If you listen to all the hype of Big Data, it solves for the first problem.If you listen to all the vendors, there is a lot of emphasis on the first part (perhaps Infochimps included), and very little on the second.I think that’s because we don’t exactly know how to truly empower the organization to interact directly with any/all data available.It’s too expensive, risky, complex.
  • 40%+ YoY growth with 2012 generating 2.4Zettabytes alone.
  • AMP:access module processorsPE: Parsing EngineBYNET: Banyan Cross-bar Switch YNET (Y Network)Store:The Parsing Engine dispatches a request to retrieve one or more rows.The BYNET ensures that appropriate AMP(s) are activated.The Parsing Engine dispatches a request to insert a row.The BYNET ensures that the row gets to the appropriate AMP (Access Module Processor) via the hashing algorithm.The AMP stores the row on its associated disk.Each AMP can have multiple physical disks associated with it.Retrieve:The AMPs (access module processors) locate and retrieve desired rows in parallel access and will sort, aggregate or format if needed.The BYNET returns retrieved rows to Parsing Engine.The Parsing Engine returns row(s) to requesting client application.Teradata’s shared-nothing architecture allows for highly scalable data volumes.
  • 3 node Hadoop system:$8K/node$10K switch$4K/node HadoopDistro$24K + $10K x 25%x3 maintenance = $43K$4K x 3 x 3 = $36KTotal = There are three essential elements of an analytic platform: Strong support for analytic database query. A variety of query styles — at a minimum, SQL, MDX or graph.Strong support for analytic processes other than queries. Typically these would be in the areas of mathematics (statistics, predictive analytics, data mining, linear algebra, optimization, graph theory, etc.) and/or data transformation (e.g. sessionization, entity extraction).Strong integration between the first two.The point is — an analytic platform is something on which you can build a range of powerful analytic applications. Some specifics of what to look for in analytic platform may be found in the link above. data warehouse (Full or partial)Kinds of data likely to be included: All, but especially operationalLikely use styles: AllCanonical example: Central EDW for a big enterpriseStresses: Concurrency, reliability, workload managementClassical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL ServerTraditional data martKinds of data likely to be included: AllLikely use styles: Business intelligence, budgeting/consolidation, investigativeExamples: Reporting servers, planning/consolidation servers, anything MOLAP, etc.Stresses: Performance, concurrency, TCOColumnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them — e.g. Sybase IQ and Vertica — have excellent track records in concurrent usage as well.Investigative data mart — agileKinds of data likely to be included: All, especially customer-centricLikely use styles: InvestigativeCanonical example: A few analysts getting a few TB to examineStresses: Ease of setup/load, ease of admin, price/performanceInfobright is often cost-effective among columnar analytic DBMS. Investigative data mart — bigKinds of data likely to be included: All, especially customer-centric, logs, financial trade, scientificLikely use styles: InvestigativeCanonical example: Single-subject 20 TB – 20 PB relational databaseStresses: Performance, scale-out, analytic functionalityPerformance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum.Bit bucket - HadoopKinds of data likely to be included: Logs, other technical/externalLikely use styles: Staging/ETL, investigativeCanonical example: Log files in a Hadoop clusterStresses: TCO, scale-out, transform/big-query performance, ETL functionalityArchival data storeKinds of data likely to be included: Operational, CDR (call detail record), security logLikely use styles: Archival, reporting (for compliance), possibly also investigativeExamples: Any long-term detailed historical storeStresses: TCO, compression, scale-out, performance (if multi-use)Perhaps only Rainstor truly embraces the archival positioningOutsourced data martKinds of data likely to be included: AllLikely use styles: Traditional BI, investigative analytics, staging/ETLExamples: Advertising tracking, SaaS CRMStresses: Performance, TCO, reliability, concurrencyOracle shops = Vertica gets the nod in a number of these casesOperational analytic(s) serverKinds of data likely to be included: Customer-centric, log, financial tradeLikely use styles: Advanced operational analyticsExamples:Lower latency: Web or call-center personalization, anti-fraudHigher latency: Customer profiling, Basel 3 risk analysisStresses: Performance, reliability, analytic functionality, perhaps concurrency
  • Being the CEO of Infochimps, I felt compelled to share a little “chimpy” research with you…The “Infinite Monkey Theorem”….is a METAPHOR that directly relates to Big Data, that I think you’ll appreciate.So what is the “Infinite Monkey Theorem”????The following definition is a variant of the original theorem….let me read it to you.This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.
  • This theorem has been traced back to Aristotle's “On Generation and Corruption”, where he makes deductions about the unexperienced and unobservable based on real experiences and real observations.Think about this a little….we’re talking about analyzing real world experiences and observations to predict what will happen…what will happen with our business in the future….the unexperienced and unobserved.This is fundamentally what Big Data proposes to help…
  • So as a metaphor…the "monkey" is not an actual monkey, but a metaphor for an abstract device a device that produces a sequence of letters and symbols.And "almost surely" is a mathematical term with a precise meaningShakespeare’s Hamlet also represents a broader meaning….it represents any text, any work, any insight.
  • So lets look at this in more depth….Infinite number of monkeys -> represents today’s seemingly unlimited computational power of either public or private Clouds…as an elastic delivery method.Keys on a typewriter -> capture discrete transactions which only analyzed together can derive meaning. Again we amass the computational power to process dataAlmost surely -> is translated into a mathematical term, namely the concept of significanceAnd finally, Shakespeare’s Hamlet is what we strive to create and it is the source of our happiness, our translation of this raw resource into insight.
  • Now this may seem “chimpy”….but this is beautiful. I love this metaphor.But we have a LARGE problem….
  • We have a problem today WITH our data infrastructure….our ability to gleam insights.I think all of you know what I’m referring to…..It’s the fact that we’re operating on less than 15% of the corporate data available to us…..even with the ENTERPRISE DATA WAREHOUSE, the EDW which is supposedly storing a COMPLETE, SINGLE VIEW OF THE TRUTH….We’re still giving our business users…..a tiny bit…a little bit of data.
  • The Business User
  • The Business User
  • The Business User
  • So why is an elastic, unlimited computational resource important?Op-Ex vs. Cap-ExCost Reduction due to better utilization / productivityTime-to-Market
  • Hedge funds and Wall Street firms, are using Cold War-style satellite surveillance to gather market-moving information. The Port of Long Beach is the second-busiest container port in the United States and acts as a major gateway for trade between the US and Asia. With the activity from this port estimated at over $100 billion per year, this specific port is a location it will pay to keep track of. 

Satellite analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity, comparing activity month by month.This analysis is being performed in Amazon”s EC2
  • Now lets talk about processing your enterprise data assets….your Big Data…..again, we can leverage the cloud infrastructure to scale to the level of any processing needs you may have.
  • The current image shows a Walmart in Wichita, Kansas.Analysts count cars in Wal-Mart parking lots to measure overall customer traffic to understand growth versus its competition.For example, Wal-Mart's growthwas determined to come mostly from areas of high unemployment.This type of analysis is being performed in Amazon”s EC2…
  • The current image shows the a Target in the Moraine Point Plaza located in Gardiner, NorthAnalysts comparing satellite parking lot data with regional unemployment trends found Target's growth tended to come in areas of lower-than-average unemployment.

Again, these processes are being performed in Amazon EC2.…this is interesting….but how do we process the data further to help derive more relevant insights?
  • The way this is performed is by taking data sources like images and storing them into Hadoop. Then using Big Data tools like MapReduce to perform sophisticated analysis on those aggregated data sets.Why is this concept so disruptive?Things like a fraction of the price….no structured data model – aka no star schema…yet the ability to run sophisticated queries and algorithms against all your detailed data.
  • The Business User
  • The previous examples of Walmart and Target involved using a regression algorithm which was executed against the satellite data + other data to produce a quarterly revenue prediction which BEAT all previous models.
  • Which brings us to the discussion around insights.
  • Quote that sets theme….the definition of “Infinite Monkey Theorem”.
  • The Business User
  • Infochimps Cloudcon 2012

    1. 1. Big Data & Cloud Infinite Monkey Theorem CloudCon Expo & Conference October, 2012
    2. 2. First 8/17/2013 Infochimps Confidential 2 What is Big Data? “data sets so large and complex that it becomes difficult to process using on-hand database management tools.”
    3. 3. 3 Source: 2011 IDC Digital Universe Study 2010 = 1.2 Zettabytes/yr 2020 = 35.2 Zettabytes/yr Data Volume Growing 44x 8/17/2013 Infochimps Confidential
    4. 4. Amp Node Amp Node Amp Node Enterprise Data Warehouse PARC | 4 . . . . BYNET Interconnect Parsing Engines Request ??? Answer
    5. 5. Search Recommend Rank Next-Best-ActionScore Big Data Warehouse PARC | 5 . . . . Ethernet Interconnect Master: Name Node Job Tracker Analytic Request Slave: Task Trckr Data Node Slave: Task Trckr Data Node Slave: Task Trckr Data Node Answer Semi- Structured Data
    6. 6. Traditional Operational Traditional Decision Support Analytic Appliances Real Time Batch Large Enterprise Small Enterprise Application Ecosystem Deployment in Public/Private Cloud Toolset Integration Hardened 8/17/2013 6Infochimps Confidential
    7. 7. Next 8/17/2013 Infochimps Confidential 7 Infinite Monkey Theorem (2): an infinite number of monkeys hitting keys on a typewriter for a period of time will almost surely type a given text, such as Shakespeare”s Hamlet.
    8. 8. 8/17/2013 Infochimps Confidential 8 “unexperienced and unobservable“ based on “real experiences and real observations“
    9. 9. “ “ 8/17/2013 Infochimps Confidential 9 Infinite Monkey Theorem (2): an infinite number of monkeys hitting keys on a typewriter for a period of time will almost surely type a given text, such as Shakespeare”s Hamlet. an infinite number of monkeys hitting keys on atypewriter for a period of time will almost surely type a given text, such as Shakespeare”s Hamlet.
    10. 10. 8/17/2013 Infochimps Confidential 10 infinite number of monkeys keys on a typewriter almost surely Shakespeare”s Hamlet unlimited computational power processing data statistically significant insights
    11. 11. 8/17/2013 Infochimps Confidential 11 #thisischimpy
    12. 12. 8/17/2013 Infochimps Confidential 12 “Little Data For Business Users“ Problem
    13. 13. 8/17/2013 Infochimps Confidential 15 “Big Data For Business Users“
    14. 14. 8/17/2013 Infochimps Confidential 16 ? Data $ $ $ $ Executive Reduce Friction
    15. 15. 8/17/2013 Infochimps Confidential 17 #thisisreallygood
    16. 16. 8/17/2013 Infochimps Confidential 18 unlimited computational power Public Private Virtual Private
    17. 17. 8/17/2013 Infochimps Confidential 19 analysts use these images to count shipping containers coming off ships in California and are able to get a sense of overall US import activity
    18. 18. 8/17/2013 Infochimps Confidential 20 data processing Public Private Virtual Private
    19. 19. 8/17/2013 Infochimps Confidential 21 Walmart
    20. 20. 8/17/2013 Infochimps Confidential 22 Target
    21. 21. 8/17/2013 Infochimps Confidential 23 Images Docs, Text Web Logs Social Sensors GPS Business Transactions & Interactions Business Intelligence & Analytics SQL NoSQL NewSQL EDW MPP NewSQL Dashboards, Reports Visualization… Web, Mobile, CRM, ERP, SCM…
    22. 22. 8/17/2013 Infochimps Confidential 24 statistically significant Public Private Virtual Private
    23. 23. 8/17/2013 Infochimps Confidential 25 #lotsofdata #simplealgorithms+
    24. 24. 8/17/2013 Infochimps Confidential 26 Cars In Lot News Text Web Pricing Social Sentiment Weather Sensors Local Employment Quarterly Revenue Prediction
    25. 25. 8/17/2013 Infochimps Confidential 27 insights Public Private Virtual Private
    26. 26. 8/17/2013 Infochimps Confidential 28 Gnip Powertrack Gnip EDC Moreover Metabase TV Transcription Radio Transcription Print Transcription In-Motion Data Delivery Service NoSQL Listening Application New Media Traditional Media APIs Sources Sentiment Business Users App DeveloperData Scientist IT Staff
    27. 27. 8/17/2013 Infochimps Confidential 29 unlimited computational power processing data statistically significant insights
    28. 28. 8/17/2013 Infochimps Confidential 30 #1BigDataCloudService
    29. 29. 8/17/2013 Infochimps Confidential 31 #inspiredbyAvinashKaushik