Successfully reported this slideshow.

Analytics Drives Big Data Drives Infrastructure

852 views

Published on

A personal perspective of how analytics have evolved from the 80s to current and how it has driven demands on the computing and storage infrastructure. Examples are given from using machine learning ("AI") techniques using neural networks and genetic algorithms in 80s and 90s to Aumnidata's social media analytics in 2008-10 and real-time intent detection by Cruxly from 2011 onwards.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Analytics Drives Big Data Drives Infrastructure

  1. 1. Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8th, 2013 aloke@cruxly.com
  2. 2. 2 What’s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You’d Like on Netflix? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  3. 3. The Sommelier “Robot” Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3
  4. 4. Predicting What Movies You’d Watch Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4
  5. 5. 5 (Analytics, BigData, DataStore)+ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  6. 6. 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 Statistics Regression Linear Time-Series Decision Trees R AI (McCarthy) 1956 Expert Systems Machine Learning Neural Networks SVM LDA Naïve Bayes K-nearest neighbor Random Forests . . . Genetic Algorithms Random Forests SNARC (Minsky) 1951 Dendral (Feigenbaum) 1965 Fraser and Burnell (1970) . . . Vapnik (1992) Ihaka and Gentleman (1993)
  7. 7. 7 Common Analytics Processing pre-2000 • Sources: Local • Data: Numeric, Homogeneous • Processing: Local • Consumer: Local • Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  8. 8. Flavor Predictor – Neural Networks USPTO #5,373,452 (1994) 1988 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8
  9. 9. Pattern Recognition – Genetic Algorithms US PTO #5,140,530, 1992 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9
  10. 10. 10 Small to Big http://article.wn.com/view/2013/04/04/Big_data_forefather_Michael_Stonebraker_shows_no_signs_of_sl/#/related_news Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  11. 11. 11 Typical Analytics: 2000-2006 • Sources: Global , Social Networks • Data: Heterogeneous, Numeric, Text • Processing: Hosted/Scale • Consumer: Global • Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc. Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  12. 12. 2007- : Internet Data Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12
  13. 13. Financial Risk Scoring: Detect Risk Scoring: detect incremental change in # occurrences where corporate officers mention “risk” (or equivalent terms) during earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13
  14. 14. Financial Risk Scoring: Listen *Risk Scoring: detect incremental change in occurrences where corporate officers mention “risk” (or semantically equivalent terms) during the corporate earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14
  15. 15. Banking: Credit Worthiness – remember 2008? Analyze bank reports to assess loans, payments, recoveries, etc. for key bank indexes, groups of banks, or individual banks Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15
  16. 16. Share of Voice: Online Buzz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16
  17. 17. Sentiment Analysis Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17
  18. 18. 18 Analytics Processing: 2007- • Sources: Global, Mobile, New Social (Instagram, . . ) • Data: Multi-Dimensional, Heterogeneous, Audio/Video • Processing: Hosted/Scale • Consumer: Global • Analytics: Batch, Streaming, . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  19. 19. 2008 - : Real-Time/Streaming Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19
  20. 20. Brand Marketing Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20
  21. 21. Brand Management 21
  22. 22. Customer Support Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22
  23. 23. Customer Support 23
  24. 24. 24 Lead Generation Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  25. 25. . . . More Data, Faster http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25
  26. 26. “Internet of Things” http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for- m2m-technology-to-drive-connected-smarter-cities/ Message Queuing Telemetry Transport Machine-to-Machine Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26
  27. 27. 27 AumniData: Batch Processing Data Collector (Batch Scheduled) Twitter Blog/Web Site Data Collector (Batch Scheduled) RSS/ATOM Feed Requestor/ URL Scanner NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP Stack+ AumniData Classifier + Analytics* (RackSpace VM) Dashboard Application (.3rd party App) Blog/Web Site Blog/Web SiteYouTube Dashboard Configuration (TomCat) Custom Analytics Display Ad-Hoc Query Summary Data Collector (Batch Scheduled) Content Store Content / Metadata Index (MySQL) Dashboard Store (SQL Server) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  28. 28. 28 Cruxly: Stream Processing Streaming API Client (Heroku Worker) (24x7) Streaming API Client (Heroku Worker) (24x7) NLP+ Cruxly Intent Detection (AWS) Streaming API Client (Heroku Worker) (24x7) Tweets (Keywords) Request (Keywords) Tweets (Keywords) Tweet ID + Intent Signal (Heroku PostgresSQL) Tweets Content Store (DynamoDB) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP (NER, etc + Cruxly Intent Detection (AWS) Reports / Dashboard Tracker Editor (web app - Heroku) Twitter Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  29. 29. 29 Data Analytics Demands . . . Store Process Analyze View Store Process Analyze View Storm Data Collector Text / Sensor Data/ Stream . . . NLP Classify Index Query/ RT Query Ad Hoc/ Search/ SQL Custom Analytics Dashboards Chart Report Machine Learning Library Stats Library R Yarn
  30. 30. Storage Implications: Back to the Future MB/s – Batch IOPs – Stream Both? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30
  31. 31. Storage Implications: Back to the Future II, III Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 Task tracker Task tracker Task tracker Job Tracker Zookeeper Hive Pig Oozie HUE HDFS clientData Node Data Node Data Node Name Node MapReduceHDFS Master Slave #1 Slave #N Mgmt Node Storage Capacity Scaling? 31 Storage Tiering? Import/Export Data?
  32. 32. A More General Data Analytics Framework? Data Ingesters (Basic) Data Ingesters (Smart) Content StoreMetadata / In-Mem Store Processing Stream and Batch Data Ingesters Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 AnalyticsProcessing SensorProcessing:DataIntegration VisualizationLibrary/InteractiveQuery LocalStorage/Flash/DAS MapReduce/DistributedDataStore 32
  33. 33. 33 Conclusion • Data Analytics ⇒ Big Data ⇒ Scale-Out • Variety ⇒ Infrastructure • Volume ⇒ Bandwidth Support • Velocity ⇒ Streaming Support • We Solved the Processing Problem • We Need to Solve the Larger Storage Problem Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  34. 34. 34 Grateful Acknowledgements • Kapil Tundwal • Dr. Kirill Kireyev • Dr. Andrew Lampert • Venky Madireddy • Dr. Shumin Wu • Joan Wrabetz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

×