Successfully reported this slideshow.
Your SlideShare is downloading. ×

Analytics and Data Mining Industry Overview

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 49 Ad

Analytics and Data Mining Industry Overview

Download to read offline

My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data

My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data

Advertisement
Advertisement

More Related Content

Advertisement

Similar to Analytics and Data Mining Industry Overview (20)

Advertisement

Recently uploaded (20)

Analytics and Data Mining Industry Overview

  1. 1. Analytics Industry Overview: To Big Data and Beyond ! Gregory Piatetsky www.KDnuggets.com/gps.html (c) KDnuggets 2011 1
  2. 2. My Data Path • PhD in applying Machine Learning to databases • Researcher at GTE Labs – started first project on Knowledge Discovery in Databases in 1989 • Organized first 3 KDD workshops (1989-93), cofounded KDD conferences and ACM SIGKDD • Chief Scientist at analytics startup 1998-2001 • Chair, SIGKDD, 2005-2009 • Analytics/Data Mining Consultant, 2001- (c) KDnuggets 2011 2
  3. 3. KDnuggets • Stands for Knowledge Discovery Nuggets • 1993 - started KDnuggets News email newsletter (~ 12,000 email subscribers now) • early website in 1994, www.KDnuggets.com in 1997 – 2011 best year, 45-50,000 unique visitors/month • twitter.com/kdnuggets ~3,000 followers • facebook.com/kdnuggets page • group: KDnuggets Analytics & Data Mining • Recently featured on CNN (c) KDnuggets 2011 3
  4. 4. KDnuggets mission Cover Analytics and Data Mining field : • News, Jobs, Software, Data (most popular) • Also Academic positions, CFP, Companies, Consulting, Courses, Meetings, Polls, Publications, Solutions, Webcasts • Subscribe to bi-weekly KDnuggets News at www.kdnuggets.com/subscribe.html (c) KDnuggets 2011 4
  5. 5. Analyzing Data or … • Statistics • Data mining Core: • Knowledge Discovery in Data Finding • KDD Useful • Analytics Patterns • Data Science in Data • …? (c) KDnuggets 2011 5
  6. 6. History • Statistics: 1800 - • Data dredging, data “fishing” : 1960s • Data Mining: 1980 – • Database Mining ~ 1985 (was HNC trademark, not used) • Knowledge Discovery in Data: 1989 – – KDD workshop in 1989 • Analytics : 2006 – – Google Analytics, “Competing on Analytics” book • Data Science: 2010 – (c) KDnuggets 2011 6
  7. 7. Pre-history Statistics is the biggest term in 20th century, but data mining and analytics appears in late 1990s From Google Ngram viewer – English language books Note: Our analysis uses only English language data. Other languages, especially Chinese , need to be considered for full picture (c) KDnuggets 2011 7
  8. 8. Recent History: Analytics, Data Mining, Knowledge Discovery Analytics has been used since 1800, but started to rise in 2005 Data Mining jumps around 1996 (soon after first KDD conference) but declines after 2003 (TIA controversy, associated with gov. invasion of privacy). Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000 (c) KDnuggets 2011 8
  9. 9. Google N-gram Results case sensitive Different capitalizations changes counts, but using lowercase is probably appropriate to measure general popularity. (c) KDnuggets 2011 9
  10. 10. Earliest use of “data mining” 1962? After eliminating many “following data. Mining cost is ” examples which refer to Mining of minerals, and books from “1958” that have a CD attached (errors in book year) The earliest “data mining” reference I found is Source: Google Books (c) KDnuggets 2011 10
  11. 11. Google Trends: After 2006, Data Mining < Analytics (c) KDnuggets 2011 11
  12. 12. Google Trends: Analytics observations Competing on Analytics Google Analytics introduced, book, Apr 2007 December vacation drop Dec 2005 (c) KDnuggets 2011
  13. 13. Half of “Analytics” searches are for “Google Analytics” (c) KDnuggets 2011 13
  14. 14. Excluding Google Analytics (c) KDnuggets 2011 14
  15. 15. Google Insights: searches for data mining, analytics -google are most popular in India, US (c) KDnuggets 2011 15
  16. 16. Data Mining >> Predictive Analytics (c) KDnuggets 2011 16
  17. 17. Business, Predictive, Text Analytics (c) KDnuggets 2011 17
  18. 18. Analytics > Data Mining > Data Science (c) KDnuggets 2011 18
  19. 19. Data Science, Big Data (c) KDnuggets 2011 19
  20. 20. Analytics Today KDnuggets Polls Findings www.KDnuggets.com/polls/ (c) KDnuggets 2011 20
  21. 21. Where did you apply analytics/data mining? 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% CRM/ consumer analytics Banking Health care/ HR Fraud Detection Direct Marketing/ Fundraising Finance Telecom / Cable Science Insurance Advertising Education avg 2.4 Web usage mining Credit Scoring Retail industries Medical/ Pharma Manufacturing e-Commerce Social Networks Search / Web content mining Government/Military Biotech/Genomics Investment / Stocks Entertainment/ Music Security / Anti-terrorism Travel / Hospitality Social Policy/Survey analysis Junk email / Anti-spam Other www.KDnuggets.com/polls/2010/analytics-data-mining-industries-applications.html (c) KDnuggets 2011 21
  22. 22. Data Types Analyzed/Mined www.KDnuggets.com/polls/2011/data-types-analyzed-mined.html (c) KDnuggets 2011 22
  23. 23. Data Types w. Most Growth in 2011 • location/geo/mobile data • music / audio • time series • Genomics, according to John Mattison (c) KDnuggets 2011 23
  24. 24. Largest Dataset Analyzed? 2011 median dataset size ~10-20 GB, vs 8-10 GB in 2010. Increase in 10 GB to 1 PB range www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html (c) KDnuggets 2011 24
  25. 25. Largest Dataset Analyzed by Region (c) KDnuggets 2011 25
  26. 26. Which methods/algorithms did you use for data analysis in 2011 % analysts who used it 0% 10% 20% 30% 40% 50% 60% 70% Decision Trees Regression Clustering Statistics Visualization Time series/Sequence analysis Support Vector (SVM) Association rules Ensemble methods Text Mining Neural Nets Boosting Bayesian Bagging Factor Analysis Anomaly/Deviation detection Social Network Analysis Survival Analysis Genetic algorithms Uplift modeling www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 26
  27. 27. Algorithms with highest Industry Affinity www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 27
  28. 28. “Academic” algorithms lowest Industry affinity www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 28
  29. 29. Cloud Analytics is not common (yet) www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 29
  30. 30. JOBS AND SKILLS (c) KDnuggets 2011 30
  31. 31. Shortage of Skills • McKinsey: shortage by 2018 in the US of – 140-190,000 people with deep analytical skills – 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions. Source: www.mckinsey.com/mgi/publications/big_data/ (c) KDnuggets 2011 31
  32. 32. Job data: Data Scientist (c) KDnuggets 2011 32
  33. 33. Jobs: Data Mining >> Data Scientist (c) KDnuggets 2011 33
  34. 34. “Ground” Analytics (LinkedIn Skills) ~ 75,000 with Data Mining skill ~ 7,000 with Predictive Modeling Also ~ 20,000 with Predictive Analytics (not related with Predictive Modeling ?? (c) KDnuggets 2011 34
  35. 35. Cloud (Big Data) Analytics Skills (c) KDnuggets 2011 35
  36. 36. Analytics LinkedIn Skills Predictive Analytics Machine Learning Text Mining MapReduce (c) KDnuggets 2011 36
  37. 37. Data Tsunami • In 2010 enterprises stored 7 exabytes =7,000,000,000 GB of new data (McKinsey) • 90 percent of the world's data has been Image with apologies to KDD-2011 generated in the past two years (IBM) (c) KDnuggets 2011 37
  38. 38. Big Data Aspects? • Volume – Terabytes to Petabytes … • Velocity – online streaming • Variety – numbers, text, links, images, audio, video, … (c) KDnuggets 2011 38
  39. 39. Volume + Velocity => No consistency • CAP Theorem (Eric Brewer, 2000) For highly scalable distributed systems, you can only have two of following: – 1) consistency, – 2) high availability, and – 3) (network) partition tolerance (network failure tolerance) http://www.julianbrowne.com/article/viewer/brewers-cap- theorem Implication: Big data solutions must stop worrying about consistency if they want high availability (c) KDnuggets 2011 39
  40. 40. Big Data • 2nd Industrial Revolution • Do old activities better • Create new activities/businesses (c) KDnuggets 2011 40
  41. 41. Application areas • Doing old things better – Churn prediction – Direct marketing/Customer modeling – Recommendations – Fraud detection – Security/Intelligence –… • Competition will level companies (c) KDnuggets 2011 41
  42. 42. Limit to Predicting Customer Behavior? • There is fundamental randomness in human behavior and once we find 1-level effects, more data or better algorithms will give diminishing returns in most cases • Example: Netflix Prize: the most advanced algorithms were only a few percentages better than basic algorithms (c) KDnuggets 2011 42
  43. 43. Direct Marketing: Random and Model-sorted Lists 100 CPH: Cumulative Pct Hits 90 80 70 60 Random 50 Model 40 30 20 10 0 5 15 25 35 45 55 65 75 85 95 Pct list 5% of random list have 5% of hits 5% of model-score ranked list have 21% of hits. Lift(5%) = 21%/5% = 4.2
  44. 44. Most lift curves are surprising similar Study of lift curves in banking, telecom Actual lift(T) Est. lift(T) 14 Best lift curves are similar 12 Special point T=Target 10 percentage 8 Lift 6 Lift(T) ~ sqrt (1/T) 4 2 0 0 5 10 15 20 25 G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and 100*T% Modeling Lift, in Proceedings of KDD-99 Conference, ACM Press, 1999. (c) KDnuggets 2011 44
  45. 45. Big Data Enables New Things ! – Google – first big success of big data – Social networks (facebook, Twitter, LinkedIn, …) success depends on network size, i.e. big data – Location analytics – Health-care • Personalized medicine – Semantics and AI ? • Imagine IBM Watson, Siri in 2020 ? (c) KDnuggets 2011 45
  46. 46. Big Data Growth By Industry Source: http://www.mckinsey.com/mgi/publications/big_data/ (c) KDnuggets 2011 46
  47. 47. Research and Industry Disconnect? • Uplift modeling – needs more research • Association rules need less papers • Data Mining with Privacy research – industry use? • KDD conference aims to bring researchers and industry people together (c) KDnuggets 2011 47
  48. 48. Hot Growth Areas • Social Analytics – Klout – many twitter micro-analytics (twitalyzer, TweetEffect, TweetStats) • Mobile Analytics – Privacy and data tracks (KDD Lab, Pisa) (c) KDnuggets 2011 48
  49. 49. Big Data Bubble? Big Data Gartner Hype Cycle 49 Copyright © 2011 KDnuggets

Editor's Notes

  • Boris Evelson, Forrester also adds 4th V – Variability (meaning not constant)

×