Your SlideShare is downloading. ×
0
Analytics Industry Overview: To Big Data and Beyond !        Gregory Piatetsky   www.KDnuggets.com/gps.html             (c...
My Data Path• PhD in applying Machine Learning to databases• Researcher at GTE Labs – started first project  on Knowledge ...
KDnuggets• Stands for Knowledge Discovery                           Nuggets• 1993 - started KDnuggets News email newslette...
KDnuggets missionCover Analytics and Data Mining field :• News, Jobs, Software, Data (most popular)• Also Academic positio...
Analyzing Data or …•   Statistics•   Data mining                           Core:•   Knowledge Discovery in Data           ...
History• Statistics: 1800 -• Data dredging, data “fishing” : 1960s• Data Mining: 1980 –• Database Mining ~ 1985 (was HNC t...
Pre-historyStatistics is the biggest term in 20th century, butdata mining           and analytics        appears in late 1...
Recent History:Analytics, Data Mining, Knowledge Discovery  Analytics has been used since 1800, but started to rise in 200...
Google N-gram Results case sensitive Different capitalizations changes counts, but using lowercase is probably appropriate...
Earliest use of “data mining” 1962?       After eliminating many “following data. Mining cost is ” examples       which re...
Google Trends:After 2006, Data Mining < Analytics              (c) KDnuggets 2011      11
Google Trends:               Analytics observations                                Competing on AnalyticsGoogle Analytics ...
Half of “Analytics” searches are for         “Google Analytics”              (c) KDnuggets 2011       13
Excluding Google Analytics          (c) KDnuggets 2011   14
Google Insights: searches fordata mining, analytics -google are most popular in India, US            (c) KDnuggets 2011   15
Data Mining >> Predictive Analytics              (c) KDnuggets 2011   16
Business, Predictive, Text Analytics               (c) KDnuggets 2011   17
Analytics > Data Mining > Data Science                (c) KDnuggets 2011   18
Data Science, Big Data        (c) KDnuggets 2011   19
Analytics TodayKDnuggets Polls Findings  www.KDnuggets.com/polls/          (c) KDnuggets 2011   20
Where did you apply analytics/data mining?                               0.0%   5.0%   10.0%   15.0%   20.0%   25.0%   30....
Data Types Analyzed/Minedwww.KDnuggets.com/polls/2011/data-types-analyzed-mined.html                            (c) KDnugg...
Data Types w. Most Growth in 2011• location/geo/mobile data• music / audio• time series• Genomics, according to John Matti...
Largest Dataset Analyzed?                                                    2011 median dataset size                     ...
Largest Dataset Analyzed by Region              (c) KDnuggets 2011   25
Which methods/algorithms did you  use for data analysis in 2011                                       % analysts who used ...
Algorithms with highest              Industry Affinitywww.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html  ...
“Academic” algorithms           lowest Industry affinitywww.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html...
Cloud Analytics is not common (yet) www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html                    ...
JOBS AND SKILLS   (c) KDnuggets 2011   30
Shortage of Skills• McKinsey: shortage by 2018 in the US of  – 140-190,000 people with deep analytical skills  – 1.5 M man...
Job data: Data Scientist         (c) KDnuggets 2011   32
Jobs: Data Mining >> Data Scientist              (c) KDnuggets 2011   33
“Ground” Analytics (LinkedIn Skills)                                    ~ 75,000 with Data Mining skill                   ...
Cloud (Big Data) Analytics Skills             (c) KDnuggets 2011     35
Analytics LinkedIn Skills Predictive Analytics      Machine LearningTextMining                                        MapR...
Data Tsunami• In 2010 enterprises  stored 7 exabytes  =7,000,000,000 GBof new data (McKinsey)• 90 percent of the  worlds d...
Big Data Aspects?• Volume  – Terabytes to Petabytes …• Velocity  – online streaming• Variety  – numbers, text, links, imag...
Volume + Velocity => No consistency• CAP Theorem (Eric Brewer, 2000)  For highly scalable distributed systems, you can onl...
Big Data• 2nd Industrial Revolution• Do old activities better• Create new activities/businesses                     (c) KD...
Application areas• Doing old things better  – Churn prediction  – Direct marketing/Customer modeling  – Recommendations  –...
Limit to Predicting Customer Behavior?• There is fundamental randomness in human  behavior and once we find 1-level  effec...
Direct Marketing:                                Random and Model-sorted Lists                                 100     CPH...
Most lift curves are surprising similarStudy of lift curves in banking,   telecom                                         ...
Big Data Enables New Things !– Google – first big success of big data– Social networks (facebook, Twitter, LinkedIn, …)  s...
Big Data Growth By Industry  Source: http://www.mckinsey.com/mgi/publications/big_data/                      (c) KDnuggets...
Research and Industry Disconnect?• Uplift modeling – needs more research• Association rules need less papers• Data Mining ...
Hot Growth Areas• Social Analytics  – Klout  – many twitter micro-analytics    (twitalyzer, TweetEffect, TweetStats)• Mobi...
Big Data Bubble?Big Data           Gartner Hype Cycle                                                 49                  ...
Upcoming SlideShare
Loading in...5
×

Analytics and Data Mining Industry Overview

21,862

Published on

My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data

Published in: Technology, Education
5 Comments
33 Likes
Statistics
Notes
No Downloads
Views
Total Views
21,862
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
1,102
Comments
5
Likes
33
Embeds 0
No embeds

No notes for slide
  • Boris Evelson, Forrester also adds 4th V – Variability (meaning not constant)
  • Transcript of "Analytics and Data Mining Industry Overview"

    1. 1. Analytics Industry Overview: To Big Data and Beyond ! Gregory Piatetsky www.KDnuggets.com/gps.html (c) KDnuggets 2011 1
    2. 2. My Data Path• PhD in applying Machine Learning to databases• Researcher at GTE Labs – started first project on Knowledge Discovery in Databases in 1989• Organized first 3 KDD workshops (1989-93), cofounded KDD conferences and ACM SIGKDD• Chief Scientist at analytics startup 1998-2001• Chair, SIGKDD, 2005-2009• Analytics/Data Mining Consultant, 2001- (c) KDnuggets 2011 2
    3. 3. KDnuggets• Stands for Knowledge Discovery Nuggets• 1993 - started KDnuggets News email newsletter (~ 12,000 email subscribers now)• early website in 1994, www.KDnuggets.com in 1997 – 2011 best year, 45-50,000 unique visitors/month• twitter.com/kdnuggets ~3,000 followers• facebook.com/kdnuggets page• group: KDnuggets Analytics & Data Mining• Recently featured on CNN (c) KDnuggets 2011 3
    4. 4. KDnuggets missionCover Analytics and Data Mining field :• News, Jobs, Software, Data (most popular)• Also Academic positions, CFP, Companies, Consulting, Courses, Meetings, Polls, Publications, Solutions, Webcasts• Subscribe to bi-weekly KDnuggets News at www.kdnuggets.com/subscribe.html (c) KDnuggets 2011 4
    5. 5. Analyzing Data or …• Statistics• Data mining Core:• Knowledge Discovery in Data Finding• KDD Useful• Analytics Patterns• Data Science in Data• …? (c) KDnuggets 2011 5
    6. 6. History• Statistics: 1800 -• Data dredging, data “fishing” : 1960s• Data Mining: 1980 –• Database Mining ~ 1985 (was HNC trademark, not used)• Knowledge Discovery in Data: 1989 – – KDD workshop in 1989• Analytics : 2006 – – Google Analytics, “Competing on Analytics” book• Data Science: 2010 – (c) KDnuggets 2011 6
    7. 7. Pre-historyStatistics is the biggest term in 20th century, butdata mining and analytics appears in late 1990sFrom Google Ngram viewer – English language booksNote: Our analysis uses only English language data.Other languages, especially Chinese , need to be considered for full picture (c) KDnuggets 2011 7
    8. 8. Recent History:Analytics, Data Mining, Knowledge Discovery Analytics has been used since 1800, but started to rise in 2005 Data Mining jumps around 1996 (soon after first KDD conference) but declines after 2003 (TIA controversy, associated with gov. invasion of privacy). Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000 (c) KDnuggets 2011 8
    9. 9. Google N-gram Results case sensitive Different capitalizations changes counts, but using lowercase is probably appropriate to measure general popularity. (c) KDnuggets 2011 9
    10. 10. Earliest use of “data mining” 1962? After eliminating many “following data. Mining cost is ” examples which refer to Mining of minerals, and books from “1958” that have a CD attached (errors in book year) The earliest “data mining” reference I found is Source: Google Books (c) KDnuggets 2011 10
    11. 11. Google Trends:After 2006, Data Mining < Analytics (c) KDnuggets 2011 11
    12. 12. Google Trends: Analytics observations Competing on AnalyticsGoogle Analytics introduced, book, Apr 2007 December vacation dropDec 2005 (c) KDnuggets 2011
    13. 13. Half of “Analytics” searches are for “Google Analytics” (c) KDnuggets 2011 13
    14. 14. Excluding Google Analytics (c) KDnuggets 2011 14
    15. 15. Google Insights: searches fordata mining, analytics -google are most popular in India, US (c) KDnuggets 2011 15
    16. 16. Data Mining >> Predictive Analytics (c) KDnuggets 2011 16
    17. 17. Business, Predictive, Text Analytics (c) KDnuggets 2011 17
    18. 18. Analytics > Data Mining > Data Science (c) KDnuggets 2011 18
    19. 19. Data Science, Big Data (c) KDnuggets 2011 19
    20. 20. Analytics TodayKDnuggets Polls Findings www.KDnuggets.com/polls/ (c) KDnuggets 2011 20
    21. 21. Where did you apply analytics/data mining? 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% CRM/ consumer analytics Banking Health care/ HR Fraud DetectionDirect Marketing/ Fundraising Finance Telecom / Cable Science Insurance Advertising Education avg 2.4 Web usage mining Credit Scoring Retail industries Medical/ Pharma Manufacturing e-Commerce Social Networks Search / Web content mining Government/Military Biotech/Genomics Investment / Stocks Entertainment/ Music Security / Anti-terrorism Travel / Hospitality Social Policy/Survey analysis Junk email / Anti-spam Other www.KDnuggets.com/polls/2010/analytics-data-mining-industries-applications.html (c) KDnuggets 2011 21
    22. 22. Data Types Analyzed/Minedwww.KDnuggets.com/polls/2011/data-types-analyzed-mined.html (c) KDnuggets 2011 22
    23. 23. Data Types w. Most Growth in 2011• location/geo/mobile data• music / audio• time series• Genomics, according to John Mattison (c) KDnuggets 2011 23
    24. 24. Largest Dataset Analyzed? 2011 median dataset size ~10-20 GB, vs 8-10 GB in 2010. Increase in 10 GB to 1 PB rangewww.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html (c) KDnuggets 2011 24
    25. 25. Largest Dataset Analyzed by Region (c) KDnuggets 2011 25
    26. 26. Which methods/algorithms did you use for data analysis in 2011 % analysts who used it 0% 10% 20% 30% 40% 50% 60% 70% Decision Trees Regression Clustering Statistics Visualization Time series/Sequence analysis Support Vector (SVM) Association rules Ensemble methods Text Mining Neural Nets Boosting Bayesian Bagging Factor Analysis Anomaly/Deviation detection Social Network Analysis Survival Analysis Genetic algorithms Uplift modelingwww.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 26
    27. 27. Algorithms with highest Industry Affinitywww.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 27
    28. 28. “Academic” algorithms lowest Industry affinitywww.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 28
    29. 29. Cloud Analytics is not common (yet) www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 29
    30. 30. JOBS AND SKILLS (c) KDnuggets 2011 30
    31. 31. Shortage of Skills• McKinsey: shortage by 2018 in the US of – 140-190,000 people with deep analytical skills – 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions. Source: www.mckinsey.com/mgi/publications/big_data/ (c) KDnuggets 2011 31
    32. 32. Job data: Data Scientist (c) KDnuggets 2011 32
    33. 33. Jobs: Data Mining >> Data Scientist (c) KDnuggets 2011 33
    34. 34. “Ground” Analytics (LinkedIn Skills) ~ 75,000 with Data Mining skill ~ 7,000 with Predictive Modeling Also ~ 20,000 with Predictive Analytics (not related with Predictive Modeling ?? (c) KDnuggets 2011 34
    35. 35. Cloud (Big Data) Analytics Skills (c) KDnuggets 2011 35
    36. 36. Analytics LinkedIn Skills Predictive Analytics Machine LearningTextMining MapReduce (c) KDnuggets 2011 36
    37. 37. Data Tsunami• In 2010 enterprises stored 7 exabytes =7,000,000,000 GBof new data (McKinsey)• 90 percent of the worlds data has been Image with apologies to KDD-2011 generated in the past two years (IBM) (c) KDnuggets 2011 37
    38. 38. Big Data Aspects?• Volume – Terabytes to Petabytes …• Velocity – online streaming• Variety – numbers, text, links, images, audio, video, … (c) KDnuggets 2011 38
    39. 39. Volume + Velocity => No consistency• CAP Theorem (Eric Brewer, 2000) For highly scalable distributed systems, you can only have two of following: – 1) consistency, – 2) high availability, and – 3) (network) partition tolerance (network failure tolerance) http://www.julianbrowne.com/article/viewer/brewers-cap- theorem Implication: Big data solutions must stop worrying about consistency if they want high availability (c) KDnuggets 2011 39
    40. 40. Big Data• 2nd Industrial Revolution• Do old activities better• Create new activities/businesses (c) KDnuggets 2011 40
    41. 41. Application areas• Doing old things better – Churn prediction – Direct marketing/Customer modeling – Recommendations – Fraud detection – Security/Intelligence –…• Competition will level companies (c) KDnuggets 2011 41
    42. 42. Limit to Predicting Customer Behavior?• There is fundamental randomness in human behavior and once we find 1-level effects, more data or better algorithms will give diminishing returns in most cases• Example: Netflix Prize: the most advanced algorithms were only a few percentages better than basic algorithms (c) KDnuggets 2011 42
    43. 43. Direct Marketing: Random and Model-sorted Lists 100 CPH: Cumulative Pct Hits 90 80 70 60 Random 50 Model 40 30 20 10 0 5 15 25 35 45 55 65 75 85 95 Pct list5% of random list have 5% of hits5% of model-score ranked list have 21% of hits.Lift(5%) = 21%/5% = 4.2
    44. 44. Most lift curves are surprising similarStudy of lift curves in banking, telecom Actual lift(T) Est. lift(T) 14Best lift curves are similar 12Special point T=Target 10 percentage 8 Lift 6Lift(T) ~ sqrt (1/T) 4 2 0 0 5 10 15 20 25G. Piatetsky-Shapiro, B. Masand,Estimating Campaign Benefits and 100*T% Modeling Lift, in Proceedings of KDD-99 Conference, ACM Press, 1999. (c) KDnuggets 2011 44
    45. 45. Big Data Enables New Things !– Google – first big success of big data– Social networks (facebook, Twitter, LinkedIn, …) success depends on network size, i.e. big data– Location analytics– Health-care • Personalized medicine– Semantics and AI ? • Imagine IBM Watson, Siri in 2020 ? (c) KDnuggets 2011 45
    46. 46. Big Data Growth By Industry Source: http://www.mckinsey.com/mgi/publications/big_data/ (c) KDnuggets 2011 46
    47. 47. Research and Industry Disconnect?• Uplift modeling – needs more research• Association rules need less papers• Data Mining with Privacy research – industry use?• KDD conference aims to bring researchers and industry people together (c) KDnuggets 2011 47
    48. 48. Hot Growth Areas• Social Analytics – Klout – many twitter micro-analytics (twitalyzer, TweetEffect, TweetStats)• Mobile Analytics – Privacy and data tracks (KDD Lab, Pisa) (c) KDnuggets 2011 48
    49. 49. Big Data Bubble?Big Data Gartner Hype Cycle 49 Copyright © 2011 KDnuggets
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×