Strata: 9 laws of Data Mining


Published on

My 9 Laws of Data Mining presentation from Strata Santa Clara 2013-02-26

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Strata: 9 laws of Data Mining

  1. 1. Advanced Analytics THE NINE LAWS OF DATA MINING Duncan Ross @duncan3ross Based on the 9 Laws of Data Mining by Tom Khabaza
  2. 2. What you won‟t get from this presentation• The last two algorithms you need to know!• An explanation of Bayes‟ theorem• The name of the software that will make you $ millions > Not even a comparison of different software! The grave of Thomas Bayes (probably) – near “silicon roundabout” Image via Wikimedia2/28/2013 @duncan3ross
  3. 3. THE 0TH LAW Advanced Analytics Data Mining laws also work as Data Science laws
  4. 4. What is data mining?• This question generates more arguments than answers• Common features > Predicting or classifying things > Based on historical cases (with or without outcomes) > Machine learning techniques > No predefined underlying model assumed Image via Wikimedia2/28/2013 @duncan3ross
  5. 5. What, where, why and how of data mining Who? Why? 9 Laws How? CRISP-DMWhat? Where? Unified data architecture2/28/2013 @duncan3ross
  6. 6. CRISP-DM created to help2/28/2013 @duncan3ross
  7. 7. THE 7TH LAW Advanced AnalyticsPrediction increases information locally by generalisation
  8. 8. This may seem obvious• Data mining learns from generalisations > Historical cases build a model of reality• These general models then predict an outcome that is local to a case and a time > How likely is it that someone will purchase product „x‟ > Will person a influence person b > What number will the ball land on in roulette• The knowledge gained may have been implied in the data, but it is new and valuable2/28/2013 @duncan3ross
  9. 9. Why the 7th Law is important• Results need to be thought of at a group level for assessment > Individual results may be poor even when generated from a great model• Two levels of value > Prediction (what, when etc…) > Model (how…)• The gap between the general and the local is the difference between model building and scoring > Hadoop? > R?2/28/2013 @duncan3ross
  10. 10. THE 5TH LAW Advanced Analytics There are always patterns
  11. 11. The heart of data science…… is taking the 5th Law to heart• A major difference between the approach of data mining and data science is in the “Field of Dreams” > Data mining (usually) requires measurable ROI prior to projects > Data science is trading on probable ROI prior to projects• Fortunately there is still a lot of gold in those hills > And as technologies and data increase the number of hills is also increasing2/28/2013 @duncan3ross
  12. 12. Graph of hills vs gold extracted2/28/2013 @duncan3ross
  13. 13. But…• Just because there are always patterns doesn‟t mean that they are useful > Algorithms can (and will) cluster a cloud > Without Laws 1 and 2 patterns may not be a good thing2/28/2013 @duncan3ross
  14. 14. THE 1ST LAW Advanced AnalyticsBusiness objectives are the origin of every data mining solutionTHE 2ND LAW Advanced Analytics Business knowledge is central to every step of the data mining process
  15. 15. The sad tale of churn• This story begins with a gains curve…2/28/2013 @duncan3ross
  16. 16. What was the business objective?• To predict churn• What was the definition of churn?• What did the business actually want to do? > Predict “churn”? > Predict people who became inactive? > Predict people who became inactive who might not if contacted?2/28/2013 @duncan3ross
  17. 17. Why the 1st and 2nd Laws are important• Because we aren‟t doing this for the fun of it > Or at least not just for the fun of it• At every stage ask: > Does this relate to the business question? > Is the original business question still valid? > Is there a better question that could be asked of this data? > Can this be acted on? > What does this actually mean?• Document the answers, and refer back to them2/28/2013 @duncan3ross
  18. 18. THE 4TH LAW Advanced Analytics There is no free lunch for the data miner
  19. 19. The last algorithm you will need to learn• Is….• I spent a lot of time on this in the 1990s > Neural nets > Regression > Decision trees• If you know in advance what technique you need to use the problem has already been solved2/28/2013 @duncan3ross
  20. 20. The case that worked... then didn‘t Campaign TopicIdentify fingerprint of churners DescriptionSNA offers an opportunity to detect potential churners earlier (possibly beforethey have completely ceased all on-net activity) and also identifies theindividuals who are likely to have the best chance of persuading them to return.The aim of this campaign format is to use SNA to detect potential churnersduring the process of leaving and motivate them to stay. Current Approach: New Approach Active Inactive Churn detected Churn detected2/28/2013 @duncan3ross
  21. 21. Why the 4th Law is important• Solutions are not generally reproducible > It may work here, but not there• Methodologies are reproducible• Learnings may have value• Time will invalidate even the best models2/28/2013 @duncan3ross
  22. 22. THE 3RD LAW Advanced AnalyticsData preparation is more than half of every data mining process
  23. 23. Data preparation through a case…2/28/2013 @duncan3ross
  24. 24. The problems of text data2/28/2013 @duncan3ross
  25. 25. Data quality raises it‟s head…2/28/2013 @duncan3ross
  26. 26. What events lead up to a reboot? Note number of paths with areboot, following another reboot! CREATE dimension table wrk.npath_reboot_5events AS SELECT path, COUNT(*) AS path_count FROM nPath (ON wrk.w_event_f PARTITION BY srv_id SELECT * ORDER BY evt_ts desc FROM GraphGen (ON MODE (NONOVERLAPPING ) (SELECT * from wrk.npath_reboot_5events PATTERN (X{0,5}.reboot) ORDER BY path_count SYMBOLS LIMIT 30 ) (true as X, PARTITION BY 1 evt_name = REBOOT AS reboot) ORDER BY path_count desc RESULT item_format(npath) (FIRST( srv_id OF X) AS srv_id, item1_col(path) ACCUMULATE (evt_name OF ANY (X,reboot)) score_col(path_count) AS path) output_format(sankey) ) GROUP BY 1 ; justify(right));2/28/2013 @duncan3ross
  27. 27. More data issues Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th2/28/2013 @duncan3ross
  28. 28. Data preparation is tough• Duncan‟s theorem > The usefulness of a variable in a model is inversely related to the amount of time you spend creating it• Edouard‟s corollary > If it turns out to be useful you could have created it in the time indicated by Duncan‟s theorem2/28/2013 @duncan3ross
  29. 29. Welcome to the world of big data• Data just got noisier and less consistent• Maintaining an analytical data dictionary just moved from vital to really really vital2/28/2013 @duncan3ross
  30. 30. Why the 3rd Law is important• Because data prep is such a huge task you need to plan for it well > Assume that you will need to do it at least twice – Experimentation – Model building – Deployment• Look for software that makes it easy > And repeatable > And documentable – Scripts ≠ documentation• Documentation of your data is even more important than documentation of your models > Models can be very sensitive to data inputs2/28/2013 @duncan3ross
  31. 31. THE 6TH LAW Advanced Analytics Data mining amplifies perception in the business domain
  32. 32. Look for patterns in Network Infrastructure• Too many end customers to visualise as a graph but network has a hierarchy > Internet Gateway Area Hub Customer Router• Create a table using standard SQL to join the reference data plus the Customer Hub error data into a single view srv_id dslam err_cnt srvid_cnt nra_id dslam_cnt errorspersrvid 20785675 lgp44-2 2 248 MZL 2 15 22254516 ltc56-1 4 314 BOT 10 15 21059184 bch66-1 2 184 RIV 15 15 21149846 tsm83-1 2 308 LCR 3 13 20833837 did75-4 10 216 DID 23 13 22295785 gbw68-1 36 170 HRS 1 12 21807750 gmo34-1 2 117 BER 17 12 21374927 bgl93-1 2 246 G5Y 8 12 20291116 ien11-1 2 211 ALZ 2 12 21459244 pai34-1 4 210 M7C 3 11 21027647 bel60-1 4 223 TRO 10 11 20551629 pla13-1 10 332 BED 4 11 20633112 crj95-2 2 332 G5Y 8 11 20585199 bau06-1 46 349 BLA 21 10 21477790 cvl92-1 4 180 IMS 35 10 21292874 che78-1 2 163 PIT 2 102/28/2013 @duncan3ross
  33. 33. Visualise as a Graph using Aster GraphGen Size of Node = number of customers Width of Edge = number of errors SELECT * FROM graphgen (ON (SELECT DISTINCT dmt_act_dslam, nra_id, nbr_of_srvid, errorspersrv, nbr_of_dslam FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format(cfilter) item1_col(dmt_act_dslam) item2_col(nra_id) score_col(errorspersrv) cnt1_col(nbr_of_srvid) cnt2_col(nbr_of_dslam) output_format(sigma) directed(false) width_max(10) width_min(1) nodesize_max (3) nodesize_min (1));2/28/2013 @duncan3ross
  34. 34. Zoom in on area where the edgewidth/colour indicates a problem2/28/2013 @duncan3ross
  35. 35. Add churn information• Add churn information to find customers connected to this Hub that have cancelled their accounts2/28/2013 @duncan3ross
  36. 36. Synch Issues by Hub Type2/28/2013 @duncan3ross
  37. 37. Error and Complaint rates by equipment type2/28/2013 @duncan3ross
  38. 38. Why the 6th Law is important• We don‟t exist in a vacuum > We need to sell the results of analysis• This is a virtuous feedback loop2/28/2013 @duncan3ross
  39. 39. THE 8TH LAW Advanced Analytics The value of data mining results is notdetermined by the accuracy or stability of predictive models
  40. 40. If your model is 98% accurate – so what?• Or if it‟s right 1 time in 35?2/28/2013 @duncan3ross
  41. 41. How can you evaluate models?• Type I and Type II errors > What is the cost (opportunity and actual) of a false positive? > What is the cost of a false negative?• Gains curves > But beware the over accurate curve• Don‟t the forget the user > Decision trees fight back2/28/2013 @duncan3ross
  42. 42. THE 9TH LAW Advanced Analytics All patterns are subject to change
  43. 43. SUMMARY Advanced Analytics0 Listen to data miners…7 Data mining brings new knowledge5 And there will always be new knowledge1 Start with the business2 Keep going back to the business4 It won’t get easier with time3 Especially given the state your data is in6 But you will improve business results8 As long as you look for the right outputs9 Goto 0
  44. 44. RESOURCES Advanced Analytics•• The Society of Data Miners (coming soon) > Available on LinkedIn• CRISP-DM