• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Strata: 9 laws of Data Mining

Strata: 9 laws of Data Mining



My 9 Laws of Data Mining presentation from Strata Santa Clara 2013-02-26

My 9 Laws of Data Mining presentation from Strata Santa Clara 2013-02-26



Total Views
Views on SlideShare
Embed Views



2 Embeds 10

http://eventifier.co 8
https://twitter.com 2



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Strata: 9 laws of Data Mining Strata: 9 laws of Data Mining Presentation Transcript

    • Advanced Analytics THE NINE LAWS OF DATA MINING Duncan Ross @duncan3ross duncan.ross@teradata.com Based on the 9 Laws of Data Mining by Tom Khabaza
    • What you won‟t get from this presentation• The last two algorithms you need to know!• An explanation of Bayes‟ theorem• The name of the software that will make you $ millions > Not even a comparison of different software! The grave of Thomas Bayes (probably) – near “silicon roundabout” Image via Wikimedia2/28/2013 @duncan3ross
    • THE 0TH LAW Advanced Analytics Data Mining laws also work as Data Science laws
    • What is data mining?• This question generates more arguments than answers• Common features > Predicting or classifying things > Based on historical cases (with or without outcomes) > Machine learning techniques > No predefined underlying model assumed Image via Wikimedia2/28/2013 @duncan3ross
    • What, where, why and how of data mining Who? Why? 9 Laws How? CRISP-DMWhat? Where? Unified data architecture2/28/2013 @duncan3ross
    • CRISP-DM created to help2/28/2013 @duncan3ross
    • THE 7TH LAW Advanced AnalyticsPrediction increases information locally by generalisation
    • This may seem obvious• Data mining learns from generalisations > Historical cases build a model of reality• These general models then predict an outcome that is local to a case and a time > How likely is it that someone will purchase product „x‟ > Will person a influence person b > What number will the ball land on in roulette• The knowledge gained may have been implied in the data, but it is new and valuable2/28/2013 @duncan3ross
    • Why the 7th Law is important• Results need to be thought of at a group level for assessment > Individual results may be poor even when generated from a great model• Two levels of value > Prediction (what, when etc…) > Model (how…)• The gap between the general and the local is the difference between model building and scoring > Hadoop? > R?2/28/2013 @duncan3ross
    • THE 5TH LAW Advanced Analytics There are always patterns
    • The heart of data science…… is taking the 5th Law to heart• A major difference between the approach of data mining and data science is in the “Field of Dreams” > Data mining (usually) requires measurable ROI prior to projects > Data science is trading on probable ROI prior to projects• Fortunately there is still a lot of gold in those hills > And as technologies and data increase the number of hills is also increasing2/28/2013 @duncan3ross
    • Graph of hills vs gold extracted2/28/2013 @duncan3ross
    • But…• Just because there are always patterns doesn‟t mean that they are useful > Algorithms can (and will) cluster a cloud > Without Laws 1 and 2 patterns may not be a good thing2/28/2013 @duncan3ross
    • THE 1ST LAW Advanced AnalyticsBusiness objectives are the origin of every data mining solutionTHE 2ND LAW Advanced Analytics Business knowledge is central to every step of the data mining process
    • The sad tale of churn• This story begins with a gains curve…2/28/2013 @duncan3ross
    • What was the business objective?• To predict churn• What was the definition of churn?• What did the business actually want to do? > Predict “churn”? > Predict people who became inactive? > Predict people who became inactive who might not if contacted?2/28/2013 @duncan3ross
    • Why the 1st and 2nd Laws are important• Because we aren‟t doing this for the fun of it > Or at least not just for the fun of it• At every stage ask: > Does this relate to the business question? > Is the original business question still valid? > Is there a better question that could be asked of this data? > Can this be acted on? > What does this actually mean?• Document the answers, and refer back to them2/28/2013 @duncan3ross
    • THE 4TH LAW Advanced Analytics There is no free lunch for the data miner
    • The last algorithm you will need to learn• Is….• I spent a lot of time on this in the 1990s > Neural nets > Regression > Decision trees• If you know in advance what technique you need to use the problem has already been solved2/28/2013 @duncan3ross
    • The case that worked... then didn‘t Campaign TopicIdentify fingerprint of churners DescriptionSNA offers an opportunity to detect potential churners earlier (possibly beforethey have completely ceased all on-net activity) and also identifies theindividuals who are likely to have the best chance of persuading them to return.The aim of this campaign format is to use SNA to detect potential churnersduring the process of leaving and motivate them to stay. Current Approach: New Approach Active Inactive Churn detected Churn detected2/28/2013 @duncan3ross
    • Why the 4th Law is important• Solutions are not generally reproducible > It may work here, but not there• Methodologies are reproducible• Learnings may have value• Time will invalidate even the best models2/28/2013 @duncan3ross
    • THE 3RD LAW Advanced AnalyticsData preparation is more than half of every data mining process
    • Data preparation through a case…2/28/2013 @duncan3ross
    • The problems of text data2/28/2013 @duncan3ross
    • Data quality raises it‟s head…2/28/2013 @duncan3ross
    • What events lead up to a reboot? Note number of paths with areboot, following another reboot! CREATE dimension table wrk.npath_reboot_5events AS SELECT path, COUNT(*) AS path_count FROM nPath (ON wrk.w_event_f PARTITION BY srv_id SELECT * ORDER BY evt_ts desc FROM GraphGen (ON MODE (NONOVERLAPPING ) (SELECT * from wrk.npath_reboot_5events PATTERN (X{0,5}.reboot) ORDER BY path_count SYMBOLS LIMIT 30 ) (true as X, PARTITION BY 1 evt_name = REBOOT AS reboot) ORDER BY path_count desc RESULT item_format(npath) (FIRST( srv_id OF X) AS srv_id, item1_col(path) ACCUMULATE (evt_name OF ANY (X,reboot)) score_col(path_count) AS path) output_format(sankey) ) GROUP BY 1 ; justify(right));2/28/2013 @duncan3ross
    • More data issues Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th2/28/2013 @duncan3ross
    • Data preparation is tough• Duncan‟s theorem > The usefulness of a variable in a model is inversely related to the amount of time you spend creating it• Edouard‟s corollary > If it turns out to be useful you could have created it in the time indicated by Duncan‟s theorem2/28/2013 @duncan3ross
    • Welcome to the world of big data• Data just got noisier and less consistent• Maintaining an analytical data dictionary just moved from vital to really really vital2/28/2013 @duncan3ross
    • Why the 3rd Law is important• Because data prep is such a huge task you need to plan for it well > Assume that you will need to do it at least twice – Experimentation – Model building – Deployment• Look for software that makes it easy > And repeatable > And documentable – Scripts ≠ documentation• Documentation of your data is even more important than documentation of your models > Models can be very sensitive to data inputs2/28/2013 @duncan3ross
    • THE 6TH LAW Advanced Analytics Data mining amplifies perception in the business domain
    • Look for patterns in Network Infrastructure• Too many end customers to visualise as a graph but network has a hierarchy > Internet Gateway Area Hub Customer Router• Create a table using standard SQL to join the reference data plus the Customer Hub error data into a single view srv_id dslam err_cnt srvid_cnt nra_id dslam_cnt errorspersrvid 20785675 lgp44-2 2 248 MZL 2 15 22254516 ltc56-1 4 314 BOT 10 15 21059184 bch66-1 2 184 RIV 15 15 21149846 tsm83-1 2 308 LCR 3 13 20833837 did75-4 10 216 DID 23 13 22295785 gbw68-1 36 170 HRS 1 12 21807750 gmo34-1 2 117 BER 17 12 21374927 bgl93-1 2 246 G5Y 8 12 20291116 ien11-1 2 211 ALZ 2 12 21459244 pai34-1 4 210 M7C 3 11 21027647 bel60-1 4 223 TRO 10 11 20551629 pla13-1 10 332 BED 4 11 20633112 crj95-2 2 332 G5Y 8 11 20585199 bau06-1 46 349 BLA 21 10 21477790 cvl92-1 4 180 IMS 35 10 21292874 che78-1 2 163 PIT 2 102/28/2013 @duncan3ross
    • Visualise as a Graph using Aster GraphGen Size of Node = number of customers Width of Edge = number of errors SELECT * FROM graphgen (ON (SELECT DISTINCT dmt_act_dslam, nra_id, nbr_of_srvid, errorspersrv, nbr_of_dslam FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format(cfilter) item1_col(dmt_act_dslam) item2_col(nra_id) score_col(errorspersrv) cnt1_col(nbr_of_srvid) cnt2_col(nbr_of_dslam) output_format(sigma) directed(false) width_max(10) width_min(1) nodesize_max (3) nodesize_min (1));2/28/2013 @duncan3ross
    • Zoom in on area where the edgewidth/colour indicates a problem2/28/2013 @duncan3ross
    • Add churn information• Add churn information to find customers connected to this Hub that have cancelled their accounts2/28/2013 @duncan3ross
    • Synch Issues by Hub Type2/28/2013 @duncan3ross
    • Error and Complaint rates by equipment type2/28/2013 @duncan3ross
    • Why the 6th Law is important• We don‟t exist in a vacuum > We need to sell the results of analysis• This is a virtuous feedback loop2/28/2013 @duncan3ross
    • THE 8TH LAW Advanced Analytics The value of data mining results is notdetermined by the accuracy or stability of predictive models
    • If your model is 98% accurate – so what?• Or if it‟s right 1 time in 35?2/28/2013 @duncan3ross
    • How can you evaluate models?• Type I and Type II errors > What is the cost (opportunity and actual) of a false positive? > What is the cost of a false negative?• Gains curves > But beware the over accurate curve• Don‟t the forget the user > Decision trees fight back2/28/2013 @duncan3ross
    • THE 9TH LAW Advanced Analytics All patterns are subject to change
    • SUMMARY Advanced Analytics0 Listen to data miners…7 Data mining brings new knowledge5 And there will always be new knowledge1 Start with the business2 Keep going back to the business4 It won’t get easier with time3 Especially given the state your data is in6 But you will improve business results8 As long as you look for the right outputs9 Goto 0
    • RESOURCES Advanced Analytics• http://khabaza.codimension.net/index_files/9laws.htm• The Society of Data Miners (coming soon) > Available on LinkedIn• CRISP-DM