BAQMaR - Conference DM


Published on

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

BAQMaR - Conference DM

  1. 1. Annual Conference 2010
  2. 3. Data Mining Track WIFI ibahn_conference CODE: 01A3D9
  3. 4. Walking In Circles
  5. 6. SO, WHAT DO YOU DO FOR A LIVING? Something with data Some stuff with statistics and numbers It’s complicated… (do you have an hour?)
  6. 7. SO, WHAT DO YOU DO FOR A LIVING? However…
  8. 9.
  9. 10. FURTHER READING… <ul><li>K. Coussement and D. Van den Poel, Improving Customer Complaint Management by Automatic Email Classification Using Linguistic Style Features as Predictors, Decision Support Systems, 44 (4) (2008), pp 370-382. </li></ul>
  10. 11.
  11. 12. FURTHER READING… <ul><li>Buckinx W., Geert Verstraeten, and Dirk Van den Poel </li></ul><ul><li>(2007), “Predicting customer loyalty using the internal transactional database,&quot; Expert Systems with Applications , 32 (1). </li></ul>
  12. 13. IF IT DOESN’T EXIST… <ul><li>ESTIMATE IT! </li></ul>
  13. 14.
  14. 15. THE MODEL FARM <ul><li>Easy analyses, high value </li></ul><ul><li>Outsourcing to Romenia, India, … </li></ul><ul><li>Modeling automation </li></ul><ul><li> MODEL FARM </li></ul>
  15. 16. Modeling Automation
  17. 18. MODEL FARMERS ?
  21. 22. LET’S EVALUATE RATIONALLY <ul><li>The charm argument… </li></ul><ul><ul><li>bOb van Reeth </li></ul></ul><ul><li>Performance… </li></ul><ul><ul><li>Statistical and/or financial? </li></ul></ul><ul><li>Model farm… </li></ul><ul><ul><li>Design is more important than performance </li></ul></ul>
  22. 23. RE!SET ? <ul><li>Evoluationary </li></ul><ul><li>First Movers </li></ul><ul><li>Laggards </li></ul>
  23. 24. Data Mining Track WIFI ibahn_conference CODE: 01A3D9
  24. 25. Web Analytics 2.0: state of the art, challenges and opportunities Prof. dr. Bart Baesens Department of Decision Sciences and Information Management K.U.Leuven (Belgium) Vlerick Leuven Ghent Management School (Belgium) School of Management University of Southampton (United Kingdom) [email_address] Twitter: DataMiningApps Facebook: Data Mining with Bart
  25. 26. Overview <ul><li>Introduction and example applications </li></ul><ul><li>Data collection on the Web </li></ul><ul><li>Web usage metrics and analysis challenges </li></ul><ul><li>Example advanced data mining applications </li></ul><ul><li>Conclusions </li></ul>
  26. 27. Web Intelligence <ul><li>Web Intelligence : advanced analysis techniques applied to the Web; often referred to as Web mining </li></ul><ul><li>Common categories of Web mining </li></ul><ul><ul><li>Web usage mining : discovering interesting patterns in how visitors use a Web site </li></ul></ul><ul><ul><ul><li>E.g. association rules for visited pages </li></ul></ul></ul><ul><ul><li>Web content mining : extracting useful information or discovering knowledge from Web page contents </li></ul></ul><ul><ul><ul><li>E.g. information retrieval/extraction, automatic document categorization, etc. </li></ul></ul></ul><ul><ul><li>Web structure mining : mining the hyperlink structure of the Web </li></ul></ul><ul><ul><ul><li>E.g. identifying authoritative pages, web community detection, etc. </li></ul></ul></ul>
  27. 28. Example goals of Web analytics & mining <ul><li>Improve web site design </li></ul><ul><ul><li>e.g. how do (segments of) customers navigate through my site and do they find what they are looking for? </li></ul></ul><ul><li>Measure the effectiveness of marketing strategies, SEM (SEO/PPC) activities, and advertising campaigns </li></ul><ul><ul><li>e.g. monitor effect of new strategy on traffic and (more importantly) conversion rate, effectiveness of banner campaigns, search engine visibility, etc. </li></ul></ul><ul><ul><li>Landing (destination) page optimization: try design variations & test differences in conversion, etc. </li></ul></ul><ul><li>Personalization: deliver content specific to each visitor </li></ul><ul><ul><li>e.g. recommender systems, targeted offers, etc. </li></ul></ul><ul><li>Etc. </li></ul>
  28. 29. Data collection on the Web <ul><li>Web server logs </li></ul><ul><li>Page tagging (client side) </li></ul><ul><ul><li>“ tagging” web page with a code snippet referencing a separate JavaScript file </li></ul></ul><ul><li>Cookies </li></ul><ul><ul><li>small text string that a Web server can send to a visitor's Web browser (as part of its HTTP response) </li></ul></ul><ul><li>Web beacons </li></ul> - - [27/Jun/2002:00:01:54 +0200] &quot;GET /dutch/shop/detail.html HTTP/1.1&quot; 200 38890 &quot;; &quot;Mozilla/4.0 (MSIE 6.0)&quot;
  29. 30. Example web usage metrics (Kaushik, 2009) <ul><li>Page Views </li></ul><ul><ul><li>not that meaningful yet on its own </li></ul></ul><ul><ul><li>Page definition issues </li></ul></ul><ul><li>Sessions/visits </li></ul><ul><ul><li>Pages visited within one session </li></ul></ul><ul><ul><li>Based on IP address/ user agent/ cookies… </li></ul></ul><ul><li>New/Repeat/Return visitors </li></ul><ul><li>Top entry/exit pages/destinations, … </li></ul><ul><li>Time on page, Time on site </li></ul><ul><li>Site abandonment rate </li></ul><ul><li>Average visits/days to purchase </li></ul><ul><li>Referrers </li></ul><ul><li>Search terms </li></ul><ul><li>Engagement </li></ul><ul><li>… . </li></ul>
  30. 31. Analysis challenges <ul><li>Extremely messy data </li></ul><ul><ul><li>Extensive preprocessing needed (e.g. irrelevant requests, sessionization, …) </li></ul></ul><ul><li>Information overload: too many metrics! </li></ul><ul><li>Focus on actionable metrics </li></ul><ul><ul><li>Bounce rate: ratio of visits where visitor left instantly </li></ul></ul><ul><ul><li>Conversion rate: percentage of visits or of unique visitors for which we observed the event (e.g. purchase, pdf download, registration, …) </li></ul></ul><ul><li>Event driven Web analytics (e.g. AJAX ) </li></ul><ul><ul><li>Impact on existing metrics (e.g. bounce rate) </li></ul></ul><ul><ul><li>Definition of new metrics </li></ul></ul><ul><ul><li>Granularity of event capture </li></ul></ul>
  31. 32. Analysis challenges (contd.) <ul><li>Imperfect data </li></ul><ul><ul><li>Cookies blocked/deleted </li></ul></ul><ul><ul><li>Users sharing computers, IP addresses, … </li></ul></ul><ul><li>Focus on </li></ul><ul><ul><li>Trends </li></ul></ul><ul><ul><ul><li>Compare against previous period </li></ul></ul></ul><ul><ul><ul><li>Upward/downward trend, time series analysis, … </li></ul></ul></ul><ul><ul><li>Segmentation </li></ul></ul><ul><ul><ul><li>Bounce rates can be segmented by, e.g., traffic source (which sources are sending you bad traffic?), referral page, search engine, top landing pages, countries (geography), … </li></ul></ul></ul>
  32. 33. Dashboards
  33. 34. Example advanced data mining applications <ul><li>Navigation analysis using sequence mining </li></ul><ul><li>Multivariate testing using experimental design </li></ul><ul><li>Semi-supervised learning for message/document labeling </li></ul><ul><li>Recommender systems using collaborative filtering </li></ul><ul><li>(Social) network based learning </li></ul>
  34. 35. Navigation analysis <ul><li>Path analysis : analysis of frequent navigation patterns </li></ul><ul><ul><li>From a given page which other pages does a group of users visit next in x% of the times </li></ul></ul><ul><li>Funnel : focus on pre-determined sequence </li></ul><ul><ul><li>Scope of one visit: e.g. stages of the checkout process </li></ul></ul><ul><ul><li>Conversion funnel over longer period of time </li></ul></ul><ul><li>Page overlay / click density analysis: </li></ul><ul><ul><li>clicks or other metrics overlaid directly on actual pages </li></ul></ul><ul><ul><li>can traverse through website as groups of users navigated through it </li></ul></ul><ul><ul><li>can also show conversion rate for each link on page </li></ul></ul><ul><li>Heat maps : coloring indicates click frequencies </li></ul><ul><li>Apply segmentation where possible! </li></ul>
  35. 36. SAS Web Analytics: funnel example
  36. 37. SAS Web Analytics: site overlay
  37. 38. Experiment and test <ul><li>Present different pages, page elements, etc. to random sample of actual visitors </li></ul><ul><li>Statistically compare metric of interest </li></ul><ul><ul><li>For example, is conversion rate, bounce rate, etc. significantly better for one page design than other? </li></ul></ul><ul><li>Example of pages to optimize: </li></ul><ul><ul><li>Landing page (page you land on after clicked on ad) </li></ul></ul><ul><ul><li>Page in checkout process </li></ul></ul><ul><ul><li>Most popular pages </li></ul></ul><ul><ul><li>Pages with high bounce rates </li></ul></ul><ul><li>Design of Experiments! </li></ul>
  38. 39. Multivariate testing <ul><li>Variables are sections or elements of the page for which you want to test different variations </li></ul><ul><ul><li>X1: version a or b </li></ul></ul><ul><ul><li>X2: version a or b or c </li></ul></ul><ul><ul><li>X3: version a or b </li></ul></ul><ul><ul><li>X4: a (blue) or b (green) </li></ul></ul>X1: headline X2: sales copy X3: button text X3: image (e.g. “hero shot”) X4: button color
  39. 40. Semi-Supervised Classification <ul><li>Motivation: data labeling can be expensive, difficult and unreliable especially in a Web data context </li></ul><ul><li>Examples (Joachims 1999 ) </li></ul><ul><ul><li>Social Media Analytics </li></ul></ul><ul><ul><ul><li>Sentiment analysis using Twitter Tweets </li></ul></ul></ul><ul><ul><li>Netnews filtering </li></ul></ul><ul><ul><ul><li>User labels some news articles as interesting or not (training set) </li></ul></ul></ul><ul><ul><li>Spam e-mail detection </li></ul></ul><ul><ul><li>Web page classification </li></ul></ul><ul><li>Data mining techniques needed (e.g. transductive SVM) </li></ul>
  40. 41. Example Recommender System
  41. 42. Collaborative Filtering: Methods <ul><li>When identifying buying patterns, make recommendation decisions for a specific user based on the judgments of users with similar interests (Resnick et al. 1994) </li></ul><ul><li>User-User methods </li></ul><ul><ul><li>Identify like-minded users (e.g. k-nearest neighbor) </li></ul></ul><ul><li>Item-Item methods </li></ul><ul><ul><li>Correlation analysis </li></ul></ul><ul><ul><li>Regression analysis </li></ul></ul><ul><ul><li>Association rule mining </li></ul></ul><ul><ul><li>Bayesian belief networks </li></ul></ul>
  42. 43. Example Bayesian Network
  43. 44. Social Networks Applications <ul><li>Social networks </li></ul><ul><ul><li>E-mail traffic </li></ul></ul><ul><ul><li>Research papers connected by citations </li></ul></ul><ul><ul><li>Telephone calls </li></ul></ul><ul><ul><li>LinkedIn, Facebook, MySpace, Friendster, Xing, … </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Web community mining </li></ul></ul><ul><ul><li>Fraud detection </li></ul></ul><ul><ul><li>Terrorism detection (suspicion scoring) </li></ul></ul><ul><ul><li>Product recommendations </li></ul></ul><ul><ul><li>Churn detection </li></ul></ul><ul><ul><li>Epidemiology (spread of illness) </li></ul></ul><ul><ul><li>Protein-Protein interactions </li></ul></ul>
  44. 45. Twitter metrics <ul><li>Number of followers </li></ul><ul><ul><ul><li>Calculate using Twitter API or e.g. </li></ul></ul></ul><ul><li>Churn rate: number of followers you lose in a given period </li></ul><ul><li>Message amplification: number of retweets of your messages, e.g. number of retweets per thousand followers in the last week, retweet quotient (# retweets/# tweets), … </li></ul><ul><ul><ul><li>measures viralness of your tweets </li></ul></ul></ul><ul><li>Average shared links click-through rate for links you share on Twitter </li></ul><ul><ul><ul><li>also calculate conversion rate for users clicking those links </li></ul></ul></ul><ul><li>Conversation rate: replies sent per day, replies received per day, tweets sent per day, average tweet length, .. </li></ul><ul><li>Composite metrics: e.g. Klout score ( ) </li></ul>
  45. 46. Components of a Network Learning System <ul><li>Non-relational (local) classifiers </li></ul><ul><ul><ul><li>Only uses local (e.g., customer-specific) information </li></ul></ul></ul><ul><ul><ul><li>Can be estimated using traditional machine learning methods (nearest neighbor, decision trees, …) </li></ul></ul></ul><ul><ul><ul><li>Used to generate the priors for the relational learning and collective inference </li></ul></ul></ul><ul><li>Relational model </li></ul><ul><ul><ul><li>Makes use of the relations/links in the network </li></ul></ul></ul><ul><li>Collective inference </li></ul><ul><ul><ul><li>Determines how the unknown values are estimated together, influencing each other </li></ul></ul></ul>
  46. 47. Example ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
  47. 48. Conclusions <ul><li>Lots of (imperfect) data in a web based context </li></ul><ul><li>Data preprocessing </li></ul><ul><li>Lots of opportunities for advanced data mining </li></ul><ul><li>Large scale, on-line, real-time data mining </li></ul><ul><li>Integration of on-line/off-line data </li></ul>
  48. 50. Annual Conference 2010