Introduction to Data Mining forWeb Applications<br />Paul-Alexandru Chirita, Ph.D.<br />
About Me<br />Education:<br />Ph.D., Information Retrieval & Data Mining, Univ. of Hannover, Germany<br />B.Sc., Ecole Pol...
Web Mining<br />The application of Data Mining algorithms to discover patterns in the Web.<br />Three dimensions:<br />Usa...
Agenda<br />Client side tools<br />Google Analytics<br />Omniture<br />Server side tools<br />AW-Stats<br />Webalizer / AW...
Agenda<br />Client side tools<br />Google Analytics<br />Omniture<br />Server side tools<br />AW-Stats<br />Webalizer / AW...
Client side tools<br />Purpose:<br />Return basic information about traffic on your Web Site, SEO<br />Most of them are al...
Client-side tools: Google Analytics<br />Free, and well-engineered!<br />Shows statistics about:<br />Basic stuff: Visits,...
Client-side tools: Google Analytics [2]<br />
Omniture: Site Catalyst<br />Low price per thousand of entries, but may become costly if you have a lot of traffic (millio...
Omniture: Site Catalyst [2]<br />
Agenda<br />Client side tools<br />Google Analytics<br />Omniture<br />Server side tools<br />AW-Stats<br />Webalizer / AW...
Server side tools<br />Purpose:<br />Return basic information about traffic on your Web Site<br />Similar to the client-si...
FREE Server side tools<br />Similar statistics as with the Client Side tools, but…<br />Less business specific information...
Server side tools: AW Stats<br />
Server side tools: Webalizer / AWF-Full<br />
Paid Server side tools<br />Overcome most limitations of the free tools<br />Log everything into text files (see next Sect...
Agenda<br />Client side tools<br />Google Analytics<br />Omniture<br />Server side tools<br />AW-Stats<br />Webalizer / AW...
How is this done in the heavy weight category ;-)<br />Multiple log files, one per each functionality checked<br />As simp...
Sample log<br />Date & Time		IP (hashed)	User ID  (hashed)	Query		Parameters<br />Sep 28 06:49:42		Ea9hjnc4ufTfU	anonymous...
What can be done using this data<br />You can basically measure everything ;-)<br />Plus you can enable loads of new featu...
Personalized search and promotions<br />Show different results/ads to different users<br />
Browsing recommendations<br />
Search suggestions<br />
How To Web - Introduction To Data Mining For Web Applications
Upcoming SlideShare
Loading in...5
×

How To Web - Introduction To Data Mining For Web Applications

2,532

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,532
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  • Transcript of "How To Web - Introduction To Data Mining For Web Applications"

    1. 1. Introduction to Data Mining forWeb Applications<br />Paul-Alexandru Chirita, Ph.D.<br />
    2. 2. About Me<br />Education:<br />Ph.D., Information Retrieval & Data Mining, Univ. of Hannover, Germany<br />B.Sc., Ecole Polytechnique, Paris, France + “Politehnica” Univ. Bucharest, CS Dept.<br />Roughly 8 yrs. in IT, out of which 7 in IR & DM<br />Now in Adobe Romania (L3S, Yahoo!, Schlumberger and others in the past)<br />
    3. 3. Web Mining<br />The application of Data Mining algorithms to discover patterns in the Web.<br />Three dimensions:<br />Usage Mining<br />Analyzes various access logs in order to provide input to Business Decisions<br />By far the most used, with the highest ROI<br />Content Mining<br />Analyzes Web page content in order to extract useful information (e.g., keywords, topic, content type, sentiment, etc.)<br />Structure Mining<br />Also known as “Link Analysis”<br />Investigates the hyperlink structure of the Web to improve current algorithms<br />
    4. 4. Agenda<br />Client side tools<br />Google Analytics<br />Omniture<br />Server side tools<br />AW-Stats<br />Webalizer / AWF-Full<br />Advanced analytics<br />
    5. 5. Agenda<br />Client side tools<br />Google Analytics<br />Omniture<br />Server side tools<br />AW-Stats<br />Webalizer / AWF-Full<br />Advanced analytics<br />
    6. 6. Client side tools<br />Purpose:<br />Return basic information about traffic on your Web Site, SEO<br />Most of them are also (partly) integrated with Monetization Tools (e.g., AdWords)<br />Pros:<br />Hosted by third party sites, zero or minimal cost for you<br />Easy to implement and integrate, no maintenance<br />Cons:<br />The client side tracking code will eat some of your bandwidth (~200-600 ms. additional response time)<br />If your traffic increases “too much” you have to pay<br />
    7. 7. Client-side tools: Google Analytics<br />Free, and well-engineered!<br />Shows statistics about:<br />Basic stuff: Visits, Pages, etc.<br />Visitor profiles: Browser, OS, Language/Locale<br />Visitor loyalty: How many times did each visitor return to your site, When was the last time they did it, For how long<br />Trends: Is your traffic & popularity growing or decreasing<br />Traffic sources: Entry/Exit pages, Referring sites & search engines<br />Some customization planned for the near-term future<br />Good for personal or small scale sites<br />https://www.google.com/analytics<br />
    8. 8. Client-side tools: Google Analytics [2]<br />
    9. 9. Omniture: Site Catalyst<br />Low price per thousand of entries, but may become costly if you have a lot of traffic (millions of visits per day) or if you have many dozens of sensors<br />Same statistics as Google Analytics, but you can drill down very deep:<br />Statistics per hour of day, per file type (html, cfm, etc.), per action type (download, view page, etc.)<br />Visitor segmentation down to the level of city<br />Purchases, Promotions, and Many metrics for e-commerce (e.g., how many products added to the cart have actually been checked out)<br />Most importantly, you can define ANY metric you want! (e.g., how many people click on my survey link, how many of them fill it in, etc.)<br />www.omniture.com<br />
    10. 10. Omniture: Site Catalyst [2]<br />
    11. 11. Agenda<br />Client side tools<br />Google Analytics<br />Omniture<br />Server side tools<br />AW-Stats<br />Webalizer / AWF-Full<br />Advanced analytics<br />
    12. 12. Server side tools<br />Purpose:<br />Return basic information about traffic on your Web Site<br />Similar to the client-side tools, but currently more focused on Reliability & Application Improvements<br />Pros:<br />Most importantly, zero bandwidth overhead for your app (Every ms counts!)<br />Show a lot of developer specific information (errors, visitor browsers/OS, etc.)<br />Very easy to install<br />Cons:<br />Usually open source, but hard to extend with your own metrics<br />
    13. 13. FREE Server side tools<br />Similar statistics as with the Client Side tools, but…<br />Less business specific information (do not include Visitor Loyalty, Trends, etc.)<br />More developer specific data (errors & error types, HTTP status codes, etc.)<br />Good for medium and large scale sites<br />http://awstats.sourceforge.net/<br />http://www.stedee.id.au/awffull/<br />
    14. 14. Server side tools: AW Stats<br />
    15. 15. Server side tools: Webalizer / AWF-Full<br />
    16. 16. Paid Server side tools<br />Overcome most limitations of the free tools<br />Log everything into text files (see next Section)<br />Provide some sort of SQL-like query language which helps you define any type of query you want<br />Run reports much faster<br />The most expensive of them all, meant for professional use<br />http://www.splunk.com/<br />
    17. 17. Agenda<br />Client side tools<br />Google Analytics<br />Omniture<br />Server side tools<br />AW-Stats<br />Webalizer / AWF-Full<br />Advanced analytics<br />
    18. 18. How is this done in the heavy weight category ;-)<br />Multiple log files, one per each functionality checked<br />As simple as possible (see next slide for an example)<br />The main guideline is to be able to parse any log file and generate statistics using only the command line<br />Example: Tab separated <br />
    19. 19. Sample log<br />Date & Time IP (hashed) User ID (hashed) Query Parameters<br />Sep 28 06:49:42 Ea9hjnc4ufTfU anonymous spell checker :0:10:en_US:en_US:0:0<br />Sep 28 06:49:42 8NCTsHqR366 anonymous javascript :0:10:fr_FR:fr_FR:0:1<br />Sep 28 06:49:42 K4nD5xy/R5fw anonymous text :0:10:en_US:en_US:0:1<br />Sep 28 06:49:43 lRqBaIaUWxna yxDkhBEqC6xxR8z= module :0:10:en_US:en_US:0:0<br />Sep 28 06:49:44 jMjJpy6bHAdb hPFLKaMNeShD0= delete spread :0:10:en_US:en_US:0:0<br />Sep 28 06:49:44 r3xgRLagX1cQ6 anonymous _x :0:10:ru_RU:ru_RU:0:0<br />Sep 28 06:49:45 b2DLBl3VTT67Q anonymous anti a :0:10:de_DE:de_DE:0:0<br />Sep 28 06:49:45 KaKiB2ITEdPeM VcLic9CIy4QxVtJQ= create a star :0:10:en_US:en_US:0:0<br />
    20. 20. What can be done using this data<br />You can basically measure everything ;-)<br />Plus you can enable loads of new features:<br />Personalization for search, sold/promoted products, etc.<br />Browsing recommendations<br />Improve site organization (make popular pages more accessible, promote some other pages and track their traffic increase, etc.)<br />Search suggestions<br />Advertising (keyword selection, etc.)<br />
    21. 21. Personalized search and promotions<br />Show different results/ads to different users<br />
    22. 22. Browsing recommendations<br />
    23. 23. Search suggestions<br />

    ×