web-mining project


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

web-mining project

  1. 1. Adaptive Web Sites Devesh Sinha [email_address]
  2. 2. Introduction and Background <ul><ul><li>World Wide Web to conduct business. </li></ul></ul><ul><ul><li>Generate and collect large volumes of data in daily operations. </li></ul></ul><ul><ul><ul><li>Case: www.amazon.com </li></ul></ul></ul><ul><ul><ul><li>Analysis – Solution not same as grocery store </li></ul></ul></ul><ul><ul><li>Case:mati.eas.asu.edu </li></ul></ul><ul><ul><ul><li>Objective of this website </li></ul></ul></ul><ul><ul><ul><li>No sales , so no buying information </li></ul></ul></ul>
  3. 3. Mati: Architecture <ul><li> </li></ul>www User: NAT servers
  4. 4. Request Profiling <ul><li>Definition </li></ul><ul><ul><li>the process by which info. is gathered, organized and interpreted to create the summarization or description of the user </li></ul></ul><ul><li>Approaches </li></ul><ul><ul><li>web server log </li></ul></ul><ul><ul><li>ask for (registration & feedback) </li></ul></ul><ul><ul><li>pre-established </li></ul></ul>
  5. 6. Log file type Access log Referrer log Agent log Error log
  6. 7. <ul><li>Data Sources: </li></ul><ul><li>server level collection : the server stores data regarding requests performed by the client, thus data regard generally just one source;  </li></ul><ul><li>client level collection : it is the client itself which sends to a repository information regarding the user's behaviour ( can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. ); </li></ul><ul><li>proxy level collection : information is stored at the proxy side, thus Web data regards several Websites, but only users whose Web c lients pass through the proxy. </li></ul>
  7. 9. Web Server Access Logs looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] &quot;GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0&quot; 200 mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] &quot;GET / HTTP/1.0&quot; 200 3291 mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] &quot;GET /images/backgnds/paper.gif HTTP/1.0&quot; 200 3014 mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] &quot;GET /cgi-bin/Count.cgi?df=CS home.dat&dd=C&ft=1 HTTP mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] &quot;GET advisor HTTP/1.0&quot; 302 mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] &quot;GET advisor/ HTTP/1.0&quot; 200 487 looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] &quot;GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0&quot; 200 . . . . . . . . . <ul><li>Typical Data in a Server Access Log </li></ul><ul><li>Access Log Format </li></ul><ul><li>IP address userid time method url protocol status size </li></ul><ul><li>mega.cs.umn.edu njain 09/Aug/1996:09:54:31 advisor/csci-faq.html </li></ul><ul><li>Other Server Logs: referrer logs, agent logs </li></ul>
  8. 10. Request Profiling <ul><li>Web Server Log </li></ul><ul><ul><li>client IP address or hostname </li></ul></ul><ul><ul><li>user id (“-” if anonymous) </li></ul></ul><ul><ul><li>access time </li></ul></ul><ul><ul><li>HTTP request method (e.g. GET, POST, HEAD ..) </li></ul></ul><ul><ul><li>path of the resource on the Web server (URL) </li></ul></ul><ul><ul><li>the protocol (e.g. HTTP/1.0, HTTP/1.1 ..) </li></ul></ul><ul><ul><li>the status code (e.g. 404 for Not Found ..) </li></ul></ul><ul><ul><li>the number of bytes transmitted </li></ul></ul>
  9. 12. Figure 4. web usage mining research projects and products
  10. 13. Approaches : Concept 1: Prepared Log + Statistical Concept 2: Prepared Log + Mining
  11. 14. Preprocessing: <ul><li>Integrate Logs: </li></ul><ul><li>Logs are only meant for post-mortem </li></ul><ul><li>Clean logs – elliminate outliers </li></ul>
  12. 15. Typical Web Usage Mining Preprocessing
  13. 16. Transaction Identification <ul><li>Main Questions: </li></ul><ul><ul><li>how to identify unique users </li></ul></ul><ul><ul><li>how to identify/define a user transaction </li></ul></ul><ul><li>Problems: </li></ul><ul><ul><li>user ids are often suppressed due to security concerns </li></ul></ul><ul><ul><li>individual IP addresses are sometimes hidden behind proxy servers </li></ul></ul><ul><ul><li>client-side & proxy caching makes server log data less reliable </li></ul></ul><ul><li>Standard Solutions/Practices: </li></ul><ul><ul><li>user registration – practical ???? </li></ul></ul><ul><ul><li>client-side cookies – not fool proof </li></ul></ul><ul><ul><li>cache busting -- —increases network traffic </li></ul></ul>
  14. 17. A Heuristic Approach <ul><li>Identifying User Sessions </li></ul><ul><ul><li>use IP, agent, and OS fields as key attributes; </li></ul></ul><ul><ul><li>use client-side cookies & unique user ids, if available; </li></ul></ul><ul><ul><li>use session time-outs; </li></ul></ul><ul><ul><li>use synchronized referrer log entries and time stamps to expand user paths belonging to a session; </li></ul></ul><ul><ul><li>path completion to infer cached references: </li></ul></ul><ul><ul><li>EX: expanding a session A ==> B ==> C by an access pair </li></ul></ul><ul><ul><li>(B ==> D) results in: A ==> B ==> C ==> B ==> D </li></ul></ul><ul><ul><li>to disambiguate paths, sessions are expanded based on page attributes (size, type), reference length, and no. of back references required to complete the path. </li></ul></ul>
  15. 18. Example: Session Inference with Referrer Log IP Time URL Referrer Agent 1 www.aol.com 08:30:00 A # Mozillar/2.0; AIX 4.1.4 2 www.aol.com 08:30:01 B E Mozillar/2.0; AIX 4.1.4 3 www.aol.com 08:30:02 C B Mozillar/2.0; AIX 4.1.4 4 www.aol.com 08:30:01 B # Mozillar/2.0; Win 95 5 www.aol.com 08:30:03 C B Mozillar/2.0; Win 95 6 www.aol.com 08:30:04 F # Mozillar/2.0; Win 95 8 www.aol.com 08:30:05 G B Mozillar/2.0; AIX 4.1.4 7 www.aol.com 08:30:04 B A Mozillar/2.0; AIX 4.1.4 Identified Sessions: S 1 : # ==> A ==> B ==> G from references 1, 7, 8 S 2 : E ==> B ==> C from references 2, 3 S 3 : # ==> B ==> C from references 4, 5 S 4 : # ==> F from reference 6
  16. 19. Example A B C D E F G H O P T I L J Q K N M R S USER1 : A B F O G A D USRE2 : A B C J USRE3 : L R
  17. 20. Concept 1 :Binary exponential backoff <ul><li>Binary exponential backoff algorithm: </li></ul><ul><ul><li>after 1st collision, wait 0 or 1 slots, at random. </li></ul></ul><ul><ul><li>after 2nd collision, wait 0, 1, 2, 3 slots at random. </li></ul></ul><ul><ul><li>etc up to 1023 slots. </li></ul></ul><ul><ul><li>after 16 collisions, exception. </li></ul></ul>Frame Frame Frame Contention Interval Contention Slot  idle Frame
  18. 21. Concept 1: <ul><li>Similary: </li></ul><ul><ul><li>From Current Logs: </li></ul></ul><ul><ul><ul><li>Rank accessed pages </li></ul></ul></ul><ul><ul><ul><li>Use Binary Backoff to change the ranks </li></ul></ul></ul>
  19. 22. Concept 2: <ul><li>Use the NAT as level 1 filtering </li></ul><ul><li>Filter the traffic as per request pattern </li></ul><ul><ul><li>Users can reach the same page but with option to further choose </li></ul></ul><ul><ul><li>Rule Based Prediction </li></ul></ul>
  20. 23. Rule Induction <ul><li>Rule Induction (rule-based prediction) </li></ul><ul><ul><li>We first generate a set of rules from a data warehouse, </li></ul></ul><ul><ul><li>then use them to predict values for new data item . </li></ul></ul><ul><ul><li>It works much better on larger (and real)data sets, not just on samples of data. </li></ul></ul><ul><li>Two phases: </li></ul><ul><ul><li>Rule discovery: analyze a historical database and generate a set of rules by automatic discovery. </li></ul></ul><ul><ul><li>Prediction: apply the rules to a new data set and match the rules to make predictions. </li></ul></ul>
  21. 24. Rule Induction Example Training Set
  22. 25. Results: <ul><li>Statistical approach performance: </li></ul><ul><ul><li>Slow to conform to changes </li></ul></ul><ul><ul><li>Good performance with general access patterns </li></ul></ul><ul><li>NAT – Rule Based performance </li></ul><ul><ul><li>High accuracy till now </li></ul></ul><ul><ul><li>Future work required : </li></ul></ul><ul><ul><ul><li>Multi Level Association </li></ul></ul></ul><ul><ul><ul><li>Better Feature Selection </li></ul></ul></ul><ul><ul><ul><li>Scalable Distributed tool </li></ul></ul></ul>
  23. 26. Project: Comments <ul><li>Problems faced: </li></ul><ul><ul><li>Data Cleaning </li></ul></ul><ul><ul><li>Learning Curve </li></ul></ul><ul><li>Future Applications : </li></ul><ul><ul><li>Network processors </li></ul></ul><ul><ul><li>Intelligent Parking slots </li></ul></ul>
  24. 27. Reference <ul><li>www.powerfulforces.org.nz/Papers/Lim.pdf </li></ul><ul><li>Towards Adaptive Web Sites: Conceptual Framework & Case Study </li></ul><ul><ul><li>Mike Perkowitz, Oren Etzioni </li></ul></ul><ul><li>Web Usage Mining : Discovery and Applications of Usage Patterns </li></ul><ul><ul><li>Jaideep Srivastava , Robert Cooley , Mukund Deshpande , Pang- Ning Tan </li></ul></ul><ul><li>WUM: A Web Utilization Miner (URL: http://wum.wiwi.hu-berlin.de/index.html ) </li></ul><ul><li>WEKA: Machine learning Algorithms in Java </li></ul><ul><li>Improving Effectiveness of Web Site with web usage mining: </li></ul><ul><ul><li>Myra Spiliopoulou, Carsten Pohle, Lukas C Faulstich </li></ul></ul>