Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data Srivastava J., Cooley R., Deshpande M, Tan P.N. Appeared in SIGKDD Explorations, Vol. 1, Issue 2, 2000
  2. 2. Web Mining <ul><li>What is? </li></ul><ul><ul><li>Data Mining efforts associated with the Web </li></ul></ul><ul><li>What kind of? </li></ul><ul><ul><li>Content Mining </li></ul></ul><ul><ul><li>Structure Mining </li></ul></ul><ul><ul><li>Usage Mining </li></ul></ul>
  3. 3. Web Data <ul><li>Content </li></ul><ul><ul><li>Ex) texts and graphics </li></ul></ul><ul><li>Structure </li></ul><ul><ul><li>Ex) HTML tags </li></ul></ul><ul><li>Usage </li></ul><ul><ul><li>Ex) IP address, page reference, date/time </li></ul></ul><ul><li>User profile </li></ul><ul><ul><li>Ex) registration data, customer profile </li></ul></ul>
  4. 4. Web Usage Mining <ul><li>The application of data mining techniques to discover usage patterns from Web Data. </li></ul><ul><li>Three phrases </li></ul><ul><ul><li>Preprocessing </li></ul></ul><ul><ul><li>Pattern discovery </li></ul></ul><ul><ul><li>Pattern analysis </li></ul></ul>
  5. 5. Data Sources <ul><li>Where the usage data can be collected from? </li></ul><ul><li>Server Level Collections </li></ul><ul><ul><li>The web server log records the browsing behavior of site visitors, but cached page views are not recorded. </li></ul></ul><ul><ul><li>The packet sniffing extracts usage data directly from TCP/IP packets. </li></ul></ul>
  6. 6. Data Sources (contd.) <ul><li><Sample Web Server Log> </li></ul><ul><li># IP Address Userid Time Method/ URL/ Protocol Status Size Referrer Agent </li></ul><ul><li>1 123.456.78.9 - [25/Apr/1998:03:04:41 -0500] &quot;GET A.html HTTP/1.0&quot; 200 3290 - Mozilla/3.04 (Win95, I) </li></ul><ul><li>2 123.456.78.9 - [25/Apr/1998:03:05:34 -0500] &quot;GET B.html HTTP/1.0&quot; 200 2050 A.html Mozilla/3.04 (Win95, I) </li></ul><ul><li>3 123.456.78.9 - [25/Apr/1998:03:05:39 -0500] &quot;GET L.html HTTP/1.0&quot; 200 4130 - Mozilla/3.04 (Win95, I) </li></ul><ul><li>4 123.456.78.9 - [25/Apr/1998:03:06:02 -0500] &quot;GET F.html HTTP/1.0&quot; 200 5096 B.html Mozilla/3.04 (Win95, I) </li></ul><ul><li>5 123.456.78.9 - [25/Apr/1998:03:06:58 -0500] &quot;GET A.html HTTP/1.0&quot; 200 3290 - Mozilla/3.01 (X11, I, IRIX6.2, IP22) </li></ul><ul><li>6 123.456.78.9 - [25/Apr/1998:03:07:42 -0500] &quot;GET B.html HTTP/1.0&quot; 200 2050 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) </li></ul><ul><li>7 123.456.78.9 - [25/Apr/1998:03:07:55 -0500] &quot;GET R.html HTTP/1.0&quot; 200 8140 L.html Mozilla/3.04 (Win95, I) </li></ul><ul><li>8 123.456.78.9 - [25/Apr/1998:03:09:50 -0500] &quot;GET C.html HTTP/1.0&quot; 200 1820 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) </li></ul><ul><li>9 123.456.78.9 - [25/Apr/1998:03:10:02 -0500] &quot;GET O.html HTTP/1.0&quot; 200 2270 F.html Mozilla/3.04 (Win95, I) </li></ul><ul><li>10 123.456.78.9 - [25/Apr/1998:03:10:45 -0500] &quot;GET J.html HTTP/1.0&quot; 200 9430 C.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) </li></ul><ul><li>11 123.456.78.9 - [25/Apr/1998:03:12:23 -0500] &quot;GET G.html HTTP/1.0&quot; 200 7220 B.html Mozilla/3.04 (Win95, I) </li></ul><ul><li>12 209.456.78.2 - [25/Apr/1998:05:05:22 -0500] &quot;GET A.html HTTP/1.0&quot; 200 3290 - Mozilla/3.04 (Win95, I) </li></ul><ul><li>13 209.456.78.3 - [25/Apr/1998:05:06:03 -0500] &quot;GET D.html HTTP/1.0&quot; 200 1680 A.html Mozilla/3.04 (Win95, I) </li></ul>
  7. 7. Data Sources (contd.) <ul><li>Client Level Collections </li></ul><ul><ul><li>By using remote agents </li></ul></ul><ul><ul><li>ex) java applet (overhead), java script (not able to capture all user clicks) </li></ul></ul><ul><ul><li>By modifying the source code of existing browser </li></ul></ul><ul><ul><li>ex) Mosaic ( hard to convince users to use browser) </li></ul></ul>
  8. 8. Data Sources (contd.) <ul><li>Proxy Level Collections </li></ul><ul><ul><li>Intermediate level of caching between web server and client browser. </li></ul></ul><ul><ul><li>Characterize the browsing behavior of a group of users sharing a common proxy server. </li></ul></ul>
  9. 9. Data Abstractions <ul><li>User : a single individual that is accessing file from one or more Web servers through a browser </li></ul><ul><li>Page Views : every file displayed on user’s browser at one time </li></ul><ul><li>Click Stream : a sequential series of page view requests </li></ul><ul><li>User Session : the click stream of page views for a single user across the entire Web </li></ul><ul><li>Server Session : the set of page views in a user session for a particular Web site </li></ul><ul><li>Episode : any semantically meaningful subset of a user or server session </li></ul>
  10. 10. Web Usage Mining Process
  11. 11. Preprocessing <ul><li>Usage Processing </li></ul><ul><li>The most difficult task due to the incompleteness of the available data (IP address, agent, server side click stream) </li></ul><ul><ul><li>Single IP address/Multiple Server Sessions </li></ul></ul><ul><ul><li>Multiple IP address/Single Server Session </li></ul></ul><ul><ul><li>Multiple IP address/Single User </li></ul></ul><ul><ul><li>Multiple Agent/Single User </li></ul></ul>
  12. 12. Preprocessing(contd.) <ul><li>Content Preprocessing </li></ul><ul><ul><li>Converting the text, image, scripts into useful forms (ex. vectors of words) </li></ul></ul><ul><ul><li>Classification/clustering algorithm can be used to filter discovered patterns based on topic or intended use </li></ul></ul><ul><li>Structure Preprocessing </li></ul><ul><ul><li>Hyperlinks between page views </li></ul></ul>
  13. 13. Pattern Discovery <ul><li>Statistical Analysis </li></ul><ul><ul><li>Page views, viewing time, length of navigational path </li></ul></ul><ul><li>Association Rules </li></ul><ul><ul><li>Apriori algorithm: correlation between users </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>Usage clustering : inferring user demographics </li></ul></ul><ul><ul><li>Page clustering: pages having related content </li></ul></ul>
  14. 14. Pattern Discovery (contd.) <ul><li>Classification </li></ul><ul><ul><li>30% of users who placed an online order in /Product/Music are in the 18-25 age group and live on the West Coast. </li></ul></ul><ul><li>Sequential Patterns </li></ul><ul><ul><li>Time-ordered set of sessions: predicting future visit patters for where to put advertisement </li></ul></ul>
  15. 15. Pattern Analysis <ul><li>Motivation </li></ul><ul><ul><li>Filter out uninteresting rules / patterns from the set found in the pattern discovery phrase. </li></ul></ul>
  16. 16. Application Areas
  17. 17. Examples <ul><li>Personalization </li></ul><ul><ul><li>http:// aztec . cs . depaul . edu /scripts/ACR2/ </li></ul></ul><ul><li>Business </li></ul><ul><ul><li> </li></ul></ul>