Web Usage Mining Chris Yang


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Web Usage Mining Chris Yang

  1. 1. Web Usage Mining Chris Yang
  2. 2. Three Phases of Web Usage Mining <ul><li>Discover usage patterns from Web data to understand and better serve the needs of Web-based applications (Srivastava et al., 2000) </li></ul><ul><li>Three phases </li></ul><ul><ul><li>Preprocessing </li></ul></ul><ul><ul><li>Pattern discovery </li></ul></ul><ul><ul><li>Pattern analysis </li></ul></ul>
  3. 4. Motivation of Web Usage Mining <ul><li>Bring vendor and end customer in electronic commerce closer </li></ul><ul><li>Mass customization </li></ul><ul><ul><li>Vendor may personalize his product message for individual customers at a massive scale </li></ul></ul>
  4. 5. Data Sources <ul><li>Sever </li></ul><ul><ul><li>Web server log explicitly records the browsing behavior of site visitors and reflects the access of a Web site by multiple users </li></ul></ul><ul><ul><li>Formats </li></ul></ul><ul><ul><ul><li>Common log </li></ul></ul></ul><ul><ul><ul><li>Extended log </li></ul></ul></ul><ul><ul><li>Web log may not be completely reliable </li></ul></ul><ul><ul><ul><li>Caching – files stored at client but not accessed from server </li></ul></ul></ul><ul><ul><ul><li>Information pass through the POST method will not be available in a server log </li></ul></ul></ul>
  5. 6. HTTP <ul><li>The Web's RPC on top of TCP/IP </li></ul><ul><li>It is stateless, which means that a separate connection is made for every request </li></ul><ul><ul><li>Simple to implement, yet incur overhead </li></ul></ul><ul><li>Each HTTP client/server interaction consists of </li></ul><ul><ul><li>a single request/reply interchange </li></ul></ul><ul><ul><ul><li>HTTP request </li></ul></ul></ul><ul><ul><ul><li>HTTP response </li></ul></ul></ul>
  6. 7. <ul><li>HTTP request message consists of : </li></ul><ul><ul><li>request line </li></ul></ul><ul><ul><ul><li>method or command to apply to a server resource </li></ul></ul></ul><ul><ul><ul><ul><li>e.g. GET, POST </li></ul></ul></ul></ul><ul><ul><ul><li>URL (without protocol and server domain name) </li></ul></ul></ul><ul><ul><ul><li>the protocol version used by the client, e.g. HTTP/1.0 </li></ul></ul></ul><ul><ul><li>request header fields </li></ul></ul><ul><ul><ul><li>Pass additional information about the request and the client itself to the server - much like RPC parameters </li></ul></ul></ul><ul><ul><ul><li>Each header filed consists of a name, followed by “:” and the field value </li></ul></ul></ul><ul><ul><li>the entity body (optional) </li></ul></ul><ul><ul><ul><li>Clients use it to pass bulk information to the server (CGI) </li></ul></ul></ul><ul><li>Examples of HTTP methods </li></ul><ul><ul><li>GET - retrieve the specified URL </li></ul></ul><ul><ul><li>POST - send this data to the specified URL </li></ul></ul><ul><li>Examples of HTTP header fields </li></ul><ul><ul><li>Accept - lists acceptable MIME type/subtype contents </li></ul></ul><ul><ul><li>User-Agent - provides client browser information </li></ul></ul>Note: crlf: carriage-return/line-feed
  7. 8. <ul><li>HTTP response message </li></ul><ul><ul><li>response header line </li></ul></ul><ul><ul><ul><li>HTTP version, the status of the response, and an explanation of the returned status </li></ul></ul></ul><ul><ul><li>response header fields </li></ul></ul><ul><ul><ul><li>Information that describes the server's attributes and the returned HTML document to client </li></ul></ul></ul><ul><ul><li>entity body </li></ul></ul><ul><ul><ul><li>Contains an HTML document that a client has requested </li></ul></ul></ul><ul><ul><li>Each HTML document needs a separate request message </li></ul></ul><ul><ul><ul><li>stateless </li></ul></ul></ul><ul><li>The result code 200 indicates that the request is successful. </li></ul>
  8. 9. Data Source - Server <ul><li>Web server log in extended log format </li></ul>
  9. 10. Data Source - Server <ul><li>Packet sniffing </li></ul><ul><ul><li>Monitor network traffic coming to a Web server </li></ul></ul><ul><ul><li>Extract usage data directly from TCP/IP packets </li></ul></ul><ul><li>Cookies </li></ul><ul><ul><li>Tokens generated by the Web server for individual client browsers to automatically track the site visitor </li></ul></ul><ul><ul><li>HTTP protocol is stateless which makes tracking individual users difficult </li></ul></ul><ul><ul><li>Cookies rely on implicit user cooperation </li></ul></ul><ul><li>Query data </li></ul><ul><li>CGI scripts </li></ul><ul><ul><li>URI for CGI programs may contain additional parameter values to be passed to CGI applications </li></ul></ul>
  10. 11. Data Source - Client <ul><li>Client </li></ul><ul><ul><li>Remote agent (e.g. Javavscripts or Java applets) </li></ul></ul><ul><ul><li>Modifying the source code of an existing browser to enhance data collection capabilities </li></ul></ul><ul><ul><li>Difficulty - Require client cooperation to enable the functionality of Javascripts and Java Applets or voluntarily use of the modified browsers </li></ul></ul>
  11. 12. Data Source - Proxy <ul><li>Proxy </li></ul><ul><ul><li>Caching between client browsers and Web servers </li></ul></ul><ul><ul><li>Proxy traces may reveal the actual HTTP request from multiple clients to multiple Web servers </li></ul></ul><ul><ul><li>It helps to characterizing the browsing behavior of a group of anonymous users sharing a common proxy server </li></ul></ul>
  12. 13. Data Abstractions <ul><li>Data from server, client and proxy helps us to construct data abstractions </li></ul><ul><ul><li>Users, server sessions, episodes, click-streams, and page views </li></ul></ul><ul><li>W3C Web Characterization Activity (WCA) has drafted a Web term definitions relevant to Web usage ( http://www.w3.org/WCA ) </li></ul><ul><li>User – a single individual that is accessing file from one or more Web servers through a browser </li></ul><ul><ul><li>Difficulty to identify user – a user may access through different machines or use more than one agent on a single machine </li></ul></ul><ul><li>Page view – page view consists of every file that contributes to the display on a user’s browser at one time </li></ul><ul><ul><li>Includes several files such as frames, graphics, and scripts </li></ul></ul><ul><ul><li>When users download a “Web page” by clicking an anchor text or submitting an URL, he/she is not aware of how many frames, graphics, images, or scripts he/she is receiving </li></ul></ul><ul><li>Click-stream – a sequential series of page view requests </li></ul><ul><ul><li>Server may not have all information to obtain the click-stream </li></ul></ul><ul><ul><li>Page views through client or proxy-level cache are not available at server </li></ul></ul><ul><li>User session – the click-stream of page views for a single user across the entire Web </li></ul><ul><ul><li>In practice, only the portion of user session that is accessing a particular site can be identified. </li></ul></ul><ul><li>Server session – the set of page views in a user session for a particular Web site </li></ul><ul><li>Episode – any semantically meaningful subset of a user or server session </li></ul>
  13. 14. Phase 1 –Preprocessing <ul><li>Usage Preprocessing </li></ul><ul><ul><li>Due to the incompleteness of available data, usage preprocessing is a difficult task </li></ul></ul><ul><ul><li>Typical problems </li></ul></ul><ul><ul><ul><li>Unless client side tracking is used, only IP address, agent, and server-side click stream are available </li></ul></ul></ul><ul><ul><ul><li>Single IP address / Multiple server sessions </li></ul></ul></ul><ul><ul><ul><ul><li>Internet service providers (ISPs) have a pool of proxy servers </li></ul></ul></ul></ul><ul><ul><ul><ul><li>A proxy server may have several users accessing a Web site, potentially over the same time period </li></ul></ul></ul></ul><ul><ul><ul><li>Multiple IP address / Single server sessions </li></ul></ul></ul><ul><ul><ul><ul><li>Some ISPs or privacy tools randomly assign each request from a user to one of several IP addresses </li></ul></ul></ul></ul><ul><ul><ul><li>Multiple IP address / Single user </li></ul></ul></ul><ul><ul><ul><ul><li>A user accesses the Web from different machines (multiple IP address from session to session) </li></ul></ul></ul></ul><ul><ul><ul><li>Multiple agent / Single user </li></ul></ul></ul><ul><ul><ul><ul><li>A user uses more than one browser appears as multiple users </li></ul></ul></ul></ul>
  14. 15. Usage Preprocessing <ul><li>Segmenting click-stream into sessions </li></ul><ul><ul><li>It is difficult to know when a user leave a Web site </li></ul></ul><ul><ul><li>A thirty-minute time out is often used (Catledge and Pitkow, 1995) </li></ul></ul><ul><ul><li>In some cases, session ID is embedded in each URI, session is defined by content server </li></ul></ul><ul><li>Content from user action </li></ul><ul><ul><li>Content servers maintain state variables for each active session, the information to determine the content by a user request is not always available </li></ul></ul>
  15. 16. <ul><li>Using referrer and agent information, 4 sessions are determined </li></ul>
  16. 17. Content Preprocessing and Structure Preprocessing <ul><li>Content Preprocessing </li></ul><ul><ul><li>Converting the text, image, scripts, and other multimedia files into forms that are useful for Web usage mining </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><ul><li>By content </li></ul></ul></ul><ul><ul><ul><li>By intended use (Cooley et al., 1999; Pirolli et al., 1996) </li></ul></ul></ul><ul><ul><ul><ul><li>Convey information, gather information from user, allow navigation, or combination </li></ul></ul></ul></ul><ul><li>Structure Preprocessing </li></ul><ul><ul><li>Hyperlinks between page views </li></ul></ul>
  17. 18. Phase 2 – Pattern Discovery <ul><li>Statistical Analysis </li></ul><ul><ul><li>Perform descriptive statistical analysis (such as mean, median, frequency etc.) on page views, viewing time and length of a navigational path from session file </li></ul></ul><ul><ul><li>Web traffic analysis tools produce periodic reports </li></ul></ul><ul><ul><ul><li>Most frequently accessed pages </li></ul></ul></ul><ul><ul><ul><li>Average view time of a page </li></ul></ul></ul><ul><ul><ul><li>Average length of a path through a site </li></ul></ul></ul><ul><ul><li>Useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, and providing support for marketing decisions </li></ul></ul>
  18. 19. <ul><li>Association Rules </li></ul><ul><ul><li>Relate pages that are most often referenced together in a single server session </li></ul></ul><ul><ul><li>Sets of pages that are accessed together with a support value exceeding some specified threshold </li></ul></ul><ul><ul><li>These page may not directed connected by hyperlinks </li></ul></ul><ul><ul><li>Useful for Web designers to restructure their Web sites </li></ul></ul><ul><ul><li>These rules serve as a heuristic for prefetching documents in order to reduce user-perceived latency when loading a page from a remote site </li></ul></ul>
  19. 20. <ul><li>Clustering </li></ul><ul><ul><li>Group together a set of items having similar characteristics </li></ul></ul><ul><ul><li>Usage clusters </li></ul></ul><ul><ul><ul><li>Establish groups of users exhibiting similar browsing patterns </li></ul></ul></ul><ul><ul><ul><li>Useful for inferring user demographics in order to perform market segmentation </li></ul></ul></ul><ul><ul><li>Page clusters </li></ul></ul><ul><ul><ul><li>Discover groups of pages that have related content </li></ul></ul></ul><ul><ul><ul><li>Useful for search engines and Web assistance providers </li></ul></ul></ul>
  20. 21. <ul><li>Classification </li></ul><ul><ul><li>Mapping a data item into one of several predefined classes </li></ul></ul><ul><ul><li>Develop a profile of users belonging to a particular class or category </li></ul></ul><ul><ul><li>Requires feature extraction and selection that best describe the properties of a given class or category </li></ul></ul><ul><ul><li>Techniques </li></ul></ul><ul><ul><ul><li>Decision tree classifiers, naïve Bayesian classifier, k-nearest neighbor classifiers, support vector machines, etc. </li></ul></ul></ul><ul><ul><li>E.g. </li></ul></ul><ul><ul><ul><li>30% users who place online orders in /Product/Music are in the 19-25 age group and live on the West coast </li></ul></ul></ul>
  21. 22. <ul><li>Sequential Pattern </li></ul><ul><ul><li>Find inter-session patterns </li></ul></ul><ul><ul><ul><li>The presence of a set of items is followed by another item in a time-ordered set of sessions or episode </li></ul></ul></ul><ul><ul><li>Useful for predicting future pattern in order to place advertisements for a certain user groups </li></ul></ul><ul><ul><li>Temporal analysis </li></ul></ul><ul><ul><ul><li>Trend analysis, change point detection, or similarity analysis </li></ul></ul></ul>
  22. 23. <ul><li>Dependency Modeling </li></ul><ul><ul><li>Develop a model capable of representing significant dependencies among the various variables in the Web domain </li></ul></ul><ul><ul><li>E.g. </li></ul></ul><ul><ul><ul><li>A model representing the different stages a visitor undergoes while shopping in an online store based on the action chosen (from casual visitor to a serious potential buyer) </li></ul></ul></ul><ul><ul><li>Techniques </li></ul></ul><ul><ul><ul><li>Hidden Markov models, Bayesian belief network </li></ul></ul></ul>
  23. 24. Phase 3 – Pattern Analysis <ul><li>Filter out uninteresting rules or patterns from the set found in the pattern discovery phase </li></ul>
  24. 25. Major Application Areas for Web Usage Mining (Sriastava et al., 2000)
  25. 26. Architecture of the WebSIFT system (Cooley et al., 1999)
  26. 27. WUM – Web Usage Miner Navigation behavior in Web sites (Berendt and Spiliopoulou, 2000) <ul><li>Web site is a network of structurally or semantically interrelated nodes (built in a way that reflects the designers’ intuition). </li></ul><ul><li>Quality of a Web site </li></ul><ul><ul><li>The conformance of the Web site’s structure to the intuition of each group of visitors accessing the site. </li></ul></ul><ul><ul><ul><li>Intuition of visitors is indirectly reflected in their navigation behavior (represented in the browsing pattern) </li></ul></ul></ul><ul><ul><li>Measure of the quality of Web site </li></ul></ul><ul><ul><ul><li>Quality of service (e.g. response time) </li></ul></ul></ul><ul><ul><ul><li>Quality of navigation </li></ul></ul></ul><ul><ul><ul><li>Accessibility </li></ul></ul></ul><ul><ul><ul><li>Information utility </li></ul></ul></ul><ul><ul><ul><li>Ease of use </li></ul></ul></ul><ul><ul><ul><li>Attractiveness of the presentation metaphor </li></ul></ul></ul>
  27. 28. Sequence Mining <ul><li>Sequence mining supports the discovery of frequent paths composed of not necessarily adjacent pages </li></ul><ul><li>Given a collection of transactions ordered in time (each transaction contains a set of items), discover sequences of maximal length with support above a given threshold </li></ul><ul><li>A sequence is an ordered list of elements, an element being a set of items appearing together in a transaction </li></ul><ul><li>Elements need not be adjacent in time but their ordering in a sequence must not violate the time ordering of the support transactions </li></ul><ul><li>Example </li></ul><ul><ul><li>Considering a Web site with pages W, A, B, C, D, E and there is a link from W to D </li></ul></ul><ul><ul><li>WABC (1000 times), WDBC (100 times), WABDEC (400 times) </li></ul></ul><ul><ul><li>Frequency threshold = 25% </li></ul></ul><ul><ul><li>WD appears 500 (400+100) times (=33%) and above threshold </li></ul></ul><ul><li>In the above example, link from W to D only used 1 out of 5 cases. Therefore, sequence mining is not useful in understanding the usefulness of a hyperlink. </li></ul><ul><li>In WUM, a navigation pattern is a directed acyclic graph composed of a group of sequences that conform to a template </li></ul><ul><ul><li>The purpose is to determine the usage of which links is responsible for the frequency of sequences </li></ul></ul>
  28. 29. WUM – Navigation Sequences and Navigation Patterns <ul><li>A session is a directed list of page accesses performed by a user during his/her visit in a site </li></ul><ul><li>A navigation pattern is a structure that </li></ul><ul><ul><li>Emphasizes the common parts among the sessions </li></ul></ul><ul><ul><li>Does not purge the dissimilar parts </li></ul></ul><ul><ul><li>Annotates both common and non-common parts with quantitative information </li></ul></ul><ul><li>P is a set of Web pages in the site </li></ul><ul><ul><li>If the site is dynamic nature, P is the set of all pages that can be generated </li></ul></ul><ul><li>D is a dataset of sessions </li></ul><ul><li>A session is a directed list of elements from P </li></ul><ul><li>A sequence of length n is a vector s  P  N ( N is a set of positive integers) </li></ul><ul><li>U = P  N </li></ul><ul><li>Example </li></ul><ul><ul><li>P = {a,b,c,d,e,f,g,h} </li></ul></ul><ul><ul><li>ab, ac, abcde, bcbf, abdfhe are sessions appearing in D </li></ul></ul>10 (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) abdfhe 5 5 (b,1) (c,1) (b,2) (f,1) bcbf 4 30 (a,1) (b,1) (c,1) (d,1) (e,1) abcde 3 20 (a,1) (c,1) ac 2 40 (a,1) (b,1) ab 1 Appearances Sequence Session No.
  29. 30. Generalized sequences <ul><li>“ wildcard” [ low; high ] is matched by any sequence of elements that has length at least low and at most high ( low  0 , high  low ) </li></ul><ul><li>“ wildcard”  − its range is not of interest </li></ul><ul><li>A generalized sequence g is a vector g 1  g 2  …  g n </li></ul><ul><ul><li>The number of non-wildcard elements in g is the length of g , length(g) </li></ul></ul><ul><li>Example </li></ul><ul><ul><li>(a,1)  (b,1) [2;4] (e,1) matches with Session 3 and 5 </li></ul></ul><ul><li>The group of sequences that match g constitute the “navigation pattern of g ” navp(g) </li></ul><ul><li>The hits of g , hits(g) , is the number of sequences that matched by g . </li></ul><ul><li>confidence(g i , g j , g) = hits(g 1  … g i-1  g i ) / hits(g 1  g j ) </li></ul><ul><ul><li>g = (a,1)  (b,1) [2;4] (e,1) </li></ul></ul><ul><ul><li>hits(g) = 30 + 10 = 40 </li></ul></ul>10 (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) abdfhe 5 5 (b,1) (c,1) (b,2) (f,1) bcbf 4 30 (a,1) (b,1) (c,1) (d,1) (e,1) abcde 3 20 (a,1) (c,1) ac 2 40 (a,1) (b,1) ab 1 Appearances Sequence Session No.
  30. 31. Aggregate tree and log <ul><li>navp(g) is modeled as a tree structure (aggregate tree) </li></ul><ul><li>Aggregate log </li></ul>10 (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) abdfhe 5 5 (b,1) (c,1) (b,2) (f,1) bcbf 4 30 (a,1) (b,1) (c,1) (d,1) (e,1) abcde 3 20 (a,1) (c,1) ac 2 40 (a,1) (b,1) ab 1 Appearances Sequence Session No.
  31. 32. Discover navigation pattern <ul><li>A “template” is a vector comprised of variable ranging over the domain U and of wildcards </li></ul><ul><li>A mining query is a template declaration accompanied by a conjunction of constraints on the permissible values of the template variables </li></ul><ul><li>Example </li></ul><ul><ul><li>NODE AS x y z </li></ul></ul><ul><ul><li>TEMPLATE x  y [2;4] z AS t </li></ul></ul><ul><ul><li>WHERE x.support  85 </li></ul></ul><ul><ul><li>AND (y.support / x.support )  0.8 </li></ul></ul>