Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Usage Miningand Using Ontology for Capturing Web Usage Semantic


Published on

Professor Ismail Toroslu gave a lecture on "Web Usage Mining and Using Ontology for Capturing Web Usage Semantic" in the Distinguished Lecturer Series - Leon The Mathematician.

More Information available at:

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

  1. 1. İsmail Hakkı Toroslu Middle East Technical University Department of Computer Engineering Ankara, Turkey Web Usage Mining and Using Ontology for Capturing Web Usage Semantic
  2. 2. 08/28/11 PART I A New Approach for Reactive Web Usage Data Processing
  3. 3. <ul><li>Web Mining </li></ul><ul><li>Previous Session Reconstruction Heuristics </li></ul><ul><li>Smart-SRA </li></ul><ul><li>Agent Simulator </li></ul><ul><li>Experimental Results </li></ul><ul><li>Conclusion </li></ul>OUTLINE
  4. 4. Web Mining <ul><li>Data Mining: Discover and retrieve useful and interesting patterns from a large dataset. </li></ul><ul><li>Web mining: Dataset is the huge web data. </li></ul><ul><li>Dimensions: </li></ul><ul><ul><li>Web content mining </li></ul></ul><ul><ul><li>Web structure mining </li></ul></ul><ul><ul><li>Web usage mining </li></ul></ul>
  5. 5. Web Usage Mining (WUM) Application of data mining techniques to web log data in order to discover user access patterns. Example User Web Access Log Web Mining 4130 200 HTTP/1.0 C.html GET [25/Apr/2005:03:04:48–05] 2050 200 HTTP/1.0 B.html GET [25/Apr/2005:03:04:43–05] 3290 200 HTTP/1.0 A.html GET [25/Apr/2005:03:04:41–05] Number of Bytes Transmitted Success of Return Code Protocol URL Method Request Time IP Address
  6. 6. Phases of Web Usage Mining Web Mining Pre-Processing Pattern Analysis Raw Server log User session File Rules and Patterns Interesting Knowledge Applications Session Reconstruction Heuristics Pattern Discovery Apriori, GSP, SPADE
  7. 7. Session Reconstruction <ul><li>Sessions are reconstructed by using heuristics that select and group requests belonging to the same user session </li></ul><ul><li>Types: </li></ul><ul><ul><li>Reactive: processing requests after they are handled by the web server, </li></ul></ul><ul><ul><li>Proactive: processing occurs during the interactive browsing of the web site by the user </li></ul></ul>Previous Session Reconstruction Heuristics
  8. 8. <ul><li>Time-oriented heuristics </li></ul><ul><li>Navigation-oriented heuristic </li></ul>New Reactive Session Reconstruction Technique: Smart-SRA Combines these heuristics with &quot;site topology&quot; information in order to increase the accuracy of the reconstructed sessions Previous Session Reconstruction Heuristics
  9. 9. Example Web Topology Graph Example Web Page Request Sequence Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page
  10. 10. Time-oriented heuristics -1 <ul><li>Total session time: duration of a discovered session is limited with a threshold </li></ul><ul><ul><ul><li>Discovered Sessions (30 mins): </li></ul></ul></ul><ul><ul><ul><li>[P 1 , P 20 , P 13 , P 49 ] </li></ul></ul></ul><ul><ul><ul><li>[P 34 , P 23 ] </li></ul></ul></ul>Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page
  11. 11. Time-oriented Heuristics -2 <ul><li>Page-stay time: time spent on any page is limited with a threshold </li></ul><ul><ul><ul><li>Discovered Sessions ( 10 mins): </li></ul></ul></ul><ul><ul><ul><li>[P 1 , P 20 , P 13 ] </li></ul></ul></ul><ul><ul><ul><li>[P 49 , P 34 ] </li></ul></ul></ul><ul><ul><ul><li>[P 23 ] </li></ul></ul></ul>Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page
  12. 12. Navigation-Oriented Heuristic <ul><li>Adding page WP N+1 to a session [WP 1 , WP 2 , …, WP N ] </li></ul><ul><ul><ul><li>If WP N has a hyperlink to WP N+1 </li></ul></ul></ul><ul><ul><ul><li>[WP 1 , WP 2 , …, WP N , WP N+1 ] </li></ul></ul></ul><ul><ul><ul><li>If WP N does not have a hyperlink to WP N+1 </li></ul></ul></ul><ul><ul><ul><li>and WP Kmax is the nearest page having a hyperlink to WP N+1 add backward browser moves </li></ul></ul></ul><ul><li>[WP 1 , WP 2 ,…, WP N , WP N-1 , WP N-2 ,..., WP Kmax , WP N+1 ] </li></ul>Previous Session Reconstruction Heuristics
  13. 13. Navigation-Oriented Heuristic Previous Session Reconstruction Heuristics [P 1 , P 20 , P 1 , P 13 , P 49 , P 13 , P 34 , P 23 ] P 23 Link[P 34 , P 23 ] =1 [P 1 , P 20 , P 1 , P 13 , P 49 , P 13 , P 34 ] P 34 Link[P 49 , P 34 ] = 0 Link[P 13 , P 34 ] = 1 [P 1 , P 20 , P 1 , P 13 , P 49 ] P 49 Link[P 13 , P 49 ] = 1 [P 1 , P 20 , P 1 , P 13 ] P 13 Link[P 20 , P 13 ] = 0 Link[P 1 , P 13 ] = 1 [P 1 , P 20 ] P 20 Link[P 1 , P 20 ] = 1 [P 1 ] P 1 [ ] New Page Condition Curent Session
  14. 14. Smart-SRA <ul><li>Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria </li></ul><ul><ul><ul><li>Satisfies the overall session duration time limit </li></ul></ul></ul><ul><li>Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that: </li></ul><ul><ul><li>between each consecutive page pair in a session there is a hyperlink from the previous page to the next page </li></ul></ul><ul><ul><li>the page stay time criteria is also satisfied </li></ul></ul><ul><li>Adds referrer constraints of the topology rule while eliminating the need for inserting backward browser movements. </li></ul>Contains Two Phases:
  15. 15. <ul><ul><li>1. Determine the web pages without any referrer (on its left) and remove them from the candidate session </li></ul></ul><ul><ul><li>2. For each one of these pages </li></ul></ul><ul><ul><ul><li>For each previously constructed session </li></ul></ul></ul><ul><ul><ul><ul><li>If there is a hyperlink from the last page of the session to the web page, then append the web page to the session </li></ul></ul></ul></ul><ul><ul><ul><ul><li>(if the page stay time constraint is satisfied) </li></ul></ul></ul></ul><ul><ul><li>3. Remove non-maximal sessions </li></ul></ul>Smart-SRA Steps of Phase 2 <ul><ul><li>Process a candidate session from left to right by repeating </li></ul></ul><ul><ul><li>the following steps until the candidate session is empty: </li></ul></ul>
  16. 16. Example Candidate Session Example Web Topology Smart-SRA 15 14 12 9 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page
  17. 17. Smart-SRA [P 1 , P 13 , P 34 , P 23 ] , [P 1 , P 13 , P 49 , P 23 ] [P 1 , P 20 , P 23 ] [P 1 ,P 13 ,P 34 ], [P 1 , P 13 , P 49 ] [P 1 , P 20 ] New Session Set (after) [P 1 , P 13 , P 34 , P 23 ] [P 1 , P 13 , P 49 , P 23 ], [P 1 , P 20 , P 23 ] [P 1 ,P 13 ,P 34 ] [P 1 , P 13 , P 49 ] Temp Session Set {P 23 } {P 49 , P 34 } Temp Page Set [P 1 ,P 13 ,P 34 ] [P 1 , P 13 , P 49 ] [P 1 , P 20 ] [P 1 ,P 20 ] [P 1 ,P 13 ] New Session Set (before) [P 23 ] [P 49 , P 34 , P 23 ] Candidate Session 4 3 Iteration [P 1 ,P 20 ] [P 1 ,P 13 ] [P 1 ] New Session Set (after) [P 1 ,P 20 ] [P 1 ,P 13 ] [P 1 ] Temp Session Set {P 20 , P 13 } {P 1 } Temp Page Set [P 1 ] New Session Set (before) [P 20 , P 13 , P 49 , P 34 , P 23 ] [P 1 , P 20 , P 13 , P 49 , P 34 , P 23 ] Candidate Session 2 1 Iteration
  18. 18. Agent Simulator <ul><li>Models the behavior of web users and generates web user navigation and the log data kept by the web server </li></ul><ul><li>Used to compare the performances of alternative session reconstruction heuristics </li></ul><ul><li>Uses 4 Primitive behaviors for simulating complex navigation of web user. </li></ul>
  19. 19. Web user can start a new session with any one of the possible entry pages of the web site Agent Simulator User-Behavior I
  20. 20. Web user can select a new page having a link from the most recently accessed page P 13 P 1 P 49 P 20 P 23 P 34 2 1 Agent Simulator User-Behavior II
  21. 21. Web user can select as the next page having a link from any one of the previously browsed pages Agent Simulator User-Behavior III P 13 P 1 P 49 P 20 P 23 P 34 2 1 3 4 5
  22. 22. Web user can terminate the session Agent Simulator User-Behavior IV P 13 P 1 P 49 P 20 P 23 P 34 2 1 3 4 5 6
  23. 23. Parameters for simulating behavior of web user <ul><li>Session Termination Probability (STP) </li></ul><ul><li>Link from Previous pages Probability (LPP) </li></ul><ul><li>New Initial page Probability (NIP) </li></ul>Agent Simulator
  24. 24. Heuristics Tested <ul><li>Time oriented heuristic (heur1) </li></ul><ul><li>(total time < 30 min) </li></ul><ul><li>Time oriented heuristic (heur2) </li></ul><ul><li>(page stay < 10 min) </li></ul><ul><li>Navigation oriented heuristic (heur3) </li></ul><ul><li>Smart-SRA heuristic (heur4) </li></ul>Experimental Results
  25. 25. Accuracy <ul><li>Reconstructed session H captures </li></ul><ul><li>a real session R </li></ul><ul><li>if R occurs as a subsequence of H (R ⊏H) </li></ul><ul><li>R = [P1, P3, P5] </li></ul><ul><li>H = [P9, P1, P3, P5 , P8] => R ⊏H </li></ul><ul><li>H = [P1, P9 , P3, P5, P8] => R ⋢H </li></ul>Experimental Results
  26. 26. Parameters for generating user sessions and web topology Experimental Results 30% 0%-90% NIP : Fixed & Range 30% 0%-90% LPP : Fixed & Range 5% 1%-20% STP : Fixed & Range 10000 Number of agents 0,5 min Deviation for page stay time 2,2 min Average number of page stay time 15 Average number of outdegree 300 Number of web pages (nodes) in topology
  27. 27. Accuracy vs. STP Experimental Results
  28. 28. Accuracy vs LPP Experimental Results
  29. 29. Accuracy vs. NIP Experimental Results
  30. 30. Conclusion <ul><li>New session reconstruction heuristic: Smart-SRA </li></ul><ul><ul><li>Does not allow sequences with unrelated consecutive requests (no hyperlink between the previous one to the next one) </li></ul></ul><ul><li>No artificial browser (back) requests insertion in order to prevent unrelated consecutive requests </li></ul><ul><ul><li>Only maximal sessions </li></ul></ul><ul><li>Agent simulator </li></ul><ul><li>Accuracy measure </li></ul><ul><li>Experimental results show Smart-SRA outperforms previous heuristics </li></ul>
  31. 31. 08/28/11 PART II Semantically Enriched Event Based Model f or W eb Usage Mining
  32. 32. <ul><li>Introduction </li></ul><ul><li>Related Work </li></ul><ul><li>Semantic Event Based Sessions </li></ul><ul><li>Formal Definition of Semantic Events </li></ul><ul><li>Algorithms for Mining Semantic Event Patterns </li></ul><ul><li>Experimental Results </li></ul><ul><li>Con c lusion </li></ul>08/28/11 OUTLINE
  33. 33. <ul><li>Traditional WUM is based on pageviews, </li></ul><ul><li>but user interaction model is changing </li></ul><ul><li>Users do not care about pageviews, </li></ul><ul><li>but they use web site to achieve high level goals such as </li></ul><ul><ul><li>Finding and viewing a video </li></ul></ul><ul><ul><li>Buying tickets </li></ul></ul><ul><ul><li>Searching for the nearest Italian restaurant </li></ul></ul><ul><ul><li>Listening to a song, etc </li></ul></ul>08/28/11 Introduction
  34. 34. <ul><li>We should analyze usage data in a series of “events” </li></ul><ul><ul><li>Search Mediterranean Restaurants </li></ul></ul><ul><ul><li>S earch Italian Restaurants </li></ul></ul><ul><ul><li>View the reviews for Restaurant A </li></ul></ul><ul><ul><li>View the reviews for Restaurant B </li></ul></ul><ul><ul><li>C lick the web site link of Restaurant A </li></ul></ul><ul><li>Incorporating semantic knowledge in the process is the logical choice </li></ul><ul><ul><li>A method should be devised to capture user behavior </li></ul></ul><ul><ul><li>Usage data should be mapped to semantic space </li></ul></ul><ul><ul><li>An algorithm should be developed to exploit semantic relations </li></ul></ul>08/28/11 Introduction
  35. 35. <ul><li>In this work we propose methods for: </li></ul><ul><ul><li>tracking and logging domain level events </li></ul></ul><ul><ul><li>i njecting semantic to events </li></ul></ul><ul><ul><li>semantic ordering of events </li></ul></ul><ul><ul><li>an algorithm for computing sequences of frequent events </li></ul></ul><ul><li>Proposed system tested with 2 web sites </li></ul><ul><ul><li>Music Streaming Site </li></ul></ul><ul><ul><li>Mobile Network Operator’s Site </li></ul></ul>08/28/11 Introduction
  36. 36. <ul><li>Events are conceptual actions </li></ul><ul><li>that the user performs to achieve a certain a ff ect </li></ul><ul><li>Events are used to capture business actions </li></ul><ul><li>that are defined in the site’s domain </li></ul><ul><li>The site admin is responsible for </li></ul><ul><li>defining and tracking events </li></ul><ul><li>Events are tracked via JavaScript client </li></ul>08/28/11 Semantic Event Based Sessions
  37. 37. <ul><li>E xample event s : </li></ul><ul><ul><li>Play a video even t </li></ul></ul><ul><ul><li>Add to shopping cart event </li></ul></ul><ul><ul><li>A dd friend action </li></ul></ul><ul><li>Sometimes we may be interested in </li></ul><ul><li>properties of events, such as </li></ul><ul><ul><li>“ query” property of a “search event” </li></ul></ul><ul><ul><li>“ category” property of a “view video event” </li></ul></ul>08/28/11 Semantic Event Based Sessions
  38. 38. <ul><li>Every event is defined as an object. </li></ul><ul><li>Objects can have properties which relate an object </li></ul><ul><ul><li>with another object or </li></ul></ul><ul><ul><li>with a datatype value </li></ul></ul><ul><li>The relations between objects are captured in a tree </li></ul><ul><ul><li>Each individual and property is a node </li></ul></ul><ul><ul><li>Object property nodes have object as a parent and a child </li></ul></ul>08/28/11 Semantic Event Based Sessions
  39. 39. <ul><li>A sample event from a hypothetical video viewing site </li></ul>08/28/11
  40. 40. <ul><li>Events can be used to capture all relevant actions of the user including plain pageviews </li></ul><ul><li>With a mapping of events to the web site’s ontology we can define ‘semantic events’ </li></ul><ul><li>Events are mapped to semantic space by using the class and property names in the ontology </li></ul><ul><li>As a result of this mapping, the data to be mined is </li></ul><ul><ul><li>An ontology containing the terminological part </li></ul></ul><ul><ul><li>Logs containing semantic objects </li></ul></ul>08/28/11 Events as Semantic Objects
  41. 41. <ul><li>A Session is an ordered set of atomtrees that corresponds to events for a single user in a certain browsing activity </li></ul><ul><li>An Atom-tree is a tree of connected atoms. The atom tree represents a domain event in the web site's ontology </li></ul><ul><li>An Atom is either an individual of a class, a datatype property assertion, or an object property assertion </li></ul>08/28/11 Definitions
  42. 42. <ul><li>A Pattern is an ordered set of atomtrees </li></ul><ul><li>A session S supports the pattern Q </li></ul><ul><li>iff Q is a subsequence of S </li></ul><ul><li>where isMoreGeneralThan relation is used instead of equality in determining the subsequence relation </li></ul>08/28/11 Definitions
  43. 43. 08/28/11
  44. 44. <ul><li>For a given set of session, the problem is to find the set of patterns with support greater than the threshold value, minSupport </li></ul><ul><li>Two phase Apriori-like algorithm </li></ul><ul><ul><li>First phase finds frequent atomtrees (patterns containing single atomtree) </li></ul></ul><ul><ul><li>Second phase searches for frequent atomtree patterns </li></ul></ul>08/28/11 Algorithm
  45. 45. <ul><li>Apriori property: </li></ul><ul><li>If atomtree a 1 is more general than atomtree a 2 , then the support of a 1 is greater than the support of a 2 </li></ul><ul><li>getMostGeneralForms generates the set of trees, </li></ul><ul><ul><li>more general than the given atomtree </li></ul></ul><ul><ul><li>not less general than any other atomtree generated </li></ul></ul><ul><li>For level-wise search, a one-step refinement operator, is defined over the set of individual atoms, object and datatype property assertion atoms. </li></ul>08/28/11 Phase I: Find Frequent Atomtrees
  46. 46. <ul><li>Given an atom, one-step refinement operator refines another atom, by refining for subclass, sub property or refining a child of the node. </li></ul><ul><li>A one-step refinement over the set of atom-trees returns a set of atom-trees by either </li></ul><ul><ul><li>Refining a single node </li></ul></ul><ul><ul><li>Adding the most general form of a node </li></ul></ul><ul><li>One-step refinement takes two atom-trees and returns atom-trees that are more similar forms of the second towards the first </li></ul>08/28/11 Phase I: Find Frequent Atomtrees
  47. 47. <ul><li>INPUT : Session data containing semantic events </li></ul><ul><li>OUTPUT : List of frequent atom-trees </li></ul><ul><li>generate the initial candidate set using </li></ul><ul><li>getMostGeneralForm of all events </li></ul><ul><li>iterate until no candidates can be generated </li></ul><ul><li>{ </li></ul><ul><li>compare candidate set with the data set </li></ul><ul><li>for each atom-tree in the data set </li></ul><ul><li>{ </li></ul><ul><li>increment the frequency of each atom-tree in candidate set </li></ul><ul><li> that is more general than the atom-tree </li></ul><ul><li>} </li></ul><ul><li>filter the candidates that are less frequent that minSupport </li></ul><ul><li>generate next candidate set using oneStepRefinement operator </li></ul><ul><li> on the current candidate set atom-trees </li></ul><ul><li>} </li></ul>08/28/11 Phase I: Find Frequent Atomtrees
  48. 48. <ul><li>Similar to GSP </li></ul><ul><li>The taxonomy introduced by isMoreGeneralThan relation is used </li></ul><ul><li>Data set is converted: </li></ul><ul><ul><li>each frequent atom-tree is mapped to an integer hash </li></ul></ul><ul><ul><li>each atom-tree in a session is replaced by a set of hashes of the atom-tree and its ancestors </li></ul></ul><ul><li>Subsequence relation is modified to respect set inclusion </li></ul>08/28/11 Phase II: Find Frequent Sequences
  49. 49. <ul><li>INPUT : session data containing semantic events and frequent atom trees from phase one </li></ul><ul><li>OUTPUT : list of frequent atom-trees </li></ul><ul><li>convert data set and frequent atom-trees to hashes </li></ul><ul><li>while the candidate set is not empty </li></ul><ul><li>{ </li></ul><ul><li>generate candidate set from previous frequent patterns </li></ul><ul><li>count candidates </li></ul><ul><li>select frequent candidates </li></ul><ul><li>} </li></ul><ul><li>reconvert patterns </li></ul>08/28/11 Phase II: Find Frequent Sequences
  50. 50. <ul><li>Two sites are tested: </li></ul><ul><ul><li>A music streaming site </li></ul></ul><ul><ul><ul><li>Single-page, AJAX based music listening site </li></ul></ul></ul><ul><ul><ul><li>280K events in 75K sessions </li></ul></ul></ul><ul><ul><ul><li>Events are tracked via Java Script client </li></ul></ul></ul><ul><ul><li>A mobile network operator’s site </li></ul></ul><ul><ul><ul><li>Content-heavy, mostly static, high traffic web site </li></ul></ul></ul><ul><ul><ul><li>1M pageviews in 175K sessions </li></ul></ul></ul><ul><ul><ul><li>Events are extracted from access logs </li></ul></ul></ul>08/28/11 Experiments
  51. 51. 08/28/11 Music Streaming Site - Events
  52. 52. <ul><li>38.9% of the sessions: user made a search </li></ul><ul><li>9.3% of the sessions: user removed a song from her playlist </li></ul><ul><li>95.5% of the sessions: user made an action about a song </li></ul><ul><li>27.5% of the sessions: user added a song to playlist </li></ul><ul><li>139 frequent patterns are found </li></ul><ul><ul><li>Frequent pattern of length 2 describes </li></ul></ul><ul><li> a search is performed after playing a particular song </li></ul><ul><ul><li>Frequent pattern of length 6 describes </li></ul></ul><ul><li> sequential removal of songs from playlist </li></ul><ul><li> (due to the lack of ‘clear playlist’ button in the interface) </li></ul>08/28/11 Music Streaming Site - Patterns
  53. 53. 08/28/11 <ul><li>Two days of logs </li></ul><ul><li>More than 1 million pageviews occurred in 175K sessions. </li></ul><ul><li>Semi-automatically generated ontology </li></ul><ul><li>A total of 503 class </li></ul><ul><li>7-level hierarchy </li></ul>Mobile Network Operator Site - Events
  54. 54. 08/28/11 Mobile Network Operator Site - Ontology
  55. 55. <ul><li>10% of the sessions: at least one search action </li></ul><ul><li>38% of the sessions: page not categorized in the ontology is visited </li></ul><ul><li>71% of the sessions: user visits the home page (interesting) </li></ul><ul><li>Some other subjectively interesting patterns </li></ul><ul><ul><li>user’s browsing behaviors between subclasses of content class </li></ul></ul><ul><ul><li>users visited home page then jumped to some specific content </li></ul></ul><ul><ul><li>users searched and moved on to specific category </li></ul></ul>08/28/11 Mobile Network Operator Site - Patterns
  56. 56. <ul><li>Proposed system is </li></ul><ul><li>More generic than some of the previous semantic web usage mining attempts </li></ul><ul><li>Captures usage model more correctly </li></ul><ul><li>Intuitive and sound </li></ul><ul><li>Uses most of the ontology constructs </li></ul><ul><li>Applicable to real web sites with varying domains </li></ul><ul><li>Parallelizable and suitable for MapReduce </li></ul>08/28/11 Conclusions