Web Usage Miningand Using Ontology for Capturing Web Usage Semantic
Upcoming SlideShare
Loading in...5
×
 

Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

on

  • 736 views

Professor Ismail Toroslu gave a lecture on "Web Usage Mining and Using Ontology for Capturing Web Usage Semantic" in the Distinguished Lecturer Series - Leon The Mathematician. ...

Professor Ismail Toroslu gave a lecture on "Web Usage Mining and Using Ontology for Capturing Web Usage Semantic" in the Distinguished Lecturer Series - Leon The Mathematician.

More Information available at:
http://dls.csd.auth.gr

Statistics

Views

Total Views
736
Slideshare-icon Views on SlideShare
684
Embed Views
52

Actions

Likes
0
Downloads
7
Comments
0

1 Embed 52

http://dls.csd.auth.gr 52

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Web Usage Miningand Using Ontology for Capturing Web Usage Semantic Web Usage Miningand Using Ontology for Capturing Web Usage Semantic Presentation Transcript

    • İsmail Hakkı Toroslu Middle East Technical University Department of Computer Engineering Ankara, Turkey Web Usage Mining and Using Ontology for Capturing Web Usage Semantic
    • 08/28/11 PART I A New Approach for Reactive Web Usage Data Processing
      • Web Mining
      • Previous Session Reconstruction Heuristics
      • Smart-SRA
      • Agent Simulator
      • Experimental Results
      • Conclusion
      OUTLINE
    • Web Mining
      • Data Mining: Discover and retrieve useful and interesting patterns from a large dataset.
      • Web mining: Dataset is the huge web data.
      • Dimensions:
        • Web content mining
        • Web structure mining
        • Web usage mining
    • Web Usage Mining (WUM) Application of data mining techniques to web log data in order to discover user access patterns. Example User Web Access Log Web Mining 4130 200 HTTP/1.0 C.html GET [25/Apr/2005:03:04:48–05] 144.123.121.23 2050 200 HTTP/1.0 B.html GET [25/Apr/2005:03:04:43–05] 144.123.121.23 3290 200 HTTP/1.0 A.html GET [25/Apr/2005:03:04:41–05] 144.123.121.23 Number of Bytes Transmitted Success of Return Code Protocol URL Method Request Time IP Address
    • Phases of Web Usage Mining Web Mining Pre-Processing Pattern Analysis Raw Server log User session File Rules and Patterns Interesting Knowledge Applications Session Reconstruction Heuristics Pattern Discovery Apriori, GSP, SPADE
    • Session Reconstruction
      • Sessions are reconstructed by using heuristics that select and group requests belonging to the same user session
      • Types:
        • Reactive: processing requests after they are handled by the web server,
        • Proactive: processing occurs during the interactive browsing of the web site by the user
      Previous Session Reconstruction Heuristics
      • Time-oriented heuristics
      • Navigation-oriented heuristic
      New Reactive Session Reconstruction Technique: Smart-SRA Combines these heuristics with "site topology" information in order to increase the accuracy of the reconstructed sessions Previous Session Reconstruction Heuristics
    • Example Web Topology Graph Example Web Page Request Sequence Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page
    • Time-oriented heuristics -1
      • Total session time: duration of a discovered session is limited with a threshold
          • Discovered Sessions (30 mins):
          • [P 1 , P 20 , P 13 , P 49 ]
          • [P 34 , P 23 ]
      Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page
    • Time-oriented Heuristics -2
      • Page-stay time: time spent on any page is limited with a threshold
          • Discovered Sessions ( 10 mins):
          • [P 1 , P 20 , P 13 ]
          • [P 49 , P 34 ]
          • [P 23 ]
      Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page
    • Navigation-Oriented Heuristic
      • Adding page WP N+1 to a session [WP 1 , WP 2 , …, WP N ]
          • If WP N has a hyperlink to WP N+1
          • [WP 1 , WP 2 , …, WP N , WP N+1 ]
          • If WP N does not have a hyperlink to WP N+1
          • and WP Kmax is the nearest page having a hyperlink to WP N+1 add backward browser moves
      • [WP 1 , WP 2 ,…, WP N , WP N-1 , WP N-2 ,..., WP Kmax , WP N+1 ]
      Previous Session Reconstruction Heuristics
    • Navigation-Oriented Heuristic Previous Session Reconstruction Heuristics [P 1 , P 20 , P 1 , P 13 , P 49 , P 13 , P 34 , P 23 ] P 23 Link[P 34 , P 23 ] =1 [P 1 , P 20 , P 1 , P 13 , P 49 , P 13 , P 34 ] P 34 Link[P 49 , P 34 ] = 0 Link[P 13 , P 34 ] = 1 [P 1 , P 20 , P 1 , P 13 , P 49 ] P 49 Link[P 13 , P 49 ] = 1 [P 1 , P 20 , P 1 , P 13 ] P 13 Link[P 20 , P 13 ] = 0 Link[P 1 , P 13 ] = 1 [P 1 , P 20 ] P 20 Link[P 1 , P 20 ] = 1 [P 1 ] P 1 [ ] New Page Condition Curent Session
    • Smart-SRA
      • Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria
          • Satisfies the overall session duration time limit
      • Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that:
        • between each consecutive page pair in a session there is a hyperlink from the previous page to the next page
        • the page stay time criteria is also satisfied
      • Adds referrer constraints of the topology rule while eliminating the need for inserting backward browser movements.
      Contains Two Phases:
        • 1. Determine the web pages without any referrer (on its left) and remove them from the candidate session
        • 2. For each one of these pages
          • For each previously constructed session
            • If there is a hyperlink from the last page of the session to the web page, then append the web page to the session
            • (if the page stay time constraint is satisfied)
        • 3. Remove non-maximal sessions
      Smart-SRA Steps of Phase 2
        • Process a candidate session from left to right by repeating
        • the following steps until the candidate session is empty:
    • Example Candidate Session Example Web Topology Smart-SRA 15 14 12 9 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page
    • Smart-SRA [P 1 , P 13 , P 34 , P 23 ] , [P 1 , P 13 , P 49 , P 23 ] [P 1 , P 20 , P 23 ] [P 1 ,P 13 ,P 34 ], [P 1 , P 13 , P 49 ] [P 1 , P 20 ] New Session Set (after) [P 1 , P 13 , P 34 , P 23 ] [P 1 , P 13 , P 49 , P 23 ], [P 1 , P 20 , P 23 ] [P 1 ,P 13 ,P 34 ] [P 1 , P 13 , P 49 ] Temp Session Set {P 23 } {P 49 , P 34 } Temp Page Set [P 1 ,P 13 ,P 34 ] [P 1 , P 13 , P 49 ] [P 1 , P 20 ] [P 1 ,P 20 ] [P 1 ,P 13 ] New Session Set (before) [P 23 ] [P 49 , P 34 , P 23 ] Candidate Session 4 3 Iteration [P 1 ,P 20 ] [P 1 ,P 13 ] [P 1 ] New Session Set (after) [P 1 ,P 20 ] [P 1 ,P 13 ] [P 1 ] Temp Session Set {P 20 , P 13 } {P 1 } Temp Page Set [P 1 ] New Session Set (before) [P 20 , P 13 , P 49 , P 34 , P 23 ] [P 1 , P 20 , P 13 , P 49 , P 34 , P 23 ] Candidate Session 2 1 Iteration
    • Agent Simulator
      • Models the behavior of web users and generates web user navigation and the log data kept by the web server
      • Used to compare the performances of alternative session reconstruction heuristics
      • Uses 4 Primitive behaviors for simulating complex navigation of web user.
    • Web user can start a new session with any one of the possible entry pages of the web site Agent Simulator User-Behavior I
    • Web user can select a new page having a link from the most recently accessed page P 13 P 1 P 49 P 20 P 23 P 34 2 1 Agent Simulator User-Behavior II
    • Web user can select as the next page having a link from any one of the previously browsed pages Agent Simulator User-Behavior III P 13 P 1 P 49 P 20 P 23 P 34 2 1 3 4 5
    • Web user can terminate the session Agent Simulator User-Behavior IV P 13 P 1 P 49 P 20 P 23 P 34 2 1 3 4 5 6
    • Parameters for simulating behavior of web user
      • Session Termination Probability (STP)
      • Link from Previous pages Probability (LPP)
      • New Initial page Probability (NIP)
      Agent Simulator
    • Heuristics Tested
      • Time oriented heuristic (heur1)
      • (total time < 30 min)
      • Time oriented heuristic (heur2)
      • (page stay < 10 min)
      • Navigation oriented heuristic (heur3)
      • Smart-SRA heuristic (heur4)
      Experimental Results
    • Accuracy
      • Reconstructed session H captures
      • a real session R
      • if R occurs as a subsequence of H (R ⊏H)
      • R = [P1, P3, P5]
      • H = [P9, P1, P3, P5 , P8] => R ⊏H
      • H = [P1, P9 , P3, P5, P8] => R ⋢H
      Experimental Results
    • Parameters for generating user sessions and web topology Experimental Results 30% 0%-90% NIP : Fixed & Range 30% 0%-90% LPP : Fixed & Range 5% 1%-20% STP : Fixed & Range 10000 Number of agents 0,5 min Deviation for page stay time 2,2 min Average number of page stay time 15 Average number of outdegree 300 Number of web pages (nodes) in topology
    • Accuracy vs. STP Experimental Results
    • Accuracy vs LPP Experimental Results
    • Accuracy vs. NIP Experimental Results
    • Conclusion
      • New session reconstruction heuristic: Smart-SRA
        • Does not allow sequences with unrelated consecutive requests (no hyperlink between the previous one to the next one)
      • No artificial browser (back) requests insertion in order to prevent unrelated consecutive requests
        • Only maximal sessions
      • Agent simulator
      • Accuracy measure
      • Experimental results show Smart-SRA outperforms previous heuristics
    • 08/28/11 PART II Semantically Enriched Event Based Model f or W eb Usage Mining
      • Introduction
      • Related Work
      • Semantic Event Based Sessions
      • Formal Definition of Semantic Events
      • Algorithms for Mining Semantic Event Patterns
      • Experimental Results
      • Con c lusion
      08/28/11 OUTLINE
      • Traditional WUM is based on pageviews,
      • but user interaction model is changing
      • Users do not care about pageviews,
      • but they use web site to achieve high level goals such as
        • Finding and viewing a video
        • Buying tickets
        • Searching for the nearest Italian restaurant
        • Listening to a song, etc
      08/28/11 Introduction
      • We should analyze usage data in a series of “events”
        • Search Mediterranean Restaurants
        • S earch Italian Restaurants
        • View the reviews for Restaurant A
        • View the reviews for Restaurant B
        • C lick the web site link of Restaurant A
      • Incorporating semantic knowledge in the process is the logical choice
        • A method should be devised to capture user behavior
        • Usage data should be mapped to semantic space
        • An algorithm should be developed to exploit semantic relations
      08/28/11 Introduction
      • In this work we propose methods for:
        • tracking and logging domain level events
        • i njecting semantic to events
        • semantic ordering of events
        • an algorithm for computing sequences of frequent events
      • Proposed system tested with 2 web sites
        • Music Streaming Site
        • Mobile Network Operator’s Site
      08/28/11 Introduction
      • Events are conceptual actions
      • that the user performs to achieve a certain a ff ect
      • Events are used to capture business actions
      • that are defined in the site’s domain
      • The site admin is responsible for
      • defining and tracking events
      • Events are tracked via JavaScript client
      08/28/11 Semantic Event Based Sessions
      • E xample event s :
        • Play a video even t
        • Add to shopping cart event
        • A dd friend action
      • Sometimes we may be interested in
      • properties of events, such as
        • “ query” property of a “search event”
        • “ category” property of a “view video event”
      08/28/11 Semantic Event Based Sessions
      • Every event is defined as an object.
      • Objects can have properties which relate an object
        • with another object or
        • with a datatype value
      • The relations between objects are captured in a tree
        • Each individual and property is a node
        • Object property nodes have object as a parent and a child
      08/28/11 Semantic Event Based Sessions
      • A sample event from a hypothetical video viewing site
      08/28/11
      • Events can be used to capture all relevant actions of the user including plain pageviews
      • With a mapping of events to the web site’s ontology we can define ‘semantic events’
      • Events are mapped to semantic space by using the class and property names in the ontology
      • As a result of this mapping, the data to be mined is
        • An ontology containing the terminological part
        • Logs containing semantic objects
      08/28/11 Events as Semantic Objects
      • A Session is an ordered set of atomtrees that corresponds to events for a single user in a certain browsing activity
      • An Atom-tree is a tree of connected atoms. The atom tree represents a domain event in the web site's ontology
      • An Atom is either an individual of a class, a datatype property assertion, or an object property assertion
      08/28/11 Definitions
      • A Pattern is an ordered set of atomtrees
      • A session S supports the pattern Q
      • iff Q is a subsequence of S
      • where isMoreGeneralThan relation is used instead of equality in determining the subsequence relation
      08/28/11 Definitions
    • 08/28/11
      • For a given set of session, the problem is to find the set of patterns with support greater than the threshold value, minSupport
      • Two phase Apriori-like algorithm
        • First phase finds frequent atomtrees (patterns containing single atomtree)
        • Second phase searches for frequent atomtree patterns
      08/28/11 Algorithm
      • Apriori property:
      • If atomtree a 1 is more general than atomtree a 2 , then the support of a 1 is greater than the support of a 2
      • getMostGeneralForms generates the set of trees,
        • more general than the given atomtree
        • not less general than any other atomtree generated
      • For level-wise search, a one-step refinement operator, is defined over the set of individual atoms, object and datatype property assertion atoms.
      08/28/11 Phase I: Find Frequent Atomtrees
      • Given an atom, one-step refinement operator refines another atom, by refining for subclass, sub property or refining a child of the node.
      • A one-step refinement over the set of atom-trees returns a set of atom-trees by either
        • Refining a single node
        • Adding the most general form of a node
      • One-step refinement takes two atom-trees and returns atom-trees that are more similar forms of the second towards the first
      08/28/11 Phase I: Find Frequent Atomtrees
      • INPUT : Session data containing semantic events
      • OUTPUT : List of frequent atom-trees
      • generate the initial candidate set using
      • getMostGeneralForm of all events
      • iterate until no candidates can be generated
      • {
      • compare candidate set with the data set
      • for each atom-tree in the data set
      • {
      • increment the frequency of each atom-tree in candidate set
      • that is more general than the atom-tree
      • }
      • filter the candidates that are less frequent that minSupport
      • generate next candidate set using oneStepRefinement operator
      • on the current candidate set atom-trees
      • }
      08/28/11 Phase I: Find Frequent Atomtrees
      • Similar to GSP
      • The taxonomy introduced by isMoreGeneralThan relation is used
      • Data set is converted:
        • each frequent atom-tree is mapped to an integer hash
        • each atom-tree in a session is replaced by a set of hashes of the atom-tree and its ancestors
      • Subsequence relation is modified to respect set inclusion
      08/28/11 Phase II: Find Frequent Sequences
      • INPUT : session data containing semantic events and frequent atom trees from phase one
      • OUTPUT : list of frequent atom-trees
      • convert data set and frequent atom-trees to hashes
      • while the candidate set is not empty
      • {
      • generate candidate set from previous frequent patterns
      • count candidates
      • select frequent candidates
      • }
      • reconvert patterns
      08/28/11 Phase II: Find Frequent Sequences
      • Two sites are tested:
        • A music streaming site
          • Single-page, AJAX based music listening site
          • 280K events in 75K sessions
          • Events are tracked via Java Script client
        • A mobile network operator’s site
          • Content-heavy, mostly static, high traffic web site
          • 1M pageviews in 175K sessions
          • Events are extracted from access logs
      08/28/11 Experiments
    • 08/28/11 Music Streaming Site - Events
      • 38.9% of the sessions: user made a search
      • 9.3% of the sessions: user removed a song from her playlist
      • 95.5% of the sessions: user made an action about a song
      • 27.5% of the sessions: user added a song to playlist
      • 139 frequent patterns are found
        • Frequent pattern of length 2 describes
      • a search is performed after playing a particular song
        • Frequent pattern of length 6 describes
      • sequential removal of songs from playlist
      • (due to the lack of ‘clear playlist’ button in the interface)
      08/28/11 Music Streaming Site - Patterns
    • 08/28/11
      • Two days of logs
      • More than 1 million pageviews occurred in 175K sessions.
      • Semi-automatically generated ontology
      • A total of 503 class
      • 7-level hierarchy
      Mobile Network Operator Site - Events
    • 08/28/11 Mobile Network Operator Site - Ontology
      • 10% of the sessions: at least one search action
      • 38% of the sessions: page not categorized in the ontology is visited
      • 71% of the sessions: user visits the home page (interesting)
      • Some other subjectively interesting patterns
        • user’s browsing behaviors between subclasses of content class
        • users visited home page then jumped to some specific content
        • users searched and moved on to specific category
      08/28/11 Mobile Network Operator Site - Patterns
      • Proposed system is
      • More generic than some of the previous semantic web usage mining attempts
      • Captures usage model more correctly
      • Intuitive and sound
      • Uses most of the ontology constructs
      • Applicable to real web sites with varying domains
      • Parallelizable and suitable for MapReduce
      08/28/11 Conclusions