Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
389
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
10
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data e Web Mining 825368 Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage Data Bayir – Toroslu – Cosar - Fidan
  • 2. Data Mining on Web Web Mining discover and retrieve useful and interesting pattern from large web dataset web content mining web structure mining web usage mining text and multimedia documents hyperlink structure web log records real data in web pages data describes the organization of the content data describes the pattern of usage of web pages
  • 3. PreProcessing Site File Access Log Referrer Log Agent Log Registration Site Crawler Data Cleaning Path Completion Session Identification User Identification User Session File Transaction Identification Transaction File Site Topology INPUT PREPROCESISNG SQL Query
  • 4. Session Identification partitioning each user’s activities into sequence (session) of entries from web request logs Session Identification time oriented heuristics navigation oriented heuristics temporal boundaries session length page-stay link between web pages
  • 5. Sequential Mining Sequential Mining Association Mining with the order of transactions itemset/element items sequence : : is itemset sequence size sequence length number of itemsets/elements number of items : : : : : Given a set of data sequences find all sequences with a user-specified minimum support subsequence :
  • 6. Sequential Mining algorithms GSP APrioriAll APrioriSome Sort Phase LargeItemSet Phase Transformation Phase Sequence Phase Maximal Phase Transforms customer transaction into custumer sequences Generates set of large itemset Represents customer sequences based on large itemset Derives large k-sequences based on large (k-1)-sequences Prunes non maximal sequences
  • 7. Smart-SRA session Smart-SRA session Path
    • timestamp ordering (time oriented) rule
    • topology (navigation oriented) rule
    • maximality rule
    (session) (path in the web site) (path in the web site)
  • 8. Smart Miner Candidate Session Smart Session Sequencial AprioriAll SMART-SRA SESSION CONSTRUCTION SEQUENCIAL MINING DATA STREAM FREQUENT ACCESS PATTERN
  • 9. Smart Miner: First Phase Smart SRA
    • time oriented heuristics
        • session length
        • page-stay
      • no backward movement
    P 1 P 13 P 20 P 49 P 34 P 23 Web Site Graph Candidate Session Candidate session construction P 1 P 20 P 13 P 49 P 34 P 23 0 6 9 12 14 15 Page TimeStamp P 13 P 20 P 23 0 5 9 Page TimeStamp P 49 10
  • 10. Smart Miner: Second Phase Smart SRA
    • time oriented heuristics
        • inherithed session length
        • re-check page-stay
      • no backward movement
      • maximality
      • topology rule
    Smart session construction P 1 P 13 P 20 P 49 P 34 P 23 Web Site Graph [ P 1 , P 13 , P 34 , P 23 ] [ P 1 , P 13 , P 49 , P 23 ] [ P 1 , P 20 , P 23 ] Smart Session P 1 P 20 P 13 P 49 P 34 P 23 0 6 9 12 14 15 Page TimeStamp
  • 11. Smart Miner: Second Phase Smart SMART SESSION RECONSTRUCTION foreach CanditateSession in CandSessionSet NewSessionSet={} while CanditateSession ≠Ø TSessionSet = {}; TPageSet = {}; foreach Page i in CandSession StartPageFlag = TRUE foreach Page j in CandidateSession with j<i if (Link[Page j ,Page i ] and TimeDiff(Page i ,Page j )≤ σ then StartPageFlag = FALSE endfor if StartPageFlag then TPageSet = TPageSet U {Page i } endfor CandSession = TPageSet U {Page i } if NewSessionSet = {} then foreach Page i in TPageSet TSessionSet = TSessionSet U {[Page i ]} else foreach Page i in TPageSet foreach Session j in NewSessionSet if (Link[Last(Session j ),Page i ] and TimeDiff(Last(Session j ),Page i )≤ σ ) then TSession = Session j TSession.mark = UNEXTENDED TSession = TSession • Page i TSessionSet = TSessionSet U {TSession} Session j .mark = EXTENDED endif endfor endfor endif foreach Session J in New SessionSet if Session J .mark ≠ EXTENDED then TSessionSet = TSessionSet U {Session J } end for NewSessionSet = TSessionSet end while end for page with no incoming link session set construction session set extension session set extension with no extended
  • 12. Session Construction Example Iteration CandidateSession TPageSet NewSessionSet 1 [ P 1 , P 20 , P 13 , P 49 , P 34 , P 23 ] 2 [ P 20 , P 13 , P 49 , P 34 , P 23 ] 3 4 [ P 49 , P 34 , P 23 ] [ P 23 ] { P 1 } { P 20 , P 13 } { P 49 , P 34 } { P 23 } [ P 1 ] [ P 1 , P 20 ] [ P 1 , P 13 ] [ P 1 , P 13 , P 34 ] [ P 1 , P 13 , P 49 ] [ P 1 , P 20 ] [ P 1 , P 13 , P 34 , P 23 ] [ P 1 , P 13 , P 49 , P 23 ] [ P 1 , P 20 , P 23 ] P 1 P 13 P 20 P 49 P 34 P 23
  • 13. Sequential APrioriAll Pruning
      • topological constraint
        • every subsequent pair of pages in a sequence the former one must have a hyperlink to the latter one
      • string matching costraint
        • session S supports a pattern P if and only if P is a subsequence of S not violating string matching
            • <1,2,3> support <1,2>
            • <1,2,3> not support <1,3>
    • during candidate sequence generation before calculating their support
  • 14. Support Support I : pattern S : user reconstructed sessions
    • one scan through the transaction database by keeping candidate session in hashmap
  • 15. Sequential Apriori Algorithm SEQUENTIAL APRIORI INPUT: minimum support frequency : δ reconstructed sessions : S topology information : Link set of all web pages : P OUTPUT: set of maximal frequent patterns : Max L 1 = {} for i = 1 to |P| do L 1 = L 1 U [P i ] | if Support([P i ],S)> δ for k = 1 to N-1 do if L k = Ø then Halt else L k+1 = {} foreach I i in L k foreach P j in P if Link[Last(I i ),P j ] then T = I i • P j // append page if Support(T,S)> δ then T.maximal = true I i .maximal = false V = [T 2 ,T 3 ,…, T |T| ] if V in L k then V.maximal = false l k+1 = l k+1 U {T} endif endif endif endfor endfor endif max = {} for k=1 to N-1 do max = max U {S|S in L k and S.maximal = true } endfor length-1 candidate pattern generation union of the sets of maximal patterns no further generation length-k+1 candidate pattern generation joining step pruning step topological rule support rule maximality rule
  • 16. Accuracy Metric : frequent maximal pattern of the agent simulator : frequent maximal pattern of the heuristic recall precision accuracy
  • 17. Agent Simulator
    • STP
    : Session Termination Probability
    • LPP
    : Link from Previous page Probability
    • LPC
    : Link from Current page Probability
    • NIP
    : New Initial page Probability probability of terminating session probability of referring next page from one of the previously accessed pages except the most recently accessed one probability of referring next page from the most recently visited page probability of selecting one of the starting pages of a web site during the navigation Agent Simulator Parameters
  • 18. Simulated Data Web topology
    • number of web pages from 10 to 1000
    • number users from 1000 to 10000
    Agent simulator parameters
    • NIP/STP 0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0
    • LPC/LPP 0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0
    • 49 different cases
    Support parameter
    • Values 0.001 , 0.0025 , 0.005 , 0,0075 , 0.01
    Runs of agent simulator
    • 10 random different runs
  • 19. Results on Simulated Data NO TO : : SSRA : navigation oriented time oriented Smart SRA NIP : New Initial Page Probability STP : Session Termination Probability NIP : New Initial Page Probability STP : Session Termination Probability
  • 20. Results on Simulated Data NO TO : : SSRA : navigation oriented time oriented Smart SRA
  • 21. Real Data AGMLAB’s company web site
    • 4 months user activity
    • 3801 users
    • 30 minutes session time-out
    • 10 web pages
    • link graph densely connected
    User Activity
    • action tracking program
    • cookies
    • cookie information recorded to a server log file
  • 22. Results on Real Data NO TO : : SSRA : navigation oriented time oriented Smart SRA
  • 23. Scalability Performance on 100 GB Data Performance with 50 nodes MAP/REDUCE paradigm each node process a block of session database computing the local frequency of each candidate patterns
  • 24. Sitologia/Bibliografia
    • M.A.Bayir – I.H.Toroslu – A.Cosar – G.Fidan, Smart Miner: A New Framework for Mining Larga Scale Web Usage Data - 2009
    • R.Cooley - B.Mobasher - J.Srivastava, Data Preparation for Mining World Wide Web - 1999
    • J.Srivastava - R.Cooley – M.Deshpande – P.N. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data - 2000
    • M.G Da Costa jr – Z. Gong, Web Structure Mining: An Introduction - 2005
    • J.J.Jung, Semantic PreProcessing of Web Request Streams for Web Usage Mining - 2005
    • R.Agrawal – R.Srikant, Mining Sequential Patterns- 1995
  • 25. GSP foreach p in L k-1 foreach q in L k-1 if ( ) then C k = C k U {p 1 ,…,p k-1 ,q k-1 } foreach s in C k if exists(r | ˄ ) then C k = C k - s C 1 = Init_Pass L 1 = {<{f}>|f in C 1 , with minimum support} for (k=2; L k-1 ≠Ø; k++) do begin C k = Candidate-gen-SPM L k-1 foreach sequence s in the database D do foreach candidate c in Ck if ( c in s ) then update candidate c L k = candidated c in C k with minimum support end result = U k( L k ) GSP – GENERALIZED SEQUENTIAL PATTERN CANDIDATE-GEN-SPM (join step) (prune step)
  • 26. GSP Example L3-sequences Candidate 4-sequences (join step) Candidate 4-sequences (prune step) <{1,2},{4}> <{1,2},{5}> <{1},{4,5}> <{1,4},{6}> <{2},{4,5}> <{2},{4},{6}> <{1,2},{4,5}> <{1,2},{4},{6}> <{1,2},{4,5}> <{1},{4},{6}>
  • 27. APrioriAll foreach p in L k-1 foreach q in L k-1 if ( p .x 1 = q .x 1 ) ˄ ( p .x 2 = q .x 2 ) ˄ … ˄ ( p .x k-2 = q .x k-2 ) then C k = C k U {< p .x 1 ,…, p .x k-1 , q .x k-1 >} foreach s in C k if exists( r | ˄ ) then C k = C k - s L 1 = {large 1-sequences} for (k=2; L k-1 ≠Ø; k++) do begin C k = Apriori-generate function L k-1 foreach sequence c in the database D do update candidates in C k that are contained in c L k = candidated in C k with minimum support end result = maximal sequences in U k( L k ) APRIORIALL APRIORI-GENERATE (join step) (prune step)
  • 28. APrioriAll Example L3-sequences <1,2,3> <1,2,4> <1,3,4> <1,3,5> <2,3,4> Candidate 4-sequences (join step) <1,2,3,4> <1,2,4,3> <1,3,4,5> <1,3,5,4> Candidate 4-sequences (prune step) <1,2,3,4>
  • 29. APrioriSome APRIORISOME //Forward Phase L 1 = {large 1-sequences}; C 1 = L 1 ; last = 1; for (k=2; C k-1 ≠Ø; k++) do begin if (Lk-1 known) then C k = Apriori-generate function L k-1 else C k = Apriori-generate function C k-1 if (k=next(last)) then foreach sequence c in the database D do update candidates in C k that are contained in c L k = candidated in C k with minimum support; last = k end //Backword Phase for (k--; k>=1; k--) do begin if (L k not found) then delete all sequences in C k contained in some L i , i>k foreach sequence c in the database D do update candidates in C k that are contained in c L k = candidated in C k with minimum support else delete all sequences in L k contained in some L i , i>k end result = maximal sequences in U k( L k )
  • 30. Sequential Mining Algorithm 90 June 12 ’93 5 30 40,70 90 June 25 ’93 June 30 ‘93 July 25 ‘93 4 4 4 30,50,70 June 25 ’93 3 10,20 30 40,60,60 June 10 ’93 June 15 ’93 June 20 ‘93 2 2 2 30 90 June 25 ’93 June 25 ‘93 1 1 Items Transaction Time Customer ID Customer Sequence Customer ID <(90)> 5 <(30) (40 70) (90)> 4 <(30) (50 (70))> 3 <(10 20) (30) (40 60 70)> 2 <(30)(90)> 1 5 (90) 4 (40 70) 3 (70) 2 (40) 1 (30) Mapped to Large itemset <{5}> 5 <{1} {2, 3, 4} {5}> 4 <{1, 3}> 3 <{1} {2, 3, 4}> 2 <{1} {5}> 1 Customer Sequence Customer ID