Mining web data: Techniques for understanding the user ...
Upcoming SlideShare
Loading in...5
×
 

Mining web data: Techniques for understanding the user ...

on

  • 6,481 views

 

Statistics

Views

Total Views
6,481
Views on SlideShare
6,474
Embed Views
7

Actions

Likes
8
Downloads
410
Comments
0

3 Embeds 7

http://www.slideshare.net 3
http://vhbkhanh.blogspot.com 3
http://vhbkhanh.blogspot.jp 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mining web data: Techniques for understanding the user ... Mining web data: Techniques for understanding the user ... Presentation Transcript

  • Mining web data: Techniques for understanding the user behavior in the Web Juan D. Velásquez Silva PhD Information Engineering, University of Tokyo, Japan jvelasqu@dii.uchile.cl http://wi.dii.uchile.cl Department of Industrial Engineering University of Chile Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 1
  • Outline  Motivation.  Web data.  Web mining.  Applications.  Summary. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 2
  • 1.- Motivation Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 3
  • The new sales window  The markets move towards business virtualization.  Many companies have begun to use the Web as a new channel of massive spread and worldwide cover.  How can we improve our web site?  By understanding what describes the user behavior [Brusilovsky96]. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 4
  • OK, but how can we do it?  Analyzing the user browsing behavior.  Analyzing the web page preferences.  Analyzing the user profile.  In fact “Extracting knowledge from web data”  … applying Web mining techniques. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 5
  • 2.- Data originated in the Web: Web Data Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 6
  • Web server and web browser Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 7
  • Web logs # IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent 1 165.182.168.101 - - 16/06/2002:16:24:06 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1) 2 165.182.168.101 - - 16/06/2002:16:24:10 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1) 3 165.182.168.101 - - 16/06/2002:16:24:57 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1) 4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98) 5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98) 6 204.231.180.195 - - 16/06/2002:16:34:10 GET p1.htm HTTP/1.1 200 3821 p3.htm Mozilla/4.0 (MSIE 6.0; Win98) 7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98) 8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98) 9 204.231.180.195 - - 16/06/2002:16:38:40 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 6.0; Win98) 10 165.182.168.101 - - 16/06/2002:16:39:02 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 11 165.182.168.101 - - 16/06/2002:16:39:15 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 12 165.182.168.101 - - 16/06/2002:16:39:45 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 13 165.182.168.101 - - 16/06/2002:16:39:58 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 14 165.182.168.101 - - 16/06/2002:16:42:03 GET p3.htm HTTP/1.1 200 4036 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 15 165.182.168.101 - - 16/06/2002:16:42:07 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1) 16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98) 18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98) 19 204.231.180.195 - - 16/06/2002:17:35:45 GET p4.htm HTTP/1.1 200 3523 p3.htm Mozilla/4.0 (MSIE 6.0; Win98) 20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98) 21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 8
  • Web site contents You can put anything you want on a Web page, from family to business info…. Hyperlink structure Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 9
  • Nature of web data  Content. The web page content, i.e., pictures, free text, sounds, etc.  Structure. Data that show the internal web page structure. In general, they are HTML or XML tags, some of them contain information about hyperlink connections with other web pages.  Usage. Data that describe the visitor preferences while browsing in a web site. It is possible to find them inside web log files.  User profile. A collection of information about a user, combining personal information (e.g. name, age etc.), usage information (e.g. page visited) and interests. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 10
  • Understanding the visitor behavior in a web site  Visitor browsing behavior: Web logs.  Visitor preferences: Web pages  Problems:  Web logs contain a lot of irrelevant data.  A Web site is a huge collection of heterogeneous, unlabelled, distributed, time variant, semi-structured and high dimensional data. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 11
  • The dream “Transform the visitors into customers and retain the existing ones”  Some solutions:  Continuous improvement of the web site structure and content.  Personalization of the relationship between the user and the web site.  Understanding the user behavior in the web site. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 12
  • Data cleaning and preprocessing  Web logs. Applying the sessionization process.  To identify the real visitor session.  Web pages. Vector space model.  Transforming the web pages into feature vectors.  Web hyperlinks structure.  Identifying social networks Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 13
  • Web logs # IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent 1 165.182.168.101 - - 16/06/2002:16:24:06 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1) 2 165.182.168.101 - - 16/06/2002:16:24:10 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1) 3 165.182.168.101 - - 16/06/2002:16:24:57 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1) 4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98) 5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98) 6 204.231.180.195 - - 16/06/2002:16:34:10 GET p1.htm HTTP/1.1 200 3821 p3.htm Mozilla/4.0 (MSIE 6.0; Win98) 7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98) 8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98) 9 204.231.180.195 - - 16/06/2002:16:38:40 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 6.0; Win98) 10 165.182.168.101 - - 16/06/2002:16:39:02 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 11 165.182.168.101 - - 16/06/2002:16:39:15 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 12 165.182.168.101 - - 16/06/2002:16:39:45 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 13 165.182.168.101 - - 16/06/2002:16:39:58 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 14 165.182.168.101 - - 16/06/2002:16:42:03 GET p3.htm HTTP/1.1 200 4036 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 15 165.182.168.101 - - 16/06/2002:16:42:07 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1) 16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1) 17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98) 18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98) 19 204.231.180.195 - - 16/06/2002:17:35:45 GET p4.htm HTTP/1.1 200 3523 p3.htm Mozilla/4.0 (MSIE 6.0; Win98) 20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98) 21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 14
  • WUM – Heuristics  The session identification heuristics  Timeout: if the time between pages requests exceeds a certain limit, it is assumed that the user is starting a new session  IP/Agent: Each different agent type for an IP address represents a different sessions  Referring page: If the referring page file for a request is not part of an open session, it is assumed that the request is coming from a different session.  Same IP-Agent/different sessions (Closest): Assigns the request to the session that is closest to the referring page at the time of the request.  Same IP-Agent/different sessions (Recent): In the case where multiple sessions are same distance from a page request, assigns the request to the session with the most recent referrer access in terms of time Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 15
  • Based on examples in tutorial PKDD-2002 by Berendt et. al. Sessionization- Example Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 16
  • Based on examples in tutorial PKDD-2002 by Berendt et. al. Sessionization- Example (2) Sort the users (IP+Agent) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 17
  • Based on examples in tutorial PKDD-2002 by Berendt et. al. Sessionization- Example (3) Sessionize using heuristics (h1 with 30 min) The h1 heuristic (timeout=30 min) will result in the two sessions Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 18
  • Based on examples in tutorial PKDD-2002 by Berendt et. al. Sessionization- Example (4) Sessionize using heuristics (with href) By using the reffer-based heuristics, we have only a single session Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 19
  • Based on examples in tutorial PKDD-2002 by Berendt et. al. Sessionization- Example (5) Path completion Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 20
  • Web text content  From different web page content, special attention receive the free text.  For the moment, a searching is performed by using key words.  It is necessary to represent the text information in a feature vector, before to apply a mining process.  The representation must consider that the words in the web page don’t have the same importance. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 21
  • Web text content: filters  There are words that don’t have information (article, prepositions, conjunctions, etc)  It is necessary a cleaning process:  HTML tags.  Stop words (i.e. pronouns, prepositions, conjunctions, etc.)  Word stemming. Process of suffix removal, to generate word Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 22
  • Processing the web site: Vector space model [Salton75]  The model associates a weight to each word in the page, based on its frequency in the whole web site.  Let ni and Q the amount of pages, a simple estimation of the relevance of a word is w = ni i, j Q  The inverse document frequency IDF = log(Q ) can be used like a weight. ni  A variation of the last expression is known as TF*IDF TF * IDF = f ij * log(Q ) ni Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 23
  • Web page: vectorial representation  Its vectorial Word 1 2 … Q representation would be a matrix of RxQ. 1 advise 1 0 … 1  Q is the number of pages 2 business 0 1 … 0 in the web site and R is . … . . … . the number of different words in P. . … . . … . . … . . … . . … . . … . . … . . … . R zambia 1 0 … 0 Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 24
  • Vector space model: a graphic example Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 25
  • Comparing web pages Q M = (mij ) = f ij * log( ) ni pi ! (m1i ,..., mRi ) p j ! (m1 j ,..., mRj ) R ! mki mkj pi dp ( pi , p j ) = cos " = R k =1 R 2 ! ( mki ) ! ( mkj ) 2 k =1 k =1 pj ! Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 26
  • Vector Model, Summarized  The best term-weighting schemes tf-idf weights: wij = f(i,j) * log(Q/ni)  For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N / ni)  This model is very good in practice:  tf-idf works well with general collections  Simple and fast to compute  Vector model is usually as good as the known ranking alternatives Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 27
  • Pros & Cons of Vector Model  Advantages:  term-weighting improves quality of the answer set  partial matching allows retrieval of docs that approximate the query conditions  cosine ranking formula sorts documents according to degree of similarity to the query  Disadvantages:  assumes independence of index terms; not clear if this is a good or bad assumption Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 28
  • Web hyperlinks structure  Why a web page point to another one?  Link analysis: use link structure to determine credibility.  If a web page is pointed by other ones, maybe it is because the page contains relevant information.  We can understand the formation of a web community.  We can improve our web site. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 29
  • Processing the hyperlinks structure Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 30
  • Processing the hyperlinks structure (2)  Identifying which pages contain more relevant information that others [Kleinberg99]:  Authorities. A natural information repository for the community.  Hub. These concentrate links to authorities web pages, for instance, “my favorite sites”. xp = !y q q such that q " p yp = !x q q such that q " p Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 31
  • 4.- Web mining Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 32
  • Data mining techniques  Association rules  Clustering  Classification.  Prediction.  Probabilistic models.  Regression. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 33
  • What happen with web data? “The web is a huge collection of heterogeneous, unlabelled, distributed, time variant, semi-structured and high dimensional data” S.K. Pal 2002 Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 34
  • Web Mining Taxonomy [Jooshi00,Lu03] Web Mining Web Content Web Structure Web Usage Mining Mining Mining Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 35
  • Web Structure Mining  It deals with the mining of the web hyperlink structure (inter document structure).  A website is represented by a graph of its links, within the site or between sites.  Facts like the popularity of a web page can be studied, for instance, if a page is referred by a lot of other pages in the web.  The web link structure allows to develop a notion of hyperlinked communities.  It can be used by search engines, like google or altavista, in order to get the set of pages more cited for a particular subject. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 36
  • WSM (2)  Finding authoritative Web pages  Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic  Hyperlinks can infer the notion of authority  The Web consists not only of pages, but also of hyperlinks pointing from one page to another  These hyperlinks contain an enormous amount of latent human annotation  A hyperlink pointing to another Web page, this can be considered as the author's endorsement of the other page Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 37
  • Google’s PageRank (Brin/Page 98)  Mine structure of web graph independently of the query!  Each web page is a node, each hyperlink is a directed edge  Assumes a random walk (surf) through the web:  Start at a random page  At each step, the surfer proceeds  to a randomly chosen web page with probability d  to a randomly chosen successor of the current page with probability 1- d  The PageRank of a page p is the fraction of steps the surfer spends at p in the limit Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 38
  • PageRank Algorithm  The assumption is the importance of a page is given for the importance of the pages that pointed it. Probability that the surfer follows some Importance of page q out-links of q when visit that page (k ) ( k +1) 1 x q x p = (1 % d ) + d ! n $q , p#P / q " p N q Importance of page p Number of pages in the web graph Number of out-links from page q Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 39
  • PageRank Algorithm (2)  By using a matrix representation, the PageRank equation is rewrite as: ( k +1) (k ) X = (1 ! d ) D + dMX  where (k ) (k ) (k ) 1 1 1 X = ( x ,..., x ), D = ( ,..., ) p1 pn and M = (mij = ) n n Nj  With Nj amount of out-links from page j such as j !i Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 40
  • PageRank Algorithm (3) (0) 1 1  Initialize X = (0,...0) and D = ( ,..., ) n n  Set d, usually d=0.85. ( k +1) (k )  Calculate X = (1 ! d ) D + dMX ( k +1) (k ) ( k +1)  If || X "X ||< ! stop and return X .  Else X ( k ) = X ( k +1) and go to point 3. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 41
  • PageRank Algorithm (4) ,0 0.5 1 0) & 0.25 # *0 0 $ ! 0 1' $ 0.25 ! M =* ' and D=$ *0 0 0 0' 0.25 ! * ' $ ! $ 0.25 ! +1 0.5 0 0( % " Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 42
  • PageRank Algorithm (5) 0.0375 0.126 0.0375 0.0375 0 0 0.0375 0 0.0853 0 0.0375 0.0375 0.126 0.0375 & 0.0375 # & 0.0853 # & 0.0853 # $ ! $ ! $ ! $ 0.0375 ! ( 2 ) $ 0.0375 ! (3) $ 0.0375 ! X (1) =$ !, X = $ 0.0375 !, X = $ 0.0375 ! 0.0375 0.0853 $ ! $ ! $ ! $ 0.0375 ! $ 0.126 ! $ 0.126 ! % " % " % " 0.0375 Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 43
  • PageRank Algorithm (6)  About the disadvantages:  If a page is pointed by another one, it mean the page receive a vote for the PageRank calculus.  If a page is pointed by a lots of pages, it mean that the page is important.  “Only the good pages are pointed by others one”, but:  Reciprocal links. If the page A link page B, then page B will link page A.  Link Requirements. Some web pages give electronic gifts, like a program, document etc., if another page point it.  Near persons community. For instance friends and relatives that from their pages point another friend or relative, only for the human relationship between them. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 44
  • WSM summary  HIST and PageRank use back-links as a means of adjusting the “worthiness” or “importance” of a page  Both use iterative process over matrix/vector values to reach a convergence point  HITS is query-dependent (thus too expensive to compute in general)  PageRank is query-independent and considered more stable  Just one piece of the overall picture… Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 45
  • Web Content Mining  The goal is to find useful information from the web content.  In this sense, WCM is similar to Information Retrieval (IR).  However, web content is not only free text, other objects like pictures, sound and movies belong also to the content.  There are two main areas in WCM :  the mining of document contents (web page content mining)  the improvement of content search in tools like search engines (search result mining). Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 46
  • Issues in Web Content Mining  Developing intelligent tools for IR  Finding keywords and key phrases  Discovering grammatical rules and collocations  Hypertext classification/categorization  Extracting key phrases from text documents  Learning extraction models/rules  Hierarchical clustering  Predicting (words) relationship Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 47
  • WEBSOM  It is a means for organizing miscellaneous text documents into meaningful maps for exploration and search.  It is based on SOM (Self-Organizing Map) that automatically organizes documents into a two-dimensional grid so that related documents appear close to each other  http://www.cis.hut.fi/websom Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 48
  • WEBSOM All words of document are mapped into the word category map Histogram of “hits” on it is formed Self-organizing map. Largest experiments have used: • word-category map Self-organizing semantic map. 315 neurons with 270 15x21 neurons inputs each Interrelated words that have similar contexts • Document-map appear close to each other on the map 104040 neurons with 315 inputs each • Training done with 1124134 documents Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 49
  • WEBSOM Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 50
  • Sample search  A new document or any document’s description can be used for finding related documents.  The circle on the map denote the location of the most representative document for the question. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 51
  • How to use the Maps  As filter that notifies the user of interesting documents.  As a searching engine.  A new index method. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 52
  • Extraction of key-text components from web pages  The key-text components are parts of an entire document, for instance a paragraph, phrase and inclusive a word, that contain significant information about a particular topic, from the web site user point of view.  A web site keyword is “a word or possibly a set of words that make a web page more attractive for an eventual user during his visit to the web site''[Velasquez05b].  The assumption is that there exists a correlation between the time that the user spent in a page and his/her interest in its content [Velasquez04b] .  Usually, the keywords in a web site have been related with the ``most frequently used words''.  In [Buyukkokten01] a method to extract keywords from a huge set of web pages is introduced. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 53
  • WCM summary  The vector space model is a recurrent method to represent a document as a feature vector.  Because the set of words used in the construction of the web site could be a lot, it is necessary to apply a stop word cleaning and stemming process.  A web page content is different to a common document, in fact, a web page contain semi-structured text, i.e., with tags that give additional information about the text component.  Also a page could contain pictures, sounds, movies, etc.  Sometime the page text content is a short text or inclusive a set of unconnected words. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 54
  • Web Usage Mining (WUM)  Also known as Web log mining  mining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the web Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 55
  • WUM: Considerations [Pierrakos03]  WUM applies traditional data mining methods in order to analyze usage data.  The sessionization process is necessary to correct the problems detected in the data.  The goal is to discover patterns in usage data applying different kinds of data mining techniques.  Applications of WUM can be grouped in two main categories:  User modeling in adaptive interfaces, known as personalization.  User navigation patterns, in order to improve the web site structure. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 56
  • WUM: Applications  Target potential customers for electronic commerce  Enhance the quality and delivery of Internet information services to the end user  Improve Web server system performance  Identify potential prime advertisement locations  Facilitates personalization/adaptive sites  Improve site design  Fraud/intrusion detection  Predict user’s actions (allows prefetching) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 57
  • Statistics  Several tools use the conventional statistics for analyzing the user behavior in a web site (http://www.accrue.com/, http://www.netgenesis.com, http://www.webtrends.com).  Each one of them use graphics interfaces, like histograms, with statistics associate.  For instance amount of clicks per page during the last month.  By applying conventional statistics on web logs, one can perform different kinds of analysis [Boving04] . Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 58
  • Statistics (2)  These basic statistics are used to improve the web site structure and content in many institutions, over all the commercial ones [Dellmann04] .  Their application is simple, there are many web traffic analysis tools and it easy to understand the tools's reports.  Although the statistic analysis could seem a very simple tool to mining the web logs, it resolve a lot of problems relate with the performance of the web site. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 59
  • Examples: 60% of clients who accessed /products/, also accessed /products/software/webminer.htm. 30% of clients who accessed /special-offer.html, placed an online order in /products/software/. (Example from IBM official Olympics Site)  {Badminton, Diving} ===> {Table Tennis} (α = 69.7%, σ = 0.35%) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 60
  • Clustering  Groups together a set of items having similar characteristics  User Clusters  Discover groups of users exhibiting similar browsing patterns  Page recommendation  User’s partial session is classified into a single cluster  The links contained in this cluster are recommended Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 61
  • Clustering  Grouping users with common browsing behavior and pages with similar content [Srivastava00] .  An important thing in any clustering technique, is the similarity measure or method used to compare the input vectors for creating the clusters.  In the WUM case, the similarities are constructed basis on the usage data in the web site, i.e., the set of pages used or visited during the user session. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 62
  • WUM: Clustering example  Clusters of user browsing behavior Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 63
  • Association Rule Generation  Discovers the correlations between pages that are most often referenced together in a single server session  Provide the information  What are the set of pages frequently accessed together by Web users?  What page will be fetched next?  What are paths frequently accessed by Web users?  Association rule A B [ Support = 60%, Confidence = 80% ] Example “50% of visitors who accessed URLs /diplomas.html and BI/info.html also visited webmining.html” Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 64
  • Association rules  In WUM, the association rules are focus mainly in the discovery of relations between pages visited by the users [Mobasher01] .  For instance, an association rule for a MBA program is mba/seminar.html  mba/speakers.html  From a different point of view, in [Schwarzkopf01] a Bayesian network is used for defining taxonomic relations between topics shown in a web site. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 65
  • 5.- Applications Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 66
  • Improving the relationship between the web site and the user  Recommendations to modify the web site structure and content.  Web personalization (online navigation recommendations).  Intelligent web site. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 67
  • Web personalization [Lu03,Mombasher00]  It is the process where the web server and the related applications, dynamically, customize the content for a particular user, based on information about his/her behavior in a website.  This is different to another related concept called “customization” where the visitor interacts with the web server using an interface to create her/his own web site, e.g., “MyBanking Page”. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 68
  • Intelligent Web site [Coenen00, Perkowitz98, Velasquez04b]  It represents the main concepts behind the “new portal generation”.  They are systems that “based on the user behavior, allow to implement changes in the current web site structure and content”. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 69
  • What we should do? [Mombasher01]  Modeling the visitor behavior.  A model for the visitor browsing behavior must consider the visited pages, the time spent and the pages sequence.  A model for the visitor preferences must consider the content of the pages and the time spent in each one of them in a session. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 70
  • Visitor browsing behavior ¿? Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 71
  • Visitor browsing behavior (2)  Three variables are considered: the web page path, its content and the time spent when it is visited by a visitor.  The visitor behavior vector is defined and a similarity measure between visitor session is introduced. r v = [( p1 , t1 )...( pn , t n )]  Where ( pi , ti ) is a component that represents the page content,thits path and percentage of time spent in the i page visited.  The vector maintains the page visit order. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 72
  • Vector space model: A variation proposed swi Q M = (mij ) = f ij (1 + ) * log( ni ) sw: special words array TR TR: Total special words pi ! (m1i ,..., mRi ) p j ! (m1 j ,..., mRj ) R ! mki mkj pi dp ( pi , p j ) = cos " = R k =1 R 2 ! ( mki ) ! ( mkj ) 2 k =1 k =1 pj ! Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 73
  • An example of variation proposed Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 74
  • Comparing the sequences [Runkler03, Velasquez04a]  The sequence of a navigation can be represented by a graph. Each page is identified by an identification number. G1 = {1 & 2,2 & 6,2 & 5,5 & 8} G2 = {1 & 3,3 & 6,3 & 7}  We need to know how similar or different are both sequence E (G1 ) = {1,2,6,5,8} representations!! E (G2 ) = {1,3,6,7} Notion of similarity || E ( G1 ) I E ( G2 )|| #0 if E (G1 ) % E (G2 ) = ' dG (G1 , G2 ) = 2 || E ( G1 ) + E ( G2 )|| =" !1 if E (G1 ) $ E (G2 ) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 75
  • Comparing sequences: An example S1 ="12648" $ 0 S1 % S 2 L( S1 , S 2 ) = # S 2 ="1367" "] , max{|| E (G1 ) ||, || E (G2 ) ||}] else 0 S3 ="12856" L( S1 , S 2 ) dG (G1 , G2 ) = 1 ! 2 S 4 ="1367" || E (G1 ) || + || E (G2 ) || L( S1 , S 2 ) = 3 = 1 ! 2 5+ 4 = 0, 3 3 L( S3 , S 4 ) = 4 Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 76
  • Comparing browsing behavior  Then the similarity measure is: r r h h 1 L # tk tk " c c sm(# , " ) = dG ( p# , p" ) L ! min( t " , t # )dp ( p# ,k , p" ,k ) k k k =1 ! " tk tk where min( t " , t ! ) is an indicator of visitor interest k k dp ( p" ,k , p! ,k ) is the page distance c c and dG is a “graph distance”, i.e., how similar are the paths between two sessions and dP is a “page distance” between the content of the visited pages. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 77
  • What is she/he looking for?  Free text.  Movies  Pictures  Sounds Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 78
  • Visitor preferences  The main focus is to understand the text preferences.  Web site keywords: the word or the set of words that make the web page more attractive for the visitor.  From the visitor behavior vector, we want to select the most important pages, assuming that the degree of importance is correlated with the percentage of time spent on each page. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 79
  • Important pages vector [Velasquez05b]  We can suppose that the most important content for the visitors is included in the web pages that concentrate the greater amount of time spent in their visit.  Then, in order to know the most important web page content, we can select a component set from v . Let be ! the quantity selected. $! (v) = [( #1 ," 1 ),..., ( #! ," ! )]  ( " k ,! k ) being the component in v that represents the k th page most important and the time spent on it.  This information is obtained after sorting the v components by time and select the first ! ones. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 80
  • Similarity measure using text 1 # % k % k! " st (&# (" ),&# ( ! )) = ' min{ ! , " } * dp ( $ k , $ k ) " ! # k =1 %k %k  This equation combines the content of the most important visited pages with the time spent on each one of them by a multiplication.  In this way we can distinguish between two pages with similar content, but different spent times. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 81
  • Identifying web site keywords  A clustering algorithm can be applied to find groups of similar visitor sessions.  Based on the clusters centroids, the most important words for each cluster will be identified.  Using the vector space model and a method to determine the most important keywords and their importance in each cluster is proposed.  A measure (geometric mean) to calculate the importance of each word relative to each cluster, is defined as: kw[i ] = ! $ mip ! : Cluster with ! vectors p#" i : 1,..., R Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 82
  • Applying clustering algorithms [Hinnserbury03, Theodoridis99, Velasquez03  For example, Self Organizing Feature Maps.  Schematically, a SOFM is presented as a two-dimensional array in whose positions the neurons are located.  Each neuron is constituted by an n-dimensional vector, whose components are the synaptic weights.  The notion of neighborhood among the neurons provides diverse topologies.  In this case a toroidal topology was used, which means that the neurons closest to the ones of the superior edge, are located in the inferior and lateral edges. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 83
  • Working with real web sites Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 84
  • Bank web site  It belongs to a Chilean virtual bank, i.e., a bank that doesn't have physical branches and where all the transactions are made using electronic means, like e-mails, portals, etc.  Written in Spanish.  217 static web pages.  Approximately eight million raw web logs registers corresponding to the period January to March, 2003. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 85
  • Visitor Behavior Vectors  Only 16% of the visitors visit 10 or more pages and 18% less than 4.  The average of visited pages is 6.  With these data, we fixed 6 as the maximum number of components in a visitor behavior vector  Finally, applying the above described filters, approximately 200,000 vectors were identified. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 86
  • Visitor Behavior Vectors (2) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 87
  • Cluster analysis  The cluster identification and validation is generally a subjective tasks.  It is necessary the support of a business expert.  In this case bank business expert.  Eight clusters were identified but only four of the were accepted by the business expert.  The criterion is “What cluster make sense for the business expert” Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 88
  • Results Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 89
  • Important page clustering Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 90
  • Cluster analysis  The cluster must content only pages related by main topic.  The business expert define the main topic per page.  From twelve identified clusters, only eight were accepted. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 91
  • Results Cluster Pages Visited 1 (6,8,190) 2 (100,128.30) 3 (86,150,97) 4 (101,105,1) 5 (3,9,147) 6 (100,126,58) 7 (70,186,137) 8 (157,169,180) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 92
  • Cluster analysis  Review which words in each page cluster are the same.  Using the vector page definition, it is possible to get the key words and their relative importance in each cluster. swi Q M = (mij ) = f ij (1 + ) * log( ni ) TR kw[i ] = ! $ mip p#"  Example ! = {6,8,190} kwi = 3 mi 6 mi 8 mi190 Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 93
  • Cluster analysis Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 94
  • Some identified keywords # Keywords 1 Crédito Credit 2 Hipotecario House credit 3 Tarjeta Card 4 Promoción Promotion 5 Concurso Concourse 6 Puntos Points 7 Descuento Discount 8 Cuenta Account Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 95
  • Offline recommendations  Structure.  Add links intra clusters, e.g., link from page 150 to page 137.  Add link inter cluster, e.g. link from page 105 to 126.  Eliminate link, e.g., from page 150 to page 186  Content. To use the web site keywords as links or contents in the page. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 96
  • Online recommendations  A current user session is classified into some clusters found.  The online navigation recommendation is created as a set of links to pages belonging to the current web site.  The user can select some links or not.  Default case: No recommendation.  It is too risky to apply the recommendations in the real web site. However it is possible to make a simulation. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 97
  • A methodology to test online navigation recommendations Rm +1 (! ) Recommendations for pm +1 ? = { p1 ,..., pn } ! = [( p1 , t1 ),..., ( pm , t m ),..., ( pH , t H )] Clq Rm +1 (! ) = {lm +1, 0 ,..., lm +1,k ,..., lm +1, j } ! ! ! pm +1 pm +1 ! Clq "lm +1, j ! Rm +1 / lm +1, j ! Clq , j > 0 # # # lm +1, x ! Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 98
  • Online recommendations (2) Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 99
  • Testing the online navigation effectiveness 100 Percentage of acceptance 90 80 70 60 50 40 30 20 sixth page 10 fifth page 0 fourth page 1 2 3 Number of page links per suggestion Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 100
  • Intelligent web site [Velasquez04b] Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 101
  • Knowledge Base  Parametric rules  Historic pattern repository Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 102
  • 5.- Summary Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 103
  • Summary  Web data are a real source to analyze the user behavior in the web.  An important step is cleaning and pre- processing of the web data.  The application of web mining techniques allows to find unknown patterns.  These patterns must be validated/rejected by an expert in the business under investigation.  We can personalize a web site. Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 104
  • Muchas gracias por su atención Thank you very much for your attention Mining web data Juan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 105