Web Mining
Upcoming SlideShare
Loading in...5
×
 

Web Mining

on

  • 2,321 views

 

Statistics

Views

Total Views
2,321
Views on SlideShare
2,321
Embed Views
0

Actions

Likes
0
Downloads
107
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Web Mining Web Mining Document Transcript

  • Web Mining 1. Objectives .................................................................................2 2. Typical Web Log Structure ......................................................3 2.1. Problems with web log.....................................................4 3. Web Mining Taxonomy ...........................................................5 3.1. Definitions ........................................................................5 3.2. Data Mining......................................................................6 4. Web Content Mining................................................................6 4.1. Web Crawlers ...................................................................6 5. Web structure Mining...............................................................7 5.1. Definitions ........................................................................7 6. Web Usage Mining.................................................................10 6.1. Objectives.......................................................................10 6.2. Definitions ......................................................................10 6.3. Web Log Analysis Tools................................................10 6.4. Data Preparation .............................................................11 A. Bellaachia Page: 1
  • 1. Objectives • Basic Idea: o Assist users or site owners in finding something useful/interesting/relevant • Web Mining: The User-Centric View o Discovery of documents on a subject o Discovery of semantically related documents or document segments o Extraction of relevant knowledge about a subject from multiple sources o Knowledge/information filtering • Web Mining: The Owner-Centric View o Increasing contact / conversion efficiency (Web marketing) o Targeted promotion of goods, services, products, ads o Measuring effectiveness of site content / structure o Providing dynamic personalized services or content A. Bellaachia Page: 2
  • 2. Typical Web Log Structure • There are several kinds of log formats • Most common format: Common Log Format (CLF) • Common Log Format (www.webdeveloper.com) o The common log format appears exactly as follows: host/ip rfcname logname [DD/MMM/YYYY:HH:MM:SS -0000] "METHOD /PATH HTTP/1.0" code bytes host/ip ÿ If reverse DNS works and DNS lookup is enabled, the hostname of the client is dropped in; otherwise the IP number displays. ÿ RFC name ÿ You can retrieve a name from the remote server for the user. If no value is present, a "-" is substituted. ÿ logname ÿ If you're using local authentication and registration, the user's log name will appear; likewise, if no value is present, a "-" is substituted. ÿ datestamp ÿ The format is day, month (three-letter abbreviation), year, hour in 24- hour clock, minute, second, and the offset from Greenwich Mean Time (for example, Pacific Standard Time is -0800). ÿ retrieval ÿ Method is GET, PUT, POST, or HEAD; path is the path and file retrieved; HTTP/1.0 defines the protocol. ÿ code ÿ HTTP completion code. 200 is successful, 304 is a reload from cache, 404 is file not found, and so forth. ÿ A. Bellaachia Page: 3 View slide
  • Bytes ÿ Number of bytes in file retrieved. ÿ • Here's an example: sniksnak.foobar.org - - [30/Feb/1996:06:03:24 -0800] "GET /film/logos/the.movies.main.gif HTTP/1.0" 200 278 2.1. Problems with web log • Identifying users o Clients may have multiple streams o Clients may access web from multiple hosts o Proxy servers: many clients/one address o Proxy servers: one client/many addresses • Data not in log o POST data (i.e., CGI request) not recorded o Cookie data stored elsewhere • Missing data o Pages may be cached o Referring page requires client cooperation o When does a session end? o Use of forward and backward pointers • Typically a 30 minute timeout is used • Web content may be dynamic o May not be able to reconstruct what the user saw • Use of spiders and automated agents – automatic request we pages A. Bellaachia Page: 4 View slide
  • 3. Web Mining Taxonomy Web Mining Web Content Web Structure Web Usage Mining Mining Mining Web Page General Access Content Mining Pattern Tracking Search Results Customized Mining Usage Tracking 3.1. Definitions • Web Content Mining o Web Page Content Mining o Summarization of Web page contents (WebSQL, WebOQL, WebML, WebLog, W3QL) • Web Structure Mining o Search Result Mining o Summarization of search engine results (PageRankTM) o Capturing Web’s structure using link interconnections (HITS) • Web Usage Mining o General Access Pattern Mining ÿ Uses KDD techniques to understand general user patterns (WUM, WEBMiner, WAP, WebLogMiner) o Customized Usage Tracking A. Bellaachia Page: 5
  • ÿ Adaptive sites 3.2. Data Mining • Frequent Itemsets o What the pages that are accessed together in a certain number of sessions. • Association Rules o When a page Pi is accessed in a session, Pj is also accessed in x% of the time. • Clustering: Content-Based or Usage-Based o Customer/visitor segmentation o Categorization of pages and products • Classification o Visitors who bought PDA and Laptops have an income of 80K+ and live in zipcode 11111. o Send a banner ad to visitors in class Ci: potential buyers of a product A. 4. Web Content Mining • It is related to data mining • It extends work of basic search engines • It includes: text, multimedia • Research areas: o Classification of complete Web Sites o Focused Web Crawling 4.1. Web Crawlers A. Bellaachia Page: 6
  • • Robot (spider): o Traverses the hypertext structure in the Web. o Collect information from visited pages o Used to construct indexes for search engines • Traditional Crawler o Visits entire Web and replaces index • Periodic Crawler o Visits portions of the Web and updates subset of index • Incremental Crawler o Selectively searches the Web and incrementally modifies index • Focused Crawler o Visits pages related to a particular subject 5. Web structure Mining • Mine structure (links, graph) of the Web • Techniques: PageRank 5.1. Definitions • Important pages: A page is important if important pages link to it ÿ • PageRank: o Used by Google o Prioritize pages returned from search A. Bellaachia Page: 7
  • o Importance of page is calculated based on number of pages which point to it – Backlinks. o Weighting is used to provide more importance to backlinks coming form important pages. o PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) ÿ • Definitions o Authoritative Pages: Highly important pages. o Hub Pages: Contains links to highly important pages. Authority Hub • HITS (Hyperlink-Induces Topic Search) Algorithmÿ o The approach consists of two phases: ÿ It uses the query terms to collect a starting set of pages (200 pages) from an index-based search engine – root set of pages. ÿ The root set is expanded into a base set by including all the pages that the root set pages link to, and all the pages that link to a page in the root set, up to a designed size cutoff, such as 2000- 5000. ÿ A weight-propagation phase is initiated. This is an iterative process that determines numerical estimates of hub and authority weights (links between two pages with the same Web domain usually serve as a navigation function and thus do A. Bellaachia Page: 8
  • not confer authority – such links are excluded from the analysis) Root set u Rn R1 … … Sn S1 Base set A. Bellaachia Page: 9
  • 6. Web Usage Mining 6.1. Objectives ÿ • Enhance server performanceÿ • Improve web site navigation • Improve system design of web applications • Target customers for electronic commerce • Identify potential prime advertisement locations 6.2. Definitions • User (Visitor) - Single individual that is accessing files from one or more Web servers through a Browser • Page File - File that is served through HTTP protocol • Pageview - Set of Page Files that contribute to a single display in a Web Browser • User Session - Set of Pageviews served due to a series of HTTP requests from a single User across the entire Web. 6.3. Web Log Analysis Tools • Frequently used, pre-defined reports: ÿ Summary report of hits and bytes transferred ÿ List of top requested URLs ÿ List of top referrers ÿ List of most common browsers ÿ Hits per hour/day/week/month reports ÿ Hits per Internet domain ÿ Error report, etc. A. Bellaachia Page: 10
  • • Tools are limited in their performance, and depth of analysis. • Example: ÿ Webtrends ÿ Analog 6.4. Data Preparation • Data cleaning o Remove irrelevant references and fields in server logs o Remove references due to spider navigation o Remove erroneous references • Data integration o Synchronize data from multiple server logs o Integrate e-commerce and application server data o Integrate meta-data, content and structure data o Integrate demographic / registration data • Data Transformation o Pageview identification o User identification o Sessionization / episode identification • Data Reduction o Sampling; dimensionality reduction (ignoring certain pageviews / items) A. Bellaachia Page: 11