Web Mining
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,467
On Slideshare
2,467
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
110
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Web Mining 1. Objectives .................................................................................2 2. Typical Web Log Structure ......................................................3 2.1. Problems with web log.....................................................4 3. Web Mining Taxonomy ...........................................................5 3.1. Definitions ........................................................................5 3.2. Data Mining......................................................................6 4. Web Content Mining................................................................6 4.1. Web Crawlers ...................................................................6 5. Web structure Mining...............................................................7 5.1. Definitions ........................................................................7 6. Web Usage Mining.................................................................10 6.1. Objectives.......................................................................10 6.2. Definitions ......................................................................10 6.3. Web Log Analysis Tools................................................10 6.4. Data Preparation .............................................................11 A. Bellaachia Page: 1
  • 2. 1. Objectives • Basic Idea: o Assist users or site owners in finding something useful/interesting/relevant • Web Mining: The User-Centric View o Discovery of documents on a subject o Discovery of semantically related documents or document segments o Extraction of relevant knowledge about a subject from multiple sources o Knowledge/information filtering • Web Mining: The Owner-Centric View o Increasing contact / conversion efficiency (Web marketing) o Targeted promotion of goods, services, products, ads o Measuring effectiveness of site content / structure o Providing dynamic personalized services or content A. Bellaachia Page: 2
  • 3. 2. Typical Web Log Structure • There are several kinds of log formats • Most common format: Common Log Format (CLF) • Common Log Format (www.webdeveloper.com) o The common log format appears exactly as follows: host/ip rfcname logname [DD/MMM/YYYY:HH:MM:SS -0000] "METHOD /PATH HTTP/1.0" code bytes host/ip ÿ If reverse DNS works and DNS lookup is enabled, the hostname of the client is dropped in; otherwise the IP number displays. ÿ RFC name ÿ You can retrieve a name from the remote server for the user. If no value is present, a "-" is substituted. ÿ logname ÿ If you're using local authentication and registration, the user's log name will appear; likewise, if no value is present, a "-" is substituted. ÿ datestamp ÿ The format is day, month (three-letter abbreviation), year, hour in 24- hour clock, minute, second, and the offset from Greenwich Mean Time (for example, Pacific Standard Time is -0800). ÿ retrieval ÿ Method is GET, PUT, POST, or HEAD; path is the path and file retrieved; HTTP/1.0 defines the protocol. ÿ code ÿ HTTP completion code. 200 is successful, 304 is a reload from cache, 404 is file not found, and so forth. ÿ A. Bellaachia Page: 3
  • 4. Bytes ÿ Number of bytes in file retrieved. ÿ • Here's an example: sniksnak.foobar.org - - [30/Feb/1996:06:03:24 -0800] "GET /film/logos/the.movies.main.gif HTTP/1.0" 200 278 2.1. Problems with web log • Identifying users o Clients may have multiple streams o Clients may access web from multiple hosts o Proxy servers: many clients/one address o Proxy servers: one client/many addresses • Data not in log o POST data (i.e., CGI request) not recorded o Cookie data stored elsewhere • Missing data o Pages may be cached o Referring page requires client cooperation o When does a session end? o Use of forward and backward pointers • Typically a 30 minute timeout is used • Web content may be dynamic o May not be able to reconstruct what the user saw • Use of spiders and automated agents – automatic request we pages A. Bellaachia Page: 4
  • 5. 3. Web Mining Taxonomy Web Mining Web Content Web Structure Web Usage Mining Mining Mining Web Page General Access Content Mining Pattern Tracking Search Results Customized Mining Usage Tracking 3.1. Definitions • Web Content Mining o Web Page Content Mining o Summarization of Web page contents (WebSQL, WebOQL, WebML, WebLog, W3QL) • Web Structure Mining o Search Result Mining o Summarization of search engine results (PageRankTM) o Capturing Web’s structure using link interconnections (HITS) • Web Usage Mining o General Access Pattern Mining ÿ Uses KDD techniques to understand general user patterns (WUM, WEBMiner, WAP, WebLogMiner) o Customized Usage Tracking A. Bellaachia Page: 5
  • 6. ÿ Adaptive sites 3.2. Data Mining • Frequent Itemsets o What the pages that are accessed together in a certain number of sessions. • Association Rules o When a page Pi is accessed in a session, Pj is also accessed in x% of the time. • Clustering: Content-Based or Usage-Based o Customer/visitor segmentation o Categorization of pages and products • Classification o Visitors who bought PDA and Laptops have an income of 80K+ and live in zipcode 11111. o Send a banner ad to visitors in class Ci: potential buyers of a product A. 4. Web Content Mining • It is related to data mining • It extends work of basic search engines • It includes: text, multimedia • Research areas: o Classification of complete Web Sites o Focused Web Crawling 4.1. Web Crawlers A. Bellaachia Page: 6
  • 7. • Robot (spider): o Traverses the hypertext structure in the Web. o Collect information from visited pages o Used to construct indexes for search engines • Traditional Crawler o Visits entire Web and replaces index • Periodic Crawler o Visits portions of the Web and updates subset of index • Incremental Crawler o Selectively searches the Web and incrementally modifies index • Focused Crawler o Visits pages related to a particular subject 5. Web structure Mining • Mine structure (links, graph) of the Web • Techniques: PageRank 5.1. Definitions • Important pages: A page is important if important pages link to it ÿ • PageRank: o Used by Google o Prioritize pages returned from search A. Bellaachia Page: 7
  • 8. o Importance of page is calculated based on number of pages which point to it – Backlinks. o Weighting is used to provide more importance to backlinks coming form important pages. o PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) ÿ • Definitions o Authoritative Pages: Highly important pages. o Hub Pages: Contains links to highly important pages. Authority Hub • HITS (Hyperlink-Induces Topic Search) Algorithmÿ o The approach consists of two phases: ÿ It uses the query terms to collect a starting set of pages (200 pages) from an index-based search engine – root set of pages. ÿ The root set is expanded into a base set by including all the pages that the root set pages link to, and all the pages that link to a page in the root set, up to a designed size cutoff, such as 2000- 5000. ÿ A weight-propagation phase is initiated. This is an iterative process that determines numerical estimates of hub and authority weights (links between two pages with the same Web domain usually serve as a navigation function and thus do A. Bellaachia Page: 8
  • 9. not confer authority – such links are excluded from the analysis) Root set u Rn R1 … … Sn S1 Base set A. Bellaachia Page: 9
  • 10. 6. Web Usage Mining 6.1. Objectives ÿ • Enhance server performanceÿ • Improve web site navigation • Improve system design of web applications • Target customers for electronic commerce • Identify potential prime advertisement locations 6.2. Definitions • User (Visitor) - Single individual that is accessing files from one or more Web servers through a Browser • Page File - File that is served through HTTP protocol • Pageview - Set of Page Files that contribute to a single display in a Web Browser • User Session - Set of Pageviews served due to a series of HTTP requests from a single User across the entire Web. 6.3. Web Log Analysis Tools • Frequently used, pre-defined reports: ÿ Summary report of hits and bytes transferred ÿ List of top requested URLs ÿ List of top referrers ÿ List of most common browsers ÿ Hits per hour/day/week/month reports ÿ Hits per Internet domain ÿ Error report, etc. A. Bellaachia Page: 10
  • 11. • Tools are limited in their performance, and depth of analysis. • Example: ÿ Webtrends ÿ Analog 6.4. Data Preparation • Data cleaning o Remove irrelevant references and fields in server logs o Remove references due to spider navigation o Remove erroneous references • Data integration o Synchronize data from multiple server logs o Integrate e-commerce and application server data o Integrate meta-data, content and structure data o Integrate demographic / registration data • Data Transformation o Pageview identification o User identification o Sessionization / episode identification • Data Reduction o Sampling; dimensionality reduction (ignoring certain pageviews / items) A. Bellaachia Page: 11