Your SlideShare is downloading. ×
Web Mining


1.   Objectives .................................................................................2
2.   Typic...
1. Objectives

    • Basic Idea:
        o Assist users or site owners in finding something
           useful/interesting/...
2. Typical Web Log Structure

        • There are several kinds of log formats
        • Most common format: Common Log Fo...
Bytes ÿ           Number of bytes in file retrieved. ÿ

        • Here's an example:

        sniksnak.foobar.org - - [30/...
3. Web Mining Taxonomy


                                 Web Mining


      Web Content                Web Structure     ...
ÿ   Adaptive sites

    3.2. Data Mining

        • Frequent Itemsets
          o What the pages that are accessed togethe...
• Robot (spider):
          o Traverses the hypertext structure in the Web.
          o Collect information from visited p...
o Importance of page is calculated based on number of
              pages which point to it – Backlinks.
            o Wei...
not confer authority – such links are excluded
                from the analysis)

                     Root set       u

...
6. Web Usage Mining

    6.1. Objectives
          ÿ
       • Enhance server performanceÿ
       • Improve web site naviga...
• Tools are limited in their performance, and depth of
        analysis.
        • Example:
          ÿ Webtrends
        ...
Upcoming SlideShare
Loading in...5
×

Web Mining

2,228

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,228
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
141
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Web Mining"

  1. 1. Web Mining 1. Objectives .................................................................................2 2. Typical Web Log Structure ......................................................3 2.1. Problems with web log.....................................................4 3. Web Mining Taxonomy ...........................................................5 3.1. Definitions ........................................................................5 3.2. Data Mining......................................................................6 4. Web Content Mining................................................................6 4.1. Web Crawlers ...................................................................6 5. Web structure Mining...............................................................7 5.1. Definitions ........................................................................7 6. Web Usage Mining.................................................................10 6.1. Objectives.......................................................................10 6.2. Definitions ......................................................................10 6.3. Web Log Analysis Tools................................................10 6.4. Data Preparation .............................................................11 A. Bellaachia Page: 1
  2. 2. 1. Objectives • Basic Idea: o Assist users or site owners in finding something useful/interesting/relevant • Web Mining: The User-Centric View o Discovery of documents on a subject o Discovery of semantically related documents or document segments o Extraction of relevant knowledge about a subject from multiple sources o Knowledge/information filtering • Web Mining: The Owner-Centric View o Increasing contact / conversion efficiency (Web marketing) o Targeted promotion of goods, services, products, ads o Measuring effectiveness of site content / structure o Providing dynamic personalized services or content A. Bellaachia Page: 2
  3. 3. 2. Typical Web Log Structure • There are several kinds of log formats • Most common format: Common Log Format (CLF) • Common Log Format (www.webdeveloper.com) o The common log format appears exactly as follows: host/ip rfcname logname [DD/MMM/YYYY:HH:MM:SS -0000] "METHOD /PATH HTTP/1.0" code bytes host/ip ÿ If reverse DNS works and DNS lookup is enabled, the hostname of the client is dropped in; otherwise the IP number displays. ÿ RFC name ÿ You can retrieve a name from the remote server for the user. If no value is present, a "-" is substituted. ÿ logname ÿ If you're using local authentication and registration, the user's log name will appear; likewise, if no value is present, a "-" is substituted. ÿ datestamp ÿ The format is day, month (three-letter abbreviation), year, hour in 24- hour clock, minute, second, and the offset from Greenwich Mean Time (for example, Pacific Standard Time is -0800). ÿ retrieval ÿ Method is GET, PUT, POST, or HEAD; path is the path and file retrieved; HTTP/1.0 defines the protocol. ÿ code ÿ HTTP completion code. 200 is successful, 304 is a reload from cache, 404 is file not found, and so forth. ÿ A. Bellaachia Page: 3
  4. 4. Bytes ÿ Number of bytes in file retrieved. ÿ • Here's an example: sniksnak.foobar.org - - [30/Feb/1996:06:03:24 -0800] "GET /film/logos/the.movies.main.gif HTTP/1.0" 200 278 2.1. Problems with web log • Identifying users o Clients may have multiple streams o Clients may access web from multiple hosts o Proxy servers: many clients/one address o Proxy servers: one client/many addresses • Data not in log o POST data (i.e., CGI request) not recorded o Cookie data stored elsewhere • Missing data o Pages may be cached o Referring page requires client cooperation o When does a session end? o Use of forward and backward pointers • Typically a 30 minute timeout is used • Web content may be dynamic o May not be able to reconstruct what the user saw • Use of spiders and automated agents – automatic request we pages A. Bellaachia Page: 4
  5. 5. 3. Web Mining Taxonomy Web Mining Web Content Web Structure Web Usage Mining Mining Mining Web Page General Access Content Mining Pattern Tracking Search Results Customized Mining Usage Tracking 3.1. Definitions • Web Content Mining o Web Page Content Mining o Summarization of Web page contents (WebSQL, WebOQL, WebML, WebLog, W3QL) • Web Structure Mining o Search Result Mining o Summarization of search engine results (PageRankTM) o Capturing Web’s structure using link interconnections (HITS) • Web Usage Mining o General Access Pattern Mining ÿ Uses KDD techniques to understand general user patterns (WUM, WEBMiner, WAP, WebLogMiner) o Customized Usage Tracking A. Bellaachia Page: 5
  6. 6. ÿ Adaptive sites 3.2. Data Mining • Frequent Itemsets o What the pages that are accessed together in a certain number of sessions. • Association Rules o When a page Pi is accessed in a session, Pj is also accessed in x% of the time. • Clustering: Content-Based or Usage-Based o Customer/visitor segmentation o Categorization of pages and products • Classification o Visitors who bought PDA and Laptops have an income of 80K+ and live in zipcode 11111. o Send a banner ad to visitors in class Ci: potential buyers of a product A. 4. Web Content Mining • It is related to data mining • It extends work of basic search engines • It includes: text, multimedia • Research areas: o Classification of complete Web Sites o Focused Web Crawling 4.1. Web Crawlers A. Bellaachia Page: 6
  7. 7. • Robot (spider): o Traverses the hypertext structure in the Web. o Collect information from visited pages o Used to construct indexes for search engines • Traditional Crawler o Visits entire Web and replaces index • Periodic Crawler o Visits portions of the Web and updates subset of index • Incremental Crawler o Selectively searches the Web and incrementally modifies index • Focused Crawler o Visits pages related to a particular subject 5. Web structure Mining • Mine structure (links, graph) of the Web • Techniques: PageRank 5.1. Definitions • Important pages: A page is important if important pages link to it ÿ • PageRank: o Used by Google o Prioritize pages returned from search A. Bellaachia Page: 7
  8. 8. o Importance of page is calculated based on number of pages which point to it – Backlinks. o Weighting is used to provide more importance to backlinks coming form important pages. o PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) ÿ • Definitions o Authoritative Pages: Highly important pages. o Hub Pages: Contains links to highly important pages. Authority Hub • HITS (Hyperlink-Induces Topic Search) Algorithmÿ o The approach consists of two phases: ÿ It uses the query terms to collect a starting set of pages (200 pages) from an index-based search engine – root set of pages. ÿ The root set is expanded into a base set by including all the pages that the root set pages link to, and all the pages that link to a page in the root set, up to a designed size cutoff, such as 2000- 5000. ÿ A weight-propagation phase is initiated. This is an iterative process that determines numerical estimates of hub and authority weights (links between two pages with the same Web domain usually serve as a navigation function and thus do A. Bellaachia Page: 8
  9. 9. not confer authority – such links are excluded from the analysis) Root set u Rn R1 … … Sn S1 Base set A. Bellaachia Page: 9
  10. 10. 6. Web Usage Mining 6.1. Objectives ÿ • Enhance server performanceÿ • Improve web site navigation • Improve system design of web applications • Target customers for electronic commerce • Identify potential prime advertisement locations 6.2. Definitions • User (Visitor) - Single individual that is accessing files from one or more Web servers through a Browser • Page File - File that is served through HTTP protocol • Pageview - Set of Page Files that contribute to a single display in a Web Browser • User Session - Set of Pageviews served due to a series of HTTP requests from a single User across the entire Web. 6.3. Web Log Analysis Tools • Frequently used, pre-defined reports: ÿ Summary report of hits and bytes transferred ÿ List of top requested URLs ÿ List of top referrers ÿ List of most common browsers ÿ Hits per hour/day/week/month reports ÿ Hits per Internet domain ÿ Error report, etc. A. Bellaachia Page: 10
  11. 11. • Tools are limited in their performance, and depth of analysis. • Example: ÿ Webtrends ÿ Analog 6.4. Data Preparation • Data cleaning o Remove irrelevant references and fields in server logs o Remove references due to spider navigation o Remove erroneous references • Data integration o Synchronize data from multiple server logs o Integrate e-commerce and application server data o Integrate meta-data, content and structure data o Integrate demographic / registration data • Data Transformation o Pageview identification o User identification o Sessionization / episode identification • Data Reduction o Sampling; dimensionality reduction (ignoring certain pageviews / items) A. Bellaachia Page: 11

×