Specifies The WWW is huge, widely distributed, globalinformation service centre for Information services:news, advertisements, consumerinformation, financialmanagement, education, government, e-commerce, etc. Hyper-link information Access and usage information WWW provides rich sources of data for data mining
The Web: Opportunities & Challenges1. The amount of information on the Web is huge2. The coverage of Web information is very wide anddiverse3. Information/data of almost all types exist on theWeb4. Much of the Web information issemi-structured5. Much of the Web information is linked6. Much of the Web information is redundant
The Web: Opportunities & Challenges7. The Web is noisy8. The Web is also about services9. The Web is dynamic10. Above all, the Web is a virtual society11. The Web consists of surface Web and deep Web. Surface Web: pages that can be browsed using abrowser. Deep Web: databases that can only be accessedthrough parameterized query interfaces
What is Web Data ? Web data is1. Web content –text,image,records,etc.2. Web structure –hyperlinks,tags,etc.3. Web usage –http logs,app server logs,etc.4. Intra-page structures5. Inter-page structures6. Supplemental data1. Profiles2. Registration information3. Cookies
Web Mining Web Mining is the use of the data mining techniquesto automatically discover and extract informationfrom web documents/services Web mining is the application of data miningtechniques to find interesting and potentially usefulknowledge from web data Web mining is the application of data miningtechniques to extract knowledge from webdata, including web documents, hyperlinks betweendocuments, usage logs of web sites, etc.
Web Mining• Web Mining is the use of the data mining techniques toautomatically discover and extract information from webdocuments/services• Discovering useful information from the World-WideWeb and its usage patterns• My Definition: Using data mining techniques to make theweb more useful and more profitable (for some) and toincrease the efficiency of our interaction with the web
Why Mine the Web? Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information Car prices Lots of data on user access patterns Web logs contain sequence of URLs accessed by users Possible to mine interesting nuggets of information People who ski also travel frequently to Europe Tech stocks have corrections in the summer and rally from Novemberuntil February
The Web is a huge collection of documents except for Hyper-link information Access and usage information The Web is very dynamic New pages are constantly being generated Challenge: Develop new Web mining algorithms and adapttraditional data mining algorithms to Exploit hyper-links and access patterns Be incrementalWhy is Web Mining Different?
Web Mining: Subtasks Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from selecteddocuments Generalization Discover general patterns within and across web sites Analysis Validation and/or interpretation of mined patterns
Web Mining Issues Size Grows at about 1 million pages a day Google indexes 9 billion documents Number of web sites Netcraft survey says 72 million sites (http://news.netcraft.com/archives/web_server_survey.html) Diverse types of data Images Text Audio/video XML HTML
E-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud Similar image retrieval Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents Network Management Performance management Fault managementWeb Mining Applications
Web Data Mining Use of data mining techniques toautomatically discover interesting andpotentially useful information from Webdocuments and services. Web mining may be divided into threecategories:1. Web content mining2. Web structure mining3. Web usage mining
Web Content Mining Discovery of useful information from webcontents / data / documents Web data contents:1. text,2. image,3. audio,4. video,5. metadata and6. hyperlinks
Web Content Mining Examine the contents of web pages as well as result of websearching Can be thought of as extending the work performed by basicsearch engines Search engines have crawlers to search the web and gatherinformation, indexing techniques to store theinformation, and query processing support to provideinformation to the users Web Content Mining is: the process of extracting knowledgefrom web contents
Web Content Mining It provides no information about structure ofcontent that we are searching for and noinformation about various categories ofdocuments that are found. Need more sophisticated tools for searching ordiscovering Web content.
Web Content mining Discovering useful information from contents of Webpages. Web content is very rich consisting oftextual, image, audio, video etc and metadata as wellas hyperlinks. The data may be unstructured (free text) orstructured (data from a database) or semi-structured(html) although much of the Web is unstructured.
Web Content Data Structure Unstructured – free text Semi-structured – HTML More structured – Table or Database generatedHTML pages Multimedia data – receive less attention than text orhypertext
Web Content mining Web content mining is related to data miningand text mining It is related to data mining because many datamining techniques can be applied in Web contentmining. It is related to text mining because much of theweb contents are texts. Web data are mainly semi-structured and/orunstructured, while data mining is structured andtext is unstructured.
Web Content Data Structure Web content consists of several types of data Text, image, audio, video, hyperlinks. Unstructured – free text Semi-structured – HTML More structured – Data in the tables ordatabase generated HTML pages Note: much of the Web content data is unstructuredtext data.
Semi-structured Data Content is, in general, semi-structured Example: Title Author Publication_Date Length Category Abstract Content
Web Content Mining: IR View Unstructured Documents Bag of words, or phrase-based featurerepresentation Features can be boolean or frequency based Features can be reduced using different featureselection techniques Word stemming, combining morphologicalvariations into one feature
Web Content Mining: IR View Semi-Structured Documents Uses richer representations for features, based oninformation from the document structure(typically HTML and hyperlinks) Uses common data mining methods (whereasunstructured might use more text miningmethods)
Web Content Mining: DB View Tries to infer the structure of a Web site or transforma Web site to become a database Better information management Better querying on the Web Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database
Web Content Mining: DB View Mainly uses the Object Exchange Model (OEM) Represents semi-structured data (somestructure, no rigid schema) by a labeled graph Process typically starts with manual selection of Websites for content mining Main application: building a structural summary ofsemi-structured data (schema extraction ordiscovery)
Tech for Web Content MiningClassificationsClusteringAssociation
Web Content Mining : Topics Structured data extraction Unstructured text extraction Sentiment classification, analysis and summarizationof consumer reviews Information integration and schema matching Knowledge synthesis Template detection and page segmentation
Structured Data Extraction Most widely studied research topic A large amount of information on the Web iscontained in regularly structured data objects(retrieved from databases)Such Web data records areimportant they often present the essentialinformation of their host pages, e.g., lists of productsand services
Structured Data Extraction Applications: integrated and value-addedservices, e.g., Comparative shopping, meta-search &query, etc
Structured Data Extraction:Approaches1. Wrapper Generation2. Wrapper Induction or Wrapper Learning3. Automatic Approach
Structured Data Extraction:Approaches Wrapper GenerationWrite an extraction program for each websitebased on observed format patterns Labor intensive & time consuming
Automatic Approach Structured data objects on the web are normallydatabase records Retrieved from databases & displayed in webpages with fixed templates Find patterns / grammars from the web pages &then use them to extract data e. g. IEPAD, MDR, ROADRUNNER, EXALG etc38
Wrapper Induction or Wrapper Learning Main technique currently The user first manually labels a set of trainedpages A learning system then generates rules from thetraining pages The resulting rules are then applied to extracttarget items from web pages e.g. WIEN, Stalker, BWI, WL etc39
Supervised Learning Supervised learning is a ‘machine learning’ technique forcreating a function from training data . Documents are categorized The output can predict a class label of the input object (calledclassification). Techniques used are Nearest Neighbor Classifier Feature Selection Decision Tree
Removes terms in the training documents whichare statistically uncorrelated with the class labels Simple heuristics Stop words like “a”, “an”, “the” etc. Empirically chosen thresholds for ignoring “toofrequent” or “too rare” terms Discard “too frequent” and “too rare terms”
Examples of DiscoveredPatterns Association rules 98% of AOL users also have E-trade accounts Classification People with age less than 40 and salary > 40k trade on-line Clustering Users A and B access similar URLs Outlier Detection User A spends more than twice the average amount of timesurfing on the Web
Important for improving customization Provide users with pages, advertisements of interest Example profiles: on-line trader, on-line shopper Generate user profiles based on their access patterns Cluster users based on frequently accessed URLs Use classifier to generate a profile for each cluster Engage technologies Tracks web traffic to create anonymous user profiles of Websurfers Has profiles for more than 35 million anonymous users
Ads are a major source of revenue for Webportals (e.g., Yahoo, Lycos) and E-commercesites Plenty of startups doing internet advertizing Doubleclick, AdForce, Flycast, AdKnowledge Internet advertizing is probably the “hottest”web mining application today
Scheme 1: Manually associate a set of ads with each userprofile For each user, display an ad from the set based onprofile Scheme 2: Automate association between ads and users Use ad click information to cluster users (each useris associated with a set of ads that he/she clickedon) For each cluster, find ads that occur most frequentlyin the cluster and these become the ads for the setof users in the cluster
Use collaborative filtering (e.g. Likeminds, Firefly) Each user Ui has a rating for a subset of ads (basedon click information, time spent, items bought etc.) Rij - rating of user Ui for ad Aj Problem: Compute user Ui‟s rating for an unrated adAjA1 A2 A3?Internet Advertizing
Key Idea: User Ui‟s rating for ad Aj is set to Rkj, where Ukis the user whose rating of ads is most similar to Ui‟s User Ui‟s rating for an ad Aj that has not been previouslydisplayed to Ui is computed as follows: Consider a user Uk who has rated ad Aj Compute Dik, the distance between Ui and Uk‟s ratings oncommon ads Ui‟s rating for ad Aj = Rkj (Uk is user with smallest Dik) Display to Ui ad Aj with highest computed ratingInternet Advertizing
With the growing popularity of E-commerce, systems todetect and prevent fraud on the Web become important Maintain a signature for each user based on buyingpatterns on the Web (e.g., amount spent, categories ofitems bought) If buying pattern changes significantly, then signal fraud HNC software uses domain knowledge and neuralnetworks for credit card fraud detection
Given: A set of images Find: All images similar to a given image All pairs of similar images Sample applications: Medical diagnosis Weather predication Web search engine for images E-commerce
QBIC, Virage, Photobook Compute feature signature for each image QBIC uses color histograms WBIIS, WALRUS use wavelets Use spatial index to retrieve database image whosesignature is closest to the query‟s signature WALRUS decomposes an image into regions A single signature is stored for each region Two images are considered to be similar if they haveenough similar region pairs
Today‟s search engines are plagued byproblems: the abundance problem (99% of info of nointerest to 99% of people) limited coverage of the Web (internetsources hidden behind search interfaces) Largest crawlers cover < 18% of all webpages limited query interface based on keyword-oriented search limited customization to individual users
Today‟s search engines are plagued byproblems: Web is highly dynamic Lot of pages added, removed, and updated everyday Very high dimensionality
Use Web directories (or topic hierarchies) Provide a hierarchical classification of documents (e.g., Yahoo!) Searches performed in the context of a topic restricts the search to onlya subset of web pages related to the topicRecreation ScienceBusiness NewsYahoo home pageSportsTravel Companies Finance Jobs
In the Clever project, hyper-links between Web pagesare taken into account when categorizing them Use a bayesian classifier Exploit knowledge of the classes of immediate neighbors ofdocument to be classified Show that simply taking text from neighbors and usingstandard document classifiers to classify page does not work Inktomi‟s Directory Engine uses “Concept Induction” toautomatically categorize millions of documents
Objective: To deliver content to users quickly andreliably• Traffic management• Fault managementService Provider NetworkRouterServer
While annual bandwidth demand is increasing ten-foldon average, annual bandwidth supply is rising only bya factor of three Result is frequent congestion at servers and onnetwork links during a major event (e.g., princess diana‟s death), anoverwhelming number of user requests can result in millionsof redundant copies of data flowing back and forth across theworld Olympic sites during the games NASA sites close to launch and landing of shuttles
Key Ideas Dynamically replicate/cache content at multiple sites within thenetwork and closer to the user Multiple paths between any pair of sites Route user requests to server closest to the user or leastloaded server Use path with least congested network links Akamai, Inktomi
Service Provider NetworkRouterServerRequestCongestedserverCongestedlink
Need to mine network and Web traffic to determine What content to replicate? Which servers should store replicas? Which server to route a user request? What path to use to route packets? Network Design issues Where to place servers? Where to place routers? Which routers should be connected by links? One can use association rules, sequential pattern miningalgorithms to cache/prefetch replicas at server
Fault management involves Quickly identifying failed/congested servers and links in network Re-routing user requests and packets to avoid congested/down servers andlinks Need to analyze alarm and traffic data to carry out root cause analysis offaults Bayesian classifiers can be used to predict the root cause given a set ofalarms
Total Sites Across All Domains August 1995 - October 2007
Web data sets can be very large Tens to hundreds of terabytes Cannot mine on a single server! Need large farms of servers How to organize hardware/software tomine multi-terabye data setsWithout breaking the bank!
Structured Data Unstructured Data OLE DB offers some solutions!
Pages contain information Links are „roads‟ How do people navigate the Internet Web Usage Mining (clickstream analysis) Information on navigation pathsavailable in log files Logs can be mined from a client or aserver perspective
Why analyze Website usage? Knowledge about how visitors use Website could Provide guidelines to web site reorganization; Help preventdisorientation Help designers place important information where the visitorslook for it Pre-fetching and caching web pages Provide adaptive Website (Personalization) Questions which could be answered What are the differences in usage and access patternsamong users? What user behaviors change over time? How usage patterns change with quality of service(slow/fast)? What is the distribution of network traffic over time?
Analog – Web Log File Analyser Gives basic statistics such as number of hits average hits per time period what are the popular pages in your site who is visiting your site what keywords are users searching for to get toyou what is being downloaded http://www.analog.cx/
Content is, in general, semi-structured Example: Title Author Publication_Date Length Category Abstract Content
Many methods designed to analyze structured data If we can represent documents by a set of attributeswe will be able to use existing data mining methods How to represent a document? Vector based representation(referred to as “bag ofwords” as it is invariant to permutations) Use statistics to add a numerical dimension tounstructured text
A document representation aims to capture what thedocument is about One possible approach: Each entry describes a document Attribute describe whether or not a term appears in thedocument
Another approach: Each entry describes a document Attributes represent the frequency inwhich a term appears in the document
Stop Word removal: Many words are notinformative and thus Irrelevant for document representation the, and, a,an, is, of, that, … Stemming: reducing words to their root form(Reduce dimensionality) A document may contain several occurrences ofwords like fish, fishes, fisher, and fishers. But wouldnot be retrieved by a query with the keyword“fishing” Different words share the same word stem andshould be represented with its stem, instead of theactual word “Fish”