Web content mining


Published on

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Web content mining

  1. 1. Web-Content Mining-Akanksha DombeJNEC, Aurangabad
  2. 2. Specifies The WWW is huge, widely distributed, globalinformation service centre for Information services:news, advertisements, consumerinformation, financialmanagement, education, government, e-commerce, etc. Hyper-link information Access and usage information WWW provides rich sources of data for data mining
  3. 3. The Web: Opportunities & Challenges1. The amount of information on the Web is huge2. The coverage of Web information is very wide anddiverse3. Information/data of almost all types exist on theWeb4. Much of the Web information issemi-structured5. Much of the Web information is linked6. Much of the Web information is redundant
  4. 4. The Web: Opportunities & Challenges7. The Web is noisy8. The Web is also about services9. The Web is dynamic10. Above all, the Web is a virtual society11. The Web consists of surface Web and deep Web. Surface Web: pages that can be browsed using abrowser. Deep Web: databases that can only be accessedthrough parameterized query interfaces
  5. 5. What is Web Data ? Web data is1. Web content –text,image,records,etc.2. Web structure –hyperlinks,tags,etc.3. Web usage –http logs,app server logs,etc.4. Intra-page structures5. Inter-page structures6. Supplemental data1. Profiles2. Registration information3. Cookies
  6. 6. Web Mining Web Mining is the use of the data mining techniquesto automatically discover and extract informationfrom web documents/services Web mining is the application of data miningtechniques to find interesting and potentially usefulknowledge from web data Web mining is the application of data miningtechniques to extract knowledge from webdata, including web documents, hyperlinks betweendocuments, usage logs of web sites, etc.
  7. 7. Web Mining• Web Mining is the use of the data mining techniques toautomatically discover and extract information from webdocuments/services• Discovering useful information from the World-WideWeb and its usage patterns• My Definition: Using data mining techniques to make theweb more useful and more profitable (for some) and toincrease the efficiency of our interaction with the web
  8. 8. Why Mine the Web? Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information Car prices Lots of data on user access patterns Web logs contain sequence of URLs accessed by users Possible to mine interesting nuggets of information People who ski also travel frequently to Europe Tech stocks have corrections in the summer and rally from Novemberuntil February
  9. 9.  The Web is a huge collection of documents except for Hyper-link information Access and usage information The Web is very dynamic New pages are constantly being generated Challenge: Develop new Web mining algorithms and adapttraditional data mining algorithms to Exploit hyper-links and access patterns Be incrementalWhy is Web Mining Different?
  10. 10. Web Mining: Subtasks Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from selecteddocuments Generalization Discover general patterns within and across web sites Analysis Validation and/or interpretation of mined patterns
  11. 11. Web Mining Issues Size Grows at about 1 million pages a day Google indexes 9 billion documents Number of web sites Netcraft survey says 72 million sites (http://news.netcraft.com/archives/web_server_survey.html) Diverse types of data Images Text Audio/video XML HTML
  12. 12.  E-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud Similar image retrieval Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents Network Management Performance management Fault managementWeb Mining Applications
  13. 13. Web Mining Taxonomy
  14. 14. Web Data Mining Use of data mining techniques toautomatically discover interesting andpotentially useful information from Webdocuments and services. Web mining may be divided into threecategories:1. Web content mining2. Web structure mining3. Web usage mining
  15. 15. Whatis“Web Content mining?”
  16. 16. Web Content Mining Discovery of useful information from webcontents / data / documents Web data contents:1. text,2. image,3. audio,4. video,5. metadata and6. hyperlinks
  17. 17. Web Content Mining Examine the contents of web pages as well as result of websearching Can be thought of as extending the work performed by basicsearch engines Search engines have crawlers to search the web and gatherinformation, indexing techniques to store theinformation, and query processing support to provideinformation to the users Web Content Mining is: the process of extracting knowledgefrom web contents
  18. 18. Web Content Mining It provides no information about structure ofcontent that we are searching for and noinformation about various categories ofdocuments that are found. Need more sophisticated tools for searching ordiscovering Web content.
  19. 19. Web Content mining Discovering useful information from contents of Webpages. Web content is very rich consisting oftextual, image, audio, video etc and metadata as wellas hyperlinks. The data may be unstructured (free text) orstructured (data from a database) or semi-structured(html) although much of the Web is unstructured.
  20. 20. Web Content Data Structure Unstructured – free text Semi-structured – HTML More structured – Table or Database generatedHTML pages Multimedia data – receive less attention than text orhypertext
  21. 21. Web Content mining Web content mining is related to data miningand text mining It is related to data mining because many datamining techniques can be applied in Web contentmining. It is related to text mining because much of theweb contents are texts. Web data are mainly semi-structured and/orunstructured, while data mining is structured andtext is unstructured.
  22. 22. Web Content Data Structure Web content consists of several types of data Text, image, audio, video, hyperlinks. Unstructured – free text Semi-structured – HTML More structured – Data in the tables ordatabase generated HTML pages Note: much of the Web content data is unstructuredtext data.
  23. 23. Semi-structured Data Content is, in general, semi-structured Example: Title Author Publication_Date Length Category Abstract Content
  24. 24. Web Content Mining: IR View Unstructured Documents Bag of words, or phrase-based featurerepresentation Features can be boolean or frequency based Features can be reduced using different featureselection techniques Word stemming, combining morphologicalvariations into one feature
  25. 25. Web Content Mining: IR View Semi-Structured Documents Uses richer representations for features, based oninformation from the document structure(typically HTML and hyperlinks) Uses common data mining methods (whereasunstructured might use more text miningmethods)
  26. 26. Web Content Mining: DB View Tries to infer the structure of a Web site or transforma Web site to become a database Better information management Better querying on the Web Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database
  27. 27. Web Content Mining: DB View Mainly uses the Object Exchange Model (OEM) Represents semi-structured data (somestructure, no rigid schema) by a labeled graph Process typically starts with manual selection of Websites for content mining Main application: building a structural summary ofsemi-structured data (schema extraction ordiscovery)
  28. 28. Tech for Web Content MiningClassificationsClusteringAssociation
  29. 29. Web Content Mining : Topics Structured data extraction Unstructured text extraction Sentiment classification, analysis and summarizationof consumer reviews Information integration and schema matching Knowledge synthesis Template detection and page segmentation
  30. 30. Structured Data Extraction Most widely studied research topic A large amount of information on the Web iscontained in regularly structured data objects(retrieved from databases)Such Web data records areimportant they often present the essentialinformation of their host pages, e.g., lists of productsand services
  31. 31. Structured Data Extraction Applications: integrated and value-addedservices, e.g., Comparative shopping, meta-search &query, etc
  32. 32. Structured Data Extraction:Approaches1. Wrapper Generation2. Wrapper Induction or Wrapper Learning3. Automatic Approach
  33. 33. Structured Data Extraction:Approaches Wrapper GenerationWrite an extraction program for each websitebased on observed format patterns Labor intensive & time consuming
  34. 34. 35
  35. 35. 36
  36. 36. CS511, Bing Liu, UIC37
  37. 37.  Automatic Approach Structured data objects on the web are normallydatabase records Retrieved from databases & displayed in webpages with fixed templates Find patterns / grammars from the web pages &then use them to extract data e. g. IEPAD, MDR, ROADRUNNER, EXALG etc38
  38. 38.  Wrapper Induction or Wrapper Learning Main technique currently The user first manually labels a set of trainedpages A learning system then generates rules from thetraining pages The resulting rules are then applied to extracttarget items from web pages e.g. WIEN, Stalker, BWI, WL etc39
  39. 39.  Supervised Learning Supervised learning is a ‘machine learning’ technique forcreating a function from training data . Documents are categorized The output can predict a class label of the input object (calledclassification). Techniques used are Nearest Neighbor Classifier Feature Selection Decision Tree
  40. 40.  Removes terms in the training documents whichare statistically uncorrelated with the class labels Simple heuristics Stop words like “a”, “an”, “the” etc. Empirically chosen thresholds for ignoring “toofrequent” or “too rare” terms Discard “too frequent” and “too rare terms”
  41. 41. Examples of DiscoveredPatterns Association rules 98% of AOL users also have E-trade accounts Classification People with age less than 40 and salary > 40k trade on-line Clustering Users A and B access similar URLs Outlier Detection User A spends more than twice the average amount of timesurfing on the Web
  42. 42.  Important for improving customization Provide users with pages, advertisements of interest Example profiles: on-line trader, on-line shopper Generate user profiles based on their access patterns Cluster users based on frequently accessed URLs Use classifier to generate a profile for each cluster Engage technologies Tracks web traffic to create anonymous user profiles of Websurfers Has profiles for more than 35 million anonymous users
  43. 43.  Ads are a major source of revenue for Webportals (e.g., Yahoo, Lycos) and E-commercesites Plenty of startups doing internet advertizing Doubleclick, AdForce, Flycast, AdKnowledge Internet advertizing is probably the “hottest”web mining application today
  44. 44.  Scheme 1: Manually associate a set of ads with each userprofile For each user, display an ad from the set based onprofile Scheme 2: Automate association between ads and users Use ad click information to cluster users (each useris associated with a set of ads that he/she clickedon) For each cluster, find ads that occur most frequentlyin the cluster and these become the ads for the setof users in the cluster
  45. 45.  Use collaborative filtering (e.g. Likeminds, Firefly) Each user Ui has a rating for a subset of ads (basedon click information, time spent, items bought etc.) Rij - rating of user Ui for ad Aj Problem: Compute user Ui‟s rating for an unrated adAjA1 A2 A3?Internet Advertizing
  46. 46.  Key Idea: User Ui‟s rating for ad Aj is set to Rkj, where Ukis the user whose rating of ads is most similar to Ui‟s User Ui‟s rating for an ad Aj that has not been previouslydisplayed to Ui is computed as follows: Consider a user Uk who has rated ad Aj Compute Dik, the distance between Ui and Uk‟s ratings oncommon ads Ui‟s rating for ad Aj = Rkj (Uk is user with smallest Dik) Display to Ui ad Aj with highest computed ratingInternet Advertizing
  47. 47.  With the growing popularity of E-commerce, systems todetect and prevent fraud on the Web become important Maintain a signature for each user based on buyingpatterns on the Web (e.g., amount spent, categories ofitems bought) If buying pattern changes significantly, then signal fraud HNC software uses domain knowledge and neuralnetworks for credit card fraud detection
  48. 48.  Given: A set of images Find: All images similar to a given image All pairs of similar images Sample applications: Medical diagnosis Weather predication Web search engine for images E-commerce
  49. 49.  QBIC, Virage, Photobook Compute feature signature for each image QBIC uses color histograms WBIIS, WALRUS use wavelets Use spatial index to retrieve database image whosesignature is closest to the query‟s signature WALRUS decomposes an image into regions A single signature is stored for each region Two images are considered to be similar if they haveenough similar region pairs
  50. 50. Query image
  51. 51.  Today‟s search engines are plagued byproblems: the abundance problem (99% of info of nointerest to 99% of people) limited coverage of the Web (internetsources hidden behind search interfaces) Largest crawlers cover < 18% of all webpages limited query interface based on keyword-oriented search limited customization to individual users
  52. 52.  Today‟s search engines are plagued byproblems: Web is highly dynamic Lot of pages added, removed, and updated everyday Very high dimensionality
  53. 53.  Use Web directories (or topic hierarchies) Provide a hierarchical classification of documents (e.g., Yahoo!) Searches performed in the context of a topic restricts the search to onlya subset of web pages related to the topicRecreation ScienceBusiness NewsYahoo home pageSportsTravel Companies Finance Jobs
  54. 54.  In the Clever project, hyper-links between Web pagesare taken into account when categorizing them Use a bayesian classifier Exploit knowledge of the classes of immediate neighbors ofdocument to be classified Show that simply taking text from neighbors and usingstandard document classifiers to classify page does not work Inktomi‟s Directory Engine uses “Concept Induction” toautomatically categorize millions of documents
  55. 55.  Objective: To deliver content to users quickly andreliably• Traffic management• Fault managementService Provider NetworkRouterServer
  56. 56.  While annual bandwidth demand is increasing ten-foldon average, annual bandwidth supply is rising only bya factor of three Result is frequent congestion at servers and onnetwork links during a major event (e.g., princess diana‟s death), anoverwhelming number of user requests can result in millionsof redundant copies of data flowing back and forth across theworld Olympic sites during the games NASA sites close to launch and landing of shuttles
  57. 57.  Key Ideas Dynamically replicate/cache content at multiple sites within thenetwork and closer to the user Multiple paths between any pair of sites Route user requests to server closest to the user or leastloaded server Use path with least congested network links Akamai, Inktomi
  58. 58. Service Provider NetworkRouterServerRequestCongestedserverCongestedlink
  59. 59.  Need to mine network and Web traffic to determine What content to replicate? Which servers should store replicas? Which server to route a user request? What path to use to route packets? Network Design issues Where to place servers? Where to place routers? Which routers should be connected by links? One can use association rules, sequential pattern miningalgorithms to cache/prefetch replicas at server
  60. 60.  Fault management involves Quickly identifying failed/congested servers and links in network Re-routing user requests and packets to avoid congested/down servers andlinks Need to analyze alarm and traffic data to carry out root cause analysis offaults Bayesian classifiers can be used to predict the root cause given a set ofalarms
  61. 61. Total Sites Across All Domains August 1995 - October 2007
  62. 62.  Web data sets can be very large Tens to hundreds of terabytes Cannot mine on a single server! Need large farms of servers How to organize hardware/software tomine multi-terabye data setsWithout breaking the bank!
  63. 63.  Structured Data Unstructured Data OLE DB offers some solutions!
  64. 64.  Pages contain information Links are „roads‟ How do people navigate the Internet  Web Usage Mining (clickstream analysis) Information on navigation pathsavailable in log files Logs can be mined from a client or aserver perspective
  65. 65.  Why analyze Website usage? Knowledge about how visitors use Website could Provide guidelines to web site reorganization; Help preventdisorientation Help designers place important information where the visitorslook for it Pre-fetching and caching web pages Provide adaptive Website (Personalization) Questions which could be answered What are the differences in usage and access patternsamong users? What user behaviors change over time? How usage patterns change with quality of service(slow/fast)? What is the distribution of network traffic over time?
  66. 66.  Analog – Web Log File Analyser Gives basic statistics such as number of hits average hits per time period what are the popular pages in your site who is visiting your site what keywords are users searching for to get toyou what is being downloaded http://www.analog.cx/
  67. 67.  Content is, in general, semi-structured Example: Title Author Publication_Date Length Category Abstract Content
  68. 68.  Many methods designed to analyze structured data If we can represent documents by a set of attributeswe will be able to use existing data mining methods How to represent a document? Vector based representation(referred to as “bag ofwords” as it is invariant to permutations) Use statistics to add a numerical dimension tounstructured text
  69. 69.  A document representation aims to capture what thedocument is about One possible approach: Each entry describes a document Attribute describe whether or not a term appears in thedocument
  70. 70.  Another approach: Each entry describes a document Attributes represent the frequency inwhich a term appears in the document
  71. 71.  Stop Word removal: Many words are notinformative and thus Irrelevant for document representation the, and, a,an, is, of, that, … Stemming: reducing words to their root form(Reduce dimensionality) A document may contain several occurrences ofwords like fish, fishes, fisher, and fishers. But wouldnot be retrieved by a query with the keyword“fishing” Different words share the same word stem andshould be represented with its stem, instead of theactual word “Fish”