1                                                                                                                         ...
2       Welcome to Web Mining!              This class is a tutorial on large scale web mining              Topics covered...
3       Meet Your Instructor              Ken Krugler - direct from Nevada City, California              Founder of TransP...
4        Agenda              9:00am - Overview                                                                            ...
5                                                                                                                         ...
6       Key Questions              Which of the three types of web mining are we focusing on today?              What make...
7       What is web mining?         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution ...
7       What is web mining?              Extracting useful information from the World-wide Web         Copyright (c) 2011 ...
8       What is web mining?         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution ...
8       What is web mining?              Extracting useful information from the World-wide Web         Copyright (c) 2011 ...
8       What is web mining?              Extracting useful information from the World-wide Web              Web structure ...
9       What is web mining?         173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"    ...
9       What is web mining?              Extracting useful information from the World-wide Web         173.255.195.185 - -...
9       What is web mining?              Extracting useful information from the World-wide Web              Web structure ...
9       What is web mining?              Extracting useful information from the World-wide Web              Web structure ...
10       What is web mining?         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution...
10       What is web mining?              Extracting useful information from the World-wide Web         Copyright (c) 2011...
10       What is web mining?              Extracting useful information from the World-wide Web              Web structure...
10       What is web mining?              Extracting useful information from the World-wide Web              Web structure...
10       What is web mining?              Extracting useful information from the World-wide Web              Web structure...
11       Web Content Mining              Analyzing data from web pages              Typically three types of page processi...
12       Crawling versus Mining         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribut...
12       Crawling versus Mining              Web mining combines...         Copyright (c) 2011 Scale Unlimited. All Rights...
12       Crawling versus Mining              Web mining combines...                      web crawling - finding & fetching ...
12       Crawling versus Mining              Web mining combines...                      web crawling - finding & fetching ...
12       Crawling versus Mining              Web mining combines...                      web crawling - finding & fetching ...
12       Crawling versus Mining              Web mining combines...                      web crawling - finding & fetching ...
12       Crawling versus Mining              Web mining combines...                      web crawling - finding & fetching ...
12       Crawling versus Mining              Web mining combines...                      web crawling - finding & fetching ...
12       Crawling versus Mining              Web mining combines...                      web crawling - finding & fetching ...
13       What is “Large Scale”?              More than what you can handle with one server                      Many singl...
14       Key Aspects of Web Mining              Crawling - finding the “good stuff”              Extracting - getting the “...
15       Finding the “good stuff”              Often feels like “needle in a haystack”                      E.g. even 100M...
16       Getting the “right data”              Scale and precision are in opposition                      One area of one ...
17       Processing the results              1TB data file has very little value                      Actually less value t...
18       Q&A              Which of the three types of web mining are we focusing on today?              What makes web pag...
19                                                                                                                        ...
20       Key Questions              What are three general types of web crawls?              What can make it hard to accu...
21       What is “web crawling”?              Includes fetching pages, of course              But also has aspect of spide...
22       Types of web crawls              Broad                      Few or no limits to what domains/pages to process    ...
23       The “don’t craw” crawl              Leverage other people’s crawl data                      Can be faster, cheape...
24       Crawling Solutions              General rule - don’t roll your own!                      Easy to make something s...
25       What makes it hard?              Web mining breaks the implicit contract with web sites                      You ...
26       Scaling Solutions              Needs to be reliable, scalable, fault tolerant              Single server can fetc...
27       Focused Crawling 101              How to maximize results while minimizing cost                      aka Finding ...
28       Focused Crawl Details              Seed URLs - Good starting point              URL State - DB of all known URLs ...
29       Finding Seed URLs              List of all registered domains - Complete, but big (100M+)                        ...
30       Scoring Pages              Analyze text on page                      Typically means tokenizing text             ...
31       Scoring using SVM              SVM = support vector machine              Trained using “documents” that have feat...
32       Challenges with Scoring              How do you decide that a page is “good”?                      Might be mostl...
33       Chrome, cruft, and boilerplate              Navigational links              Sidebar elements              Ads, SE...
34       Expanding the crawl frontier              Have to parse the page to find outlinks              Need to normalize l...
35       Focused Domain Crawl              Very specific, explicit crawl of one domain              Typically involves disc...
36       Discovery vs Extraction              Focused Domain Crawl has two distinct phases                      Crawling t...
37       Goby Crawl Example              http://www.goby.com has information on lots of attractions              http://ww...
38       Q&A              What are three general types of web crawls?              What can make it hard to accurately sco...
39                                                                                                                        ...
40       Key Questions              What are the three general approaches for data extraction?              Why might you ...
41       You’ve got a page, now what?              Time to extract the data you need              Three attributes of extr...
42       Common Tasks - Cleaning              The HTML needs to be cleaned up                      Lots of messy data, esp...
43       Common Tasks - Charset              You get bytes back from the web server              You need a charset to con...
44       Common Tasks - Link Extraction              Needed to have a crawl - where new links come from              Means...
45       Common Tasks - Boilerplate              For unstructured and semi-structured              Can improve the quality...
46       Common Tasks - Language              Often used for filtering or alternative processing                      Targe...
47       Unstructured Extraction              Goal is extracting text, without much additional processing              Oft...
48       Semi-structured Extraction              Goal is finding structured data in random text                      Can be...
49       Structured Extraction              Precise extraction of specific types of data              Typically is to one a...
50       How to figure out XPath              Firebug is your friend                      Plug-in for Firefox              ...
51       XPath demo              Firebug              XPath tool         Copyright (c) 2011 Scale Unlimited. All Rights Re...
52       Dealing with Javascript              Required if page generates target content using JS              Forces you t...
53       Javascript challenges              10x slower than just loading the page text              Good way to make a web...
54       Q&A              What are the three general approaches for data extraction?              Why might you want to de...
55                                                                                                                        ...
56       Key Questions              Were you able to build and run the code locally?              Were you able to run a c...
57       ImageFinder Details              Find images about Ultimate Frisbee              Focused crawl                   ...
58       Running ImageFinder              Can run locally, with restrictions                      Only one fetcher thread ...
59       Running in Elastic MapReduce              Watch your job via http://strata.scaleunlimited.com:9100/              ...
60       Solr Results Issue              Note links vs. images              These are “image” URLs that are actually to pa...
61       Lab Details              Code is in strata-web-mining folder you downloaded              Details of code in strat...
62       Lab Exercises              First goal is to build code and run locally              Next is to build code and run...
63       Trouble-shooting & Timing              I’ll be walking around - raise your hand if you need help              But...
64       Q&A              Were you able to build and run the code locally?              Were you able to run a crawl in El...
65                                                                                                                        ...
66       Key Challenges              Complexity of large scale web crawling              Challenges with extracting the ri...
67       Ethical Crawling              Always have a real, valid, informative user agent name              Always honor th...
68       Avoiding getting blocked              Follow all ethical crawling guidelines              Gradually ramp up your ...
69       Resources              Hadoop - http://hadoop.apache.org              Cascading - http://www.cascading.org       ...
70       Question?              I might have answers              ken@scaleunlimited.com              @kkrugler         Co...
Upcoming SlideShare
Loading in...5
×

Strata web mining tutorial

9,202

Published on

These are the slides from my tutorial at Strata Santa Clara 2012 today.

Published in: Education, Technology, Business
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,202
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide

Transcript of "Strata web mining tutorial"

  1. 1. 1 Scale Unlimited Web Mining photo by: i_pinz, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  2. 2. 2 Welcome to Web Mining! This class is a tutorial on large scale web mining Topics covered Overview of web mining Web crawling - broad & focused Text mining - extracting value Hands-on lab Tips and traps Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  3. 3. 3 Meet Your Instructor Ken Krugler - direct from Nevada City, California Founder of TransPac Software, Krugle, Bixo Labs/Scale Unlimited Developer of Bixo web mining toolkit Committer on Apache Tika Developer and trainer for Hadoop, Solr and Cascading Actively web mining for six years Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  4. 4. 4 Agenda 9:00am - Overview 11:00am - Web Mining Lab 9:30am - Web Crawling 11:45am - Lab Review 10:00am - Text Mining 12:00pm - Summary 10:30am - Break 12:15pm - Q&A Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  5. 5. 5 Overview Web Mining photo by: exfordy, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  6. 6. 6 Key Questions Which of the three types of web mining are we focusing on today? What makes web pages “noisy”? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  7. 7. 7 What is web mining? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  8. 8. 7 What is web mining? Extracting useful information from the World-wide Web Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  9. 9. 8 What is web mining? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  10. 10. 8 What is web mining? Extracting useful information from the World-wide Web Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  11. 11. 8 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  12. 12. 9 What is web mining? 173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-" 67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-" 89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-" Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  13. 13. 9 What is web mining? Extracting useful information from the World-wide Web 173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-" 67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-" 89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-" Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  14. 14. 9 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis 173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-" 67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-" 89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-" Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  15. 15. 9 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Web usage - server logs 173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-" 67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-" 89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-" Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  16. 16. 10 What is web mining? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  17. 17. 10 What is web mining? Extracting useful information from the World-wide Web Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  18. 18. 10 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  19. 19. 10 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Web usage - server logs Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  20. 20. 10 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Web usage - server logs Web content - text and images Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  21. 21. 11 Web Content Mining Analyzing data from web pages Typically three types of page processing Unstructured - get rid of “boilerplate” text, analyze sentiment Semi-structured - find names of people with phone numbers Structured - find hotel name, address, phone number, reviews Plus inter-document analysis Clustering Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  22. 22. 12 Crawling versus Mining Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  23. 23. 12 Crawling versus Mining Web mining combines... Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  24. 24. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  25. 25. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  26. 26. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  27. 27. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Both fields are broad and deep - for example Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  28. 28. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Both fields are broad and deep - for example optimal crawling strategies Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  29. 29. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Both fields are broad and deep - for example optimal crawling strategies machine learning for page classification Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  30. 30. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Both fields are broad and deep - for example optimal crawling strategies machine learning for page classification automatically extracting structured data Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  31. 31. 13 What is “Large Scale”? More than what you can handle with one server Many single-server solutions for mining web pages Harder when you include text analytics And (almost) impossible when you get to 100M+ pages So you need some kind of distributed processing framework Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  32. 32. 14 Key Aspects of Web Mining Crawling - finding the “good stuff” Extracting - getting the “right data” Processing - turning bytes into bucks Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  33. 33. 15 Finding the “good stuff” Often feels like “needle in a haystack” E.g. even 100M pages is 0.1% of total web Need to optimize time + cost per useful result Can’t afford to waste time on pages that aren’t useful And each page has cost to data provider Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  34. 34. 16 Getting the “right data” Scale and precision are in opposition One area of one site can be precision-processed All areas of 50M domains means you have to be general Pages are noisy Ads Boilerplate (navigation, etc) SEO Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  35. 35. 17 Processing the results 1TB data file has very little value Actually less value that a small file that can be opened & viewed Has to be turned into something with value Often processing is considered part of web mining Reduction - turning petabytes into pie charts Indexing - being able to search the data Analytics - clustering, training models for recommenders Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  36. 36. 18 Q&A Which of the three types of web mining are we focusing on today? What makes web pages “noisy”? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  37. 37. 19 Web Crawling Web Mining photo by: Graham Racher, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  38. 38. 20 Key Questions What are three general types of web crawls? What can make it hard to accurately score a page? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  39. 39. 21 What is “web crawling”? Includes fetching pages, of course But also has aspect of spider crawling over a web Extracting outlinks to discover new pages Which means parsing the fetched content Managing state of the crawl And all of the implicit rules Robots exclusion protocol User agent Request rate Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  40. 40. 22 Types of web crawls Broad Few or no limits to what domains/pages to process Typically what people think of - Googlebot, bingbot, Baiduspider, ... Focused Uses page scoring -> outlinks to guess at quality of unfetched pages Often has whitelist of domains to avoid traps Domain For a limited number of domains Typically for precise extraction of data Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  41. 41. 23 The “don’t craw” crawl Leverage other people’s crawl data Can be faster, cheaper Reduces load on servers Public datasets Common crawl Wikipedia - use data dump! Commercial providers Spinner, InfoChimps Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  42. 42. 24 Crawling Solutions General rule - don’t roll your own! Easy to make something simple Hard to make something scalable, robust, efficient Open source options Java - Nutch, Heritrix, Bixo, Droids Python - http://scrapy.org/ PHP - http://astellar.com/php-crawler/ Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  43. 43. 25 What makes it hard? Web mining breaks the implicit contract with web sites You often aren’t creating an index that drives traffic to them So why should they let you use bandwidth & server cycles? The web is a nearly infinite set of edge cases Every possible problem will occur, with a broad enough crawl And not everybody plays nice Link farms/honeypots, malicious sites, angry webmasters Plus you have to be able to work at scale Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  44. 44. 26 Scaling Solutions Needs to be reliable, scalable, fault tolerant Single server can fetch lots of pages But scaling is issue with post-processing Several options Hadoop - Nutch, Bixo Custom queuing system - Heritrix, Droids Storm - scalable queuing Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  45. 45. 27 Focused Crawling 101 How to maximize results while minimizing cost aka Finding Good Stuff Fast Only crawl pages that you think are likely to be good Reduces cost through Less time spent fetching worthless pages Lower bandwidth/CPU/storage costs Fewer angry webmasters Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  46. 46. 28 Focused Crawl Details Seed URLs - Good starting point URL State - DB of all known URLs Page Score - “Quality” of page Link Score - Page Score/outlinks Fetched Pages - Saved results Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  47. 47. 29 Finding Seed URLs List of all registered domains - Complete, but big (100M+) Broad DMOZ - lots of spam/porn Alexa/Quantcast “top sites” list - top 1M US sites by traffic Wikipedia - use outlink dump if possible Tweets - with filtering, e.g. Gnip, DataSift Using search Narrow Manually entering URLs - slow, but curated Using API - faster, typically limited, can have junk Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  48. 48. 30 Scoring Pages Analyze text on page Typically means tokenizing text “The sport of ultimate is..” => “the”, “sport”, “of”, “ultimate”, “is”, ... Simple term-based Count occurrences of all phrases, good phrase, bad phrases Calculate ratios of counts: good/all - bad/all = score Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  49. 49. 31 Scoring using SVM SVM = support vector machine Trained using “documents” that have features, and a class “good” : “ultimate”, “ultimate frisbee”, “disc”, “sport”, “throw”, “run” “bad” : “golf”, “timeshare”, “aardvark”, “potato” Creates a statistical model Divides all training documents into separate classes Used to give an unknown document a class Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  50. 50. 32 Challenges with Scoring How do you decide that a page is “good”? Might be mostly graphics with few words Could be a definition of the term Min threshold for amount of real content Detecting link farms with fake content Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  51. 51. 33 Chrome, cruft, and boilerplate Navigational links Sidebar elements Ads, SEO links Can use Boilerpipe & other “cleaners” Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  52. 52. 34 Expanding the crawl frontier Have to parse the page to find outlinks Need to normalize links http://scaleunlimited.com == http://www.scaleunlimited.com http://www.scaleunlimited.com == http://www.scaleunlimited.com/ Skipping links to low-value pages Links to images, pdf files, other binary types (using suffix) Links to DB-generated pages Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  53. 53. 35 Focused Domain Crawl Very specific, explicit crawl of one domain Typically involves discovery of target content pages Often uses URL patterns to synthesize links Page X in site has list of product; a, b, c, d... Product pages are <domain>/product/a or b or c or d... Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  54. 54. 36 Discovery vs Extraction Focused Domain Crawl has two distinct phases Crawling to discover details pages Fetching/processing details pages Often phases are co-mingled, for efficiency Need to track what kind of page in the URL State DB Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  55. 55. 37 Goby Crawl Example http://www.goby.com has information on lots of attractions http://www.goby.com/boston-ma has list of categories http://www.goby.com/<category>--near--<city>-<state> Often need to paginate listing pages, to get all details links Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  56. 56. 38 Q&A What are three general types of web crawls? What can make it hard to accurately score a page? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  57. 57. 39 Data Extraction Web Mining photo by: Graham Racher, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  58. 58. 40 Key Questions What are the three general approaches for data extraction? Why might you want to detect the language of a page? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  59. 59. 41 You’ve got a page, now what? Time to extract the data you need Three attributes of extraction, pick any two Broad - across lots of domains and page formats Precise - very specific types of data Accurate - low error rate Three general approaches Unstructured (broad, accurate) - “just text” Semi-structured (broad, precise) - finding meaning in text Structured (precise, accurate) - getting exactly the data you need Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  60. 60. 42 Common Tasks - Cleaning The HTML needs to be cleaned up Lots of messy data, especially when hand-edited Even HTML (2.0? 3.2? 4.0.1?) should be converted to XHTML Various libraries help with “cleaning” the HTML TagSoup, NekoHTML, HtmlCleaner Note that end result won’t match original text Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  61. 61. 43 Common Tasks - Charset You get bytes back from the web server You need a charset to convert bytes to characters HTTP response header - “Content-Type: text/html; charset=UTF-8” HTML meta tag - <meta http-equiv=”Content-Type” content=“...” /> Analysis of text - byte sequence statistics Several packages support this Tika, ICU Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  62. 62. 44 Common Tasks - Link Extraction Needed to have a crawl - where new links come from Means you need XHTML so you can parse the markup Not just <a href=“xxx”> img, frame, iframe, link, map, area Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  63. 63. 45 Common Tasks - Boilerplate For unstructured and semi-structured Can improve the quality of results Especially important for machine learning Boilerplate text can dramatically skew statistics Creates a noisier signal Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  64. 64. 46 Common Tasks - Language Often used for filtering or alternative processing Target audience is only interested in Spanish I need to tokenize Japanese differently Clustering improves when it’s segmented by language Multiple signals for selecting language, same as charset detection HTTP response header: Content-Language: es HTML meta tag - <meta http-equiv=”Content-Language” content=“es” /> HTML tag attributes - <html lang=“es”> Analysis of text - ngram statistics, short words Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  65. 65. 47 Unstructured Extraction Goal is extracting text, without much additional processing Often has a few fields, from HTML Title - from <head><title>The title of my page</title></head> Description - from <meta name=“description” content=“ultimate frisbee” /> Body - from <body>...all elements that contain text, like <p>...</body> Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  66. 66. 48 Semi-structured Extraction Goal is finding structured data in random text Can be applied broadly, since it’s not (very) format-specific Accuracy suffers, because of breadth of input data formats Beware the academic algorithm Examples of what does work... Easy patterns: telephone numbers, dates Microformats: hCalendar, hCard, hReview, ... Natural Language Processing (NLP): named entities Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  67. 67. 49 Structured Extraction Precise extraction of specific types of data Typically is to one area of one site Often handled with XPath, and maybe regular expressions //div[@id=‘<id of target div>’]/p div and span are beautiful tags Commonly used with CSS Which means they are (more) stable Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  68. 68. 50 How to figure out XPath Firebug is your friend Plug-in for Firefox Will show you the full XPath for each element Note that browsers will re-write HTML (e.g. tbody element) The DOM you see is often generated with Javascript Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  69. 69. 51 XPath demo Firebug XPath tool Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  70. 70. 52 Dealing with Javascript Required if page generates target content using JS Forces you to use Firebug or equivalent to inspect the DOM Options for processing include... HtmlUnit qt-webkit headless Mozilla Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  71. 71. 53 Javascript challenges 10x slower than just loading the page text Good way to make a webmaster angry Lots of extra load on server Can skew website stats Often has issues Pages that work in FF or IE but not HtmlUnit Pages that cause HtmlUnit to hang Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  72. 72. 54 Q&A What are the three general approaches for data extraction? Why might you want to detect the language of a page? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  73. 73. 55 Web Crawling Lab Web Mining photo by: Graham Racher, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  74. 74. 56 Key Questions Were you able to build and run the code locally? Were you able to run a crawl in Elastic MapReduce? Were you able to improve the focused crawl? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  75. 75. 57 ImageFinder Details Find images about Ultimate Frisbee Focused crawl Fixed list of seed URLs Positive & negative terms used to score pages Extract images from page Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  76. 76. 58 Running ImageFinder Can run locally, with restrictions Only one fetcher thread Only 5 pages/loop Can run in Hadoop cluster Amazon Elastic MapReduce CrawlRunner uploads job jar, creates “Job Flow Step” Limited to 100 pages/loop, 2 loops Will take up to 10 minutes to run loops Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  77. 77. 59 Running in Elastic MapReduce Watch your job via http://strata.scaleunlimited.com:9100/ Your job name will include your username Results get added to searchable index http://strata.scaleunlimited.com/solr/strata/ Search for student:<username> to find your results Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  78. 78. 60 Solr Results Issue Note links vs. images These are “image” URLs that are actually to pages Double-bonus on exercise ...fix this problem :) Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  79. 79. 61 Lab Details Code is in strata-web-mining folder you downloaded Details of code in strata-web-mining/doc/README-Description Instructions are in strata-web-mining/doc/README-Lab Please follow the lab steps carefully Missing a step will cause pain and suffering later Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  80. 80. 62 Lab Exercises First goal is to build code and run locally Next is to build code and run in real cluster Then you get to try to optimize the focused crawl And (if you’re fast) try finding images for a different topic Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  81. 81. 63 Trouble-shooting & Timing I’ll be walking around - raise your hand if you need help But with 100+ people, I’ll be talking fast :) We’ve got an hour (or more) before summary/Q&A Have fun... Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  82. 82. 64 Q&A Were you able to build and run the code locally? Were you able to run a crawl in Elastic MapReduce? Were you able to improve the focused crawl? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  83. 83. 65 Summary & QA Web Mining photo by: Graham Racher, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  84. 84. 66 Key Challenges Complexity of large scale web crawling Challenges with extracting the right data Extra work to turn results into value Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  85. 85. 67 Ethical Crawling Always have a real, valid, informative user agent name Always honor the robot exclusion protocol - robots.txt Limit your crawl rate - parallelism, crawl delay, pages/day Immediately comply with blacklisting and data removal requests Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  86. 86. 68 Avoiding getting blocked Follow all ethical crawling guidelines Gradually ramp up your crawl rate Gives webmasters time to complain before it’s a serious problem Avoid Javascript if at all possible Don’t follow form links Grovel shamelessly Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  87. 87. 69 Resources Hadoop - http://hadoop.apache.org Cascading - http://www.cascading.org Bixo - http://openbixo.org Web Data Mining by Bing Liu Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12
  88. 88. 70 Question? I might have answers ken@scaleunlimited.com @kkrugler Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Tuesday, February 28, 12

×