MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
Strata web mining tutorial
1. 1
Scale Unlimited
Web Mining
photo by: i_pinz, flickr
Strata 2012
Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
2. 2
Welcome to Web Mining!
This class is a tutorial on large scale web mining
Topics covered
Overview of web mining
Web crawling - broad & focused
Text mining - extracting value
Hands-on lab
Tips and traps
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
3. 3
Meet Your Instructor
Ken Krugler - direct from Nevada City, California
Founder of TransPac Software, Krugle, Bixo Labs/Scale Unlimited
Developer of Bixo web mining toolkit
Committer on Apache Tika
Developer and trainer for Hadoop, Solr and Cascading
Actively web mining for six years
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
4. 4
Agenda
9:00am - Overview 11:00am - Web Mining Lab
9:30am - Web Crawling 11:45am - Lab Review
10:00am - Text Mining 12:00pm - Summary
10:30am - Break 12:15pm - Q&A
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
5. 5
Overview
Web Mining
photo by: exfordy, flickr
Strata 2012
Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
6. 6
Key Questions
Which of the three types of web mining are we focusing on today?
What makes web pages “noisy”?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
7. 7
What is web mining?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
8. 7
What is web mining?
Extracting useful information from the World-wide Web
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
9. 8
What is web mining?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
10. 8
What is web mining?
Extracting useful information from the World-wide Web
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
11. 8
What is web mining?
Extracting useful information from the World-wide Web
Web structure - link graph analysis
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
12. 9
What is web mining?
173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"
67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"
89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
13. 9
What is web mining?
Extracting useful information from the World-wide Web
173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"
67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"
89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
14. 9
What is web mining?
Extracting useful information from the World-wide Web
Web structure - link graph analysis
173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"
67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"
89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
15. 9
What is web mining?
Extracting useful information from the World-wide Web
Web structure - link graph analysis
Web usage - server logs
173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"
67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"
89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
16. 10
What is web mining?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
17. 10
What is web mining?
Extracting useful information from the World-wide Web
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
18. 10
What is web mining?
Extracting useful information from the World-wide Web
Web structure - link graph analysis
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
19. 10
What is web mining?
Extracting useful information from the World-wide Web
Web structure - link graph analysis
Web usage - server logs
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
20. 10
What is web mining?
Extracting useful information from the World-wide Web
Web structure - link graph analysis
Web usage - server logs
Web content - text and images
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
21. 11
Web Content Mining
Analyzing data from web pages
Typically three types of page processing
Unstructured - get rid of “boilerplate” text, analyze sentiment
Semi-structured - find names of people with phone numbers
Structured - find hotel name, address, phone number, reviews
Plus inter-document analysis
Clustering
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
22. 12
Crawling versus Mining
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
23. 12
Crawling versus Mining
Web mining combines...
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
24. 12
Crawling versus Mining
Web mining combines...
web crawling - finding & fetching content
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
25. 12
Crawling versus Mining
Web mining combines...
web crawling - finding & fetching content
data mining - extracting useful information
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
26. 12
Crawling versus Mining
Web mining combines...
web crawling - finding & fetching content
data mining - extracting useful information
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
27. 12
Crawling versus Mining
Web mining combines...
web crawling - finding & fetching content
data mining - extracting useful information
Both fields are broad and deep - for example
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
28. 12
Crawling versus Mining
Web mining combines...
web crawling - finding & fetching content
data mining - extracting useful information
Both fields are broad and deep - for example
optimal crawling strategies
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
29. 12
Crawling versus Mining
Web mining combines...
web crawling - finding & fetching content
data mining - extracting useful information
Both fields are broad and deep - for example
optimal crawling strategies
machine learning for page classification
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
30. 12
Crawling versus Mining
Web mining combines...
web crawling - finding & fetching content
data mining - extracting useful information
Both fields are broad and deep - for example
optimal crawling strategies
machine learning for page classification
automatically extracting structured data
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
31. 13
What is “Large Scale”?
More than what you can handle with one server
Many single-server solutions for mining web pages
Harder when you include text analytics
And (almost) impossible when you get to 100M+ pages
So you need some kind of distributed processing framework
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
32. 14
Key Aspects of Web Mining
Crawling - finding the “good stuff”
Extracting - getting the “right data”
Processing - turning bytes into bucks
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
33. 15
Finding the “good stuff”
Often feels like “needle in a haystack”
E.g. even 100M pages is 0.1% of total web
Need to optimize time + cost per useful result
Can’t afford to waste time on pages that aren’t useful
And each page has cost to data provider
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
34. 16
Getting the “right data”
Scale and precision are in opposition
One area of one site can be precision-processed
All areas of 50M domains means you have to be general
Pages are noisy
Ads
Boilerplate (navigation, etc)
SEO
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
35. 17
Processing the results
1TB data file has very little value
Actually less value that a small file that can be opened & viewed
Has to be turned into something with value
Often processing is considered part of web mining
Reduction - turning petabytes into pie charts
Indexing - being able to search the data
Analytics - clustering, training models for recommenders
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
36. 18
Q&A
Which of the three types of web mining are we focusing on today?
What makes web pages “noisy”?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
37. 19
Web Crawling
Web Mining
photo by: Graham Racher, flickr
Strata 2012
Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
38. 20
Key Questions
What are three general types of web crawls?
What can make it hard to accurately score a page?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
39. 21
What is “web crawling”?
Includes fetching pages, of course
But also has aspect of spider crawling over a web
Extracting outlinks to discover new pages
Which means parsing the fetched content
Managing state of the crawl
And all of the implicit rules
Robots exclusion protocol
User agent
Request rate
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
40. 22
Types of web crawls
Broad
Few or no limits to what domains/pages to process
Typically what people think of - Googlebot, bingbot, Baiduspider, ...
Focused
Uses page scoring -> outlinks to guess at quality of unfetched pages
Often has whitelist of domains to avoid traps
Domain
For a limited number of domains
Typically for precise extraction of data
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
41. 23
The “don’t craw” crawl
Leverage other people’s crawl data
Can be faster, cheaper
Reduces load on servers
Public datasets
Common crawl
Wikipedia - use data dump!
Commercial providers
Spinner, InfoChimps
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
42. 24
Crawling Solutions
General rule - don’t roll your own!
Easy to make something simple
Hard to make something scalable, robust, efficient
Open source options
Java - Nutch, Heritrix, Bixo, Droids
Python - http://scrapy.org/
PHP - http://astellar.com/php-crawler/
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
43. 25
What makes it hard?
Web mining breaks the implicit contract with web sites
You often aren’t creating an index that drives traffic to them
So why should they let you use bandwidth & server cycles?
The web is a nearly infinite set of edge cases
Every possible problem will occur, with a broad enough crawl
And not everybody plays nice
Link farms/honeypots, malicious sites, angry webmasters
Plus you have to be able to work at scale
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
44. 26
Scaling Solutions
Needs to be reliable, scalable, fault tolerant
Single server can fetch lots of pages
But scaling is issue with post-processing
Several options
Hadoop - Nutch, Bixo
Custom queuing system - Heritrix, Droids
Storm - scalable queuing
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
45. 27
Focused Crawling 101
How to maximize results while minimizing cost
aka Finding Good Stuff Fast
Only crawl pages that you think are likely to be good
Reduces cost through
Less time spent fetching worthless pages
Lower bandwidth/CPU/storage costs
Fewer angry webmasters
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
46. 28
Focused Crawl Details
Seed URLs - Good starting point
URL State - DB of all known URLs
Page Score - “Quality” of page
Link Score - Page Score/outlinks
Fetched Pages - Saved results
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
47. 29
Finding Seed URLs
List of all registered domains - Complete, but big (100M+) Broad
DMOZ - lots of spam/porn
Alexa/Quantcast “top sites” list - top 1M US sites by traffic
Wikipedia - use outlink dump if possible
Tweets - with filtering, e.g. Gnip, DataSift
Using search Narrow
Manually entering URLs - slow, but curated
Using API - faster, typically limited, can have junk
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
48. 30
Scoring Pages
Analyze text on page
Typically means tokenizing text
“The sport of ultimate is..” => “the”, “sport”, “of”, “ultimate”, “is”, ...
Simple term-based
Count occurrences of all phrases, good phrase, bad phrases
Calculate ratios of counts: good/all - bad/all = score
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
49. 31
Scoring using SVM
SVM = support vector machine
Trained using “documents” that have features, and a class
“good” : “ultimate”, “ultimate frisbee”, “disc”, “sport”, “throw”, “run”
“bad” : “golf”, “timeshare”, “aardvark”, “potato”
Creates a statistical model
Divides all training documents into separate classes
Used to give an unknown document a class
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
50. 32
Challenges with Scoring
How do you decide that a page is “good”?
Might be mostly graphics with few words
Could be a definition of the term
Min threshold for amount of real content
Detecting link farms with fake content
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
51. 33
Chrome, cruft, and boilerplate
Navigational links
Sidebar elements
Ads, SEO links
Can use Boilerpipe & other “cleaners”
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
52. 34
Expanding the crawl frontier
Have to parse the page to find outlinks
Need to normalize links
http://scaleunlimited.com == http://www.scaleunlimited.com
http://www.scaleunlimited.com == http://www.scaleunlimited.com/
Skipping links to low-value pages
Links to images, pdf files, other binary types (using suffix)
Links to DB-generated pages
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
53. 35
Focused Domain Crawl
Very specific, explicit crawl of one domain
Typically involves discovery of target content pages
Often uses URL patterns to synthesize links
Page X in site has list of product; a, b, c, d...
Product pages are <domain>/product/a or b or c or d...
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
54. 36
Discovery vs Extraction
Focused Domain Crawl has two distinct phases
Crawling to discover details pages
Fetching/processing details pages
Often phases are co-mingled, for efficiency
Need to track what kind of page in the URL State DB
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
55. 37
Goby Crawl Example
http://www.goby.com has information on lots of attractions
http://www.goby.com/boston-ma has list of categories
http://www.goby.com/<category>--near--<city>-<state>
Often need to paginate listing pages, to get all details links
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
56. 38
Q&A
What are three general types of web crawls?
What can make it hard to accurately score a page?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
57. 39
Data Extraction
Web Mining
photo by: Graham Racher, flickr
Strata 2012
Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
58. 40
Key Questions
What are the three general approaches for data extraction?
Why might you want to detect the language of a page?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
59. 41
You’ve got a page, now what?
Time to extract the data you need
Three attributes of extraction, pick any two
Broad - across lots of domains and page formats
Precise - very specific types of data
Accurate - low error rate
Three general approaches
Unstructured (broad, accurate) - “just text”
Semi-structured (broad, precise) - finding meaning in text
Structured (precise, accurate) - getting exactly the data you need
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
60. 42
Common Tasks - Cleaning
The HTML needs to be cleaned up
Lots of messy data, especially when hand-edited
Even HTML (2.0? 3.2? 4.0.1?) should be converted to XHTML
Various libraries help with “cleaning” the HTML
TagSoup, NekoHTML, HtmlCleaner
Note that end result won’t match original text
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
61. 43
Common Tasks - Charset
You get bytes back from the web server
You need a charset to convert bytes to characters
HTTP response header - “Content-Type: text/html; charset=UTF-8”
HTML meta tag - <meta http-equiv=”Content-Type” content=“...” />
Analysis of text - byte sequence statistics
Several packages support this
Tika, ICU
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
62. 44
Common Tasks - Link Extraction
Needed to have a crawl - where new links come from
Means you need XHTML so you can parse the markup
Not just <a href=“xxx”>
img, frame, iframe, link, map, area
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
63. 45
Common Tasks - Boilerplate
For unstructured and semi-structured
Can improve the quality of results
Especially important for machine learning
Boilerplate text can dramatically skew statistics
Creates a noisier signal
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
64. 46
Common Tasks - Language
Often used for filtering or alternative processing
Target audience is only interested in Spanish
I need to tokenize Japanese differently
Clustering improves when it’s segmented by language
Multiple signals for selecting language, same as charset detection
HTTP response header: Content-Language: es
HTML meta tag - <meta http-equiv=”Content-Language” content=“es” />
HTML tag attributes - <html lang=“es”>
Analysis of text - ngram statistics, short words
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
65. 47
Unstructured Extraction
Goal is extracting text, without much additional processing
Often has a few fields, from HTML
Title - from <head><title>The title of my page</title></head>
Description - from <meta name=“description” content=“ultimate frisbee” />
Body - from <body>...all elements that contain text, like <p>...</body>
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
66. 48
Semi-structured Extraction
Goal is finding structured data in random text
Can be applied broadly, since it’s not (very) format-specific
Accuracy suffers, because of breadth of input data formats
Beware the academic algorithm
Examples of what does work...
Easy patterns: telephone numbers, dates
Microformats: hCalendar, hCard, hReview, ...
Natural Language Processing (NLP): named entities
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
67. 49
Structured Extraction
Precise extraction of specific types of data
Typically is to one area of one site
Often handled with XPath, and maybe regular expressions
//div[@id=‘<id of target div>’]/p
div and span are beautiful tags
Commonly used with CSS
Which means they are (more) stable
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
68. 50
How to figure out XPath
Firebug is your friend
Plug-in for Firefox
Will show you the full XPath for each element
Note that browsers will re-write HTML (e.g. tbody element)
The DOM you see is often generated with Javascript
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
69. 51
XPath demo
Firebug
XPath tool
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
70. 52
Dealing with Javascript
Required if page generates target content using JS
Forces you to use Firebug or equivalent to inspect the DOM
Options for processing include...
HtmlUnit
qt-webkit
headless Mozilla
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
71. 53
Javascript challenges
10x slower than just loading the page text
Good way to make a webmaster angry
Lots of extra load on server
Can skew website stats
Often has issues
Pages that work in FF or IE but not HtmlUnit
Pages that cause HtmlUnit to hang
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
72. 54
Q&A
What are the three general approaches for data extraction?
Why might you want to detect the language of a page?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
73. 55
Web Crawling Lab
Web Mining
photo by: Graham Racher, flickr
Strata 2012
Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
74. 56
Key Questions
Were you able to build and run the code locally?
Were you able to run a crawl in Elastic MapReduce?
Were you able to improve the focused crawl?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
75. 57
ImageFinder Details
Find images about Ultimate Frisbee
Focused crawl
Fixed list of seed URLs
Positive & negative terms used to score pages
Extract images from page
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
76. 58
Running ImageFinder
Can run locally, with restrictions
Only one fetcher thread
Only 5 pages/loop
Can run in Hadoop cluster
Amazon Elastic MapReduce
CrawlRunner uploads job jar, creates “Job Flow Step”
Limited to 100 pages/loop, 2 loops
Will take up to 10 minutes to run loops
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
77. 59
Running in Elastic MapReduce
Watch your job via http://strata.scaleunlimited.com:9100/
Your job name will include your username
Results get added to searchable index
http://strata.scaleunlimited.com/solr/strata/
Search for student:<username> to find your results
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
78. 60
Solr Results Issue
Note links vs. images
These are “image” URLs that are actually to pages
Double-bonus on exercise
...fix this problem :)
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
79. 61
Lab Details
Code is in strata-web-mining folder you downloaded
Details of code in strata-web-mining/doc/README-Description
Instructions are in strata-web-mining/doc/README-Lab
Please follow the lab steps carefully
Missing a step will cause pain and suffering later
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
80. 62
Lab Exercises
First goal is to build code and run locally
Next is to build code and run in real cluster
Then you get to try to optimize the focused crawl
And (if you’re fast) try finding images for a different topic
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
81. 63
Trouble-shooting & Timing
I’ll be walking around - raise your hand if you need help
But with 100+ people, I’ll be talking fast :)
We’ve got an hour (or more) before summary/Q&A
Have fun...
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
82. 64
Q&A
Were you able to build and run the code locally?
Were you able to run a crawl in Elastic MapReduce?
Were you able to improve the focused crawl?
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
83. 65
Summary & QA
Web Mining
photo by: Graham Racher, flickr
Strata 2012
Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
84. 66
Key Challenges
Complexity of large scale web crawling
Challenges with extracting the right data
Extra work to turn results into value
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
85. 67
Ethical Crawling
Always have a real, valid, informative user agent name
Always honor the robot exclusion protocol - robots.txt
Limit your crawl rate - parallelism, crawl delay, pages/day
Immediately comply with blacklisting and data removal requests
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
86. 68
Avoiding getting blocked
Follow all ethical crawling guidelines
Gradually ramp up your crawl rate
Gives webmasters time to complain before it’s a serious problem
Avoid Javascript if at all possible
Don’t follow form links
Grovel shamelessly
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
87. 69
Resources
Hadoop - http://hadoop.apache.org
Cascading - http://www.cascading.org
Bixo - http://openbixo.org
Web Data Mining by Bing Liu
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12
88. 70
Question?
I might have answers
ken@scaleunlimited.com
@kkrugler
Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Tuesday, February 28, 12