SlideShare a Scribd company logo
1 of 88
1


                                                                                                                         Scale Unlimited




                                                                                                                        Web Mining




                                                                                                                                                                             photo by: i_pinz, flickr
                                                                                                                        Strata 2012
         Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
2




       Welcome to Web Mining!

              This class is a tutorial on large scale web mining
              Topics covered
                      Overview of web mining
                      Web crawling - broad & focused
                      Text mining - extracting value
                      Hands-on lab
                      Tips and traps



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
3




       Meet Your Instructor

              Ken Krugler - direct from Nevada City, California
              Founder of TransPac Software, Krugle, Bixo Labs/Scale Unlimited
              Developer of Bixo web mining toolkit
              Committer on Apache Tika
              Developer and trainer for Hadoop, Solr and Cascading
              Actively web mining for six years



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
4




        Agenda


              9:00am - Overview                                                                                              11:00am - Web Mining Lab
              9:30am - Web Crawling                                                                                          11:45am - Lab Review
              10:00am - Text Mining                                                                                          12:00pm - Summary
              10:30am - Break                                                                                                12:15pm - Q&A




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
5


                                                                                                                         Overview




                                                                                                                        Web Mining




                                                                                                                                                                             photo by: exfordy, flickr
                                                                                                                        Strata 2012
         Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
6




       Key Questions



              Which of the three types of web mining are we focusing on today?
              What makes web pages “noisy”?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
7




       What is web mining?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
7




       What is web mining?

              Extracting useful information from the World-wide Web




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
8




       What is web mining?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
8




       What is web mining?

              Extracting useful information from the World-wide Web




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
8




       What is web mining?

              Extracting useful information from the World-wide Web
              Web structure - link graph analysis




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
9




       What is web mining?




         173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"
         67.124.22.71    - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"
         89.105.44.90    - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
9




       What is web mining?

              Extracting useful information from the World-wide Web




         173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"
         67.124.22.71    - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"
         89.105.44.90    - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
9




       What is web mining?

              Extracting useful information from the World-wide Web
              Web structure - link graph analysis



         173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"
         67.124.22.71    - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"
         89.105.44.90    - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
9




       What is web mining?

              Extracting useful information from the World-wide Web
              Web structure - link graph analysis
              Web usage - server logs

         173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"
         67.124.22.71    - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"
         89.105.44.90    - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
10




       What is web mining?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
10




       What is web mining?

              Extracting useful information from the World-wide Web




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
10




       What is web mining?

              Extracting useful information from the World-wide Web
              Web structure - link graph analysis




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
10




       What is web mining?

              Extracting useful information from the World-wide Web
              Web structure - link graph analysis
              Web usage - server logs




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
10




       What is web mining?

              Extracting useful information from the World-wide Web
              Web structure - link graph analysis
              Web usage - server logs
              Web content - text and images




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
11




       Web Content Mining

              Analyzing data from web pages
              Typically three types of page processing
                      Unstructured - get rid of “boilerplate” text, analyze sentiment
                      Semi-structured - find names of people with phone numbers
                      Structured - find hotel name, address, phone number, reviews
              Plus inter-document analysis
                      Clustering



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining

              Web mining combines...




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining

              Web mining combines...
                      web crawling - finding & fetching content




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining

              Web mining combines...
                      web crawling - finding & fetching content
                      data mining - extracting useful information




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining

              Web mining combines...
                      web crawling - finding & fetching content
                      data mining - extracting useful information




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining

              Web mining combines...
                      web crawling - finding & fetching content
                      data mining - extracting useful information


              Both fields are broad and deep - for example




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining

              Web mining combines...
                      web crawling - finding & fetching content
                      data mining - extracting useful information


              Both fields are broad and deep - for example
                      optimal crawling strategies




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining

              Web mining combines...
                      web crawling - finding & fetching content
                      data mining - extracting useful information


              Both fields are broad and deep - for example
                      optimal crawling strategies
                      machine learning for page classification



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
12




       Crawling versus Mining

              Web mining combines...
                      web crawling - finding & fetching content
                      data mining - extracting useful information


              Both fields are broad and deep - for example
                      optimal crawling strategies
                      machine learning for page classification
                      automatically extracting structured data


         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
13




       What is “Large Scale”?

              More than what you can handle with one server
                      Many single-server solutions for mining web pages
                      Harder when you include text analytics
                      And (almost) impossible when you get to 100M+ pages


              So you need some kind of distributed processing framework




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
14




       Key Aspects of Web Mining


              Crawling - finding the “good stuff”


              Extracting - getting the “right data”


              Processing - turning bytes into bucks




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
15




       Finding the “good stuff”

              Often feels like “needle in a haystack”
                      E.g. even 100M pages is 0.1% of total web


              Need to optimize time + cost per useful result
                      Can’t afford to waste time on pages that aren’t useful
                      And each page has cost to data provider




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
16




       Getting the “right data”

              Scale and precision are in opposition
                      One area of one site can be precision-processed
                      All areas of 50M domains means you have to be general


              Pages are noisy
                      Ads
                      Boilerplate (navigation, etc)
                      SEO


         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
17




       Processing the results

              1TB data file has very little value
                      Actually less value that a small file that can be opened & viewed
                      Has to be turned into something with value


              Often processing is considered part of web mining
                      Reduction - turning petabytes into pie charts
                      Indexing - being able to search the data
                      Analytics - clustering, training models for recommenders


         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
18




       Q&A



              Which of the three types of web mining are we focusing on today?
              What makes web pages “noisy”?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
19

                                                                                                                         Web Crawling




                                                                                                                        Web Mining




                                                                                                                                                                              photo by: Graham Racher, flickr
                                                                                                                        Strata 2012
         Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
20




       Key Questions



              What are three general types of web crawls?
              What can make it hard to accurately score a page?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
21




       What is “web crawling”?
              Includes fetching pages, of course
              But also has aspect of spider crawling over a web
                      Extracting outlinks to discover new pages
                      Which means parsing the fetched content
                      Managing state of the crawl
              And all of the implicit rules
                      Robots exclusion protocol
                      User agent
                      Request rate

         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
22




       Types of web crawls
              Broad
                      Few or no limits to what domains/pages to process
                      Typically what people think of - Googlebot, bingbot, Baiduspider, ...
              Focused
                      Uses page scoring -> outlinks to guess at quality of unfetched pages
                      Often has whitelist of domains to avoid traps
              Domain
                      For a limited number of domains
                      Typically for precise extraction of data

         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
23




       The “don’t craw” crawl

              Leverage other people’s crawl data
                      Can be faster, cheaper
                      Reduces load on servers
              Public datasets
                      Common crawl
                      Wikipedia - use data dump!
              Commercial providers
                      Spinner, InfoChimps


         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
24




       Crawling Solutions

              General rule - don’t roll your own!
                      Easy to make something simple
                      Hard to make something scalable, robust, efficient
              Open source options
                      Java - Nutch, Heritrix, Bixo, Droids
                      Python - http://scrapy.org/
                      PHP - http://astellar.com/php-crawler/



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
25




       What makes it hard?

              Web mining breaks the implicit contract with web sites
                      You often aren’t creating an index that drives traffic to them
                      So why should they let you use bandwidth & server cycles?
              The web is a nearly infinite set of edge cases
                      Every possible problem will occur, with a broad enough crawl
              And not everybody plays nice
                      Link farms/honeypots, malicious sites, angry webmasters
              Plus you have to be able to work at scale


         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
26




       Scaling Solutions

              Needs to be reliable, scalable, fault tolerant
              Single server can fetch lots of pages
                      But scaling is issue with post-processing
              Several options
                      Hadoop - Nutch, Bixo
                      Custom queuing system - Heritrix, Droids
                      Storm - scalable queuing



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
27




       Focused Crawling 101

              How to maximize results while minimizing cost
                      aka Finding Good Stuff Fast
              Only crawl pages that you think are likely to be good
              Reduces cost through
                      Less time spent fetching worthless pages
                      Lower bandwidth/CPU/storage costs
                      Fewer angry webmasters



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
28




       Focused Crawl Details


              Seed URLs - Good starting point
              URL State - DB of all known URLs
              Page Score - “Quality” of page
              Link Score - Page Score/outlinks
              Fetched Pages - Saved results



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
29




       Finding Seed URLs

              List of all registered domains - Complete, but big (100M+)                                                                                                     Broad

              DMOZ - lots of spam/porn
              Alexa/Quantcast “top sites” list - top 1M US sites by traffic
              Wikipedia - use outlink dump if possible
              Tweets - with filtering, e.g. Gnip, DataSift
              Using search                                                                                                                                                   Narrow
                      Manually entering URLs - slow, but curated
                      Using API - faster, typically limited, can have junk


         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
30




       Scoring Pages

              Analyze text on page
                      Typically means tokenizing text
                      “The sport of ultimate is..” => “the”, “sport”, “of”, “ultimate”, “is”, ...
              Simple term-based
                      Count occurrences of all phrases, good phrase, bad phrases
                      Calculate ratios of counts: good/all - bad/all = score




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
31




       Scoring using SVM
              SVM = support vector machine
              Trained using “documents” that have features, and a class
                      “good” : “ultimate”, “ultimate frisbee”, “disc”, “sport”, “throw”, “run”
                      “bad” : “golf”, “timeshare”, “aardvark”, “potato”


              Creates a statistical model
                      Divides all training documents into separate classes
                      Used to give an unknown document a class



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
32




       Challenges with Scoring
              How do you decide that a page is “good”?
                      Might be mostly graphics with few words
                      Could be a definition of the term




              Min threshold for amount of real content
              Detecting link farms with fake content

         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
33




       Chrome, cruft, and boilerplate
              Navigational links
              Sidebar elements
              Ads, SEO links




              Can use Boilerpipe & other “cleaners”



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
34




       Expanding the crawl frontier

              Have to parse the page to find outlinks
              Need to normalize links
                      http://scaleunlimited.com == http://www.scaleunlimited.com
                      http://www.scaleunlimited.com == http://www.scaleunlimited.com/
              Skipping links to low-value pages
                      Links to images, pdf files, other binary types (using suffix)
                      Links to DB-generated pages



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
35




       Focused Domain Crawl


              Very specific, explicit crawl of one domain
              Typically involves discovery of target content pages
              Often uses URL patterns to synthesize links
                      Page X in site has list of product; a, b, c, d...
                      Product pages are <domain>/product/a or b or c or d...




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
36




       Discovery vs Extraction


              Focused Domain Crawl has two distinct phases
                      Crawling to discover details pages
                      Fetching/processing details pages
              Often phases are co-mingled, for efficiency
                      Need to track what kind of page in the URL State DB




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
37




       Goby Crawl Example
              http://www.goby.com has information on lots of attractions
              http://www.goby.com/boston-ma has list of categories




              http://www.goby.com/<category>--near--<city>-<state>
              Often need to paginate listing pages, to get all details links

         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
38




       Q&A



              What are three general types of web crawls?
              What can make it hard to accurately score a page?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
39

                                                                                                                         Data Extraction




                                                                                                                        Web Mining




                                                                                                                                                                              photo by: Graham Racher, flickr
                                                                                                                        Strata 2012
         Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
40




       Key Questions



              What are the three general approaches for data extraction?
              Why might you want to detect the language of a page?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
41




       You’ve got a page, now what?

              Time to extract the data you need
              Three attributes of extraction, pick any two
                      Broad - across lots of domains and page formats
                      Precise - very specific types of data
                      Accurate - low error rate
              Three general approaches
                      Unstructured (broad, accurate) - “just text”
                      Semi-structured (broad, precise) - finding meaning in text
                      Structured (precise, accurate) - getting exactly the data you need

         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
42




       Common Tasks - Cleaning

              The HTML needs to be cleaned up
                      Lots of messy data, especially when hand-edited
                      Even HTML (2.0? 3.2? 4.0.1?) should be converted to XHTML
              Various libraries help with “cleaning” the HTML
                      TagSoup, NekoHTML, HtmlCleaner
              Note that end result won’t match original text




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
43




       Common Tasks - Charset

              You get bytes back from the web server
              You need a charset to convert bytes to characters
                      HTTP response header - “Content-Type: text/html; charset=UTF-8”
                      HTML meta tag - <meta http-equiv=”Content-Type” content=“...” />
                      Analysis of text - byte sequence statistics
              Several packages support this
                      Tika, ICU



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
44




       Common Tasks - Link Extraction


              Needed to have a crawl - where new links come from
              Means you need XHTML so you can parse the markup
              Not just <a href=“xxx”>
                      img, frame, iframe, link, map, area




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
45




       Common Tasks - Boilerplate


              For unstructured and semi-structured
              Can improve the quality of results
              Especially important for machine learning
                      Boilerplate text can dramatically skew statistics
                      Creates a noisier signal




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
46




       Common Tasks - Language
              Often used for filtering or alternative processing
                      Target audience is only interested in Spanish
                      I need to tokenize Japanese differently
                      Clustering improves when it’s segmented by language
              Multiple signals for selecting language, same as charset detection
                      HTTP response header: Content-Language: es
                      HTML meta tag - <meta http-equiv=”Content-Language” content=“es” />
                      HTML tag attributes - <html lang=“es”>
                      Analysis of text - ngram statistics, short words

         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
47




       Unstructured Extraction


              Goal is extracting text, without much additional processing
              Often has a few fields, from HTML
                      Title - from <head><title>The title of my page</title></head>
                      Description - from <meta name=“description” content=“ultimate frisbee” />
                      Body - from <body>...all elements that contain text, like <p>...</body>




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
48




       Semi-structured Extraction

              Goal is finding structured data in random text
                      Can be applied broadly, since it’s not (very) format-specific
                      Accuracy suffers, because of breadth of input data formats
                      Beware the academic algorithm
              Examples of what does work...
                      Easy patterns: telephone numbers, dates
                      Microformats: hCalendar, hCard, hReview, ...
                      Natural Language Processing (NLP): named entities


         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
49




       Structured Extraction

              Precise extraction of specific types of data
              Typically is to one area of one site
              Often handled with XPath, and maybe regular expressions
                      //div[@id=‘<id of target div>’]/p
              div and span are beautiful tags
                      Commonly used with CSS
                      Which means they are (more) stable



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
50




       How to figure out XPath


              Firebug is your friend
                      Plug-in for Firefox
                      Will show you the full XPath for each element
              Note that browsers will re-write HTML (e.g. tbody element)
              The DOM you see is often generated with Javascript




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
51




       XPath demo



              Firebug
              XPath tool




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
52




       Dealing with Javascript

              Required if page generates target content using JS
              Forces you to use Firebug or equivalent to inspect the DOM
              Options for processing include...
                      HtmlUnit
                      qt-webkit
                      headless Mozilla




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
53




       Javascript challenges

              10x slower than just loading the page text
              Good way to make a webmaster angry
                      Lots of extra load on server
                      Can skew website stats
              Often has issues
                      Pages that work in FF or IE but not HtmlUnit
                      Pages that cause HtmlUnit to hang



         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
54




       Q&A



              What are the three general approaches for data extraction?
              Why might you want to detect the language of a page?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
55

                                                                                                                         Web Crawling Lab




                                                                                                                        Web Mining




                                                                                                                                                                              photo by: Graham Racher, flickr
                                                                                                                        Strata 2012
         Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
56




       Key Questions



              Were you able to build and run the code locally?
              Were you able to run a crawl in Elastic MapReduce?
              Were you able to improve the focused crawl?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
57




       ImageFinder Details


              Find images about Ultimate Frisbee
              Focused crawl
                      Fixed list of seed URLs
                      Positive & negative terms used to score pages
              Extract images from page




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
58




       Running ImageFinder

              Can run locally, with restrictions
                      Only one fetcher thread
                      Only 5 pages/loop
              Can run in Hadoop cluster
                      Amazon Elastic MapReduce
                      CrawlRunner uploads job jar, creates “Job Flow Step”
                      Limited to 100 pages/loop, 2 loops
                      Will take up to 10 minutes to run loops


         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
59




       Running in Elastic MapReduce


              Watch your job via http://strata.scaleunlimited.com:9100/
                      Your job name will include your username
              Results get added to searchable index
                      http://strata.scaleunlimited.com/solr/strata/
                      Search for student:<username> to find your results




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
60




       Solr Results Issue
              Note links vs. images
              These are “image” URLs that are actually to pages
              Double-bonus on exercise
              ...fix this problem :)




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
61




       Lab Details


              Code is in strata-web-mining folder you downloaded
              Details of code in strata-web-mining/doc/README-Description
              Instructions are in strata-web-mining/doc/README-Lab
                      Please follow the lab steps carefully
                      Missing a step will cause pain and suffering later




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
62




       Lab Exercises


              First goal is to build code and run locally
              Next is to build code and run in real cluster
              Then you get to try to optimize the focused crawl
              And (if you’re fast) try finding images for a different topic




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
63




       Trouble-shooting & Timing


              I’ll be walking around - raise your hand if you need help
              But with 100+ people, I’ll be talking fast :)
              We’ve got an hour (or more) before summary/Q&A
              Have fun...




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
64




       Q&A



              Were you able to build and run the code locally?
              Were you able to run a crawl in Elastic MapReduce?
              Were you able to improve the focused crawl?




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
65

                                                                                                                         Summary & QA




                                                                                                                        Web Mining




                                                                                                                                                                              photo by: Graham Racher, flickr
                                                                                                                        Strata 2012
         Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
66




       Key Challenges


              Complexity of large scale web crawling


              Challenges with extracting the right data


              Extra work to turn results into value




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
67




       Ethical Crawling


              Always have a real, valid, informative user agent name
              Always honor the robot exclusion protocol - robots.txt
              Limit your crawl rate - parallelism, crawl delay, pages/day
              Immediately comply with blacklisting and data removal requests




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
68




       Avoiding getting blocked

              Follow all ethical crawling guidelines
              Gradually ramp up your crawl rate
                      Gives webmasters time to complain before it’s a serious problem
              Avoid Javascript if at all possible
              Don’t follow form links
              Grovel shamelessly




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
69




       Resources


              Hadoop - http://hadoop.apache.org
              Cascading - http://www.cascading.org
              Bixo - http://openbixo.org
              Web Data Mining by Bing Liu




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12
70




       Question?


              I might have answers


              ken@scaleunlimited.com
              @kkrugler




         Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Tuesday, February 28, 12

More Related Content

Viewers also liked

CV Lucas Madrid Barbotto
CV Lucas Madrid BarbottoCV Lucas Madrid Barbotto
CV Lucas Madrid BarbottoLucas Madrid
 
How to do business in Pharma Industry in Brazil
How to do business in Pharma Industry in BrazilHow to do business in Pharma Industry in Brazil
How to do business in Pharma Industry in BrazilLadina Consulting
 
Sports Geek Sports Social Media Rankings April 2013
Sports Geek Sports Social Media Rankings April 2013Sports Geek Sports Social Media Rankings April 2013
Sports Geek Sports Social Media Rankings April 2013Sports Geek
 
4900 Piazza Center Vraag 4 Visualisatie (lowres)
4900 Piazza Center Vraag 4 Visualisatie (lowres)4900 Piazza Center Vraag 4 Visualisatie (lowres)
4900 Piazza Center Vraag 4 Visualisatie (lowres)twansibon
 
IDN & UDRP : tour d'horizon des litiges relatifs aux noms de domaine internat...
IDN & UDRP : tour d'horizon des litiges relatifs aux noms de domaine internat...IDN & UDRP : tour d'horizon des litiges relatifs aux noms de domaine internat...
IDN & UDRP : tour d'horizon des litiges relatifs aux noms de domaine internat...Cedric Manara
 
Instala y Configura Aplicaciones y Servicios
Instala y Configura Aplicaciones y ServiciosInstala y Configura Aplicaciones y Servicios
Instala y Configura Aplicaciones y ServiciosJennifer Amador Martinez
 
Moje Bambino: Kreator dobrej przestrzeni edukacyjnej
Moje Bambino: Kreator dobrej przestrzeni edukacyjnejMoje Bambino: Kreator dobrej przestrzeni edukacyjnej
Moje Bambino: Kreator dobrej przestrzeni edukacyjnejmojebambino
 
IPv6 at 1&1
IPv6 at 1&1 IPv6 at 1&1
IPv6 at 1&1 1&1
 

Viewers also liked (17)

CV Lucas Madrid Barbotto
CV Lucas Madrid BarbottoCV Lucas Madrid Barbotto
CV Lucas Madrid Barbotto
 
How to do business in Pharma Industry in Brazil
How to do business in Pharma Industry in BrazilHow to do business in Pharma Industry in Brazil
How to do business in Pharma Industry in Brazil
 
Sports Geek Sports Social Media Rankings April 2013
Sports Geek Sports Social Media Rankings April 2013Sports Geek Sports Social Media Rankings April 2013
Sports Geek Sports Social Media Rankings April 2013
 
Manuales taller-cecsa-volkswagen (1) (1)
Manuales taller-cecsa-volkswagen (1) (1)Manuales taller-cecsa-volkswagen (1) (1)
Manuales taller-cecsa-volkswagen (1) (1)
 
Webprosjekt Bodø
Webprosjekt BodøWebprosjekt Bodø
Webprosjekt Bodø
 
Flowing society
Flowing societyFlowing society
Flowing society
 
4900 Piazza Center Vraag 4 Visualisatie (lowres)
4900 Piazza Center Vraag 4 Visualisatie (lowres)4900 Piazza Center Vraag 4 Visualisatie (lowres)
4900 Piazza Center Vraag 4 Visualisatie (lowres)
 
NOMINA VINI V
NOMINA VINI  VNOMINA VINI  V
NOMINA VINI V
 
Agenda Semanal Cultural de la Municipalidad de Asunción 26 mayo al 01 junio
Agenda Semanal Cultural de la Municipalidad de Asunción 26 mayo al 01 junioAgenda Semanal Cultural de la Municipalidad de Asunción 26 mayo al 01 junio
Agenda Semanal Cultural de la Municipalidad de Asunción 26 mayo al 01 junio
 
IDN & UDRP : tour d'horizon des litiges relatifs aux noms de domaine internat...
IDN & UDRP : tour d'horizon des litiges relatifs aux noms de domaine internat...IDN & UDRP : tour d'horizon des litiges relatifs aux noms de domaine internat...
IDN & UDRP : tour d'horizon des litiges relatifs aux noms de domaine internat...
 
Instala y Configura Aplicaciones y Servicios
Instala y Configura Aplicaciones y ServiciosInstala y Configura Aplicaciones y Servicios
Instala y Configura Aplicaciones y Servicios
 
Medio ambiente oficina alba
Medio ambiente oficina albaMedio ambiente oficina alba
Medio ambiente oficina alba
 
Taller ecuador
Taller ecuadorTaller ecuador
Taller ecuador
 
Topicos
TopicosTopicos
Topicos
 
Moje Bambino: Kreator dobrej przestrzeni edukacyjnej
Moje Bambino: Kreator dobrej przestrzeni edukacyjnejMoje Bambino: Kreator dobrej przestrzeni edukacyjnej
Moje Bambino: Kreator dobrej przestrzeni edukacyjnej
 
Erasmus Kotka (Finlandia)
Erasmus Kotka (Finlandia)Erasmus Kotka (Finlandia)
Erasmus Kotka (Finlandia)
 
IPv6 at 1&1
IPv6 at 1&1 IPv6 at 1&1
IPv6 at 1&1
 

Similar to Strata web mining tutorial

Building Agile Data Warehouses with Ralph Hughes
Building Agile Data Warehouses with Ralph HughesBuilding Agile Data Warehouses with Ralph Hughes
Building Agile Data Warehouses with Ralph HughesKalido
 
Real Life WebSocket Case Studies and Demos
Real Life WebSocket Case Studies and DemosReal Life WebSocket Case Studies and Demos
Real Life WebSocket Case Studies and DemosPeter Moskovits
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
Building Living Web Applications with HTML5 WebSockets
Building Living Web Applications with HTML5 WebSocketsBuilding Living Web Applications with HTML5 WebSockets
Building Living Web Applications with HTML5 WebSocketsPeter Moskovits
 
Sakai spring maven archetype
Sakai spring maven archetypeSakai spring maven archetype
Sakai spring maven archetypegjenning
 
Sample Tracking Demo Slides Dia 2012 06 20
Sample Tracking Demo Slides Dia 2012 06 20Sample Tracking Demo Slides Dia 2012 06 20
Sample Tracking Demo Slides Dia 2012 06 20Jeff Mayhew
 
Sharepoint and SQL Server 2012
Sharepoint and SQL Server 2012Sharepoint and SQL Server 2012
Sharepoint and SQL Server 2012James Tramel
 
The Software Debt Bubble: Is It About to Burst
The Software Debt Bubble: Is It About to BurstThe Software Debt Bubble: Is It About to Burst
The Software Debt Bubble: Is It About to BurstChris Sterling
 
STPCon fall 2012: The Testing Renaissance Has Arrived
STPCon fall 2012: The Testing Renaissance Has ArrivedSTPCon fall 2012: The Testing Renaissance Has Arrived
STPCon fall 2012: The Testing Renaissance Has ArrivedSOASTA
 
How to implement video at your school: A hands-on workshop with edSocialMedia...
How to implement video at your school: A hands-on workshop with edSocialMedia...How to implement video at your school: A hands-on workshop with edSocialMedia...
How to implement video at your school: A hands-on workshop with edSocialMedia...edSocialMedia
 
Open stack for open source private cloud 20120425-shanghai
Open stack for open source  private cloud  20120425-shanghaiOpen stack for open source  private cloud  20120425-shanghai
Open stack for open source private cloud 20120425-shanghaiOpenCity Community
 
EdSocialMedia Video Bootcamp Keynote
EdSocialMedia Video Bootcamp KeynoteEdSocialMedia Video Bootcamp Keynote
EdSocialMedia Video Bootcamp KeynoteWhippleHill
 
Effortless Interfaces for Appified TV
Effortless Interfaces for Appified TVEffortless Interfaces for Appified TV
Effortless Interfaces for Appified TVVenu Vasudevan
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009LeeFeigenbaum
 
Cloud Foundry and Ubuntu - 2012
Cloud Foundry and Ubuntu - 2012Cloud Foundry and Ubuntu - 2012
Cloud Foundry and Ubuntu - 2012Patrick Chanezon
 
Paremus Cloud and OSGi Beyond the VM - OSGi Cloud Workshop March 2012
Paremus Cloud and OSGi Beyond the VM - OSGi Cloud Workshop March 2012Paremus Cloud and OSGi Beyond the VM - OSGi Cloud Workshop March 2012
Paremus Cloud and OSGi Beyond the VM - OSGi Cloud Workshop March 2012mfrancis
 
Building HTML5 WebSocket Apps in Java at JavaOne Latin America 2012
Building HTML5 WebSocket Apps in Java at JavaOne Latin America 2012Building HTML5 WebSocket Apps in Java at JavaOne Latin America 2012
Building HTML5 WebSocket Apps in Java at JavaOne Latin America 2012Arun Gupta
 
SOLVING BIG DATA APP DEVELOPERS BIGGEST PAINS from Structure:Data 2013
SOLVING BIG DATA APP DEVELOPERS BIGGEST PAINS from Structure:Data 2013SOLVING BIG DATA APP DEVELOPERS BIGGEST PAINS from Structure:Data 2013
SOLVING BIG DATA APP DEVELOPERS BIGGEST PAINS from Structure:Data 2013Gigaom
 

Similar to Strata web mining tutorial (19)

Building Agile Data Warehouses with Ralph Hughes
Building Agile Data Warehouses with Ralph HughesBuilding Agile Data Warehouses with Ralph Hughes
Building Agile Data Warehouses with Ralph Hughes
 
Real Life WebSocket Case Studies and Demos
Real Life WebSocket Case Studies and DemosReal Life WebSocket Case Studies and Demos
Real Life WebSocket Case Studies and Demos
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
Building Living Web Applications with HTML5 WebSockets
Building Living Web Applications with HTML5 WebSocketsBuilding Living Web Applications with HTML5 WebSockets
Building Living Web Applications with HTML5 WebSockets
 
Sakai spring maven archetype
Sakai spring maven archetypeSakai spring maven archetype
Sakai spring maven archetype
 
Sample Tracking Demo Slides Dia 2012 06 20
Sample Tracking Demo Slides Dia 2012 06 20Sample Tracking Demo Slides Dia 2012 06 20
Sample Tracking Demo Slides Dia 2012 06 20
 
Sharepoint and SQL Server 2012
Sharepoint and SQL Server 2012Sharepoint and SQL Server 2012
Sharepoint and SQL Server 2012
 
The Software Debt Bubble: Is It About to Burst
The Software Debt Bubble: Is It About to BurstThe Software Debt Bubble: Is It About to Burst
The Software Debt Bubble: Is It About to Burst
 
STPCon fall 2012: The Testing Renaissance Has Arrived
STPCon fall 2012: The Testing Renaissance Has ArrivedSTPCon fall 2012: The Testing Renaissance Has Arrived
STPCon fall 2012: The Testing Renaissance Has Arrived
 
How to implement video at your school: A hands-on workshop with edSocialMedia...
How to implement video at your school: A hands-on workshop with edSocialMedia...How to implement video at your school: A hands-on workshop with edSocialMedia...
How to implement video at your school: A hands-on workshop with edSocialMedia...
 
Open stack for open source private cloud 20120425-shanghai
Open stack for open source  private cloud  20120425-shanghaiOpen stack for open source  private cloud  20120425-shanghai
Open stack for open source private cloud 20120425-shanghai
 
EdSocialMedia Video Bootcamp Keynote
EdSocialMedia Video Bootcamp KeynoteEdSocialMedia Video Bootcamp Keynote
EdSocialMedia Video Bootcamp Keynote
 
Effortless Interfaces for Appified TV
Effortless Interfaces for Appified TVEffortless Interfaces for Appified TV
Effortless Interfaces for Appified TV
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009
 
Cloud Foundry and Ubuntu - 2012
Cloud Foundry and Ubuntu - 2012Cloud Foundry and Ubuntu - 2012
Cloud Foundry and Ubuntu - 2012
 
Paremus Cloud and OSGi Beyond the VM - OSGi Cloud Workshop March 2012
Paremus Cloud and OSGi Beyond the VM - OSGi Cloud Workshop March 2012Paremus Cloud and OSGi Beyond the VM - OSGi Cloud Workshop March 2012
Paremus Cloud and OSGi Beyond the VM - OSGi Cloud Workshop March 2012
 
Building HTML5 WebSocket Apps in Java at JavaOne Latin America 2012
Building HTML5 WebSocket Apps in Java at JavaOne Latin America 2012Building HTML5 WebSocket Apps in Java at JavaOne Latin America 2012
Building HTML5 WebSocket Apps in Java at JavaOne Latin America 2012
 
SOLVING BIG DATA APP DEVELOPERS BIGGEST PAINS from Structure:Data 2013
SOLVING BIG DATA APP DEVELOPERS BIGGEST PAINS from Structure:Data 2013SOLVING BIG DATA APP DEVELOPERS BIGGEST PAINS from Structure:Data 2013
SOLVING BIG DATA APP DEVELOPERS BIGGEST PAINS from Structure:Data 2013
 
Web20
Web20Web20
Web20
 

More from Ken Krugler

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraKen Krugler
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrKen Krugler
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big dataKen Krugler
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoopKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web MiningKen Krugler
 

More from Ken Krugler (8)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 

Recently uploaded

Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 

Recently uploaded (20)

Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 

Strata web mining tutorial

  • 1. 1 Scale Unlimited Web Mining photo by: i_pinz, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 2. 2 Welcome to Web Mining! This class is a tutorial on large scale web mining Topics covered Overview of web mining Web crawling - broad & focused Text mining - extracting value Hands-on lab Tips and traps Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 3. 3 Meet Your Instructor Ken Krugler - direct from Nevada City, California Founder of TransPac Software, Krugle, Bixo Labs/Scale Unlimited Developer of Bixo web mining toolkit Committer on Apache Tika Developer and trainer for Hadoop, Solr and Cascading Actively web mining for six years Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 4. 4 Agenda 9:00am - Overview 11:00am - Web Mining Lab 9:30am - Web Crawling 11:45am - Lab Review 10:00am - Text Mining 12:00pm - Summary 10:30am - Break 12:15pm - Q&A Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 5. 5 Overview Web Mining photo by: exfordy, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 6. 6 Key Questions Which of the three types of web mining are we focusing on today? What makes web pages “noisy”? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 7. 7 What is web mining? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 8. 7 What is web mining? Extracting useful information from the World-wide Web Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 9. 8 What is web mining? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 10. 8 What is web mining? Extracting useful information from the World-wide Web Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 11. 8 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 12. 9 What is web mining? 173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-" 67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-" 89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-" Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 13. 9 What is web mining? Extracting useful information from the World-wide Web 173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-" 67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-" 89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-" Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 14. 9 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis 173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-" 67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-" 89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-" Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 15. 9 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Web usage - server logs 173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-" 67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-" 89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-" Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 16. 10 What is web mining? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 17. 10 What is web mining? Extracting useful information from the World-wide Web Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 18. 10 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 19. 10 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Web usage - server logs Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 20. 10 What is web mining? Extracting useful information from the World-wide Web Web structure - link graph analysis Web usage - server logs Web content - text and images Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 21. 11 Web Content Mining Analyzing data from web pages Typically three types of page processing Unstructured - get rid of “boilerplate” text, analyze sentiment Semi-structured - find names of people with phone numbers Structured - find hotel name, address, phone number, reviews Plus inter-document analysis Clustering Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 22. 12 Crawling versus Mining Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 23. 12 Crawling versus Mining Web mining combines... Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 24. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 25. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 26. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 27. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Both fields are broad and deep - for example Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 28. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Both fields are broad and deep - for example optimal crawling strategies Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 29. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Both fields are broad and deep - for example optimal crawling strategies machine learning for page classification Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 30. 12 Crawling versus Mining Web mining combines... web crawling - finding & fetching content data mining - extracting useful information Both fields are broad and deep - for example optimal crawling strategies machine learning for page classification automatically extracting structured data Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 31. 13 What is “Large Scale”? More than what you can handle with one server Many single-server solutions for mining web pages Harder when you include text analytics And (almost) impossible when you get to 100M+ pages So you need some kind of distributed processing framework Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 32. 14 Key Aspects of Web Mining Crawling - finding the “good stuff” Extracting - getting the “right data” Processing - turning bytes into bucks Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 33. 15 Finding the “good stuff” Often feels like “needle in a haystack” E.g. even 100M pages is 0.1% of total web Need to optimize time + cost per useful result Can’t afford to waste time on pages that aren’t useful And each page has cost to data provider Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 34. 16 Getting the “right data” Scale and precision are in opposition One area of one site can be precision-processed All areas of 50M domains means you have to be general Pages are noisy Ads Boilerplate (navigation, etc) SEO Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 35. 17 Processing the results 1TB data file has very little value Actually less value that a small file that can be opened & viewed Has to be turned into something with value Often processing is considered part of web mining Reduction - turning petabytes into pie charts Indexing - being able to search the data Analytics - clustering, training models for recommenders Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 36. 18 Q&A Which of the three types of web mining are we focusing on today? What makes web pages “noisy”? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 37. 19 Web Crawling Web Mining photo by: Graham Racher, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 38. 20 Key Questions What are three general types of web crawls? What can make it hard to accurately score a page? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 39. 21 What is “web crawling”? Includes fetching pages, of course But also has aspect of spider crawling over a web Extracting outlinks to discover new pages Which means parsing the fetched content Managing state of the crawl And all of the implicit rules Robots exclusion protocol User agent Request rate Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 40. 22 Types of web crawls Broad Few or no limits to what domains/pages to process Typically what people think of - Googlebot, bingbot, Baiduspider, ... Focused Uses page scoring -> outlinks to guess at quality of unfetched pages Often has whitelist of domains to avoid traps Domain For a limited number of domains Typically for precise extraction of data Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 41. 23 The “don’t craw” crawl Leverage other people’s crawl data Can be faster, cheaper Reduces load on servers Public datasets Common crawl Wikipedia - use data dump! Commercial providers Spinner, InfoChimps Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 42. 24 Crawling Solutions General rule - don’t roll your own! Easy to make something simple Hard to make something scalable, robust, efficient Open source options Java - Nutch, Heritrix, Bixo, Droids Python - http://scrapy.org/ PHP - http://astellar.com/php-crawler/ Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 43. 25 What makes it hard? Web mining breaks the implicit contract with web sites You often aren’t creating an index that drives traffic to them So why should they let you use bandwidth & server cycles? The web is a nearly infinite set of edge cases Every possible problem will occur, with a broad enough crawl And not everybody plays nice Link farms/honeypots, malicious sites, angry webmasters Plus you have to be able to work at scale Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 44. 26 Scaling Solutions Needs to be reliable, scalable, fault tolerant Single server can fetch lots of pages But scaling is issue with post-processing Several options Hadoop - Nutch, Bixo Custom queuing system - Heritrix, Droids Storm - scalable queuing Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 45. 27 Focused Crawling 101 How to maximize results while minimizing cost aka Finding Good Stuff Fast Only crawl pages that you think are likely to be good Reduces cost through Less time spent fetching worthless pages Lower bandwidth/CPU/storage costs Fewer angry webmasters Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 46. 28 Focused Crawl Details Seed URLs - Good starting point URL State - DB of all known URLs Page Score - “Quality” of page Link Score - Page Score/outlinks Fetched Pages - Saved results Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 47. 29 Finding Seed URLs List of all registered domains - Complete, but big (100M+) Broad DMOZ - lots of spam/porn Alexa/Quantcast “top sites” list - top 1M US sites by traffic Wikipedia - use outlink dump if possible Tweets - with filtering, e.g. Gnip, DataSift Using search Narrow Manually entering URLs - slow, but curated Using API - faster, typically limited, can have junk Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 48. 30 Scoring Pages Analyze text on page Typically means tokenizing text “The sport of ultimate is..” => “the”, “sport”, “of”, “ultimate”, “is”, ... Simple term-based Count occurrences of all phrases, good phrase, bad phrases Calculate ratios of counts: good/all - bad/all = score Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 49. 31 Scoring using SVM SVM = support vector machine Trained using “documents” that have features, and a class “good” : “ultimate”, “ultimate frisbee”, “disc”, “sport”, “throw”, “run” “bad” : “golf”, “timeshare”, “aardvark”, “potato” Creates a statistical model Divides all training documents into separate classes Used to give an unknown document a class Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 50. 32 Challenges with Scoring How do you decide that a page is “good”? Might be mostly graphics with few words Could be a definition of the term Min threshold for amount of real content Detecting link farms with fake content Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 51. 33 Chrome, cruft, and boilerplate Navigational links Sidebar elements Ads, SEO links Can use Boilerpipe & other “cleaners” Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 52. 34 Expanding the crawl frontier Have to parse the page to find outlinks Need to normalize links http://scaleunlimited.com == http://www.scaleunlimited.com http://www.scaleunlimited.com == http://www.scaleunlimited.com/ Skipping links to low-value pages Links to images, pdf files, other binary types (using suffix) Links to DB-generated pages Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 53. 35 Focused Domain Crawl Very specific, explicit crawl of one domain Typically involves discovery of target content pages Often uses URL patterns to synthesize links Page X in site has list of product; a, b, c, d... Product pages are <domain>/product/a or b or c or d... Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 54. 36 Discovery vs Extraction Focused Domain Crawl has two distinct phases Crawling to discover details pages Fetching/processing details pages Often phases are co-mingled, for efficiency Need to track what kind of page in the URL State DB Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 55. 37 Goby Crawl Example http://www.goby.com has information on lots of attractions http://www.goby.com/boston-ma has list of categories http://www.goby.com/<category>--near--<city>-<state> Often need to paginate listing pages, to get all details links Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 56. 38 Q&A What are three general types of web crawls? What can make it hard to accurately score a page? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 57. 39 Data Extraction Web Mining photo by: Graham Racher, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 58. 40 Key Questions What are the three general approaches for data extraction? Why might you want to detect the language of a page? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 59. 41 You’ve got a page, now what? Time to extract the data you need Three attributes of extraction, pick any two Broad - across lots of domains and page formats Precise - very specific types of data Accurate - low error rate Three general approaches Unstructured (broad, accurate) - “just text” Semi-structured (broad, precise) - finding meaning in text Structured (precise, accurate) - getting exactly the data you need Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 60. 42 Common Tasks - Cleaning The HTML needs to be cleaned up Lots of messy data, especially when hand-edited Even HTML (2.0? 3.2? 4.0.1?) should be converted to XHTML Various libraries help with “cleaning” the HTML TagSoup, NekoHTML, HtmlCleaner Note that end result won’t match original text Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 61. 43 Common Tasks - Charset You get bytes back from the web server You need a charset to convert bytes to characters HTTP response header - “Content-Type: text/html; charset=UTF-8” HTML meta tag - <meta http-equiv=”Content-Type” content=“...” /> Analysis of text - byte sequence statistics Several packages support this Tika, ICU Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 62. 44 Common Tasks - Link Extraction Needed to have a crawl - where new links come from Means you need XHTML so you can parse the markup Not just <a href=“xxx”> img, frame, iframe, link, map, area Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 63. 45 Common Tasks - Boilerplate For unstructured and semi-structured Can improve the quality of results Especially important for machine learning Boilerplate text can dramatically skew statistics Creates a noisier signal Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 64. 46 Common Tasks - Language Often used for filtering or alternative processing Target audience is only interested in Spanish I need to tokenize Japanese differently Clustering improves when it’s segmented by language Multiple signals for selecting language, same as charset detection HTTP response header: Content-Language: es HTML meta tag - <meta http-equiv=”Content-Language” content=“es” /> HTML tag attributes - <html lang=“es”> Analysis of text - ngram statistics, short words Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 65. 47 Unstructured Extraction Goal is extracting text, without much additional processing Often has a few fields, from HTML Title - from <head><title>The title of my page</title></head> Description - from <meta name=“description” content=“ultimate frisbee” /> Body - from <body>...all elements that contain text, like <p>...</body> Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 66. 48 Semi-structured Extraction Goal is finding structured data in random text Can be applied broadly, since it’s not (very) format-specific Accuracy suffers, because of breadth of input data formats Beware the academic algorithm Examples of what does work... Easy patterns: telephone numbers, dates Microformats: hCalendar, hCard, hReview, ... Natural Language Processing (NLP): named entities Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 67. 49 Structured Extraction Precise extraction of specific types of data Typically is to one area of one site Often handled with XPath, and maybe regular expressions //div[@id=‘<id of target div>’]/p div and span are beautiful tags Commonly used with CSS Which means they are (more) stable Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 68. 50 How to figure out XPath Firebug is your friend Plug-in for Firefox Will show you the full XPath for each element Note that browsers will re-write HTML (e.g. tbody element) The DOM you see is often generated with Javascript Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 69. 51 XPath demo Firebug XPath tool Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 70. 52 Dealing with Javascript Required if page generates target content using JS Forces you to use Firebug or equivalent to inspect the DOM Options for processing include... HtmlUnit qt-webkit headless Mozilla Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 71. 53 Javascript challenges 10x slower than just loading the page text Good way to make a webmaster angry Lots of extra load on server Can skew website stats Often has issues Pages that work in FF or IE but not HtmlUnit Pages that cause HtmlUnit to hang Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 72. 54 Q&A What are the three general approaches for data extraction? Why might you want to detect the language of a page? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 73. 55 Web Crawling Lab Web Mining photo by: Graham Racher, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 74. 56 Key Questions Were you able to build and run the code locally? Were you able to run a crawl in Elastic MapReduce? Were you able to improve the focused crawl? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 75. 57 ImageFinder Details Find images about Ultimate Frisbee Focused crawl Fixed list of seed URLs Positive & negative terms used to score pages Extract images from page Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 76. 58 Running ImageFinder Can run locally, with restrictions Only one fetcher thread Only 5 pages/loop Can run in Hadoop cluster Amazon Elastic MapReduce CrawlRunner uploads job jar, creates “Job Flow Step” Limited to 100 pages/loop, 2 loops Will take up to 10 minutes to run loops Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 77. 59 Running in Elastic MapReduce Watch your job via http://strata.scaleunlimited.com:9100/ Your job name will include your username Results get added to searchable index http://strata.scaleunlimited.com/solr/strata/ Search for student:<username> to find your results Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 78. 60 Solr Results Issue Note links vs. images These are “image” URLs that are actually to pages Double-bonus on exercise ...fix this problem :) Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 79. 61 Lab Details Code is in strata-web-mining folder you downloaded Details of code in strata-web-mining/doc/README-Description Instructions are in strata-web-mining/doc/README-Lab Please follow the lab steps carefully Missing a step will cause pain and suffering later Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 80. 62 Lab Exercises First goal is to build code and run locally Next is to build code and run in real cluster Then you get to try to optimize the focused crawl And (if you’re fast) try finding images for a different topic Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 81. 63 Trouble-shooting & Timing I’ll be walking around - raise your hand if you need help But with 100+ people, I’ll be talking fast :) We’ve got an hour (or more) before summary/Q&A Have fun... Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 82. 64 Q&A Were you able to build and run the code locally? Were you able to run a crawl in Elastic MapReduce? Were you able to improve the focused crawl? Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 83. 65 Summary & QA Web Mining photo by: Graham Racher, flickr Strata 2012 Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 84. 66 Key Challenges Complexity of large scale web crawling Challenges with extracting the right data Extra work to turn results into value Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 85. 67 Ethical Crawling Always have a real, valid, informative user agent name Always honor the robot exclusion protocol - robots.txt Limit your crawl rate - parallelism, crawl delay, pages/day Immediately comply with blacklisting and data removal requests Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 86. 68 Avoiding getting blocked Follow all ethical crawling guidelines Gradually ramp up your crawl rate Gives webmasters time to complain before it’s a serious problem Avoid Javascript if at all possible Don’t follow form links Grovel shamelessly Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 87. 69 Resources Hadoop - http://hadoop.apache.org Cascading - http://www.cascading.org Bixo - http://openbixo.org Web Data Mining by Bing Liu Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12
  • 88. 70 Question? I might have answers ken@scaleunlimited.com @kkrugler Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Tuesday, February 28, 12