Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scraping Data from
Documents and the Web
Tommy Tavenner
National Wildlife Federation
What is it?
© 2014 Tommy Tavenner
What is Scraping?
• Converting data from human readable into machine readable
• This data is sometimes referred to as ‘uns...
Is it Legal?
© 2014 Tommy Tavenner
Maybe!
© 2014 Tommy Tavenner
Is Scraping Legal?
• It depends
• Most publically available data in the US falls within the sphere
of copyright protection...
Is Scraping Legal?
• Terms of Service
> Does it explicitly prohibit scraping?
> Does it prohibit storing information priva...
Is Scraping Legal?
• Feist v. Rural Telephone (1991)
> Feist, a phone book compiler in Kansas, copied the contents of
Rura...
Is Scraping Legal?
• LinkedIn case (2014)
> Suing a group of unknown defendants in California.
> LinkedIn alleges that thi...
Jargon
• Spider – Searches for links within content and follows, building
up a site map or web of content.
• Crawler – Syn...
Anatomy of a Scraper
Document Load
• Pull in the
complete web
page, PDF, XML,
etc.
Parsing
• Parse the HTML,
XML, or PDF m...
Anatomy of a Scraper
Document
Load
• Load the entire document or HTML
page. Generally as a string of
characters.
• For lar...
Anatomy of a Scraper
Parsing
• Interpret the document to make searching
possible.
• Biggest potential failure point
• Spec...
Anatomy of a Scraper
Extraction
• Search parsed data for particular
pieces of information
• i.e. file name, link, or table...
Anatomy of a Scraper
Transformation
• Convert data into proper output
• Apply standards
• Change type
• i.e. date string d...
Visual Scraping tools
• Require no programming knowledge
• Primarily web-based
• Allow quick access to data
• Because they...
ScraperWiki
• Paid Service with very basic free plan
• Focused on table extraction and Twitter data
• Takes a single page ...
ScraperWiki
• Allows you to quickly access the data or summarize it.
• Works well with PDF’s of tables but struggles with ...
Import.io
• In early stages, currently free with professional accounts
• Downloadable Java app – multi-platform
• Focused ...
Import.io
• Data can be extracted either for a single page or a full site
© 2014 Tommy Tavenner
Import.io
Scrapinghub
• Designed for much larger scraping jobs, including multi-site
© 2014 Tommy Tavenner
Scrapinghub
• Sits somewhere between a visual scraper and a scraping
library.
• Custom scrapers may be developed in Python...
Scraping with a scripting language
• Libraries are available in most languages.
• Primarily make it easier to understand a...
R
• scrapeR – for parsing HTML/XML
• XML package – for parsing HTML/XML
• tm – for parsing PDFs using Xpdf or Poppler engi...
Python
• ScraperWiki
• Scrapy
• BeautifulSoup – for parsing HTML
• XPath
• PDFMiner – for parsing PDFs
© 2014 Tommy Tavenn...
PHP
• Simple HTML DOM
• PDF Parser
© 2014 Tommy Tavenner
Javascript
• NodeJS using Request and Cheerio
• jsPDF
• pdf2json
© 2014 Tommy Tavenner
Upcoming SlideShare
Loading in …5
×

Scraping data from the web and documents

5,674 views

Published on

From a talk given at the APRA Data Analytics Symposium Las Vegas, NV
July 2014

Published in: Technology
  • Just got my check for $500 ➢➢➢ https://t.cn/A6ybK3XL
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • by filling out a short survey? ☞☞☞ https://t.cn/A6ybK3XL
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • taking surveys for cash online? ✱✱✱ https://dwz1.cc/v5Fcq3Qr
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You won't get rich, but we do pay. ☺☺☺ https://dwz1.cc/EWG1lhe4
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Great information about writing! If you ever need any help with proofreading, editing or research check out Writer’s Help. They are a great resource for personal, educational or business writing needs. The website is HelpWriting.net
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Scraping data from the web and documents

  1. 1. Scraping Data from Documents and the Web Tommy Tavenner National Wildlife Federation
  2. 2. What is it? © 2014 Tommy Tavenner
  3. 3. What is Scraping? • Converting data from human readable into machine readable • This data is sometimes referred to as ‘unstructured’ but is really just not structured properly for systematic parsing • The data is often embedded in layers of formatting meta data. Think HTML or PDF formatting like font colors and tables. • The job of the scraper is to separate the data from the formatting. In some cases even using the formatting to interpret the data. © 2014 Tommy Tavenner
  4. 4. Is it Legal? © 2014 Tommy Tavenner
  5. 5. Maybe! © 2014 Tommy Tavenner
  6. 6. Is Scraping Legal? • It depends • Most publically available data in the US falls within the sphere of copyright protection. > Creativity in producing the source data > The manner in which the data is presented > Fair Use on the web • What is the purpose of the scraping? © 2014 Tommy Tavenner
  7. 7. Is Scraping Legal? • Terms of Service > Does it explicitly prohibit scraping? > Does it prohibit storing information privately? © 2014 Tommy Tavenner
  8. 8. Is Scraping Legal? • Feist v. Rural Telephone (1991) > Feist, a phone book compiler in Kansas, copied the contents of Rural Telephone’s directory after Rural refused to license the information. > Rural sued Feist for copyright infringement. Because of the nature of the information, the case eventually made it to the supreme court. > The case centered on originality and whether compiling facts constitutes an original work. > The court ruled that the phone directory did not constitute and original compilation because no discretion was exercised in deciding on contents. © 2014 Tommy Tavenner
  9. 9. Is Scraping Legal? • LinkedIn case (2014) > Suing a group of unknown defendants in California. > LinkedIn alleges that this group used a series of bots and fake profiles on the site to scrape content from other member profiles > The case is based on the Digital Millennium Copyright Act. © 2014 Tommy Tavenner
  10. 10. Jargon • Spider – Searches for links within content and follows, building up a site map or web of content. • Crawler – Synonym for Spider • Training Data – Like in supervised machine learning, training data is used to teach a spider how to interpret the content they will be processing. • IP Proxy/Switching – Regular switching of IP address used to bypass restrictions on the number of connections per client set by web servers. May be a sign of less than legal or honorable intent in scraping. © 2014 Tommy Tavenner
  11. 11. Anatomy of a Scraper Document Load • Pull in the complete web page, PDF, XML, etc. Parsing • Parse the HTML, XML, or PDF meta data into something the script can understand Extraction • Use the results of parsing to extract the data we are looking for Transformation •Convert the data into useful formats, i.e. currency, dates, etc. © 2014 Tommy Tavenner
  12. 12. Anatomy of a Scraper Document Load • Load the entire document or HTML page. Generally as a string of characters. • For larger documents this may involve splitting it into multiple pages © 2014 Tommy Tavenner
  13. 13. Anatomy of a Scraper Parsing • Interpret the document to make searching possible. • Biggest potential failure point • Specific to the source data. • HTML Document Object Model • PDF Grid Model © 2014 Tommy Tavenner
  14. 14. Anatomy of a Scraper Extraction • Search parsed data for particular pieces of information • i.e. file name, link, or table • Separate data into individual pieces for later processing © 2014 Tommy Tavenner
  15. 15. Anatomy of a Scraper Transformation • Convert data into proper output • Apply standards • Change type • i.e. date string date © 2014 Tommy Tavenner
  16. 16. Visual Scraping tools • Require no programming knowledge • Primarily web-based • Allow quick access to data • Because they are not bespoke may require more scrubbing of the data after scraping © 2014 Tommy Tavenner
  17. 17. ScraperWiki • Paid Service with very basic free plan • Focused on table extraction and Twitter data • Takes a single page or document as its source © 2014 Tommy Tavenner
  18. 18. ScraperWiki • Allows you to quickly access the data or summarize it. • Works well with PDF’s of tables but struggles with mixed data. © 2014 Tommy Tavenner
  19. 19. Import.io • In early stages, currently free with professional accounts • Downloadable Java app – multi-platform • Focused more on crawling sites to build up data sources • Offers limited training or refining abilities to make sure it extracts data correctly. • Enables access to the data source either as a downloadable file or as an API. © 2014 Tommy Tavenner
  20. 20. Import.io • Data can be extracted either for a single page or a full site © 2014 Tommy Tavenner
  21. 21. Import.io
  22. 22. Scrapinghub • Designed for much larger scraping jobs, including multi-site © 2014 Tommy Tavenner
  23. 23. Scrapinghub • Sits somewhere between a visual scraper and a scraping library. • Custom scrapers may be developed in Python and hosted by Scrapinghub • The autoscraper allows annotating pages and training the scraper • The crawler starts with a single page and works out from there following links on the pages it finds and quickly building large databases. © 2014 Tommy Tavenner
  24. 24. Scraping with a scripting language • Libraries are available in most languages. • Primarily make it easier to understand a certain format, i.e. HTML or PDF. • Require strong knowledge of the language • Require more fine tuning but result in much higher quality data © 2014 Tommy Tavenner
  25. 25. R • scrapeR – for parsing HTML/XML • XML package – for parsing HTML/XML • tm – for parsing PDFs using Xpdf or Poppler engines © 2014 Tommy Tavenner
  26. 26. Python • ScraperWiki • Scrapy • BeautifulSoup – for parsing HTML • XPath • PDFMiner – for parsing PDFs © 2014 Tommy Tavenner
  27. 27. PHP • Simple HTML DOM • PDF Parser © 2014 Tommy Tavenner
  28. 28. Javascript • NodeJS using Request and Cheerio • jsPDF • pdf2json © 2014 Tommy Tavenner

×