Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scraping talk public


Published on

  • Be the first to comment

  • Be the first to like this

Scraping talk public

  1. 1. Getting data from the web for research Andrew Whitby 27 February 2014
  2. 2. Web data projects I’ve worked on A project… Website Examining the global trade in music Various websites incl. Wikipedia, Musicbrainz Data items Scrape API 8 million chart entries ~50k unique artists   Analysing promotion A social network techniques for artists in foreign markets 5k users with 2+ million user preferences (similar to ‘likes’)   Investigation of data skills University course database 20,000 courses  Modelling political orientation of various organisations* Twitter 10ks of followers  * Not at Nesta
  3. 3. Do you really need to scrape? Easiest Bulk download: Some sites make their data available as a download. Check! Use an API: A programming interface designed to expose data directly. Manually collect the data: for up to 100s of items, this can be quicker (intern, contract researcher?) Contact the site owner: For smaller sites this can be surprisingly effective. Hardest Scrape the website: Do this as a last resort.
  4. 4. Can it be scraped? Structured or semi-structured = Scraping Unstructured text = A different problem
  5. 5. Scrapers
  6. 6. Web 101 • Clients (your browser) send requests to servers (e.g. using HyperText Transfer Protocol (HTTP) • Depending on the request, the server might return – – – – A web page, in HTML An image (e.g. a PNG or JPG) Some data, as XML or JSON Etc • Scraping and APIs both use HTTP
  7. 7. So how does web scraping work? • In the (good) old days web pages were very simple, handcrafted, marked-up text • Now most automatically generated from databases of content according to templates, so they naturally have a repetitive structure • Scraping exploits the regularities of this (semi-) structure to extract data using text-manipulation algorithms
  8. 8. Scraping example: Nesta People Ordinary URL that you would browse to Extraneous information, formatting, etc The data you actually want: either as a table or list here, or possibly as a link to a pageper-item Pagination, e.g. <<First <Prev 1 2 3 Next> Last>>
  9. 9. Scraping example: under the bonnet
  10. 10. Adam Scraping example: under the bonnet Albert Start of an entry Photo link End of an entry Link to Albert’s main page Name text
  11. 11. Scraping: legal considerations • Jurisdiction issues • Laws that have been relied upon – – – – – Contract: terms of service Copyright law EU Databases Directive (research exemption?) US Computer Fraud & Abuse Act US Digital Millennium Copyright Act • Case law – Unsettled - conflicting decisions Bottom line: this is a grey area and not without legal risk (Also: I’m not a lawyer, this is not legal advice)
  12. 12. Scraping: ethical considerations • Remember, the site wasn’t designed for this purpose: be sympathetic to the site owner • Avoid putting an unreasonable burden on the site – Some run on massive datacentres, others a single machine. – Rule of thumb: don’t scrape multiple items in parallel • Ask permission if you can – But be realistic, and remember a lot of web traffic is scraping (Google, Bing, etc) • Observe robots.txt – But this is (probably) not legally binding either way This is before even thinking about privacy (if user data involved)
  13. 13. Scraping courtesy: robots.txt If this file exists it will be at
  14. 14. Scraping: practical issues Sites may reject connections, or challenge your humanity with CAPTCHAs
  15. 15. Getting around limits The simple options – Slow down requests, introduce random delays – Use ‘user agent’ to pretend to be human The serious option – Tor (“the onion router”) – Anonymises your network location. – Ethical consideration though • Tor is a fragile community with better uses If these don’t work, give up. If they’ve gone to this much trouble to prevent scraping, they’re more likely to get upset and possibly take action against you. These aren’t the droids you’re looking for
  16. 16. APIs (Application Programming Interfaces)
  17. 17. How do APIs work? • Way of extracting structured data from a web site or service – A service intentionally made available by the data owner • Just a set of rules for communicating / exchanging data – Request is usually made as a specially-constructed web address – Response is usually encoded as JSON or XML • You can access an API: – – – – directly in your browser (good for testing) using a tool like curl by programming it directly by using a ‘wrapper’ in your language of choice (Python, Ruby, Java, etc)
  18. 18. An API is a set of rules
  19. 19. API example: Companies House Specially constructed URL (‘request’) Structured, unformatted data returned (‘response’) A RESTful request using HTTP with data returned in JSON format
  20. 20. API example: Companies House Formatted, humanfriendly page returned The same data rendered in a human-friendly web format.
  21. 21. APIs: legal issues • Situation is simpler/safer than scraping • Publishing an API means a data provider is encouraging use, and explicitly controlling the amount of data you can collect • With an API you are more likely to have to expressly agree to something (“clickwrap”); with a paid API you’ll have a formal contract
  22. 22. APIs: ethical issues • As with scraping, avoid putting an unreasonable burden on the site • But often API owners will be explicit about what a reasonable burden is – This may be voluntary – Or enforced via a ‘rate limit’ • Easier for API owner to enforce, so responsibility is shifted somewhat
  23. 23. APIs: practical issues • APIs will often be ‘rate limited’: that is, a limit is imposed on how many requests you can make per minute/hour. • This can increase the elapsed time it takes to collect large quantities of information – But often free registration will increase your rate limit – And paid accounts may increase it further – Don’t try to work around this any other way • APIs may not provide all the same fields web users see – they are often designed for third-party apps rather than research – In which case, scraping may be an option
  24. 24. DIY web data access Scraping API access Point and click Yahoo Dapper Yahoo Pipes Various browser extensions (e.g. Chrome Scraper) Kimono? Scraperwiki (Twitter) Some code Scraperwiki Scraperwiki Lots of code Scrapy BeautifulSoup Your language of choice (Python+Requests is good) Also see this list of non-code scraping things to try courtesy of a pair of US journalists: here
  25. 25. Contracted web data access • How much: – e.g. ScraperWiki: $3-10k upfront, $200-500 per month • Think about – How will you receive/analyse the data? – What is the time period of interest? – Is it a well-known API (e.g. Twitter) or something exotic (e.g. Douban)? Case study Data: Twitter public API (~800 users, 1m tweets over Jan-Oct 2012, plus network snapshots at 3 times Cost: £10-15k Time: months (limitations of API history + rate limits) Issues • Lack of transparency/documentation about data processing decisions (what’s in, what’s out) – getting from complex to flat data structures • Need for iteration, constant communication • Data collection skills may not coexist with report-writing skills
  26. 26. Summary 1. Consider your non-scraping options 2. A legally grey area - be aware of this 3. If you scrape, scrape ethically 4. Scraping starts simply, but can get complicated 5. Life is easier with open APIs
  27. 27. Glossary API Key a secret string that you use to identify yourself to an API CAPTCHA Completely Automated Public Turing Test to tell Computers and Humans Apart HTML (HyperText Markup Language) the language in which web pages are constructed HTTP (HyperText Transfer Protocol) the communications protocol that is used to transfer web pages from the server to your browser. APIs use this too JSON a very simple data format based on the Javascript language, that is quite readable to humans too rate limit a limit on how frequently you can make requests to the API REST a popular semantic approach to using HTTP for APIs XML a more complex data format that predates JSON