Onlineinfo2012 - Scraping


Published on

Is open data disruptive to data vendors/verticals in the information industry?
How can scrapers turn data published as information on the web or in PDFs back into structured data?
What business models or publications are built from scraped data?

Published in: Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Tony HirstTwitter:@psychemediaBlog: http://blog.ouseful.infoPresentation prepared for: Online Info 12/11/2012DATA LIBERATION: OPENING UP DATA BY HOOK OR BY CROOK - DATA SCRAPING, LINKAGE AND THE VALUE OF A GOOD IDENTIFIERThe 1/9/90 rule is often used to characterise the way in which a small number of creators generate content that a larger number (but still small percentage in the greater scheme of things) comment on or amplify, whilst the majority just passively consume. In this presentation, I will explore the extent to which a similar view applies to the world of "data liberation". After reviewing the idea of data scraping, and some of the techniques surrounding it, I will describe how online tools such as Scraperwiki provide a platform for concentrating data scraping activity and expertise, as well as supporting the publication of data /as data/ in a variety of formats, in addition to 'end user' views in the form of graphical charts and interactive visualisations.One of the major motivations for data scraping is the aggregation of data from a variety of data sources into a larger, integrated whole. For example, the aggregation of research council funding data from separate research councils allows us to view a large proportion of the publicly funded research grants received by a single institution; or the collection of local council spending data across all UK councils allows us to see how councils spend money with each other across a range of transaction areas. But how do we actually create such aggregations when the data is sourced from different areas? In order to do this, we need to know when different datasets are actually talking about the same thing, which is where common identifiers come in. For it is surely the case that when we have common identifiers, we can have linkage, and as a result start to realise some of the benefits of Linked Data (as well as developing a wider appreciation of what those benefits might actually be...) (As an aside, I'll describe how we might go about deriving such identifiers when they are missing from a data set that might otherwise, or more conveniently, be expected to publish them.)Throughout the presentation, I will draw on practical examples of how aggregated "liberated" data has been used as the basis of wider interest, and even status quo disrupting, services, as well as reflecting on what other sources of data we might see the data liberators turning their attention to next...Key learning points:1 - What is "data scraping", how can I do it and is my website at risk of it?2 - Why the secret to understanding "Linked Data" is the very idea of it, not just (or not even) the technology.3 - How has data scraping been used to "open up" data in actual practice?
  • The focus on this presentation is not the release of “information”, but the release of data in raw form so that it can be interpreted and presented in informative ways by other parties.
  • The London Datastore is an early example of a council-centric open data website. Early signs suggest it is natural to locate data websites at addresses of the form or
  • Another example that demonstrates how CSV can be used to help data flow is demonstrated by Google Spreadsheets. The =importData formula allows a user to specify a source data URL, and pull the CSV data found at that location in to the spreadsheet. Unlike Many Eyes Wikified, if the source data at the URL is updated, the updated will (eventually) be pulled into the spreadsheet automatically.
  • One of the really good reasons for getting data into a data processing environment such as a spreadsheet is that you can start to work it. In the case of Google Spreadsheets, the spreadsheet environment can also be used as a database environment. That is, we can treat one or more data containing sheets in a spreadsheet as a database, and generate new views over the data, as well as running queries over that data.
  • Another way of using a Google Spreadsheet as a database is via the Google Spreadsheets API. The GoogleVisualisation API (?) provides a way of passing queries written using the Google ???viz query language from an arbitrary web page or web application, and receiving the resulting data in a standard JSON based format, which also happens to play nicely with the Google Visualisation API???The Guardian Datastore explorer is a crude demonstration for 2009(??) demonstrating how data from the Guardian datastore, data that is stored across a range of Google spreadsheets, can be explored , queried and visualised via these APIs. Users can select a dataset from a drop down menu, fed from a delicious account to which various datastore spreadsheets have been bookmarked using a particular set of tags, or by pasting in the URL of an arbitrary (public) Google spreadsheet. The first row/headings of the data can then be previewed (a simple spreadsheet is assumed, in which column headings appear In the first row of the spreadsheet).
  • A series of list boxes are then populated with the column labels and there names, and provide a certain amount of help for the creation of a query over the spreadsheet data. A range of output formats can also be selected, from simple HTML data tables, to a range of charts. URLs are also generated for HTML and CSV representations of the data returned from the query.
  • One of the nice things about the data table widget (a standard GoogleVisualisation API component in this case, though similar examples exist for YUI, the Yahoo User Interface Libraries, or frameworks such as JQuery), is that is supports things like row sorting by column, (for free – no programming required!), allowing even further manipulation of the data, albeit at a simplistic level.(It’s probably worth pointing out here that it may be worth providing a preview of the column headings and first few rows (or a sample of random rows) of data when datasets are published, just so that users can see what sort of data is on offer without having to download the whole data set?)
  • If you’re in the business of selling information as data, you are under threat where that information is published in an openly licensed way.
  • Linked Data – the TM is something of a joke and refers to the particular style of publishing data according to set of principles first outlined by the inventor of the World Wide Web, Sir Tim Berners Lee – is one of the data formats that the Government’s data task force favour for the publication of data.
  • There is a problem though – at the moment, there are barriers to entry to Linked Data world from both the query side (not many people speak SPARQL, or know how to construct a SPARQL query to an endpoint) and the results side (data is returned as RDF).
  • So – do you speak SPARQL?
  • Onlineinfo2012 - Scraping

    1. 1. DATA LIBERATIONOpening Up Data by Hookor by Crook - DataScraping, Linkage and theValue of a Good Identifier Tony Hirst Department of Communication and Systems The Open University
    2. 2. data NOTinformation by Vick
    3. 3. [DisruptiveInnovation?]
    4. 4. “First” generation: data catalogues
    5. 5. Breathing life into data…
    6. 6. =importData(“CSV_URL”)
    7. 7. the spreadsheet becomesA DATABASE
    8. 8. “Second” generation: data management systems
    9. 9. There’s lots moredata that’s lockedup in web pages…
    10. 10. Scraping…
    11. 11. “grabbing web contentin a machine readable format and then processing it for your own purposes”
    12. 12. Original Extract AccessibleHTML web Information web page page -> data
    13. 13. Recreating thedatabase that was used to populate a (templated) page
    14. 14. …quick’n’dirty
    15. 15. Scrapers SQLite Scraper databaseViews SQLitedatab ase Scraper
    16. 16. Sometimes the data is spreadacross different files…
    17. 17. Row basedaggregation
    18. 18. Sometimes the data is spreadacross different websites…
    19. 19. … Normalisation…
    20. 20. DataEnrichment
    21. 21. ColumnAdditions/An notations
    22. 22. Sometimes the data is splitacross different files…
    23. 23. Columnbased merge
    24. 24. -> Datacleansing
    25. 25. Clustering…
    26. 26. Martin Hawksey/@mhawksey
    27. 27. “Finessing” a common identifer
    28. 28. Common identifiers (common KEYS) makeit MUCH easier to JOIN datasets by column
    29. 29. Book Title-> ISBN
    30. 30. I am “psychemedia”on Twitter, delicious,slideshare, flickr, etc etc
    31. 31. Reconciliation…
    32. 32. LinkedData™
    33. 33. So who speaks SPARQL? Diners - Journal Canteen by avlxyz
    34. 34. You DON’T have to….
    35. 35. Just think about how one piece of data might be related to another through a common means of addressing them…
    36. 36. @psychemedia