Onlineinfo2012 - Scraping

•Download as PPTX, PDF•

2 likes•817 views

Is open data disruptive to data vendors/verticals in the information industry? How can scrapers turn data published as information on the web or in PDFs back into structured data? What business models or publications are built from scraped data?

Business

DATA LIBERATION
Opening Up Data by Hook
or by Crook - Data
Scraping, Linkage and the
Value of a Good Identifier
Tony Hirst
Department of Communication
and Systems
The Open University

“Second” generation:
data management
systems

There’s lots more
data that’s locked
up in web pages…

“grabbing web content
in a machine readable
format and then
processing it for your
own purposes”

Original Extract
Accessible
HTML web Information
web page
page -> data

Recreating the
database that was used
to populate a
(templated) page

Scrapers
SQLite
Scraper database

Views
SQLitedatab
ase
Scraper

Sometimes the
data is spread
across different
files…

Sometimes the
data is spread
across different
websites…

Sometimes the
data is split
across different
files…

http://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/

/via Martin Hawksey/@mhawksey

Common identifiers
(common KEYS) make
it MUCH easier to JOIN
datasets by column

I am “psychemedia”
on Twitter, delicious,
slideshare, flickr, etc
etc

So who speaks SPARQL?

Diners - Journal Canteen
by avlxyz

Just think about how one piece of
data might be related to another
through a common means of
addressing them…

What's hot

Soton2013 opendataTony Hirst

non-slides-ThatcampTrevor Owens

I say NoSQL you say whatPratik Khasnabis

Data(base) taxonomyDejan Radic

Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsEmily Nimsakont

IASSIT Kansa Presentationekansa

Relevance of clasification and indexingVaralakshmiRSR

A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers

What is Linked Data, and What Does It Mean for Libraries?Emily Nimsakont

Databases and types of databasesbaabtra.com - No. 1 supplier of quality freshers

LODLAM Landscape NOTESShana McDanold

Lodlam.slideshareHafabe

LODLAM LandscapeShana McDanold

Linked Open DataLars Marius Garshol

The network reconfigures the cataloglisld

Towards collaboration at scale: Libraries, the social and the technicallisld

Linked Data for Law Libraries: An IntroductionEmily Nimsakont

ECS2019 - Managing Content Types in the Modern WorldMarc D Anderson

What's hot (18)

Soton2013 opendata

non-slides-Thatcamp

I say NoSQL you say what

Data(base) taxonomy

Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs

IASSIT Kansa Presentation

Relevance of clasification and indexing

A distributed network of digital heritage information - Semantics Amsterdam

What is Linked Data, and What Does It Mean for Libraries?

Databases and types of databases

LODLAM Landscape NOTES

Lodlam.slideshare

LODLAM Landscape

Linked Open Data

The network reconfigures the catalog

Towards collaboration at scale: Libraries, the social and the technical

Linked Data for Law Libraries: An Introduction

ECS2019 - Managing Content Types in the Modern World

Viewers also liked

Search34yosser atassi

Mining the web, no experience requiredScrapinghub

XPath for web scrapingScrapinghub

Frontera: open source, large scale web crawling frameworkScrapinghub

chapter22.pptTareq Hasan

Python 101: Python for Absolute Beginners (PyTexas 2014)Paige Bailey

Viewers also liked (6)

Search34

Mining the web, no experience required

XPath for web scraping

Frontera: open source, large scale web crawling framework

chapter22.ppt

Python 101: Python for Absolute Beginners (PyTexas 2014)

Similar to Onlineinfo2012 - Scraping

What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...Emily Nimsakont

What is the Semantic WebJuan Sequeda

Linked Data: so what?MIUR

Linked Data and Libraries: What? Why? How?Emily Nimsakont

Linked data for Libraries, Archives, Museumsljsmart

Library discovery: past, present and some futureslisld

Lodlam saa 2011_jenelfarrell_2Jenel Farrell

Sailing on the ocean of 1s and 0sWoodruff Solutions LLC

FAIR data: LOUD for all audiencesAlessandro Adamou

Metadata in the age of data curation and linked dataRyan Johnson

Madrid Building blocks of Linked DataVictor de Boer

LIS 653 fall 2013 final project postersPrattSILS

Engineering a Semantic Web (Spring 2018)Rensselaer Polytechnic Institute

Introduction to linked dataLaura Po

Libraries in a data-centered environmentJakob .

What flavor of linked data is best for your collection? Debra Shapiro

Linked library dataJindřich Mynarz

Semantic Mapping and LOD prezCarol Chiodo

Semantic web Santhosh N BasavarajappaSanthosh Basavarajappa

Management of bibliographic metadata - Metadata management at the Leibniz Inf...suvanni

Similar to Onlineinfo2012 - Scraping (20)

What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...

What is the Semantic Web

Linked Data: so what?

Linked Data and Libraries: What? Why? How?

Linked data for Libraries, Archives, Museums

Library discovery: past, present and some futures

Lodlam saa 2011_jenelfarrell_2

Sailing on the ocean of 1s and 0s

FAIR data: LOUD for all audiences

Metadata in the age of data curation and linked data

Madrid Building blocks of Linked Data

LIS 653 fall 2013 final project posters

Engineering a Semantic Web (Spring 2018)

Introduction to linked data

Libraries in a data-centered environment

What flavor of linked data is best for your collection?

Linked library data

Semantic Mapping and LOD prez

Semantic web Santhosh N Basavarajappa

Management of bibliographic metadata - Metadata management at the Leibniz Inf...

Recently uploaded

A DAY IN THE LIFE OF A SALESMAN / WOMANIlamathiKannappan

👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...rajveerescorts2022

John Halpern sued for sexual assault.pdfAmzadHosen3

Dr. Admir Softic_ presentation_Green Club_ENG.pdfAdmir Softic

VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Value Proposition canvas- Customer needs and painsP&CO

BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLkapoorjyoti4444

unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE AbudhabiAbortion pills in Kuwait Cytotec pills in Kuwait

Falcon's Invoice Discounting: Your Path to Prosperityhemanthkumar470700

Business Model Canvas (BMC)- A new venture conceptP&CO

Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...lizamodels9

Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableSeo

Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876dlhescort

Organizational Transformation Lead with CultureSeta Wicaksana

How to Get Started in Social Media for Art League CityEric T. Tung

Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...amitlee9823

Pharma Works Profile of Karan Communicationskarancommunications

B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxpriyanshujha201

Monthly Social Media Update April 2024 pptx.pptxAndy Lambert

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...Aggregage

Recently uploaded (20)

A DAY IN THE LIFE OF A SALESMAN / WOMAN

👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...

John Halpern sued for sexual assault.pdf

Dr. Admir Softic_ presentation_Green Club_ENG.pdf

VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...

Value Proposition canvas- Customer needs and pains

BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL

unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi

Falcon's Invoice Discounting: Your Path to Prosperity

Business Model Canvas (BMC)- A new venture concept

Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...

Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available

Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876

Organizational Transformation Lead with Culture

How to Get Started in Social Media for Art League City

Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...

Pharma Works Profile of Karan Communications

B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx

Monthly Social Media Update April 2024 pptx.pptx

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...

Onlineinfo2012 - Scraping

1. DATA LIBERATION Opening Up Data by Hook or by Crook - Data Scraping, Linkage and the Value of a Good Identifier Tony Hirst Department of Communication and Systems The Open University

2. data NOT information by Vick

3. [Disruptive Innovation?]

5. “First” generation: data catalogues

6. Breathing life into data…

7. =importData(“CSV_URL”)

8. the spreadsheet becomes A DATABASE

10.

11.

12. “Second” generation: data management systems

13.

14.

15. There’s lots more data that’s locked up in web pages…

16. Scraping…

17.

18. “grabbing web content in a machine readable format and then processing it for your own purposes”

19.

20.

21.

22. Original Extract Accessible HTML web Information web page page -> data

23. Recreating the database that was used to populate a (templated) page

24.

25.

26.

27.

28.

29.

30. …quick’n’dirty

31.

32.

33.

34.

35.

36.

37.

38.

39. Scrapers SQLite Scraper database Views SQLitedatab ase Scraper

40.

41.

42.

43. Sometimes the data is spread across different files…

44.

45. Row based aggregation

46. Sometimes the data is spread across different websites…

47. … Normalisation…

48.

49. Data Enrichment

50. Column Additions/An notations

51.

52. Sometimes the data is split across different files…

53. Column based merge

54.

55. -> Data cleansing

56. Clustering…

57. http://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/ /via Martin Hawksey/@mhawksey

58.

59.

60. “Finessing” a common identifer

61. Common identifiers (common KEYS) make it MUCH easier to JOIN datasets by column

62. Book Title -> ISBN

63. I am “psychemedia” on Twitter, delicious, slideshare, flickr, etc etc

64.

65. Reconciliation…

66.

67.

68.

69.

70.

71. Linked Data™

72.

73. So who speaks SPARQL? Diners - Journal Canteen by avlxyz

74. You DON’T have to….

75. Just think about how one piece of data might be related to another through a common means of addressing them…

76. http://ouseful.info @psychemedia

Editor's Notes

Tony HirstTwitter:@psychemediaBlog: http://blog.ouseful.infoPresentation prepared for: Online Info 12/11/2012DATA LIBERATION: OPENING UP DATA BY HOOK OR BY CROOK - DATA SCRAPING, LINKAGE AND THE VALUE OF A GOOD IDENTIFIERThe 1/9/90 rule is often used to characterise the way in which a small number of creators generate content that a larger number (but still small percentage in the greater scheme of things) comment on or amplify, whilst the majority just passively consume. In this presentation, I will explore the extent to which a similar view applies to the world of "data liberation". After reviewing the idea of data scraping, and some of the techniques surrounding it, I will describe how online tools such as Scraperwiki provide a platform for concentrating data scraping activity and expertise, as well as supporting the publication of data /as data/ in a variety of formats, in addition to 'end user' views in the form of graphical charts and interactive visualisations.One of the major motivations for data scraping is the aggregation of data from a variety of data sources into a larger, integrated whole. For example, the aggregation of research council funding data from separate research councils allows us to view a large proportion of the publicly funded research grants received by a single institution; or the collection of local council spending data across all UK councils allows us to see how councils spend money with each other across a range of transaction areas. But how do we actually create such aggregations when the data is sourced from different areas? In order to do this, we need to know when different datasets are actually talking about the same thing, which is where common identifiers come in. For it is surely the case that when we have common identifiers, we can have linkage, and as a result start to realise some of the benefits of Linked Data (as well as developing a wider appreciation of what those benefits might actually be...) (As an aside, I'll describe how we might go about deriving such identifiers when they are missing from a data set that might otherwise, or more conveniently, be expected to publish them.)Throughout the presentation, I will draw on practical examples of how aggregated "liberated" data has been used as the basis of wider interest, and even status quo disrupting, services, as well as reflecting on what other sources of data we might see the data liberators turning their attention to next...Key learning points:1 - What is "data scraping", how can I do it and is my website at risk of it?2 - Why the secret to understanding "Linked Data" is the very idea of it, not just (or not even) the technology.3 - How has data scraping been used to "open up" data in actual practice?
The focus on this presentation is not the release of “information”, but the release of data in raw form so that it can be interpreted and presented in informative ways by other parties.
The London Datastore is an early example of a council-centric open data website. Early signs suggest it is natural to locate data websites at addresses of the form data.COUNCILNAME.gov.uk or www.COUNCILNAME.gov.uk/data
Another example that demonstrates how CSV can be used to help data flow is demonstrated by Google Spreadsheets. The =importData formula allows a user to specify a source data URL, and pull the CSV data found at that location in to the spreadsheet. Unlike Many Eyes Wikified, if the source data at the URL is updated, the updated will (eventually) be pulled into the spreadsheet automatically.
One of the really good reasons for getting data into a data processing environment such as a spreadsheet is that you can start to work it. In the case of Google Spreadsheets, the spreadsheet environment can also be used as a database environment. That is, we can treat one or more data containing sheets in a spreadsheet as a database, and generate new views over the data, as well as running queries over that data.
Another way of using a Google Spreadsheet as a database is via the Google Spreadsheets API. The GoogleVisualisation API (?) provides a way of passing queries written using the Google ???viz query language from an arbitrary web page or web application, and receiving the resulting data in a standard JSON based format, which also happens to play nicely with the Google Visualisation API???The Guardian Datastore explorer is a crude demonstration for 2009(??) demonstrating how data from the Guardian datastore, data that is stored across a range of Google spreadsheets, can be explored , queried and visualised via these APIs. Users can select a dataset from a drop down menu, fed from a delicious account to which various datastore spreadsheets have been bookmarked using a particular set of tags, or by pasting in the URL of an arbitrary (public) Google spreadsheet. The first row/headings of the data can then be previewed (a simple spreadsheet is assumed, in which column headings appear In the first row of the spreadsheet).
A series of list boxes are then populated with the column labels and there names, and provide a certain amount of help for the creation of a query over the spreadsheet data. A range of output formats can also be selected, from simple HTML data tables, to a range of charts. URLs are also generated for HTML and CSV representations of the data returned from the query.
One of the nice things about the data table widget (a standard GoogleVisualisation API component in this case, though similar examples exist for YUI, the Yahoo User Interface Libraries, or frameworks such as JQuery), is that is supports things like row sorting by column, (for free – no programming required!), allowing even further manipulation of the data, albeit at a simplistic level.(It’s probably worth pointing out here that it may be worth providing a preview of the column headings and first few rows (or a sample of random rows) of data when datasets are published, just so that users can see what sort of data is on offer without having to download the whole data set?)
If you’re in the business of selling information as data, you are under threat where that information is published in an openly licensed way.
Linked Data – the TM is something of a joke and refers to the particular style of publishing data according to set of principles first outlined by the inventor of the World Wide Web, Sir Tim Berners Lee – is one of the data formats that the Government’s data task force favour for the publication of data.
There is a problem though – at the moment, there are barriers to entry to Linked Data world from both the query side (not many people speak SPARQL, or know how to construct a SPARQL query to an endpoint) and the results side (data is returned as RDF).
So – do you speak SPARQL?

Onlineinfo2012 - Scraping

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (6)

Similar to Onlineinfo2012 - Scraping

Similar to Onlineinfo2012 - Scraping (20)

More from Tony Hirst

More from Tony Hirst (20)

Recently uploaded

Recently uploaded (20)

Onlineinfo2012 - Scraping

Editor's Notes