SlideShare a Scribd company logo
1 of 18
Eurostat
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Web scraping tools,
an introduction
ESTP course on Big Data Sources – Web, Social Media and
Text Analytics, Day 1
Olav ten Bosch, Statistics Netherlands
Eurostat
Outline
• Introduction
• Scraping tools
• Some scraping knowledge
• A simple scraping exercise
2
Eurostat
Introduction (1)
• There are many different tools for scraping
available, which differ in their functionality and
use.
• Tools and frameworks come and go, choose the
one that fits the job.
• Scraping is not like typical (statistical) IT. The life
cycle (design, develop, test, maintain) is much
shorter, it might not even be a cycle (one time
use).
3
Eurostat
Introduction (2)
• In this course we mention some tools we came
across in the past years which show different
functionality without claiming this list is
exhaustive or these are the best available tools.
• Any tool is useless without some basic knowledge
of web technology and internet experience, so we
provide you some.
• At the end of this session we will do a simple
scraping exercise with a freely available web-
based scraping tool. 4
Eurostat
Introduction (3)
• We make a rough distinction between:
• Scraping: the actual extraction of data / information
from a web page
• Crawling: following hyperlinks on the internet to
traverse multiple pages and / or sites
• Search: using (third party) search engines (such as
google) automatically to find information on the web
• Many tools offer a mix of these
5
Eurostat
Scraping tools (1)
• Imacros (imacros.net):
• Available for quite some years
• Point and click (record and replay) as well as
coding via API (application programming
interface)
• Browser add-in as well as standalone program
• Type of functionality: scrape
• Free and commercial version
• Used by some NSI’s for scraping prices
• Easy to start with
6
Eurostat
Screendump Imacros(2)
7
Eurostat
Imacros: generated code
8
Eurostat
Scraping tools (2)
• Scrapy(scrapy.org):
• Python-based scraping and crawling framework
• More IT oriented: coding skills required
• Open source
• Large user community
• Used by some NSI’s for various scraping tasks
9
Eurostat
Scrapy example
10
Eurostat
Scraping tools (3)
• Import.io:
• Point and click and coding
• Fully web-based and hosted scraping
• Type of functionality: scrape
• Free and commercial licenses
• Used by some NSI’s
11
Eurostat
Screendump ImportIO
12
Eurostat
Screendump ImportIO
13
NEW!
Reduced price
finally…
Dinner in your own
garden!
Eurostat
Screendump ImportIO
14
Eurostat
Scraping tools (5)
• There are many more, such as:
• Nutch for crawling (Apache, java)
• An extensive list is available on:
https://github.com/lorien/awesome-web-scraping
• Scraping tools by Statistics Netherlands:
• CBS Robot Framework (more in the afternoon)
• CBS Robottool, a tool for detecting changes on
websites (robot-assisted price collection)
15
Eurostat
Some scraping knowledge (1)
• HTTP: the communication protocol
• HTML: the language in which web pages are
defined
• JS: javascript (code executing in the browser)
• CSS: style sheets, how web pages are styled.
Important, but does not contain data.
• JPG, PNG, BMP: images, usually not interesting
• CSV / TXT / JSON / XML: data, interesting !!!
16
Eurostat
Some scraping knowledge (2)
• Analyse a website:
• For example using firebug in Firefox or Web
developer extensions in Chrome
• Live demo
• Keep an eye on the format of a hyperlink:
• Fictitious example:
http://www.example.com/getdata?subject=books&display=label,price
• The parameters may be useful for scraping
• Live demo
17
Eurostat
A simple scraping exercise
• The exercise will be presented during the course
18

More Related Content

Similar to 2_Big Data Sources part3-Day 1-B Tools.pptx

Fuzzing RTC @ Kamailio World 2019
Fuzzing RTC @ Kamailio World 2019Fuzzing RTC @ Kamailio World 2019
Fuzzing RTC @ Kamailio World 2019Lorenzo Miniero
 
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01Hiro Yoshioka
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformJason Letourneau
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformBasis Technology
 
NESSHI and GEPHI: sociology of science as a breeding ground for tool building...
NESSHI and GEPHI: sociology of science as a breeding ground for tool building...NESSHI and GEPHI: sociology of science as a breeding ground for tool building...
NESSHI and GEPHI: sociology of science as a breeding ground for tool building...Clement Levallois
 
OSGi Alliance Community Event 2007 - Business Session#2 - Abdallah Bushnaq, A...
OSGi Alliance Community Event 2007 - Business Session#2 - Abdallah Bushnaq, A...OSGi Alliance Community Event 2007 - Business Session#2 - Abdallah Bushnaq, A...
OSGi Alliance Community Event 2007 - Business Session#2 - Abdallah Bushnaq, A...mfrancis
 
Citizen Developer Tools - session at SPS New England 10/20/2018
Citizen Developer Tools - session at SPS New England 10/20/2018Citizen Developer Tools - session at SPS New England 10/20/2018
Citizen Developer Tools - session at SPS New England 10/20/2018Antti Koskela
 
Citizen Developer Tools are not just for Citizen Developers (session at Share...
Citizen Developer Tools are not just for Citizen Developers (session at Share...Citizen Developer Tools are not just for Citizen Developers (session at Share...
Citizen Developer Tools are not just for Citizen Developers (session at Share...Antti Koskela
 
Dd13.2013.milano.open ntf
Dd13.2013.milano.open ntfDd13.2013.milano.open ntf
Dd13.2013.milano.open ntfUlrich Krause
 
OpenNTF - The Lotus Notes and Domino Open Source Organization
OpenNTF - The Lotus Notes and Domino Open Source OrganizationOpenNTF - The Lotus Notes and Domino Open Source Organization
OpenNTF - The Lotus Notes and Domino Open Source OrganizationBruce Elgort
 
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...Antti Koskela
 
Highlights from MS build\\2016 Conference
Highlights from MS build\\2016 ConferenceHighlights from MS build\\2016 Conference
Highlights from MS build\\2016 ConferenceEastBanc Tachnologies
 
Programming the Real World: Javascript for Makers
Programming the Real World: Javascript for MakersProgramming the Real World: Javascript for Makers
Programming the Real World: Javascript for Makerspchristensen
 
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13Dominopoint - Italian Lotus User Group
 
Introduction to GeoNetwork and GeoCat Bridge - teknologiforum Oslo 11-2012
Introduction to GeoNetwork and GeoCat Bridge - teknologiforum Oslo 11-2012Introduction to GeoNetwork and GeoCat Bridge - teknologiforum Oslo 11-2012
Introduction to GeoNetwork and GeoCat Bridge - teknologiforum Oslo 11-2012Jeroen Ticheler
 
Open source softrware, group 5 final
Open source softrware, group 5 finalOpen source softrware, group 5 final
Open source softrware, group 5 finalbigrouge
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning processDenis Dus
 

Similar to 2_Big Data Sources part3-Day 1-B Tools.pptx (20)

Fuzzing RTC @ Kamailio World 2019
Fuzzing RTC @ Kamailio World 2019Fuzzing RTC @ Kamailio World 2019
Fuzzing RTC @ Kamailio World 2019
 
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
NESSHI and GEPHI: sociology of science as a breeding ground for tool building...
NESSHI and GEPHI: sociology of science as a breeding ground for tool building...NESSHI and GEPHI: sociology of science as a breeding ground for tool building...
NESSHI and GEPHI: sociology of science as a breeding ground for tool building...
 
OSGi Alliance Community Event 2007 - Business Session#2 - Abdallah Bushnaq, A...
OSGi Alliance Community Event 2007 - Business Session#2 - Abdallah Bushnaq, A...OSGi Alliance Community Event 2007 - Business Session#2 - Abdallah Bushnaq, A...
OSGi Alliance Community Event 2007 - Business Session#2 - Abdallah Bushnaq, A...
 
Python with dataScience
Python with dataSciencePython with dataScience
Python with dataScience
 
Citizen Developer Tools - session at SPS New England 10/20/2018
Citizen Developer Tools - session at SPS New England 10/20/2018Citizen Developer Tools - session at SPS New England 10/20/2018
Citizen Developer Tools - session at SPS New England 10/20/2018
 
SIG-NOC Tools Survey 2015
SIG-NOC Tools Survey 2015SIG-NOC Tools Survey 2015
SIG-NOC Tools Survey 2015
 
Citizen Developer Tools are not just for Citizen Developers (session at Share...
Citizen Developer Tools are not just for Citizen Developers (session at Share...Citizen Developer Tools are not just for Citizen Developers (session at Share...
Citizen Developer Tools are not just for Citizen Developers (session at Share...
 
Dd13.2013.milano.open ntf
Dd13.2013.milano.open ntfDd13.2013.milano.open ntf
Dd13.2013.milano.open ntf
 
OpenNTF - The Lotus Notes and Domino Open Source Organization
OpenNTF - The Lotus Notes and Domino Open Source OrganizationOpenNTF - The Lotus Notes and Domino Open Source Organization
OpenNTF - The Lotus Notes and Domino Open Source Organization
 
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
 
Highlights from MS build\\2016 Conference
Highlights from MS build\\2016 ConferenceHighlights from MS build\\2016 Conference
Highlights from MS build\\2016 Conference
 
Programming the Real World: Javascript for Makers
Programming the Real World: Javascript for MakersProgramming the Real World: Javascript for Makers
Programming the Real World: Javascript for Makers
 
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13
The Latest and Greatest from OpenNTF and the IBM Social Business Toolkit, #dd13
 
Network Automation e-Academy
Network Automation e-AcademyNetwork Automation e-Academy
Network Automation e-Academy
 
Introduction to GeoNetwork and GeoCat Bridge - teknologiforum Oslo 11-2012
Introduction to GeoNetwork and GeoCat Bridge - teknologiforum Oslo 11-2012Introduction to GeoNetwork and GeoCat Bridge - teknologiforum Oslo 11-2012
Introduction to GeoNetwork and GeoCat Bridge - teknologiforum Oslo 11-2012
 
Open source softrware, group 5 final
Open source softrware, group 5 finalOpen source softrware, group 5 final
Open source softrware, group 5 final
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 

Recently uploaded

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 

Recently uploaded (20)

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 

2_Big Data Sources part3-Day 1-B Tools.pptx

  • 1. Eurostat THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Web scraping tools, an introduction ESTP course on Big Data Sources – Web, Social Media and Text Analytics, Day 1 Olav ten Bosch, Statistics Netherlands
  • 2. Eurostat Outline • Introduction • Scraping tools • Some scraping knowledge • A simple scraping exercise 2
  • 3. Eurostat Introduction (1) • There are many different tools for scraping available, which differ in their functionality and use. • Tools and frameworks come and go, choose the one that fits the job. • Scraping is not like typical (statistical) IT. The life cycle (design, develop, test, maintain) is much shorter, it might not even be a cycle (one time use). 3
  • 4. Eurostat Introduction (2) • In this course we mention some tools we came across in the past years which show different functionality without claiming this list is exhaustive or these are the best available tools. • Any tool is useless without some basic knowledge of web technology and internet experience, so we provide you some. • At the end of this session we will do a simple scraping exercise with a freely available web- based scraping tool. 4
  • 5. Eurostat Introduction (3) • We make a rough distinction between: • Scraping: the actual extraction of data / information from a web page • Crawling: following hyperlinks on the internet to traverse multiple pages and / or sites • Search: using (third party) search engines (such as google) automatically to find information on the web • Many tools offer a mix of these 5
  • 6. Eurostat Scraping tools (1) • Imacros (imacros.net): • Available for quite some years • Point and click (record and replay) as well as coding via API (application programming interface) • Browser add-in as well as standalone program • Type of functionality: scrape • Free and commercial version • Used by some NSI’s for scraping prices • Easy to start with 6
  • 9. Eurostat Scraping tools (2) • Scrapy(scrapy.org): • Python-based scraping and crawling framework • More IT oriented: coding skills required • Open source • Large user community • Used by some NSI’s for various scraping tasks 9
  • 11. Eurostat Scraping tools (3) • Import.io: • Point and click and coding • Fully web-based and hosted scraping • Type of functionality: scrape • Free and commercial licenses • Used by some NSI’s 11
  • 15. Eurostat Scraping tools (5) • There are many more, such as: • Nutch for crawling (Apache, java) • An extensive list is available on: https://github.com/lorien/awesome-web-scraping • Scraping tools by Statistics Netherlands: • CBS Robot Framework (more in the afternoon) • CBS Robottool, a tool for detecting changes on websites (robot-assisted price collection) 15
  • 16. Eurostat Some scraping knowledge (1) • HTTP: the communication protocol • HTML: the language in which web pages are defined • JS: javascript (code executing in the browser) • CSS: style sheets, how web pages are styled. Important, but does not contain data. • JPG, PNG, BMP: images, usually not interesting • CSV / TXT / JSON / XML: data, interesting !!! 16
  • 17. Eurostat Some scraping knowledge (2) • Analyse a website: • For example using firebug in Firefox or Web developer extensions in Chrome • Live demo • Keep an eye on the format of a hyperlink: • Fictitious example: http://www.example.com/getdata?subject=books&display=label,price • The parameters may be useful for scraping • Live demo 17
  • 18. Eurostat A simple scraping exercise • The exercise will be presented during the course 18