SlideShare a Scribd company logo
Intro to Data
ScrapingPRESENTED BY
DAVID SELASSIE OPOKU
@sdopoku
13 July 2015
Outline
1. Target audience
2. What is and Why Data Scraping?
3. Use cases
4. Basic steps & Best practices
5. Tools
6. Reference Resources
Target
Audience
This should be useful to ...
● Non-tech-savvy data journalists
● Advanced data journalists
● Web developers & data publishers
● School of Data fellows
● Open Data enthusiasts
What is &
Why Data
Scraping ?
Data Scraping: what is it ?
scrape [ verb ˈskrāp ]
: to remove from a surface by usually repeated strokes of an edged instrument
: to collect by or as if by scraping —often used with up or together <scrape up the
price of a ticket>
- Merriam Webster
“The transformation of unstructured data on the web, typically in HTML format, into
structured data that can be stored and analyzed in a central local database or
spreadsheet.”
- Wikipedia (web scraping)
When should you scrape data ?
● PDF Data
● HTML data
Machine-readable data
Example
Use Cases
Cases when you can scrape
● Create a dataset for a data workshop
● Create a database for a data -driven app
● Create a data visualisation for a story
Best
Practices
Best Practices For Scrapers
1. Scraping is not scary!
a. Use existing tools
2. Use a modern and friendly browser
a. Chrome, Firefox, Opera, Safari
b. Avoid Internet Explorer
3. Map out the process
a. Where does scraping fit in?
Best Practices For Data Publishers
1. Have a consistent structure
a. Websites
b. PDFs
2. Always think about your data end users
a. Before, during & after publishing
Steps
1. Map out the process/pipeline for your data project
2. Identify your data source (website, PDF, API?)
3. Decide on storage format for your scraped data
a. CSV file, Spreadsheet, Google docs
b. Database
4. Select scraping tool
5. Verify and Clean data
Tools
Tools: Web Browsers
Tools: Scraping Apps
1. Point and click
a. Scraper Google Chrome extension
b. ScraperWiki (Classic version)
c. Import.io, Kimono Labs, Webscraper.io
d. Tabula (PDF)
2. Programming (Python libraries)
a. Beautiful Soup
b. Pattern (PDF and HTML)
c. Scrapy
Tools: Storage & Sharing
1. Google Spreadsheets
2. Github
3. Datahub.io
Resources - Readings and Tools
1. Five data scraping tools for would-be data journalists
2. Making data on the web useful: scraping
3. Liberating HTML Data Tables
4. BeautifulSoup
5. Pattern
6. Scrapy
7. Datahub
8. Import.io
9. Kimono
10. Webscraper.io
11. Tabula

More Related Content

What's hot

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
Utkarsh Sharma
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
In-Memory Big Data Analytics
In-Memory Big Data AnalyticsIn-Memory Big Data Analytics
In-Memory Big Data Analytics
Supreeth M P
 
Text mining
Text miningText mining
Text mining
ThejeswiniChivukula
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Edureka!
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
RohithND
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
Sreenivasa Harish
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data Cleaning
CarolineSmith912130
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
Catherine Kimani
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

What's hot (20)

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
web mining
web miningweb mining
web mining
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Data analytics
Data analyticsData analytics
Data analytics
 
In-Memory Big Data Analytics
In-Memory Big Data AnalyticsIn-Memory Big Data Analytics
In-Memory Big Data Analytics
 
Text mining
Text miningText mining
Text mining
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Mining
Data MiningData Mining
Data Mining
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data Cleaning
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 

Similar to Skillshare - Introduction to Data Scraping

What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
Brijesh Prajapati
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
Sarah Jones
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
Projeto RCAAP
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
Robert Dempsey
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptx
ASIMKHAN840563
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
Sarah Jones
 
DMP & DMPonline
DMP & DMPonlineDMP & DMPonline
DMP & DMPonline
Sarah Jones
 
What is-rdm
What is-rdmWhat is-rdm
What is-rdm
Sarah Jones
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
Philippe Mizrahi
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
geektimecoil
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
l_ernest
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW Libraries
Jennifer Muilenburg
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
DESTIN-Informatique.com
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption"
J T "Tom" Johnson
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Tony Ross-Hellauer
 

Similar to Skillshare - Introduction to Data Scraping (20)

What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptx
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
DMP & DMPonline
DMP & DMPonlineDMP & DMPonline
DMP & DMPonline
 
What is-rdm
What is-rdmWhat is-rdm
What is-rdm
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW Libraries
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption"
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 

More from School of Data

School of Data - What is it?
School of Data - What is it?School of Data - What is it?
School of Data - What is it?
School of Data
 
Skillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSkillshare - Creating Excel Dashboards
Skillshare - Creating Excel Dashboards
School of Data
 
Skillshare - Understanding extractives data
Skillshare - Understanding extractives dataSkillshare - Understanding extractives data
Skillshare - Understanding extractives data
School of Data
 
Skillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSkillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data Journalism
School of Data
 
Skillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSkillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in Nigeria
School of Data
 
Skillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSkillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collection
School of Data
 
Skillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSkillshare - Introduction to Timemapper
Skillshare - Introduction to Timemapper
School of Data
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
School of Data
 
From data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsFrom data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and charts
School of Data
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
School of Data
 
Skillshare getting feedback from training events
Skillshare  getting feedback from training events Skillshare  getting feedback from training events
Skillshare getting feedback from training events
School of Data
 
Photography tips
Photography tipsPhotography tips
Photography tips
School of Data
 
Activism through the lens [english].pptx
Activism through the lens [english].pptxActivism through the lens [english].pptx
Activism through the lens [english].pptx
School of Data
 
Gamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiGamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra Ismiraldi
School of Data
 
Facilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenFacilitation skill share by Happy Feraren
Facilitation skill share by Happy Feraren
School of Data
 
UX presentation
UX presentationUX presentation
UX presentation
School of Data
 
Mapping Skillshare with School of Data
Mapping Skillshare with School of DataMapping Skillshare with School of Data
Mapping Skillshare with School of Data
School of Data
 
Data Visualization & Design with School of Data
Data Visualization & Design with School of DataData Visualization & Design with School of Data
Data Visualization & Design with School of Data
School of Data
 
Network mapping with School of Data
Network mapping with School of DataNetwork mapping with School of Data
Network mapping with School of Data
School of Data
 

More from School of Data (20)

School of Data - What is it?
School of Data - What is it?School of Data - What is it?
School of Data - What is it?
 
Skillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSkillshare - Creating Excel Dashboards
Skillshare - Creating Excel Dashboards
 
Skillshare - Understanding extractives data
Skillshare - Understanding extractives dataSkillshare - Understanding extractives data
Skillshare - Understanding extractives data
 
Skillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSkillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data Journalism
 
Skillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSkillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in Nigeria
 
Skillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSkillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collection
 
Skillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSkillshare - Introduction to Timemapper
Skillshare - Introduction to Timemapper
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
 
From data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsFrom data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and charts
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
 
Skillshare getting feedback from training events
Skillshare  getting feedback from training events Skillshare  getting feedback from training events
Skillshare getting feedback from training events
 
Photography tips
Photography tipsPhotography tips
Photography tips
 
Activism through the lens [english].pptx
Activism through the lens [english].pptxActivism through the lens [english].pptx
Activism through the lens [english].pptx
 
Gamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiGamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra Ismiraldi
 
Facilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenFacilitation skill share by Happy Feraren
Facilitation skill share by Happy Feraren
 
UX presentation
UX presentationUX presentation
UX presentation
 
Mapping Skillshare with School of Data
Mapping Skillshare with School of DataMapping Skillshare with School of Data
Mapping Skillshare with School of Data
 
Data Visualization & Design with School of Data
Data Visualization & Design with School of DataData Visualization & Design with School of Data
Data Visualization & Design with School of Data
 
Network mapping with School of Data
Network mapping with School of DataNetwork mapping with School of Data
Network mapping with School of Data
 

Recently uploaded

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 

Recently uploaded (20)

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 

Skillshare - Introduction to Data Scraping

  • 1. Intro to Data ScrapingPRESENTED BY DAVID SELASSIE OPOKU @sdopoku 13 July 2015
  • 2. Outline 1. Target audience 2. What is and Why Data Scraping? 3. Use cases 4. Basic steps & Best practices 5. Tools 6. Reference Resources
  • 4. This should be useful to ... ● Non-tech-savvy data journalists ● Advanced data journalists ● Web developers & data publishers ● School of Data fellows ● Open Data enthusiasts
  • 5. What is & Why Data Scraping ?
  • 6. Data Scraping: what is it ? scrape [ verb ˈskrāp ] : to remove from a surface by usually repeated strokes of an edged instrument : to collect by or as if by scraping —often used with up or together <scrape up the price of a ticket> - Merriam Webster “The transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.” - Wikipedia (web scraping)
  • 7. When should you scrape data ? ● PDF Data ● HTML data Machine-readable data
  • 9. Cases when you can scrape ● Create a dataset for a data workshop ● Create a database for a data -driven app ● Create a data visualisation for a story
  • 11. Best Practices For Scrapers 1. Scraping is not scary! a. Use existing tools 2. Use a modern and friendly browser a. Chrome, Firefox, Opera, Safari b. Avoid Internet Explorer 3. Map out the process a. Where does scraping fit in?
  • 12. Best Practices For Data Publishers 1. Have a consistent structure a. Websites b. PDFs 2. Always think about your data end users a. Before, during & after publishing
  • 13. Steps 1. Map out the process/pipeline for your data project 2. Identify your data source (website, PDF, API?) 3. Decide on storage format for your scraped data a. CSV file, Spreadsheet, Google docs b. Database 4. Select scraping tool 5. Verify and Clean data
  • 14. Tools
  • 16. Tools: Scraping Apps 1. Point and click a. Scraper Google Chrome extension b. ScraperWiki (Classic version) c. Import.io, Kimono Labs, Webscraper.io d. Tabula (PDF) 2. Programming (Python libraries) a. Beautiful Soup b. Pattern (PDF and HTML) c. Scrapy
  • 17. Tools: Storage & Sharing 1. Google Spreadsheets 2. Github 3. Datahub.io
  • 18. Resources - Readings and Tools 1. Five data scraping tools for would-be data journalists 2. Making data on the web useful: scraping 3. Liberating HTML Data Tables 4. BeautifulSoup 5. Pattern 6. Scrapy 7. Datahub 8. Import.io 9. Kimono 10. Webscraper.io 11. Tabula