SlideShare a Scribd company logo
1 of 34
Damian Gordon
 For this project you need a dataset.
 Two ways of getting a dataset are:
1. Finding an existing one
2. Generating a new one
 Option 1 is waaaaay easier, but it can be
often difficult to find the exact dataset you
need.
 For this project you need a dataset.
 Two ways of getting a dataset are:
1. Finding an existing one
2. Generating a new one
 Option 1 is waaaaay easier, but it can be
often difficult to find the exact dataset you
need.
 But more often than not, it’s both.
 Ways to get data:
◦ Downloads and Torrents
◦ Application Programming Interfaces
◦ Web Scraping
 Data journalism sites that makes the data
sets used in its articles available online
 FiveThirtyEight
◦ https://github.com/fivethirtyeight/data
 BuzzFeed
◦ https://github.com/BuzzFeedNews/everything
 Some I.T. companies provide tonnes of
datasets, but you need to set-up a (free)
login:
 Amazon/AWS
◦ https://registry.opendata.aws/
 Google
◦ https://cloud.google.com/bigquery/public-data/
 Some social sites have full site dumps, often
including media
 Wikipedia: Media
◦ https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_p
roject_XML_dumps#Media0
 Wikipedia: Full Site Dumps
◦ https://dumps.wikimedia.org/
 Reddit: Submission Corpus 2016
◦ https://www.reddit.com/r/datasets/comments/3mg812
/full_reddit_submission_corpus_now_available_2006/
 Governments sites with data
 Ireland
◦ https://data.gov.ie/
 UK (get it before it brexits)
◦ https://data.gov.uk/
 USA
◦ https://www.dataquest.io/blog/free-datasets-for-projects/
 Some sites have lots of data, but they need a
bit of cleaning
 The World Bank datasets
◦ https://data.worldbank.org/
 Socrata
◦ https://opendata.socrata.com/
 Academic Sites that provide datasets
 SAGE Datasets
◦ https://methods.sagepub.com/Datasets
 Academic Torrents
 (all sorts of data, in all kinds of state)
◦ https://academictorrents.com/
 Lists of datasets
 https://libraryguides.missouri.edu/c.php?g=21330
0&p=1407295
 https://guides.lib.vt.edu/c.php?g=580714
 https://libguides.babson.edu/datasets
 https://piktochart.com/blog/100-data-sets/
 APIs (Application Programming Interfaces) are an
intermediary that allows one software to talk to
another.
 In simple terms, you can pass a JSON to an API
and in return, it will also give you a JSON.
 Now there will always exist a set of rules as to
what you can send in the JSON and what it can
return.
 These rules are strict and can’t change unless
someone actually changes the API itself.
 So when using an API to collect data, you will be
strictly governed by a set of rules, and there are
only some specific data fields that you can get.
 Data journalism sites that have APIs
 ProPublica
◦ https://www.propublica.org/datastore/apis
 Social Media sites that have APIs
 Twitter
◦ https://developer.twitter.com/en/docs
 Government sites that have APIs
 Ireland
◦ https://data.gov.ie/pages/developers
 UK
◦ https://content-
api.publishing.service.gov.uk/#gov-uk-content-api
 USA
◦ data.gov/developers/apis
 OECD
◦ https://data.oecd.org/api/
 Data sites that have APIs
 data.world
◦ https://apidocs.data.world/api
 Kaggle
◦ https://www.kaggle.com/docs/api
 Other sites that have APIs
 GitHub
◦ https://developer.github.com/v3/
 Wunderground (weather site, needs login)
◦ https://www.wunderground.com/login
 Creating a dataset using an API with Python
 https://towardsdatascience.com/creating-a-
dataset-using-an-api-with-python-
dcc1607616d
 Good Analytics tools to distribute the
processing across multiple nodes.
 Apache Spark
◦ https://spark.apache.org/
 Apache Hadoop
◦ http://hadoop.apache.org/
 Web scraping is much more customizable,
complex and is not governed by any set rule.
 You can get any data that you can see on a
website using a scraping setup.
 As for how you can scrape data, you can
apply any techniques available, and you are
constrained only by your imagination.
 In other words…
 If you know what you are looking for, and you
are repeatedly looking to get the same data,
from the same source for fulfilling the
specific objective … go with APIs
 But if you need a scenario that is more
customizable, complex and is not governed
by any set rule … you can get any data that
you can see on a site using a web scraper
 Some web spider code, and great videos
 http://damiantgordon.com/Videos/Program
mingAndAlgorithms/SearchEngine.html
 Five Python Libraries for Scraping:
◦ The Requests library
◦ https://2.python-
requests.org//en/master/user/quickstart/
◦ Beautiful Soup 4
 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
◦ Lxml
 https://lxml.de/index.html#introduction
◦ Selenium
◦ http://thiagomarzagao.com/2013/11/12/webscraping-
with-selenium-part-1/
◦ Scrapy
 https://scrapy.org/
 Some general advice on web scraping:
 Robots.txt
 Check if the root directory of the domain has a file
in it called Robots.txt
 This defines which areas of a website crawlers are
not allowed to search.
 This simple text file can exclude entire domains,
complete directories, one or more subdirectories or
individual files from the search engine crawling.
 Crawling a website that doesn’t allow web crawling
is very, very rude (and illegal in some countries) so
it should not be attempted.
 CAPTCHAs
 A lot of websites have CAPTCHAs, and they pose
real challenges for web crawlers
 There are tools to get around them, e.g.
◦ http://bypasscaptcha.com/
 Note that however you circumvent them, they can
still slow down the scraping process a good bit.
 EXCEPTION HANDLING
 I’m speaking for myself here …
 Very often I leave out the exception handling, but
in this particular circumstance, catch everything
you can.
 You code will bomb from time to time, and it’s a
good idea to know what happened.
 Also try to avoid hard coding things, make
everything as parameterised as possible
 IP BLOCKING
 Sometimes websites will mistake a reasonably
harmless crawler for something more malignant,
and will block you.
 When a server detects a high number of requests
from the same IP address or if the crawler makes
multiple parallel requests it may get blocked
 You might need to create a pool of IP addresses, or
spoof a user agent
◦ http://www.whatsmyuseragent.com/
 DYNAMIC WEBSITES
 New websites use a lot of dynamic coding practices
are not crawler friendly.
 Examples are lazy loading images, infinite scrolling
and product variants being loaded via AJAX calls.
 This type of websites are even difficult to crawl
 WEBSITE STRUCTURE
 Websites that periodically upgrades their UI can
lead to numerous structural changes on the
website.
 Since web crawlers are set up according to the code
elements present at that time on the website, the
scrapers would require changes too.
 Web scrapers usually need adjustments every few
weeks, as a minor change in the target website
affecting the fields you scrape, might either give
you incomplete data or crash the scraper,
depending on the logic of the scraper.
 HONEYPOT TRAPS
 Some website designers put honeypot traps inside
websites to detect and trap web spiders,
 They may be links that normal user can’t see and a
crawler can.
 Some honeypot links to detect crawlers will have
the CSS style “display: none” or will be colour
disguised to blend in with the page’s background
colour.

More Related Content

What's hot

Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
NYC Predictive Analytics
 

What's hot (20)

What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 
Google Search Engine
Google Search Engine Google Search Engine
Google Search Engine
 
Web 3.0 Intro
Web 3.0 IntroWeb 3.0 Intro
Web 3.0 Intro
 
Wordpress ppt
Wordpress pptWordpress ppt
Wordpress ppt
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
Web Development
Web DevelopmentWeb Development
Web Development
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?
 
Web development ppt
Web development pptWeb development ppt
Web development ppt
 
Client & server side scripting
Client & server side scriptingClient & server side scripting
Client & server side scripting
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
 

Similar to Datasets, APIs, and Web Scraping

Is your mobile app up to speed softwaresymposium
Is your mobile app up to speed softwaresymposiumIs your mobile app up to speed softwaresymposium
Is your mobile app up to speed softwaresymposium
Doug Sillars
 
Peter lubbers-html5-offline-web-apps
Peter lubbers-html5-offline-web-appsPeter lubbers-html5-offline-web-apps
Peter lubbers-html5-offline-web-apps
Skills Matter
 

Similar to Datasets, APIs, and Web Scraping (20)

ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
Is your mobile app up to speed softwaresymposium
Is your mobile app up to speed softwaresymposiumIs your mobile app up to speed softwaresymposium
Is your mobile app up to speed softwaresymposium
 
What is the difference between web scraping and api
What is the difference between web scraping and apiWhat is the difference between web scraping and api
What is the difference between web scraping and api
 
JS digest. Decemebr 2017
JS digest. Decemebr 2017JS digest. Decemebr 2017
JS digest. Decemebr 2017
 
Peter lubbers-html5-offline-web-apps
Peter lubbers-html5-offline-web-appsPeter lubbers-html5-offline-web-apps
Peter lubbers-html5-offline-web-apps
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
Performance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For SpeedPerformance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For Speed
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
Web performance optimization for modern web applications
Web performance optimization for modern web applicationsWeb performance optimization for modern web applications
Web performance optimization for modern web applications
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
 
Website & Internet + Performance testing
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testing
 
10 Things You Can Do to Speed Up Your Web App Today
10 Things You Can Do to Speed Up Your Web App Today10 Things You Can Do to Speed Up Your Web App Today
10 Things You Can Do to Speed Up Your Web App Today
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScaleGDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Mobile web development
Mobile web developmentMobile web development
Mobile web development
 
SharePoint 2013 App Provisioning Models
SharePoint 2013 App Provisioning ModelsSharePoint 2013 App Provisioning Models
SharePoint 2013 App Provisioning Models
 

More from Damian T. Gordon

More from Damian T. Gordon (20)

Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
REST and RESTful Services
REST and RESTful ServicesREST and RESTful Services
REST and RESTful Services
 
Serverless Computing
Serverless ComputingServerless Computing
Serverless Computing
 
Cloud Identity Management
Cloud Identity ManagementCloud Identity Management
Cloud Identity Management
 
Containers and Docker
Containers and DockerContainers and Docker
Containers and Docker
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
 
How to Argue Logically
How to Argue LogicallyHow to Argue Logically
How to Argue Logically
 
Evaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONSEvaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONS
 
Evaluating Teaching: MERLOT
Evaluating Teaching: MERLOTEvaluating Teaching: MERLOT
Evaluating Teaching: MERLOT
 
Evaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson RubricEvaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson Rubric
 
Evaluating Teaching: LORI
Evaluating Teaching: LORIEvaluating Teaching: LORI
Evaluating Teaching: LORI
 
Designing Teaching: Pause Procedure
Designing Teaching: Pause ProcedureDesigning Teaching: Pause Procedure
Designing Teaching: Pause Procedure
 
Designing Teaching: ADDIE
Designing Teaching: ADDIEDesigning Teaching: ADDIE
Designing Teaching: ADDIE
 
Designing Teaching: ASSURE
Designing Teaching: ASSUREDesigning Teaching: ASSURE
Designing Teaching: ASSURE
 
Designing Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning TypesDesigning Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning Types
 
Designing Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of InstructionDesigning Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of Instruction
 
Designing Teaching: Elaboration Theory
Designing Teaching: Elaboration TheoryDesigning Teaching: Elaboration Theory
Designing Teaching: Elaboration Theory
 
Universally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some ConsiderationsUniversally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some Considerations
 

Recently uploaded

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 

Datasets, APIs, and Web Scraping

  • 2.  For this project you need a dataset.  Two ways of getting a dataset are: 1. Finding an existing one 2. Generating a new one  Option 1 is waaaaay easier, but it can be often difficult to find the exact dataset you need.
  • 3.  For this project you need a dataset.  Two ways of getting a dataset are: 1. Finding an existing one 2. Generating a new one  Option 1 is waaaaay easier, but it can be often difficult to find the exact dataset you need.  But more often than not, it’s both.
  • 4.  Ways to get data: ◦ Downloads and Torrents ◦ Application Programming Interfaces ◦ Web Scraping
  • 5.  Data journalism sites that makes the data sets used in its articles available online  FiveThirtyEight ◦ https://github.com/fivethirtyeight/data  BuzzFeed ◦ https://github.com/BuzzFeedNews/everything
  • 6.  Some I.T. companies provide tonnes of datasets, but you need to set-up a (free) login:  Amazon/AWS ◦ https://registry.opendata.aws/  Google ◦ https://cloud.google.com/bigquery/public-data/
  • 7.  Some social sites have full site dumps, often including media  Wikipedia: Media ◦ https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_p roject_XML_dumps#Media0  Wikipedia: Full Site Dumps ◦ https://dumps.wikimedia.org/  Reddit: Submission Corpus 2016 ◦ https://www.reddit.com/r/datasets/comments/3mg812 /full_reddit_submission_corpus_now_available_2006/
  • 8.  Governments sites with data  Ireland ◦ https://data.gov.ie/  UK (get it before it brexits) ◦ https://data.gov.uk/  USA ◦ https://www.dataquest.io/blog/free-datasets-for-projects/
  • 9.  Some sites have lots of data, but they need a bit of cleaning  The World Bank datasets ◦ https://data.worldbank.org/  Socrata ◦ https://opendata.socrata.com/
  • 10.  Academic Sites that provide datasets  SAGE Datasets ◦ https://methods.sagepub.com/Datasets  Academic Torrents  (all sorts of data, in all kinds of state) ◦ https://academictorrents.com/
  • 11.  Lists of datasets  https://libraryguides.missouri.edu/c.php?g=21330 0&p=1407295  https://guides.lib.vt.edu/c.php?g=580714  https://libguides.babson.edu/datasets  https://piktochart.com/blog/100-data-sets/
  • 12.
  • 13.  APIs (Application Programming Interfaces) are an intermediary that allows one software to talk to another.  In simple terms, you can pass a JSON to an API and in return, it will also give you a JSON.  Now there will always exist a set of rules as to what you can send in the JSON and what it can return.  These rules are strict and can’t change unless someone actually changes the API itself.  So when using an API to collect data, you will be strictly governed by a set of rules, and there are only some specific data fields that you can get.
  • 14.  Data journalism sites that have APIs  ProPublica ◦ https://www.propublica.org/datastore/apis
  • 15.  Social Media sites that have APIs  Twitter ◦ https://developer.twitter.com/en/docs
  • 16.  Government sites that have APIs  Ireland ◦ https://data.gov.ie/pages/developers  UK ◦ https://content- api.publishing.service.gov.uk/#gov-uk-content-api  USA ◦ data.gov/developers/apis  OECD ◦ https://data.oecd.org/api/
  • 17.  Data sites that have APIs  data.world ◦ https://apidocs.data.world/api  Kaggle ◦ https://www.kaggle.com/docs/api
  • 18.  Other sites that have APIs  GitHub ◦ https://developer.github.com/v3/  Wunderground (weather site, needs login) ◦ https://www.wunderground.com/login
  • 19.  Creating a dataset using an API with Python  https://towardsdatascience.com/creating-a- dataset-using-an-api-with-python- dcc1607616d
  • 20.  Good Analytics tools to distribute the processing across multiple nodes.  Apache Spark ◦ https://spark.apache.org/  Apache Hadoop ◦ http://hadoop.apache.org/
  • 21.
  • 22.  Web scraping is much more customizable, complex and is not governed by any set rule.  You can get any data that you can see on a website using a scraping setup.  As for how you can scrape data, you can apply any techniques available, and you are constrained only by your imagination.
  • 23.  In other words…  If you know what you are looking for, and you are repeatedly looking to get the same data, from the same source for fulfilling the specific objective … go with APIs  But if you need a scenario that is more customizable, complex and is not governed by any set rule … you can get any data that you can see on a site using a web scraper
  • 24.
  • 25.  Some web spider code, and great videos  http://damiantgordon.com/Videos/Program mingAndAlgorithms/SearchEngine.html
  • 26.  Five Python Libraries for Scraping: ◦ The Requests library ◦ https://2.python- requests.org//en/master/user/quickstart/ ◦ Beautiful Soup 4  https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ◦ Lxml  https://lxml.de/index.html#introduction ◦ Selenium ◦ http://thiagomarzagao.com/2013/11/12/webscraping- with-selenium-part-1/ ◦ Scrapy  https://scrapy.org/
  • 27.  Some general advice on web scraping:
  • 28.  Robots.txt  Check if the root directory of the domain has a file in it called Robots.txt  This defines which areas of a website crawlers are not allowed to search.  This simple text file can exclude entire domains, complete directories, one or more subdirectories or individual files from the search engine crawling.  Crawling a website that doesn’t allow web crawling is very, very rude (and illegal in some countries) so it should not be attempted.
  • 29.  CAPTCHAs  A lot of websites have CAPTCHAs, and they pose real challenges for web crawlers  There are tools to get around them, e.g. ◦ http://bypasscaptcha.com/  Note that however you circumvent them, they can still slow down the scraping process a good bit.
  • 30.  EXCEPTION HANDLING  I’m speaking for myself here …  Very often I leave out the exception handling, but in this particular circumstance, catch everything you can.  You code will bomb from time to time, and it’s a good idea to know what happened.  Also try to avoid hard coding things, make everything as parameterised as possible
  • 31.  IP BLOCKING  Sometimes websites will mistake a reasonably harmless crawler for something more malignant, and will block you.  When a server detects a high number of requests from the same IP address or if the crawler makes multiple parallel requests it may get blocked  You might need to create a pool of IP addresses, or spoof a user agent ◦ http://www.whatsmyuseragent.com/
  • 32.  DYNAMIC WEBSITES  New websites use a lot of dynamic coding practices are not crawler friendly.  Examples are lazy loading images, infinite scrolling and product variants being loaded via AJAX calls.  This type of websites are even difficult to crawl
  • 33.  WEBSITE STRUCTURE  Websites that periodically upgrades their UI can lead to numerous structural changes on the website.  Since web crawlers are set up according to the code elements present at that time on the website, the scrapers would require changes too.  Web scrapers usually need adjustments every few weeks, as a minor change in the target website affecting the fields you scrape, might either give you incomplete data or crash the scraper, depending on the logic of the scraper.
  • 34.  HONEYPOT TRAPS  Some website designers put honeypot traps inside websites to detect and trap web spiders,  They may be links that normal user can’t see and a crawler can.  Some honeypot links to detect crawlers will have the CSS style “display: none” or will be colour disguised to blend in with the page’s background colour.