SlideShare a Scribd company logo
Damian Gordon
 For this project you need a dataset.
 Two ways of getting a dataset are:
1. Finding an existing one
2. Generating a new one
 Option 1 is waaaaay easier, but it can be
often difficult to find the exact dataset you
need.
 For this project you need a dataset.
 Two ways of getting a dataset are:
1. Finding an existing one
2. Generating a new one
 Option 1 is waaaaay easier, but it can be
often difficult to find the exact dataset you
need.
 But more often than not, it’s both.
 Ways to get data:
◦ Downloads and Torrents
◦ Application Programming Interfaces
◦ Web Scraping
 Data journalism sites that makes the data
sets used in its articles available online
 FiveThirtyEight
◦ https://github.com/fivethirtyeight/data
 BuzzFeed
◦ https://github.com/BuzzFeedNews/everything
 Some I.T. companies provide tonnes of
datasets, but you need to set-up a (free)
login:
 Amazon/AWS
◦ https://registry.opendata.aws/
 Google
◦ https://cloud.google.com/bigquery/public-data/
 Some social sites have full site dumps, often
including media
 Wikipedia: Media
◦ https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_p
roject_XML_dumps#Media0
 Wikipedia: Full Site Dumps
◦ https://dumps.wikimedia.org/
 Reddit: Submission Corpus 2016
◦ https://www.reddit.com/r/datasets/comments/3mg812
/full_reddit_submission_corpus_now_available_2006/
 Governments sites with data
 Ireland
◦ https://data.gov.ie/
 UK (get it before it brexits)
◦ https://data.gov.uk/
 USA
◦ https://www.dataquest.io/blog/free-datasets-for-projects/
 Some sites have lots of data, but they need a
bit of cleaning
 The World Bank datasets
◦ https://data.worldbank.org/
 Socrata
◦ https://opendata.socrata.com/
 Academic Sites that provide datasets
 SAGE Datasets
◦ https://methods.sagepub.com/Datasets
 Academic Torrents
 (all sorts of data, in all kinds of state)
◦ https://academictorrents.com/
 Lists of datasets
 https://libraryguides.missouri.edu/c.php?g=21330
0&p=1407295
 https://guides.lib.vt.edu/c.php?g=580714
 https://libguides.babson.edu/datasets
 https://piktochart.com/blog/100-data-sets/
 APIs (Application Programming Interfaces) are an
intermediary that allows one software to talk to
another.
 In simple terms, you can pass a JSON to an API
and in return, it will also give you a JSON.
 Now there will always exist a set of rules as to
what you can send in the JSON and what it can
return.
 These rules are strict and can’t change unless
someone actually changes the API itself.
 So when using an API to collect data, you will be
strictly governed by a set of rules, and there are
only some specific data fields that you can get.
 Data journalism sites that have APIs
 ProPublica
◦ https://www.propublica.org/datastore/apis
 Social Media sites that have APIs
 Twitter
◦ https://developer.twitter.com/en/docs
 Government sites that have APIs
 Ireland
◦ https://data.gov.ie/pages/developers
 UK
◦ https://content-
api.publishing.service.gov.uk/#gov-uk-content-api
 USA
◦ data.gov/developers/apis
 OECD
◦ https://data.oecd.org/api/
 Data sites that have APIs
 data.world
◦ https://apidocs.data.world/api
 Kaggle
◦ https://www.kaggle.com/docs/api
 Other sites that have APIs
 GitHub
◦ https://developer.github.com/v3/
 Wunderground (weather site, needs login)
◦ https://www.wunderground.com/login
 Creating a dataset using an API with Python
 https://towardsdatascience.com/creating-a-
dataset-using-an-api-with-python-
dcc1607616d
 Good Analytics tools to distribute the
processing across multiple nodes.
 Apache Spark
◦ https://spark.apache.org/
 Apache Hadoop
◦ http://hadoop.apache.org/
 Web scraping is much more customizable,
complex and is not governed by any set rule.
 You can get any data that you can see on a
website using a scraping setup.
 As for how you can scrape data, you can
apply any techniques available, and you are
constrained only by your imagination.
 In other words…
 If you know what you are looking for, and you
are repeatedly looking to get the same data,
from the same source for fulfilling the
specific objective … go with APIs
 But if you need a scenario that is more
customizable, complex and is not governed
by any set rule … you can get any data that
you can see on a site using a web scraper
 Some web spider code, and great videos
 http://damiantgordon.com/Videos/Program
mingAndAlgorithms/SearchEngine.html
 Five Python Libraries for Scraping:
◦ The Requests library
◦ https://2.python-
requests.org//en/master/user/quickstart/
◦ Beautiful Soup 4
 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
◦ Lxml
 https://lxml.de/index.html#introduction
◦ Selenium
◦ http://thiagomarzagao.com/2013/11/12/webscraping-
with-selenium-part-1/
◦ Scrapy
 https://scrapy.org/
 Some general advice on web scraping:
 Robots.txt
 Check if the root directory of the domain has a file
in it called Robots.txt
 This defines which areas of a website crawlers are
not allowed to search.
 This simple text file can exclude entire domains,
complete directories, one or more subdirectories or
individual files from the search engine crawling.
 Crawling a website that doesn’t allow web crawling
is very, very rude (and illegal in some countries) so
it should not be attempted.
 CAPTCHAs
 A lot of websites have CAPTCHAs, and they pose
real challenges for web crawlers
 There are tools to get around them, e.g.
◦ http://bypasscaptcha.com/
 Note that however you circumvent them, they can
still slow down the scraping process a good bit.
 EXCEPTION HANDLING
 I’m speaking for myself here …
 Very often I leave out the exception handling, but
in this particular circumstance, catch everything
you can.
 You code will bomb from time to time, and it’s a
good idea to know what happened.
 Also try to avoid hard coding things, make
everything as parameterised as possible
 IP BLOCKING
 Sometimes websites will mistake a reasonably
harmless crawler for something more malignant,
and will block you.
 When a server detects a high number of requests
from the same IP address or if the crawler makes
multiple parallel requests it may get blocked
 You might need to create a pool of IP addresses, or
spoof a user agent
◦ http://www.whatsmyuseragent.com/
 DYNAMIC WEBSITES
 New websites use a lot of dynamic coding practices
are not crawler friendly.
 Examples are lazy loading images, infinite scrolling
and product variants being loaded via AJAX calls.
 This type of websites are even difficult to crawl
 WEBSITE STRUCTURE
 Websites that periodically upgrades their UI can
lead to numerous structural changes on the
website.
 Since web crawlers are set up according to the code
elements present at that time on the website, the
scrapers would require changes too.
 Web scrapers usually need adjustments every few
weeks, as a minor change in the target website
affecting the fields you scrape, might either give
you incomplete data or crash the scraper,
depending on the logic of the scraper.
 HONEYPOT TRAPS
 Some website designers put honeypot traps inside
websites to detect and trap web spiders,
 They may be links that normal user can’t see and a
crawler can.
 Some honeypot links to detect crawlers will have
the CSS style “display: none” or will be colour
disguised to blend in with the page’s background
colour.

More Related Content

What's hot

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Edureka!
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Amazon Product Review Sentiment Analysis with Machine Learning
Amazon Product Review Sentiment Analysis with Machine LearningAmazon Product Review Sentiment Analysis with Machine Learning
Amazon Product Review Sentiment Analysis with Machine Learning
ijtsrd
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 

What's hot (20)

Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuning
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Web mining
Web miningWeb mining
Web mining
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Amazon Product Review Sentiment Analysis with Machine Learning
Amazon Product Review Sentiment Analysis with Machine LearningAmazon Product Review Sentiment Analysis with Machine Learning
Amazon Product Review Sentiment Analysis with Machine Learning
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
Tableau Presentation
Tableau PresentationTableau Presentation
Tableau Presentation
 
Handwritten Recognition using Deep Learning with R
Handwritten Recognition using Deep Learning with RHandwritten Recognition using Deep Learning with R
Handwritten Recognition using Deep Learning with R
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Vector database
Vector databaseVector database
Vector database
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Query Processing in IR
Query Processing in IRQuery Processing in IR
Query Processing in IR
 
similarity measure
similarity measure similarity measure
similarity measure
 
In-Memory Big Data Analytics
In-Memory Big Data AnalyticsIn-Memory Big Data Analytics
In-Memory Big Data Analytics
 

Similar to Datasets, APIs, and Web Scraping

Is your mobile app up to speed softwaresymposium
Is your mobile app up to speed softwaresymposiumIs your mobile app up to speed softwaresymposium
Is your mobile app up to speed softwaresymposium
Doug Sillars
 
Peter lubbers-html5-offline-web-apps
Peter lubbers-html5-offline-web-appsPeter lubbers-html5-offline-web-apps
Peter lubbers-html5-offline-web-apps
Skills Matter
 

Similar to Datasets, APIs, and Web Scraping (20)

ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
Is your mobile app up to speed softwaresymposium
Is your mobile app up to speed softwaresymposiumIs your mobile app up to speed softwaresymposium
Is your mobile app up to speed softwaresymposium
 
What is the difference between web scraping and api
What is the difference between web scraping and apiWhat is the difference between web scraping and api
What is the difference between web scraping and api
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
JS digest. Decemebr 2017
JS digest. Decemebr 2017JS digest. Decemebr 2017
JS digest. Decemebr 2017
 
Peter lubbers-html5-offline-web-apps
Peter lubbers-html5-offline-web-appsPeter lubbers-html5-offline-web-apps
Peter lubbers-html5-offline-web-apps
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
Performance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For SpeedPerformance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For Speed
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
Web performance optimization for modern web applications
Web performance optimization for modern web applicationsWeb performance optimization for modern web applications
Web performance optimization for modern web applications
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
 
Website & Internet + Performance testing
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testing
 
10 Things You Can Do to Speed Up Your Web App Today
10 Things You Can Do to Speed Up Your Web App Today10 Things You Can Do to Speed Up Your Web App Today
10 Things You Can Do to Speed Up Your Web App Today
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and ScaleGDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
GDD Japan 2009 - Designing OpenSocial Apps For Speed and Scale
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Mobile web development
Mobile web developmentMobile web development
Mobile web development
 

More from Damian T. Gordon

More from Damian T. Gordon (20)

Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
REST and RESTful Services
REST and RESTful ServicesREST and RESTful Services
REST and RESTful Services
 
Serverless Computing
Serverless ComputingServerless Computing
Serverless Computing
 
Cloud Identity Management
Cloud Identity ManagementCloud Identity Management
Cloud Identity Management
 
Containers and Docker
Containers and DockerContainers and Docker
Containers and Docker
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
 
How to Argue Logically
How to Argue LogicallyHow to Argue Logically
How to Argue Logically
 
Evaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONSEvaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONS
 
Evaluating Teaching: MERLOT
Evaluating Teaching: MERLOTEvaluating Teaching: MERLOT
Evaluating Teaching: MERLOT
 
Evaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson RubricEvaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson Rubric
 
Evaluating Teaching: LORI
Evaluating Teaching: LORIEvaluating Teaching: LORI
Evaluating Teaching: LORI
 
Designing Teaching: Pause Procedure
Designing Teaching: Pause ProcedureDesigning Teaching: Pause Procedure
Designing Teaching: Pause Procedure
 
Designing Teaching: ADDIE
Designing Teaching: ADDIEDesigning Teaching: ADDIE
Designing Teaching: ADDIE
 
Designing Teaching: ASSURE
Designing Teaching: ASSUREDesigning Teaching: ASSURE
Designing Teaching: ASSURE
 
Designing Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning TypesDesigning Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning Types
 
Designing Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of InstructionDesigning Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of Instruction
 
Designing Teaching: Elaboration Theory
Designing Teaching: Elaboration TheoryDesigning Teaching: Elaboration Theory
Designing Teaching: Elaboration Theory
 
Universally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some ConsiderationsUniversally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some Considerations
 

Recently uploaded

Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 

Recently uploaded (20)

Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Advances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdfAdvances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdf
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
 

Datasets, APIs, and Web Scraping

  • 2.  For this project you need a dataset.  Two ways of getting a dataset are: 1. Finding an existing one 2. Generating a new one  Option 1 is waaaaay easier, but it can be often difficult to find the exact dataset you need.
  • 3.  For this project you need a dataset.  Two ways of getting a dataset are: 1. Finding an existing one 2. Generating a new one  Option 1 is waaaaay easier, but it can be often difficult to find the exact dataset you need.  But more often than not, it’s both.
  • 4.  Ways to get data: ◦ Downloads and Torrents ◦ Application Programming Interfaces ◦ Web Scraping
  • 5.  Data journalism sites that makes the data sets used in its articles available online  FiveThirtyEight ◦ https://github.com/fivethirtyeight/data  BuzzFeed ◦ https://github.com/BuzzFeedNews/everything
  • 6.  Some I.T. companies provide tonnes of datasets, but you need to set-up a (free) login:  Amazon/AWS ◦ https://registry.opendata.aws/  Google ◦ https://cloud.google.com/bigquery/public-data/
  • 7.  Some social sites have full site dumps, often including media  Wikipedia: Media ◦ https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_p roject_XML_dumps#Media0  Wikipedia: Full Site Dumps ◦ https://dumps.wikimedia.org/  Reddit: Submission Corpus 2016 ◦ https://www.reddit.com/r/datasets/comments/3mg812 /full_reddit_submission_corpus_now_available_2006/
  • 8.  Governments sites with data  Ireland ◦ https://data.gov.ie/  UK (get it before it brexits) ◦ https://data.gov.uk/  USA ◦ https://www.dataquest.io/blog/free-datasets-for-projects/
  • 9.  Some sites have lots of data, but they need a bit of cleaning  The World Bank datasets ◦ https://data.worldbank.org/  Socrata ◦ https://opendata.socrata.com/
  • 10.  Academic Sites that provide datasets  SAGE Datasets ◦ https://methods.sagepub.com/Datasets  Academic Torrents  (all sorts of data, in all kinds of state) ◦ https://academictorrents.com/
  • 11.  Lists of datasets  https://libraryguides.missouri.edu/c.php?g=21330 0&p=1407295  https://guides.lib.vt.edu/c.php?g=580714  https://libguides.babson.edu/datasets  https://piktochart.com/blog/100-data-sets/
  • 12.
  • 13.  APIs (Application Programming Interfaces) are an intermediary that allows one software to talk to another.  In simple terms, you can pass a JSON to an API and in return, it will also give you a JSON.  Now there will always exist a set of rules as to what you can send in the JSON and what it can return.  These rules are strict and can’t change unless someone actually changes the API itself.  So when using an API to collect data, you will be strictly governed by a set of rules, and there are only some specific data fields that you can get.
  • 14.  Data journalism sites that have APIs  ProPublica ◦ https://www.propublica.org/datastore/apis
  • 15.  Social Media sites that have APIs  Twitter ◦ https://developer.twitter.com/en/docs
  • 16.  Government sites that have APIs  Ireland ◦ https://data.gov.ie/pages/developers  UK ◦ https://content- api.publishing.service.gov.uk/#gov-uk-content-api  USA ◦ data.gov/developers/apis  OECD ◦ https://data.oecd.org/api/
  • 17.  Data sites that have APIs  data.world ◦ https://apidocs.data.world/api  Kaggle ◦ https://www.kaggle.com/docs/api
  • 18.  Other sites that have APIs  GitHub ◦ https://developer.github.com/v3/  Wunderground (weather site, needs login) ◦ https://www.wunderground.com/login
  • 19.  Creating a dataset using an API with Python  https://towardsdatascience.com/creating-a- dataset-using-an-api-with-python- dcc1607616d
  • 20.  Good Analytics tools to distribute the processing across multiple nodes.  Apache Spark ◦ https://spark.apache.org/  Apache Hadoop ◦ http://hadoop.apache.org/
  • 21.
  • 22.  Web scraping is much more customizable, complex and is not governed by any set rule.  You can get any data that you can see on a website using a scraping setup.  As for how you can scrape data, you can apply any techniques available, and you are constrained only by your imagination.
  • 23.  In other words…  If you know what you are looking for, and you are repeatedly looking to get the same data, from the same source for fulfilling the specific objective … go with APIs  But if you need a scenario that is more customizable, complex and is not governed by any set rule … you can get any data that you can see on a site using a web scraper
  • 24.
  • 25.  Some web spider code, and great videos  http://damiantgordon.com/Videos/Program mingAndAlgorithms/SearchEngine.html
  • 26.  Five Python Libraries for Scraping: ◦ The Requests library ◦ https://2.python- requests.org//en/master/user/quickstart/ ◦ Beautiful Soup 4  https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ◦ Lxml  https://lxml.de/index.html#introduction ◦ Selenium ◦ http://thiagomarzagao.com/2013/11/12/webscraping- with-selenium-part-1/ ◦ Scrapy  https://scrapy.org/
  • 27.  Some general advice on web scraping:
  • 28.  Robots.txt  Check if the root directory of the domain has a file in it called Robots.txt  This defines which areas of a website crawlers are not allowed to search.  This simple text file can exclude entire domains, complete directories, one or more subdirectories or individual files from the search engine crawling.  Crawling a website that doesn’t allow web crawling is very, very rude (and illegal in some countries) so it should not be attempted.
  • 29.  CAPTCHAs  A lot of websites have CAPTCHAs, and they pose real challenges for web crawlers  There are tools to get around them, e.g. ◦ http://bypasscaptcha.com/  Note that however you circumvent them, they can still slow down the scraping process a good bit.
  • 30.  EXCEPTION HANDLING  I’m speaking for myself here …  Very often I leave out the exception handling, but in this particular circumstance, catch everything you can.  You code will bomb from time to time, and it’s a good idea to know what happened.  Also try to avoid hard coding things, make everything as parameterised as possible
  • 31.  IP BLOCKING  Sometimes websites will mistake a reasonably harmless crawler for something more malignant, and will block you.  When a server detects a high number of requests from the same IP address or if the crawler makes multiple parallel requests it may get blocked  You might need to create a pool of IP addresses, or spoof a user agent ◦ http://www.whatsmyuseragent.com/
  • 32.  DYNAMIC WEBSITES  New websites use a lot of dynamic coding practices are not crawler friendly.  Examples are lazy loading images, infinite scrolling and product variants being loaded via AJAX calls.  This type of websites are even difficult to crawl
  • 33.  WEBSITE STRUCTURE  Websites that periodically upgrades their UI can lead to numerous structural changes on the website.  Since web crawlers are set up according to the code elements present at that time on the website, the scrapers would require changes too.  Web scrapers usually need adjustments every few weeks, as a minor change in the target website affecting the fields you scrape, might either give you incomplete data or crash the scraper, depending on the logic of the scraper.
  • 34.  HONEYPOT TRAPS  Some website designers put honeypot traps inside websites to detect and trap web spiders,  They may be links that normal user can’t see and a crawler can.  Some honeypot links to detect crawlers will have the CSS style “display: none” or will be colour disguised to blend in with the page’s background colour.