How to scrape data as economics student

How to scrape Data as
Economics Student
Nikolay Tretyakov
04.12.2018
Department of Economics
University of Ioannina

Agenda
1. Introduction
2. Theoretical Background
2.1. Language behind HTML code
2.2. Navigate HTML using XPATH
3. No coding Tricks
3.1. Extensions Chrome/Firefox
3.2. Using Google Spreadsheets functions
3.3. Scraping static pages using Python
3.4. Urllib, Requests
3.5. Beautiful Soup
4. Scraping dynamic pages using Python
4.1. Selenium Web Driver
5. Conclusion
6. References
1

Introduction
- Erasmus trainee from Otto-von-Guericke Universität, Magdeburg,
Germany
- Making research about tourism industry in Epirus, Greece
- Scrapping is 70 percent of work
- Big variety of scraped data analysis:
- Descriptive statistics
- Sentiment analysis
- Seasonality
2

Theoretical Background
- Web Scraping, or Web harvesting - methods of data
extraction from across the internet, mostly using software
simulating user behavior
- Web Crawler, or Spider, or Web Robot - program which
browses World Wide Web in a methodical manner
- Most advanced web crawler - Google Search Engine
3

- Access HTML code: view page
source in the browser
- XML stands for eXtensible Markup
Language
- (W3C) DOM: API which treats XML as
a structure where each node is an
object representing part of document
<?xml version="1.0" encoding="utf-8"?>
<destinationslist>
<dest>
<dest_en> Ioannina </dest_en>
<dest_ru> Янина </dest_ru>
<dest_gr> Ιωάννινα </dest_gr>
</dest>
</destinationslist>
Language behind HTML code
4
Example of XML structure

Navigating HTML using XPATH
5
/html[@class='js bootstrap-anchors-processed']/body[@class='html not-front not-logged-in one-sidebar sidebar-first
page-traineeships navbar-is-fixed-top']/div[@class='main-container container']/div[@class='row']/div[@class='region
region-content col-sm-9']/section[@id='block-system-main']/div[@class='view view-erasmusintern-traineeships-search
view-id-erasmusintern_traineeships_search view-display-id-page media-list-container
view-dom-id-bfcb6580db560c43fd748355dce05662']/div[@class='view-content']/div[1]/div[@class='node node-traineeship
view-mode-media_list clearfix']/div[@class='row media-list-items']/div[@class='col-md-12']/div[@class='ds-header
inline-header-content']/div[@class='field field-name-title field-type-ds field-label-hidden
pull-left']/div[@class='field-items']/div[@class='field-item even']/h3[@class='dot-title']/a
XPath Query:

Navigating HTML using XPATH
6
Key Characters:
/ : starts the root,leads to children`s
node;
// : starts wherever (relative path);
@ : select attributes;
[] : answers question Which one?
[*] : grabs everything
XPATH: “//div[@class='field-item even']/h3[@class='dot-title']/a[1]”
Types of nodes:
- element
- attribute
- text
- namespace
- processing instructions
- comment
- document node

No Coding Tricks
7
- Using browser extensions (demo1):
- XPath helper (Google Chrome)
- Selector Gadget
- Chro Path (Firefox, Chrome)
- Firefox Quantum Developer Edition (former Firebug)
- ImportXML() in Google Spreadsheets:
- arguments: xpath and url
- importfeed() and importhtml()
- Third-party services with free pricing

Scraping static pages using Python
8
- Requests and urrlib libraries
- Beautiful Soup library
Get to the desired URL
Scrape the existing
content on the page
Save the JSON or
CSV file

Scraping static pages using BS4 + Requests
9
Python Source Code example of JSON file
import requests
import BeautifulSoup as BS
def save_the_JSON :***
url = requests.get(erasmus_intern_url )
soup = BS(url, "lxml")
scrape_results = soup.find_all("div", class_="field-item even" )
for element in scrape_results :
title = element.find("h3",
class_='dot-title' ).find_next('a').get_text()
save_the_JSON ()
“results”:[
{
“id” : 1
“city” : “Berlin”,
“company” : “UIZ”,
“title” : “Manager”
}
]

Crawling dynamic web pages using Python
10
- Selenium Web Driver (source,
output, demo2):
- Initially for automated tests
- Make AJAX requests
- Submit forms, click buttons,
close pop-ups
- Waits are important
(recommended to wait if
more than 20 http requests
per minute)
- Requests html package:
- simple JS calls
- introduced early 2018

Web-etiquette and cat-mouse game
11
- Websites protect themselves!
- crawling brings extra load to servers
- copyright issues
- loss of income
- Sanctions vary from IP temporary ban to
opening a case
- Respect robots.txt and Terms and Conditions (ToS)
- Try to obtain API
- Do not publish obtained copyrighted information

Conclusion
12
- Automate the boring stuff
https://automatetheboringstuff.com/
- Apply scrapping only if it is worth it
- Try Erasmus!

References
13
1. https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-
right/
2. https://erasmusintern.org/traineeships?search_api_views_fulltext=&field_tr
aineeship_full_location_field_traineeship_location_count=242
3. https://www.slideshare.net/anniecushing/web-scraping-for-codeophobes
4. https://www.pythonforbeginners.com/requests/using-requests-in-python
5. https://www.techopedia.com/definition/5212/web-scraping
6. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
7. https://selenium-python.readthedocs.io/

How to scrape data as economics student

More Related Content

Similar to How to scrape data as economics student

Recently uploaded

How to scrape data as economics student