SlideShare a Scribd company logo
1 of 29
Web Scraping
Gathering Data from Websites
M.MOHAMED MUSTHAFA
Web Scraping - What We’ll Cover
1. Build a data corpus of congressional press releases
2. APIs and gather latitude and longitude -- using JSON formatted data
3. A brief hands-on introduction into HTML parsing
4. APIs and Documentation (FTP) -- OpenSecrets.org
5. Discussion of APIs and Social Media data gathering
6. A brief discussion on the ethics of scraping
2
This is not a programming workshop, but...
1. We will discuss Python and BeautifulSoup
2. We will not learn or use Python in the workshop
3. However, some automation tools are used in this workshop
4. Web Scraping is about deconstructing websites. Effective scraping requires
learning about technical infrastructure as well as subject content
5. Not a workshop on Text Analysis (tools that calculate or correlate your data)
6. Not a workshop on data cleaning
3
Technical Definitions
Deconstruction v Construction
4
Definitions
5
● Scraping
Using tools to gather data you can see on
a webpage
A wide range of web scraping techniques
and tools exist. These can be as simple
as copy/paste and increase in complexity
to automation tools, HTML parsing, APIs
and programming
Scraping propolis from the sides of the bee box
Image by Abalg~commonswiki
Definitions
6
● Scraping
● HTTP
HyperText Transfer Protocol
Machine interchange information
transported over the Internet to enable
multi-media data exchange, aka WWW.
The protocol defines aspects of
authentication, requests, status codes,
persistent connections, client/server
request/response. etc.
Access a server on port 80; the declarative
Document Type Definition ( HTML, XML,
JSON, etc.) Images from
commons.WikiMedia.org
● Scraping
● HTTP
● HTML
HyperText Markup Language
The standard markup language on the
Web
As the web evolves so does the
proliferation of technical wrappers
surrounding the visible content of websites
(text and data)
Definitions
7
● Scraping
● HTTP
● HTML
● Parsing
The act of analyzing the strings and
symbols to reveal only the data you need
Definitions
8
● Scraping
● HTTP
● HTML
● Parsing
● Crawling
Moving across or through a website in an
attempt to gather data from more than one
URL or page
Definitions
9
Image by Dave Gingrich
● Scraping
● HTTP
● HTML
● Parsing
● Crawling
● JSON
Javascript Open Notation
Readable text used to transmit data
objects consisting of attribute-value pairs
Definitions
10
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null
}
● Scraping
● HTTP
● HTML
● Parsing
● JSON
● Crawling
● API
Application Programming Interface
A set of rules and protocols used to build a
software application. In the context of Web
Scraping an API is a method used to gather
clean data from a website (i.e. data that is not
wrapped in HTML, Javascript, bound in HTTP,
etc.)
Definitions
11
Image by Tsahi Levent-Levi
Webscraper.io
Demonstration & Hands-on
12
Demo: Scraping Congressional Press Releases
● Representative Nancy Pelosi’s Press Releases
○ CONTENT
■ Structure of the Press Release subsection of the site
● Pagination
● Links to each release
● Common elements of release
○ TOOLS
■ Webscraper.io tool works inside of Chrome
● Tutorials
● Documentation
● Community
● Free or, alternatively, Fee for Service
13
14
Now You Try It
1. http://v.gd/webscraping1111
2. Follow the download & installation instructions: step 9a
3. Find some congressional press release sites: step 9b
4. Follow the instructions: steps 1-8
15
Google Sheets
16
Demonstration
ImportHTML
ImportXML
Google Sheets -- ImportHTML
● Example Site: http://www.boxofficemojo.com/
● IMPORTHTML(url, query, index)
● In Practice:
=IMPORTHTML("http://www.boxofficemojo.com/","table",3)
http://www.boxofficemojo.com/movies/?page=intl&id=annie2014.htm
=IMPORTHTML(A6,"table",6)
Help Docs: https://support.google.com/docs/answer/3093339?hl=en
17
Google Sheets -- ImportXML
● Example Site: http://www.nytimes.com/
● IMPORTXML(url, xpath_query)
● In Practice:
=IMPORTXML("http://nytimes.com/", "//*/p[contains(@class,
'summary')]")
Help Docs: https://support.google.com/docs/answer/3093342?hl=en
18
Resources
YouTube Videos
● Web Scraping with Google Sheets
● Importing Data 4 ImportXML
● Web scraping using Google Docs - Xpath
Other Resources
● CSS Tutorial
● XPath
● XPath Language defined by W3C
Your Turn:
● https://github.com/data-and-visualization/Rfun
19
APIs, Parsing, & JSON
20
OpenRefine & JSON file format
● Demonstration (http://v.gd/parsing3333)
○ A step-by-step guide using OpenRefine to gather JSON data via Google Map’s API; then
parse the JSON for latitude & longitude
21
Parsing HTML
22
OpenRefine & ParseHtml
● BeautifulSoup Libraries
○ Refine uses the Jython Libraries and has Jsoup
○ jSoup is a Java library built on BeautifulSoup -- a tool for HTML Extraction
● Resources (OpenRefine)
○ Step-by-step example documented in the demonstration above
● Documentation
○ Refine’s documentation on HTML Parsing
○ jSoup Documentation
● Now You Try it -- http://v.gd/parsing2222
23
Case Study: OpenSecrets
and documentation
24
OpenSecrets API and FTP
● OpenSecrets tracks the effects of money and lobbying in elections and
politics
● OpenSecrets has an API
● OpenSecrets API Documentation
● OpenSecrets Bulk Data downloader
○ Login
○ Lobby.zip
25
Social Media
26
Social Media
1. Many ways to gather social media data
a. IFTTT where you compose rules to connect sites and can deposit data in a spreadsheet
b. APIs - often requires registered keys
c. Buy your data from a service such as GNIP
2. After you download it you may want to perform analysis
a. Sentiment Analysis, Word Frequency, Correlation, etc.
b. Text Analysis tools (from Digital Humanities LibGuide)
c. Digital Studio’s program on working with Texts: Comparing and choosing texts analysis tools
27
TAGS: a tool for collecting Twitter streams
● TAGS (“New Sheets”; Version 6.0ns) - https://tags.hawksey.info/
○ Form driven (not command line)
○ Minimal setup
○ Data are collected in Google Sheets
○ Gather twitter stream data by type
■ screen-name stream data
■ screen-name status updates
■ twitter user favorited tweets
■ Search term for last 7 days: hashtag stream, username, boolean logic
■ Limit by date
■ Schedule to run hourly - set your interval, or run once.
■ 3 minute setup-video; easy to use - https://youtu.be/Vm0kjAvH5HM
■ Outputs: raw CSV structured data, plus default social graph visualizations
28
Thank You!
29
Please complete feedback forms

More Related Content

Similar to Web Scraping_ Gathering Data from Websites.pptx

Crawling and Processing the Italian Corporate Web
Crawling and Processing the Italian Corporate WebCrawling and Processing the Italian Corporate Web
Crawling and Processing the Italian Corporate WebSpeck&Tech
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...confluent
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Ido Green
 
Data Collection and Consumption
Data Collection and ConsumptionData Collection and Consumption
Data Collection and ConsumptionBrian Greig
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDataWorks Summit
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 
Web Service and Mobile Integrated Day I
Web Service and Mobile Integrated Day IWeb Service and Mobile Integrated Day I
Web Service and Mobile Integrated Day IAnuchit Chalothorn
 
Reto2.011 APEX API
Reto2.011 APEX APIReto2.011 APEX API
Reto2.011 APEX APIreto20
 
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your NetworkCTruncer
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jIvan Zoratti
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your EnterpriseWSO2
 

Similar to Web Scraping_ Gathering Data from Websites.pptx (20)

Crawling and Processing the Italian Corporate Web
Crawling and Processing the Italian Corporate WebCrawling and Processing the Italian Corporate Web
Crawling and Processing the Italian Corporate Web
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Deep Web
Deep WebDeep Web
Deep Web
 
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
 
Revealing ALLSTOCKER
Revealing ALLSTOCKERRevealing ALLSTOCKER
Revealing ALLSTOCKER
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
 
Data Collection and Consumption
Data Collection and ConsumptionData Collection and Consumption
Data Collection and Consumption
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Web Service and Mobile Integrated Day I
Web Service and Mobile Integrated Day IWeb Service and Mobile Integrated Day I
Web Service and Mobile Integrated Day I
 
Reto2.011 APEX API
Reto2.011 APEX APIReto2.011 APEX API
Reto2.011 APEX API
 
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your Network
 
Ice dec04-04-sammy
Ice dec04-04-sammyIce dec04-04-sammy
Ice dec04-04-sammy
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 

Recently uploaded

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Recently uploaded (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Web Scraping_ Gathering Data from Websites.pptx

  • 1. Web Scraping Gathering Data from Websites M.MOHAMED MUSTHAFA
  • 2. Web Scraping - What We’ll Cover 1. Build a data corpus of congressional press releases 2. APIs and gather latitude and longitude -- using JSON formatted data 3. A brief hands-on introduction into HTML parsing 4. APIs and Documentation (FTP) -- OpenSecrets.org 5. Discussion of APIs and Social Media data gathering 6. A brief discussion on the ethics of scraping 2
  • 3. This is not a programming workshop, but... 1. We will discuss Python and BeautifulSoup 2. We will not learn or use Python in the workshop 3. However, some automation tools are used in this workshop 4. Web Scraping is about deconstructing websites. Effective scraping requires learning about technical infrastructure as well as subject content 5. Not a workshop on Text Analysis (tools that calculate or correlate your data) 6. Not a workshop on data cleaning 3
  • 5. Definitions 5 ● Scraping Using tools to gather data you can see on a webpage A wide range of web scraping techniques and tools exist. These can be as simple as copy/paste and increase in complexity to automation tools, HTML parsing, APIs and programming Scraping propolis from the sides of the bee box Image by Abalg~commonswiki
  • 6. Definitions 6 ● Scraping ● HTTP HyperText Transfer Protocol Machine interchange information transported over the Internet to enable multi-media data exchange, aka WWW. The protocol defines aspects of authentication, requests, status codes, persistent connections, client/server request/response. etc. Access a server on port 80; the declarative Document Type Definition ( HTML, XML, JSON, etc.) Images from commons.WikiMedia.org
  • 7. ● Scraping ● HTTP ● HTML HyperText Markup Language The standard markup language on the Web As the web evolves so does the proliferation of technical wrappers surrounding the visible content of websites (text and data) Definitions 7
  • 8. ● Scraping ● HTTP ● HTML ● Parsing The act of analyzing the strings and symbols to reveal only the data you need Definitions 8
  • 9. ● Scraping ● HTTP ● HTML ● Parsing ● Crawling Moving across or through a website in an attempt to gather data from more than one URL or page Definitions 9 Image by Dave Gingrich
  • 10. ● Scraping ● HTTP ● HTML ● Parsing ● Crawling ● JSON Javascript Open Notation Readable text used to transmit data objects consisting of attribute-value pairs Definitions 10 { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ], "children": [], "spouse": null }
  • 11. ● Scraping ● HTTP ● HTML ● Parsing ● JSON ● Crawling ● API Application Programming Interface A set of rules and protocols used to build a software application. In the context of Web Scraping an API is a method used to gather clean data from a website (i.e. data that is not wrapped in HTML, Javascript, bound in HTTP, etc.) Definitions 11 Image by Tsahi Levent-Levi
  • 13. Demo: Scraping Congressional Press Releases ● Representative Nancy Pelosi’s Press Releases ○ CONTENT ■ Structure of the Press Release subsection of the site ● Pagination ● Links to each release ● Common elements of release ○ TOOLS ■ Webscraper.io tool works inside of Chrome ● Tutorials ● Documentation ● Community ● Free or, alternatively, Fee for Service 13
  • 14. 14
  • 15. Now You Try It 1. http://v.gd/webscraping1111 2. Follow the download & installation instructions: step 9a 3. Find some congressional press release sites: step 9b 4. Follow the instructions: steps 1-8 15
  • 17. Google Sheets -- ImportHTML ● Example Site: http://www.boxofficemojo.com/ ● IMPORTHTML(url, query, index) ● In Practice: =IMPORTHTML("http://www.boxofficemojo.com/","table",3) http://www.boxofficemojo.com/movies/?page=intl&id=annie2014.htm =IMPORTHTML(A6,"table",6) Help Docs: https://support.google.com/docs/answer/3093339?hl=en 17
  • 18. Google Sheets -- ImportXML ● Example Site: http://www.nytimes.com/ ● IMPORTXML(url, xpath_query) ● In Practice: =IMPORTXML("http://nytimes.com/", "//*/p[contains(@class, 'summary')]") Help Docs: https://support.google.com/docs/answer/3093342?hl=en 18
  • 19. Resources YouTube Videos ● Web Scraping with Google Sheets ● Importing Data 4 ImportXML ● Web scraping using Google Docs - Xpath Other Resources ● CSS Tutorial ● XPath ● XPath Language defined by W3C Your Turn: ● https://github.com/data-and-visualization/Rfun 19
  • 20. APIs, Parsing, & JSON 20
  • 21. OpenRefine & JSON file format ● Demonstration (http://v.gd/parsing3333) ○ A step-by-step guide using OpenRefine to gather JSON data via Google Map’s API; then parse the JSON for latitude & longitude 21
  • 23. OpenRefine & ParseHtml ● BeautifulSoup Libraries ○ Refine uses the Jython Libraries and has Jsoup ○ jSoup is a Java library built on BeautifulSoup -- a tool for HTML Extraction ● Resources (OpenRefine) ○ Step-by-step example documented in the demonstration above ● Documentation ○ Refine’s documentation on HTML Parsing ○ jSoup Documentation ● Now You Try it -- http://v.gd/parsing2222 23
  • 24. Case Study: OpenSecrets and documentation 24
  • 25. OpenSecrets API and FTP ● OpenSecrets tracks the effects of money and lobbying in elections and politics ● OpenSecrets has an API ● OpenSecrets API Documentation ● OpenSecrets Bulk Data downloader ○ Login ○ Lobby.zip 25
  • 27. Social Media 1. Many ways to gather social media data a. IFTTT where you compose rules to connect sites and can deposit data in a spreadsheet b. APIs - often requires registered keys c. Buy your data from a service such as GNIP 2. After you download it you may want to perform analysis a. Sentiment Analysis, Word Frequency, Correlation, etc. b. Text Analysis tools (from Digital Humanities LibGuide) c. Digital Studio’s program on working with Texts: Comparing and choosing texts analysis tools 27
  • 28. TAGS: a tool for collecting Twitter streams ● TAGS (“New Sheets”; Version 6.0ns) - https://tags.hawksey.info/ ○ Form driven (not command line) ○ Minimal setup ○ Data are collected in Google Sheets ○ Gather twitter stream data by type ■ screen-name stream data ■ screen-name status updates ■ twitter user favorited tweets ■ Search term for last 7 days: hashtag stream, username, boolean logic ■ Limit by date ■ Schedule to run hourly - set your interval, or run once. ■ 3 minute setup-video; easy to use - https://youtu.be/Vm0kjAvH5HM ■ Outputs: raw CSV structured data, plus default social graph visualizations 28

Editor's Notes

  1. The main point of this slide highlights that making data available via [public] APIs is an evolving trend. There are older standards, like FTP, which still facilitate the release of information and clean data. Sometimes the data are right under your nose if you simply read the available documentation. In this example, the APIs and webpages do not support a basic data need, but the Bulk Downloader -- once registered and logged in -- does make data available in a needed format/configuration. https://www.opensecrets.org/resources/create/data.php