SlideShare a Scribd company logo
1 of 42
Download to read offline
Big Data and Automated Content Analysis
Week 6 – Wednesday
»Web scraping«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
6 May 2014
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Today
1 We put on our magic cap, pretend we are Firefox, scrape all
comments from GeenStijl, clean up the mess, and put the
comments in a neat CSV table
2 OK, but this surely can be doe more elegantly? Yes!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
We put on our magic cap, pretend we are Firefox, scrape all
comments from GeenStijl, clean up the mess, and put the
comments in a neat CSV table
Big Data and Automated Content Analysis Damian Trilling
BD-ACA Week6
BD-ACA Week6
BD-ACA Week6
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Let’s make a plan!
Which elements from the page do we need?
• What do they mean?
• How are they represented in the source code?
How should our output look like?
• What lists do we want?
• . . .
And how can we achieve this?
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
1 Download the page
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r?) and
TABs (t): We want one long string
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r?) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r?) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r?) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r?) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
6 Put text and metadata in lists and save them to a csv file
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Operation Magic Cap
1 Download the page
• They might block us, so let’s do as if we were a web browser!
2 Remove all line breaks (n, but maybe also nr or r?) and
TABs (t): We want one long string
3 Isolate the comment section (it started with <div
class="commentlist"> and ended with </div>)
4 Within the comment section, identify each comment
(<article>)
5 Within each comment, seperate the text (<p>) from the
metadata <footer>)
6 Put text and metadata in lists and save them to a csv file
And how can we achieve this?
Big Data and Automated Content Analysis Damian Trilling
1 from urllib import request
2 import re
3 import csv
4
5 onlycommentslist=[]
6 metalist=[]
7
8 req = request.Request(’http://www.geenstijl.nl/mt/archieven/2014/05/
das_toch_niet_normaal.html’, headers={’User-Agent’ : "Mozilla
/5.0"})
9 tekst=request.urlopen(req).read()
10 tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("n"," ").
replace("t"," ")
11
12 commentsection=re.findall(r’<div class="commentlist">.*?</div>’,tekst)
13 print (commentsection)
14 comments=re.findall(r’<article.*?>(.*?)</article>’,commentsection[0])
15 print (comments)
16 print ("There are",len(comments),"comments")
17 for co in comments:
18 metalist.append(re.findall(r’<footer>(.*?)</footer>’,co))
19 onlycommentslist.append(re.findall(r’<p>(.*?)</p>’,co))
20 writer=csv.writer(open("geenstijlcomments.csv",mode="w",encoding="utf
-8"))
21 output=zip(onlycommentslist,metalist)
22 writer.writerows(output)
BD-ACA Week6
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <footer> and </footer>)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <footer> and </footer>)
Optimization
• Only save the 0th (and only) element of the list
• Seperate the username and interpret date and time
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Further reading
Doing this with other sites?
• It’s basically puzzling with regular expressions.
• Look at the source code of the website to see how
well-structured it is.
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
OK, but this surely can be doe more elegantly? Yes!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
The following example is based on http://www.chicagoreader.com/
chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It
uses the module lxml
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
What do we need?
• the URL (of course)
• the XPATH of the element we want to scrape (you’ll see in a
minute what this is)
Big Data and Automated Content Analysis Damian Trilling
BD-ACA Week6
BD-ACA Week6
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Playing around with the Firefox XPath Checker
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use . For example, every *@id="reviews-container"
would grap a tag like <div id=”reviews-container”
class=”’user-content’
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use . For example, every *@id="reviews-container"
would grap a tag like <div id=”reviews-container”
class=”’user-content’
• Let the XPATH end with /text() to get all text
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Playing around with the Firefox XPath Checker
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use . For example, every *@id="reviews-container"
would grap a tag like <div id=”reviews-container”
class=”’user-content’
• Let the XPATH end with /text() to get all text
• Have a look at the source code of the web page to think of
other possible XPATHs!
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
The * means “every”.
Also, to get the text of the element, the XPATH should end on
/text().
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
The XPATH
You get something like
//*[@id="tabbedReviewsDiv"]/dl[1]/dd
//*[@id="tabbedReviewsDiv"]/dl[2]/dd
The * means “every”.
Also, to get the text of the element, the XPATH should end on
/text().
We can infer that we (probably) get all comments with
//*[@id="tabbedReviewsDiv"]/dl[*]/dd/text()
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Let’s scrape them!
1 from lxml import html
2 from urllib import request
3
4 req=request.Request("http://www.kieskeurig.nl/tablet/samsung/
galaxy_tab_3_101_wifi_16gb/reviews/1344691")
5 tree = html.fromstring(request.urlopen(req).read().decode(encoding="utf
-8",errors="ignore"))
6
7 reviews = tree.xpath(’//*[@id="reviews-container"]//*[@class="text
margin-mobile-bottom-large"]/text()’)
8
9 print (len(reviews),"reviews scraped. Showing the first 60 characters of
each:")
10 i=0
11 for review in reviews:
12 print("Review",i,":",review[:60])
13 i+=1
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
The output – perfect!
1 34 reviews scraped. Showing the first 60 characters of each:
2 Review 0 : Ideaal in combinatie met onze Samsung curved tv en onze mobi
3 Review 1 : Gewoon een goed ding!!!!Ligt goed in de hand. Is duidelijk e
4 Review 2 : Prachtig mooi levendig beeld, hoever of kort bij zit maakt n
5 Review 3 : Opstartsnelheid is zeer snel.
Big Data and Automated Content Analysis Damian Trilling
Magic Cap OK, but this surely can be doe more elegantly? Yes!
Recap
General idea
1 Identify each element by its XPATH (look it up in your
browser)
2 Read the webpage into a (loooooong) string
3 Use the XPATH to extract the relevant text into a list (with a
module like lxml)
4 Do something with the list (preprocess, analyze, save)
Alternatives: scrapy, beautifulsoup, regular expressions, . . .
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot

Semantic Optimization with Structured Data - SMX Munich
Semantic Optimization with Structured Data - SMX MunichSemantic Optimization with Structured Data - SMX Munich
Semantic Optimization with Structured Data - SMX MunichCraig Bradford
 
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...patrickstox
 
Why Java Needs Hierarchical Data
Why Java Needs Hierarchical DataWhy Java Needs Hierarchical Data
Why Java Needs Hierarchical DataMarakana Inc.
 
Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015Max De Marzi
 
ReadingSEO - Technical SEO at Scale
ReadingSEO - Technical SEO at ScaleReadingSEO - Technical SEO at Scale
ReadingSEO - Technical SEO at ScaleHayden Roche
 
Implementing schema.org in the JSON-LD format with Google Tag Manager
Implementing schema.org in the JSON-LD format with Google Tag ManagerImplementing schema.org in the JSON-LD format with Google Tag Manager
Implementing schema.org in the JSON-LD format with Google Tag ManagerEoghan Henn
 
.htaccess for SEOs - A presentation by Roxana Stingu
.htaccess for SEOs - A presentation by Roxana Stingu.htaccess for SEOs - A presentation by Roxana Stingu
.htaccess for SEOs - A presentation by Roxana StinguRoxana Stingu
 
Challenges of building a search engine like web rendering service
Challenges of building a search engine like web rendering serviceChallenges of building a search engine like web rendering service
Challenges of building a search engine like web rendering serviceGiacomo Zecchini
 
How to build simple web apps to automate your SEO tasks - BrightonSEO Spring ...
How to build simple web apps to automate your SEO tasks - BrightonSEO Spring ...How to build simple web apps to automate your SEO tasks - BrightonSEO Spring ...
How to build simple web apps to automate your SEO tasks - BrightonSEO Spring ...Charly Wargnier
 
How can a data layer help my seo
How can a data layer help my seoHow can a data layer help my seo
How can a data layer help my seoPhil Pearce
 
Track Everything with Google Tag Manager - #DFWSEM May 2017
Track Everything with Google Tag Manager -  #DFWSEM May 2017Track Everything with Google Tag Manager -  #DFWSEM May 2017
Track Everything with Google Tag Manager - #DFWSEM May 2017Ruth Burr Reedy
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web StandardsAarron Walter
 
SearchLove London | 'Jono Alderson', Turbocharging your Wordpress Website'
SearchLove London | 'Jono Alderson', Turbocharging your Wordpress Website'SearchLove London | 'Jono Alderson', Turbocharging your Wordpress Website'
SearchLove London | 'Jono Alderson', Turbocharging your Wordpress Website'Distilled
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stoxpatrickstox
 
BrightonSEO 2017 - SEO quick wins from a technical check
BrightonSEO 2017  - SEO quick wins from a technical checkBrightonSEO 2017  - SEO quick wins from a technical check
BrightonSEO 2017 - SEO quick wins from a technical checkChloe Bodard
 
Everything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick Stox
Everything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick StoxEverything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick Stox
Everything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick Stoxpatrickstox
 
Neo4j Presentation
Neo4j PresentationNeo4j Presentation
Neo4j PresentationMax De Marzi
 
London seo master - feb 2020
London seo master - feb 2020London seo master - feb 2020
London seo master - feb 2020Matt Williamson
 
Debugging rendering problems at scale
Debugging rendering problems at scaleDebugging rendering problems at scale
Debugging rendering problems at scaleGiacomo Zecchini
 

What's hot (20)

Semantic Optimization with Structured Data - SMX Munich
Semantic Optimization with Structured Data - SMX MunichSemantic Optimization with Structured Data - SMX Munich
Semantic Optimization with Structured Data - SMX Munich
 
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
 
Why Java Needs Hierarchical Data
Why Java Needs Hierarchical DataWhy Java Needs Hierarchical Data
Why Java Needs Hierarchical Data
 
Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015
 
ReadingSEO - Technical SEO at Scale
ReadingSEO - Technical SEO at ScaleReadingSEO - Technical SEO at Scale
ReadingSEO - Technical SEO at Scale
 
Implementing schema.org in the JSON-LD format with Google Tag Manager
Implementing schema.org in the JSON-LD format with Google Tag ManagerImplementing schema.org in the JSON-LD format with Google Tag Manager
Implementing schema.org in the JSON-LD format with Google Tag Manager
 
.htaccess for SEOs - A presentation by Roxana Stingu
.htaccess for SEOs - A presentation by Roxana Stingu.htaccess for SEOs - A presentation by Roxana Stingu
.htaccess for SEOs - A presentation by Roxana Stingu
 
Challenges of building a search engine like web rendering service
Challenges of building a search engine like web rendering serviceChallenges of building a search engine like web rendering service
Challenges of building a search engine like web rendering service
 
How to build simple web apps to automate your SEO tasks - BrightonSEO Spring ...
How to build simple web apps to automate your SEO tasks - BrightonSEO Spring ...How to build simple web apps to automate your SEO tasks - BrightonSEO Spring ...
How to build simple web apps to automate your SEO tasks - BrightonSEO Spring ...
 
How can a data layer help my seo
How can a data layer help my seoHow can a data layer help my seo
How can a data layer help my seo
 
Track Everything with Google Tag Manager - #DFWSEM May 2017
Track Everything with Google Tag Manager -  #DFWSEM May 2017Track Everything with Google Tag Manager -  #DFWSEM May 2017
Track Everything with Google Tag Manager - #DFWSEM May 2017
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web Standards
 
SearchLove London | 'Jono Alderson', Turbocharging your Wordpress Website'
SearchLove London | 'Jono Alderson', Turbocharging your Wordpress Website'SearchLove London | 'Jono Alderson', Turbocharging your Wordpress Website'
SearchLove London | 'Jono Alderson', Turbocharging your Wordpress Website'
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
 
BrightonSEO 2017 - SEO quick wins from a technical check
BrightonSEO 2017  - SEO quick wins from a technical checkBrightonSEO 2017  - SEO quick wins from a technical check
BrightonSEO 2017 - SEO quick wins from a technical check
 
Everything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick Stox
Everything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick StoxEverything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick Stox
Everything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick Stox
 
Neo4j Presentation
Neo4j PresentationNeo4j Presentation
Neo4j Presentation
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
London seo master - feb 2020
London seo master - feb 2020London seo master - feb 2020
London seo master - feb 2020
 
Debugging rendering problems at scale
Debugging rendering problems at scaleDebugging rendering problems at scale
Debugging rendering problems at scale
 

Similar to BD-ACA Week6

JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4alexsaves
 
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...Thoughtworks
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideDominic Woodman
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThoughtworks
 
Even faster web sites
Even faster web sitesEven faster web sites
Even faster web sitesFelipe Lavín
 
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis Sanders
 
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry WhiteGTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry WhiteGerry White
 
Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Fwdays
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
Content Management for Publishers
Content Management for PublishersContent Management for Publishers
Content Management for PublishersApex CoVantage
 
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...Martin Leyrer
 
Does my DIV look big in this?
Does my DIV look big in this?Does my DIV look big in this?
Does my DIV look big in this?glen_a_smith
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015Brian Coffey
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchclintongormley
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019Dave Stokes
 

Similar to BD-ACA Week6 (20)

BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
 
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always Wanted
 
Real_World_0days.pdf
Real_World_0days.pdfReal_World_0days.pdf
Real_World_0days.pdf
 
Even faster web sites
Even faster web sitesEven faster web sites
Even faster web sites
 
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot X
 
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry WhiteGTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
GTM Clowns, fun and hacks - Search Elite - May 2017 Gerry White
 
Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"Stefan Judis "Did we(b development) lose the right direction?"
Stefan Judis "Did we(b development) lose the right direction?"
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
Content Management for Publishers
Content Management for PublishersContent Management for Publishers
Content Management for Publishers
 
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
 
Does my DIV look big in this?
Does my DIV look big in this?Does my DIV look big in this?
Does my DIV look big in this?
 
D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015D3 in Jupyter : PyData NYC 2015
D3 in Jupyter : PyData NYC 2015
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 20197 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
 

More from Department of Communication Science, University of Amsterdam

More from Department of Communication Science, University of Amsterdam (20)

BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 

Recently uploaded

Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustVani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustSavipriya Raghavendra
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17Celine George
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfMohonDas
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
Over the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxOver the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxraviapr7
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxheathfieldcps1
 
SOLIDE WASTE in Cameroon,,,,,,,,,,,,,,,,,,,,,,,,,,,.pptx
SOLIDE WASTE in Cameroon,,,,,,,,,,,,,,,,,,,,,,,,,,,.pptxSOLIDE WASTE in Cameroon,,,,,,,,,,,,,,,,,,,,,,,,,,,.pptx
SOLIDE WASTE in Cameroon,,,,,,,,,,,,,,,,,,,,,,,,,,,.pptxSyedNadeemGillANi
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxiammrhaywood
 
How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17Celine George
 
10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdfJayanti Pande
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
How to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeHow to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeCeline George
 

Recently uploaded (20)

Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustVani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
March 2024 Directors Meeting, Division of Student Affairs and Academic Support
March 2024 Directors Meeting, Division of Student Affairs and Academic SupportMarch 2024 Directors Meeting, Division of Student Affairs and Academic Support
March 2024 Directors Meeting, Division of Student Affairs and Academic Support
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdf
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Over the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxOver the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptx
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
SOLIDE WASTE in Cameroon,,,,,,,,,,,,,,,,,,,,,,,,,,,.pptx
SOLIDE WASTE in Cameroon,,,,,,,,,,,,,,,,,,,,,,,,,,,.pptxSOLIDE WASTE in Cameroon,,,,,,,,,,,,,,,,,,,,,,,,,,,.pptx
SOLIDE WASTE in Cameroon,,,,,,,,,,,,,,,,,,,,,,,,,,,.pptx
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
 
How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17How to Create a Toggle Button in Odoo 17
How to Create a Toggle Button in Odoo 17
 
10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
How to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using CodeHow to Send Emails From Odoo 17 Using Code
How to Send Emails From Odoo 17 Using Code
 

BD-ACA Week6

  • 1. Big Data and Automated Content Analysis Week 6 – Wednesday »Web scraping« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 6 May 2014
  • 2. Magic Cap OK, but this surely can be doe more elegantly? Yes! Today 1 We put on our magic cap, pretend we are Firefox, scrape all comments from GeenStijl, clean up the mess, and put the comments in a neat CSV table 2 OK, but this surely can be doe more elegantly? Yes! Big Data and Automated Content Analysis Damian Trilling
  • 3. Magic Cap OK, but this surely can be doe more elegantly? Yes! We put on our magic cap, pretend we are Firefox, scrape all comments from GeenStijl, clean up the mess, and put the comments in a neat CSV table Big Data and Automated Content Analysis Damian Trilling
  • 7. Magic Cap OK, but this surely can be doe more elegantly? Yes! Let’s make a plan! Which elements from the page do we need? • What do they mean? • How are they represented in the source code? How should our output look like? • What lists do we want? • . . . And how can we achieve this? Big Data and Automated Content Analysis Damian Trilling
  • 8. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap Big Data and Automated Content Analysis Damian Trilling
  • 9. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap 1 Download the page Big Data and Automated Content Analysis Damian Trilling
  • 10. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! Big Data and Automated Content Analysis Damian Trilling
  • 11. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r?) and TABs (t): We want one long string Big Data and Automated Content Analysis Damian Trilling
  • 12. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r?) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) Big Data and Automated Content Analysis Damian Trilling
  • 13. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r?) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) Big Data and Automated Content Analysis Damian Trilling
  • 14. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r?) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) Big Data and Automated Content Analysis Damian Trilling
  • 15. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r?) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) 6 Put text and metadata in lists and save them to a csv file Big Data and Automated Content Analysis Damian Trilling
  • 16. Magic Cap OK, but this surely can be doe more elegantly? Yes! Operation Magic Cap 1 Download the page • They might block us, so let’s do as if we were a web browser! 2 Remove all line breaks (n, but maybe also nr or r?) and TABs (t): We want one long string 3 Isolate the comment section (it started with <div class="commentlist"> and ended with </div>) 4 Within the comment section, identify each comment (<article>) 5 Within each comment, seperate the text (<p>) from the metadata <footer>) 6 Put text and metadata in lists and save them to a csv file And how can we achieve this? Big Data and Automated Content Analysis Damian Trilling
  • 17. 1 from urllib import request 2 import re 3 import csv 4 5 onlycommentslist=[] 6 metalist=[] 7 8 req = request.Request(’http://www.geenstijl.nl/mt/archieven/2014/05/ das_toch_niet_normaal.html’, headers={’User-Agent’ : "Mozilla /5.0"}) 9 tekst=request.urlopen(req).read() 10 tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("n"," "). replace("t"," ") 11 12 commentsection=re.findall(r’<div class="commentlist">.*?</div>’,tekst) 13 print (commentsection) 14 comments=re.findall(r’<article.*?>(.*?)</article>’,commentsection[0]) 15 print (comments) 16 print ("There are",len(comments),"comments") 17 for co in comments: 18 metalist.append(re.findall(r’<footer>(.*?)</footer>’,co)) 19 onlycommentslist.append(re.findall(r’<p>(.*?)</p>’,co)) 20 writer=csv.writer(open("geenstijlcomments.csv",mode="w",encoding="utf -8")) 21 output=zip(onlycommentslist,metalist) 22 writer.writerows(output)
  • 19. Magic Cap OK, but this surely can be doe more elegantly? Yes! Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). Big Data and Automated Content Analysis Damian Trilling
  • 20. Magic Cap OK, but this surely can be doe more elegantly? Yes! Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <footer> and </footer>) Big Data and Automated Content Analysis Damian Trilling
  • 21. Magic Cap OK, but this surely can be doe more elegantly? Yes! Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <footer> and </footer>) Optimization • Only save the 0th (and only) element of the list • Seperate the username and interpret date and time Big Data and Automated Content Analysis Damian Trilling
  • 22. Magic Cap OK, but this surely can be doe more elegantly? Yes! Further reading Doing this with other sites? • It’s basically puzzling with regular expressions. • Look at the source code of the website to see how well-structured it is. Big Data and Automated Content Analysis Damian Trilling
  • 23. Magic Cap OK, but this surely can be doe more elegantly? Yes! OK, but this surely can be doe more elegantly? Yes! Big Data and Automated Content Analysis Damian Trilling
  • 24. Magic Cap OK, but this surely can be doe more elegantly? Yes! Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) Big Data and Automated Content Analysis Damian Trilling
  • 25. Magic Cap OK, but this surely can be doe more elegantly? Yes! Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) Big Data and Automated Content Analysis Damian Trilling
  • 26. Magic Cap OK, but this surely can be doe more elegantly? Yes! Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex Big Data and Automated Content Analysis Damian Trilling
  • 27. Magic Cap OK, but this surely can be doe more elegantly? Yes! Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex The following example is based on http://www.chicagoreader.com/ chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It uses the module lxml Big Data and Automated Content Analysis Damian Trilling
  • 28. Magic Cap OK, but this surely can be doe more elegantly? Yes! What do we need? • the URL (of course) • the XPATH of the element we want to scrape (you’ll see in a minute what this is) Big Data and Automated Content Analysis Damian Trilling
  • 31. Magic Cap OK, but this surely can be doe more elegantly? Yes! Playing around with the Firefox XPath Checker Big Data and Automated Content Analysis Damian Trilling
  • 32. Magic Cap OK, but this surely can be doe more elegantly? Yes! Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) Big Data and Automated Content Analysis Damian Trilling
  • 33. Magic Cap OK, but this surely can be doe more elegantly? Yes! Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all Big Data and Automated Content Analysis Damian Trilling
  • 34. Magic Cap OK, but this surely can be doe more elegantly? Yes! Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use . For example, every *@id="reviews-container" would grap a tag like <div id=”reviews-container” class=”’user-content’ Big Data and Automated Content Analysis Damian Trilling
  • 35. Magic Cap OK, but this surely can be doe more elegantly? Yes! Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use . For example, every *@id="reviews-container" would grap a tag like <div id=”reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text Big Data and Automated Content Analysis Damian Trilling
  • 36. Magic Cap OK, but this surely can be doe more elegantly? Yes! Playing around with the Firefox XPath Checker Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use . For example, every *@id="reviews-container" would grap a tag like <div id=”reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text • Have a look at the source code of the web page to think of other possible XPATHs! Big Data and Automated Content Analysis Damian Trilling
  • 37. Magic Cap OK, but this surely can be doe more elegantly? Yes! The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd Big Data and Automated Content Analysis Damian Trilling
  • 38. Magic Cap OK, but this surely can be doe more elegantly? Yes! The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd The * means “every”. Also, to get the text of the element, the XPATH should end on /text(). Big Data and Automated Content Analysis Damian Trilling
  • 39. Magic Cap OK, but this surely can be doe more elegantly? Yes! The XPATH You get something like //*[@id="tabbedReviewsDiv"]/dl[1]/dd //*[@id="tabbedReviewsDiv"]/dl[2]/dd The * means “every”. Also, to get the text of the element, the XPATH should end on /text(). We can infer that we (probably) get all comments with //*[@id="tabbedReviewsDiv"]/dl[*]/dd/text() Big Data and Automated Content Analysis Damian Trilling
  • 40. Magic Cap OK, but this surely can be doe more elegantly? Yes! Let’s scrape them! 1 from lxml import html 2 from urllib import request 3 4 req=request.Request("http://www.kieskeurig.nl/tablet/samsung/ galaxy_tab_3_101_wifi_16gb/reviews/1344691") 5 tree = html.fromstring(request.urlopen(req).read().decode(encoding="utf -8",errors="ignore")) 6 7 reviews = tree.xpath(’//*[@id="reviews-container"]//*[@class="text margin-mobile-bottom-large"]/text()’) 8 9 print (len(reviews),"reviews scraped. Showing the first 60 characters of each:") 10 i=0 11 for review in reviews: 12 print("Review",i,":",review[:60]) 13 i+=1 Big Data and Automated Content Analysis Damian Trilling
  • 41. Magic Cap OK, but this surely can be doe more elegantly? Yes! The output – perfect! 1 34 reviews scraped. Showing the first 60 characters of each: 2 Review 0 : Ideaal in combinatie met onze Samsung curved tv en onze mobi 3 Review 1 : Gewoon een goed ding!!!!Ligt goed in de hand. Is duidelijk e 4 Review 2 : Prachtig mooi levendig beeld, hoever of kort bij zit maakt n 5 Review 3 : Opstartsnelheid is zeer snel. Big Data and Automated Content Analysis Damian Trilling
  • 42. Magic Cap OK, but this surely can be doe more elegantly? Yes! Recap General idea 1 Identify each element by its XPATH (look it up in your browser) 2 Read the webpage into a (loooooong) string 3 Use the XPATH to extract the relevant text into a list (with a module like lxml) 4 Do something with the list (preprocess, analyze, save) Alternatives: scrapy, beautifulsoup, regular expressions, . . . Big Data and Automated Content Analysis Damian Trilling