Your SlideShare is downloading. ×
DMI Workshop: Crawling and Scraping
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

DMI Workshop: Crawling and Scraping

874

Published on

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
874
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. CRAWLING AND SCRAPING Noortje Marres (Goldsmiths, University of London) Michael Stevenson (Digital Methods Initiative, University of Amsterdam) Esther Weltevrede (Digital Methods Initiative, University of Amsterdam) Digital Methods Summer School, 28 June 2011Wednesday, June 29, 2011
  • 2. CRAWLING AND SCRAPING Techniques for online data capture and analysis: • Issuecrawler • Lippmanian Device Implications for research methods: • dynamic data sets • formatted data Real-time research?Wednesday, June 29, 2011
  • 3. MAPPING NETWORKS WITH ISSUE CRAWLER A Web-based tool for the location and visualization of hyperlink networks on the WebWednesday, June 29, 2011
  • 4. Locating issue networks on the Web How to demarcate networks that have configured around specific affairs on the Web? To do this, Issue Crawler relies on: • well-chosen starting points or Web pages that disclose activity around a particular issue on the Web by way of hyperlinks • the ‘intelligence’ of aggregated, live hyperlinkingWednesday, June 29, 2011
  • 5. Extractive Industries Review network, 2004Wednesday, June 29, 2011
  • 6. More about hyperlink analysis Issue Crawler performs iterations of co-link analysis • the critique of ‘absolute’ citation measures (as in: pagerank) Compare this with co-citation analysis in the social studies of science (Callon et al., 1983) • topical relevance vs overall popularityWednesday, June 29, 2011
  • 7. Issue Crawler as a tool of online social research (1/2) To perform immanent critique of the supposed ‘egalitarianism’ of the Internet: to highlight specific asymmetries of relevance and/or authority among organizations’ Web pages To deploy hyperlink analysis for purposes of issue analysis in the politics of issues, “experts and activists define issues by sharing information about them” (Heclo, 1974)Wednesday, June 29, 2011
  • 8. fëëìÉë=áå=íÜÉ=cÉêÖ~å~=s~ääÉóI= ròÄÉâáëí~åI=~ÅÅçêÇáåÖ=íç=íÜÉ=tÉÄK c~ää=OMMN fëëìÉë=~êÉ=çå=íÜÉ=tÉÄI=Äìí=ïÜáÅÜ=áëëìÉë= qÜÉ=ÅÜ~ê~ÅíÉêáë~íáçå=çÑ=íÜÉ=cÉêÖ~å~=s~ääÉó=áëëìÉë= ÇÉéÉåÇë=çå=íÜÉ=ëáíÉë=~ÅÅÉëëÉÇK jìãíçòÄÉÖáã kdl=~ÇÇêÉëëÉëI==éÜçåÉ=åìãÄÉêëI=~åÇ= _ìëáåÉëë=tçãÉåDë ^ëëçÅá~íáçåI hçâ~åÇ çÅÅ~ëáçå~ääó=ÉJã~áä=~ÇÇêÉëëÉë=~î~áä~ÄäÉ= dáêäDë=píìÇáçI=hçâ~åÇ çå=íÜÉ=tÉÄK kçòáÖáãI=j~âÜ~ää~=tçãÉåDë=`äìÄI=hçâ~åÇ j~åçîáó=_~êâ~ãçääáâ=`ÉåíêÉ j~äáâ~=c~ãáäó=pçÅá~ä=~åÇ=iÉÖ~ä=pìééçêí=`ÉåíêÉ fÑíáòçê=léÉå=vçìíÜ=`äìÄ pçÖäçã ^îäçÇ=rÅÜìå=`Ü~êáíó=cçìåÇ~íáçåG ^ÄÇìää~ h~Çóêá=cçìåÇ~íáçåI=hçâ~åÇ `çåëìãÉê=oáÖÜíë=mêçíÉÅíáçå=pçÅáÉíóI=cÉêÖ~å~ g~ãáä~=`Ü~êáí~ÄäÉ=cçìåÇ~íáçå fÑíáòçê=`ÉåíêÉI=hçâ~åÇ j^a^a=`ÉåíÉê=Ñçê=íÜÉ==çåÉäóI=~ÖÉÇ=~åÇ=Çáë~ÄäÉÇI=^åÇáà~å äWednesday, June 29, 2011 oÉÇ=`êÉëÅÉåí=pçÅáÉíóI hçâ~åÇG=
  • 9. jìãíçòÄÉÖáã kdl=~ÇÇêÉëëÉëI==éÜçåÉ=åìãÄÉêëI=~åÇ= _ìëáåÉëë=tçãÉåDë ^ëëçÅá~íáçåI hçâ~åÇ çÅÅ~ëáçå~ääó=ÉJã~áä=~ÇÇêÉëëÉë=~î~áä~ÄäÉ= dáêäDë=píìÇáçI=hçâ~åÇ çå=íÜÉ=tÉÄK kçòáÖáãI=j~âÜ~ää~=tçãÉåDë=`äìÄI=hçâ~åÇ j~åçîáó=_~êâ~ãçääáâ=`ÉåíêÉ j~äáâ~=c~ãáäó=pçÅá~ä=~åÇ=iÉÖ~ä=pìééçêí=`ÉåíêÉ fÑíáòçê=léÉå=vçìíÜ=`äìÄ pçÖäçã ^îäçÇ=rÅÜìå=`Ü~êáíó=cçìåÇ~íáçåG ^ÄÇìää~ h~Çóêá=cçìåÇ~íáçåI=hçâ~åÇ `çåëìãÉê=oáÖÜíë=mêçíÉÅíáçå=pçÅáÉíóI=cÉêÖ~å~ g~ãáä~=`Ü~êáí~ÄäÉ=cçìåÇ~íáçå fÑíáòçê=`ÉåíêÉI=hçâ~åÇ j^a^a=`ÉåíÉê=Ñçê=íÜÉ==çåÉäóI=~ÖÉÇ=~åÇ=Çáë~ÄäÉÇI=^åÇáà~å ä oÉÇ=`êÉëÅÉåí=pçÅáÉíóI hçâ~åÇG= jÉÜêJp~Üçî~í=`Ü~êáí~ÄäÉ=`ÉåíêÉ b`lp^k=fåíÉêå~íáçå~ä=cçìåÇ~íáçåI=hçâ~åÇG jìëë~Ñç=bÅçäçÖáÅ~ä=`ÉåíêÉI=hçâ~åÇ Gdlkdl=EÖçîÉêåãÉåíJçêÖ~åáëÉÇ=kdlF kç=fåíÉêåÉíI=åç=áëëìÉë=Ñêçã=íÜÉ=ÖêçìåÇ= kdlÛë=áå=íÜÉ=cÉêÖ~å~=s~ääÉó=ã~ó=åçí=Ü~îÉ=tÉÄ=ëáíÉëI= Äìí=íÜÉáê=áëëìÉë=~êÉ=çå=íÜÉ=tÉÄKWednesday, June 29, 2011
  • 10. Issue Crawler How to use itWednesday, June 29, 2011
  • 11. Issue Crawler http://issuecrawler.net Request account and log inWednesday, June 29, 2011
  • 12. Issue Crawler lobby News workshops, software Queue time sharing Current three simultaneous crawlersWednesday, June 29, 2011
  • 13. Issue Crawler harvester Enter text, URLs will be stripped outWednesday, June 29, 2011
  • 14. Crawling and analysis Crawling to a certain depth Analysis snowball inter-actor co-link Iterate (optional)Wednesday, June 29, 2011
  • 15. Issue Crawler as a tool of online social research (2/2) More generally, to adopt an empirical approach to the study of public controversies: • is there a network? (is there an issue?) • who are the actors? • how are they related? • what are the issues? • where are they happening?Wednesday, June 29, 2011
  • 16. Co-link settings 1 iteration ~ social or event network 2 iterations ~ issue network 3 iterations ~ establishment network See http://www.govcom.org/scenarios_use.htmWednesday, June 29, 2011
  • 17. THE LIPPMANNIAN DEVICE* Scraping and other digital methods skills * a.k.a. The Google ScraperWednesday, June 29, 2011
  • 18. WHEN SEARCH BECOMES RESEARCH Turning Google into a research toolWednesday, June 29, 2011
  • 19. WALTER LIPPMANN (1889-1974) The Phantom Public, 1927 "The problem is to locate by clear and coarse objective tests the actor in a controversy who is most worthy of public support" (p.120)Wednesday, June 29, 2011
  • 20. LIPPMANNIAN DEVICE - MODES OF ANALYSIS Showing the partisanship of an actor. Showing the issue agenda of an organization. Issue Cloud Issue agenda.Which Source cloud Partisanship or issues are on the agenda of an commitment. Which sources organization or movement? mention the issue?Wednesday, June 29, 2011
  • 21. ISSUE CLOUD: GREENPEACE ISSUES An organization’s issue agenda (or commitment) Greenpeace has issues. Which are they most committed to?Wednesday, June 29, 2011
  • 22. Body Text Body textWednesday, June 29, 2011
  • 23. ISSUE CLOUD: GREENPEACE ISSUES Greenpeace issues, http://www.greenpeace.org/international/campaigns. Stop climate change Protect ancient forests Defending our Oceans Say no to genetic engineering Eliminate toxic chemicals Demand Peace and Disarmament End the nuclear age Encourage sustainable trade Keep most significant issue language: "climate change" "ancient forests" “oceans” "genetic engineering" "toxic chemicals" “disarmament” "nuclear power" "sustainable trade" ---> Query Design workshopWednesday, June 29, 2011
  • 24. Body Text Body textWednesday, June 29, 2011
  • 25. ISSUE CLOUD: GREENPEACE ISSUES Greenpeace’s issue agenda (distribution of commitment) Greenpeaces issue commitment. Greenpeaces campaign issue list, ranked according to number of mentions of issues on greenpeace.org, 11 October 2009.Wednesday, June 29, 2011
  • 26. EXAMPLE: SOURCE CLOUD Method for showing the partisanship or commitment of sources to names Method 1. Gather source list (e.g. through Issuecrawler or top google results) 2. Query source list for one or more experts Digital Methods Initiative, 2007Wednesday, June 29, 2011
  • 27. SOURCE CLOUD: CLIMATE CHANGE SKEPTICS Query design: What are the sources? Climate Change Skeptics: Who recognizes them? 1. Top 100 results for the query “climate change” http://www.google.com/search?q=%22climate+change %22&num=100Wednesday, June 29, 2011
  • 28. SOURCE CLOUD: CLIMATE CHANGE SKEPTICS Query design: What are the issues? Derive list of climate change skeptics Sources: motherjones.com, wikipedia.org, heartland.org Compare the three lists and retain the skeptics that are mentioned in at least two of the listsWednesday, June 29, 2011
  • 29. SOURCE CLOUD: CLIMATE CHANGE SKEPTICS Skeptics S. Fred Singer Robert Balling Sallie Baliunas Patrick Michaels Richard Lindzen Steven Milloy Timothy Ball Paul Driessen Willie Soon Sherwood B. Idso Frederick SeitzWednesday, June 29, 2011
  • 30. Body Text Body textWednesday, June 29, 2011
  • 31. GOOGLE BLOCKING Check query design before launching a scrape Number of sources x number of issues = number of request to GoogleWednesday, June 29, 2011
  • 32. Body Text Body textWednesday, June 29, 2011
  • 33. ----> data visualization: clouding workshopWednesday, June 29, 2011
  • 34. Climate Change Sceptics on the Web (Frederick Seitz) Research Question_To what extent are climate change skeptics present in the climate change spaces on the Web? Findings_There is distance between the skeptics and the top of the search engine returns. Body Text epa.gov (0) bbc.co.uk (0) defra.gov.uk (0) unep.org (0) bom.gov.au (0) ipcc.ch (0) pewclimate.org (0) davidsuzuki.org (0) panda.org (0) mfe.govt.nz (0) ec.gc.ca (0) exploratorium.edu (0) climatechange.com.au (0) greenpeace.org (0) climatechallenge.gov.uk (0) guardian.co.uk (0) iisd.org (0) g8.gov.uk (0) campaigncc.org (1) foe.co.uk (0) state.gov (0) scidev.net (0) eea.europa.eu (0) whoi.edu (0) cbc.ca (0) energy.gov (0) marshall.org (8) climateark.org (4) un.org (0) dar.csiro.au (0) theglobeandmail.com (0) acfonline.org.au (0) gcrio.org (0) nature.com (0) grida.no (0) nature.org (0) ecokids.ca (0) royalsoc.ac.uk (0) climatechangecentral.com (0) iea.org (0) ecn.ac.uk (0) ecy.wa.gov (0) worldwildlife.org (0) realclimate.org (35) metoffice.gov.uk (0) open2.net (0) scienceagogo.com (0) eldis.org (0) ft.com (0) who.int (0) climatecrisis.net (0) faqs.org (0) ltscotland.org.uk (0) abc.net.au (0) climatechange.ca.gov (0) envirolink.org (0) mofa.go.jp (0) sourcewatch.org (21) Body text iucn.org (0) dfat.gov.au (0) ncdc.noaa.gov (0) climatescience.gov (0) climatechangecollege.org (0) ciel.org (0) ucar.edu (0)Source_google.com Product_of the Digital Methods Initiative,Query_“Frederick Seitz” dmi.mediastudies.nl. Analysis_by BramMethod_Search for query “Frederick Seitz” in top 100. Organized in order. Nijhof, Richard Rogers and Laura van derTools_Google Scraper and Tag Cloud Generator Vlies. Design_Anne Helmond.Date_30 July 2007 CLIMATE CHANGE SCEPTICS CC_BY:NC:SAWednesday, June 29, 2011
  • 35. Climate Change Sceptics on the Web (Steven Milloy) Research Question_To what extent are climate change skeptics present in the climate change spaces on the Web? Findings_There is distance between the skeptics and the top of the search engine returns. Body Text epa.gov (1) bbc.co.uk (0) defra.gov.uk (1) unep.org (1) bom.gov.au (0) ipcc.ch (1) pewclimate.org (1) davidsuzuki.org (0) panda.org (0) mfe.govt.nz (0) ec.gc.ca (0) exploratorium.edu (0) climatechange.com.au (0) greenpeace.org (1) climatechallenge.gov.uk (1) guardian.co.uk (0) iisd.org (0) g8.gov.uk (0) campaigncc.org (0) foe.co.uk (0) state.gov (1) eea.europa.eu (1) whoi.edu (1) cbc.ca (0) energy.gov (1) marshall.org (0) climateark.org (2) un.org (0) dar.csiro.au (1) theglobeandmail.com (0) acfonline.org.au (0) gcrio.org (0) nature.com (0) grida.no (0) nature.org (1) ecokids.ca (0) climatechangecentral.com (0) iea.org (0) ecn.ac.uk (1) ecy.wa.gov (1) worldwildlife.org (0) realclimate.org (33) open2.net (0) eldis.org (0) ft.com (0) who.int (1) climatecrisis.net (1) faqs.org (0) metoffice.gov.uk (1) ltscotland.org.uk (1) abc.net.au (0) climatechange.ca.gov (1) envirolink.org (1) mofa.go.jp (1) Body text sourcewatch.org (27) iucn.org (0) dfat.gov.au (0) ncdc.noaa.gov (1) climatescience.gov (0) climatechangecollege.org (1) ciel.org (0) ucar.edu (0)Source_google.com Product_of the Digital Methods Initiative,Query_“Stephen Milloy” dmi.mediastudies.nl. Analysis_by BramMethod_Search for query “Stephen Milloy” in top 100. Organized in order. Nijhof, Richard Rogers and Laura van derTools_Google Scraper and Tag Cloud Generator Vlies. Design_Anne Helmond.Date_30 July 2007 CLIMATE CHANGE SCEPTICS CC_BY:NC:SAWednesday, June 29, 2011
  • 36. Climate Change Sceptics on the Web (S. Fred Singer) Research Question_To what extent are climate change skeptics present in the climate change spaces on the Web? Findings_There is distance between the skeptics and the top of the search engine returns. Body Textepa.gov (0) bbc.co.uk (0) defra.gov.uk (0) unep.org (0) bom.gov.au (0) ipcc.ch (0) pewclimate.org (0) davidsuzuki.org (0) panda.org (0) mfe.govt.nz (0) ec.gc.ca (0) exploratorium.edu (0) climatechange.com.au (0) greenpeace.org (1) climatechallenge.gov.uk (0) guardian.co.uk (0) iisd.org (0) g8.gov.uk (0) campaigncc.org (1) foe.co.uk (0) state.gov (0) scidev.net (0) eea.europa.eu (0) whoi.edu (0) cbc.ca (0) energy.gov (0) marshall.org (0) climateark.org (1) un.org (0) dar.csiro.au (0) theglobeandmail.com (0) acfonline.org.au (0) gcrio.org (0) nature.com (0) grida.no (0) nature.org (0) ecokids.ca (0) royalsoc.ac.uk (0) climatechangecentral.com (0) iea.org (0) ecn.ac.uk (0) ecy.wa.gov (0) worldwildlife.org (0) realclimate.org (14) faqs.org (0) metoffice.gov.uk (0) open2.net (0) scienceagogo.com (0) eldis.org (0) ft.com (0) who.int (0) climatecrisis.net (0) ltscotland.org.uk (0) abc.net.au (0) climatechange.ca.gov (0) sourcewatch.org (64) envirolink.org (0) mofa.go.jp (0) Body text iucn.org (0) dfat.gov.au (0) ncdc.noaa.gov (0) climatescience.gov (11) climatechangecollege.org (0) ciel.org (0) ucar.edu (0)Source_google.com Product_of the Digital Methods Initiative,Query_“Fred Singer” dmi.mediastudies.nl. Analysis_by BramMethod_Search for query “Fred Singer” in top 100. Organized in order. Nijhof, Richard Rogers and Laura van derTools_Google Scraper and Tag Cloud Generator Vlies. Design_Anne Helmond.Date_30 July 2007 CLIMATE CHANGE SCEPTICS CC_BY:NC:SAWednesday, June 29, 2011
  • 37. LIPPMANNIAN DEVICE Modes of analysis Issue agenda check. What are the current commitments of an organization(s)? Use the issue cloud Partisanship check. Which side is an actor on? Use the source cloudWednesday, June 29, 2011
  • 38. Tools and references http://tools.digitalmethods.net http://digitalmethods.net http://govcom.orgWednesday, June 29, 2011
  • 39. Climate Change Sceptics on the Web (Frederick Seitz) Research Question_To what extent are climate change skeptics present in the climate change spaces on the Web? Findings_There is distance between the skeptics and the top of the search engine returns. epa.gov (0) bbc.co.uk (0) defra.gov.uk (0) unep.org (0) bom.gov.au (0) ipcc.ch (0) pewclimate.org (0) davidsuzuki.org (0) panda.org (0) mfe.govt.nz (0) ec.gc.ca (0) exploratorium.edu (0) climatechange.com.au (0) greenpeace.org (0) climatechallenge.gov.uk (0) guardian.co.uk (0) iisd.org (0) g8.gov.uk (0) campaigncc.org (1) foe.co.uk (0) state.gov (0) scidev.net (0) eea.europa.eu (0) whoi.edu (0) cbc.ca (0) energy.gov (0) Body Text marshall.org (8) climateark.org (4) un.org (0) dar.csiro.au (0) theglobeandmail.com (0) acfonline.org.au (0) gcrio.org (0) nature.com (0) grida.no (0) nature.org (0) ecokids.ca (0) royalsoc.ac.uk (0) climatechangecentral.com (0) iea.org (0) ecn.ac.uk (0) ecy.wa.gov (0) worldwildlife.org (0) realclimate.org (35) metoffice.gov.uk (0) open2.net (0) scienceagogo.com (0) eldis.org (0) ft.com (0) who.int (0) climatecrisis.net (0) faqs.org (0) ltscotland.org.uk (0) abc.net.au (0) climatechange.ca.gov (0) envirolink.org (0) mofa.go.jp (0) sourcewatch.org (21) Body text iucn.org (0) dfat.gov.au (0) ncdc.noaa.gov (0) climatescience.gov (0) climatechangecollege.org (0) ciel.org (0) ucar.edu (0)Source_google.com Product_of the Digital Methods Initiative,Query_“Frederick Seitz” dmi.mediastudies.nl. Analysis_by BramMethod_Search for query “Frederick Seitz” in top 100. Organized in order. Nijhof, Richard Rogers and Laura van derTools_Google Scraper and Tag Cloud Generator Vlies. Design_Anne Helmond.Date_30 July 2007 CLIMATE CHANGE SCEPTICS CC_BY:NC:SA
  • 40. E X E R C I S E : S O U R C I N G C L I M AT E C H A N G ESKEPTICSResearch Question:Which climate change issue actors mention the skeptics, andwhat kinds of actors are more likely to mention them?Method:Comparative Query skeptics in two source sets (‘top’ sourcesand climate change blogs), outputting source cloud.
  • 41. SOURCE SETS(1) Top ten Google returns for “climate change” (mix of mediaas well as governmental organizations)
  • 42. SOURCE SETS(2) Climate change blogs network (IssueCrawler results - mix of‘establishment’ blogs, media and governmental and non-governmental organizations)
  • 43. E X E R C I S E : S O U R C I N G C L I M AT E C H A N G ESKEPTICSSteps:- Acquire source sets and skeptics list from Michael.- Launch the Lippmannian device (aka Google Scraper - seetools.digitalmethods.net).- Enter source sets and skeptics names. Query the source setsseparately, and remember to use “” to get exact returns.- Wait. Use this moment to discuss hypotheses.- Explore the output, and present findings.

×