Crawling and Scraping tutorial at the Digital Methods Summer School 2013

863 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
863
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
30
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Crawling and Scraping tutorial at the Digital Methods Summer School 2013

  1. 1. Crawling and ScrapingThe Issuecrawler and the Lippmannian device.Michael Stevenson
  2. 2. Issuecrawler.What does it do?
  3. 3. Body textBody TextSiteABCCRAWL STARTING POINTS
  4. 4. Body textBody TextSiteABCCRAWL STARTING POINTSSiteABCDCRAWL DEPTH ONEfollow all starting points outlinks
  5. 5. Body textBody TextSiteABCCRAWL STARTING POINTSSiteABCDCRAWL DEPTH ONEfollow all starting points outlinksSiteABCDEFGHCRAWL DEPTH TWOfollow all outlinks from the pages found in the previous depth
  6. 6. Body textBody TextANALYSIS SNOWBALLretain all links and sites discovered during the crawlSiteABCDEFGH
  7. 7. Body textBody TextANALYSIS INTER-ACTORretain only links between the starting pointsSiteABC
  8. 8. Body textBody TextANALYSIS CO-LINKretain sites that receive links from at least two other sitesSiteBD
  9. 9. Climate change blogs networkStarting points: blogroll from RealClimate.org
  10. 10. Climate change blogs networkResults: mix of blogs, social media, traditionalmedia and governmental and non-governmentalorganizations.
  11. 11. Climate change science networkStarting points: “science links” fromRealClimate.orgResults: mix of governmental, non-governmental,educational and media organizations
  12. 12. OK... We have the issuenetworks, but what can wecan say about their content?
  13. 13. Lippmannian device.(aka the google scraper)
  14. 14. What does it do?1. Explore a source’s partisanship or commitment.2. Show the issue agenda of an organization ormovement.Source cloud Issue cloudPartisanship or commitment.Whichsources mention the expert’s name?Issue agenda.Which issues are on theagenda of an organization or movement?
  15. 15. Lippmannian device.“Source cloud”Showing the partisanship orcommitment of sources to one nameCraig Venters presence in the Synthetic Biology issue space, March 2008. Top sources on "syntheticbiology" according to a Google query, with number of mentions of Venter per source, ordered.
  16. 16. Lippmannian device.“Source cloud”Method for showing the partisanship orcommitment of sources to names1. Gather source list (e.g. through Issuecrawler)2. Query source list for one or more experts
  17. 17. Lippmannian device.“Source cloud”Showing the partisanship orcommitment of sources to namesClimate Change Skeptics:Who recognizes them?(Digital Methods Initiative, 2007)https://wiki.digitalmethods.net/Dmi/ClimateChangeSkeptics
  18. 18. Lippmannian device.“Making an Issue cloud”An organization’s issue agenda(or commitment)Public Knowledge, a digital rights NGO,has issues. Which are they most committed to?
  19. 19. Lippmannian device.“Issue cloud”Showing the issue commitmentsof the NGO, Public KnowledgePublic Knowledges issue commitment. Lower six issues on Public Knowledges issue list, rankedaccording to number of mentions of issues on publicknowledge.org, 2 October 2009.
  20. 20. Lippmannian device.“Making an Issue cloud”Greenpeace issues, http://www.greenpeace.org/international/campaigns.Stop climate changeProtect ancient forestsDefending our OceansSay no to genetic engineeringEliminate toxic chemicalsDemand Peace and DisarmamentEnd the nuclear ageEncourage sustainable tradeKeep most significant issue language."climate change""ancient forests"oceans"genetic engineering""toxic chemicals"disarmament"nuclear power""sustainable trade"
  21. 21. Lippmannian device.“Issue cloud”Greenpeace’s issue agenda (distribution ofcommitment)Greenpeaces issue commitment. Greenpeaces campaign issue list, ranked according to number ofmentions of issues on greenpeace.org, 11 October 2009.
  22. 22. Lippmannian device.“Making an Issue cloud”Multiple sources, multiple issuesWhat is the agenda of theglobal human rights network?Which issues are at the top andat the bottom of the agenda?What is the current level of commitment to aparticular issue?
  23. 23. Lippmannian device.“Making an Issue cloud”Multiple sources, multiple issuesThis is more complicated, but still doable(Govcom.org, University of Pittsburg, UMass Amhearst, ongoing)
  24. 24. Lippmannian device.“Making an Issue cloud”Take three good lists of human rightsorganizations (global south, global north, UN’s)
  25. 25. Lippmannian device.“Making an Issue cloud”Make a list of all issues listed on all Websites
  26. 26. Lippmannian device.“Issue cloud”Showing the issue commitmentsof global human rights networkGlobal human rights issue agenda. Global human rights actors issues, ranked according to theestimated number of Google mentions on a set of global human rights actors websites, 31 March 2009.
  27. 27. Lippmannian device.“Issue cloud”Showing the issue commitmentsof global human rights networkGlobal human rights issue agenda, bottom. Global human rights actors issues, ranked according to theestimated number of Google mentions on a set of global human rights actors websites, 31 March 2009.
  28. 28. Lippmannian device.Partisanship check. Which side of thecontroversy is an actor on?Use the source cloud
  29. 29. Lippmannian device.1. Check an organziation’s issue agenda.What are its current commitments?2. Check a national or global movement’s issueagenda. What are its current commitments?Use the issue cloud
  30. 30. Questions.
  31. 31. Exercise:Sourcing Climate ChangeSkeptics.
  32. 32. Body textBody TextClimate Change Sceptics on the Web (Frederick Seitz)Research Question_To what extent are climate change skeptics presentin the climate change spaces on the Web?Findings_There is distance between the skeptics and the top of thesearch engine returns.Source_google.comQuery_“Frederick Seitz”Method_Search for query “Frederick Seitz” in top 100. Organized in order.Tools_Google Scraper and Tag Cloud GeneratorDate_30 July 2007Product_of the Digital Methods Initiative,dmi.mediastudies.nl. Analysis_by BramNijhof, Richard Rogers and Laura van derVlies. Design_Anne Helmond.CC_BY:NC:SAcampaigncc.org (1)climateark.org (4)marshall.org (8)realclimate.org (35)sourcewatch.org (21)abc.net.au (0)acfonline.org.au (0)bbc.co.uk (0) bom.gov.au (0)cbc.ca (0)ciel.org (0)climatechallenge.gov.uk (0)climatechange.ca.gov (0)climatechange.com.au (0)climatechangecentral.com (0)climatechangecollege.org (0)climatecrisis.net (0)climatescience.gov (0)dar.csiro.au (0)davidsuzuki.org (0)defra.gov.uk (0)dfat.gov.au (0)ec.gc.ca (0)ecn.ac.uk (0)ecokids.ca (0)ecy.wa.gov (0)eea.europa.eu (0)eldis.org (0)energy.gov (0)envirolink.org (0)epa.gov (0)exploratorium.edu (0)faqs.org (0)foe.co.uk (0)ft.com (0)g8.gov.uk (0)gcrio.org (0)greenpeace.org (0)grida.no (0)guardian.co.uk (0)iea.org (0)iisd.org (0)ipcc.ch (0)iucn.org (0)ltscotland.org.uk (0)metoffice.gov.uk (0)mfe.govt.nz (0)mofa.go.jp (0)nature.com (0) nature.org (0)ncdc.noaa.gov (0)open2.net (0)panda.org (0)pewclimate.org (0)royalsoc.ac.uk (0)scidev.net (0)scienceagogo.com (0)state.gov (0)theglobeandmail.com (0)ucar.edu (0)un.org (0)unep.org (0)who.int (0)whoi.edu (0)worldwildlife.org (0)CLIMATE CHANGESCEPTICS
  33. 33. Research Question:Which climate change issue actors mention theskeptics, and what kinds of actors are morelikely to mention them?Method:Comparative Query: skeptics in three source sets(‘top’ sources, climate change blogs and climatechange science network), outputting sourcecloud for each.
  34. 34. Source Sets:(1) Top ten Google returns for “climatechange” (mix of media as well as governmentalorganizations)
  35. 35. Source Sets:(2) Climate change blogs network (IssueCrawlerresults - mix of blogs, social media, traditionalmedia and governmental and non-governmentalorganizations)
  36. 36. Source Sets:(3) Climate change science network(IssueCrawler results - governmental, non-governmental, educational and mediaorganizations)
  37. 37. Steps:- Install the DMI toolbar, and open theLippmannian device (aka Google Scraper - seetools.digitalmethods.net).- Acquire source sets and skeptics list.- Enter source sets and skeptics names. Querythe source sets separately, and remember to use“” to get exact returns.- Wait, fill in CAPTCHA’s if necessary. Also usethis moment to discuss hypotheses.- Explore the output, and present findings.

×