Scraping in 60 minutes

5,089 views
4,867 views

Published on

Presentation at Data Harvest 2014

Published in: Education, Technology

Scraping in 60 minutes

  1. 1. Paul Bradshaw Leanpub.com/scrapingforjournalists* Scraping in 60 mins Saturday, 10 May 14
  2. 2. https://www.youtube.com/watch?v=Efr-VEkwWoM Saturday, 10 May 14
  3. 3. Saturday, 10 May 14
  4. 4. Saturday, 10 May 14
  5. 5. Saturday, 10 May 14
  6. 6. Saturday, 10 May 14
  7. 7. Saturday, 10 May 14
  8. 8. Saturday, 10 May 14
  9. 9. Saturday, 10 May 14
  10. 10. * Saturday, 10 May 14
  11. 11. * Saturday, 10 May 14
  12. 12. * Function (Arguments) (aka parameters) Saturday, 10 May 14
  13. 13. * Function (arguments) =SUM(A2:A50) =AVERAGE(B2:B300) =COUNTIF(A10:A3000,”Smith”) Saturday, 10 May 14
  14. 14. * Function (parameters) =SUM(range of cells to be summed) =AVERAGE(range of cells to be averaged) =COUNTIF(range of cells to be counted,what to count) Saturday, 10 May 14
  15. 15. * (“string”, index) Saturday, 10 May 14
  16. 16. * Tip: search for documentation Saturday, 10 May 14
  17. 17. * Variable Saturday, 10 May 14
  18. 18. * Variables Saturday, 10 May 14
  19. 19. * Jargon checklist: Function Arguments Parameters String Index Variable Documentation Saturday, 10 May 14
  20. 20. Vote: =importXML or Python? Saturday, 10 May 14
  21. 21. * Another function? Saturday, 10 May 14
  22. 22. * Search for documentation! https://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/ Saturday, 10 May 14
  23. 23. * Query (XPath) Saturday, 10 May 14
  24. 24. * XPath is a path through XML (or HTML) <table> = //table <table><tr> = //table//tr <table><tr><td> = //table//tr//td Saturday, 10 May 14
  25. 25. * Search for documentation! http://www.w3schools.com/XPath/xpath_syntax.asp Saturday, 10 May 14
  26. 26. * Tip: search for structure around data http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs Saturday, 10 May 14
  27. 27. * http:// www.w4mpjobs.org/ SearchJobs.aspx? http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs Saturday, 10 May 14
  28. 28. * Saturday, 10 May 14
  29. 29. * "//div[@class= 'leftcolumn']" Saturday, 10 May 14
  30. 30. * //div[starts-with(@ class, ‘jobWrap’)] Saturday, 10 May 14
  31. 31. * A crib sheet: Saturday, 10 May 14
  32. 32. * Chrome extension: Saturday, 10 May 14
  33. 33. Saturday, 10 May 14
  34. 34. #!/usr/bin/env python import scraperwiki ‘This is a Python script’ (Shebang) import the Scraperwiki library Saturday, 10 May 14
  35. 35. #!/usr/bin/env python import scraperwiki html = scraperwiki.scrape('http:// uk.soccerway.com/teams/netherlands/fortuna- sittard/1551/') print html Function (argument) Saturday, 10 May 14
  36. 36. #!/usr/bin/env python import scraperwiki html = scraperwiki.scrape('http:// uk.soccerway.com/teams/netherlands/fortuna- sittard/1551/') print html Comes from Scraperwiki library (check documentation) Saturday, 10 May 14
  37. 37. #!/usr/bin/env python import scraperwiki html = scraperwiki.scrape('http:// uk.soccerway.com/teams/netherlands/fortuna- sittard/1551/') print html Variable (assigned with = sign) Statement used to show variable Saturday, 10 May 14
  38. 38. #!/usr/bin/env python import scraperwiki html = scraperwiki.scrape('http:// uk.soccerway.com/teams/netherlands/fortuna- sittard/1551/') print html Saturday, 10 May 14
  39. 39. Jargon checklist: Library Shebang List Saturday, 10 May 14
  40. 40. Paul Bradshaw Leanpub.com/scrapingforjournalists* Thank you. Saturday, 10 May 14

×