Successfully reported this slideshow.
Your SlideShare is downloading. ×

A Novel Approach to Scraping Websites - Rob Ousbey, MozCon 2020

Ad

rob.ousbey@moz.com @RobOusbey
A Novel Approach to
Scraping Websites
Rob Ousbey
VP Product, Moz
@RobOusbey

Ad

rob.ousbey@moz.com @RobOusbey

Ad

rob.ousbey@moz.com @RobOusbey

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
jQuery
jQuery
Loading in …3
×

Check these out next

1 of 99 Ad
1 of 99 Ad

A Novel Approach to Scraping Websites - Rob Ousbey, MozCon 2020

Download to read offline

Want to build a quick Google scraper? Want a bookmarklet to crawl data from a website? Want to combine data from two different SaaS tools in one place?

At Mozcon 2020, I presented this technique to achieve these things, and much more. By injecting JavaScript into a site (via bookmarklets and files) you can have it run in the page with full permissions, and create wonderful things!

The code samples for this presentation are available at https://www.ousbey.com/mozcon

Want to build a quick Google scraper? Want a bookmarklet to crawl data from a website? Want to combine data from two different SaaS tools in one place?

At Mozcon 2020, I presented this technique to achieve these things, and much more. By injecting JavaScript into a site (via bookmarklets and files) you can have it run in the page with full permissions, and create wonderful things!

The code samples for this presentation are available at https://www.ousbey.com/mozcon

More Related Content

A Novel Approach to Scraping Websites - Rob Ousbey, MozCon 2020

  1. 1. rob.ousbey@moz.com @RobOusbey A Novel Approach to Scraping Websites Rob Ousbey VP Product, Moz @RobOusbey
  2. 2. rob.ousbey@moz.com @RobOusbey
  3. 3. rob.ousbey@moz.com @RobOusbey
  4. 4. rob.ousbey@moz.com @RobOusbey ⚠️ Warning: Code Ahead
  5. 5. rob.ousbey@moz.com @RobOusbey Paths forward: A: Just use the code that I share today B: Buddy up with a real developer C: Lean some JavaScript
  6. 6. rob.ousbey@moz.com @RobOusbey Learn JavaScript? Resources link bundle at: ousbey.com/mozcon
  7. 7. rob.ousbey@moz.com @RobOusbey A Novel Approach to Scraping Websites
  8. 8. rob.ousbey@moz.com @RobOusbey A Novel Approach to Scraping Websites
  9. 9. rob.ousbey@moz.com @RobOusbey A Novel Approach to Scraping Websites
  10. 10. rob.ousbey@moz.com @RobOusbey A Novel Approach to Scraping Websites
  11. 11. rob.ousbey@moz.com @RobOusbey Insert your code into any website (Using Google Chrome)
  12. 12. rob.ousbey@moz.com @RobOusbey Option 1: Write / paste code directly into the Console Press F12 to open Chrome DevTools & click on the ‘Console’ tab
  13. 13. rob.ousbey@moz.com @RobOusbey
  14. 14. rob.ousbey@moz.com @RobOusbey $(selector) Finds an element on the current page, using the typical CSS selectors that you’re used to
  15. 15. rob.ousbey@moz.com @RobOusbey $(selector) Finds an element on the current page, using the typical CSS selectors that you’re used to
  16. 16. rob.ousbey@moz.com @RobOusbey Option 2: Store code in a ‘JavaScript bookmarklet’ • Storing code in a bookmarklet lets you click to run it anytime you want • Begin the URL with javascript: and then add your code • Lines of code should be separated by semi-colons
  17. 17. rob.ousbey@moz.com @RobOusbey Option 3: Store code in an online JavaScript file • Host your code somewhere accessible online (including Dropbox, etc) • Create a bookmarklet (see option 2) that does nothing but import this code. • Include it on the page by creating a new script element, using code like: javascript:(function(){ document.body.appendChild(document.createElement('script’)) .src='https://ousbey.com/mycode.js';})();
  18. 18. rob.ousbey@moz.com @RobOusbey Option 3: Store code in an online JavaScript file Host your code somewhere accessible online (including Dropbox, etc) Create a bookmarklet (see option 2) that does nothing but import this code. Include it on the page by creating a new script element, using code like: javascript:(function(){ document.body.appendChild(document.createElement('script’)) .src='https://ousbey.com/mycode.js?x='+Math.random();})();
  19. 19. rob.ousbey@moz.com @RobOusbey Chapter 1: Some basic scraping
  20. 20. rob.ousbey@moz.com @RobOusbey Pseudo Code For each one of multiple products, we will: • Search for the product name
  21. 21. rob.ousbey@moz.com @RobOusbey Pseudo Code For each one of multiple products, we will: • Search for the product name • Get the URL from the top result
  22. 22. rob.ousbey@moz.com @RobOusbey Pseudo Code For each one of multiple products, we will: • Search for the product name • Get the URL from the top result • Go to that URL, and scrape: - the overall score
  23. 23. rob.ousbey@moz.com @RobOusbey Pseudo Code For each one of multiple products, we will: • Search for the product name • Get the URL from the top result • Go to that URL, and scrape: - the overall score - the detailed ratings
  24. 24. rob.ousbey@moz.com @RobOusbey Pseudo Code For each one of multiple products, we will: • Search for the product name • Get the URL from the top result • Go to that URL, and scrape: - the overall score - the detailed ratings • Put all that data in a table of some kind
  25. 25. rob.ousbey@moz.com @RobOusbey
  26. 26. rob.ousbey@moz.com @RobOusbey
  27. 27. rob.ousbey@moz.com @RobOusbey
  28. 28. rob.ousbey@moz.com @RobOusbey
  29. 29. rob.ousbey@moz.com @RobOusbey Get data from a URL; Parse the response searchQuery = "zendesk"; $.get( "/search/products?query="+ searchQuery, function( data ) { var $page = $(data); console.log( $page.find("h1").text() ); topListing = $( $page.find(".product-listing")[0] ); productData = { url: topListing.find(".product-listing__title a").attr("href"), name: topListing.find(".product-listing__product-name").text(), rating: topListing.find(".product-listing__star-rating .fw- semibold").first().text() } console.log(productData); });
  30. 30. rob.ousbey@moz.com @RobOusbey Get data from a URL; Parse the response searchQuery = "zendesk"; $.get( "/search/products?query="+ searchQuery, function( data ) { var $page = $(data); console.log( $page.find("h1").text() ); topListing = $( $page.find(".product-listing")[0] ); productData = { url: topListing.find(".product-listing__title a").attr("href"), name: topListing.find(".product-listing__product-name").text(), rating: topListing.find(".product-listing__star-rating .fw- semibold").first().text() } console.log(productData); }); $.get(url, function) Fetch a page from elsewhere on the site, and process the result
  31. 31. rob.ousbey@moz.com @RobOusbey Get data from a URL; Parse the response searchQuery = "zendesk"; $.get( "/search/products?query="+ searchQuery, function( data ) { var $page = $(data); console.log( $page.find("h1").text() ); topListing = $( $page.find(".product-listing")[0] ); productData = { url: topListing.find(".product-listing__title a").attr("href"), name: topListing.find(".product-listing__product-name").text(), rating: topListing.find(".product-listing__star-rating .fw- semibold").first().text() } console.log(productData); }); .find(selector) Parse a page, and find the one or more elements that match the CSS selector .text() Grab the text from a selected element
  32. 32. rob.ousbey@moz.com @RobOusbey
  33. 33. rob.ousbey@moz.com @RobOusbey
  34. 34. rob.ousbey@moz.com @RobOusbey
  35. 35. rob.ousbey@moz.com @RobOusbey $.get(productData.url, function( pageContent ) { var $productPage = $(pageContent); ratingDetails = {}; ratingDetails.use = $productPage.find('.cell.small-7 div:contains("Ease of Use")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); ratingDetails.support = $productPage.find('.cell.small-7 div:contains("Quality of Support")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); ratingDetails.setup = $productPage.find('.cell.small-7 div:contains("Ease of Setup")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); }); Get data from a URL; Parse the response
  36. 36. rob.ousbey@moz.com @RobOusbey $.get(productData.url, function( pageContent ) { var $productPage = $(pageContent); ratingDetails = {}; ratingDetails.use = $productPage.find('.cell.small-7 div:contains("Ease of Use")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); ratingDetails.support = $productPage.find('.cell.small-7 div:contains("Quality of Support")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); ratingDetails.setup = $productPage.find('.cell.small-7 div:contains("Ease of Setup")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); }); Get data from a URL; Parse the response
  37. 37. rob.ousbey@moz.com @RobOusbey $.get(productData.url, function( pageContent ) { var $productPage = $(pageContent); ratingDetails = {}; ratingDetails.use = $productPage.find('.cell.small-7 div:contains("Ease of Use")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); ratingDetails.support = $productPage.find('.cell.small-7 div:contains("Quality of Support")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); ratingDetails.setup = $productPage.find('.cell.small-7 div:contains("Ease of Setup")').closest('.grid-x’) .find('.charts--doughnut__reviews').text(); }); Get data from a URL; Parse the response
  38. 38. rob.ousbey@moz.com @RobOusbey
  39. 39. rob.ousbey@moz.com @RobOusbey
  40. 40. rob.ousbey@moz.com @RobOusbey .text(data) Replace the text of an element Writing Data $('#container').html('<div id="output"></div>’); $('#output').append('<span class="product" data-id="zendesk"></span>’); $('span[data-id="zendesk"]').text('4.3'); .html(data) Replace the HTML of an element .append(data) Add html on to the end of an element’s content
  41. 41. rob.ousbey@moz.com @RobOusbey $("html").html(`<html> <head></head> <body> <textarea id="input"></textarea> <button id="start">Go</button> <table id="output"> <tr> <th>Product</th> <th>Reviews</th> <th>Score</th> <th>Ease of Use</th> <th>Support</th> <th>Ease of Setup</th> </tr> </table> </body> </html>`); Writing data to a table
  42. 42. rob.ousbey@moz.com @RobOusbey
  43. 43. rob.ousbey@moz.com @RobOusbey function populateTable(){ productList = $("#input").val().split(/r?n/); $.each( productList, function( i, inputName ){ inputName = inputName.trim().toLowerCase(); $("#output").append('<tr class="product" data-name="'+inputName+ '" data-processed="false"><td class="name">'+inputName+'</td> <td class="count"></td><td class="rating"></td><td class="rating_use"></td> <td class="rating_support"></td><td class="rating_setup"></td></tr>'); }); } Writing data to a table .each(array, function) Run the same function against every item in an array
  44. 44. rob.ousbey@moz.com @RobOusbey Adding interactivity $("#start").click(function() { populateTable(); startCrawl(); }); .click( function) Specifies code to be run when you click the selected element(s)
  45. 45. rob.ousbey@moz.com @RobOusbey
  46. 46. rob.ousbey@moz.com @RobOusbey function startCrawl(){ $("#output tr").each(function( index ) { crawlSearchResults( $(this) ); }); } Loops
  47. 47. rob.ousbey@moz.com @RobOusbey
  48. 48. rob.ousbey@moz.com @RobOusbey
  49. 49. rob.ousbey@moz.com @RobOusbey
  50. 50. rob.ousbey@moz.com @RobOusbey
  51. 51. rob.ousbey@moz.com @RobOusbey Use the G2 Scraper for yourself! Inspect the code for yourself, or install the bookmarklet from here: ousbey.com / mozcon
  52. 52. rob.ousbey@moz.com @RobOusbey Who else can we scrape?
  53. 53. rob.ousbey@moz.com @RobOusbey Chapter 2: When the scraper becomes the scrapee
  54. 54. rob.ousbey@moz.com @RobOusbey
  55. 55. rob.ousbey@moz.com @RobOusbey
  56. 56. rob.ousbey@moz.com @RobOusbey Folder-by-folder Google indexation counter Let’s answer the question: How many pages does Google have indexed from each folder on the site?
  57. 57. rob.ousbey@moz.com @RobOusbey Pseudo Code For the site in question, we will: • Run a “site:domain.com” search
  58. 58. rob.ousbey@moz.com @RobOusbey Pseudo Code For the site in question, we will: • Run a “site:domain.com” search • Make a list of all the folders we find Moz.com: /learn /blog /community
  59. 59. rob.ousbey@moz.com @RobOusbey Pseudo Code For the site in question, we will: • Run a “site:domain.com” search • Make a list of all the folders we find • Do another site: search for each folder to count the number of indexed pages
  60. 60. rob.ousbey@moz.com @RobOusbey Pseudo Code For the site in question, we will: • Run a “site:domain.com” search • Make a list of all the folders we find • Do another site: search for each folder to count the number of indexed pages • Record the results Moz.com: /learn – 447 pages /blog – 7,900 pages /community – 54,000 pages
  61. 61. rob.ousbey@moz.com @RobOusbey Pseudo Code For the site in question, we will: • Run a “site:domain.com” search • Make a list of all the folders we find • Do another site: search for each folder to count the number of indexed pages • Record the results • Do another site: search - excluding the folder you already found - to discover more folders and continue the process
  62. 62. rob.ousbey@moz.com @RobOusbey Replace page with input / output UI $("html").html(` <html> <head></head> <body> <textarea id="input"></textarea> <table id="output"><tr> <th>Folder</th> <th>Indexed Pages</th> </tr></table> <button id="collect_serps">Get more folders</button> <button id="get_counts">Count indexed Pages</button> </body> </html>`);
  63. 63. rob.ousbey@moz.com @RobOusbey
  64. 64. rob.ousbey@moz.com @RobOusbey
  65. 65. rob.ousbey@moz.com @RobOusbey Find an initial list of folders on the site domain = $("#input").val(); query = 'site:' + domain; $.get( '/search?q='+ query, function( data ) { var $page = $(data); $page.find('#search a[href^="https://'+domain+'"]').each(function(index){ url = $(this).attr("href") + '/'; slug = url.split("://")[1].split("/")[1].toLowerCase(); $("#output").append('<tr class="folder" data-name="'+slug+ '"data-processed="false"><td class="slug">'+slug+ '</td><td class="count"></td></tr>'); }); });
  66. 66. rob.ousbey@moz.com @RobOusbey moz.com
  67. 67. rob.ousbey@moz.com @RobOusbey Count the indexed pages in any given folder function getFolderCount(slug){ $.get( "/search?q=site%3A"+ domain+"/"+slug+"/", function( data ){ var $page = $(data); count = $page.find('#result-stats').text().replace('About ','').split(' ')[0] $('#output tr[data-name="'+slug+'"] td.count').text(count); }); }
  68. 68. rob.ousbey@moz.com @RobOusbey
  69. 69. rob.ousbey@moz.com @RobOusbey
  70. 70. rob.ousbey@moz.com @RobOusbey Count the indexed pages in any given folder function getFolderCount(slug){ $.get( "/search?q=site%3A"+ domain+"/"+slug+"/", function( data ){ var $page = $(data); count = $page.find('#result-stats').text().replace('About ','').split(' ')[0] $('#output tr[data-name="'+slug+'"] td.count').text(count); }); }
  71. 71. rob.ousbey@moz.com @RobOusbey Limit the scraping speed var crawlDelay = 1000; // timing in milliseconds function getAllCounts(){ scheduledTimeFromNow = 0; $('#output tr').each(function(index){ scheduledTimeFromNow += crawlDelay; getFolderCountWithDelay($(this).attr('data-name'), scheduledTimeFromNow); }); } function getFolderCountWithDelay(slug, delay){ setTimeout( function(){getFolderCount(slug);} , delay); } setTimeout( function, delay) Schedules the function to be run in a certain number of milliseconds in the future
  72. 72. rob.ousbey@moz.com @RobOusbey
  73. 73. rob.ousbey@moz.com @RobOusbey
  74. 74. rob.ousbey@moz.com @RobOusbey Use this Google Scraper for yourself! Inspect the code for yourself, or install the bookmarklet from here: ousbey.com / mozcon
  75. 75. rob.ousbey@moz.com @RobOusbey Chapter 3: Multi-site Scraping
  76. 76. rob.ousbey@moz.com @RobOusbey Can we grab some SEO data?
  77. 77. rob.ousbey@moz.com @RobOusbey
  78. 78. rob.ousbey@moz.com @RobOusbey
  79. 79. rob.ousbey@moz.com @RobOusbey
  80. 80. rob.ousbey@moz.com @RobOusbey
  81. 81. rob.ousbey@moz.com @RobOusbey
  82. 82. rob.ousbey@moz.com @RobOusbey
  83. 83. rob.ousbey@moz.com @RobOusbey Take a 30 day free trial, at moz.com/pro
  84. 84. rob.ousbey@moz.com @RobOusbey
  85. 85. rob.ousbey@moz.com @RobOusbey
  86. 86. rob.ousbey@moz.com @RobOusbey
  87. 87. rob.ousbey@moz.com @RobOusbey
  88. 88. rob.ousbey@moz.com @RobOusbey
  89. 89. rob.ousbey@moz.com @RobOusbey Extend the existing SERP table with new columns lighthouseFeatures = [ {id:'performance',title:'Performance'},{id:'accessibility',title:'Accessibility'}, {id:'best-practices',title:'Best Practices'}, {id:'seo',title:'SEO’} ] lighthouseFeatures.forEach(function (item) { $('table.table.table-basic thead tr').append('<th class="table-header" role="columnheader" scope="col" style="width: 110px;"><div class="table-header- container"><span class="table-header-name">'+item.title+'</span></div></th>’); $('table.table.table-basic tbody tr').append('<td class="lighthouse '+item.id+’” colspan="1"></td>'); }); $('td.lighthouse.performance').html( '<a class="run_lighthouse" href="#">Run Lighthouse</a>');
  90. 90. rob.ousbey@moz.com @RobOusbey Write data into the table $('a.run_lighthouse').click(function(event) { targetUrl = $(this).closest('tr').find('a.external-link').attr('href'); $(this).closest('tr').attr('data-targeturl',targetUrl); $.ajax({ type: 'POST’, url: 'https://lighthouse-dot-webdotdevsite.appspot.com//lh/newaudit ?replace=true&save=false&url='+encodeURIComponent(targetUrl), crossDomain: true, success: function(responseData, textStatus, jqXHR) { requestedUrl = responseData.lhr.requestedUrl outputRow = $('tr[data-targeturl="'+requestedUrl+'"]’); lighthouseFeatures.forEach(function (item) { outputRow.find('.'+item.id).text(Math.round(100 * responseData.lhr.categories[item.id].score)); }); }; }); }); $.ajax( … ) A more customizable alternative to $.get
  91. 91. rob.ousbey@moz.com @RobOusbey
  92. 92. rob.ousbey@moz.com @RobOusbey
  93. 93. rob.ousbey@moz.com @RobOusbey See Lighthouse metrics in your Moz Pro campaign! Inspect the code for yourself, or install the bookmarklet from here: ousbey.com / mozcon
  94. 94. rob.ousbey@moz.com @RobOusbey Epilogue: Your Toolkit
  95. 95. rob.ousbey@moz.com @RobOusbey Run JavaScript as if it’s on the site Paste code into the console Very quick to do, especially while testing, and easy to change the code everytime. But you have to store the code somewhere to copy from, it takes a bit of work, and it’s difficult to share with other people. Store code in a bookmarklet Very easy to run code any time you want. But this can be cumbersome to update if you’re changing code frequently, and makes it harder to share with others. Store code in a bookmarklet Also very easy to run code any time you want, and to share with others. But you have to find somewhere to host your code online.
  96. 96. rob.ousbey@moz.com @RobOusbey The most useful tools at your disposal $.get() or $.post()or $.ajax() Request content from any other URL on the site that you’re on $(‘#input’) or .find(‘h2.rating’) Use regular CSS selectors to choose an element on a page .each( … ) Loop through every matching element you found on a page .attr(‘href’) or .text() or .text(‘new text here’) Grab content from an element on a page, or write to page
  97. 97. rob.ousbey@moz.com @RobOusbey Tips for becoming an effective JavaScript hacker Learn a little bit of JavaScript, and a lot of jQuery Write code slowly and check / debug it a bit at a time Google everything - it’s all been solved before If it works, then it works Start with existing code, and edit it to understand how it works Ignore everything about code quality, elegance, efficiency and scalability
  98. 98. rob.ousbey@moz.com @RobOusbey ousbey.com / mozcon
  99. 99. rob.ousbey@moz.com @RobOusbey Thanks! twitter.com / RobOusbey ousbey.com / mozcon

×