Scraping data to drive
content marketing
campaigns
(without knowing how to code)
@jeremycabral
Data-driven content
DOMINATES
Insights from
Analyzing 1
Million Articles
“Original research based content
has the potential to achieve much
higher numbers of domain links
than other forms of content”
- Steve Rayson (Director -
BuzzSumo)
BuzzSumo Study
Priceonomics
From price guides to content
marketing
Pivot to data-driven
content marketing
23,000+ linking root domains
Price comparison: Airbnb vs Hotels
125
Linking root
domains
URL: https://priceonomics.com/hotels/
The Hipster Music Index
204,219
views
92
Linking root domains
URL: https://priceonomics.com/the-hipster-music-index/
Data mining fuels fast,
cheap and repeatable
content marketing ideas
But… what if the data you
need isn’t available by API
or downloadable?
Disclaimer
Seek legal advice before
committing to a scraping
project
Scraping data could breach the
terms of service of a website
Scraping at a disruptive rate
could slow down or even crash
a website
What is data
scraping?
Data scraping is an automated way
using scripts and crawlers to
1. Fetch a page
2. Parse the data in that page to
extract information
3. Format the data in an
organised way
4. Store or export that data to
create a dataset (DB, CSV,
TXT etc)
Patterns in HTML & CSS
It’s easier to scrape content broken up by a unique id or class assigned to the
element you want to extract
Basic overview of XPath
XPath can be used to navigate through
elements and attributes in a document
Important to understand how tags are nested as
a scraper will follow this tree
Learn more:
https://www.slideshare.net/scrapinghub/xpath-
for-web-scraping
Finding an API
Learn more: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
Important
Excel analysis
skills
1. Match the same data across
multiple spreadsheets:
a. VLOOKUP
b. INDEX MATCH
2. Summarising data
a. Pivot Tables
b. Charts
3. Cleaning data
a. =TRIM()
b. =SPLIT()
Learn more:
● https://www.distilled.net/excel-for-seo/
● https://trumpexcel.com/clean-data-in-excel/
Data sources
Web Apps
● Engines / Listings (product
data, reviews)
● Search results (with filters
applied)
Calculators
● Automatically input values
with scripts
● Store every calculator
results combination
Public
Datasets
● Upside: easy to download,
regularly maintained by
others
● Downside: everyone has
access easily to the same
data as you
APIs
● Upside: everything is
structured and (often)
documented
● Downside: sometimes not all
data is available in an API
How to get the data
Scraping
Frameworks &
Languages
Popular languages
PHP
Python
Ruby
Perl
Node.js
These are important for your
own development or choosing a
freelancer
Try and use a language your
developers are familiar with
Simulating the
user in the
browser
● Selenium Web Driver
● PhantomJS w/CasperJS
Data scraping tools
Desktop tools
Scrapesimilar
artoo.js
Tabula - extract tables from PDFs
Parsehub (free & paid versions)
Screaming Frog
URL Profiler
Scripts run on your local machine
Hosted Services
Google Sheets (ImportXML,
ImportJSON, ImportHTML)
Import.io - automatic page scraper
Mozenda - point and click screen
scraping (Windows only)
DIFFBot (Artificial Intelligence)
Connotate
Scraping with Google Sheets
Google Sheets Formulas (built in)
=importXML(url, xpath_query) -- imports
structured data using XPath
=importHTML(url, query, index) – imports data
from a table or list within an HTML page. Index
identifies which table in the source code
Learn more:
https://www.distilled.net/blog/distilled/guide-to-
google-docs-importxml/
= ImportJSON(url, query, parseOptions) --
imports JSON feeds into Google Sheets
http://date.jsontest.com/
{
"time": "11:35:24 AM",
"milliseconds_since_epoch": 1493552124786,
"date": "02-14-2014"
}
Learn more:
http://blog.fastfedora.com/projects/import-json
Scraping with
Screaming
Frog
Using custom extraction and
filters
Learn more: http://www.seerinteractive.com/blog/screaming-frog-guide/
Import.io example
Run on a frequency you set + stores data historically
Predictive model for real estate
value
Learn more: http://www.louisdorard.com/guest/everyone-can-do-data-science
Realtor.com scraped by import.io => cleaned with Pandas => model built by BigML
Scrape Similar (“Scraper”)
Learn more: http://ipullrank.com/how-to-scrape-every-single-page-on-the-web/
Diffbot.com
4 main APIs that use artificial
intelligence for data extraction
1. Article: clean text from article,
html, author, date info, related
images, videos
2. Discussion: content of forum
threads, article comments,
product reviews
3. Product: pricing information,
product IDs, images, product
specs
4. Video: Author/uploader,
duration, title, description, date
uploaded, stats.
Getting help with data
scraping
Find scraping
experts
Upwork
Freelancer.com
Codementor.io (CodementorX)
Briefing a freelancer
Inputs:
1. Project Goal
2. List of URLs
1. Provide it yourself
2. Provide an endpoint and a pattern of URLs
that you’d like captured
3. Specific inputs into any filters/data input fields
which may be required to capture all the data
combinations
1. Form values (numbers, sliders, etc)
2. Login details
4. Technical requirements
1. Location of IP when scraping
2. Frequency of scrape
3. Scraping language
Outputs:
1. Where the data will be stored?
a. Local file (CSV, TXT)
b. Database (SQLite)
c. Stored on webserver
2. Provide an example spreadsheet showing
how you would like to data presented
3. Specify any data manipulation needed to
have clean output from the scrape
4. Specify how the data will be used
a. HTML Table or
b. Single page application (React/Angular JS)
embedded with oEmbed
Avoid getting
blocked
● Spoof header as Googlebot
● Run scrape from multiple IP
addresses
● Run the scrape slowly
● Be careful scraping behind a
login
Data scraping
services
Typically $2k+ per project..
ouch!
Priceonomics
Promptcloud
Scrapinghub
Datahen
How to turn your data into
visually appealing content
Maps
In order to do this you need to
capture addresses or
latitude/longitude
Batchgeo
Turn spreadsheets into maps
Charts
Easiest way to visualise data,
hardest to make look sexy with
Excel & Google Sheets
Source: https://www.labnol.org/software/find-right-chart-type-for-your-
data/6523/
Tip: Tableau
suggests
charts
Place your data set in Tableau and
use the ‘show me’ functionality
Interactive
Tables
Helpful to use a database tool
for larger datasets
https://tablepress.org/
Interactive
visualisations
Highly engaging and allows the
user to filter the data
Source:
https://www.lowyinstitute.org/lowyinstitutepollinteractive/feelings-
towards-other-nations/
Inspect
element to find
frameworks
This visualisation is using
amcharts.com
Interactive
visualisation
brief example
Download the full brief template
http://bit.ly/datavizbrief
Data
visualisation
inspiration
Graphiq.com
TheAtlas.com
Flowingdata.com
Storybench.org
Dribbble
reddit.com/r/Dataisbeautiful
Blueprints for data-driven
content marketing
Provide a new dimension on a
dataset
How? IMDB + the idea that people want their fav tv shows to come back on air
335,830
views
142
Linking root domains
Recognise patterns and service
them
How? Combined results from NBN map search + real estate listings
1,000+
New users within 72
hours
Display data in an accessible format
How? Allflicks.net combining IMDB with Netflix library plus filters
1.13k
Linking root domains
● Filterable
● Sortable
● Categorised
● Indexable!
Visualise trends
How? Twitter API + Maptimize mapping engine - onemilliontweetmap.com
426
Linking root domains
Big data analysis ‘taster’
How? Scraped Google to analyse rich snippets + blog post with ‘taste’ of the data
128
Linking root domains
+
Lead source
Want more
ideas?
1. Scrape an online community to get a
list of URLs and their
a. Post titles
b. # of Upvotes
c. # of comments
d. Date posted
2. Mash together the data with social
shares, link data using URL Profiler
3. Analyse the data using pivot tables in
Excel or Google Sheets
Learn how: https://blog.parsehub.com/boost-your-
content-marketing-with-web-scraping-and-pivot-tables/
Scrape reddit,
growthhackers.com,
inbound.org, hackernews
Promoting data-driven
content
Content
Distribution
Supernodes:
Reddit
Digg
Hacker News
Slashdot
Inbound.org
Q&A websites (Quora, etc)
Online communities
Forums
Subreddits
Facebook Groups/Pages
List of content distribution websites: bit.ly/content-distribution-list
Good ol’
fashioned
reachout
Find websites with audiences that
will be interested in your data
Give journalists and bloggers a
unique angle and potentially a
different dimension on the
dataset so they can write their
own unique story
Make contact - don’t be afraid to
use the phone or go for a
coffee
List of content distribution websites: bit.ly/content-distribution-list
Build email
lists
Even small email lists can be
powerful to spread your content
online
We are always hiring!
finder.com.au/careers
jeremy@finder.com

Jeremy cabral search marketing summit - scraping data-driven content (1)

  • 1.
    Scraping data todrive content marketing campaigns (without knowing how to code) @jeremycabral
  • 2.
  • 3.
    Insights from Analyzing 1 MillionArticles “Original research based content has the potential to achieve much higher numbers of domain links than other forms of content” - Steve Rayson (Director - BuzzSumo) BuzzSumo Study
  • 4.
    Priceonomics From price guidesto content marketing Pivot to data-driven content marketing 23,000+ linking root domains
  • 5.
    Price comparison: Airbnbvs Hotels 125 Linking root domains URL: https://priceonomics.com/hotels/
  • 6.
    The Hipster MusicIndex 204,219 views 92 Linking root domains URL: https://priceonomics.com/the-hipster-music-index/
  • 7.
    Data mining fuelsfast, cheap and repeatable content marketing ideas
  • 8.
    But… what ifthe data you need isn’t available by API or downloadable?
  • 9.
    Disclaimer Seek legal advicebefore committing to a scraping project Scraping data could breach the terms of service of a website Scraping at a disruptive rate could slow down or even crash a website
  • 10.
    What is data scraping? Datascraping is an automated way using scripts and crawlers to 1. Fetch a page 2. Parse the data in that page to extract information 3. Format the data in an organised way 4. Store or export that data to create a dataset (DB, CSV, TXT etc)
  • 11.
    Patterns in HTML& CSS It’s easier to scrape content broken up by a unique id or class assigned to the element you want to extract
  • 12.
    Basic overview ofXPath XPath can be used to navigate through elements and attributes in a document Important to understand how tags are nested as a scraper will follow this tree Learn more: https://www.slideshare.net/scrapinghub/xpath- for-web-scraping
  • 13.
    Finding an API Learnmore: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
  • 14.
    Important Excel analysis skills 1. Matchthe same data across multiple spreadsheets: a. VLOOKUP b. INDEX MATCH 2. Summarising data a. Pivot Tables b. Charts 3. Cleaning data a. =TRIM() b. =SPLIT() Learn more: ● https://www.distilled.net/excel-for-seo/ ● https://trumpexcel.com/clean-data-in-excel/
  • 15.
  • 16.
    Web Apps ● Engines/ Listings (product data, reviews) ● Search results (with filters applied)
  • 17.
    Calculators ● Automatically inputvalues with scripts ● Store every calculator results combination
  • 18.
    Public Datasets ● Upside: easyto download, regularly maintained by others ● Downside: everyone has access easily to the same data as you
  • 19.
    APIs ● Upside: everythingis structured and (often) documented ● Downside: sometimes not all data is available in an API
  • 20.
    How to getthe data
  • 21.
    Scraping Frameworks & Languages Popular languages PHP Python Ruby Perl Node.js Theseare important for your own development or choosing a freelancer Try and use a language your developers are familiar with
  • 22.
    Simulating the user inthe browser ● Selenium Web Driver ● PhantomJS w/CasperJS
  • 23.
    Data scraping tools Desktoptools Scrapesimilar artoo.js Tabula - extract tables from PDFs Parsehub (free & paid versions) Screaming Frog URL Profiler Scripts run on your local machine Hosted Services Google Sheets (ImportXML, ImportJSON, ImportHTML) Import.io - automatic page scraper Mozenda - point and click screen scraping (Windows only) DIFFBot (Artificial Intelligence) Connotate
  • 24.
    Scraping with GoogleSheets Google Sheets Formulas (built in) =importXML(url, xpath_query) -- imports structured data using XPath =importHTML(url, query, index) – imports data from a table or list within an HTML page. Index identifies which table in the source code Learn more: https://www.distilled.net/blog/distilled/guide-to- google-docs-importxml/ = ImportJSON(url, query, parseOptions) -- imports JSON feeds into Google Sheets http://date.jsontest.com/ { "time": "11:35:24 AM", "milliseconds_since_epoch": 1493552124786, "date": "02-14-2014" } Learn more: http://blog.fastfedora.com/projects/import-json
  • 25.
    Scraping with Screaming Frog Using customextraction and filters Learn more: http://www.seerinteractive.com/blog/screaming-frog-guide/
  • 26.
    Import.io example Run ona frequency you set + stores data historically
  • 27.
    Predictive model forreal estate value Learn more: http://www.louisdorard.com/guest/everyone-can-do-data-science Realtor.com scraped by import.io => cleaned with Pandas => model built by BigML
  • 28.
    Scrape Similar (“Scraper”) Learnmore: http://ipullrank.com/how-to-scrape-every-single-page-on-the-web/
  • 29.
    Diffbot.com 4 main APIsthat use artificial intelligence for data extraction 1. Article: clean text from article, html, author, date info, related images, videos 2. Discussion: content of forum threads, article comments, product reviews 3. Product: pricing information, product IDs, images, product specs 4. Video: Author/uploader, duration, title, description, date uploaded, stats.
  • 31.
    Getting help withdata scraping
  • 32.
  • 33.
    Briefing a freelancer Inputs: 1.Project Goal 2. List of URLs 1. Provide it yourself 2. Provide an endpoint and a pattern of URLs that you’d like captured 3. Specific inputs into any filters/data input fields which may be required to capture all the data combinations 1. Form values (numbers, sliders, etc) 2. Login details 4. Technical requirements 1. Location of IP when scraping 2. Frequency of scrape 3. Scraping language Outputs: 1. Where the data will be stored? a. Local file (CSV, TXT) b. Database (SQLite) c. Stored on webserver 2. Provide an example spreadsheet showing how you would like to data presented 3. Specify any data manipulation needed to have clean output from the scrape 4. Specify how the data will be used a. HTML Table or b. Single page application (React/Angular JS) embedded with oEmbed
  • 34.
    Avoid getting blocked ● Spoofheader as Googlebot ● Run scrape from multiple IP addresses ● Run the scrape slowly ● Be careful scraping behind a login
  • 35.
    Data scraping services Typically $2k+per project.. ouch! Priceonomics Promptcloud Scrapinghub Datahen
  • 36.
    How to turnyour data into visually appealing content
  • 37.
    Maps In order todo this you need to capture addresses or latitude/longitude
  • 38.
  • 39.
    Charts Easiest way tovisualise data, hardest to make look sexy with Excel & Google Sheets Source: https://www.labnol.org/software/find-right-chart-type-for-your- data/6523/
  • 40.
    Tip: Tableau suggests charts Place yourdata set in Tableau and use the ‘show me’ functionality
  • 41.
    Interactive Tables Helpful to usea database tool for larger datasets https://tablepress.org/
  • 42.
    Interactive visualisations Highly engaging andallows the user to filter the data Source: https://www.lowyinstitute.org/lowyinstitutepollinteractive/feelings- towards-other-nations/
  • 43.
    Inspect element to find frameworks Thisvisualisation is using amcharts.com
  • 44.
    Interactive visualisation brief example Download thefull brief template http://bit.ly/datavizbrief
  • 45.
  • 46.
  • 47.
    Provide a newdimension on a dataset How? IMDB + the idea that people want their fav tv shows to come back on air 335,830 views 142 Linking root domains
  • 48.
    Recognise patterns andservice them How? Combined results from NBN map search + real estate listings 1,000+ New users within 72 hours
  • 49.
    Display data inan accessible format How? Allflicks.net combining IMDB with Netflix library plus filters 1.13k Linking root domains ● Filterable ● Sortable ● Categorised ● Indexable!
  • 50.
    Visualise trends How? TwitterAPI + Maptimize mapping engine - onemilliontweetmap.com 426 Linking root domains
  • 51.
    Big data analysis‘taster’ How? Scraped Google to analyse rich snippets + blog post with ‘taste’ of the data 128 Linking root domains + Lead source
  • 52.
    Want more ideas? 1. Scrapean online community to get a list of URLs and their a. Post titles b. # of Upvotes c. # of comments d. Date posted 2. Mash together the data with social shares, link data using URL Profiler 3. Analyse the data using pivot tables in Excel or Google Sheets Learn how: https://blog.parsehub.com/boost-your- content-marketing-with-web-scraping-and-pivot-tables/ Scrape reddit, growthhackers.com, inbound.org, hackernews
  • 53.
  • 54.
    Content Distribution Supernodes: Reddit Digg Hacker News Slashdot Inbound.org Q&A websites(Quora, etc) Online communities Forums Subreddits Facebook Groups/Pages List of content distribution websites: bit.ly/content-distribution-list
  • 55.
    Good ol’ fashioned reachout Find websiteswith audiences that will be interested in your data Give journalists and bloggers a unique angle and potentially a different dimension on the dataset so they can write their own unique story Make contact - don’t be afraid to use the phone or go for a coffee List of content distribution websites: bit.ly/content-distribution-list
  • 56.
    Build email lists Even smallemail lists can be powerful to spread your content online
  • 57.
    We are alwayshiring! finder.com.au/careers jeremy@finder.com

Editor's Notes

  • #5 Originally part of Y-Combinator Blogs Airbnb rates
  • #7 Combined data from pitchfork music review websites + facebook likes for each review. Pitchfork: music reviews for independent music Facebook likes: artists with the least likes were the most hipster, because their second criteria is that it should be a band you’ve never heard of
  • #13 Understanding this will help you understand page structures and what’s possible
  • #19 Data.gov.au Kaggle the data science community recently bought by google publishes alot of their datasets
  • #22 Talk about they aren’t niche Second opinion
  • #24 https://blog.hartleybrody.com/web-scraping/
  • #25 Talk about other JSON examples
  • #27 30 seconds to produce this
  • #28 Point and click Can run on a regular basis
  • #29 easy-to-use tool for intermediate to advanced users who are comfortable with XPath. More advanced than ImportXML. Allows you to capture more information than what is possible in google sheets
  • #36 Pricey. Expect to pay upwards of $2k per project
  • #49 Where to live next, where they can get the NBN