Jeremy cabral search marketing summit - scraping data-driven content (1)

Scraping data to drive
content marketing
campaigns
(without knowing how to code)
@jeremycabral

Insights from
Analyzing 1
Million Articles
“Original research based content
has the potential to achieve much
higher numbers of domain links
than other forms of content”
- Steve Rayson (Director -
BuzzSumo)
BuzzSumo Study

Priceonomics
From price guides to content
marketing
Pivot to data-driven
content marketing
23,000+ linking root domains

Price comparison: Airbnb vs Hotels
125
Linking root
domains
URL: https://priceonomics.com/hotels/

The Hipster Music Index
204,219
views
92
Linking root domains
URL: https://priceonomics.com/the-hipster-music-index/

Data mining fuels fast,
cheap and repeatable
content marketing ideas

But… what if the data you
need isn’t available by API
or downloadable?

Disclaimer
Seek legal advice before
committing to a scraping
project
Scraping data could breach the
terms of service of a website
Scraping at a disruptive rate
could slow down or even crash
a website

What is data
scraping?
Data scraping is an automated way
using scripts and crawlers to
1. Fetch a page
2. Parse the data in that page to
extract information
3. Format the data in an
organised way
4. Store or export that data to
create a dataset (DB, CSV,
TXT etc)

Patterns in HTML & CSS
It’s easier to scrape content broken up by a unique id or class assigned to the
element you want to extract

Basic overview of XPath
XPath can be used to navigate through
elements and attributes in a document
Important to understand how tags are nested as
a scraper will follow this tree
Learn more:
https://www.slideshare.net/scrapinghub/xpath-
for-web-scraping

Finding an API
Learn more: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

Important
Excel analysis
skills
1. Match the same data across
multiple spreadsheets:
a. VLOOKUP
b. INDEX MATCH
2. Summarising data
a. Pivot Tables
b. Charts
3. Cleaning data
a. =TRIM()
b. =SPLIT()
Learn more:
● https://www.distilled.net/excel-for-seo/
● https://trumpexcel.com/clean-data-in-excel/

Web Apps
● Engines / Listings (product
data, reviews)
● Search results (with filters
applied)

Calculators
● Automatically input values
with scripts
● Store every calculator
results combination

Public
Datasets
● Upside: easy to download,
regularly maintained by
others
● Downside: everyone has
access easily to the same
data as you

APIs
● Upside: everything is
structured and (often)
documented
● Downside: sometimes not all
data is available in an API

Scraping
Frameworks &
Languages
Popular languages
PHP
Python
Ruby
Perl
Node.js
These are important for your
own development or choosing a
freelancer
Try and use a language your
developers are familiar with

Simulating the
user in the
browser
● Selenium Web Driver
● PhantomJS w/CasperJS

Data scraping tools
Desktop tools
Scrapesimilar
artoo.js
Tabula - extract tables from PDFs
Parsehub (free & paid versions)
Screaming Frog
URL Profiler
Scripts run on your local machine
Hosted Services
Google Sheets (ImportXML,
ImportJSON, ImportHTML)
Import.io - automatic page scraper
Mozenda - point and click screen
scraping (Windows only)
DIFFBot (Artificial Intelligence)
Connotate

Scraping with Google Sheets
Google Sheets Formulas (built in)
=importXML(url, xpath_query) -- imports
structured data using XPath
=importHTML(url, query, index) – imports data
from a table or list within an HTML page. Index
identifies which table in the source code
Learn more:
https://www.distilled.net/blog/distilled/guide-to-
google-docs-importxml/
= ImportJSON(url, query, parseOptions) --
imports JSON feeds into Google Sheets
http://date.jsontest.com/
{
"time": "11:35:24 AM",
"milliseconds_since_epoch": 1493552124786,
"date": "02-14-2014"
}
Learn more:
http://blog.fastfedora.com/projects/import-json

Scraping with
Screaming
Frog
Using custom extraction and
filters
Learn more: http://www.seerinteractive.com/blog/screaming-frog-guide/

Import.io example
Run on a frequency you set + stores data historically

Predictive model for real estate
value
Learn more: http://www.louisdorard.com/guest/everyone-can-do-data-science
Realtor.com scraped by import.io => cleaned with Pandas => model built by BigML

Scrape Similar (“Scraper”)
Learn more: http://ipullrank.com/how-to-scrape-every-single-page-on-the-web/

Diffbot.com
4 main APIs that use artificial
intelligence for data extraction
1. Article: clean text from article,
html, author, date info, related
images, videos
2. Discussion: content of forum
threads, article comments,
product reviews
3. Product: pricing information,
product IDs, images, product
specs
4. Video: Author/uploader,
duration, title, description, date
uploaded, stats.

Getting help with data
scraping

Find scraping
experts
Upwork
Freelancer.com
Codementor.io (CodementorX)

Briefing a freelancer
Inputs:
1. Project Goal
2. List of URLs
1. Provide it yourself
2. Provide an endpoint and a pattern of URLs
that you’d like captured
3. Specific inputs into any filters/data input fields
which may be required to capture all the data
combinations
1. Form values (numbers, sliders, etc)
2. Login details
4. Technical requirements
1. Location of IP when scraping
2. Frequency of scrape
3. Scraping language
Outputs:
1. Where the data will be stored?
a. Local file (CSV, TXT)
b. Database (SQLite)
c. Stored on webserver
2. Provide an example spreadsheet showing
how you would like to data presented
3. Specify any data manipulation needed to
have clean output from the scrape
4. Specify how the data will be used
a. HTML Table or
b. Single page application (React/Angular JS)
embedded with oEmbed

Avoid getting
blocked
● Spoof header as Googlebot
● Run scrape from multiple IP
addresses
● Run the scrape slowly
● Be careful scraping behind a
login

Data scraping
services
Typically $2k+ per project..
ouch!
Priceonomics
Promptcloud
Scrapinghub
Datahen

How to turn your data into
visually appealing content

Maps
In order to do this you need to
capture addresses or
latitude/longitude

Batchgeo
Turn spreadsheets into maps

Charts
Easiest way to visualise data,
hardest to make look sexy with
Excel & Google Sheets
Source: https://www.labnol.org/software/find-right-chart-type-for-your-
data/6523/

Tip: Tableau
suggests
charts
Place your data set in Tableau and
use the ‘show me’ functionality

Interactive
Tables
Helpful to use a database tool
for larger datasets
https://tablepress.org/

Interactive
visualisations
Highly engaging and allows the
user to filter the data
Source:
https://www.lowyinstitute.org/lowyinstitutepollinteractive/feelings-
towards-other-nations/

Inspect
element to find
frameworks
This visualisation is using
amcharts.com

Interactive
visualisation
brief example
Download the full brief template
http://bit.ly/datavizbrief

Data
visualisation
inspiration
Graphiq.com
TheAtlas.com
Flowingdata.com
Storybench.org
Dribbble
reddit.com/r/Dataisbeautiful

Blueprints for data-driven
content marketing

Provide a new dimension on a
dataset
How? IMDB + the idea that people want their fav tv shows to come back on air
335,830
views
142

Recognise patterns and service
them
How? Combined results from NBN map search + real estate listings
1,000+
New users within 72
hours

Display data in an accessible format
How? Allflicks.net combining IMDB with Netflix library plus filters
1.13k
● Filterable
● Sortable
● Categorised
● Indexable!

Visualise trends
How? Twitter API + Maptimize mapping engine - onemilliontweetmap.com
426

Big data analysis ‘taster’
How? Scraped Google to analyse rich snippets + blog post with ‘taste’ of the data
128
+
Lead source

Want more
ideas?
1. Scrape an online community to get a
list of URLs and their
a. Post titles
b. # of Upvotes
c. # of comments
d. Date posted
2. Mash together the data with social
shares, link data using URL Profiler
3. Analyse the data using pivot tables in
Excel or Google Sheets
Learn how: https://blog.parsehub.com/boost-your-
content-marketing-with-web-scraping-and-pivot-tables/
Scrape reddit,
growthhackers.com,
inbound.org, hackernews

Content
Distribution
Supernodes:
Reddit
Digg
Hacker News
Slashdot
Inbound.org
Q&A websites (Quora, etc)
Online communities
Forums
Subreddits
Facebook Groups/Pages
List of content distribution websites: bit.ly/content-distribution-list

Good ol’
fashioned
reachout
Find websites with audiences that
will be interested in your data
Give journalists and bloggers a
unique angle and potentially a
different dimension on the
dataset so they can write their
own unique story
Make contact - don’t be afraid to
use the phone or go for a
coffee
List of content distribution websites: bit.ly/content-distribution-list

Build email
lists
Even small email lists can be
powerful to spread your content
online

We are always hiring!
finder.com.au/careers
jeremy@finder.com

Jeremy cabral search marketing summit - scraping data-driven content (1)

More Related Content

What's hot

Similar to Jeremy cabral search marketing summit - scraping data-driven content (1)

Recently uploaded

Jeremy cabral search marketing summit - scraping data-driven content (1)

Editor's Notes