SlideShare a Scribd company logo
Scraping data to drive
content marketing
campaigns
(without knowing how to code)
@jeremycabral
Data-driven content
DOMINATES
Insights from
Analyzing 1
Million Articles
“Original research based content
has the potential to achieve much
higher numbers of domain links
than other forms of content”
- Steve Rayson (Director -
BuzzSumo)
BuzzSumo Study
Priceonomics
From price guides to content
marketing
Pivot to data-driven
content marketing
23,000+ linking root domains
Price comparison: Airbnb vs Hotels
125
Linking root
domains
URL: https://priceonomics.com/hotels/
The Hipster Music Index
204,219
views
92
Linking root domains
URL: https://priceonomics.com/the-hipster-music-index/
Data mining fuels fast,
cheap and repeatable
content marketing ideas
But… what if the data you
need isn’t available by API
or downloadable?
Disclaimer
Seek legal advice before
committing to a scraping
project
Scraping data could breach the
terms of service of a website
Scraping at a disruptive rate
could slow down or even crash
a website
What is data
scraping?
Data scraping is an automated way
using scripts and crawlers to
1. Fetch a page
2. Parse the data in that page to
extract information
3. Format the data in an
organised way
4. Store or export that data to
create a dataset (DB, CSV,
TXT etc)
Patterns in HTML & CSS
It’s easier to scrape content broken up by a unique id or class assigned to the
element you want to extract
Basic overview of XPath
XPath can be used to navigate through
elements and attributes in a document
Important to understand how tags are nested as
a scraper will follow this tree
Learn more:
https://www.slideshare.net/scrapinghub/xpath-
for-web-scraping
Finding an API
Learn more: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
Important
Excel analysis
skills
1. Match the same data across
multiple spreadsheets:
a. VLOOKUP
b. INDEX MATCH
2. Summarising data
a. Pivot Tables
b. Charts
3. Cleaning data
a. =TRIM()
b. =SPLIT()
Learn more:
● https://www.distilled.net/excel-for-seo/
● https://trumpexcel.com/clean-data-in-excel/
Data sources
Web Apps
● Engines / Listings (product
data, reviews)
● Search results (with filters
applied)
Calculators
● Automatically input values
with scripts
● Store every calculator
results combination
Public
Datasets
● Upside: easy to download,
regularly maintained by
others
● Downside: everyone has
access easily to the same
data as you
APIs
● Upside: everything is
structured and (often)
documented
● Downside: sometimes not all
data is available in an API
How to get the data
Scraping
Frameworks &
Languages
Popular languages
PHP
Python
Ruby
Perl
Node.js
These are important for your
own development or choosing a
freelancer
Try and use a language your
developers are familiar with
Simulating the
user in the
browser
● Selenium Web Driver
● PhantomJS w/CasperJS
Data scraping tools
Desktop tools
Scrapesimilar
artoo.js
Tabula - extract tables from PDFs
Parsehub (free & paid versions)
Screaming Frog
URL Profiler
Scripts run on your local machine
Hosted Services
Google Sheets (ImportXML,
ImportJSON, ImportHTML)
Import.io - automatic page scraper
Mozenda - point and click screen
scraping (Windows only)
DIFFBot (Artificial Intelligence)
Connotate
Scraping with Google Sheets
Google Sheets Formulas (built in)
=importXML(url, xpath_query) -- imports
structured data using XPath
=importHTML(url, query, index) – imports data
from a table or list within an HTML page. Index
identifies which table in the source code
Learn more:
https://www.distilled.net/blog/distilled/guide-to-
google-docs-importxml/
= ImportJSON(url, query, parseOptions) --
imports JSON feeds into Google Sheets
http://date.jsontest.com/
{
"time": "11:35:24 AM",
"milliseconds_since_epoch": 1493552124786,
"date": "02-14-2014"
}
Learn more:
http://blog.fastfedora.com/projects/import-json
Scraping with
Screaming
Frog
Using custom extraction and
filters
Learn more: http://www.seerinteractive.com/blog/screaming-frog-guide/
Import.io example
Run on a frequency you set + stores data historically
Predictive model for real estate
value
Learn more: http://www.louisdorard.com/guest/everyone-can-do-data-science
Realtor.com scraped by import.io => cleaned with Pandas => model built by BigML
Scrape Similar (“Scraper”)
Learn more: http://ipullrank.com/how-to-scrape-every-single-page-on-the-web/
Diffbot.com
4 main APIs that use artificial
intelligence for data extraction
1. Article: clean text from article,
html, author, date info, related
images, videos
2. Discussion: content of forum
threads, article comments,
product reviews
3. Product: pricing information,
product IDs, images, product
specs
4. Video: Author/uploader,
duration, title, description, date
uploaded, stats.
Getting help with data
scraping
Find scraping
experts
Upwork
Freelancer.com
Codementor.io (CodementorX)
Briefing a freelancer
Inputs:
1. Project Goal
2. List of URLs
1. Provide it yourself
2. Provide an endpoint and a pattern of URLs
that you’d like captured
3. Specific inputs into any filters/data input fields
which may be required to capture all the data
combinations
1. Form values (numbers, sliders, etc)
2. Login details
4. Technical requirements
1. Location of IP when scraping
2. Frequency of scrape
3. Scraping language
Outputs:
1. Where the data will be stored?
a. Local file (CSV, TXT)
b. Database (SQLite)
c. Stored on webserver
2. Provide an example spreadsheet showing
how you would like to data presented
3. Specify any data manipulation needed to
have clean output from the scrape
4. Specify how the data will be used
a. HTML Table or
b. Single page application (React/Angular JS)
embedded with oEmbed
Avoid getting
blocked
● Spoof header as Googlebot
● Run scrape from multiple IP
addresses
● Run the scrape slowly
● Be careful scraping behind a
login
Data scraping
services
Typically $2k+ per project..
ouch!
Priceonomics
Promptcloud
Scrapinghub
Datahen
How to turn your data into
visually appealing content
Maps
In order to do this you need to
capture addresses or
latitude/longitude
Batchgeo
Turn spreadsheets into maps
Charts
Easiest way to visualise data,
hardest to make look sexy with
Excel & Google Sheets
Source: https://www.labnol.org/software/find-right-chart-type-for-your-
data/6523/
Tip: Tableau
suggests
charts
Place your data set in Tableau and
use the ‘show me’ functionality
Interactive
Tables
Helpful to use a database tool
for larger datasets
https://tablepress.org/
Interactive
visualisations
Highly engaging and allows the
user to filter the data
Source:
https://www.lowyinstitute.org/lowyinstitutepollinteractive/feelings-
towards-other-nations/
Inspect
element to find
frameworks
This visualisation is using
amcharts.com
Interactive
visualisation
brief example
Download the full brief template
http://bit.ly/datavizbrief
Data
visualisation
inspiration
Graphiq.com
TheAtlas.com
Flowingdata.com
Storybench.org
Dribbble
reddit.com/r/Dataisbeautiful
Blueprints for data-driven
content marketing
Provide a new dimension on a
dataset
How? IMDB + the idea that people want their fav tv shows to come back on air
335,830
views
142
Linking root domains
Recognise patterns and service
them
How? Combined results from NBN map search + real estate listings
1,000+
New users within 72
hours
Display data in an accessible format
How? Allflicks.net combining IMDB with Netflix library plus filters
1.13k
Linking root domains
● Filterable
● Sortable
● Categorised
● Indexable!
Visualise trends
How? Twitter API + Maptimize mapping engine - onemilliontweetmap.com
426
Linking root domains
Big data analysis ‘taster’
How? Scraped Google to analyse rich snippets + blog post with ‘taste’ of the data
128
Linking root domains
+
Lead source
Want more
ideas?
1. Scrape an online community to get a
list of URLs and their
a. Post titles
b. # of Upvotes
c. # of comments
d. Date posted
2. Mash together the data with social
shares, link data using URL Profiler
3. Analyse the data using pivot tables in
Excel or Google Sheets
Learn how: https://blog.parsehub.com/boost-your-
content-marketing-with-web-scraping-and-pivot-tables/
Scrape reddit,
growthhackers.com,
inbound.org, hackernews
Promoting data-driven
content
Content
Distribution
Supernodes:
Reddit
Digg
Hacker News
Slashdot
Inbound.org
Q&A websites (Quora, etc)
Online communities
Forums
Subreddits
Facebook Groups/Pages
List of content distribution websites: bit.ly/content-distribution-list
Good ol’
fashioned
reachout
Find websites with audiences that
will be interested in your data
Give journalists and bloggers a
unique angle and potentially a
different dimension on the
dataset so they can write their
own unique story
Make contact - don’t be afraid to
use the phone or go for a
coffee
List of content distribution websites: bit.ly/content-distribution-list
Build email
lists
Even small email lists can be
powerful to spread your content
online
We are always hiring!
finder.com.au/careers
jeremy@finder.com

More Related Content

What's hot

Yahoo! Search BOSS
Yahoo! Search BOSSYahoo! Search BOSS
Yahoo! Search BOSS
Praveen P N
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreLeveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
BarbaraStarr2009
 
Why and how does the SEO industry use expired domains
Why and how does the SEO industry use expired domainsWhy and how does the SEO industry use expired domains
Why and how does the SEO industry use expired domains
Kalin Karakehayov
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing Platform
Ontotext
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights
 
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive SearchGlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
Anupam Ranku
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
Introduction to Azure Search
Introduction to Azure SearchIntroduction to Azure Search
Introduction to Azure Search
Radoslav Gatev
 
Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019
Bastian Grimm
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
Peter Mika
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on Demand
Ontotext
 
Introduction to RDF*
Introduction to RDF*Introduction to RDF*
Introduction to RDF*
Cambridge Semantics
 
Optimizing public facing SharePoint sites
Optimizing public facing SharePoint sitesOptimizing public facing SharePoint sites
Optimizing public facing SharePoint sites
Gunnar Peipman
 
Fc3 integration strategies
Fc3 integration strategiesFc3 integration strategies
Fc3 integration strategies
GabrieleSani3
 
JahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with JahiaJahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with Jahia
Jahia Solutions Group
 
Google search vs Solr search for Enterprise search
Google search vs Solr search for Enterprise searchGoogle search vs Solr search for Enterprise search
Google search vs Solr search for Enterprise search
Veera Shekar
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
Philippe Mizrahi
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Neo4j
 
Pemari CA PPM Dynamic Data Display - Screenshots
Pemari   CA PPM Dynamic Data Display - ScreenshotsPemari   CA PPM Dynamic Data Display - Screenshots
Pemari CA PPM Dynamic Data Display - Screenshots
Peter Hughes
 
Optimizing Content with SEO and Social Media
Optimizing Content with SEO and Social MediaOptimizing Content with SEO and Social Media
Optimizing Content with SEO and Social Media
Erudite
 

What's hot (20)

Yahoo! Search BOSS
Yahoo! Search BOSSYahoo! Search BOSS
Yahoo! Search BOSS
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreLeveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
 
Why and how does the SEO industry use expired domains
Why and how does the SEO industry use expired domainsWhy and how does the SEO industry use expired domains
Why and how does the SEO industry use expired domains
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing Platform
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
 
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive SearchGlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Introduction to Azure Search
Introduction to Azure SearchIntroduction to Azure Search
Introduction to Azure Search
 
Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on Demand
 
Introduction to RDF*
Introduction to RDF*Introduction to RDF*
Introduction to RDF*
 
Optimizing public facing SharePoint sites
Optimizing public facing SharePoint sitesOptimizing public facing SharePoint sites
Optimizing public facing SharePoint sites
 
Fc3 integration strategies
Fc3 integration strategiesFc3 integration strategies
Fc3 integration strategies
 
JahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with JahiaJahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with Jahia
 
Google search vs Solr search for Enterprise search
Google search vs Solr search for Enterprise searchGoogle search vs Solr search for Enterprise search
Google search vs Solr search for Enterprise search
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Pemari CA PPM Dynamic Data Display - Screenshots
Pemari   CA PPM Dynamic Data Display - ScreenshotsPemari   CA PPM Dynamic Data Display - Screenshots
Pemari CA PPM Dynamic Data Display - Screenshots
 
Optimizing Content with SEO and Social Media
Optimizing Content with SEO and Social MediaOptimizing Content with SEO and Social Media
Optimizing Content with SEO and Social Media
 

Similar to Jeremy cabral search marketing summit - scraping data-driven content (1)

Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
Scott Sanders
 
ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)
Abdelkrim Boujraf
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
Aparna Sharma
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
Brijesh Prajapati
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Amazon Web Services Korea
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web Development
Robert J. Stein
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
Sylvain Zimmer
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
confluent
 
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas VochtenI2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
SPS Paris
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
Jan-Willem Bobbink - Freelance SEO Consultant
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
Clusterpoint
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
Andy Stretton
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz
 
Daniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days OcDaniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days Oc
Daniel Egan
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
ComputerScienceJunct
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...
Enterprise Ireland
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 
Apache Unomi presentation and update. By Serge Huber, CTO Jahia
Apache Unomi presentation and update. By Serge Huber, CTO JahiaApache Unomi presentation and update. By Serge Huber, CTO Jahia
Apache Unomi presentation and update. By Serge Huber, CTO Jahia
Jahia Solutions Group
 

Similar to Jeremy cabral search marketing summit - scraping data-driven content (1) (20)

Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
 
Lecture7
Lecture7Lecture7
Lecture7
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web Development
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas VochtenI2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Daniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days OcDaniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days Oc
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Apache Unomi presentation and update. By Serge Huber, CTO Jahia
Apache Unomi presentation and update. By Serge Huber, CTO JahiaApache Unomi presentation and update. By Serge Huber, CTO Jahia
Apache Unomi presentation and update. By Serge Huber, CTO Jahia
 

Recently uploaded

Generative AI - Unleash Creative Opportunity - Peter Weltman
Generative AI - Unleash Creative Opportunity - Peter WeltmanGenerative AI - Unleash Creative Opportunity - Peter Weltman
Generative AI - Unleash Creative Opportunity - Peter Weltman
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
QuickBooks Sync Manager Repair Tool- What You Need to Know
QuickBooks Sync Manager Repair Tool- What You Need to KnowQuickBooks Sync Manager Repair Tool- What You Need to Know
QuickBooks Sync Manager Repair Tool- What You Need to Know
markmargaret23
 
Digital Marketing Training In Bangalore
Digital Marketing Training In  BangaloreDigital Marketing Training In  Bangalore
Digital Marketing Training In Bangalore
Honey385968
 
Google Ads Vs Social Media Ads-A comparative analysis
Google Ads Vs Social Media Ads-A comparative analysisGoogle Ads Vs Social Media Ads-A comparative analysis
Google Ads Vs Social Media Ads-A comparative analysis
akashrawdot
 
Digital Marketing Trends - Experts Insights on How
Digital Marketing Trends - Experts Insights on HowDigital Marketing Trends - Experts Insights on How
Trust Element Assessment: How Your Online Presence Affects Outbound Lead Gene...
Trust Element Assessment: How Your Online Presence Affects Outbound Lead Gene...Trust Element Assessment: How Your Online Presence Affects Outbound Lead Gene...
Trust Element Assessment: How Your Online Presence Affects Outbound Lead Gene...
Martal Group
 
Your Path to Profits - The Game-Changing Power of a Marketing - Daniel Bussius
Your Path to Profits - The Game-Changing Power of a Marketing - Daniel BussiusYour Path to Profits - The Game-Changing Power of a Marketing - Daniel Bussius
Your Path to Profits - The Game-Changing Power of a Marketing - Daniel Bussius
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
SEO Master Class - Steve Wiideman,  Wiideman Consulting GroupSEO Master Class - Steve Wiideman,  Wiideman Consulting Group
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
The Good the Bad and The Ugly of Marketing Measurement
The Good the Bad and The Ugly of Marketing MeasurementThe Good the Bad and The Ugly of Marketing Measurement
The Good the Bad and The Ugly of Marketing Measurement
NapierPR
 
Mastering Dynamic Web Designing A Comprehensive Guide.pdf
Mastering Dynamic Web Designing A Comprehensive Guide.pdfMastering Dynamic Web Designing A Comprehensive Guide.pdf
Mastering Dynamic Web Designing A Comprehensive Guide.pdf
Ibrandizer
 
Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...
Unlocking Everyday Narratives: The Power of Storytelling in Marketing  - Chad...Unlocking Everyday Narratives: The Power of Storytelling in Marketing  - Chad...
Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdfOffissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
offisadizayn
 
Why People Fail in Network Marketing Business
Why People Fail in Network Marketing BusinessWhy People Fail in Network Marketing Business
Why People Fail in Network Marketing Business
Harish Kumar
 
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
FullSail: HOF - Presentation Phil Pallen
FullSail: HOF - Presentation Phil PallenFullSail: HOF - Presentation Phil Pallen
FullSail: HOF - Presentation Phil Pallen
travisomalana
 
Exploring the Top Digital Marketing Company in Canada
Exploring the Top Digital Marketing Company in CanadaExploring the Top Digital Marketing Company in Canada
Exploring the Top Digital Marketing Company in Canada
Solomo Media
 
What is digital marketing And why is it used?
What is digital marketing And why is it used?What is digital marketing And why is it used?
What is digital marketing And why is it used?
125albina
 
My Personal Brand Exploration by Mariano
My Personal Brand Exploration by MarianoMy Personal Brand Exploration by Mariano
My Personal Brand Exploration by Mariano
marianooscos
 
Digital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew RupertDigital Strategy Master Class - Andrew Rupert
BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BLOOM_May2024 (r). Balmer Lawrie Online Monthly BulletinBLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BalmerLawrie
 

Recently uploaded (20)

Generative AI - Unleash Creative Opportunity - Peter Weltman
Generative AI - Unleash Creative Opportunity - Peter WeltmanGenerative AI - Unleash Creative Opportunity - Peter Weltman
Generative AI - Unleash Creative Opportunity - Peter Weltman
 
QuickBooks Sync Manager Repair Tool- What You Need to Know
QuickBooks Sync Manager Repair Tool- What You Need to KnowQuickBooks Sync Manager Repair Tool- What You Need to Know
QuickBooks Sync Manager Repair Tool- What You Need to Know
 
Digital Marketing Training In Bangalore
Digital Marketing Training In  BangaloreDigital Marketing Training In  Bangalore
Digital Marketing Training In Bangalore
 
Google Ads Vs Social Media Ads-A comparative analysis
Google Ads Vs Social Media Ads-A comparative analysisGoogle Ads Vs Social Media Ads-A comparative analysis
Google Ads Vs Social Media Ads-A comparative analysis
 
Digital Marketing Trends - Experts Insights on How
Digital Marketing Trends - Experts Insights on HowDigital Marketing Trends - Experts Insights on How
Digital Marketing Trends - Experts Insights on How
 
Trust Element Assessment: How Your Online Presence Affects Outbound Lead Gene...
Trust Element Assessment: How Your Online Presence Affects Outbound Lead Gene...Trust Element Assessment: How Your Online Presence Affects Outbound Lead Gene...
Trust Element Assessment: How Your Online Presence Affects Outbound Lead Gene...
 
Your Path to Profits - The Game-Changing Power of a Marketing - Daniel Bussius
Your Path to Profits - The Game-Changing Power of a Marketing - Daniel BussiusYour Path to Profits - The Game-Changing Power of a Marketing - Daniel Bussius
Your Path to Profits - The Game-Changing Power of a Marketing - Daniel Bussius
 
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
SEO Master Class - Steve Wiideman,  Wiideman Consulting GroupSEO Master Class - Steve Wiideman,  Wiideman Consulting Group
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
 
The Good the Bad and The Ugly of Marketing Measurement
The Good the Bad and The Ugly of Marketing MeasurementThe Good the Bad and The Ugly of Marketing Measurement
The Good the Bad and The Ugly of Marketing Measurement
 
Mastering Dynamic Web Designing A Comprehensive Guide.pdf
Mastering Dynamic Web Designing A Comprehensive Guide.pdfMastering Dynamic Web Designing A Comprehensive Guide.pdf
Mastering Dynamic Web Designing A Comprehensive Guide.pdf
 
Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...
Unlocking Everyday Narratives: The Power of Storytelling in Marketing  - Chad...Unlocking Everyday Narratives: The Power of Storytelling in Marketing  - Chad...
Unlocking Everyday Narratives: The Power of Storytelling in Marketing - Chad...
 
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdfOffissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
Offissa Dizayn - Otel, Kafe, Restoran Kataloqu_240603_011042.pdf
 
Why People Fail in Network Marketing Business
Why People Fail in Network Marketing BusinessWhy People Fail in Network Marketing Business
Why People Fail in Network Marketing Business
 
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
Unknown to Unforgettable - The Art and Science to Being Irresistible on Camer...
 
FullSail: HOF - Presentation Phil Pallen
FullSail: HOF - Presentation Phil PallenFullSail: HOF - Presentation Phil Pallen
FullSail: HOF - Presentation Phil Pallen
 
Exploring the Top Digital Marketing Company in Canada
Exploring the Top Digital Marketing Company in CanadaExploring the Top Digital Marketing Company in Canada
Exploring the Top Digital Marketing Company in Canada
 
What is digital marketing And why is it used?
What is digital marketing And why is it used?What is digital marketing And why is it used?
What is digital marketing And why is it used?
 
My Personal Brand Exploration by Mariano
My Personal Brand Exploration by MarianoMy Personal Brand Exploration by Mariano
My Personal Brand Exploration by Mariano
 
Digital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew RupertDigital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew Rupert
 
BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BLOOM_May2024 (r). Balmer Lawrie Online Monthly BulletinBLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
BLOOM_May2024 (r). Balmer Lawrie Online Monthly Bulletin
 

Jeremy cabral search marketing summit - scraping data-driven content (1)

  • 1. Scraping data to drive content marketing campaigns (without knowing how to code) @jeremycabral
  • 3. Insights from Analyzing 1 Million Articles “Original research based content has the potential to achieve much higher numbers of domain links than other forms of content” - Steve Rayson (Director - BuzzSumo) BuzzSumo Study
  • 4. Priceonomics From price guides to content marketing Pivot to data-driven content marketing 23,000+ linking root domains
  • 5. Price comparison: Airbnb vs Hotels 125 Linking root domains URL: https://priceonomics.com/hotels/
  • 6. The Hipster Music Index 204,219 views 92 Linking root domains URL: https://priceonomics.com/the-hipster-music-index/
  • 7. Data mining fuels fast, cheap and repeatable content marketing ideas
  • 8. But… what if the data you need isn’t available by API or downloadable?
  • 9. Disclaimer Seek legal advice before committing to a scraping project Scraping data could breach the terms of service of a website Scraping at a disruptive rate could slow down or even crash a website
  • 10. What is data scraping? Data scraping is an automated way using scripts and crawlers to 1. Fetch a page 2. Parse the data in that page to extract information 3. Format the data in an organised way 4. Store or export that data to create a dataset (DB, CSV, TXT etc)
  • 11. Patterns in HTML & CSS It’s easier to scrape content broken up by a unique id or class assigned to the element you want to extract
  • 12. Basic overview of XPath XPath can be used to navigate through elements and attributes in a document Important to understand how tags are nested as a scraper will follow this tree Learn more: https://www.slideshare.net/scrapinghub/xpath- for-web-scraping
  • 13. Finding an API Learn more: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
  • 14. Important Excel analysis skills 1. Match the same data across multiple spreadsheets: a. VLOOKUP b. INDEX MATCH 2. Summarising data a. Pivot Tables b. Charts 3. Cleaning data a. =TRIM() b. =SPLIT() Learn more: ● https://www.distilled.net/excel-for-seo/ ● https://trumpexcel.com/clean-data-in-excel/
  • 16. Web Apps ● Engines / Listings (product data, reviews) ● Search results (with filters applied)
  • 17. Calculators ● Automatically input values with scripts ● Store every calculator results combination
  • 18. Public Datasets ● Upside: easy to download, regularly maintained by others ● Downside: everyone has access easily to the same data as you
  • 19. APIs ● Upside: everything is structured and (often) documented ● Downside: sometimes not all data is available in an API
  • 20. How to get the data
  • 21. Scraping Frameworks & Languages Popular languages PHP Python Ruby Perl Node.js These are important for your own development or choosing a freelancer Try and use a language your developers are familiar with
  • 22. Simulating the user in the browser ● Selenium Web Driver ● PhantomJS w/CasperJS
  • 23. Data scraping tools Desktop tools Scrapesimilar artoo.js Tabula - extract tables from PDFs Parsehub (free & paid versions) Screaming Frog URL Profiler Scripts run on your local machine Hosted Services Google Sheets (ImportXML, ImportJSON, ImportHTML) Import.io - automatic page scraper Mozenda - point and click screen scraping (Windows only) DIFFBot (Artificial Intelligence) Connotate
  • 24. Scraping with Google Sheets Google Sheets Formulas (built in) =importXML(url, xpath_query) -- imports structured data using XPath =importHTML(url, query, index) – imports data from a table or list within an HTML page. Index identifies which table in the source code Learn more: https://www.distilled.net/blog/distilled/guide-to- google-docs-importxml/ = ImportJSON(url, query, parseOptions) -- imports JSON feeds into Google Sheets http://date.jsontest.com/ { "time": "11:35:24 AM", "milliseconds_since_epoch": 1493552124786, "date": "02-14-2014" } Learn more: http://blog.fastfedora.com/projects/import-json
  • 25. Scraping with Screaming Frog Using custom extraction and filters Learn more: http://www.seerinteractive.com/blog/screaming-frog-guide/
  • 26. Import.io example Run on a frequency you set + stores data historically
  • 27. Predictive model for real estate value Learn more: http://www.louisdorard.com/guest/everyone-can-do-data-science Realtor.com scraped by import.io => cleaned with Pandas => model built by BigML
  • 28. Scrape Similar (“Scraper”) Learn more: http://ipullrank.com/how-to-scrape-every-single-page-on-the-web/
  • 29. Diffbot.com 4 main APIs that use artificial intelligence for data extraction 1. Article: clean text from article, html, author, date info, related images, videos 2. Discussion: content of forum threads, article comments, product reviews 3. Product: pricing information, product IDs, images, product specs 4. Video: Author/uploader, duration, title, description, date uploaded, stats.
  • 30.
  • 31. Getting help with data scraping
  • 33. Briefing a freelancer Inputs: 1. Project Goal 2. List of URLs 1. Provide it yourself 2. Provide an endpoint and a pattern of URLs that you’d like captured 3. Specific inputs into any filters/data input fields which may be required to capture all the data combinations 1. Form values (numbers, sliders, etc) 2. Login details 4. Technical requirements 1. Location of IP when scraping 2. Frequency of scrape 3. Scraping language Outputs: 1. Where the data will be stored? a. Local file (CSV, TXT) b. Database (SQLite) c. Stored on webserver 2. Provide an example spreadsheet showing how you would like to data presented 3. Specify any data manipulation needed to have clean output from the scrape 4. Specify how the data will be used a. HTML Table or b. Single page application (React/Angular JS) embedded with oEmbed
  • 34. Avoid getting blocked ● Spoof header as Googlebot ● Run scrape from multiple IP addresses ● Run the scrape slowly ● Be careful scraping behind a login
  • 35. Data scraping services Typically $2k+ per project.. ouch! Priceonomics Promptcloud Scrapinghub Datahen
  • 36. How to turn your data into visually appealing content
  • 37. Maps In order to do this you need to capture addresses or latitude/longitude
  • 39. Charts Easiest way to visualise data, hardest to make look sexy with Excel & Google Sheets Source: https://www.labnol.org/software/find-right-chart-type-for-your- data/6523/
  • 40. Tip: Tableau suggests charts Place your data set in Tableau and use the ‘show me’ functionality
  • 41. Interactive Tables Helpful to use a database tool for larger datasets https://tablepress.org/
  • 42. Interactive visualisations Highly engaging and allows the user to filter the data Source: https://www.lowyinstitute.org/lowyinstitutepollinteractive/feelings- towards-other-nations/
  • 43. Inspect element to find frameworks This visualisation is using amcharts.com
  • 44. Interactive visualisation brief example Download the full brief template http://bit.ly/datavizbrief
  • 47. Provide a new dimension on a dataset How? IMDB + the idea that people want their fav tv shows to come back on air 335,830 views 142 Linking root domains
  • 48. Recognise patterns and service them How? Combined results from NBN map search + real estate listings 1,000+ New users within 72 hours
  • 49. Display data in an accessible format How? Allflicks.net combining IMDB with Netflix library plus filters 1.13k Linking root domains ● Filterable ● Sortable ● Categorised ● Indexable!
  • 50. Visualise trends How? Twitter API + Maptimize mapping engine - onemilliontweetmap.com 426 Linking root domains
  • 51. Big data analysis ‘taster’ How? Scraped Google to analyse rich snippets + blog post with ‘taste’ of the data 128 Linking root domains + Lead source
  • 52. Want more ideas? 1. Scrape an online community to get a list of URLs and their a. Post titles b. # of Upvotes c. # of comments d. Date posted 2. Mash together the data with social shares, link data using URL Profiler 3. Analyse the data using pivot tables in Excel or Google Sheets Learn how: https://blog.parsehub.com/boost-your- content-marketing-with-web-scraping-and-pivot-tables/ Scrape reddit, growthhackers.com, inbound.org, hackernews
  • 54. Content Distribution Supernodes: Reddit Digg Hacker News Slashdot Inbound.org Q&A websites (Quora, etc) Online communities Forums Subreddits Facebook Groups/Pages List of content distribution websites: bit.ly/content-distribution-list
  • 55. Good ol’ fashioned reachout Find websites with audiences that will be interested in your data Give journalists and bloggers a unique angle and potentially a different dimension on the dataset so they can write their own unique story Make contact - don’t be afraid to use the phone or go for a coffee List of content distribution websites: bit.ly/content-distribution-list
  • 56. Build email lists Even small email lists can be powerful to spread your content online
  • 57. We are always hiring! finder.com.au/careers jeremy@finder.com

Editor's Notes

  1. Originally part of Y-Combinator Blogs Airbnb rates
  2. Combined data from pitchfork music review websites + facebook likes for each review. Pitchfork: music reviews for independent music Facebook likes: artists with the least likes were the most hipster, because their second criteria is that it should be a band you’ve never heard of
  3. Understanding this will help you understand page structures and what’s possible
  4. Data.gov.au Kaggle the data science community recently bought by google publishes alot of their datasets
  5. Talk about they aren’t niche Second opinion
  6. https://blog.hartleybrody.com/web-scraping/
  7. Talk about other JSON examples
  8. 30 seconds to produce this
  9. Point and click Can run on a regular basis
  10. easy-to-use tool for intermediate to advanced users who are comfortable with XPath. More advanced than ImportXML. Allows you to capture more information than what is possible in google sheets
  11. Pricey. Expect to pay upwards of $2k per project
  12. Where to live next, where they can get the NBN