@frankseo
frank@orchidbox.com
Clients we work for
@frankseo
frank@orchidbox.com
Overview
The basic techniques of web crawling
Backlink tools - Moz, Hrefs and Majestic SEO
Outreachr.com – the tool
An Australian challenge
Insights into the Outreachr database
Owning the data – what you can do with it
Take-aways
@frankseo
frank@orchidbox.com
Web crawling – an introduction
- A web crawler is a computer program that
browses the web in a methodical and
automated manner.
- They are called crawlers because they
crawl through a site one page at a
time, following the links to other pages on
the site until all pages have been read.
- All major search engines and SEO tools
deploy crawlers - also known as "spiders" or
"bots”.
@frankseo
frank@orchidbox.com
Breadth First Search
Web crawling – an Introduction
• BFS begins at a root node and inspects
all neighbouring nodes.
• For each neighbour node, in turn it
inspects the neighbour nodes which
were unvisited, and continues.
• Assumption: If we start with "good"
pages, this keeps us close to other
good pages.
• Variation of this algorithms are more
memory efficient and popular in
computing.
@frankseo
frank@orchidbox.com
Web Crawling – An Introduction
Depth First Search
• Invented in 19th century by French
mathematician Charles Pierre
Trémaux (strategy for solving
mazes).
• Algorithm for traversing or
searching tree or graph data
structures.
• Starts at the root and explores as far
as possible along each branch
before backtracking.
@frankseo
frank@orchidbox.com
Popular SEO tools
@frankseo
frank@orchidbox.com
Web crawling tools
@frankseo
frank@orchidbox.com
Tool index sizes
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Moz Majestic Fresh Hrefs
Billion Urls
Million Root Domains
Billion Links
Remember
- Number of pages per domain
- Number of links per domain
Eg ebay AU has 80M pages
@frankseo
frank@orchidbox.com
2 years ago we came up with an internal tool to handle outreach
@frankseo
frank@orchidbox.com
We had to come up with a new tool
@frankseo
frank@orchidbox.com
- Be more efficient in finding the right sites for our clients
- Speed up the contact process
- Outsource some of the most repetitive work (e.g. sending
emails/filling contact forms)
- Work for various clients in various languages
- Codebase ownership = freedom to run custom campaign
- We don’t want to piss people off! We have an historical index of
who we have contacted in the past.
Why?
@frankseo
frank@orchidbox.com
Outreachr.com - how we do it
Discovery
(engine scraping,
Twitter,
own index)
Get SEO stats
(Moz &PR)
Social
Contact
extraction
(crawling
sites, Whois
data)
Sorting
algorithm
New campaign queries
@frankseo
frank@orchidbox.com
Outreachr - interface
@frankseo
frank@orchidbox.com
Insights into the Aussie web graph
@frankseo
frank@orchidbox.com
The Australian challenge
Australian challenge
@frankseo
frank@orchidbox.com
Step 1 - We started with a small tight seeding
(abc.net.au, news.com.au, theaustralian.com.au and other popular
Australian news sites)
After obtaining over 1M urls and analysing over 8M links, we only
found 90,000 unique domains over 2.4M registered .au Domains
The Australian web graph is hard to crawl
@frankseo
frank@orchidbox.com
2012 stats from AusRegistry – 2.4M registered urls
Source http://www.auda.org.au/pdf/ausregistry-q4-1112.pdf
@frankseo
frank@orchidbox.com
Any tools using first
breadth search will
struggle to efficiently
crawl Aussie sites
@frankseo
frank@orchidbox.com
Australian sites link out to sites all over the world
@frankseo
frank@orchidbox.com
Com
(40)
AU
(45)
domain.com.au
.com.au sites link to .com as much as .com.au
Net
(24)
@frankseo
frank@orchidbox.com
So what we have learned from our Data Base?
Ranking domains
(1.5M)
First Breadth from ranking
domains
(2M)
Twitter
Domains
(0.4M)
@frankseo
frank@orchidbox.com
EDU have avg. PR of 5.49!
0
1
2
3
4
5
6
.com .net .org .edu
PR
@frankseo
frank@orchidbox.com
0
0.5
1
1.5
2
2.5
3
3.5
.co.uk .fr .au .nz
PR
Regional Level – Australia has got the highest AVG PR
@frankseo
frank@orchidbox.com
Moz loves .org sites
23
24
25
26
27
28
29
30
.com .net .org .edu
DA
@frankseo
frank@orchidbox.com
Australia has got the highest AVG DA
0
10
20
30
40
50
60
.co.uk .fr .au .nz
DA
@frankseo
frank@orchidbox.com
0
10
20
30
40
50
60
.ac.uk .co.uk .fr .au .nz
DA
0
1
2
3
4
5
6
.ac.uk .co.uk .fr .au .nz
PR
Quite big disparity between PR and DA
@frankseo
frank@orchidbox.com
You need fewer links to rank in Australia
84
67
48
0
10
20
30
40
50
60
70
80
90
.com .uk .au
Root Domain Links
@frankseo
frank@orchidbox.com
43% success rate in grabbing emails off domains
AU
Email found
No email found
@frankseo
frank@orchidbox.com
COM
Email found
No email found
50% success rate in grabbing emails off domains
@frankseo
frank@orchidbox.com
16% of sites linked to their Facebook page
AU
link to facebook page
no link found
@frankseo
frank@orchidbox.com
COM
link to facebook page
no link found
18% of sites linked to their Facebook page
@frankseo
frank@orchidbox.com
AU
link to twitter page
no link found
61% of sites linked to their Twitter page
@frankseo
frank@orchidbox.com
COM
link to twitter page
no link found
70% of sites linked to their Twitter page
@frankseo
frank@orchidbox.com
And … domain extension distribution
74%
7%
19%
.com.au
other .au (net.au, org.au ..)
other (com,net ..) usually
au.domain.com
@frankseo
frank@orchidbox.com
Owning this data is really cool
@frankseo
frank@orchidbox.com
Analysing ranking pages on G (eg. PR, DA, keywords in url)
How difficult it is to rank based on sites we found on 1st
page?
@frankseo
frank@orchidbox.com
Who are my online SERP competitors?
Based on a keyword set you control and you care about
@frankseo
frank@orchidbox.com
Ebay is the most visible site across 17k keywords analysed
Domain In top 10 Saturation
ebay.com.au 2691 25.07
truelocal.com.au 2308 21.5
yellowpages.com.au 1894 17.65
gumtree.com.au 1819 16.95
google (images/video/shopping) 1765 16.44
tripadvisor.com.au 1753 16.33
forums.whirlpool.net.au 1392 12.97
productreview.com.au 1208 11.26
myshopping.com.au 1130 10.53
abc.net.au 1101 10.26
smh.com.au 1100 10.25
itunes.apple.com 1077 10.03
whitepages.com.au 990 9.22
yelp.com.au 965 8.99
whereis.com 893 8.32
news.com.au 833 7.76
wotif.com 783 7.3
au.answers.yahoo.com 774 7.21
expedia.com.au 672 6.26
getprice.com.au 628 5.85
Compiled
analysing over
100,000 ranking
domains
@frankseo
frank@orchidbox.com
Big surprise! Nothing to do with home appliances
broadband choice
adsl 2
microsoft certification
modem router
@frankseo
frank@orchidbox.com
Take-aways
- If you want to outreach in Australia, you probably need to be on Twitter.
- The top Aussie sites are aggregators (products, reviews or local business) - get
listed to increase visibility.
- You are already lucky! You don’t need to work to get as many root domains as
you would in other countries like the UK.
- Use a range of tools, including Open Site Explorer, hrefs.com and MajesticSEO
to check backlink profile as no single tool seems to do a great job at indexing the
Australian subnet.
- You need a com.au to rank in Australia. 19% are .com but usually with an
Australian subdomain (e.g. au.domain.com)
@frankseo
frank@orchidbox.com
@frankseo
frank@outreachr.com
frank@orchidbox.com
(Send me a tweet to get free Outreachr pro access for a month!)

Datamining the Australian web graph by Frank Vitetta