Datamining the Australian web graph by Frank Vitetta

371 views

Published on

How easy is it to crawl the Australian web graph - or, in other words, crawl all Australian sites? Frank has set himself this challenge and in his talk he will cover web crawling in depth, as well as a number of interesting findings and trends about the Australian web market that he came across along the way.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
371
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Datamining the Australian web graph by Frank Vitetta

  1. 1. @frankseo frank@orchidbox.com Clients we work for
  2. 2. @frankseo frank@orchidbox.com Overview The basic techniques of web crawling Backlink tools - Moz, Hrefs and Majestic SEO Outreachr.com – the tool An Australian challenge Insights into the Outreachr database Owning the data – what you can do with it Take-aways
  3. 3. @frankseo frank@orchidbox.com Web crawling – an introduction - A web crawler is a computer program that browses the web in a methodical and automated manner. - They are called crawlers because they crawl through a site one page at a time, following the links to other pages on the site until all pages have been read. - All major search engines and SEO tools deploy crawlers - also known as "spiders" or "bots”.
  4. 4. @frankseo frank@orchidbox.com Breadth First Search Web crawling – an Introduction • BFS begins at a root node and inspects all neighbouring nodes. • For each neighbour node, in turn it inspects the neighbour nodes which were unvisited, and continues. • Assumption: If we start with "good" pages, this keeps us close to other good pages. • Variation of this algorithms are more memory efficient and popular in computing.
  5. 5. @frankseo frank@orchidbox.com Web Crawling – An Introduction Depth First Search • Invented in 19th century by French mathematician Charles Pierre Trémaux (strategy for solving mazes). • Algorithm for traversing or searching tree or graph data structures. • Starts at the root and explores as far as possible along each branch before backtracking.
  6. 6. @frankseo frank@orchidbox.com Popular SEO tools
  7. 7. @frankseo frank@orchidbox.com Web crawling tools
  8. 8. @frankseo frank@orchidbox.com Tool index sizes 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Moz Majestic Fresh Hrefs Billion Urls Million Root Domains Billion Links Remember - Number of pages per domain - Number of links per domain Eg ebay AU has 80M pages
  9. 9. @frankseo frank@orchidbox.com 2 years ago we came up with an internal tool to handle outreach
  10. 10. @frankseo frank@orchidbox.com We had to come up with a new tool
  11. 11. @frankseo frank@orchidbox.com - Be more efficient in finding the right sites for our clients - Speed up the contact process - Outsource some of the most repetitive work (e.g. sending emails/filling contact forms) - Work for various clients in various languages - Codebase ownership = freedom to run custom campaign - We don’t want to piss people off! We have an historical index of who we have contacted in the past. Why?
  12. 12. @frankseo frank@orchidbox.com Outreachr.com - how we do it Discovery (engine scraping, Twitter, own index) Get SEO stats (Moz &PR) Social Contact extraction (crawling sites, Whois data) Sorting algorithm New campaign queries
  13. 13. @frankseo frank@orchidbox.com Outreachr - interface
  14. 14. @frankseo frank@orchidbox.com Insights into the Aussie web graph
  15. 15. @frankseo frank@orchidbox.com The Australian challenge Australian challenge
  16. 16. @frankseo frank@orchidbox.com Step 1 - We started with a small tight seeding (abc.net.au, news.com.au, theaustralian.com.au and other popular Australian news sites) After obtaining over 1M urls and analysing over 8M links, we only found 90,000 unique domains over 2.4M registered .au Domains The Australian web graph is hard to crawl
  17. 17. @frankseo frank@orchidbox.com 2012 stats from AusRegistry – 2.4M registered urls Source http://www.auda.org.au/pdf/ausregistry-q4-1112.pdf
  18. 18. @frankseo frank@orchidbox.com Any tools using first breadth search will struggle to efficiently crawl Aussie sites
  19. 19. @frankseo frank@orchidbox.com Australian sites link out to sites all over the world
  20. 20. @frankseo frank@orchidbox.com Com (40) AU (45) domain.com.au .com.au sites link to .com as much as .com.au Net (24)
  21. 21. @frankseo frank@orchidbox.com So what we have learned from our Data Base? Ranking domains (1.5M) First Breadth from ranking domains (2M) Twitter Domains (0.4M)
  22. 22. @frankseo frank@orchidbox.com EDU have avg. PR of 5.49! 0 1 2 3 4 5 6 .com .net .org .edu PR
  23. 23. @frankseo frank@orchidbox.com 0 0.5 1 1.5 2 2.5 3 3.5 .co.uk .fr .au .nz PR Regional Level – Australia has got the highest AVG PR
  24. 24. @frankseo frank@orchidbox.com Moz loves .org sites 23 24 25 26 27 28 29 30 .com .net .org .edu DA
  25. 25. @frankseo frank@orchidbox.com Australia has got the highest AVG DA 0 10 20 30 40 50 60 .co.uk .fr .au .nz DA
  26. 26. @frankseo frank@orchidbox.com 0 10 20 30 40 50 60 .ac.uk .co.uk .fr .au .nz DA 0 1 2 3 4 5 6 .ac.uk .co.uk .fr .au .nz PR Quite big disparity between PR and DA
  27. 27. @frankseo frank@orchidbox.com You need fewer links to rank in Australia 84 67 48 0 10 20 30 40 50 60 70 80 90 .com .uk .au Root Domain Links
  28. 28. @frankseo frank@orchidbox.com 43% success rate in grabbing emails off domains AU Email found No email found
  29. 29. @frankseo frank@orchidbox.com COM Email found No email found 50% success rate in grabbing emails off domains
  30. 30. @frankseo frank@orchidbox.com 16% of sites linked to their Facebook page AU link to facebook page no link found
  31. 31. @frankseo frank@orchidbox.com COM link to facebook page no link found 18% of sites linked to their Facebook page
  32. 32. @frankseo frank@orchidbox.com AU link to twitter page no link found 61% of sites linked to their Twitter page
  33. 33. @frankseo frank@orchidbox.com COM link to twitter page no link found 70% of sites linked to their Twitter page
  34. 34. @frankseo frank@orchidbox.com And … domain extension distribution 74% 7% 19% .com.au other .au (net.au, org.au ..) other (com,net ..) usually au.domain.com
  35. 35. @frankseo frank@orchidbox.com Owning this data is really cool
  36. 36. @frankseo frank@orchidbox.com Analysing ranking pages on G (eg. PR, DA, keywords in url) How difficult it is to rank based on sites we found on 1st page?
  37. 37. @frankseo frank@orchidbox.com Who are my online SERP competitors? Based on a keyword set you control and you care about
  38. 38. @frankseo frank@orchidbox.com Ebay is the most visible site across 17k keywords analysed Domain In top 10 Saturation ebay.com.au 2691 25.07 truelocal.com.au 2308 21.5 yellowpages.com.au 1894 17.65 gumtree.com.au 1819 16.95 google (images/video/shopping) 1765 16.44 tripadvisor.com.au 1753 16.33 forums.whirlpool.net.au 1392 12.97 productreview.com.au 1208 11.26 myshopping.com.au 1130 10.53 abc.net.au 1101 10.26 smh.com.au 1100 10.25 itunes.apple.com 1077 10.03 whitepages.com.au 990 9.22 yelp.com.au 965 8.99 whereis.com 893 8.32 news.com.au 833 7.76 wotif.com 783 7.3 au.answers.yahoo.com 774 7.21 expedia.com.au 672 6.26 getprice.com.au 628 5.85 Compiled analysing over 100,000 ranking domains
  39. 39. @frankseo frank@orchidbox.com Big surprise! Nothing to do with home appliances broadband choice adsl 2 microsoft certification modem router
  40. 40. @frankseo frank@orchidbox.com Take-aways - If you want to outreach in Australia, you probably need to be on Twitter. - The top Aussie sites are aggregators (products, reviews or local business) - get listed to increase visibility. - You are already lucky! You don’t need to work to get as many root domains as you would in other countries like the UK. - Use a range of tools, including Open Site Explorer, hrefs.com and MajesticSEO to check backlink profile as no single tool seems to do a great job at indexing the Australian subnet. - You need a com.au to rank in Australia. 19% are .com but usually with an Australian subdomain (e.g. au.domain.com)
  41. 41. @frankseo frank@orchidbox.com @frankseo frank@outreachr.com frank@orchidbox.com (Send me a tweet to get free Outreachr pro access for a month!)

×