Your First Sitemap.xml
& Robots.txt Implementation
Jérôme Verstrynge
For Ligatures.net
December, 2014
License: CC BY-ND 4....
Table Of Contents
●
Introduction
●
Sitemap: XML vs HTTP
●
Location:
– Sitemap.xml
– Robots.txt
●
Sitemap
– Content I & II
...
Introduction
●
Web Crawler
– A search engine
computer searching
for content on the
Internet for later
indexation
– They re...
Sitemaps: XML vs HTML (confusion)
●
HTML Sitemap:
– A web page containing
links facilitating user
navigation on a website
...
Sitemap.xml Locations
By default, most web crawlers
search for a sitemap.xml file in
the root
But sitemaps can
be located
...
Robots.txt Location
By default, all web
crawlers search for a
robots.txt file in the root
sitemap.xml
...
/mydir
/
robots....
Sitemap.xml Content - I
●
A structured document defining a <urlset>
●
One <url>...</url> section per web page URL
●
In bol...
Sitemap.xml Content - II
●
<loc>: the URL of a page on the website
●
<lastmod>: when it has been last modified
●
<changefr...
Sitemap.xml Generators
●
Creating a sitemap.xml manually can be very
time consuming
●
Many can generate it automatically f...
Sitemap.xml Recommendations
●
Create at least one sitemap.xml in the root
●
Be as exhaustive as possible
●
Leave out <last...
Robots.txt Content - I
●
Rules apply top-down, last prevails on top
●
User-agent: tells to which web crawler (a.k.a. robot...
Robots.txt Content - II
●
The above robots.txt says:
●
All web crawlers (but Google's) can access
everything on the websit...
Robots.txt – Basic Example
●
Use the above example for a start
●
Allow access to all your website content to all web crawl...
Robots.txt Recommendations & Warnings
●
Always create (at least) a minimal robots.txt
where all sitemaps are declared
●
Ne...
Additional References
●
Further readings:
– Troubleshooting web site indexation issues
– Troubleshooting web pages indexat...
Upcoming SlideShare
Loading in …5
×

Your first sitemap.xml and robots.txt implementation

645 views

Published on

Introduction to XML sitemap and robots.txt files for SEO beginners. Covers the basic to implement them for your first website.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
645
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Your first sitemap.xml and robots.txt implementation

  1. 1. Your First Sitemap.xml & Robots.txt Implementation Jérôme Verstrynge For Ligatures.net December, 2014 License: CC BY-ND 4.0 Click for information
  2. 2. Table Of Contents ● Introduction ● Sitemap: XML vs HTTP ● Location: – Sitemap.xml – Robots.txt ● Sitemap – Content I & II – Generators – Recommendations ● Robots.txt – Content I & II – Basic example – Recommendation & Warnings ● Additional References – Further readings
  3. 3. Introduction ● Web Crawler – A search engine computer searching for content on the Internet for later indexation – They read robots.txt and sitemap.xml files found on websites ● Robots.txt – A text file containing instructions for web crawlers ● Sitemap.xml – Text files listing pages URLs to help web crawlers find content on a website
  4. 4. Sitemaps: XML vs HTML (confusion) ● HTML Sitemap: – A web page containing links facilitating user navigation on a website ● XML Sitemap: – A structured text file containing the URLs of pages of a website for web crawlers Displayed in web browsers Visited by users and web crawlers Never displayed to users Read by web crawlers only That's what we are interested in !!!
  5. 5. Sitemap.xml Locations By default, most web crawlers search for a sitemap.xml file in the root But sitemaps can be located anywhere... ...although the recommended practice is to put them all in the root! sitemap.xml ... /mydir / ... sitemap2.xml website 'root' A website can have more than one sitemap!
  6. 6. Robots.txt Location By default, all web crawlers search for a robots.txt file in the root sitemap.xml ... /mydir / robots.txt website 'root' A website may not have a robots.txt file.. ... ...but it is recommended to always have a robots.txt file (even if minimal)
  7. 7. Sitemap.xml Content - I ● A structured document defining a <urlset> ● One <url>...</url> section per web page URL ● In bold required elements, others are→ optional <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://mysite.com/page.html</loc> <lastmod>2014-10-04T13:27:58+03:00<lastmod> <changefreq>daily</changefreq> <priority>0.7</priority> </url> ... </urlset>
  8. 8. Sitemap.xml Content - II ● <loc>: the URL of a page on the website ● <lastmod>: when it has been last modified ● <changefreq>: how often it is modified ● <priority>: your opportunity to tell web crawlers on which page you think they should spend their time first (it has no impact on rankings) <loc>http://mysite.com/page.html</loc> <lastmod>2014-10-04T13:27:58+03:00<lastmod> <changefreq>daily</changefreq> <priority>0.7</priority>
  9. 9. Sitemap.xml Generators ● Creating a sitemap.xml manually can be very time consuming ● Many can generate it automatically for their websites ● ...but not everyone is a technical! ● Solution? – Use free online sitemap generators – Some plugins are available for blog platforms
  10. 10. Sitemap.xml Recommendations ● Create at least one sitemap.xml in the root ● Be as exhaustive as possible ● Leave out <lastmod> and <changefreq> if you can't set reliable values ● Don't try to fool search engines with <lastmod>, <changefreq> and <priority>, it does not work and can bite back at you ● You may submit your sitemaps to search engines (but it is not mandatory)
  11. 11. Robots.txt Content - I ● Rules apply top-down, last prevails on top ● User-agent: tells to which web crawler (a.k.a. robot it applies), * means all ● Disallow = forbid access, but if empty, this means forbid access to nothing (in other words, allow all) ● Allow = authorize access User-agent: * Disallow: User-agent: Googlebot Disallow: /mydir/ Allow: /myfile/myfile.html
  12. 12. Robots.txt Content - II ● The above robots.txt says: ● All web crawlers (but Google's) can access everything on the website ● Google's web crawler cannot access the content of the /mydir directory, except myfile.html in this directory User-agent: * Disallow: User-agent: Googlebot Disallow: /mydir/ Allow: /myfile/myfile.html
  13. 13. Robots.txt – Basic Example ● Use the above example for a start ● Allow access to all your website content to all web crawlers ● Register all your sitemaps in robots.txt, otherwise web crawlers likely won't find them ● Locations are case-sensitive ● Directory locations should end with a '/' user-agent: * disallow: sitemap: http://www.mysite.com/sitemap.xml sitemap: http://www.mysite.com/sitemap2.xml ...
  14. 14. Robots.txt Recommendations & Warnings ● Always create (at least) a minimal robots.txt where all sitemaps are declared ● Never block access to CSS and Javascript content ● Disallow instructions can be bypassed by malicious web crawlers, they are no means to protect access to content ● Debug your robots.txt with online checkers
  15. 15. Additional References ● Further readings: – Troubleshooting web site indexation issues – Troubleshooting web pages indexation issues – Getting started with SEO – SEO guidelines & checklists Click

×