Wednesday, May 03, 2006

Spiders & Google Sitemaps

Internet Searching became one of the most popular services provided on the internet. Google, Yahoo, Altavista and AllTheWeb are all examples of search engines. Although, the simplicity of the searching, the inner technique used in searching is not easy as we think. Searh engines have to update their index regularlly and infiniatly.

Such engines use a special kind of softwares called “Spiders” or “Robots”. These softwares try to scan the whole internet to update their index with the new – or modified – locations, urls and contents. Depending on whatever algorithm or technqiue, “Spiders” will try extracts the keywords of the webpages content and restore the results on huge databases.

To comprehense its service, Google had provide an effective service for webmasters so that they can increase the traffic of their website. The idea is that each website will have an XML file called “sitemap.xml”, this file should be uploaded in the highest root of the FTP space of the web application. This file will include the informaton about the site pages and how they frequently changed. Here is a sample file for a small website:


<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">

<url>
<loc>http://mnour.blogspot.com</loc>
<lastmod>2006-05-01</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>

<url>
<loc>http://mnour.blogspot.com/2006/04/sql-injection-part-1.html</loc>
<lastmod>2006-05-01</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>

<url>
<loc>http://mnour.blogspot.com/2006/04/sql-injection-part-2.html</loc>
<lastmod>2006-05-01</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>

<url>
<loc>http://mnour.blogspot.com/2006/04/ajax-new-giant.html</loc>
<lastmod>2006-05-01</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>

</urlset>

While Google spiders scans the network, they will try to read this file – “sitemap.xml” – from a scanned domain. Then, it will check if any pages were modified after last time scanning this domain.

As you can notice, each “url” tag in the file contains how frequently the content of this URL will be updated. Actually, all of the webmasters will make there pages as updated hourly! thinking that this will increase there chances for being in the first results of Google. Actually, Google declared that this parameter is not taken as it’s. Maybe a URL declared as updated Montly, be indexed by Google Spiders more than another one declared as updated hourly!

This technqiue gives the benefits for the both sides: Webmasters and Google. Webmasters will be able to add their site to the serach results of most popular search engine on the net. On the other side, the technique will decrease the time needed for Google spiders to update their index.
The main drawback for this technique is for webmasters. They will have to update this file frequently after each update of their site content.