Monday, 18 April 2011

Web Crawlers

Web crawlers are bots or programs designed to perform automated crawls on web pages. They are also known by the terms spiders or ants. Crawlers create a copy of the web pages visited by them to be indexed later by the search engine. Other crawlers also exist. Some are used to validate HTML code in web pages; others are used to gather email addresses from web pages also known as harvesters. Crawlers require URL feeds known as Seeds to enable them to crawl web pages. Most crawlers or bots are associated with a dedicated URL server on the same host machine from which the crawling is carried out.

The Technique of Web Crawlers
As web pages are crawled, the hyperlinks on the pages are noted and reserved to be crawled next. This is known as the crawl frontier. How deep a crawler mines into the web is dependent on each and every search engine. Search engines are known to only crawl 16% of the web for their purposes. As crawlers are nothing but programs written in high-level languages such as C, C++, Python or Java, they are capable of running multiple connections at the same time. Many search engines have 300 to 350 crawler programs or open connections running at the same time. They are capable of mining deep into the web and carrying out different variations in searches with conditions due to the dynamic code written in them.

Diferent Types of Crawlers

Breadth First Search
Breadth First Search or BFS is a graph search algorithm that begins at the root node and then moves to the nodes associated with the root node in the given fashion. Crawlers use this crawling strategy to crawl the web and make copies of the web pages. As a single page is crawled, a copy of it is saved and the hyperlinks on the page are collected next to be crawled. This crawling goes on; till it reaches the maximum number of pages the crawler is capable of making copies of.

Path Ascending Crawlers
Path ascending crawlers are crawlers designed to crawl and make copies of all the web pages in the URL path of a web page. Hence, if a web page is contained within the third sub-directory of a website, all the web pages in those directories would be crawled. They are specific to the websites being crawled. Path ascending crawlers are known to download a lot of content from the websites crawled by them. This may include videos, PowerPoint presentations, images, and PDF files.

Focused Crawlers
Focused crawlers are also known as topical crawlers. In this crawling method, the website is crawled on the basis of the topics covered. The crawler downloads pages similar in topical quality. It does this by following the hyperlinks on the pages being crawled and crawling them in similar fashion.



Distributed Crawling
This crawling method is more of a crawler hardware configuration methodology, than a crawling method. Different computers or servers are given crawlers to achieve maximum throughput in download rates. It is also meant to reduce the bandwidth of the central crawling servers.

Crawling the Web
Website content is updated every week or in weeks by web administrators. Crawlers return to a website in many weeks after their initial crawl. By this time, new content would have been added to the web pages. Search queries would not have these newly updated pages in their search results for the same (week loss) reasons. Crawlers handle this by maintaining a page crawl frequency for every page. If a page is found to be updated frequently after its initial crawl, it is given an appropriate updation ratio or index also known as the update frequency of a page. These pages are crawled more frequently dependent on a frequency ratio. Hence, the frequency ratio of an often updated page is more than a less updated page. Relevance and page rank are also dependent on updation ratios. Search engines might penalize websites that update content frequently without adding anything of relevance with each updation. This might affect their eventual relevance and page rank. Frequent updations without much value addition are seen as a sign of weakness by web engines.
Web crawlers are known to take a lot of server bandwidth causing servers to slowdown, develop bugs or even crash. The robots exclusion protocol has the robots.txt file which can be used to specify the pages needed to be ignored by the engines. Also, the interval of time between requests to download resources from a website can be specified in the ‘crawl-delay’ parameter in the same file. Web crawlers are designed with highly optimized architectures to deliver maximum performance and efficiency. Crawler strategies and architectures are kept as business secrets and are not disclosed to anyone. Search engine spamming is also a major concern other than commercial search engine baiters, who try to take advantage of them.

Different Web Crawlers
The Yahoo crawler is called Slurp!. It uses different searching styles to crawl the web as it is based on different technologies due to its acquisitions of other search engines. The Google crawler is a C++ and Python based crawler. It is highly integrated with its indexing process. FAST crawler is a distributed crawler. Methabot is a web crawler written in the C language. PolyBot is another distributed crawler written in C++ and Python. It has a crawl manager, a dedicated downloader and a DNS resolver. The URLs collected on web pages are stored and processed later. WebCrawler is a crawler built to create a simple full-text index of the web. It was designed to operate on the breadth first search algorithm. World wide worm is a simple index of document titles and URLs. The document title and URLs could be searched using a simple grep UNIX command.

WebFountain is a distributed, modular crawler built using the C++ language. It uses a controller machine connected to several spider or ant machines for a modular crawling pattern. WebRace is a crawler and cacher developed in Java language. It is a crawler which follows user requests for downloading web pages similar to a directory. When pages are updated, they are downloaded again and users are notified of the change.

No comments:

Post a Comment