Bomber Charlie: How Search Engines Work?

Search engines are designed to crawl and index the World Wide Web discreetly and efficiently to produce useful and informative search results. They search and index millions of web pages containing different and diverse search terms. They run thousands of queries by their user’s everyday. As the web grows everyday, running queries for search engines become even more challenging. Hardware performance and cost are also factors influencing speed of queries and creation of indexes.

How Search Engines Rank the Web?

Search engines base their ranking on an algorithm known as the Page Rank. Page Rank is based on the link structure of the web. The link structure is a citation graph of the web. Search engines create graphs containing millions of hyperlinks. These maps help in rapid calculation of a web page’s “Page Rank”. Page Rank is largely based on academic citation, which is importance based on citation’s and back link’s to a web page. The more the quality and number of citation’s and back link’s a web page receives, the higher will be its Page Rank. Hence, a citation from a page with a higher citation or academic level elicits more Page Rank for a web page, than from a page with less amount of academic value.

Search Engine Architecture

Most search engines are written in high level languages such as C or C++. These languages support development of robust and reliable crawler applications, which need to be run on Solaris or Linux systems. Search engines have dedicated crawler programs to which lists of URL’s are fed. The web pages retrieved by the crawlers are compressed and stored in a repository. Every web page is assigned an ID number. The indexer is a program, which parses the web pages to record each and every word in the web page into a repository. The indexer is also responsible for recording the citations and the back links, which eventually results in the Page Rank for a web page. Parsed web pages are also sorted into Page Rank stacks before being stored in the repositories.

How Search Engines Crawl the Web?

Search engines have at any time about 300 crawlers or ‘Open Connections’ ready to crawl the web. A URL server serves the URLs to these hungry crawlers. Crawlers need to crawl over 100 web pages per second, which amounts to about 600KB of data per second. Robots Exclusion Protocol requires search engines to skip web pages, where the Robot <META> tag is set to NOINDEX, NOFOLLOW. Web sites which do not want their pages crawled or indexed by search engines, resort to this protocol. Search engines skip pages containing this command.

How a Query Generates a Search Result?

A single word query is searched in the parsed web pages stored in the repositories for matches. A query word is checked in the title, anchor text, URL, Meta’s, body content and other attributes for occurrences. Their individual count makes their weight, which is the number of occurrences of the query in the attribute. They are then assessed by their counter weights in comparison to each other. A factor is also involved in assessing a query. It is the importance and value assigned to every attribute occurrence. This leads to a collective weight count and factor, which is used to assess the relative relevance of the web page to the keyword in comparison to other web pages. This is combined with Page Rank to present a search result with the most relevant results.

Multiple word queries are assessed similarly, but weight is dependent on proximity of the search keywords. The closer they occur in the web pages, the more weight and factor they earn. Hence, proximity is equated with relevance and meaning by which collective weight and factor is attributed to them.

How they Earn Income and Stay in Business?

Search engines are dependent on commercial advertisements displayed during search results to earn their livelihood. These search results are relevant to the search keywords. Search engines allow users to advertise their products online for a fee. These advertisements are displayed on the top and right corners of the search results. User’s finding these advertisements useful, in most probability do check them out too. Hence, this leads to more exposure and profitability for the advertisers, due to the search results.

Bomber Charlie

Monday, 18 April 2011

How Search Engines Work?

No comments:

Post a Comment