A High Level Architecture of Google Search will be discussed in this post.
Google Searches work with the help of Web Crawlers.
Web Crawling is the process of downloading the web pages.
Downloading of web pages (web crawling) is not done by single crawler. It is done by various distributed crawlers.
It starts with URL Server that is there to send the list of all URLs that are needed to be fetched.
Fetched web pages are forwarded to Storeserver.
There is a repository to store the compressed web pages.
Note that a every webpage fetched has a url and a unique ID which is assigned to it.
It is known as docID.
Indexer box shown in the image does the process of indexing and sorting.
Indexer has some set of roles – Reading Repository, Uncompressing documents and parsing them.
All the documents are later converted into hits – Set of Word Occurrences.
Hits are there to fetch details like word and its position, font size capitalization etc.
Indexer distributes the hits into barrels. Here a partially sorted forward index is formed.
Indexer performs an important function of parsing every link in the web page and identifying the information hidden in anchor tags.
Indexer file has information to understand about the referring pages of the links and the text of links.
We can see that information from Anchors box is being forwarded to URL Resolver.
Url Resolver is a reader of anchor files and is a converter of Relative URL into Absolute URLs.
This gives the unique docID. Anchor text is not put into the forward index that is associated with the docID to which anchor is pointing.
A database of links (pair of docIDs) is generated, this Database of link is used to determine the PageRanks of documents.
Barrels are taken by Sorter. Barrels are sorted by docID. In order to generate the word index the Sorter resorts the barrels by wordID.
There is a program called DumpLexicon. List of wordIDs and the lexicon (a kind of vocabulary) that are produced by indexer are taken together by the DumpLexicon and a new lexicon to be used by the Searcher is produced.
A web server is used to run Searcher. The lexicon built by DumpLexicon, the inverted index, and the PageRank of documents are used together to answer the search queries by users.
This is how Architecture of Google Search works.
Reference of Article –
http://infolab.stanford.edu/~backrub/google.html