How Google Works

by joepue on 2008-08-26 12:32:58

How Google Works

In the past 12 months, the number of Google employees has doubled, and they have improved their search engine to increase its speed. Now, they respond to more queries than Microsoft and Yahoo combined. But there is one query we must answer ourselves: How does Google work?

It's all because of a spelling mistake. Ten years ago this September, the story began when some Stanford graduates helped Larry Page name his search engine. "Googolplex (a huge number)," Sean Anderson said. (They had already sensed how big this thing could become.) "Googol," Page responded. Anderson, while checking if the name was available, typed g-o-o-g-l-e into his browser and made the most famous spelling error since p-o-t-a-t-o-e[1]. Page registered this name within hours, and today, Google is no longer a typo; it has become a verb, a verb with a market worth $160,000,000.

Below is a guide to see what happens during a regular search—of course, with automatic spell-checking included.

1. Query Box

The story begins when someone types in a query for certain information, such as the safest dog food, when the traffic bureau closes, or what China's preferential interest rate is.

2. DNS

"Hello, this is the operator..."

Google's domain name servers operate in data centers around the world that Google rents or owns, including a headquarters at the Manhattan Port Authority. Their sole task is to direct search requests as efficiently as possible to a Google cluster, taking into account which cluster is closest to the searcher and which is least busy at the time.

3. Cluster

The search request then arrives at one of at least two hundred clusters. These clusters are located in Google’s data centers around the world.

4. Google Web Server

This program distributes a search request across hundreds or even thousands of machines so they can work simultaneously. It’s like the difference between one person shopping alone in a grocery store and having 100 people find an item and throw it into your cart at the same time.

5. Index Server

All the things Google knows are stored in a large database. Rather than waiting for one computer to sift through so many gigabytes of data, Google lets hundreds of computers scan its "card catalog" simultaneously for any relevant entries. Popular search entries are cached—kept in memory—for several hours instead of being re-executed from scratch. Britney Spears, just like you.

6. Document Server

After the index server generates its results, the document server pulls out all relevant documents (including links and article snippets) from that very large database. What does Google do to make searching the Web so fast? Actually, it doesn’t. It keeps three copies of all the information on the Internet (stored in its document center), and all this data has already been pre-sorted.

7. Spell-Check Server

Google doesn't understand sentences; it looks for patterns of words, whether in English or Sanskrit. If it finds 1,000 results matching the pattern of your search request but finds a million results matching a similar pattern, it will connect those dots and politely ask if you meant to search for these words, even after it has already provided results. For example, if your fat fingers type "hwedge funds"[2].

8. Ad Server

Any search query also passes through an ad database, and matching results are provided to the Web server to place these ads on the results page. The ad team is actually racing against the search team. Google promises to execute all searches as quickly as possible; therefore, if generating ad results takes longer than generating search results, those ads won’t be placed on the results page—meaning Google won’t make money from that search.

9. Page Generator

The Google web server gathers thousands of results generated for this search query, organizes all the data, and then displays Google's clean and cute results page in your browser window—all happening in less time than it takes to read a sentence.

10. Search Results Displayed

Generally within 0.25 seconds, or even shorter.

Cluster Control

Google's brilliance lies in its network software, which helps thousands of inexpensive computers work together in a cluster like a super hard drive. These affordable computers allow Google to replace parts without stopping the entire operation: if one computer goes down, an engineer simply removes the faulty machine, and at least two other computers nearby can take its place.

Power

The only thing limiting Google's performance is how much electricity the company can buy. The recently built new data center near the Columbia River in Dallas, Oregon, consumes 1.8 billion watts of hydroelectric power; coincidentally, this is where the network connection between Asia and the United States lies. This factory of bytes has two computing centers, each as big as a football field.

Capacity (Petabytes)

Based on some data released by Google, experts estimate that at least 20 petabytes (two quadrillion)[3] of data are stored on its servers. However, Googlytes are known for keeping secrets; Wired magazine says Google might have 200 petabytes of capacity. So, how much is that? If your iPod could hold 1 petabyte (one million gigabytes), then you would be able to play two hundred million songs. And if you started downloading 1 petabyte of data via your high-speed internet connection now, your great-great-great-great-grandchild might still be waiting, with the last byte still being transmitted in 2514.

PageRank

Google determines how reliable a website is—and how important its content is when Google generates a list of search results—by considering over two hundred factors, just as it analyzes content. But the secret lies in Google's patented formula. This formula allows Google to understand the differences between linked sites based on all the links on a page. This means that the credibility of a site largely depends on the quality of the sites linking to it.

Googlebots

Google uses programs called spiders to create a copy of the Internet. On popular sites, Googlebots may crawl along all the links several times within an hour. As they pass through pages, the crawlers save every byte of text or code. This raw data is brought back to the cluster, processed through the mill, and progressively replaces the old data originally stored in the index and document servers, ensuring that the results remain fresh and don't turn into fossils.