Record unvisited, visited URLs and web page content summary information.

by scseoer on 2011-10-21 22:11:11

The gatherer starts from a pre-defined list of URLs (which can be understood as the initial list of unvisited URLs). These URLs are typically extracted from previous access records, especially popular sites and "What's New" webpages. In addition, many search engines also accept URLs submitted by users.

After the gatherer visits a webpage, it analyzes the page and extracts new URLs, adding them to the list of URLs to visit, thus recursively traversing the Web.

Even with just one gatherer performing the task of collecting webpages, the issue of how to avoid revisiting the same webpage must be addressed. Therefore, two tables are defined: the "unvisited table" and the "visited table." The "unvisited table" stores URLs that are ready to be added to the pending visit queue, while the "visited table" stores URLs for which the corresponding webpages have already been requested.

When the gatherer parses new URLs after visiting a new webpage, it can use the "unvisited table" and "visited table" to determine which tasks have already been completed, thereby avoiding duplicate collection.

In TSE (Tencent Search Engine), in addition to storing summary information for the aforementioned "visited table" and "unvisited table," it also stores summary information for the content of already acquired webpages. The reason for storing webpage content summaries is due to the large number of duplicated webpages on the Web, where different URLs may lead to webpages with identical content.

In TSE, MD5 summaries are created for visited URLs, unvisited URLs, and obtained webpage content respectively, forming three unique sets. When a new URL is parsed, it first checks against the MD5 set of already visited URLs to determine if the URL has already been crawled. If not, it is added to the unvisited URL database; otherwise, it is discarded. This allows for O(1) time complexity during lookups.