Heritrix Advantages and Disadvantages - Unbelievable - In my best years, I met you just like this... sink

by stilling2006 on 2009-10-01 09:50:52

Heritrix is the Internet Archive's (IA) open-source, extensible, web-scale archiving crawler project.

The Heritrix project began in early 2003 with the IA's goal to develop a specialized crawler for archiving web resources and building a digital library of the internet. Over the past six years, IA has built up 400TB of data.

IA envisions their crawler having several types:

Broadband crawler: capable of crawling sites at higher bandwidth.

Focused crawler: concentrating on selected topics.

Continuous crawler: not only crawls current web pages but also revisits pages that are updated later.

Experimental crawler: experiments with crawling techniques to decide what to crawl and analyzes crawling results for different protocols.

Heritrix's homepage is http://crawler.archive.org

Heritrix is a crawler framework that incorporates various interchangeable components.

Its execution proceeds recursively, primarily involving the following steps: