Heritrix is the Internet Archive's (IA) open-source, extensible, web-scale archiving crawler project.
The Heritrix project began in early 2003 with the IA's goal to develop a specialized crawler for archiving web resources and building a digital library of the internet. Over the past six years, IA has built up 400TB of data.
IA envisions their crawler having several types:
Broadband crawler: capable of crawling sites at higher bandwidth.
Focused crawler: concentrating on selected topics.
Continuous crawler: not only crawls current web pages but also revisits pages that are updated later.
Experimental crawler: experiments with crawling techniques to decide what to crawl and analyzes crawling results for different protocols.
Heritrix's homepage is http://crawler.archive.org
Heritrix is a crawler framework that incorporates various interchangeable components.
Its execution proceeds recursively, primarily involving the following steps: