Preliminary Summary of the Use of Heritrix - Incredible - What I have been looking forward to has never appeared... sink

by stilling2006 on 2009-10-01 09:52:13

Before using Heritrix, make sure that JDK, Eclipse, and the relevant Eclipse plugins are installed on your machine. Initially, I made the mistake of not installing Eclipse but instead used JBuilder for debugging, which always failed.

1. Installation:

The current version number is 1.12.1, and the official website address is http://crawler.archive.org/. For a standard installation, unzip the package to the relevant directory, then configure the system environment variable "HERITRIX_HOME" to point to this unzipped directory (assuming the Java environment has already been configured).

2. Post-Installation Steps:

Unzip `%HERITRIX_HOME%\heritrix-1.12.1.jar` to a temporary directory, then copy the `profiles` directory from it to `%HERITRIX_HOME%\conf\`. This step resolves a bug in Heritrix related to the default Profile configuration.

3. Configure Management Account:

Copy `%HERITRIX_HOME%\conf\jmxremote.password.template` to `%HERITRIX_HOME%`, and rename it to `jmxremote.password`. Then edit the password section of this file as follows: `monitorRole @PASSWORD@ monitorRole admin`.