Customization of Heritrix crawler -- Format Filtering - Inconceivable -- War is the bloody politics, and politics is the bloodless war... sink

by stilling2006 on 2009-10-01 09:53:07

Perhaps this is a small breakthrough I can show after studying Heritrix for so long recently.

First, open the Heritrix project in Eclipse. Create a new class under the `my.postprocess` package; you can name it anything. Copy the existing code from `FrontierSchedulerFor163Mobile`. There will be two errors, but they are very simple to fix—just unify the class names and file names.

Find the following code:

```java

try {

if (url.indexOf("mobile.163.com/0011/product/0011000B/product") != -1

|| url.indexOf("mobile.163.com/0011/product/0011000B/mark") != -1

|| url.endsWith(".gif")

|| url.endsWith(".jpg")

|| url.endsWith(".jpeg")

|| url.indexOf("robots.txt") != -1

|| url.indexOf("dns:") != -1) {

if (url.indexOf("#") == -1) {

getController().getFrontier().schedule(caUri);

}

}

}

```

(Note: The original Chinese text contains some formatting issues, such as incorrect quotation marks and special characters like "124;124;" which have been replaced with logical operators (`||`) in the corrected Java code above.)