The issue regarding HtmlParser not being able to retrieve all hyperlinks from a webpage.

by kpkmd54461 on 2012-03-04 17:55:12

### Translation:

I used HtmlParser to retrieve all hyperlinks from the main page of NetEase Blogs (http://blog.163.com/), but encountered a `java.net.ProtocolException`. The following is my code:

```java

public void testGetLinks() throws IOException, ParserException {

URL url = new URL("http://blog.163.com/");

URLConnection conn = url.openConnection();

Parser parser = new Parser(conn);

parser.setEncoding("gb2312");

NodeList list = parser.parse(new TagNameFilter("a"));

System.out.println(list.size());

}

```

The total number of hyperlinks obtained was 211, which is clearly much smaller than the actual value. Later, I saved the source code of http://blog.163.com/ and used UltraEdit to search and count the number of links in the source code, finding that there were actually 897 links. This discrepancy indicates that HtmlParser did not capture all the links!

Could you please help me understand where the problem lies?

...

------ **Solution Reference** --------------------------------------------------------

For reference: How to extract link content and corresponding links from the following string using regular expressions? Thank you.

Link: [http://www.myexception.cn/c-sharp/78793.html](http://www.myexception.cn/c-sharp/78793.html)

Related topic articles:

- Why can't my Android program run, and how should I handle it?

- Questions related to Java time programming

- How to output a segment of text to the current cursor in Java

---

### Explanation of the Problem:

The issue arises because the `HtmlParser` library may not fully parse dynamic or JavaScript-rendered content. Modern websites often use JavaScript to dynamically load content, and such content may not be visible in the raw HTML retrieved by `URLConnection`. Additionally, encoding issues or incomplete parsing logic in the library could also contribute to missing links.

To address this issue, consider the following solutions:

1. **Use a Headless Browser**: Tools like Selenium or Jsoup can render JavaScript and provide a more complete DOM for parsing.

2. **Check Encoding**: Ensure the correct character encoding (`gb2312` in this case) is applied during parsing.

3. **Regular Expressions**: As suggested, use regular expressions to extract links directly from the raw HTML source.

4. **Debugging**: Verify if the raw HTML retrieved by `URLConnection` contains all the expected links.

If you need further clarification or assistance with implementing these solutions, feel free to ask!