The issue regarding HtmlParser not being able to retrieve all hyperlinks from a webpage.

by kpkmd54461 on 2012-03-04 17:55:12

### Translation:

I used HtmlParser to retrieve all hyperlinks from the main page of NetEase Blogs (http://blog.163.com/), but encountered a `java.net.ProtocolException`. The following is my code:

```java

public void testGetLinks() throws IOException, ParserException {

URL url = new URL("http://blog.163.com/");

URLConnection conn = url.openConnection();

Parser parser = new Parser(conn);

parser.setEncoding("gb2312");

NodeList list = parser.parse(new TagNameFilter("a"));

System.out.println(list.size());

}

```

The total number of hyperlinks obtained was 211, which is clearly much smaller than the actual value. Later, I saved the source code of http://blog.163.com/ and used UltraEdit to search and count the number of links in the source code, finding that there were actually 897 links. This discrepancy indicates that HtmlParser did not capture all the links!

Could you please help me understand where the problem lies?

...

------ **Solution Reference** --------------------------------------------------------

For reference: How to extract link content and corresponding links from the following string using regular expressions? Thank you.

Link: [http://www.myexception.cn/c-sharp/78793.html](http://www.myexception.cn/c-sharp/78793.html)