The webpage downloaded by Httpclient is garbled. How to solve it?

by kpkmd54461 on 2012-03-01 21:52:17

Here's the translation of your text into English:

---

I am encountering an issue where I always get garbled characters when using HttpClient to download the page from news.sohu.com. Please review the following specific analysis error details and modify the source file appropriately. I've tried several methods, but it still doesn't work. Could it be that the server on the other side is detecting and restricting HttpClient? Is there anyone with more experience who could provide some guidance?

The code I am using is as follows:

```java

private static final String USER_AGENT = "Mozilla/5.0 (Windows NT 5.1; rv:2.0) Gecko/20100101 Firefox/4.0";

private static final String CONTENT_CHARSET = "GBK";

public static String httpGet(String url) throws Exception {

HttpClient httpclient = new DefaultHttpClient();

httpclient.getParams().setParameter("http.protocol.content-charset", "gb2312");

try {

HttpGet httpget = new HttpGet(url);

httpget.setHeader(HTTP.USER_AGENT, USER_AGENT);

httpget.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");

ResponseHandler responseHandler = new BasicResponseHandler();

System.out.println(new String(httpclient.execute(httpget, responseHandler).getBytes(), "gb2312"));

return httpclient.execute(httpget, responseHandler);

} finally {

// When HttpClient instance is no longer needed,

// shut down the connection manager to ensure

// immediate deallocation of all system resources

httpclient.getConnectionManager().shutdown();

}

}

public static void main(String[] args) throws Exception {

httpGet("http://news.sohu.com");

}

```

The page encoding for news.sohu.com is:

---

### Solution Reference:

For reference, you can check the following link regarding the issue of garbled characters when downloading web pages or internal error codes like ORA-00600:

[How to resolve garbled characters when downloading web pages](http://www.myexception.cn/java-web/14391.html)

Relevant topic articles include:

- How to display Flash files or hidden Divs in a popup window using Ext for Air.

- Issues with MyEclipse code assist keys.

- Errors encountered while interacting between Flex and BlazeDS, even though the configuration is correct.

---

### Additional Notes:

1. **Encoding Issue**: The page encoding for `news.sohu.com` might not match the charset specified in your code (`gb2312`). Ensure that the actual encoding of the webpage matches the charset you're using in your code.

2. **Server Restrictions**: Some servers may detect automated requests and block them based on headers or IP addresses. You can try mimicking a real browser more closely by adding additional headers (e.g., `Accept-Language`, `Referer`) or rotating user agents.

3. **Debugging Tips**: Use tools like Postman or browser developer tools to inspect the actual response from the server and verify the encoding. Adjust your code accordingly.

If the problem persists, consider providing more details about the response headers or any specific errors encountered.