Explanation of Unicode Programming in VC++ - VC|VC++ - Programming Development - Eden Network

by henxue on 2010-07-18 21:20:03

### What is Unicode?

Let's start with ASCII. ASCII is an encoding standard used to represent English characters. Each ASCII character occupies 1 byte, so the maximum number of characters that ASCII can encode is 255 (from 00H to FFH). In reality, there aren't that many English characters, and generally only the first 128 characters are used (from 00H to 7FH, where the highest bit is 0). This includes control characters, numbers, uppercase and lowercase letters, and some other symbols. The additional 128 characters (from 80H to FFH) with the highest bit set to 1 are called "Extended ASCII" and are typically used to store English tabulation characters, some phonetic symbols, and other symbols.

This character encoding rule works fine for handling English. However, when it comes to complex scripts like Chinese or Arabic, 255 characters are clearly insufficient.

As a result, various countries have developed their own text encoding standards. For Chinese, the encoding standard is called "GB2312-80," which is compatible with ASCII. Essentially, it uses the fact that Extended ASCII has not been fully standardized and represents one Chinese character using two Extended ASCII characters to differentiate from the ASCII part.

However, this method has problems. The biggest issue is the overlap between Chinese character encoding and Extended ASCII. Many software programs use Extended ASCII's English tabulation characters to create tables. When these programs are used in Chinese systems, these tables may be misinterpreted as Chinese characters, resulting in garbled text.

Additionally, since different countries and regions have their own text encoding rules, they often conflict with each other, making information exchange between them very difficult.

To truly solve this problem, we cannot approach it from the perspective of extending ASCII. Instead, we need an entirely new encoding system that considers all languages uniformly, assigning a unique code to each character.

Thus, Unicode was born.

Unicode is also a character encoding method. It uses 2 bytes (from 0000H to FFFFH), accommodating up to 65,536 characters, which is sufficient to encode all the characters of every language in the world.

In Unicode, all characters are treated equally. Chinese characters no longer use "two Extended ASCII" but instead use "one Unicode." This means all characters are handled as single characters, each having a unique Unicode code.

### Benefits of Using Unicode Encoding

Using Unicode encoding allows your project to support multiple languages simultaneously, making your project internationalized.

Moreover, Windows NT is developed using Unicode, and the entire system is based on Unicode. If you call an API function and pass it an ANSI (ASCII character set and its derivatives that are compatible with it, such as GB2312, commonly referred to as the ANSI character set) string, the system will first convert the string into Unicode and then pass the Unicode string to the operating system. If you want the function to return an ANSI string, the system will first convert the Unicode string into an ANSI string and then return the result to your application. These string conversions consume system time and memory. By developing applications using Unicode, you can make your applications run more efficiently.

Below are a few examples of character encodings to simply demonstrate the differences between ANSI and Unicode:

| Character | A | N | 和 |

|-----------|----|----|----|

| ANSI Code | 41H | 4eH | cdbaH |

| Unicode Code | 0041H | 004eH | 548cH |

### Using C++ for Unicode Programming

Support for wide characters is actually part of the ANSI C standard, designed to support multi-byte representation of characters. Wide characters and Unicode are not entirely equivalent; Unicode is just one type of wide character encoding.

#### 1. Definition of Wide Characters

In ANSI, the length of a character (char) is one byte (Byte). When using Unicode, a character occupies one word. C++ defines the most basic wide character type `wchar_t` in the `wchar.h` header file:

```cpp

typedef unsigned short wchar_t;

```

From this, we can clearly see that a wide character is essentially an unsigned short integer.

#### 2. Constant Wide Strings

For C++ programmers, constructing string constants is a common task. So how do you construct a constant wide character string? It's simple: just add a capital L before the string constant, like this:

```cpp

wchar_t *str1 = L"Hello";

```

This L is very important. Only by including it does the compiler know that you want to store the string as one character per word. Also, note that there should be no space between the L and the string.

#### 3. Wide String Library Functions

To manipulate wide strings, C++ has defined a set of functions. For example, the function to get the length of a wide string is:

```cpp

size_t __cdel wchlen(const wchar_t*);

```

Why are these functions specifically defined? The fundamental reason is that under ANSI, strings are identified by '\0' (Unicode strings end with """), and many string functions operate correctly based on this. However, we know that in the case of wide characters, one character occupies a word's worth of space in memory, which makes it impossible for ANSI string functions to operate correctly. Taking the string "Hello" as an example, in wide characters, its five characters are:

```

0x0048 0x0065 0x006c 0x006c 0x006f

```

In memory, the actual arrangement is:

```

48 00 65 00 6c 00 6c 00 6f 00

```

Thus, ANSI string functions like `strlen`, upon encountering the first 00 after 48, would consider the string ended, resulting in `strlen` always returning 1 for wide strings!

#### 4. Using Macros to Achieve General Programming for ANSI and Unicode

Clearly, C++ has a complete set of data types and functions to implement Unicode programming, meaning you can completely use C++ to achieve Unicode programming.

If we want our program to have two versions: an ANSI version and a Unicode version, writing two separate sets of code to implement both versions is certainly feasible. However, maintaining two separate sets of code for ANSI and Unicode characters is a very cumbersome task. To reduce the burden of programming, C++ defines a series of macros to help you achieve general programming for ANSI and Unicode.

The essence of C++ macros achieving general programming for ANSI and Unicode is based on whether `_UNICODE` (note the underscore) is defined. These macros expand into either ANSI or Unicode characters (strings).

Below is a partial excerpt from the `tchar.h` header file:

```cpp

#ifdef _UNICODE

typedef wchar_t TCHAR;

#define __T(x) L##x

#define _T(x) __T(x)

#else

#define __T(x) x

typedef char TCHAR;

#endif

```

It can be seen that these macros expand into either ANSI or Unicode characters based on whether `_UNICODE` is defined. The macros defined in the `tchar.h` header file can be divided into two categories:

A. Macros for defining characters and constant strings:

We list only the two most commonly used macros:

| Macro | `_UNICODE` Not Defined (ANSI Character) | `_UNICODE` Defined (Unicode Character) |

|------------|-----------------------------------------|----------------------------------------|

| `TCHAR` | `char` | `wchar_t` |

| `_T(x)` | `x` | `L##x` |

Note:

`##` is a preprocessor syntax in ANSI C, called the "paste symbol," indicating that the preceding L is appended to the macro parameter. In other words, if we write `_T("Hello")`, it expands to `L"Hello"`.

B. Macros for string function calls:

C++ also defines a series of macros for string functions. Again, we only list a few commonly used ones:

| Macro | `_UNICODE` Not Defined (ANSI Character) | `_UNICODE` Defined (Unicode Character) |

|------------|-----------------------------------------|----------------------------------------|

| `_tcschr` | `strchr` | `wcschr` |

| `_tcscmp` | `strcmp` | `wcscmp` |

| `_tcslen` | `strlen` | `wcslen` |

### Using Win32 API for Unicode Programming

The Win32 API defines some of its own character data types. These data types are defined in the `winnt.h` header file. For example:

```cpp

typedef char CHAR;

typedef unsigned short WCHAR; // wc, 16-bit UNICODE character

typedef CONST CHAR *LPCSTR, *PCSTR;

```

The Win32 API defines some macros in the `winnt.h` header file for implementing character and constant string operations for general ANSI/Unicode programming. Again, we only list a few commonly used ones:

```cpp

#ifdef UNICODE

typedef WCHAR TCHAR, *PTCHAR;

typedef LPWSTR LPTCH, PTCH;

typedef LPWSTR PTSTR, LPTSTR;

typedef LPCWSTR LPCTSTR;

#define __TEXT(quote) L##quote // r_winnt

#else /* UNICODE */ // r_winnt

typedef char TCHAR, *PTCHAR;

typedef LPSTR LPTCH, PTCH;

typedef LPSTR PTSTR, LPTSTR;

typedef LPCSTR LPCTSTR;

#define __TEXT(quote) quote // r_winnt

#endif /* UNICODE */ // r_winnt

```

From the above header file, we can see that `winnt.h` performs conditional compilation based on whether `UNICODE` (no underscore) is defined.

The Win32 API also defines a set of string functions that expand into either ANSI or Unicode string functions depending on whether `UNICODE` is defined, such as `lstrlen`. The API's string manipulation functions and C++'s string manipulation functions can achieve the same functionality, so if needed, it is recommended to use C++'s string functions without spending too much effort learning the API's string functions.

You may have never noticed, but the Win32 API actually has two versions. One version accepts MBCS strings, while the other accepts Unicode strings. For example, there is no `SetWindowText()` API function; instead, there are `SetWindowTextA()` and `SetWindowTextW()`. The suffix A indicates this is the MBCS function, and the suffix W indicates this is the Unicode version of the function. These API functions are declared in the `winuser.h` header file. Below is an excerpt of the declaration of the `SetWindowText()` function in `winuser.h`:

```cpp

#ifdef UNICODE

#define SetWindowText SetWindowTextW

#else

#define SetWindowText SetWindowTextA

#endif // !UNICODE

```

It can be seen that the API function points to either the Unicode version or the MBCS version based on whether `UNICODE` is defined.

Careful readers might have already noticed the difference between `UNICODE` and `_UNICODE`: the former has no underscore and is specifically used for Windows header files; the latter has an underscore prefix and is specifically used for C runtime header files. In other words, in ANSI C++, macros expand into either Unicode or ANSI characters based on whether `_UNICODE` (with an underscore) is defined, and in Windows, macros expand into either Unicode or ANSI characters based on whether `UNICODE` (without an underscore) is defined.

In practice, we don't strictly distinguish between the two and define both `_UNICODE` and `UNICODE` to achieve Unicode version programming.

### Writing Unicode Encoded Applications in VC++ 6.0

VC++ 6.0 supports Unicode programming, but the default is ANSI. Therefore, developers only need to slightly change their coding habits to easily write applications that support UNICODE.

Using VC++ 6.0 for Unicode programming mainly involves the following steps:

1. Add `UNICODE` and `_UNICODE` preprocessing options to the project.

Specific steps: Open [Project] -> [Settings...] dialog box. As shown in Figure 1, in the C/C++ label dialog box, remove `_MBCS` from the "Preprocessor Definitions" and add `_UNICODE`, `UNICODE`. (Note: Separate them with commas.) After modification, it looks like Figure 2.

When neither `UNICODE` nor `_UNICODE` is defined, all functions and types default to the ANSI version. After defining `UNICODE` and `_UNICODE`, all MFC classes and Windows APIs become wide-character versions.

2. Set the program entry point.

Since MFC applications have a dedicated entry point for Unicode, we need to set the entry point. Otherwise, a linking error will occur.

To set the entry point: Open [Project] -> [Settings...] dialog box, and in the Link page under the Output category, fill in `wWinMainCRTStartup` in the Entry Point field.

3. Use ANSI/Unicode generic data types.

Microsoft provides some ANSI and Unicode compatible generic data types. The most commonly used data types are `_T`, `TCHAR`, `LPTSTR`, and `LPCTSTR`.

Incidentally, `LPCTSTR` and `const TCHAR*` are completely identical. The L stands for long pointer, which is a legacy from 16-bit operating systems like Windows 3.1. In Win32 and other 32-bit operating systems, long pointers and near/far modifiers are only for compatibility and have no practical significance. P (pointer) indicates that this is a pointer; C (const) indicates that it is a constant; T (_T macro) indicates compatibility with ANSI and Unicode; STR (string) indicates that this variable is a string. Therefore, `LPCTSTR` represents a pointer to a constant string whose semantics can change based on some macro definitions. For example:

```cpp

TCHAR* szText = _T("Hello!");

TCHAR szText[] = _T("I Love You");

LPCTSTR lpszText = _T("你好!");

```

Function parameters should also change accordingly, for example:

```cpp

MessageBox(_T("你好"));

```

Actually, in the above statement, even if you don't use the `_T` macro, the `MessageBox` function will automatically force-convert the "你好" string. However, I still recommend using the `_T` macro to indicate that you are aware of Unicode encoding.

4. Modify string operation issues.

Some string manipulation functions need to obtain the number of characters (`sizeof(szBuffer)/sizeof(TCHAR)`), while others may need to obtain the number of bytes (`sizeof(szBuffer)`). You should be mindful of this issue and carefully analyze string manipulation functions to ensure correct results.

ANSI operation functions start with `str`, such as `strcpy()`, `strcat()`, `strlen()`.

Unicode operation functions start with `wcs`, such as `wcscpy()`, `wcslen()`.

ANSI/Unicode operation functions start with `_tcs` (C runtime library).

ANSI/Unicode operation functions start with `lstr` (Windows function).

Considering compatibility between ANSI and Unicode, we need to use general string manipulation functions that start with `_tcs` or `lstr`.

### Example of Unicode Programming

Step 1:

Open VC++ 6.0, create a new dialog-based project named `Unicode`. In the main dialog `IDD_UNICODE_DIALOG`, add a button control. Double-click the control and add its response function:

```cpp

void CUnicodeDlg::OnButton1()

{

TCHAR* str1 = _T("ANSI and UNICODE encoding test");

m_disp = str1;

UpdateData(FALSE);

}

```

Add a static text box `IDC_DISP` and use ClassWizard to add a `CString` type variable `m_disp`. Compile the project using the default ANSI encoding environment to generate `Unicode.exe`.

Step 2:

Open "Control Panel", click "Date, Time, Language, and Regional Options", and in the "Date, Time, Language, and Regional Options" window, continue clicking "Regional and Language Options". In the "Regional and Language Options" dialog box, click the "Advanced" tab and change the "Language for non-Unicode programs" option to "Japanese". Click the "Apply" button, as shown in Figure 4.

Click "Yes" in the pop-up dialog box and restart the computer to make the settings take effect.

Run `Unicode.exe` and click the "Button1" button. Look, the static text box shows garbled text.

Step 3:

Recompile the project in the Unicode encoding environment to generate `Unicode.exe`. Run `Unicode.exe` again and click the "Button1" button. Now you can see the advantage of Unicode encoding.

That's all. Good luck!

Article Address: [Eden Network](http://www.edenw.com/tech/devdeloper/vc/2010-07-18/4786.html)