Improving the accuracy of Chinese search on Wiki, and supporting Chinese in MySQL's FULLTEXT.

by xue73lwsh on 2012-03-08 14:20:39

**Mysql Full-Text Indexing Issues with Chinese (Mediawiki Chinese Search Problem)**

Category: Technology — ssmax @ 15:24:59

Today, I reviewed the source code of Mediawiki because its Chinese search functionality is not very accurate, and I wanted to investigate the cause. I looked into how its search mechanism is implemented.

The database used is MySQL, which employs a full-text index table for searching:

```sql

CREATE TABLE `searchindex` (

`si_page` int(10) unsigned NOT NULL,

`si_title` varchar(255) NOT NULL DEFAULT "",

`si_text` mediumtext NOT NULL,

UNIQUE KEY `si_page` (`si_page`),

FULLTEXT KEY `si_title` (`si_title`),

FULLTEXT KEY `si_text` (`si_text`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8;

```

MySQL's `FULLTEXT` indexing has traditionally had poor support for Chinese. If UTF-8 strings are used directly without word segmentation delimiters, the index becomes ineffective. To address this, Mediawiki uses a clever method by converting UTF-8 characters into the format `U8xxxx` and separating them with English spaces, enabling the search to work effectively.

Below is the character conversion code from Mediawiki, which is quite useful:

```php

cat wiki/languages/classes/LanguageZh_cn.php

/**

* @addtogroup Language

class LanguageZh_cn extends Language {

function stripForSearch($string) {

# MySQL full-text index doesn't understand utf-8, so we need to fold cases and convert to hex.

# We also separate characters as "words".

if (function_exists('mb_strtolower')) {

return preg_replace('/([\xc0-\xff][\x80-\xbf]*)/e', "' U8' . bin2hex('$1')", mb_strtolower($string));

} else {

list(, $wikiLowerChars) = Language::getCaseMaps();

return preg_replace('/([\xc0-\xff][\x80-\xbf]*)/e', "' U8' . bin2hex(strtr('$1', $wikiLowerChars))", $string);

}

```

This code converts Chinese characters into the format `U8xxxx` followed by a space, allowing the use of MySQL's full-text index. In fact, starting from MySQL 5.0, it is possible to use UTF-8 characters for full-text indexing. However, due to the lack of proper word segmentation, each Chinese character still needs to be separated by spaces, and the minimum indexed character length must be set accordingly. Therefore, Mediawiki's approach remains more convenient.

Since each Chinese character is treated as a single word without sequential search, the results may not align well with the language habits of Chinese speakers. To improve accuracy, the source code can be modified to encapsulate phrases using colons, yielding more precise results.

Edit `wiki/includes/SearchMySQL4.php` and locate the following code:

```php

if ($this->strictMatching && ($terms[1] == '')) {

$terms[1] = '+';

}

$searchon .= $terms[1] . $wgContLang->stripForSearch($terms[2]);

```

Modify it as follows:

```php

if ($this->strictMatching && ($terms[1] == '')) {

// $terms[1] = '+';

$terms[1] = '+"';

}

$searchon .= $terms[1] . $wgContLang->stripForSearch($terms[2]) . '"';

```

This modification enables more precise searches.

---

Diff for `/var/lib/mediawiki/includes/SearchMySQL.php`:

```diff

--- /var/lib/mediawiki/includes/SearchMySQL.php.orig 2011-06-13 09:18:52.000000000 +0800

+++ /var/lib/mediawiki/includes/SearchMySQL.php 2011-06-13 09:11:32.000000000 +0800

@@ -51,9 +51,9 @@

foreach ($m as $terms) {

if ($searchon !== '') {

$searchon .= ' ';

}

if ($this->strictMatching && ($terms[1] == '')) {

- $terms[1] = '+';

+ $terms[1] = '+"';

}

- $searchon .= $terms[1] . $wgContLang->stripForSearch($terms[2]);

+ $searchon .= $terms[1] . $wgContLang->stripForSearch($terms[2]) . '"';

if (!empty($terms[3])) {

// Match individual terms in result highlighting...

$regexp = preg_quote($terms[3], '/');

```

With these changes, the search functionality becomes more accurate for Chinese text.

tags users