Tutorial :regex breaking Chinese string


When i run this code and similar some Chinese the ni (你) character (maybe others) gets chopped of and broken.

$sample = "你不喜欢 香蕉 吗";  $parts = preg_split("/[\s,]+/", $sample);  var_dump($parts);    //outputs  array(4) {    [0]=>    string(2) "�"    [1]=>    string(9) "不喜欢"    [2]=>    string(6) "香蕉"    [3]=>    string(3) "吗"  }    //in æˆ'觉得 你很 麻烦  //out  array(4) {    [0]=>    string(9) "æˆ'觉得"    [1]=>    string(2) "�"    [2]=>    string(3) "很"    [3]=>    string(6) "麻烦"  }  

Is my regex wrong?


If your string is in UTF-8, you must use the u modifier:

$sample = "你不喜欢 香蕉 吗";  $parts = preg_split("/[\\s,]+/u", $sample);  var_dump($parts);  

If it's in another encoding, see unicornaddict's answer.


Since the input string is multi-byte, I guess you'll have to use mb_split in place of preg_split.

