
Question:
Assuming I have a string "HET1200 text string" and I need it to change to "HET1200 Text String". Encoding would be UTF-8.
How can I do that? Currently, I use mb_convert_case($string, MB_CASE_TITLE, "UTF-8");
but that changes "HET1200" to "Het1200.
I could specify an exception, but it won't be an exhaustive. So I rather all uppercase words to remain uppercase.
Thanks :)
Solution:1
OK, let's try to recreate mb_convert_case
as close as possible but only changing the first character of every word.
The relevant part of mb_convert_case
implementation is this:
int mode = 0; for (i = 0; i < unicode_len; i+=4) { int res = php_unicode_is_prop( BE_ARY_TO_UINT32(&unicode_ptr[i]), UC_MN|UC_ME|UC_CF|UC_LM|UC_SK|UC_LU|UC_LL|UC_LT|UC_PO|UC_OS, 0); if (mode) { if (res) { UINT32_TO_BE_ARY(&unicode_ptr[i], php_unicode_tolower(BE_ARY_TO_UINT32(&unicode_ptr[i]), _src_encoding TSRMLS_CC)); } else { mode = 0; } } else { if (res) { mode = 1; UINT32_TO_BE_ARY(&unicode_ptr[i], php_unicode_totitle(BE_ARY_TO_UINT32(&unicode_ptr[i]), _src_encoding TSRMLS_CC)); } } }
Basically, this does the following:
- Set
mode
to0
.mode
will determine whether we are in the first character of a word. If it's0
, we are, otherwise, we're not. - Iterate through the characters of string.
- Determine what kind of character it is.
- Set
res
to1
if it's a word character. More specifically, set it to1
if it has the property "Mark, Non-Spacing", "Mark, Enclosing", "Other, Format", "Letter, Modifier", "Symbol, Modifier", "Letter, Uppercase", "Letter, Lowercase", "Letter, Titlecase", "Punctuation, Other" or "Other, Surrogate". Oddly, "Letter, Other" is not included.
- Set
- If we're not in the beginning of a word
- If we're at a word character, convert it to lowercase â" this is what we don't want.
- Otherwise, we're not at a word character, and we set
mode
to0
to signal we're moving to the beginning of a word.
- If we're at the beggining of a word and we indeed have a word character
- Convert this character to title case
- Signal we're no longer at the beginning of a word.
- Determine what kind of character it is.
The mbstring extension does not seem to expose the character properties. This leaves us with a problem, because we don't have a good way to determine if a character has any of the 10 properties for which mb_convert_case
tests.
Fortunately, unicode character properties in regex can save us here.
A faithful reproduction of mb_convert_case
with the problematic conversion to lowercase becomes:
function mb_convert_case_utf8_variation($s) { $arr = preg_split("//u", $s, -1, PREG_SPLIT_NO_EMPTY); $result = ""; $mode = false; foreach ($arr as $char) { $res = preg_match( '/\\p{Mn}|\\p{Me}|\\p{Cf}|\\p{Lm}|\\p{Sk}|\\p{Lu}|\\p{Ll}|'. '\\p{Lt}|\\p{Sk}|\\p{Cs}/u', $char) == 1; if ($mode) { if (!$res) $mode = false; } elseif ($res) { $mode = true; $char = mb_convert_case($char, MB_CASE_TITLE, "UTF-8"); } $result .= $char; } return $result; }
Test:
echo mb_convert_case_utf8_variation("HETÃ1200 Ãáxt Ãtring uii");
gives:
HETÃ1200 Ãáxt Ãtring Uii
Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
EmoticonEmoticon