Ticket #485 (fixed bug)
Four bytes utf-8 character truncate text at that character
- Created: 2011-09-03 16:23:04
- Reported by: Otomatic
- Assigned to: Franz
- Milestone: 1.4.8
- Component: parser
- Priority: normal
In post, the use of Unicode characters whose UTF-8 representation consists of four bytes truncate the text at that character. Everything after that character is deleted. The preview of post is not truncated.
For example, Unicode characters:
- U+10000 LINEAR B SYLLABLE B008 A
Entitie : 𐀀
UTF-8 : 0xF0 90 80 80
- U+1F601 GRINNING FACE WITH SMILING EYES
Entitie : 😁
UTF-8 : 0xF0 9F 98 81
History
Otomatic 2011-09-04 08:23:07

I think that the problem is MySQL and utf8 that accept utf-8 three bytes characters and does not accept utf-8 four bytes characters.
http://dev.mysql.com/doc/refman/5.5/en/ … f8mb4.html
To solve this problem, I think that the better way is to replace any four bytes characters by three bytes unknown character.
Franz 2011-12-02 09:50:38

Does that mean we would only have to truncate 4-byte characters for MySQL?
Otomatic 2011-12-02 14:23:53

You have to use functions in include/utf8/utils/bad.php like utf8_bad_replace before inserting strings into BDD but it seems that this function does not work correctly and does not replace 4-byte characters.
Franz 2011-12-14 15:31:43

WordPress has the same problem and somebody wrote a patch for escaping those characters before storing them in the database and unescaping on retrieval.
That solution is a little big for such a small problem, I think we should just replace it.
Nontheless, this WP ticket (along with the attached patch) might help us find a good solution.
Otomatic 2011-12-14 15:48:57

I also think that the best solution is to replace utf8 four bytes characters by a ? or a space.
It is possible to accept utf8mb4 but only with MySQL version 5.5.3 or more.
I wrote an article for help of french fluxbb (Sorry, only in French) here : http://fluxbb.fr/aide/doku.php?id=mysql … tre_octets
Franz 2011-12-14 15:53:00

Frank (FSX) told me he's working on it, let's see if we can port it to FluxBB's (relatively) old version of the utf8 library.
Franz 2011-12-14 15:55:15

- Owner set to Franz.
In fact, if somebody can tell me how I can recognize four-byte characters in PHP, I would happily fix our implementation myself.
Otomatic 2011-12-14 17:32:53

Hi,
For FluxBB 1.4.7, in post.php, just after :
$now = time();
add :
//modif oto - Replace bad character by '?' including utf8 four bytes
$message = preg_replace('/[x00-x08x10x0Bx0Cx0E-x19x7F]'.
'|[x00-x7F][x80-xBF]+'.
'|([xC0xC1]|[xF0-xFF])[x80-xBF]*'.
'|[xC2-xDF]((?![x80-xBF])|[x80-xBF]{2,})'.
'|[xE0-xEF](([x80-xBF](?![x80-xBF]))|(?![x80-xBF]{2})|[x80-xBF]{3,})/S',
'?', $message );
$message = preg_replace('/xE0[x80-x9F][x80-xBF]'.
'|xED[xA0-xBF][x80-xBF]/S','?', $message );
//End modif oto
Franz 2011-12-14 21:56:13

What about applying this function?
//
// Replace four-byte characters with a question mark
//
// As MySQL cannot properly handle four-byte characters with the default utf-8
// charset up until version 5.5.3 (where a special charset has to be used), they
// need to be replaced, by question marks in this case.
//
function strip_bad_multibyte_chars($str)
{
$result = '';
$length = strlen($str);
for ($i = 0; $i < $length; $i++)
{
// Replace four-byte characters (11110www 10zzzzzz 10yyyyyy 10xxxxxx)
if ((ord($str[$i]) & 0xF8) == 0xF0)
{
$result .= '?';
$i += 3;
}
else
{
$result .= $str[$i];
}
}
return $return;
}
Franz 2011-12-14 22:00:16

And a version that is a little easier to read, inspired by this excellent article:
//
// Replace four-byte characters with a question mark
//
// As MySQL cannot properly handle four-byte characters with the default utf-8
// charset up until version 5.5.3 (where a special charset has to be used), they
// need to be replaced, by question marks in this case.
//
function strip_bad_multibyte_chars($str)
{
$result = '';
$length = strlen($str);
for ($i = 0; $i < $length; $i++)
{
// Replace four-byte characters (11110www 10zzzzzz 10yyyyyy 10xxxxxx)
$ord = ord($str[$i]);
if ($ord >= 240 && $ord <= 244)
{
$result .= '?';
$i += 3;
}
else
{
$result .= $str[$i];
}
}
return $result;
}
Franz 2011-12-19 15:28:35

@Otomatic: could you test my version?
Also, we probably need to apply this to more inputs than just the post contents.
Otomatic 2011-12-20 09:39:44

Hi,
I test the second version in post.php and the four bytes utf8 characters are replaced by '?'.
It works!
Franz 2011-12-20 23:41:29

- Status changed from open to fixed.
Should be fixed now.
I guess we don't necessarily need to do this in other places. It is just very annoying when posts get truncated, I can live with the rest.