Fork me on GitHub
Subscribe 3

Ticket #485 (fixed bug)

Four bytes utf-8 character truncate text at that character

  • Created: 2011-09-03 16:23:04
  • Reported by: Otomatic
  • Assigned to: Franz
  • Milestone: 1.4.8
  • Component: parser
  • Priority: normal

In post, the use of Unicode characters whose UTF-8 representation consists of four bytes truncate the text at that character. Everything after that character is deleted. The preview of post is not truncated.
For example, Unicode characters:
- U+10000 LINEAR B SYLLABLE B008 A
  Entitie : 𐀀
  UTF-8 : 0xF0 90 80 80
- U+1F601 GRINNING FACE WITH SMILING EYES
  Entitie : &#128513
  UTF-8 : 0xF0 9F 98 81

History

Otomatic 2011-09-04 08:23:07

I think that the problem is MySQL and utf8 that accept utf-8 three bytes characters and does not accept utf-8 four bytes characters.
http://dev.mysql.com/doc/refman/5.5/en/ … f8mb4.html
To solve this problem, I think that the better way is to replace any four bytes characters by three bytes unknown character.

Reines 2011-09-09 15:25:34

  • Milestone set to 1.4.7.

Reines 2011-09-13 19:37:41

  • Milestone changed from 1.4.7 to 1.4.8.

Franz 2011-12-02 09:50:38

Does that mean we would only have to truncate 4-byte characters for MySQL?

Otomatic 2011-12-02 14:23:53

You have to use functions in include/utf8/utils/bad.php like utf8_bad_replace before inserting strings into BDD but it seems that this function does not work correctly and does not replace 4-byte characters.

Franz 2011-12-14 15:31:43

WordPress has the same problem and somebody wrote a patch for escaping those characters before storing them in the database and unescaping on retrieval.

That solution is a little big for such a small problem, I think we should just replace it.

Nontheless, this WP ticket (along with the attached patch) might help us find a good solution.

Otomatic 2011-12-14 15:48:57

I also think that the best solution is to replace utf8 four bytes characters by a ? or a space.
It is possible to accept utf8mb4 but only with MySQL version 5.5.3 or more.
I wrote an article for help of french fluxbb (Sorry, only in French) here : http://fluxbb.fr/aide/doku.php?id=mysql … tre_octets

Franz 2011-12-14 15:53:00

Frank (FSX) told me he's working on it, let's see if we can port it to FluxBB's (relatively) old version of the utf8 library.

Franz 2011-12-14 15:55:15

  • Owner set to Franz.

In fact, if somebody can tell me how I can recognize four-byte characters in PHP, I would happily fix our implementation myself. smile

Otomatic 2011-12-14 17:32:53

Hi,

For FluxBB 1.4.7, in post.php, just after :

   $now = time();

add :

//modif oto - Replace bad character by '?' including utf8 four bytes
$message = preg_replace('/[x00-x08x10x0Bx0Cx0E-x19x7F]'.
'|[x00-x7F][x80-xBF]+'.
'|([xC0xC1]|[xF0-xFF])[x80-xBF]*'.
'|[xC2-xDF]((?![x80-xBF])|[x80-xBF]{2,})'.
'|[xE0-xEF](([x80-xBF](?![x80-xBF]))|(?![x80-xBF]{2})|[x80-xBF]{3,})/S',
'?', $message );
$message = preg_replace('/xE0[x80-x9F][x80-xBF]'.
'|xED[xA0-xBF][x80-xBF]/S','?', $message );
//End modif oto

Franz 2011-12-14 21:30:39

Any way to combine that with the utf8_bad_replace() function?

Franz 2011-12-14 21:56:13

What about applying this function?

//
// Replace four-byte characters with a question mark
//
// As MySQL cannot properly handle four-byte characters with the default utf-8
// charset up until version 5.5.3 (where a special charset has to be used), they
// need to be replaced, by question marks in this case. 
//
function strip_bad_multibyte_chars($str)
{
	$result = '';
	$length = strlen($str);
	
	for ($i = 0; $i < $length; $i++)
	{
		// Replace four-byte characters (11110www 10zzzzzz 10yyyyyy 10xxxxxx)
		if ((ord($str[$i]) & 0xF8) == 0xF0)
		{
			$result .= '?';
			$i += 3;
		}
		else
		{
			$result .= $str[$i];
		}
	}
	
	return $return;
}

Franz 2011-12-14 22:00:16

And a version that is a little easier to read, inspired by this excellent article:

//
// Replace four-byte characters with a question mark
//
// As MySQL cannot properly handle four-byte characters with the default utf-8
// charset up until version 5.5.3 (where a special charset has to be used), they
// need to be replaced, by question marks in this case. 
//
function strip_bad_multibyte_chars($str)
{
	$result = '';
	$length = strlen($str);
	
	for ($i = 0; $i < $length; $i++)
	{
		// Replace four-byte characters (11110www 10zzzzzz 10yyyyyy 10xxxxxx)
		$ord = ord($str[$i]);
		if ($ord >= 240 && $ord <= 244)
		{
			$result .= '?';
			$i += 3;
		}
		else
		{
			$result .= $str[$i];
		}
	}
	
	return $result;
}

Franz 2011-12-19 15:28:35

@Otomatic: could you test my version?

Also, we probably need to apply this to more inputs than just the post contents.

Otomatic 2011-12-20 09:39:44

Hi,

I test the second version in post.php and the four bytes utf8 characters are replaced by '?'.
It works!

Franz 2011-12-20 23:34:09

Commit 3721201 to fluxbb fluxbb-1.4

#485: Add function strip_bad_multibyte_chars().

Franz 2011-12-20 23:36:38

Commit 0d35b37 to fluxbb fluxbb-1.4

#485: Replace four-byte characters in posts (as MySQL cannot handle them properly).

Franz 2011-12-20 23:38:52

Commit a39c2ef to fluxbb fluxbb-1.4

Replace four-byte characters when editing posts, too.

Related to #485.

Franz 2011-12-20 23:41:29

  • Status changed from open to fixed.

Should be fixed now.

I guess we don't necessarily need to do this in other places. It is just very annoying when posts get truncated, I can live with the rest.