Ticket #483 (open bug)
Improve performance of character validation
- Created: 2011-08-22 15:22:34
- Reported by: Pierre
- Assigned to: None
- Milestone: 2.0-alpha3
- Component: parser
- Priority: normal
I have noticed some timeouts (script was killed after 30s) triggered by posting.
The hot spot seems to be the utf8_bad_strip() function in bad.php. This function and all functions implemented in a similar way are very inefficient.
This knowledge could be used for a simple DOS attack and reduces overall performance in general.
I have uploaded a simple test including a solution to this problem to https://users.archlinux.de/~pierre/tmp/ … st.php.txt
Running the test with utf8_bad_strip takes more than 3 minutes while utf8_bad_strip_improved can do the same in 2 seconds.
The same technique should be used in other utf8 functions as well.
Here is a copy of the improved function:
function utf8_bad_strip_improved($str)
{
$UTF8_GOOD =
'[\x00-\x7F]'. # ASCII (including control chars)
'|[\xC2-\xDF][\x80-\xBF]'. # Non-overlong 2-byte
'|\xE0[\xA0-\xBF][\x80-\xBF]'. # Excluding overlongs
'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}'. # Straight 3-byte
'|\xED[\x80-\x9F][\x80-\xBF]'. # Excluding surrogates
'|\xF0[\x90-\xBF][\x80-\xBF]{2}'. # Planes 1-3
'|[\xF1-\xF3][\x80-\xBF]{3}'. # Planes 4-15
'|\xF4[\x80-\x8F][\x80-\xBF]{2}'; # Plane 16
$result = '';
for ($i = 0; $i <= mb_strlen($str, 'UTF-8'); $i += 10000)
{
preg_match_all('%(?:'.$UTF8_GOOD.')+%',
mb_substr($str, $i, 1000, 'UTF-8'),
$clean_pieces);
$result .= implode('', $clean_pieces[0] );
}
return $result;
}
Note that the input string needs to be split in several peaces as pcre cannot handle large input and tends to crash (depending on the version).
History
Visman 2011-10-16 12:13:06

...
for ($i = 0; $i <= mb_strlen($str, 'UTF-8'); $i += 10000)
...
mb_substr($str, $i, 1000, 'UTF-8'),
...
?
Reines 2011-10-16 12:19:20

Just to note, the problem with this solution is it requires mbstring extension, which isn't enabled by default in a stock PHP install.
Visman 2011-10-16 13:33:34

function utf8_bad_strip_improved($str) { // (C) SiMM, based on ru.wikipedia.org/wiki/Unicode
$ret = '';
$i = 0;
$s = strlen($str);
for (;$i < $s;) {
$tmp = $str{$i++};
$ch = ord($tmp);
if ($ch > 0x7F) {
if ($ch < 0xC0) continue;
elseif ($ch < 0xE0) $di = 1;
elseif ($ch < 0xF0) $di = 2;
elseif ($ch < 0xF8) $di = 3;
elseif ($ch < 0xFC) $di = 4;
elseif ($ch < 0xFE) $di = 5;
else continue;
for ($j = 0;$j < $di;$j++) {
$tmp .= $ch = $str{$i + $j};
$ch = ord($ch);
if ($ch < 0x80 || $ch > 0xBF) continue 2;
}
$i += $di;
}
$ret .= $tmp;
}
return $ret;
}
Visman 2011-10-16 13:50:08

Test:
Post text 661 200 bytes
utf8_bad_strip() - script was killed after 30s
my variant utf8_bad_strip_improved() - [ Generated in 3.587 seconds, 7 queries executed - Memory usage: 15.75 MiB (Peak: 18.8 MiB) ]
Reines 2011-12-22 20:19:22

So I did some work on this today...
@Pierre: While your solution is much more efficient than the current one, it relies on the mb_string extension, which we cannot assume is available. The native utf8_substr function in the utf8 library fails, as it fails when presented with invalid unicode. I'm also seeing segfaults (?!) when attempting to use that function on a non-invalid ASCII-only string.
@Visman: Where did that function come from? I don't see any direct reference to it on the linked wikipedia article. It appears to work much better, and doesn't rely on any extensions.
In the below "improved" is Pierre's solution, "improved (compatible)" is Pierre's solution using utf8 library instead of mb_string, and "improved (test)" is Visman's solution.
19:35:59 jamie@penguin ~ php test.php
Original took 499.747825 seconds
Improved took 3.285618 seconds, result passed
Improved (compatible) took 5.084029 seconds, result failed
Improved (test) took 0.702509 seconds, result passed
Here's a slightly tidied up version, but I don't want to commit anything yet if the licensing is going to be an issue...
function utf8_bad_strip_improved_test($str, $replace = '') { // (C) SiMM, based on ru.wikipedia.org/wiki/Unicode
$result = '';
$strlen = strlen($str);
for ($i = 0; $i < $strlen;) {
$tmp = $str[$i++];
$ch = ord($tmp);
if ($ch > 0x7F) {
if ($ch < 0xC0) {
$result .= $replace;
continue;
}
else if ($ch < 0xE0) $di = 1;
else if ($ch < 0xF0) $di = 2;
else if ($ch < 0xF8) $di = 3;
else if ($ch < 0xFC) $di = 4;
else if ($ch < 0xFE) $di = 5;
else {
$result .= $replace;
continue;
}
for ($j = 0; $j < $di; $j++) {
$tmp .= $ch = $str[$i + $j];
$ch = ord($ch);
if ($ch < 0x80 || $ch > 0xBF) {
$result .= $replace;
continue 2;
}
}
$i += $di;
}
$result .= $tmp;
}
return $result;
}

Here's a version without the replace argument. It also passes the tests:
function badClean($str)
{
$UTF8_GOOD =
'[\x00-\x7F]'. # ASCII (including control chars)
'|[\xC2-\xDF][\x80-\xBF]'. # Non-overlong 2-byte
'|\xE0[\xA0-\xBF][\x80-\xBF]'. # Excluding overlongs
'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}'. # Straight 3-byte
'|\xED[\x80-\x9F][\x80-\xBF]'. # Excluding surrogates
'|\xF0[\x90-\xBF][\x80-\xBF]{2}'. # Planes 1-3
'|[\xF1-\xF3][\x80-\xBF]{3}'. # Planes 4-15
'|\xF4[\x80-\x8F][\x80-\xBF]{2}'; # Plane 16
preg_match_all('%(?:'.$UTF8_GOOD.')+%', $str, $clean_pieces);
return implode('', $clean_pieces[0]);
}
I was also thinking about mb_strlen. It doesn't count invalid characters so chopping the text in pieces and process it piece by piece is not going to work properly, because you aren't taking the invalid characters into account.
Reines 2011-12-22 22:16:26

I'm also getting a segfault on that when giving it a large clean ASCII string.
Also it's memory consumption is insane - using a string of size 20,000,000 it is uses 2.6Gb memory and takes 14 seconds. The modified one Visman posted takes 14 seconds and uses 30Mb memory.

I don't think you will find direct reference to it on the linked wikipedia article. Googled "SiMM, based on ru.wikipedia.org/wiki/Unicode" and found following postings by SiMM:
http://phpclub.ru/talk/threads/iconv-%D … f-8.24038/
http://forum.ru-board.com/topic.cgi?for … 0&start=20
Reines 2011-12-31 00:18:44

- Status changed from open to fixed.
I've implemented a version of SiMM's code from scratch and commit it. I'll pull it to fluxbb.org now. Can someone please give it a look over and testing when they get a chance?
Cheers.

- Status changed from fixed to open.
I run the unit test on this function and it fails on the tests that include 0xFC, 0xA1 and 0xF8. It doesn't filter them out.
Maybe this document contains useful information: https://www.securecoding.cert.org/confl … ted+Issues
I'll read through it later. Need to sleep.
Franz 2012-05-03 11:53:57

Nicholas Grekas, creator of Patchwork/Utf8 has suggested trying this:
iconv('UTF-8', 'UTF-8//IGNORE', $s);

i tought of it but tought it was the same as issue with mb_string extension ( but that could be used and the current solution ( or solution solved by fsx ) used as fallback )
Franz 2012-10-05 12:35:38

According to the documentation, iconv() is too buggy to be used for this. (Read the first few comments on that page.)
Any news, guys?
Studio384 2012-11-01 10:25:43

Well actualy, it was ment to be fixed in the 1.4 branch, but since 1.4.7 its moved to 1.4.8, 1.4.9 and now 1.5.1. So: nop.
Franz 2012-11-01 10:44:19

I would love for it to be fixed, but there doesn't seem to be a simple solution...