Fork me on GitHub
Subscribe 7

Ticket #483 (fixed bug)

Improve performance of character validation

  • Created: 2011-08-22 15:22:34
  • Reported by: Pierre
  • Assigned to: Reines
  • Milestone: 1.4.8
  • Component: parser
  • Priority: normal

I have noticed some timeouts (script was killed after 30s) triggered by posting.

The hot spot seems to be the utf8_bad_strip() function in bad.php. This function and all functions implemented in a similar way are very inefficient.

This knowledge could be used for a simple DOS attack and reduces overall performance in general.

I have uploaded a simple test including a solution to this problem to https://users.archlinux.de/~pierre/tmp/ … st.php.txt

Running the test with utf8_bad_strip takes more than 3 minutes while utf8_bad_strip_improved can do the same in 2 seconds.

The same technique should be used in other utf8 functions as well.

Here is a copy of the improved function:

function utf8_bad_strip_improved($str)
{
	$UTF8_GOOD =
		'[\x00-\x7F]'.                           # ASCII (including control chars)
		'|[\xC2-\xDF][\x80-\xBF]'.               # Non-overlong 2-byte
		'|\xE0[\xA0-\xBF][\x80-\xBF]'.           # Excluding overlongs
		'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}'.    # Straight 3-byte
		'|\xED[\x80-\x9F][\x80-\xBF]'.           # Excluding surrogates
		'|\xF0[\x90-\xBF][\x80-\xBF]{2}'.        # Planes 1-3
		'|[\xF1-\xF3][\x80-\xBF]{3}'.            # Planes 4-15
		'|\xF4[\x80-\x8F][\x80-\xBF]{2}';        # Plane 16
	$result = '';
	for ($i = 0; $i <= mb_strlen($str, 'UTF-8'); $i += 10000)
	{
		preg_match_all('%(?:'.$UTF8_GOOD.')+%',
			mb_substr($str, $i, 1000, 'UTF-8'),
			$clean_pieces);

		$result .= implode('', $clean_pieces[0] );
	}
	return $result;
}

Note that the input string needs to be split in several peaces as pcre cannot handle large input and tends to crash (depending on the version).

History

Reines 2011-09-09 15:25:27

  • Milestone set to 1.4.7.

Reines 2011-09-09 20:27:08

  • Owner set to Reines.

Reines 2011-09-13 19:37:55

  • Milestone changed from 1.4.7 to 1.4.8.

Visman 2011-10-16 12:13:06

...
for ($i = 0; $i <= mb_strlen($str, 'UTF-8'); $i += 10000)
...
mb_substr($str, $i, 1000, 'UTF-8'),
...

?

Reines 2011-10-16 12:19:20

Just to note, the problem with this solution is it requires mbstring extension, which isn't enabled by default in a stock PHP install.

Visman 2011-10-16 13:33:34

function utf8_bad_strip_improved($str) { // (C) SiMM, based on ru.wikipedia.org/wiki/Unicode
$ret = '';
$i = 0;
$s = strlen($str);
for (;$i < $s;) {
    $tmp = $str{$i++};
    $ch = ord($tmp);
    if ($ch > 0x7F) {
        if ($ch < 0xC0) continue;
        elseif ($ch < 0xE0) $di = 1;
        elseif ($ch < 0xF0) $di = 2;
        elseif ($ch < 0xF8) $di = 3;
        elseif ($ch < 0xFC) $di = 4;
        elseif ($ch < 0xFE) $di = 5;
        else continue;

        for ($j = 0;$j < $di;$j++) {
            $tmp .= $ch = $str{$i + $j};
            $ch = ord($ch);
            if ($ch < 0x80 || $ch > 0xBF) continue 2;
        }
        $i += $di;
    }
    $ret .= $tmp;
}
return $ret;
}

Visman 2011-10-16 13:50:08

Test:
Post text 661 200 bytes
utf8_bad_strip() - script was killed after 30s
my variant utf8_bad_strip_improved() - [ Generated in 3.587 seconds, 7 queries executed - Memory usage: 15.75 MiB (Peak: 18.8 MiB) ]

FSX 2011-10-24 21:36:14

Since this affects php-utf8, please also make an issue at https://github.com/FSX/php-utf8

Visman 2011-10-25 01:07:19

new utf8_bad_clean == old utf8_bad_strip
The problem remains.

FSX 2011-10-25 08:33:54

I haven't looked into it yet. I just said: create an issue at the php-utf8 repository.

Reines 2011-12-22 20:19:22

So I did some work on this today...

@Pierre: While your solution is much more efficient than the current one, it relies on the mb_string extension, which we cannot assume is available. The native utf8_substr function in the utf8 library fails, as it fails when presented with invalid unicode. I'm also seeing segfaults (?!) when attempting to use that function on a non-invalid ASCII-only string.

@Visman: Where did that function come from? I don't see any direct reference to it on the linked wikipedia article. It appears to work much better, and doesn't rely on any extensions.

In the below "improved" is Pierre's solution, "improved (compatible)" is Pierre's solution using utf8 library instead of mb_string, and "improved (test)" is Visman's solution.

19:35:59 jamie@penguin ~ php test.php
Original took 499.747825 seconds
Improved took 3.285618 seconds, result passed
Improved (compatible) took 5.084029 seconds, result failed
Improved (test) took 0.702509 seconds, result passed

Here's a slightly tidied up version, but I don't want to commit anything yet if the licensing is going to be an issue...

function utf8_bad_strip_improved_test($str, $replace = '') { // (C) SiMM, based on ru.wikipedia.org/wiki/Unicode
	$result = '';

	$strlen = strlen($str);
	for ($i = 0; $i < $strlen;) {
		$tmp = $str[$i++];
		$ch = ord($tmp);

		if ($ch > 0x7F) {
			if ($ch < 0xC0) {
				$result .= $replace;
				continue;
			}
			else if ($ch < 0xE0) $di = 1;
			else if ($ch < 0xF0) $di = 2;
			else if ($ch < 0xF8) $di = 3;
			else if ($ch < 0xFC) $di = 4;
			else if ($ch < 0xFE) $di = 5;
			else {
				$result .= $replace;
				continue;
			}

			for ($j = 0; $j < $di; $j++) {
				$tmp .= $ch = $str[$i + $j];
				$ch = ord($ch);

				if ($ch < 0x80 || $ch > 0xBF) {
					$result .= $replace;
					continue 2;
				}
			}

			$i += $di;
		}

		$result .= $tmp;
	}

	return $result;
}

FSX 2011-12-22 20:45:51

Here's a version without the replace argument. It also passes the tests:

function badClean($str)
{
	$UTF8_GOOD =
		'[\x00-\x7F]'.                           # ASCII (including control chars)
		'|[\xC2-\xDF][\x80-\xBF]'.               # Non-overlong 2-byte
		'|\xE0[\xA0-\xBF][\x80-\xBF]'.           # Excluding overlongs
		'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}'.    # Straight 3-byte
		'|\xED[\x80-\x9F][\x80-\xBF]'.           # Excluding surrogates
		'|\xF0[\x90-\xBF][\x80-\xBF]{2}'.        # Planes 1-3
		'|[\xF1-\xF3][\x80-\xBF]{3}'.            # Planes 4-15
		'|\xF4[\x80-\x8F][\x80-\xBF]{2}';        # Plane 16

	preg_match_all('%(?:'.$UTF8_GOOD.')+%', $str, $clean_pieces);
	return implode('', $clean_pieces[0]);
}

I was also thinking about mb_strlen. It doesn't count invalid characters so chopping the text in pieces and process it piece by piece is not going to work properly, because you aren't taking the invalid characters into account.

Reines 2011-12-22 22:16:26

I'm also getting a segfault on that when giving it a large clean ASCII string.

Also it's memory consumption is insane - using a string of size 20,000,000 it is uses 2.6Gb memory and takes 14 seconds. The modified one Visman posted takes 14 seconds and uses 30Mb memory.

FSX 2011-12-23 11:34:23

Can you show your test code and data?

quy 2011-12-27 15:55:34

I don't think you will find direct reference to it on the linked wikipedia article. Googled "SiMM, based on ru.wikipedia.org/wiki/Unicode" and found following postings by SiMM:
http://phpclub.ru/talk/threads/iconv-%D … f-8.24038/
http://forum.ru-board.com/topic.cgi?for … 0&start=20

Reines committed 8c1402d fluxbb-1.4 2011-12-31 00:12:32

Improving performance of utf8_bad_strip. #483

Reines 2011-12-31 00:18:44

  • Status changed from open to fixed.

I've implemented a version of SiMM's code from scratch and commit it. I'll pull it to fluxbb.org now. Can someone please give it a look over and testing when they get a chance?

Cheers.