Fork me on GitHub
Subscribe 10

Ticket #483 (open bug)

Improve performance of character validation

  • Created: 2011-08-22 15:22:34
  • Reported by: Pierre
  • Assigned to: None
  • Milestone: 2.0-alpha3
  • Component: parser
  • Priority: normal

I have noticed some timeouts (script was killed after 30s) triggered by posting.

The hot spot seems to be the utf8_bad_strip() function in bad.php. This function and all functions implemented in a similar way are very inefficient.

This knowledge could be used for a simple DOS attack and reduces overall performance in general.

I have uploaded a simple test including a solution to this problem to https://users.archlinux.de/~pierre/tmp/ … st.php.txt

Running the test with utf8_bad_strip takes more than 3 minutes while utf8_bad_strip_improved can do the same in 2 seconds.

The same technique should be used in other utf8 functions as well.

Here is a copy of the improved function:

function utf8_bad_strip_improved($str)
{
	$UTF8_GOOD =
		'[\x00-\x7F]'.                           # ASCII (including control chars)
		'|[\xC2-\xDF][\x80-\xBF]'.               # Non-overlong 2-byte
		'|\xE0[\xA0-\xBF][\x80-\xBF]'.           # Excluding overlongs
		'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}'.    # Straight 3-byte
		'|\xED[\x80-\x9F][\x80-\xBF]'.           # Excluding surrogates
		'|\xF0[\x90-\xBF][\x80-\xBF]{2}'.        # Planes 1-3
		'|[\xF1-\xF3][\x80-\xBF]{3}'.            # Planes 4-15
		'|\xF4[\x80-\x8F][\x80-\xBF]{2}';        # Plane 16
	$result = '';
	for ($i = 0; $i <= mb_strlen($str, 'UTF-8'); $i += 10000)
	{
		preg_match_all('%(?:'.$UTF8_GOOD.')+%',
			mb_substr($str, $i, 1000, 'UTF-8'),
			$clean_pieces);

		$result .= implode('', $clean_pieces[0] );
	}
	return $result;
}

Note that the input string needs to be split in several peaces as pcre cannot handle large input and tends to crash (depending on the version).

History

Reines 2011-09-09 15:25:27

  • Milestone set to 1.4.7.

Reines 2011-09-09 20:27:08

  • Owner set to Reines.

Reines 2011-09-13 19:37:55

  • Milestone changed from 1.4.7 to 1.4.8.

Visman 2011-10-16 12:13:06

...
for ($i = 0; $i <= mb_strlen($str, 'UTF-8'); $i += 10000)
...
mb_substr($str, $i, 1000, 'UTF-8'),
...

?

Reines 2011-10-16 12:19:20

Just to note, the problem with this solution is it requires mbstring extension, which isn't enabled by default in a stock PHP install.

Visman 2011-10-16 13:33:34

function utf8_bad_strip_improved($str) { // (C) SiMM, based on ru.wikipedia.org/wiki/Unicode
$ret = '';
$i = 0;
$s = strlen($str);
for (;$i < $s;) {
    $tmp = $str{$i++};
    $ch = ord($tmp);
    if ($ch > 0x7F) {
        if ($ch < 0xC0) continue;
        elseif ($ch < 0xE0) $di = 1;
        elseif ($ch < 0xF0) $di = 2;
        elseif ($ch < 0xF8) $di = 3;
        elseif ($ch < 0xFC) $di = 4;
        elseif ($ch < 0xFE) $di = 5;
        else continue;

        for ($j = 0;$j < $di;$j++) {
            $tmp .= $ch = $str{$i + $j};
            $ch = ord($ch);
            if ($ch < 0x80 || $ch > 0xBF) continue 2;
        }
        $i += $di;
    }
    $ret .= $tmp;
}
return $ret;
}

Visman 2011-10-16 13:50:08

Test:
Post text 661 200 bytes
utf8_bad_strip() - script was killed after 30s
my variant utf8_bad_strip_improved() - [ Generated in 3.587 seconds, 7 queries executed - Memory usage: 15.75 MiB (Peak: 18.8 MiB) ]

FSX 2011-10-24 21:36:14

Since this affects php-utf8, please also make an issue at https://github.com/FSX/php-utf8

Visman 2011-10-25 01:07:19

new utf8_bad_clean == old utf8_bad_strip
The problem remains.

FSX 2011-10-25 08:33:54

I haven't looked into it yet. I just said: create an issue at the php-utf8 repository.

Reines 2011-12-22 20:19:22

So I did some work on this today...

@Pierre: While your solution is much more efficient than the current one, it relies on the mb_string extension, which we cannot assume is available. The native utf8_substr function in the utf8 library fails, as it fails when presented with invalid unicode. I'm also seeing segfaults (?!) when attempting to use that function on a non-invalid ASCII-only string.

@Visman: Where did that function come from? I don't see any direct reference to it on the linked wikipedia article. It appears to work much better, and doesn't rely on any extensions.

In the below "improved" is Pierre's solution, "improved (compatible)" is Pierre's solution using utf8 library instead of mb_string, and "improved (test)" is Visman's solution.

19:35:59 jamie@penguin ~ php test.php
Original took 499.747825 seconds
Improved took 3.285618 seconds, result passed
Improved (compatible) took 5.084029 seconds, result failed
Improved (test) took 0.702509 seconds, result passed

Here's a slightly tidied up version, but I don't want to commit anything yet if the licensing is going to be an issue...

function utf8_bad_strip_improved_test($str, $replace = '') { // (C) SiMM, based on ru.wikipedia.org/wiki/Unicode
	$result = '';

	$strlen = strlen($str);
	for ($i = 0; $i < $strlen;) {
		$tmp = $str[$i++];
		$ch = ord($tmp);

		if ($ch > 0x7F) {
			if ($ch < 0xC0) {
				$result .= $replace;
				continue;
			}
			else if ($ch < 0xE0) $di = 1;
			else if ($ch < 0xF0) $di = 2;
			else if ($ch < 0xF8) $di = 3;
			else if ($ch < 0xFC) $di = 4;
			else if ($ch < 0xFE) $di = 5;
			else {
				$result .= $replace;
				continue;
			}

			for ($j = 0; $j < $di; $j++) {
				$tmp .= $ch = $str[$i + $j];
				$ch = ord($ch);

				if ($ch < 0x80 || $ch > 0xBF) {
					$result .= $replace;
					continue 2;
				}
			}

			$i += $di;
		}

		$result .= $tmp;
	}

	return $result;
}

FSX 2011-12-22 20:45:51

Here's a version without the replace argument. It also passes the tests:

function badClean($str)
{
	$UTF8_GOOD =
		'[\x00-\x7F]'.                           # ASCII (including control chars)
		'|[\xC2-\xDF][\x80-\xBF]'.               # Non-overlong 2-byte
		'|\xE0[\xA0-\xBF][\x80-\xBF]'.           # Excluding overlongs
		'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}'.    # Straight 3-byte
		'|\xED[\x80-\x9F][\x80-\xBF]'.           # Excluding surrogates
		'|\xF0[\x90-\xBF][\x80-\xBF]{2}'.        # Planes 1-3
		'|[\xF1-\xF3][\x80-\xBF]{3}'.            # Planes 4-15
		'|\xF4[\x80-\x8F][\x80-\xBF]{2}';        # Plane 16

	preg_match_all('%(?:'.$UTF8_GOOD.')+%', $str, $clean_pieces);
	return implode('', $clean_pieces[0]);
}

I was also thinking about mb_strlen. It doesn't count invalid characters so chopping the text in pieces and process it piece by piece is not going to work properly, because you aren't taking the invalid characters into account.

Reines 2011-12-22 22:16:26

I'm also getting a segfault on that when giving it a large clean ASCII string.

Also it's memory consumption is insane - using a string of size 20,000,000 it is uses 2.6Gb memory and takes 14 seconds. The modified one Visman posted takes 14 seconds and uses 30Mb memory.

FSX 2011-12-23 11:34:23

Can you show your test code and data?

quy 2011-12-27 15:55:34

I don't think you will find direct reference to it on the linked wikipedia article. Googled "SiMM, based on ru.wikipedia.org/wiki/Unicode" and found following postings by SiMM:
http://phpclub.ru/talk/threads/iconv-%D … f-8.24038/
http://forum.ru-board.com/topic.cgi?for … 0&start=20

Reines 2011-12-31 00:12:32

Commit 8c1402d to fluxbb fluxbb-1.4

Improving performance of utf8_bad_strip. #483

Reines 2011-12-31 00:18:44

  • Status changed from open to fixed.

I've implemented a version of SiMM's code from scratch and commit it. I'll pull it to fluxbb.org now. Can someone please give it a look over and testing when they get a chance?

Cheers.

FSX 2012-02-28 00:29:33

  • Status changed from fixed to open.

I run the unit test on this function and it fails on the tests that include 0xFC, 0xA1 and 0xF8. It doesn't filter them out.

Maybe this document contains useful information: https://www.securecoding.cert.org/confl … ted+Issues

I'll read through it later. Need to sleep.

Comment edited 1 times (Diff)

Reines 2012-03-01 16:57:10

  • Milestone changed from 1.4.8 to 1.4.9.

Franz 2012-04-03 22:38:12

So, any ideas on this?

FSX 2012-04-05 08:01:18

Sorry, I forgot about it. I'll continue with it this week.

Franz 2012-04-24 13:13:59

  • Milestone changed from 1.4.9 to 1.5.1.

Franz 2012-05-03 11:53:57

Nicholas Grekas, creator of Patchwork/Utf8 has suggested trying this:

iconv('UTF-8', 'UTF-8//IGNORE', $s);
Comment edited 1 times (Diff)

arw 2012-05-03 22:06:16

i tought of it but tought it was the same as issue with mb_string extension ( but that could be used and the current solution ( or solution solved by fsx ) used as fallback )

Franz 2012-10-05 12:35:38

According to the documentation, iconv() is too buggy to be used for this. (Read the first few comments on that page.)

Any news, guys?

adaur 2012-11-01 10:17:32

Do we really have to fix this in the 1.5 branch?

Studio384 2012-11-01 10:25:43

Well actualy, it was ment to be fixed in the 1.4 branch, but since 1.4.7 its moved to 1.4.8, 1.4.9 and now 1.5.1. So: nop.

Franz 2012-11-01 10:44:19

I would love for it to be fixed, but there doesn't seem to be a simple solution...

adaur 2012-11-07 20:52:15

So, won't fix until FluxBB 2 ?

Studio384 2012-11-07 21:09:21

I think that's the best thing for now.

Studio384 2012-11-08 18:34:17

  • Milestone changed from 1.5.1 to 2.0-alpha3.

Reines 2015-07-21 09:38:56

  • Description changed. (Diff)
  • Owner Reines removed.