Fork me on GitHub
Subscribe 7

Ticket #508 (fixed bug)

Thousands of returns in message crash the forum

  • Created: 2011-10-12 08:07:40
  • Reported by: Otomatic
  • Assigned to: Reines
  • Milestone: 1.4.8
  • Component: parser
  • Priority: high

In a message write, for example :

Test

Thousands of return

Test

See https://fluxbb.org/forums/viewtopic.php?id=5824

History

Franz 2011-10-12 09:46:07

  • Milestone set to 1.4.8.
  • Status changed from open to duplicate.

Sorry, I had already created a (private) ticket.

adaur 2011-10-12 12:26:44

This shouldn't be public imo.

I'm currently trying to create a patch, without success for the moment.

I think this part has to be modified

	// Deal with newlines, tabs and multiple spaces
	$pattern = array("\n", "\t", '  ', '  ');
	$replace = array('<br />', '&#160; &#160; ', '&#160; ', ' &#160;');
	$text = str_replace($pattern, $replace, $text);

in include/parser.php

adaur 2011-10-12 12:31:29

I just found a fix.

Replace the code above by

	// Deal with newlines, tabs and multiple spaces
	$text = preg_replace("/\n\n+/s", "\n\n", $text);
	$pattern = array("\n", "\t", '  ', '  ');
	$replace = array('<br />', '&#160; &#160; ', '&#160; ', ' &#160;');
	$text = str_replace($pattern, $replace, $text);

and it's working.

Franz 2011-10-12 12:43:12

So this would essentially get rid of areas of more than two newlines?

What happens when we have messages like that, but with tons of tabs instead of newlines?

adaur 2011-10-12 12:48:22

I tested this with newlines, it replaces all multiple newlines by one when the post is displayed.

I didn't test the bug with tabs for the moment.

adaur 2011-10-12 12:49:45

Ok, for tabs, use this:

$text = preg_replace("/\t\t+/s", "\t", $text);

Franz 2011-10-12 12:53:51

What about a mix of tabs and newlines? tongue

Also, should we do this when the post is displayed? Or during preparsing?

adaur 2011-10-12 13:06:30

I think it's better to do this before insterting it in DB.

Should we keep this for display anyway?

To clean the code before inserting it, add

	// Remove multiples newlines and tabs
	$text = preg_replace("/\n\n+/s", "\n\n", $text);
	$text = preg_replace("/\t\t+/s", "\t", $text);

before

return pun_trim($text);

in the function preparse_bbcode (include/parser.php)

Otomatic 2011-10-12 13:26:07

Hi,

I am not a specialist "regex", but I think it would be better to remove any repetitive sequence of more than x characters, including return and tab itself repeatedly over five times.

adaur 2011-10-12 15:12:44

Don't you think a such regex would be greedy?

Plus I don't see why FluxBB should censor any character repeated more than 5 times, this can be useful sometimess.

taylorchu 2011-10-12 15:38:27

the problem is php regexp does not allow content length > 100k. pcre.backtrack_limit is 100k by default. so we should check content length.

adaur 2011-10-14 14:20:21

Any news ?

Otomatic 2011-10-14 17:05:36

A "regexp" is impossible in this case.
This is the kind of work done by compression algorithms with dictionaries, that is another approach of text processing than regular expressions.
I think the "parser" would quickly become a "gas plant"

Smartys 2011-10-16 13:06:09

What is this actually crashing in the first place? Why is a simple str_replace resulting in a 500 error?

Reines 2011-10-17 13:54:38

  • Owner set to Reines.
  • Status changed from duplicate to open.

Otomatic 2011-10-18 08:54:18

Hi,
In parser.php file, just after :
    // Deal with newlines, tabs and multiple spaces
add
    //modif oto - More than two consecutive newlines, replaced by only two newlines
    $text = preg_replace("/\n\n+/s", "\n\n", $text);
    //modif oto - Suppress any repetitive sequence consisting of at least 10 characters, repeated at least 5 times.
    if(preg_match('/(.{10,})\1{5,}/s', $text, $matches, PREG_OFFSET_CAPTURE))
        $text = str_replace($matches[0][0], "", $text);

Otomatic 2011-10-18 15:07:31

Not only for newline or tab, but for any repetitive sequence of characters.
As we can be dealing with crafty repeat sequences that are different, in order to delete multiple distinct repetitive sequences, we use preg_match_all ().
    // Deal with newlines, tabs and multiple spaces
    //modif adaur - More than two consecutive newlines, replaced by only two newlines
    $text = preg_replace("/\n\n+/s", "\n\n", $text);
    //modif oto - Suppress any repetitive sequence consisting of at least $nb_char characters, repeated at least $nb_times times.
    $nb_char = 10;
    $nb_times = 5;
    if(preg_match_all('/(.{'.$nb_char.',})\1{'.$nb_times.',}/s', $text, $matches)) {
        $text = str_replace($matches[1], "",$text);
        $text = "Any repetitive sequence of at least ".$nb_char." characters, repeated at least ".$nb_times." times, were removed.\n\n".$text;
    }

Otomatic 2011-10-19 17:02:43

Hi,

Repetitive and abusive sequences are removed from preview then from the message prior to insertion into the database.

See : http://fluxbb.fr/forums/viewtopic.php?p … 80#p103880
The message is in French but the explanations are in English.

Franz 2011-10-19 23:18:48

  • Summary changed from Thousand of Retun in message crash the forum to Thousands of returns in message crash the forum.

ridgerunner 2011-10-27 21:47:00

Part of the problem is that the maximum post length was increased from 64KB. That was a bad move IMHO. Can someone send me an email with an attachment containing the text of a post which causes the crash? I'd like to see if my parser experiences the same crash.

I agree that this bug should not be public!

Franz 2011-10-27 22:25:08

  • Visibility set to private.

Fine then.

quy 2011-10-28 04:07:29

http://dl.dropbox.com/u/4284608/test_data.txt

The link contains the data to reproduce the crash. Basically it has 1 line of text at the beginning, 10,000 carriage returns in the middle and 1 line of text at the end.

Beginning line
Thousands of returns
Ending line

quy 2011-10-28 04:41:29

Followup: This is reproducible on my live server, but not on my local setup. It may work differently with your setup.

quy 2011-11-01 22:07:34

ridgerunner: tested with your parser and experienced no crashes.

Franz 2011-12-23 10:31:45

In my error log, I see a segmentation fault in the Apache process, so it seems there's a bug with PHP's str_replace() function when it comes to big data.

Otomatic's patch fixes it for me - it might not be the ideal solution, but as the new parser fixes this whole thing, maybe we should just go with this workaround (which is executed when displaying the page, not when preparsing the post).

Opinions? This is important!

Otomatic 2011-12-23 14:28:09

Hi,
My patch works, but it is very time and cpu consuming for long message. So I limit the second part for message less than 20000 characters.
  // Deal with newlines, tabs and multiple spaces
    //modif oto - More than two consecutive newlines or tab, replaced by only two newlines or one tab
    $message = preg_replace(array("/\n\n+/s", "/\t\t+/s"), array("\n\n", "\t"), $message);
  /* modif oto - Suppress any repetitive sequence consisting of at least $nb_char characters,
     repeated at least $nb_times times.
     Only for messages less than 20000 characters because It's very time and cpu consuming */
  if(strlen($message) < 20000) {
    $nb_char = 10;
    $nb_times = 5;
    if(preg_match_all('/(.{'.$nb_char.',})\1{'.$nb_times.',}/s', $message, $matches))
    $message = str_replace($matches[1], "",$message);
  }

Reines 2011-12-24 22:18:43

I've only skimmed through the comments here - but surely the problem is only large numbers of new lines? The str_replace isn't causing any errors - all it does is turn new lines into <br> tags - it's the regular expression after it that is an issue:

// Add paragraph tag around post, but make sure there are no empty paragraphs
$text = preg_replace('%<br />\s*?<br />((\s*<br />)*)%i', "</p>$1<p>", $text);

quy 2011-12-27 03:59:06

// Add paragraph tag around post, but make sure there are no empty paragraphs
$text = preg_replace('%<br />\s*?<br />((\s*<br />)*)%i', "</p>$1<p>", $text);

This is definitely the culprit. It is not an issue with the new parser because this regex is not included. Maybe it can be removed to fix the issue.

Franz 2011-12-27 17:02:25

Gosh, that thing is messing with my head. Especially as I'm not completely sure how it is supposed to work!?

Not important, but we can probably remove the lazy star selector, as it is not necessary (replace by simple star).

If I understand this thing correctly, it removes the first two linebreaks if there are at least two of them in a row. What in the world is the point of doing that? Since we remove multiple newlines much earlier, can we maybe simplify this thing drastically?

@ridgerunner: any ideas on how to fix that regex?

quy 2011-12-28 19:19:40

Here are results with/without the regex and from the new parser.

With the regex:

<p>This is line 1.</p><p>This is line 3.</p><br /><p>This is line 6.</p><br /><br /><p>This is line 10.</p>

Without the regex:

<p>This is line 1.<br /><br />This is line 3.<br /><br /><br />This is line 6.<br /><br /><br /><br />This is line 10.</p>

New parser:

<p>This is line 1.<br /><br />This is line 3.<br /><br /><br />This is line 6.<br /><br /><br /><br />This is line 10.</p>

The regex has been omitted in the new parser.

quy 2011-12-31 18:16:33

This is not ideal, but it does the trick. 100 is arbitrary chosen and could be less since a post most likely will not have that many newlines.

	if (substr_count($text, '<br />') < 100) 
		$text = preg_replace('%<br />\s*?<br />((\s*<br />)*)%i', "</p>$1<p>", $text);

adaur 2012-01-01 09:59:48

Don't forget tabs wink

Reines 2012-01-01 13:28:06

adaur wrote:

Don't forget tabs wink

Unless I'm totally misunderstanding the problem, tabs aren't a issue here... The issue is the regex which I posted - which throws up on too many <br> tags (which are only introduced by newlines).

Looking at the regex, this should be fixed by making the final <br> capture possessive? i.e.

%<br />\s*?<br />((\s*<br />)*+)%i

Indeed, this change solves the problem - though I haven't tested it in the real world. I don't think the chance of possessiveness should break anything, but I'm not an expert. It either needs some testing (if I haven't heard back I'll try commit and pull to fluxbb.org this evening and see what happens), or input from someone like Ridgerunner!

Reines 2012-01-01 13:58:18

Commit 12c6c03 to fluxbb fluxbb-1.4

Fixing paragraph regex in the parser to be possessive. #508

Reines 2012-01-01 14:01:39

  • Status changed from open to fixed.

In fact I've just commit and pushed this now - so let me know if it broke anything!

Reines 2012-01-01 14:10:55

  • Status changed from fixed to open.

While that fixed an issue, and it now runs fine locally, the post in the original description is still giving an error 500 - so I've re-opened this. Will look again later.

Franz 2012-01-01 14:27:29

I'll test it tonight, too.

Reines 2012-01-01 18:24:23

Yup that fixed it locally but not fully - it looks like PHP has a higher default PCRE backtrack/recursion limit on Archlinux than Ubuntu.

Obviously my change helped a bit, but not enough...

Franz 2012-01-01 18:36:45

Can we somehow get around using backtracking/recursion like that?

Obviously, if we just decrease the amount we use, I can make the forum break with even larger posts, I would think.

Franz 2012-01-01 20:16:39

Indeed, it did not work on my Ubuntu test install, either.

adaur 2012-01-01 21:02:41

@Reines: tabs are an issue as they crash the forum like the spaces.

https://fluxbb.org/forums/viewtopic.php … 998#p43998

In this test they do not break it but it would be better to fix this isn't it?

Reines 2012-01-01 21:58:46

They don't crash the forum - they just make it hard work for the browser to render. It should probably be dealt with yes, but it's a separate issue from this - feel free to make another ticket for it if you wish.

Reines 2012-01-01 22:06:25

I think the following should work...

	// Add paragraph tag around post, but make sure there are no empty paragraphs
	$text = preg_replace('%<br />\s*?<br />%i', '</p><p>', $text);

	$text = str_replace('<p><br />', '<br /><p>', '<p>'.$text.'</p>');
	$text = str_replace('<br /></p>', '</p><br />', $text);
	$text = str_replace('<p></p>', '<br /><br />', $text);

Reines 2012-01-01 22:23:29

Okay that works-ish. It solves the problem - but we end up with double <br> tags before and after quotes - which shouldn't be there.

Reines 2012-01-01 22:33:25

Commit 6ddce72 to fluxbb fluxbb-1.4

Removing starting and ending line breaks from posts - i.e trimming any content. Part of #508.

Reines 2012-01-01 22:36:35

  • Visibility set to public.
  • Status changed from open to fixed.

I don't like this method of trimming - I hadn't noticed quotes inside content (not just ones at the start/end) also have the same problem. I'm going to make a new ticket though as it's turning into a different issue.