Fork me on GitHub
Subscribe 2

Ticket #319 (fixed enhancement)

Fallback if unicode-properties aren't enabled

  • Created: 2011-02-25 17:13:00
  • Reported by: Reines
  • Assigned to: Franz
  • Milestone: 1.4.5
  • Component: regex
  • Priority: normal

CentOS/Red Hat didn't/don't have unicode-properties enabled in their PCRE builds, which causes problems. We should attempt to fall back and recover from this best we can if possible.

History

Reines 2011-03-20 22:37:11

  • Owner changed from Reines to Franz.

preg_match etc return false if an error occurs, so if it returns false opposed to an int then it would fall back and try again with the old regex.

To give an example from search_idx.php:

    // Remove any apostrophes or dashes which aren't part of words
    $replaced = preg_replace('/((?<=[^\p{L}\p{N}])[\'\-]|[\'\-](?=[^\p{L}\p{N}]))/u', '', ' '.$text.' ');
    if ($replaced === false)
        $replaced = preg_replace('/((?<=\W)[\'\-]|[\'\-](?=\W))/u', '', ' '.$text.' ');

    $text = substr($replaced, 1, -1);

Though it would be quite nice if there was a way to do this without duplicating the rest of the regex...

Franz 2011-03-22 00:23:38

  • Uploaded patch 314-unicode.patch. (view)

Got a nice little patch.
It's very much a workaround by providing a wrapper around the preg_replace() function that takes care of replacing the nasty part of the regex pattern with something more comforting for these non-conforming builds.

Is that a good solution? If so, please merge.

Reines 2011-03-22 09:16:47

This doesn't solve it fully at the moment. It might be able to be adapted to though.

That patch will only handle the following being used:

[^\p{L}\p{N}]

In certain places we have other patterns, such as:

[\p{L}\p{N}\-]

In the search indexer the old expression used to remove a blacklist of symbols, and assume anything left is part of a word. Using the unicode-properties I changed it to remove anything that isn't considered a word. If we just replace the unicode-property match there we will end up removing anything that isn't considered a latin word, effectively breaking search for foreign boards. Mind you I guess it could be argued that since this is a fallback that is acceptable, and to use non-latin search you need unicode-properties enabled.

Reines 2011-03-22 09:31:49

What I think would actually work here would be replacing \p{L}\p{N} with \w, instead of the negative. This would still have the issue with searching in foreign languages, but I think we can argue that is okay.

I would be tempted though to call the function something like ucp_preg_replace since it is unicode-character-property related - calling it safe is misleading.

PS. This also needs applied in db_update.php.

Franz 2011-03-22 17:27:27

  • Uploaded patch 314-unicode2.patch. (view)

I agree with ignoring international search when Unicode support is disabled. Here's the updated version. Thanks for the heads-ups.

Franz 2011-03-22 17:38:13

Note that this is a hack. I don't see a better solution, though, and it certainly is more elegant than using a hard-coded fallback in every one of these cases.

Franz 2011-03-22 19:46:35

Commit 280dccd to fluxbb fluxbb-1.4

#314: Provide a regex fallback if unicode-properties aren't enabled.

Franz 2011-03-22 19:51:49

  • Status changed from open to fixed.

Fixed in 280dccd.