Fork me on GitHub
Subscribe 2

Ticket #179 (fixed bug)

Indexer can't recognize Unicode punctuation

  • Created: 2010-11-06 16:51:24
  • Reported by: Freeman
  • Assigned to: Reines
  • Milestone: 1.4.4
  • Component: search
  • Priority: normal

Unicode has more complex punctuation subsets than plain text, like
U+2000..U+206F — General Punctuation
U+3000..U+303F — CJK Symbols and Punctuation (includes hieroglyphic full stop 。)
and many other codepoints in national ranges.

For example, now I have words started with « character in my index.

History

Reines 2010-11-08 15:58:43

  • Milestone set to 1.4.4.

Thanks, I'll take a look into this, but I'm going to mark it for the 1.4.4 release. Obviously this isn't a huge change, but it will require people to re-index their boards which is a bit of a pain.

Reines 2011-01-23 14:07:42

Commit ebf12e1 to fluxbb fluxbb-1.4

Changing the regular expressions to use \p{L} instead of \w and \P{L} instead of \W. This fixes numerous unicode related issues:
- #166: Censoring still doesn't work fully with utf-8
- #179: Indexer can't recognize Unicode punctuation
- #239: Unicode hyperlink not clickable and truncated

Reines 2011-01-23 14:14:23

  • Owner set to Reines.

I've implemented a fix for this - just testing for some testing.

https://github.com/fluxbb/fluxbb/pull/9

Reines 2011-01-26 11:29:47

  • Milestone changed from 1.4.4 to 1.4.5.

Sorry my fix doesn't actually fix anything here - obviously it's the punctuation regex we need to change, not just the word separators.

Ideally we could simply use [\p{P}\p{S}] which matches all unicode punctuation - however when doing a search rather than index we want to allow through certain characters (* and % for example), which would be stripped by the above regex. I'll need to ask ridgerunners advice on this one.

Reines 2011-01-26 15:27:04

  • Milestone changed from 1.4.5 to 1.4.4.

Actually I'm going to change this back to the 1.4.4 milestone. I've already updated the SI revision in 1.4.4, so doing an update will require re-indexing all posts. If possible it would make sense to have this fixed rather than requiring reindexing again in 1.4.5.

Reines 2011-01-27 14:20:24

Commit 83462a1 to fluxbb master

Changing the search index to remove all non-letter or number characters, instead of simply having a blacklist of known punctuation. This should hopefully result in a more tidy search words table. #179

Reines 2011-01-27 14:25:53

  • Status changed from open to fixed.

Hopefully sorted now.