Philippe wrote:
The Unicode solution would be to do normalization with full decomposition and then a regex on \p{InCombiningDiacriticalMarks} and replace it with an empty string or something similar.
I don't think that is enough. I think the normalization is language dependent. o-umlaut is replaced by oe in German, but the equivalent in Dutch is o.
Stephan
2009/6/3 stephan@stack.nl:
Philippe wrote:
The Unicode solution would be to do normalization with full decomposition and then a regex on \p{InCombiningDiacriticalMarks} and replace it with an empty string or something similar.
I don't think that is enough. I think the normalization is language dependent. o-umlaut is replaced by oe in German, but the equivalent in Dutch is o.
I think we talk about different issues. Unicode normalization [1] solves the problem that some characters (like o-umlaut) can be represented in different ways in Unicode by using only one of them. What I proposed would then simply remove the diacritical marks (remove the umlaut, keep to o). What you propose is more sophisticated.
But you're right, most text operations are language dependent including upper and lower case translation.
[1] http://www.unicode.org/unicode/reports/tr15/tr15-23.html
Cheers Philippe
squeak-dev@lists.squeakfoundation.org