[squeak-dev] Re: a diacritics free version of a string

Philippe Marschall philippe.marschall at gmail.com
Wed Jun 3 15:45:45 UTC 2009


2009/6/3  <stephan at stack.nl>:
> Philippe wrote:
>>
>> The Unicode solution would be to do normalization with full
>> decomposition and then a regex on \p{InCombiningDiacriticalMarks} and
>> replace it with an empty string or something similar.
>
> I don't think that is enough. I think the normalization is language
> dependent.
> o-umlaut is replaced by oe in German, but the equivalent in Dutch is o.

I think we talk about different issues. Unicode normalization [1]
solves the problem that some characters (like o-umlaut) can be
represented in different ways in Unicode by using only one of them.
What I proposed would then simply remove the diacritical marks (remove
the umlaut, keep to o). What you propose is more sophisticated.

But you're right, most text operations are language dependent
including upper and lower case translation.

 [1] http://www.unicode.org/unicode/reports/tr15/tr15-23.html

Cheers
Philippe



More information about the Squeak-dev mailing list