Remaining to-do items for 3.7

Yoshiki Ohshima Yoshiki.Ohshima at acm.org
Fri Feb 20 00:02:50 UTC 2004


  Hello,

  Great.  Before I was going to ask the question, Avi asked it, and
now Ned answers it.

  I haven't try it by myself, but does this "UTF-8 support" handle
characters outside of BMP?

  It shouldn't be too hard to support it from m17n Squeak.  It would
give me one less reason to use Ruby now and then^^; However, we're
going to lose the ability to Japanese/Chinese sensitive search, hmm.

-- Yoshiki

At Mon, 16 Feb 2004 20:00:52 -0800,
Ned Konz wrote:
> 
> On Monday 16 February 2004 7:20 pm, Avi Bryant wrote:
> 
> > The VB-Regex package is a 120k fileout, the RePlugin is about a 450k
> > fileout, FWIW.  Not sure why the RePlugin is so big, it may just have a
> > larger test suite.
> 
> It has tests, and quite a bit of documentation (as well as a deprecated 
> class).
> 
> > But here's a more pertinent question - which will handle i17n better?
> > Probably neither will out of the box, but which will be more work to
> > adapt?  I don't know the answer, though in general I would expect such
> > things to be easier when we have Smalltalk code all the way down...
> 
> It looks like the PCRE engine under the RePlugin can handle UTF-8 strings and 
> patterns. So if we wanted we could feed it UTF-8 (assuming that conversion is 
> easy).
> 
> Here's the notes on PCRE support for larger character sets:
> ----
> Starting at release 3.3, PCRE has had some support for character strings 
> encoded in the UTF-8 format. For release 4.0 this has been greatly extended 
> to cover most common requirements.
> 
> In order process UTF-8 strings, you must build PCRE to include UTF-8 support 
> in the code, and, in addition, you must call pcre_compile() with the 
> PCRE_UTF8 option flag. When you do this, both the pattern and any subject 
> strings that are matched against it are treated as UTF-8 strings instead of 
> just strings of bytes.
> 
> If you compile PCRE with UTF-8 support, but do not use it at run time, the 
> library will be a bit bigger, but the additional run time overhead is limited 
> to testing the PCRE_UTF8 flag in several places, so should not be very large.
> 
> The following comments apply when PCRE is running in UTF-8 mode:
> 
> 1. PCRE assumes that the strings it is given contain valid UTF-8 codes. It 
> does not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to 
> PCRE, the results are undefined.
> 
> 2. In a pattern, the escape sequence \x{...}, where the contents of the braces 
> is a string of hexadecimal digits, is interpreted as a UTF-8 character whose 
> code number is the given hexadecimal number, for example: \x{1234}. If a non- 
> hexadecimal digit appears between the braces, the item is not recognized. 
> This escape sequence can be used either as a literal, or within a character 
> class.
> 
> 3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 
> character if the value is greater than 127.
> 
> 4. Repeat quantifiers apply to complete UTF-8 characters, not to individual 
> bytes, for example: \x{100}{3}.
> 
> 5. The dot metacharacter matches one UTF-8 character instead of a single byte.
> 
> 6. The escape sequence \C can be used to match a single byte in UTF-8 mode, 
> but its use can lead to some strange effects.
> 
> 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test 
> char- acters of any code value, but the characters that PCRE recognizes as 
> digits, spaces, or word characters remain the same set as before, all with 
> values less than 256.
> 
> 8. Case-insensitive matching applies only to characters whose values are less 
> than 256. PCRE does not support the notion of "case" for higher-valued 
> charac- ters.
> 
> 9. PCRE does not support the use of Unicode tables and properties or the Perl 
> escapes \p, \P, and \X.
> ---
> 
> -- 
> Ned Konz
> http://bike-nomad.com/squeak/
> 



More information about the Squeak-dev mailing list