Remaining to-do items for 3.7
Yoshiki Ohshima
Yoshiki.Ohshima at acm.org
Fri Feb 20 00:02:50 UTC 2004
Hello,
Great. Before I was going to ask the question, Avi asked it, and
now Ned answers it.
I haven't try it by myself, but does this "UTF-8 support" handle
characters outside of BMP?
It shouldn't be too hard to support it from m17n Squeak. It would
give me one less reason to use Ruby now and then^^; However, we're
going to lose the ability to Japanese/Chinese sensitive search, hmm.
-- Yoshiki
At Mon, 16 Feb 2004 20:00:52 -0800,
Ned Konz wrote:
>
> On Monday 16 February 2004 7:20 pm, Avi Bryant wrote:
>
> > The VB-Regex package is a 120k fileout, the RePlugin is about a 450k
> > fileout, FWIW. Not sure why the RePlugin is so big, it may just have a
> > larger test suite.
>
> It has tests, and quite a bit of documentation (as well as a deprecated
> class).
>
> > But here's a more pertinent question - which will handle i17n better?
> > Probably neither will out of the box, but which will be more work to
> > adapt? I don't know the answer, though in general I would expect such
> > things to be easier when we have Smalltalk code all the way down...
>
> It looks like the PCRE engine under the RePlugin can handle UTF-8 strings and
> patterns. So if we wanted we could feed it UTF-8 (assuming that conversion is
> easy).
>
> Here's the notes on PCRE support for larger character sets:
> ----
> Starting at release 3.3, PCRE has had some support for character strings
> encoded in the UTF-8 format. For release 4.0 this has been greatly extended
> to cover most common requirements.
>
> In order process UTF-8 strings, you must build PCRE to include UTF-8 support
> in the code, and, in addition, you must call pcre_compile() with the
> PCRE_UTF8 option flag. When you do this, both the pattern and any subject
> strings that are matched against it are treated as UTF-8 strings instead of
> just strings of bytes.
>
> If you compile PCRE with UTF-8 support, but do not use it at run time, the
> library will be a bit bigger, but the additional run time overhead is limited
> to testing the PCRE_UTF8 flag in several places, so should not be very large.
>
> The following comments apply when PCRE is running in UTF-8 mode:
>
> 1. PCRE assumes that the strings it is given contain valid UTF-8 codes. It
> does not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to
> PCRE, the results are undefined.
>
> 2. In a pattern, the escape sequence \x{...}, where the contents of the braces
> is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
> code number is the given hexadecimal number, for example: \x{1234}. If a non-
> hexadecimal digit appears between the braces, the item is not recognized.
> This escape sequence can be used either as a literal, or within a character
> class.
>
> 3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8
> character if the value is greater than 127.
>
> 4. Repeat quantifiers apply to complete UTF-8 characters, not to individual
> bytes, for example: \x{100}{3}.
>
> 5. The dot metacharacter matches one UTF-8 character instead of a single byte.
>
> 6. The escape sequence \C can be used to match a single byte in UTF-8 mode,
> but its use can lead to some strange effects.
>
> 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
> char- acters of any code value, but the characters that PCRE recognizes as
> digits, spaces, or word characters remain the same set as before, all with
> values less than 256.
>
> 8. Case-insensitive matching applies only to characters whose values are less
> than 256. PCRE does not support the notion of "case" for higher-valued
> charac- ters.
>
> 9. PCRE does not support the use of Unicode tables and properties or the Perl
> escapes \p, \P, and \X.
> ---
>
> --
> Ned Konz
> http://bike-nomad.com/squeak/
>
More information about the Squeak-dev
mailing list
|