Remaining to-do items for 3.7
Ned Konz
ned at squeakland.org
Tue Feb 17 04:00:52 UTC 2004
On Monday 16 February 2004 7:20 pm, Avi Bryant wrote:
> The VB-Regex package is a 120k fileout, the RePlugin is about a 450k
> fileout, FWIW. Not sure why the RePlugin is so big, it may just have a
> larger test suite.
It has tests, and quite a bit of documentation (as well as a deprecated
class).
> But here's a more pertinent question - which will handle i17n better?
> Probably neither will out of the box, but which will be more work to
> adapt? I don't know the answer, though in general I would expect such
> things to be easier when we have Smalltalk code all the way down...
It looks like the PCRE engine under the RePlugin can handle UTF-8 strings and
patterns. So if we wanted we could feed it UTF-8 (assuming that conversion is
easy).
Here's the notes on PCRE support for larger character sets:
----
Starting at release 3.3, PCRE has had some support for character strings
encoded in the UTF-8 format. For release 4.0 this has been greatly extended
to cover most common requirements.
In order process UTF-8 strings, you must build PCRE to include UTF-8 support
in the code, and, in addition, you must call pcre_compile() with the
PCRE_UTF8 option flag. When you do this, both the pattern and any subject
strings that are matched against it are treated as UTF-8 strings instead of
just strings of bytes.
If you compile PCRE with UTF-8 support, but do not use it at run time, the
library will be a bit bigger, but the additional run time overhead is limited
to testing the PCRE_UTF8 flag in several places, so should not be very large.
The following comments apply when PCRE is running in UTF-8 mode:
1. PCRE assumes that the strings it is given contain valid UTF-8 codes. It
does not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to
PCRE, the results are undefined.
2. In a pattern, the escape sequence \x{...}, where the contents of the braces
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
code number is the given hexadecimal number, for example: \x{1234}. If a non-
hexadecimal digit appears between the braces, the item is not recognized.
This escape sequence can be used either as a literal, or within a character
class.
3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8
character if the value is greater than 127.
4. Repeat quantifiers apply to complete UTF-8 characters, not to individual
bytes, for example: \x{100}{3}.
5. The dot metacharacter matches one UTF-8 character instead of a single byte.
6. The escape sequence \C can be used to match a single byte in UTF-8 mode,
but its use can lead to some strange effects.
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
char- acters of any code value, but the characters that PCRE recognizes as
digits, spaces, or word characters remain the same set as before, all with
values less than 256.
8. Case-insensitive matching applies only to characters whose values are less
than 256. PCRE does not support the notion of "case" for higher-valued
charac- ters.
9. PCRE does not support the use of Unicode tables and properties or the Perl
escapes \p, \P, and \X.
---
--
Ned Konz
http://bike-nomad.com/squeak/
More information about the Squeak-dev
mailing list
|