Remaining to-do items for 3.7

Ned Konz ned at squeakland.org
Tue Feb 17 04:00:52 UTC 2004


On Monday 16 February 2004 7:20 pm, Avi Bryant wrote:

> The VB-Regex package is a 120k fileout, the RePlugin is about a 450k
> fileout, FWIW.  Not sure why the RePlugin is so big, it may just have a
> larger test suite.

It has tests, and quite a bit of documentation (as well as a deprecated 
class).

> But here's a more pertinent question - which will handle i17n better?
> Probably neither will out of the box, but which will be more work to
> adapt?  I don't know the answer, though in general I would expect such
> things to be easier when we have Smalltalk code all the way down...

It looks like the PCRE engine under the RePlugin can handle UTF-8 strings and 
patterns. So if we wanted we could feed it UTF-8 (assuming that conversion is 
easy).

Here's the notes on PCRE support for larger character sets:
----
Starting at release 3.3, PCRE has had some support for character strings 
encoded in the UTF-8 format. For release 4.0 this has been greatly extended 
to cover most common requirements.

In order process UTF-8 strings, you must build PCRE to include UTF-8 support 
in the code, and, in addition, you must call pcre_compile() with the 
PCRE_UTF8 option flag. When you do this, both the pattern and any subject 
strings that are matched against it are treated as UTF-8 strings instead of 
just strings of bytes.

If you compile PCRE with UTF-8 support, but do not use it at run time, the 
library will be a bit bigger, but the additional run time overhead is limited 
to testing the PCRE_UTF8 flag in several places, so should not be very large.

The following comments apply when PCRE is running in UTF-8 mode:

1. PCRE assumes that the strings it is given contain valid UTF-8 codes. It 
does not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to 
PCRE, the results are undefined.

2. In a pattern, the escape sequence \x{...}, where the contents of the braces 
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose 
code number is the given hexadecimal number, for example: \x{1234}. If a non- 
hexadecimal digit appears between the braces, the item is not recognized. 
This escape sequence can be used either as a literal, or within a character 
class.

3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 
character if the value is greater than 127.

4. Repeat quantifiers apply to complete UTF-8 characters, not to individual 
bytes, for example: \x{100}{3}.

5. The dot metacharacter matches one UTF-8 character instead of a single byte.

6. The escape sequence \C can be used to match a single byte in UTF-8 mode, 
but its use can lead to some strange effects.

7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test 
char- acters of any code value, but the characters that PCRE recognizes as 
digits, spaces, or word characters remain the same set as before, all with 
values less than 256.

8. Case-insensitive matching applies only to characters whose values are less 
than 256. PCRE does not support the notion of "case" for higher-valued 
charac- ters.

9. PCRE does not support the use of Unicode tables and properties or the Perl 
escapes \p, \P, and \X.
---

-- 
Ned Konz
http://bike-nomad.com/squeak/



More information about the Squeak-dev mailing list