[squeak-dev] Regular Expressions

Hans-Martin Mosner hmm at heeg.de
Fri Nov 18 14:05:53 UTC 2016


Am 18.11.2016 um 14:39 schrieb Edgar De Cleene:
> Folks:
> I wish remove tags from HTMlL
> According to https://regex101.com/ and http://www.freeformatter.com/regex-tester.html and also of my old Nissus Pro.
>
> <.+?>
>
> Should be a valid expression.
>
> But 
>
>  regex|
> regex := RxMatcher forString: '<.+?>’.
>
> Gives my an error.
>
> Any help ?
>
> Edgar
> @morplenauta
>
I was going to write this:

    The "+" already means "match one or more of the previous", where
    "previous" in this case is ".", which means "any character".

    The "?" means "match zero or one of the previous", but it cannot be
    cmobined with "+".

But then I realized that "+?" is defined in regex syntax as "lazy"
matching, i.e. it finds as few of the previous tokens as needed to to
make the pattern match (in contrast, standard "+" matches greedily, so
it consumes as much as possible while still matching the pattern).

However, the Rx framework in Squeak is quite old and does not have these
extensions. A pattern that should work would be "<[^>]+>" which matches
an opening angle bracket, any characters that are not closing angle
brackets, and finally the closing bracket.

Be aware though that correctly stripping tags from HTML is not possible
(or at least not trivial) with regex. For example, in your pattern, the
"." would not match newlines, but tags can extend over multiple lines,
so you would not be able to strip out a multiline tag. My pattern
apparently works with newlines, too, but there are other cases that it
does not handle (for example, see
http://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value).

So unless you know that your input is going to be fairly regular, don't
rely on regex to strip tags. Use a proper HTML/SGML/XML parser, they are
designed to do it right.

Cheers,

Hans-Martin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20161118/9fca34af/attachment.html>


More information about the Squeak-dev mailing list