regexp strangeness

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

regexp strangeness

Dr. K. nick
Hey all,


the documentation to regexp says:

'\w'
          Match any word character

what exactly is a word character (maybe even more important what isn't)?
Am I right in assuming its
[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] aka letters? What
about non
english characters like öäßłńŚ?


And here some other strange (to me) behavior:

>> regexp("#w#","#\w#") 
ans =  1                        <- seems to work in general as expected...
>> regexp("#d#","#\w#")
ans = [](1x0)                     <- why does this happen? I've provided
a word character (letter)
>> regexp("#d#","#\\w#")
ans =  1                             <- Ahhh, so we need to double
escape these special characters... no mention of that in the help...
>> regexp("#j#","#\\w#")      
ans =  1                        <- ok, seems to work fine...
>> regexp("#E#","#\\w#")
ans =  1                        <- ok
>> regexp("#E#","#\\w*#")
ans =  1                        <- ok
>> regexp("##","#\\w*#")
ans =  1                        <- ok
>> regexp("#.#","#\\w*#")
ans = [](1x0)                    <- why? Asterisk (*) is supposed to
match zero or more times. Here there is zero times a letter, so it
should match...

Especially the last one >> regexp("#.#","#\\w*#") ans = [](1x0) looks
like a bug to me. Or am I getting something wrong here?

Thanks


Kay


Reply | Threaded
Open this post in threaded view
|

Re: regexp strangeness

Andreas Weber-6
Am 08.02.20 um 12:47 schrieb Kay Nick:
> the documentation to regexp says:
>
> '\w'
>           Match any word character
>
> what exactly is a word character (maybe even more important what isn't)?

It's always worth to have a look at the underlying library, PCRE in this
case: https://www.pcre.org/original/doc/html/pcrepattern.html

...A "word" character is an underscore or any character that is a letter
or digit. By default, the definition of letters and digits is controlled
by PCRE's low-valued character tables, and may vary if locale-specific
matching is taking place (see "Locale support" in the pcreapi page). For
example, in a French locale such as "fr_FR" in Unix-like systems, or
"french" in Windows, some character codes greater than 127 are used for
accented letters, and these are then matched by \w. The use of locales
with Unicode is discouraged. ....

>>> regexp("#d#","#\w#")
> ans = [](1x0)                     <- why does this happen? I've provided
> a word character (letter)
>>> regexp("#d#","#\\w#")
> ans =  1                             <- Ahhh, so we need to double
> escape these special characters... no mention of that in the help...

The handling of escape sequences apply to all sings, not just in regexp,
see https://octave.org/doc/v4.0.1/Escape-Sequences-in-String-Constants.html

I don't think it makes sense to document this especially or additionally
tin the help text for regexp.

>>> regexp("#.#","#\\w*#")
> ans = [](1x0)                    <- why? Asterisk (*) is supposed to
> match zero or more times. Here there is zero times a letter, so it
> should match...

No, it would match "##" but no "#.#".
You can play around here: https://regex101.com/r/sYXfWy/1

> Especially the last one >> regexp("#.#","#\\w*#") ans = [](1x0) looks
> like a bug to me. Or am I getting something wrong here?

Yes, see above.

-- Andy

Reply | Threaded
Open this post in threaded view
|

Re: regexp strangeness

Dr. K. nick
> I don't think it makes sense to document this especially or additionally
> tin the help text for regexp.
I do agree that we should avoid redundant use of explanatory paragraphs
in the documentation. But I think that clear hints to places where your
explanation comes from aka. those links you have provided (thanks for
that btw.) would be very helpful in the help text for regexp. That way
we would improve documentation without of creating additional burden to
keep it up to date.

Kay

On 08.02.20 15:01, Andreas Weber wrote:

> Am 08.02.20 um 12:47 schrieb Kay Nick:
>> the documentation to regexp says:
>>
>> '\w'
>>           Match any word character
>>
>> what exactly is a word character (maybe even more important what isn't)?
> It's always worth to have a look at the underlying library, PCRE in this
> case: https://www.pcre.org/original/doc/html/pcrepattern.html
>
> ...A "word" character is an underscore or any character that is a letter
> or digit. By default, the definition of letters and digits is controlled
> by PCRE's low-valued character tables, and may vary if locale-specific
> matching is taking place (see "Locale support" in the pcreapi page). For
> example, in a French locale such as "fr_FR" in Unix-like systems, or
> "french" in Windows, some character codes greater than 127 are used for
> accented letters, and these are then matched by \w. The use of locales
> with Unicode is discouraged. ....
>
>>>> regexp("#d#","#\w#")
>> ans = [](1x0)                     <- why does this happen? I've provided
>> a word character (letter)
>>>> regexp("#d#","#\\w#")
>> ans =  1                             <- Ahhh, so we need to double
>> escape these special characters... no mention of that in the help...
> The handling of escape sequences apply to all sings, not just in regexp,
> see https://octave.org/doc/v4.0.1/Escape-Sequences-in-String-Constants.html
>
> I don't think it makes sense to document this especially or additionally
> tin the help text for regexp.
>
>>>> regexp("#.#","#\\w*#")
>> ans = [](1x0)                    <- why? Asterisk (*) is supposed to
>> match zero or more times. Here there is zero times a letter, so it
>> should match...
> No, it would match "##" but no "#.#".
> You can play around here: https://regex101.com/r/sYXfWy/1
>
>> Especially the last one >> regexp("#.#","#\\w*#") ans = [](1x0) looks
>> like a bug to me. Or am I getting something wrong here?
> Yes, see above.
>
> -- Andy

Reply | Threaded
Open this post in threaded view
|

Re: regexp strangeness

Andrew Janke-2


On 2/8/20 10:57 AM, Kay Nick wrote:

> On 08.02.20 15:01, Andreas Weber wrote:
>> Am 08.02.20 um 12:47 schrieb Kay Nick:
>>> the documentation to regexp says:
>>>
>>> '\w'
>>>           Match any word character
>>>
>>> what exactly is a word character (maybe even more important what isn't)?
>> It's always worth to have a look at the underlying library, PCRE in this
>> case: https://www.pcre.org/original/doc/html/pcrepattern.html
>>
>> ...A "word" character is an underscore or any character that is a letter
>> or digit. By default, the definition of letters and digits is controlled
>> by PCRE's low-valued character tables, and may vary if locale-specific
>> matching is taking place (see "Locale support" in the pcreapi page). For
>> example, in a French locale such as "fr_FR" in Unix-like systems, or
>> "french" in Windows, some character codes greater than 127 are used for
>> accented letters, and these are then matched by \w. The use of locales
>> with Unicode is discouraged. ....

Matlab compatibility note: in Matlab's regexp() functions, the \w
metacharacter appears to match any alphanumeric character in any script
within Unicode, not just the ASCII-compatible '[a-zA-Z0-9_]'. Over
there, it seems like \w is equivalent to '[\p{L}\p{N}_]'.

The Matlab documentation is not very explicit about this, and its
wording is a little muddled.

Sounds like maybe Octave should be running PCRE in Unicode mode, and
compiling its patterns with the PCRE_UCP option set?

Cheers,
Andrew