Aw: Re: regexp: matching expressions b4 and after ....

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Aw: Re: regexp: matching expressions b4 and after ....

lombardo-2
 

Hi,

thanks for the reply Ben.
As you see in your attempt, also the first "(-1)" gets replaced, which is exactly what the regexprep example command (in MATLAB) avoids.

In practice I'm looking for the look-around operators for OCTAVE as described in MATLAB.


 
----- Original Nachricht ----
Von:     Ben Abbott <[hidden email]>
An:      [hidden email]
Datum:   08.09.2008 18:22
Betreff: Re: regexp: matching expressions b4 and after ....

> On Monday, September 08, 2008, at 11:38AM, <[hidden email]>
> wrote:
> >Hi,
> >
> >I'm trying to convert a set of matlab codes into Octave. These codes
> heavily use string manipulation functions
> >
> >In particular I'm trying to extract a set of equations from an ASCII file.
> From these equations I extract/replace variables etc.
> >
> >So, for example, I have g="x^(-1)+y(-1)+z(-1)=0"
> >
> >I need to extract/replace x, y and (-1). So, for example, I used the
> command regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
> >
> >which returned "x^(-1)+y_minus1+z_minus1"
> >
> >Can I do the same in Octave? How?
> >
>
> Doesn't work for me either
>
> octave:1> g="x^(-1)+y(-1)+z(-1)=0"
> g = x^(-1)+y(-1)+z(-1)=0
> octave:2>
> octave:2> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
> error: syntax error in pattern
>
> Running Matlab2007b
>
> >> g='x^(-1)+y(-1)+z(-1)=0';
> >> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
> ans =
> x^_minus1+y_minus1+z_minus1=0
>
> I'd suggest you try to isolate the error. Perhaps if the developers knew
> where the specific problem was they could fix it.
>
> I'm not experienced with regexp, but the simpler one below works for your
> example.
>
> regexprep(g,'\(\-[1-9]*\)','_minus1')
> ans = x^_minus1+y_minus1+z_minus1=0
>
> Ben
>
_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

David Bateman-3
[hidden email] wrote:
>  
>
> Hi,
>
> thanks for the reply Ben.
> As you see in your attempt, also the first "(-1)" gets replaced, which is exactly what the regexprep example command (in MATLAB) avoids.
>
> In practice I'm looking for the look-around operators for OCTAVE as described in MATLAB.
>  

Where exactly is "look-around" explained?

D.


--
David Bateman                                [hidden email]
Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob)
91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax)

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary

_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....

David Bateman-3
David Bateman wrote:

> [hidden email] wrote:
>>  
>>
>> Hi,
>>
>> thanks for the reply Ben. As you see in your attempt, also the first
>> "(-1)" gets replaced, which is exactly what the regexprep example
>> command (in MATLAB) avoids.
>>
>> In practice I'm looking for the look-around operators for OCTAVE as
>> described in MATLAB.
>>  
>
> Where exactly is "look-around" explained?
>
> D.
>
>
Ok, forget it.. I figured it out.. The issue is that matlab uses a
different syntax for named tokens than PCRE, so we are obliged to look
for named tokens like "(?<name>)" and replace them with the PCRE
compatible "(?P<name>)". The test in Octave to do this was trapping
"(?<=...)" and "(?<!...") as a syntax error for a matlab named token.
The other lookaround operator "(?=...)" and "(?!...)" seem to work as
pretty much as expected.

One issue is that PCRE does not accept arbitrary length lookaround
expressions and so  "(?<=[a-z]*)" is not legal with PCRE. Though maximum
length lookarounds are acceptable, so you can write instead
"(?<=[a-z]{10})" for example.

I have a changeset to address this, but wonder if I should look for
lookaround operators with "*" or "+" and replace with "{MAX_LENGTH}" and
"{1:MAX_LENGTH}" respectively, with a warning about this limitation.
Should I do this before submitting the changeset?

D.

--
David Bateman                                [hidden email]
Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob)
91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax)

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary

_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

bpabbott
Administrator
In reply to this post by lombardo-2
On Tuesday, September 09, 2008, at 06:17AM, <[hidden email]> wrote:
>
>
>Hi,
>
>thanks for the reply Ben.
>As you see in your attempt, also the first "(-1)" gets replaced, which is exactly what the regexprep example command (in MATLAB) avoids.

hmmm ... Matlab gives me the result below ...

>> g='x^(-1)+y(-1)+z(-1)=0';
>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
ans =
x^_minus1+y_minus1+z_minus1=0

What does matlab give you?

Ben
_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

David Bateman-3
In reply to this post by David Bateman-3
David Bateman wrote:

> Ok, forget it.. I figured it out.. The issue is that matlab uses a
> different syntax for named tokens than PCRE, so we are obliged to look
> for named tokens like "(?<name>)" and replace them with the PCRE
> compatible "(?P<name>)". The test in Octave to do this was trapping
> "(?<=...)" and "(?<!...") as a syntax error for a matlab named token.
> The other lookaround operator "(?=...)" and "(?!...)" seem to work as
> pretty much as expected.
>
> One issue is that PCRE does not accept arbitrary length lookaround
> expressions and so  "(?<=[a-z]*)" is not legal with PCRE. Though
> maximum length lookarounds are acceptable, so you can write instead
> "(?<=[a-z]{10})" for example.
>
> I have a changeset to address this, but wonder if I should look for
> lookaround operators with "*" or "+" and replace with "{MAX_LENGTH}"
> and "{1:MAX_LENGTH}" respectively, with a warning about this
> limitation. Should I do this before submitting the changeset?
>

Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length
lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok
but "(?<[a-z]*)" isn't. I'd hoped to replace this with
"(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is
not ok either. What I'd have to do is replace it with

((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH}))

which used the alternate operator and MALENGTH+1 copies of the
lookbehind expression to get the effect. This seems to be a ridiculous
amount of extra crap in the pattern space to get this functionality. Is
it worth supporting arbitrary length lookbehind expressions like
"(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is
it worth supporting it but limits max_length, and print a warning? If so
what value should be the limit?

Frankly I wonder how mathworks got this to work as they appear to be
using the Boost regex library which also doesn't support arbitrary
length lookbehind expressions....

D.


--
David Bateman                                [hidden email]
Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob)
91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax)

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary

_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

bpabbott
Administrator
On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <[hidden email]> wrote:

>
>Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length
>lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok
>but "(?<[a-z]*)" isn't. I'd hoped to replace this with
>"(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is
>not ok either. What I'd have to do is replace it with
>
>((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH}))
>
>which used the alternate operator and MALENGTH+1 copies of the
>lookbehind expression to get the effect. This seems to be a ridiculous
>amount of extra crap in the pattern space to get this functionality. Is
>it worth supporting arbitrary length lookbehind expressions like
>"(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is
>it worth supporting it but limits max_length, and print a warning? If so
>what value should be the limit?
>
>Frankly I wonder how mathworks got this to work as they appear to be
>using the Boost regex library which also doesn't support arbitrary
>length lookbehind expressions....
>
>D.

David,

Have you tried the example in Matlab?

Using 2007b, It does *not* work for me. My 2008a/b is busy running some simulations, so I can't try it there until later.

>> g='x^(-1)+y(-1)+z(-1)=0';
>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
ans =
x^_minus1+y_minus1+z_minus1=0

If I understand correctly the result should be

ans =
x^(-1)+y_minus1+z_minus1=0

Correct?

Ben


_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

David Bateman-3
In reply to this post by bpabbott
Ben Abbott wrote:

> On Tuesday, September 09, 2008, at 06:17AM, <[hidden email]> wrote:
>  
>> Hi,
>>
>> thanks for the reply Ben.
>> As you see in your attempt, also the first "(-1)" gets replaced, which is exactly what the regexprep example command (in MATLAB) avoids.
>>    
>
> hmmm ... Matlab gives me the result below ...
>
>  
>>> g='x^(-1)+y(-1)+z(-1)=0';
>>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
>>>      
> ans =
> x^_minus1+y_minus1+z_minus1=0
>
> What does matlab give you?
>
>  
Yes the pattern should be '(?<=[a-z]+)\(\-[1-9]*\)' to get the desired
behavior.

D.

--
David Bateman                                [hidden email]
Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob)
91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax)

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary

_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....

David Bateman-3
In reply to this post by bpabbott
Ben Abbott wrote:

> On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <[hidden email]> wrote:
>  
>> Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length
>> lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok
>> but "(?<[a-z]*)" isn't. I'd hoped to replace this with
>> "(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is
>> not ok either. What I'd have to do is replace it with
>>
>> ((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH}))
>>
>> which used the alternate operator and MALENGTH+1 copies of the
>> lookbehind expression to get the effect. This seems to be a ridiculous
>> amount of extra crap in the pattern space to get this functionality. Is
>> it worth supporting arbitrary length lookbehind expressions like
>> "(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is
>> it worth supporting it but limits max_length, and print a warning? If so
>> what value should be the limit?
>>
>> Frankly I wonder how mathworks got this to work as they appear to be
>> using the Boost regex library which also doesn't support arbitrary
>> length lookbehind expressions....
>>
>> D.
>>    
>
> David,
>
> Have you tried the example in Matlab?
>
> Using 2007b, It does *not* work for me. My 2008a/b is busy running some simulations, so I can't try it there until later.
>
>  
>>> g='x^(-1)+y(-1)+z(-1)=0';
>>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
>>>      
> ans =
> x^_minus1+y_minus1+z_minus1=0
>
> If I understand correctly the result should be
>
> ans =
> x^(-1)+y_minus1+z_minus1=0
>
> Correct?
>
> Ben
>
>
>
>  
The message

http://groups.google.com/group/comp.soft-sys.matlab/browse_thread/thread/babf37252132fd99/250b037e60b345ff?lnk=gst&q=lookbehind#250b037e60b345ff

seems to imply that mathworks have their own regexp engine and that
lookbehind is inefficient. I therefore don't consider it that much of an
issue to duplicate the lookbehind pattern in the pattern space and so
propose the attached changeset that replaces "(?>=[a-z]*)" with
"((?>=[a-z]{0})|(?>=[a-z]{1})|...(?>=[a-z]{10}))" before calling PCRE on
it. It also issues a warning about the maximum length string if the
lookbehind might be an issue. So the limitation is that "+" then
represents 1 to 10 characters and "*" 0 to 10 characters in a lookbehind
expression. This limitation doesn't apply to lookaheads, etc.

D.

--
David Bateman                                [hidden email]
Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob)
91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax)

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary


# HG changeset patch
# User David Bateman <[hidden email]>
# Date 1220976632 -7200
# Node ID efb313d0d98265789218594c6d3dd5d0e19f9ce1
# Parent  917a977250d994aceb97d2bd299e9a6048ca750f
Treat PCRE lookbehind operators in a manner that is approximately correct

diff --git a/src/ChangeLog b/src/ChangeLog
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,10 @@ 2008-09-02  Michael Goffioul  <michael.g
+2008-09-09  David Bateman  <[hidden email]>
+
+ * DLD-FUNCTIONS/regexp.cc (octregexp_list): Distinguish between
+ matlab named tokens and perl lookbehind expressions. For
+ lookbehind expression replace "*" and "+" with a limited number of
+ fixed length expressions to simulate arbitrary length look behind.
+
 2008-09-02  Michael Goffioul  <[hidden email]>
 
  * graphics.cc (hggroup::update_axis_limits): Also reacts on
diff --git a/src/DLD-FUNCTIONS/regexp.cc b/src/DLD-FUNCTIONS/regexp.cc
--- a/src/DLD-FUNCTIONS/regexp.cc
+++ b/src/DLD-FUNCTIONS/regexp.cc
@@ -80,6 +80,9 @@ public:
 
 typedef std::list<regexp_elem>::const_iterator const_iterator;
 
+#define MAXLOOKBEHIND 10
+static bool lookbehind_warned = false;
+
 static int
 octregexp_list (const octave_value_list &args, const std::string &nm,
  bool case_insensitive, std::list<regexp_elem> &lst,
@@ -96,6 +99,9 @@ octregexp_list (const octave_value_list
   once = false;
 
   std::string buffer = args(0).string_value ();
+  size_t max_length = (buffer.length () > MAXLOOKBEHIND ?
+       MAXLOOKBEHIND: buffer.length ());
+
   if (error_state)
     {
       gripe_wrong_type_arg (nm.c_str(), args(0));
@@ -190,12 +196,6 @@ octregexp_list (const octave_value_list
 
       // named tokens "(?<name>...)" are only treated with PCRE not regex.
 #if HAVE_PCRE
-      // The syntax of named tokens in pcre is "(?P<name>...)" while we need
-      // a syntax "(?<name>...)", so fix that here. Also an expression like
-      // "(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)" should
-      // be perfectly legal, while pcre does not allow the same named token
-      // name on both sides of the alternative. Also fix that here by replacing
-      // name tokens by dummy names, and dealing with the dummy names later.
       
       size_t pos = 0;
       size_t new_pos;
@@ -204,44 +204,131 @@ octregexp_list (const octave_value_list
       std::ostringstream buf;
       Array<int> named_idx;
 
-      while ((new_pos = pattern.find ("(?<",pos)) != std::string::npos)
- {
-  size_t tmp_pos = pattern.find_first_of ('>',new_pos);
-
-  if (tmp_pos == std::string::npos)
+      while ((new_pos = pattern.find ("(?",pos)) != std::string::npos)
+ {
+  if (pattern.at (new_pos + 2) == '<' &&  
+      !(pattern.at (new_pos + 3) == '=' ||
+ pattern.at (new_pos + 3) == '!'))
     {
-      error ("syntax error in pattern");
-      break;
+      // The syntax of named tokens in pcre is "(?P<name>...)" while
+      // we need a syntax "(?<name>...)", so fix that here. Also an
+      // expression like
+      // "(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)"
+      // should be perfectly legal, while pcre does not allow the same
+      // named token name on both sides of the alternative. Also fix
+      // that here by replacing name tokens by dummy names, and dealing
+      // with the dummy names later.
+
+      size_t tmp_pos = pattern.find_first_of ('>',new_pos);
+
+      if (tmp_pos == std::string::npos)
+ {
+  error ("syntax error in pattern");
+  break;
+ }
+
+      std::string tmp_name =
+ pattern.substr(new_pos+3,tmp_pos-new_pos-3);
+      bool found = false;
+
+      for (int i = 0; i < nnames; i++)
+ if (named(i) == tmp_name)
+  {
+    named_idx.resize(inames+1);
+    named_idx(inames) = i;
+    found = true;
+    break;
+  }
+      if (! found)
+ {
+  named_idx.resize(inames+1);
+  named_idx(inames) = nnames;
+  named.append(tmp_name);
+  nnames++;
+ }
+
+      if (new_pos - pos > 0)
+ buf << pattern.substr(pos,new_pos-pos);
+      if (inames < 10)
+ buf << "(?P<n00" << inames++;
+      else if (inames < 100)
+ buf << "(?P<n0" << inames++;
+      else
+ buf << "(?P<n" << inames++;
+      pos = tmp_pos;
     }
-
-  std::string tmp_name = pattern.substr(new_pos+3,tmp_pos-new_pos-3);
-  bool found = false;
-
-  for (int i = 0; i < nnames; i++)
-    if (named(i) == tmp_name)
-      {
- named_idx.resize(inames+1);
- named_idx(inames) = i;
- found = true;
- break;
-      }
-  if (! found)
+  else if (pattern.at (new_pos + 2) == '<')
     {
-      named_idx.resize(inames+1);
-      named_idx(inames) = nnames;
-      named.append(tmp_name);
-      nnames++;
+      // Find lookbehind operators of arbitrary length (ie like
+      // "(?<=[a-z]*)") and replace with a maximum length operator
+      // as PCRE can not yet handle arbitrary length lookahead
+      // operators. Use the string length as the maximum length to
+      // avoid issues.
+
+      int brackets = 1;
+      size_t tmp_pos1 = new_pos + 2;
+      size_t tmp_pos2 = tmp_pos1;
+      while (tmp_pos1 <= pattern.length () && brackets > 0)
+ {
+  char ch = pattern.at (tmp_pos1);
+  if (ch == '(')
+    brackets++;
+  else if (ch == ')')
+    {
+      if (brackets > 1)
+ tmp_pos2 = tmp_pos1;
+
+      brackets--;
+    }
+  tmp_pos1++;
+ }
+
+      if (brackets != 0)
+ {
+  buf << pattern.substr (pos, new_pos - pos) << "(?";
+  pos = new_pos + 2;
+ }
+      else
+ {
+  size_t tmp_pos3 = pattern.find_first_of ("*+", tmp_pos2);
+  if (tmp_pos3 != std::string::npos && tmp_pos3 < tmp_pos1)
+    {
+      if (!lookbehind_warned)
+ {
+  lookbehind_warned = true;
+  warning ("%s: arbitrary length lookbehind patterns are only support up to length %d", nm.c_str(), MAXLOOKBEHIND);
+ }
+
+      buf << pattern.substr (pos, new_pos - pos) << "(";
+
+      size_t i;
+      if (pattern.at (tmp_pos3) == '*')
+ i = 0;
+      else
+ i = 1;
+
+      for (; i < max_length + 1; i++)
+ {
+  buf <<pattern.substr(new_pos, tmp_pos3 - new_pos)
+      << "{" << i << "}";
+  buf << pattern.substr(tmp_pos3 + 1,
+ tmp_pos1 - tmp_pos3 - 1);
+  if (i != max_length)
+    buf << "|";
+ }
+      buf << ")";
+    }
+  else
+    buf << pattern.substr (pos, tmp_pos1 - pos);
+  pos = tmp_pos1;
+ }
     }
-
-  if (new_pos - pos > 0)
-    buf << pattern.substr(pos,new_pos-pos);
-  if (inames < 10)
-    buf << "(?P<n00" << inames++;
-  else if (inames < 100)
-    buf << "(?P<n0" << inames++;
   else
-    buf << "(?P<n" << inames++;
-  pos = tmp_pos;
+    {
+      buf << pattern.substr (pos, new_pos - pos) << "(?";
+      pos = new_pos + 2;
+    }
+
  }
 
       buf << pattern.substr(pos);
@@ -1041,6 +1128,16 @@ The pattern is taken literally.\n\
 %!assert(regexp({'asdfg-dfd','-dfd-dfd-','qasfdfdaq'},'-'),{6,[1,5,9],zeros(1,0)})
 %!assert(regexp({'asdfg-dfd';'-dfd-dfd-';'qasfdfdaq'},{'-';'f';'q'}),{6;[3,7];[1,9]})
 %!assert(regexp('Strings',{'t','s'}),{2,7})
+
+## Test case for lookaround operators
+%!assert(regexp('Iraq','q(?!u)'),4)
+%!assert(regexp('quit','q(?!u)'), zeros(1,0))
+%!assert(regexp('quit','q(?=u)','match'), {'q'})
+%!assert(regexp("quit",'q(?=u+)','match'), {'q'})
+%!assert(regexp("qit",'q(?=u+)','match'), cell(1,0))
+%!assert(regexp("qit",'q(?=u*)','match'), {'q'})
+
+%!assert(regexp('thingamabob','(?<=a)b'), 9)
 
 */
 
@@ -1569,6 +1666,9 @@ Alternatively, use (?x) or (?-x) in the
 %!assert(regexprep({"abc","cba"},"b","?"),{"a?c","c?a"})
 %!assert(regexprep({"abc","cba"},{"b","a"},{"?","!"}),{"!?c","c?!"})
 
+# Nasty lookbehind expression
+%!assert(regexprep('x^(-1)+y(-1)+z(-1)=0','(?<=[a-z]+)\(\-[1-9]*\)','_minus1'),'x^(-1)+y_minus1+z_minus1=0')
+
 */
 
 /*

_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....

John W. Eaton-6
On  9-Sep-2008, David Bateman wrote:

| Ben Abbott wrote:
| > On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <[hidden email]> wrote:
| >  
| >> Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length
| >> lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok
| >> but "(?<[a-z]*)" isn't. I'd hoped to replace this with
| >> "(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is
| >> not ok either. What I'd have to do is replace it with
| >>
| >> ((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH}))
| >>
| >> which used the alternate operator and MALENGTH+1 copies of the
| >> lookbehind expression to get the effect. This seems to be a ridiculous
| >> amount of extra crap in the pattern space to get this functionality. Is
| >> it worth supporting arbitrary length lookbehind expressions like
| >> "(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is
| >> it worth supporting it but limits max_length, and print a warning? If so
| >> what value should be the limit?
| >>
| >> Frankly I wonder how mathworks got this to work as they appear to be
| >> using the Boost regex library which also doesn't support arbitrary
| >> length lookbehind expressions....
| >>
| >> D.
| >>    
| >
| > David,
| >
| > Have you tried the example in Matlab?
| >
| > Using 2007b, It does *not* work for me. My 2008a/b is busy running some simulations, so I can't try it there until later.
| >
| >  
| >>> g='x^(-1)+y(-1)+z(-1)=0';
| >>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
| >>>      
| > ans =
| > x^_minus1+y_minus1+z_minus1=0
| >
| > If I understand correctly the result should be
| >
| > ans =
| > x^(-1)+y_minus1+z_minus1=0
| >
| > Correct?
| >
| > Ben
| >
| >
| >
| >  
|
| The message
|
| http://groups.google.com/group/comp.soft-sys.matlab/browse_thread/thread/babf37252132fd99/250b037e60b345ff?lnk=gst&q=lookbehind#250b037e60b345ff
|
| seems to imply that mathworks have their own regexp engine and that
| lookbehind is inefficient. I therefore don't consider it that much of an
| issue to duplicate the lookbehind pattern in the pattern space and so
| propose the attached changeset that replaces "(?>=[a-z]*)" with
| "((?>=[a-z]{0})|(?>=[a-z]{1})|...(?>=[a-z]{10}))" before calling PCRE on
| it. It also issues a warning about the maximum length string if the
| lookbehind might be an issue. So the limitation is that "+" then
| represents 1 to 10 characters and "*" 0 to 10 characters in a lookbehind
| expression. This limitation doesn't apply to lookaheads, etc.

Is the bug report

  http://bugs.exim.org/show_bug.cgi?id=547

the same problem?  Note the comment

  I can't see an efficient way of doing this with the current
  implementation.  Note that Perl is even more restrictive - all
  alternatives in the lookbehind have to be the same length in Perl.

I guess it might be worth asking whether there is a way to get this
feature, even if it is not efficient.

Meanwhile, I've applied your changeset.

Thanks,

jwe
_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: [Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....

Jaroslav Hajek-2
On Tue, Sep 9, 2008 at 6:37 PM, John W. Eaton <[hidden email]> wrote:

> On  9-Sep-2008, David Bateman wrote:
>
> | Ben Abbott wrote:
> | > On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <[hidden email]> wrote:
> | >
> | >> Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length
> | >> lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok
> | >> but "(?<[a-z]*)" isn't. I'd hoped to replace this with
> | >> "(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is
> | >> not ok either. What I'd have to do is replace it with
> | >>
> | >> ((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH}))
> | >>
> | >> which used the alternate operator and MALENGTH+1 copies of the
> | >> lookbehind expression to get the effect. This seems to be a ridiculous
> | >> amount of extra crap in the pattern space to get this functionality. Is
> | >> it worth supporting arbitrary length lookbehind expressions like
> | >> "(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is
> | >> it worth supporting it but limits max_length, and print a warning? If so
> | >> what value should be the limit?
> | >>
> | >> Frankly I wonder how mathworks got this to work as they appear to be
> | >> using the Boost regex library which also doesn't support arbitrary
> | >> length lookbehind expressions....
> | >>
> | >> D.
> | >>
> | >
> | > David,
> | >
> | > Have you tried the example in Matlab?
> | >
> | > Using 2007b, It does *not* work for me. My 2008a/b is busy running some simulations, so I can't try it there until later.
> | >
> | >
> | >>> g='x^(-1)+y(-1)+z(-1)=0';
> | >>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
> | >>>
> | > ans =
> | > x^_minus1+y_minus1+z_minus1=0
> | >
> | > If I understand correctly the result should be
> | >
> | > ans =
> | > x^(-1)+y_minus1+z_minus1=0
> | >
> | > Correct?
> | >
> | > Ben
> | >
> | >
> | >
> | >
> |
> | The message
> |
> | http://groups.google.com/group/comp.soft-sys.matlab/browse_thread/thread/babf37252132fd99/250b037e60b345ff?lnk=gst&q=lookbehind#250b037e60b345ff
> |
> | seems to imply that mathworks have their own regexp engine and that
> | lookbehind is inefficient. I therefore don't consider it that much of an
> | issue to duplicate the lookbehind pattern in the pattern space and so
> | propose the attached changeset that replaces "(?>=[a-z]*)" with
> | "((?>=[a-z]{0})|(?>=[a-z]{1})|...(?>=[a-z]{10}))" before calling PCRE on
> | it. It also issues a warning about the maximum length string if the
> | lookbehind might be an issue. So the limitation is that "+" then
> | represents 1 to 10 characters and "*" 0 to 10 characters in a lookbehind
> | expression. This limitation doesn't apply to lookaheads, etc.
>
> Is the bug report
>
>  http://bugs.exim.org/show_bug.cgi?id=547
>
> the same problem?  Note the comment
>
>  I can't see an efficient way of doing this with the current
>  implementation.  Note that Perl is even more restrictive - all
>  alternatives in the lookbehind have to be the same length in Perl.
>
> I guess it might be worth asking whether there is a way to get this
> feature, even if it is not efficient.
>
> Meanwhile, I've applied your changeset.
>

I transplanted it to 3.0.x. The patch to regexp.cc applied cleanly
only after I replaced std::string::npos with NPOS (8211 was never
applied to 3.0.x).

thx


--
RNDr. Jaroslav Hajek
computing expert
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz
_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: [Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....

David Bateman-3
In reply to this post by John W. Eaton-6
John W. Eaton wrote:

> On  9-Sep-2008, David Bateman wrote:
>
> | Ben Abbott wrote:
> | > On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <[hidden email]> wrote:
> | >  
> | >> Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length
> | >> lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok
> | >> but "(?<[a-z]*)" isn't. I'd hoped to replace this with
> | >> "(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is
> | >> not ok either. What I'd have to do is replace it with
> | >>
> | >> ((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH}))
> | >>
> | >> which used the alternate operator and MALENGTH+1 copies of the
> | >> lookbehind expression to get the effect. This seems to be a ridiculous
> | >> amount of extra crap in the pattern space to get this functionality. Is
> | >> it worth supporting arbitrary length lookbehind expressions like
> | >> "(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is
> | >> it worth supporting it but limits max_length, and print a warning? If so
> | >> what value should be the limit?
> | >>
> | >> Frankly I wonder how mathworks got this to work as they appear to be
> | >> using the Boost regex library which also doesn't support arbitrary
> | >> length lookbehind expressions....
> | >>
> | >> D.
> | >>    
> | >
> | > David,
> | >
> | > Have you tried the example in Matlab?
> | >
> | > Using 2007b, It does *not* work for me. My 2008a/b is busy running some simulations, so I can't try it there until later.
> | >
> | >  
> | >>> g='x^(-1)+y(-1)+z(-1)=0';
> | >>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
> | >>>      
> | > ans =
> | > x^_minus1+y_minus1+z_minus1=0
> | >
> | > If I understand correctly the result should be
> | >
> | > ans =
> | > x^(-1)+y_minus1+z_minus1=0
> | >
> | > Correct?
> | >
> | > Ben
> | >
> | >
> | >
> | >  
> |
> | The message
> |
> | http://groups.google.com/group/comp.soft-sys.matlab/browse_thread/thread/babf37252132fd99/250b037e60b345ff?lnk=gst&q=lookbehind#250b037e60b345ff
> |
> | seems to imply that mathworks have their own regexp engine and that
> | lookbehind is inefficient. I therefore don't consider it that much of an
> | issue to duplicate the lookbehind pattern in the pattern space and so
> | propose the attached changeset that replaces "(?>=[a-z]*)" with
> | "((?>=[a-z]{0})|(?>=[a-z]{1})|...(?>=[a-z]{10}))" before calling PCRE on
> | it. It also issues a warning about the maximum length string if the
> | lookbehind might be an issue. So the limitation is that "+" then
> | represents 1 to 10 characters and "*" 0 to 10 characters in a lookbehind
> | expression. This limitation doesn't apply to lookaheads, etc.
>
> Is the bug report
>
>   http://bugs.exim.org/show_bug.cgi?id=547
>
> the same problem?  Note the comment
>
>   I can't see an efficient way of doing this with the current
>   implementation.  Note that Perl is even more restrictive - all
>   alternatives in the lookbehind have to be the same length in Perl.
>  
Well I added this as alternative lookbehinds rather than alternatives in
the lookbend expression itself. However yes it is the same issue.


> I guess it might be worth asking whether there is a way to get this
> feature, even if it is not efficient.
>  
The inefficient way of doing it is essentially do the pattern space
expansion I did but in PCRE itself. However it can be more efficient in
PCRE as it can know how much it has to expand the search length. There
also cases like "(?<=Nov(ember))" to consider that match both "Nov" and
"November" and so need to be expanded as "((?<=Nov)|(?<=November))" that
I haven't taken into account. Maybe this is what the bug report is
taking about about PCRE handling alternatively in the lookbehind
expressions.

Yes it would be better if PCRE handled this internal rather than leaving
us to do it by modifying the pattern.

Cheers
David

> Meanwhile, I've applied your changeset.
>
> Thanks,
>
> jwe
>
>  


--
David Bateman                                [hidden email]
Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob)
91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax)

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary

_______________________________________________
Help-octave mailing list
[hidden email]
https://www-old.cae.wisc.edu/mailman/listinfo/help-octave