Aw: Re: regexp: matching expressions b4 and after ....

11 messages
Open this post in threaded view
|

Aw: Re: regexp: matching expressions b4 and after ....

 Hi, thanks for the reply Ben. As you see in your attempt, also the first "(-1)" gets replaced, which is exactly what the regexprep example command (in MATLAB) avoids. In practice I'm looking for the look-around operators for OCTAVE as described in MATLAB.   ----- Original Nachricht ---- Von:     Ben Abbott <[hidden email]> An:      [hidden email] Datum:   08.09.2008 18:22 Betreff: Re: regexp: matching expressions b4 and after .... > On Monday, September 08, 2008, at 11:38AM, <[hidden email]> > wrote: > >Hi, > > > >I'm trying to convert a set of matlab codes into Octave. These codes > heavily use string manipulation functions > > > >In particular I'm trying to extract a set of equations from an ASCII file. > From these equations I extract/replace variables etc. > > > >So, for example, I have g="x^(-1)+y(-1)+z(-1)=0" > > > >I need to extract/replace x, y and (-1). So, for example, I used the > command regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1') > > > >which returned "x^(-1)+y_minus1+z_minus1" > > > >Can I do the same in Octave? How? > > > > Doesn't work for me either > > octave:1> g="x^(-1)+y(-1)+z(-1)=0" > g = x^(-1)+y(-1)+z(-1)=0 > octave:2> > octave:2> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1') > error: syntax error in pattern > > Running Matlab2007b > > >> g='x^(-1)+y(-1)+z(-1)=0'; > >> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1') > ans = > x^_minus1+y_minus1+z_minus1=0 > > I'd suggest you try to isolate the error. Perhaps if the developers knew > where the specific problem was they could fix it. > > I'm not experienced with regexp, but the simpler one below works for your > example. > > regexprep(g,'\(\-[1-9]*\)','_minus1') > ans = x^_minus1+y_minus1+z_minus1=0 > > Ben > _______________________________________________ Help-octave mailing list [hidden email] https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

 [hidden email] wrote: >   > > Hi, > > thanks for the reply Ben. > As you see in your attempt, also the first "(-1)" gets replaced, which is exactly what the regexprep example command (in MATLAB) avoids. > > In practice I'm looking for the look-around operators for OCTAVE as described in MATLAB. >   Where exactly is "look-around" explained? D. -- David Bateman                                [hidden email] Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph) Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob) 91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax) The information contained in this communication has been classified as: [x] General Business Information [ ] Motorola Internal Use Only [ ] Motorola Confidential Proprietary _______________________________________________ Help-octave mailing list [hidden email] https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Open this post in threaded view
|

[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....

 David Bateman wrote: > [hidden email] wrote: >>   >> >> Hi, >> >> thanks for the reply Ben. As you see in your attempt, also the first >> "(-1)" gets replaced, which is exactly what the regexprep example >> command (in MATLAB) avoids. >> >> In practice I'm looking for the look-around operators for OCTAVE as >> described in MATLAB. >>   > > Where exactly is "look-around" explained? > > D. > > Ok, forget it.. I figured it out.. The issue is that matlab uses a different syntax for named tokens than PCRE, so we are obliged to look for named tokens like "(?)" and replace them with the PCRE compatible "(?P)". The test in Octave to do this was trapping "(?<=...)" and "(?
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

 Administrator In reply to this post by lombardo-2 On Tuesday, September 09, 2008, at 06:17AM, <[hidden email]> wrote: > > >Hi, > >thanks for the reply Ben. >As you see in your attempt, also the first "(-1)" gets replaced, which is exactly what the regexprep example command (in MATLAB) avoids. hmmm ... Matlab gives me the result below ... >> g='x^(-1)+y(-1)+z(-1)=0'; >> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1') ans = x^_minus1+y_minus1+z_minus1=0 What does matlab give you? Ben _______________________________________________ Help-octave mailing list [hidden email] https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

 In reply to this post by David Bateman-3 David Bateman wrote: > Ok, forget it.. I figured it out.. The issue is that matlab uses a > different syntax for named tokens than PCRE, so we are obliged to look > for named tokens like "(?)" and replace them with the PCRE > compatible "(?P)". The test in Octave to do this was trapping > "(?<=...)" and "(? The other lookaround operator "(?=...)" and "(?!...)" seem to work as > pretty much as expected. > > One issue is that PCRE does not accept arbitrary length lookaround > expressions and so  "(?<=[a-z]*)" is not legal with PCRE. Though > maximum length lookarounds are acceptable, so you can write instead > "(?<=[a-z]{10})" for example. > > I have a changeset to address this, but wonder if I should look for > lookaround operators with "*" or "+" and replace with "{MAX_LENGTH}" > and "{1:MAX_LENGTH}" respectively, with a warning about this > limitation. Should I do this before submitting the changeset? > Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok but "(?<[a-z]*)" isn't. I'd hoped to replace this with "(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is not ok either. What I'd have to do is replace it with ((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH})) which used the alternate operator and MALENGTH+1 copies of the lookbehind expression to get the effect. This seems to be a ridiculous amount of extra crap in the pattern space to get this functionality. Is it worth supporting arbitrary length lookbehind expressions like "(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is it worth supporting it but limits max_length, and print a warning? If so what value should be the limit? Frankly I wonder how mathworks got this to work as they appear to be using the Boost regex library which also doesn't support arbitrary length lookbehind expressions.... D. -- David Bateman                                [hidden email] Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph) Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob) 91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax) The information contained in this communication has been classified as: [x] General Business Information [ ] Motorola Internal Use Only [ ] Motorola Confidential Proprietary _______________________________________________ Help-octave mailing list [hidden email] https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

 Administrator On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <[hidden email]> wrote: > >Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length >lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok >but "(?<[a-z]*)" isn't. I'd hoped to replace this with >"(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is >not ok either. What I'd have to do is replace it with > >((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH})) > >which used the alternate operator and MALENGTH+1 copies of the >lookbehind expression to get the effect. This seems to be a ridiculous >amount of extra crap in the pattern space to get this functionality. Is >it worth supporting arbitrary length lookbehind expressions like >"(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is >it worth supporting it but limits max_length, and print a warning? If so >what value should be the limit? > >Frankly I wonder how mathworks got this to work as they appear to be >using the Boost regex library which also doesn't support arbitrary >length lookbehind expressions.... > >D. David, Have you tried the example in Matlab? Using 2007b, It does *not* work for me. My 2008a/b is busy running some simulations, so I can't try it there until later. >> g='x^(-1)+y(-1)+z(-1)=0'; >> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1') ans = x^_minus1+y_minus1+z_minus1=0 If I understand correctly the result should be ans = x^(-1)+y_minus1+z_minus1=0 Correct? Ben _______________________________________________ Help-octave mailing list [hidden email] https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Open this post in threaded view
|

Re: Aw: Re: regexp: matching expressions b4 and after ....

 In reply to this post by bpabbott Ben Abbott wrote: > On Tuesday, September 09, 2008, at 06:17AM, <[hidden email]> wrote: >   >> Hi, >> >> thanks for the reply Ben. >> As you see in your attempt, also the first "(-1)" gets replaced, which is exactly what the regexprep example command (in MATLAB) avoids. >>     > > hmmm ... Matlab gives me the result below ... > >   >>> g='x^(-1)+y(-1)+z(-1)=0'; >>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1') >>>       > ans = > x^_minus1+y_minus1+z_minus1=0 > > What does matlab give you? > >   Yes the pattern should be '(?<=[a-z]+)\(\-[1-9]*\)' to get the desired behavior. D. -- David Bateman                                [hidden email] Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph) Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob) 91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax) The information contained in this communication has been classified as: [x] General Business Information [ ] Motorola Internal Use Only [ ] Motorola Confidential Proprietary _______________________________________________ Help-octave mailing list [hidden email] https://www-old.cae.wisc.edu/mailman/listinfo/help-octave
Open this post in threaded view
|

[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....

 In reply to this post by bpabbott Ben Abbott wrote: > On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <[hidden email]> wrote: >   >> Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length >> lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok >> but "(?<[a-z]*)" isn't. I'd hoped to replace this with >> "(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is >> not ok either. What I'd have to do is replace it with >> >> ((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH})) >> >> which used the alternate operator and MALENGTH+1 copies of the >> lookbehind expression to get the effect. This seems to be a ridiculous >> amount of extra crap in the pattern space to get this functionality. Is >> it worth supporting arbitrary length lookbehind expressions like >> "(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is >> it worth supporting it but limits max_length, and print a warning? If so >> what value should be the limit? >> >> Frankly I wonder how mathworks got this to work as they appear to be >> using the Boost regex library which also doesn't support arbitrary >> length lookbehind expressions.... >> >> D. >>     > > David, > > Have you tried the example in Matlab? > > Using 2007b, It does *not* work for me. My 2008a/b is busy running some simulations, so I can't try it there until later. > >   >>> g='x^(-1)+y(-1)+z(-1)=0'; >>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1') >>>       > ans = > x^_minus1+y_minus1+z_minus1=0 > > If I understand correctly the result should be > > ans = > x^(-1)+y_minus1+z_minus1=0 > > Correct? > > Ben > > > >   The message http://groups.google.com/group/comp.soft-sys.matlab/browse_thread/thread/babf37252132fd99/250b037e60b345ff?lnk=gst&q=lookbehind#250b037e60b345ffseems to imply that mathworks have their own regexp engine and that lookbehind is inefficient. I therefore don't consider it that much of an issue to duplicate the lookbehind pattern in the pattern space and so propose the attached changeset that replaces "(?>=[a-z]*)" with "((?>=[a-z]{0})|(?>=[a-z]{1})|...(?>=[a-z]{10}))" before calling PCRE on it. It also issues a warning about the maximum length string if the lookbehind might be an issue. So the limitation is that "+" then represents 1 to 10 characters and "*" 0 to 10 characters in a lookbehind expression. This limitation doesn't apply to lookaheads, etc. D. -- David Bateman                                [hidden email] Motorola Labs - Paris                        +33 1 69 35 48 04 (Ph) Parc Les Algorithmes, Commune de St Aubin    +33 6 72 01 06 33 (Mob) 91193 Gif-Sur-Yvette FRANCE                  +33 1 69 35 77 01 (Fax) The information contained in this communication has been classified as: [x] General Business Information [ ] Motorola Internal Use Only [ ] Motorola Confidential Proprietary # HG changeset patch # User David Bateman <[hidden email]> # Date 1220976632 -7200 # Node ID efb313d0d98265789218594c6d3dd5d0e19f9ce1 # Parent  917a977250d994aceb97d2bd299e9a6048ca750f Treat PCRE lookbehind operators in a manner that is approximately correct diff --git a/src/ChangeLog b/src/ChangeLog --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,10 @@ 2008-09-02  Michael Goffioul   + + * DLD-FUNCTIONS/regexp.cc (octregexp_list): Distinguish between + matlab named tokens and perl lookbehind expressions. For + lookbehind expression replace "*" and "+" with a limited number of + fixed length expressions to simulate arbitrary length look behind. +  2008-09-02  Michael Goffioul  <[hidden email]>     * graphics.cc (hggroup::update_axis_limits): Also reacts on diff --git a/src/DLD-FUNCTIONS/regexp.cc b/src/DLD-FUNCTIONS/regexp.cc --- a/src/DLD-FUNCTIONS/regexp.cc +++ b/src/DLD-FUNCTIONS/regexp.cc @@ -80,6 +80,9 @@ public:    typedef std::list::const_iterator const_iterator;   +#define MAXLOOKBEHIND 10 +static bool lookbehind_warned = false; +  static int  octregexp_list (const octave_value_list &args, const std::string &nm,   bool case_insensitive, std::list &lst, @@ -96,6 +99,9 @@ octregexp_list (const octave_value_list    once = false;      std::string buffer = args(0).string_value (); +  size_t max_length = (buffer.length () > MAXLOOKBEHIND ? +       MAXLOOKBEHIND: buffer.length ()); +    if (error_state)      {        gripe_wrong_type_arg (nm.c_str(), args(0)); @@ -190,12 +196,6 @@ octregexp_list (const octave_value_list          // named tokens "(?...)" are only treated with PCRE not regex.  #if HAVE_PCRE -      // The syntax of named tokens in pcre is "(?P...)" while we need -      // a syntax "(?...)", so fix that here. Also an expression like -      // "(?\w+)\s+(?\w+)|(?\w+),\s+(?\w+)" should -      // be perfectly legal, while pcre does not allow the same named token -      // name on both sides of the alternative. Also fix that here by replacing -      // name tokens by dummy names, and dealing with the dummy names later.                size_t pos = 0;        size_t new_pos; @@ -204,44 +204,131 @@ octregexp_list (const octave_value_list        std::ostringstream buf;        Array named_idx;   -      while ((new_pos = pattern.find ("(?<",pos)) != std::string::npos) - { -  size_t tmp_pos = pattern.find_first_of ('>',new_pos); - -  if (tmp_pos == std::string::npos) +      while ((new_pos = pattern.find ("(?",pos)) != std::string::npos) + { +  if (pattern.at (new_pos + 2) == '<' &&   +      !(pattern.at (new_pos + 3) == '=' || + pattern.at (new_pos + 3) == '!'))      { -      error ("syntax error in pattern"); -      break; +      // The syntax of named tokens in pcre is "(?P...)" while +      // we need a syntax "(?...)", so fix that here. Also an +      // expression like +      // "(?\w+)\s+(?\w+)|(?\w+),\s+(?\w+)" +      // should be perfectly legal, while pcre does not allow the same +      // named token name on both sides of the alternative. Also fix +      // that here by replacing name tokens by dummy names, and dealing +      // with the dummy names later. + +      size_t tmp_pos = pattern.find_first_of ('>',new_pos); + +      if (tmp_pos == std::string::npos) + { +  error ("syntax error in pattern"); +  break; + } + +      std::string tmp_name = + pattern.substr(new_pos+3,tmp_pos-new_pos-3); +      bool found = false; + +      for (int i = 0; i < nnames; i++) + if (named(i) == tmp_name) +  { +    named_idx.resize(inames+1); +    named_idx(inames) = i; +    found = true; +    break; +  } +      if (! found) + { +  named_idx.resize(inames+1); +  named_idx(inames) = nnames; +  named.append(tmp_name); +  nnames++; + } + +      if (new_pos - pos > 0) + buf << pattern.substr(pos,new_pos-pos); +      if (inames < 10) + buf << "(?P 0) + { +  char ch = pattern.at (tmp_pos1); +  if (ch == '(') +    brackets++; +  else if (ch == ')') +    { +      if (brackets > 1) + tmp_pos2 = tmp_pos1; + +      brackets--; +    } +  tmp_pos1++; + } + +      if (brackets != 0) + { +  buf << pattern.substr (pos, new_pos - pos) << "(?"; +  pos = new_pos + 2; + } +      else + { +  size_t tmp_pos3 = pattern.find_first_of ("*+", tmp_pos2); +  if (tmp_pos3 != std::string::npos && tmp_pos3 < tmp_pos1) +    { +      if (!lookbehind_warned) + { +  lookbehind_warned = true; +  warning ("%s: arbitrary length lookbehind patterns are only support up to length %d", nm.c_str(), MAXLOOKBEHIND); + } + +      buf << pattern.substr (pos, new_pos - pos) << "("; + +      size_t i; +      if (pattern.at (tmp_pos3) == '*') + i = 0; +      else + i = 1; + +      for (; i < max_length + 1; i++) + { +  buf < 0) -    buf << pattern.substr(pos,new_pos-pos); -  if (inames < 10) -    buf << "(?P
Open this post in threaded view
|

[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....

 On  9-Sep-2008, David Bateman wrote: | Ben Abbott wrote: | > On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <[hidden email]> wrote: | >   | >> Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length | >> lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok | >> but "(?<[a-z]*)" isn't. I'd hoped to replace this with | >> "(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is | >> not ok either. What I'd have to do is replace it with | >> | >> ((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH})) | >> | >> which used the alternate operator and MALENGTH+1 copies of the | >> lookbehind expression to get the effect. This seems to be a ridiculous | >> amount of extra crap in the pattern space to get this functionality. Is | >> it worth supporting arbitrary length lookbehind expressions like | >> "(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is | >> it worth supporting it but limits max_length, and print a warning? If so | >> what value should be the limit? | >> | >> Frankly I wonder how mathworks got this to work as they appear to be | >> using the Boost regex library which also doesn't support arbitrary | >> length lookbehind expressions.... | >> | >> D. | >>     | > | > David, | > | > Have you tried the example in Matlab? | > | > Using 2007b, It does *not* work for me. My 2008a/b is busy running some simulations, so I can't try it there until later. | > | >   | >>> g='x^(-1)+y(-1)+z(-1)=0'; | >>> regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1') | >>>       | > ans = | > x^_minus1+y_minus1+z_minus1=0 | > | > If I understand correctly the result should be | > | > ans = | > x^(-1)+y_minus1+z_minus1=0 | > | > Correct? | > | > Ben | > | > | > | >   | | The message | | http://groups.google.com/group/comp.soft-sys.matlab/browse_thread/thread/babf37252132fd99/250b037e60b345ff?lnk=gst&q=lookbehind#250b037e60b345ff| | seems to imply that mathworks have their own regexp engine and that | lookbehind is inefficient. I therefore don't consider it that much of an | issue to duplicate the lookbehind pattern in the pattern space and so | propose the attached changeset that replaces "(?>=[a-z]*)" with | "((?>=[a-z]{0})|(?>=[a-z]{1})|...(?>=[a-z]{10}))" before calling PCRE on | it. It also issues a warning about the maximum length string if the | lookbehind might be an issue. So the limitation is that "+" then | represents 1 to 10 characters and "*" 0 to 10 characters in a lookbehind | expression. This limitation doesn't apply to lookaheads, etc. Is the bug report   http://bugs.exim.org/show_bug.cgi?id=547the same problem?  Note the comment   I can't see an efficient way of doing this with the current   implementation.  Note that Perl is even more restrictive - all   alternatives in the lookbehind have to be the same length in Perl. I guess it might be worth asking whether there is a way to get this feature, even if it is not efficient. Meanwhile, I've applied your changeset. Thanks, jwe _______________________________________________ Help-octave mailing list [hidden email] https://www-old.cae.wisc.edu/mailman/listinfo/help-octave