Re: first help sentence truncated

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Rik-4
This is likely to be caused by get_first_help_sentence.m in Octave core.

With the following file b.m

--- File: b.m ---
## This plots x vs. y on a green background.

function b (x)
  disp ('hello');
endfunction
--- End File ---

get_first_help_sentence ("b.m") returns
ans =  This plots x vs.

The code is

  ## Extract first line by searching for a period followed by a space class
  ## character (to support periods in numbers or words) ...
  period_idx = regexp (help_text, '\.\s', "once");

One way to resolve this is to have a list of abbreviations as Oliver suggests.  This could get cumbersome though as we would probably not recognize a new abbreviation in the first sentence of help until somebody reported an error.  Another solution would be to require the convention (used in Octave core) that a sentence-ending period is followed by *two* spaces.  Then the regular expression above could be modified to support this case.  This would work on all in-sentence abbreviations, and on phrases like "Plot Y vs. X on a semilog background.  The second help sentence".  A third possibility would be to re-write the documentation--either to expand the abbreviation like vs. to versus if length is not a problem, or to remove the abbreviation entirely.  For example, the existing semilogy documentation avoids using "vs." entirely and says "Produce a 2-D plot using a logarithmic scale for the y-axis."

--Rik

On 08/27/2018 09:00 AM, [hidden email] wrote:
Subject:
Re: Octave-Forge: Redesign with responsive layout
From:
Oliver Heimlich [hidden email]
Date:
08/26/2018 10:58 PM
To:
"Dmitri A. Sergatskov" [hidden email]
CC:
octave-maintainers [hidden email]
List-Post:
[hidden email]
Content-Transfer-Encoding:
7bit
Precedence:
list
MIME-Version:
1.0
References:
[hidden email] [hidden email] [hidden email] [hidden email]
In-Reply-To:
[hidden email]
Message-ID:
[hidden email]
Content-Type:
multipart/alternative; boundary="----CE397N214E2FVBBD2PO7S01V9MZ9W7"
Message:
1

Hi Dmitri,

thank you for the observation. Please create a bug report in the category 'Octave Forge Package'. The extraction of the first help sentence happens in the package generate_html.

Fortunately, it doesn't happen very often. Maybe we can simply scan for common abbreviations to prevent it from happening.

Oliver

Am 27. August 2018 04:44:16 MESZ schrieb "Dmitri A. Sergatskov" [hidden email]:
 
Looking at the Octave-Forge website I noticed that the description text of individual functions often gets trancated.


yet the problem is widespread. (I assume the code that extract 1st sentence from the help gets tricked by the period in misc abbreviations.)

I could file a bug report but I do not see a proper Category.


Dmitri.

--


Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Rik-4
Assuming moving to a two-space convention to end sentences, this regular expression will work

regexp (help_text, '\.(?:\s\s|\n|$)', 'once')

--Rik

On 08/27/2018 10:11 AM, Rik wrote:
This is likely to be caused by get_first_help_sentence.m in Octave core.

With the following file b.m

--- File: b.m ---
## This plots x vs. y on a green background.

function b (x)
  disp ('hello');
endfunction
--- End File ---

get_first_help_sentence ("b.m") returns
ans =  This plots x vs.

The code is

  ## Extract first line by searching for a period followed by a space class
  ## character (to support periods in numbers or words) ...
  period_idx = regexp (help_text, '\.\s', "once");

One way to resolve this is to have a list of abbreviations as Oliver suggests.  This could get cumbersome though as we would probably not recognize a new abbreviation in the first sentence of help until somebody reported an error.  Another solution would be to require the convention (used in Octave core) that a sentence-ending period is followed by *two* spaces.  Then the regular expression above could be modified to support this case.  This would work on all in-sentence abbreviations, and on phrases like "Plot Y vs. X on a semilog background.  The second help sentence".  A third possibility would be to re-write the documentation--either to expand the abbreviation like vs. to versus if length is not a problem, or to remove the abbreviation entirely.  For example, the existing semilogy documentation avoids using "vs." entirely and says "Produce a 2-D plot using a logarithmic scale for the y-axis."

--Rik

On 08/27/2018 09:00 AM, [hidden email] wrote:
Subject:
Re: Octave-Forge: Redesign with responsive layout
From:
Oliver Heimlich [hidden email]
Date:
08/26/2018 10:58 PM
To:
"Dmitri A. Sergatskov" [hidden email]
CC:
octave-maintainers [hidden email]
List-Post:
[hidden email]
Content-Transfer-Encoding:
7bit
Precedence:
list
MIME-Version:
1.0
References:
[hidden email] [hidden email] [hidden email] [hidden email]
In-Reply-To:
[hidden email]
Message-ID:
[hidden email]
Content-Type:
multipart/alternative; boundary="----CE397N214E2FVBBD2PO7S01V9MZ9W7"
Message:
1

Hi Dmitri,

thank you for the observation. Please create a bug report in the category 'Octave Forge Package'. The extraction of the first help sentence happens in the package generate_html.

Fortunately, it doesn't happen very often. Maybe we can simply scan for common abbreviations to prevent it from happening.

Oliver

Am 27. August 2018 04:44:16 MESZ schrieb "Dmitri A. Sergatskov" [hidden email]:
 
Looking at the Octave-Forge website I noticed that the description text of individual functions often gets trancated.


yet the problem is widespread. (I assume the code that extract 1st sentence from the help gets tricked by the period in misc abbreviations.)

I could file a bug report but I do not see a proper Category.


Dmitri.

--



Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Dmitri A. Sergatskov
I think we should also put a limit on max string length to avoid possible DoS
(intentional or not) / web page corruption, if regex is not found.

Dmitri.

On Mon, Aug 27, 2018 at 12:29 PM Rik <[hidden email]> wrote:
Assuming moving to a two-space convention to end sentences, this regular expression will work

regexp (help_text, '\.(?:\s\s|\n|$)', 'once')

--Rik

On 08/27/2018 10:11 AM, Rik wrote:
This is likely to be caused by get_first_help_sentence.m in Octave core.

With the following file b.m

--- File: b.m ---
## This plots x vs. y on a green background.

function b (x)
  disp ('hello');
endfunction
--- End File ---

get_first_help_sentence ("b.m") returns
ans =  This plots x vs.

The code is

  ## Extract first line by searching for a period followed by a space class
  ## character (to support periods in numbers or words) ...
  period_idx = regexp (help_text, '\.\s', "once");

One way to resolve this is to have a list of abbreviations as Oliver suggests.  This could get cumbersome though as we would probably not recognize a new abbreviation in the first sentence of help until somebody reported an error.  Another solution would be to require the convention (used in Octave core) that a sentence-ending period is followed by *two* spaces.  Then the regular expression above could be modified to support this case.  This would work on all in-sentence abbreviations, and on phrases like "Plot Y vs. X on a semilog background.  The second help sentence".  A third possibility would be to re-write the documentation--either to expand the abbreviation like vs. to versus if length is not a problem, or to remove the abbreviation entirely.  For example, the existing semilogy documentation avoids using "vs." entirely and says "Produce a 2-D plot using a logarithmic scale for the y-axis."

--Rik

On 08/27/2018 09:00 AM, [hidden email] wrote:
Subject:
Re: Octave-Forge: Redesign with responsive layout
From:
Oliver Heimlich [hidden email]
Date:
08/26/2018 10:58 PM
To:
"Dmitri A. Sergatskov" [hidden email]
CC:
octave-maintainers [hidden email]
List-Post:
[hidden email]
Content-Transfer-Encoding:
7bit
Precedence:
list
MIME-Version:
1.0
References:
[hidden email] [hidden email] [hidden email] [hidden email]
In-Reply-To:
[hidden email]
Message-ID:
[hidden email]
Content-Type:
multipart/alternative; boundary="----CE397N214E2FVBBD2PO7S01V9MZ9W7"
Message:
1

Hi Dmitri,

thank you for the observation. Please create a bug report in the category 'Octave Forge Package'. The extraction of the first help sentence happens in the package generate_html.

Fortunately, it doesn't happen very often. Maybe we can simply scan for common abbreviations to prevent it from happening.

Oliver

Am 27. August 2018 04:44:16 MESZ schrieb "Dmitri A. Sergatskov" [hidden email]:
 
Looking at the Octave-Forge website I noticed that the description text of individual functions often gets trancated.


yet the problem is widespread. (I assume the code that extract 1st sentence from the help gets tricked by the period in misc abbreviations.)

I could file a bug report but I do not see a proper Category.


Dmitri.

--



Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Dmitri A. Sergatskov
Also if we truncate the message we could add "..." at the end so it gives a clue that
it is truncated intentionally.

Dmitri.
--


On Mon, Aug 27, 2018 at 1:14 PM Dmitri A. Sergatskov <[hidden email]> wrote:
I think we should also put a limit on max string length to avoid possible DoS
(intentional or not) / web page corruption, if regex is not found.

Dmitri.

On Mon, Aug 27, 2018 at 12:29 PM Rik <[hidden email]> wrote:
Assuming moving to a two-space convention to end sentences, this regular expression will work

regexp (help_text, '\.(?:\s\s|\n|$)', 'once')

--Rik

On 08/27/2018 10:11 AM, Rik wrote:
This is likely to be caused by get_first_help_sentence.m in Octave core.

With the following file b.m

--- File: b.m ---
## This plots x vs. y on a green background.

function b (x)
  disp ('hello');
endfunction
--- End File ---

get_first_help_sentence ("b.m") returns
ans =  This plots x vs.

The code is

  ## Extract first line by searching for a period followed by a space class
  ## character (to support periods in numbers or words) ...
  period_idx = regexp (help_text, '\.\s', "once");

One way to resolve this is to have a list of abbreviations as Oliver suggests.  This could get cumbersome though as we would probably not recognize a new abbreviation in the first sentence of help until somebody reported an error.  Another solution would be to require the convention (used in Octave core) that a sentence-ending period is followed by *two* spaces.  Then the regular expression above could be modified to support this case.  This would work on all in-sentence abbreviations, and on phrases like "Plot Y vs. X on a semilog background.  The second help sentence".  A third possibility would be to re-write the documentation--either to expand the abbreviation like vs. to versus if length is not a problem, or to remove the abbreviation entirely.  For example, the existing semilogy documentation avoids using "vs." entirely and says "Produce a 2-D plot using a logarithmic scale for the y-axis."

--Rik

On 08/27/2018 09:00 AM, [hidden email] wrote:
Subject:
Re: Octave-Forge: Redesign with responsive layout
From:
Oliver Heimlich [hidden email]
Date:
08/26/2018 10:58 PM
To:
"Dmitri A. Sergatskov" [hidden email]
CC:
octave-maintainers [hidden email]
List-Post:
[hidden email]
Content-Transfer-Encoding:
7bit
Precedence:
list
MIME-Version:
1.0
References:
[hidden email] [hidden email] [hidden email] [hidden email]
In-Reply-To:
[hidden email]
Message-ID:
[hidden email]
Content-Type:
multipart/alternative; boundary="----CE397N214E2FVBBD2PO7S01V9MZ9W7"
Message:
1

Hi Dmitri,

thank you for the observation. Please create a bug report in the category 'Octave Forge Package'. The extraction of the first help sentence happens in the package generate_html.

Fortunately, it doesn't happen very often. Maybe we can simply scan for common abbreviations to prevent it from happening.

Oliver

Am 27. August 2018 04:44:16 MESZ schrieb "Dmitri A. Sergatskov" [hidden email]:
 
Looking at the Octave-Forge website I noticed that the description text of individual functions often gets trancated.


yet the problem is widespread. (I assume the code that extract 1st sentence from the help gets tricked by the period in misc abbreviations.)

I could file a bug report but I do not see a proper Category.


Dmitri.

--



Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Rik-4
In reply to this post by Dmitri A. Sergatskov
On 08/27/2018 11:14 AM, Dmitri A. Sergatskov wrote:
I think we should also put a limit on max string length to avoid possible DoS
(intentional or not) / web page corruption, if regex is not found.

Dmitri.

This is already implemented with a default cutoff of 80 characters.  The help text for get_first_help_sentence is:

 -- TEXT = get_first_help_sentence (NAME)
 -- TEXT = get_first_help_sentence (NAME, MAX_LEN)
 -- [TEXT, STATUS] = get_first_help_sentence (...)
     Return the first sentence of a function's help text.

     The first sentence is defined as the text after the function
     declaration until either the first period (".")  or the first
     appearance of two consecutive newlines ("\n\n").  The text is
     truncated to a maximum length of MAX_LEN, which defaults to 80.

--Rik
Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Rik-4
In reply to this post by Dmitri A. Sergatskov
On 08/27/2018 11:21 AM, Dmitri A. Sergatskov wrote:
Also if we truncate the message we could add "..." at the end so it gives a clue that
it is truncated intentionally.

Yes, that would be a good addition.

--Rik
Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Rik-4
In reply to this post by Dmitri A. Sergatskov
On 08/27/2018 01:10 PM, Rik wrote:
On 08/27/2018 11:21 AM, Dmitri A. Sergatskov wrote:
Also if we truncate the message we could add "..." at the end so it gives a clue that
it is truncated intentionally.

Yes, that would be a good addition.

--Rik

I added this feature to the development branch in this cset: https://hg.savannah.gnu.org/hgweb/octave/rev/6784059127f5.

--Rik
Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Juan Pablo Carbajal-2
Hi,
Just a question, but why instead of re-inventing the english language,
do not chekc for period followed by non-characters strings and end of
line?
something in the lines of (assuming non-greedy *):   '.*[.]\W*$'
I do not see why a regex cannot handle abbreviations vs. periods.


On Mon, Aug 27, 2018 at 11:47 PM Rik <[hidden email]> wrote:

>
> On 08/27/2018 01:10 PM, Rik wrote:
>
> On 08/27/2018 11:21 AM, Dmitri A. Sergatskov wrote:
>
> Also if we truncate the message we could add "..." at the end so it gives a clue that
> it is truncated intentionally.
>
>
> Yes, that would be a good addition.
>
> --Rik
>
>
> I added this feature to the development branch in this cset: https://hg.savannah.gnu.org/hgweb/octave/rev/6784059127f5.
>
> --Rik

Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Rik-4
On 08/29/2018 01:41 PM, Juan Pablo Carbajal wrote:
> Hi,
> Just a question, but why instead of re-inventing the english language,
> do not chekc for period followed by non-characters strings and end of
> line?
> something in the lines of (assuming non-greedy *):   '.*[.]\W*$'
> I do not see why a regex cannot handle abbreviations vs. periods.
That would catch multiple sentences.  For example,

octave:1> str = "This is sentence 1. This is sentence 2.\n";
octave:2> str
str = This is sentence 1. This is sentence 2.

octave:3> regexp (str, '.*[.]\W*$')
ans =  1
octave:4> [s,e] = regexp (str, '.*[.]\W*$')
s =  1
e =  40
octave:5> str(1:e)
ans = This is sentence 1. This is sentence 2.

--Rik


>
>
> On Mon, Aug 27, 2018 at 11:47 PM Rik <[hidden email]> wrote:
>> On 08/27/2018 01:10 PM, Rik wrote:
>>
>> On 08/27/2018 11:21 AM, Dmitri A. Sergatskov wrote:
>>
>> Also if we truncate the message we could add "..." at the end so it gives a clue that
>> it is truncated intentionally.
>>
>>
>> Yes, that would be a good addition.
>>
>> --Rik
>>
>>
>> I added this feature to the development branch in this cset: https://hg.savannah.gnu.org/hgweb/octave/rev/6784059127f5.
>>
>> --Rik
>
>


Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

John W. Eaton
Administrator
On 08/29/2018 05:06 PM, Rik wrote:
> On 08/29/2018 01:41 PM, Juan Pablo Carbajal wrote:
>> Hi,
>> Just a question, but why instead of re-inventing the english language,
>> do not chekc for period followed by non-characters strings and end of
>> line?
>> something in the lines of (assuming non-greedy *):   '.*[.]\W*$'
>> I do not see why a regex cannot handle abbreviations vs. periods.
> That would catch multiple sentences.  For example,


It seems to me we could use the same rules as Texinfo (based on the
rules for TeX) for determining the ends of sentences and then require
the extra work for those cases when the rules aren't sufficient.  The
Texinfo rules may be found here:

 
https://www.gnu.org/software/texinfo/manual/texinfo/texinfo.html#Ending-a-Sentence

and

 
https://www.gnu.org/software/texinfo/manual/texinfo/texinfo.html#Not-Ending-a-Sentence

I wouldn't work too hard on this, but it seems fairly straightforward if
someone who is interested.

Also, for the record, I generally type two spaces at the end of a
sentence.  It seems natural to me, as it's a habit formed many years ago
when learning to type on real typewriters with fixed-space characters.
But I also realize that style is falling out of favor now, even when
using fixed-width fonts or lousy "word processors" that don't
automatically insert the extra space at ends of sentences that one would
expect to see with good quality typesetting.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Rik-4
On 08/30/2018 09:18 AM, John W. Eaton wrote:
On 08/29/2018 05:06 PM, Rik wrote:
On 08/29/2018 01:41 PM, Juan Pablo Carbajal wrote:
Hi,
Just a question, but why instead of re-inventing the english language,
do not chekc for period followed by non-characters strings and end of
line?
something in the lines of (assuming non-greedy *):   '.*[.]\W*$'
I do not see why a regex cannot handle abbreviations vs. periods.
That would catch multiple sentences.  For example,


It seems to me we could use the same rules as Texinfo (based on the rules for TeX) for determining the ends of sentences and then require the extra work for those cases when the rules aren't sufficient.  The Texinfo rules may be found here:


https://www.gnu.org/software/texinfo/manual/texinfo/texinfo.html#Ending-a-Sentence

and


https://www.gnu.org/software/texinfo/manual/texinfo/texinfo.html#Not-Ending-a-Sentence

I wouldn't work too hard on this, but it seems fairly straightforward if someone who is interested.

Also, for the record, I generally type two spaces at the end of a sentence.  It seems natural to me, as it's a habit formed many years ago when learning to type on real typewriters with fixed-space characters. But I also realize that style is falling out of favor now, even when using fixed-width fonts or lousy "word processors" that don't automatically insert the extra space at ends of sentences that one would expect to see with good quality typesetting.

jwe


I agree, this isn't worth too much time.  I changed the regexp pattern on the development branch:

+  ## Extract first line by searching for a period followed by whitespace
+  ## followed by a capital letter (Nearly the same rule as Texinfo).
+  period_idx = regexp (help_text, '\.\s+(?:[A-Z]|\n)', "once");

This is closer to what is meant by a "sentence" but it still won't work on the motivating example.  The HTML for that is shown below.

Smooths the y vs. x values of 1D data by Tikhonov regularization. The smooth y-values are returned as yhat.

The texinfo has @var{y} and @var{x} so when this is expanded by makeinfo to plain text the result is

Smooths the Y vs. X values of 1D data by Tikhonov regularization. The smooth y-values are returned as YHAT.

The abbreviation "vs." still looks like a sentence end because the next letter following the period is capitalized.  At this point, I think it would be easier to rewrite the help text and expand "vs." to "versus".

--Rik
Reply | Threaded
Open this post in threaded view
|

Re: first help sentence truncated

Colin Macdonald-2
On 2018-08-30 01:13 PM, Rik wrote:
> The abbreviation "vs." still looks like a sentence end because the next
> letter following the period is capitalized.  At this point, I think it
> would be easier to rewrite the help text and expand "vs." to "versus".

Perhaps generate_html could just write out what it thinks the first
sentence is as a debug message.  That way we punt strange cases back to
the package maintainer.

But I think the best approach is two have *both* bits of text (the
one-liner and the longer paragraph) as separate data, so the package
author can decide what to put in each field.  Add a "Summary: <one
sentence>" field to the DESCRIPTION file.  If the new field is missing,
we take the first sentence, as we do now.

This also aligns nicely with the <summary> and <description> tags from
the foo.metainfo.xml file.  E.g., form octave-image.metainfo.xml:

   <summary>Image processing, feature extraction, transformations,
morphological operations, filters, and more</summary>
   <description>
     <p>Provides functions for processing images, such as feature
     extraction, image statistics, spatial and geometric
     transformations, morphological operations, linear filtering, and
     much more.</p>
   </description>

(so in principle a Makefile could generate one from the other.)

cheers,
Colin