utf8 does not appear to work for function documentation strings generated with texinfo

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

utf8 does not appear to work for function documentation strings generated with texinfo

Alan W. Irwin
N.B. to understand this post your system has to be set up to deal
properly with utf8 mail so that the utf8 character ≥ for the
math symbol for "greater than or equal to" in this post is
displayed properly.

To illustrate a problem I have encountered with utf8 in function help
strings in a simple way, I have defined the following function which
outputs the greater than or equal utf8 math symbol:

irwin@raven> cat test_utf8.m
## -*- texinfo -*-
## The unicode character, ≥, is output
function test_utf8
printf("The unicode character, ≥, is output\n")
endfunction

If you run this function from octave 3.6.2, you get the expected
utf8 results.

octave:1> test_utf8
The unicode character, ≥, is output

However, the help text for this function is truncated at the unicode
character.

octave:2> help test_utf8
test_utf8' is a function from the file
/home/irwin/test_octave/test_utf8.m

The unicode character,
                       ^^^
[...]

On the other hand, if you drop the

## -*- texinfo -*-

line, the complete help string is output.

Therefore, there appears to be some issue with how octave uses texinfo
or an issue with texinfo itself that is causing the truncation
problem. According to
https://www.gnu.org/software/texinfo/manual/texinfo/html_node/_0040documentencoding.html
texinfo does support utf8.  But I tried some experiments with the
recommended "@documentencoding UTF-8" command to specify that encoding
in the above function, but the truncation issue was always in the
results.

Note, I am interested in the texinfo variant of documentation strings
because that variant appears to be the preferred form of documentation.
Furthermore, for my original much more complicated problem (the Octave binding
for PLplot generated with swig as a C++ extension), the extension is
generated using the DEFUNX_DLD macro, and swig automatically inserts
"-*- texinfo -*-" as the first part of the documentation string for
that macro.

Alan
__________________________
Alan W. Irwin

Astronomical research affiliation with Department of Physics and Astronomy,
University of Victoria (astrowww.phys.uvic.ca).

Programming affiliations with the FreeEOS equation-of-state
implementation for stellar interiors (freeeos.sf.net); the Time
Ephemerides project (timeephem.sf.net); PLplot scientific plotting
software package (plplot.sf.net); the libLASi project
(unifont.org/lasi); the Loads of Linux Links project (loll.sf.net);
and the Linux Brochure Project (lbproject.sf.net).
__________________________

Linux-powered Science
__________________________

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: utf8 does not appear to work for function documentation strings generated with texinfo

CdeMills
Using utf-8 with TeX requires to specify explicitelly the coding.

With TexInfo, you need to add

@documentencoding UTF-8

in the help preambule.

Regards

Pascal
Reply | Threaded
Open this post in threaded view
|

Re: utf8 does not appear to work for function documentation strings generated with texinfo

Alan W. Irwin
On 2014-03-26 03:03-0700 CdeMills wrote:

> Using utf-8 with TeX requires to specify explicitelly the coding.
>
> With TexInfo, you need to add
>
> @documentencoding UTF-8

Thanks for your reply to my question.  However, as I stated in my OP,
I had tried some experiments with @documentencoding UTF-8, but could
not get it to work.  To be specific, here is one example that does not
work here with ocatve-3.6.2.

__________________
## -*- texinfo -*-
## @deftypefn  {Function File} {@var{a} =} fn (@var{x}, …)
## @documentencoding UTF-8
## The unicode character, ≥, is output
## @end deftypefn

function test_utf8
printf("The unicode character, ≥, is output\n")
endfunction
__________________

It yields the following truncated help results:

octave:1> test_utf8
The unicode character, ≥, is output
octave:2> help test_utf8
test_utf8' is a function from the file
/home/irwin/test_octave/test_utf8.m

  -- Function File: A = fn (X,
      The unicode character,
                            ^^^
Additional help for built-in functions and operators is
[....]

I also tried the experiment of putting the ## @documentencoding UTF-8
just after the texinfo line, but I still get the truncated help result.

Do you get untruncated help results with this simple example there for
your version of octave or is there some problem with how this simple
example is implemented?

Alan
__________________________
Alan W. Irwin

Astronomical research affiliation with Department of Physics and Astronomy,
University of Victoria (astrowww.phys.uvic.ca).

Programming affiliations with the FreeEOS equation-of-state
implementation for stellar interiors (freeeos.sf.net); the Time
Ephemerides project (timeephem.sf.net); PLplot scientific plotting
software package (plplot.sf.net); the libLASi project
(unifont.org/lasi); the Loads of Linux Links project (loll.sf.net);
and the Linux Brochure Project (lbproject.sf.net).
__________________________

Linux-powered Science
__________________________

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: utf8 does not appear to work for function documentation strings generated with texinfo

Mike Miller
On Wed, Mar 26, 2014 at 09:32:11 -0700, Alan W. Irwin wrote:
> Do you get untruncated help results with this simple example there for
> your version of octave or is there some problem with how this simple
> example is implemented?

I get the same as you, and with some simple use of Octave's debugger
you can easily tell that this is Octave's fault, not a Texinfo
problem. By using dbstop and dbnext you can step through the help
function to find where the text is transformed incorrectly.

  dbstop help
  help test_utf8
  dbstep
  dbstep
  ... ## step until get_help_text is called
  text
  dbstep
  ...

In the __makeinfo__ function, the texinfo string is written to a
temporary file which is passed to the makeinfo program, see

  http://hg.savannah.gnu.org/hgweb/octave/file/75467145096f/scripts/help/__makeinfo__.m#l120

In writing to this file, it looks like all non-ASCII bytes are set to
zero by Octave. This is then sent to makeinfo, and that's why the
string is truncated at that point when it is printed by the help
command, because the first zero is seen as a null string terminator.

Feel free to report this as a bug, but this shows that the underlying
cause is the behavior of fwrite with a char matrix argument. And note
that Octave does not natively support wide characters in the "char"
type yet, so there may not be a nice way to get this working until
full wide character support is added.

HTH,

--
mike

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: utf8 does not appear to work for function documentation strings generated with texinfo

CdeMills
In reply to this post by Alan W. Irwin
Alan W. Irwin wrote
Thanks for your reply to my question.  However, as I stated in my OP,
I had tried some experiments with @documentencoding UTF-8, but could
not get it to work.  To be specific, here is one example that does not
work here with ocatve-3.6.2.
Sorry, I was too fast in reading the OP :-)

Tested the same under 3.8.1-GUI; same behaviour. Yet it was started as
env LANG=fr_FR.UTF-8 octave --force-gui

I ran it under the debugger. help.m calls __makeinfo__.m. At line 126, we have:
 fwrite (fid, text);
 
With your example, the end of text contains unicode characters; yet in the generated file the unicode chars are garbled. This file is then passed to the external program 'makeinfo', which should respect the @documentencoding directive. But garbled in results in garbled out.

I suppose the issue lies in fwrite implementation. Could you please report it as a bug on Savannah ?

Regards

Pascal
Reply | Threaded
Open this post in threaded view
|

Re: [help-octave] Re: utf8 does not appear to work for function documentation strings generated with texinfo

Alan W. Irwin
To Pascal and Mike:

You guys have shown the issue is with fwrite.

So I have prepared a simple test case of the fwrite problem
for a (future) bug report.

function test_fwrite_utf8
fid = fopen("test.out", "w")
fwrite(fid, "The unicode character, ≥, is output\n")
endfunction

That creates the following file:

irwin@raven> od -a test.out
0000000   T   h   e  sp   u   n   i   c   o   d   e  sp   c   h   a   r
0000020   a   c   t   e   r   ,  sp nul nul nul   ,  sp   i   s  sp   o
0000040   u   t   p   u   t  nl
0000046

The default precision for fwrite (according to the documentation of
that function) is uchar.  I believe that would work if the Octave type
of the utf8 string, "The unicode character, ≥, is output\n" is also uchar
since all that is needed here is to transmit the bytes of that string
unmolested to the output file.

So is the problem that the Octave utf8 string does not have a uchar
type?  Or does the fwrite built-in have some unnecessary filtering in
place to zero bytes with a non-zero eighth bit (i.e., non-ascii utf8
bytes)?

I am pretty much an octave newbie so it would be good to obtain
agreement here on what the exact problem is before I create
the requested bug report.

Alan
__________________________
Alan W. Irwin

Astronomical research affiliation with Department of Physics and Astronomy,
University of Victoria (astrowww.phys.uvic.ca).

Programming affiliations with the FreeEOS equation-of-state
implementation for stellar interiors (freeeos.sf.net); the Time
Ephemerides project (timeephem.sf.net); PLplot scientific plotting
software package (plplot.sf.net); the libLASi project
(unifont.org/lasi); the Loads of Linux Links project (loll.sf.net);
and the Linux Brochure Project (lbproject.sf.net).
__________________________

Linux-powered Science
__________________________

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: [help-octave] Re: utf8 does not appear to work for function documentation strings generated with texinfo

Mike Miller
On Wed, Mar 26, 2014 at 3:13 PM, Alan W. Irwin
<[hidden email]> wrote:

> function test_fwrite_utf8
> fid = fopen("test.out", "w")
> fwrite(fid, "The unicode character, ≥, is output\n")
> endfunction
>
> That creates the following file:
>
> irwin@raven> od -a test.out
> 0000000   T   h   e  sp   u   n   i   c   o   d   e  sp   c   h   a   r
> 0000020   a   c   t   e   r   ,  sp nul nul nul   ,  sp   i   s  sp   o
> 0000040   u   t   p   u   t  nl
> 0000046

Yep, same here.

> The default precision for fwrite (according to the documentation of
> that function) is uchar.  I believe that would work if the Octave type
> of the utf8 string, "The unicode character, ≥, is output\n" is also uchar
> since all that is needed here is to transmit the bytes of that string
> unmolested to the output file.

Unfortunately there is no "uchar" type in the Octave (or Matlab)
language, this keyword is only meaningful to the fread and fwrite
functions.

> So is the problem that the Octave utf8 string does not have a uchar
> type?  Or does the fwrite built-in have some unnecessary filtering in
> place to zero bytes with a non-zero eighth bit (i.e., non-ascii utf8
> bytes)?

The *real* problem is that the "char" type is supposed to (and will
someday) represent a Unicode character. It currently only represents a
C "char" one-byte value in Octave.

If you want to focus on this specific situation, the problem is that
the argument is a string of type "char", which is probably limiting
the range to [-128,127], which is then limited to [0,127] when
converted to the "uchar" precision range internally by fwrite.

If you try

  fwrite (fid, "The unicode character, ≥, is output\n", "schar");

or

  fwrite (fid, double("The unicode character, ≥, is output\n"));

instead, then the values are not limited and it should work (both work for me).

This may or may not be an acceptable workaround for the particular
case of help strings, but internally fwrite (and other functions) may
still effectively apply ASCII range limits to char matrices (strings)
until Octave actually supports wide characters.

--
mike

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: [help-octave] Re: utf8 does not appear to work for function documentation strings generated with texinfo

Alan W. Irwin
On 2014-03-26 16:19-0400 Mike Miller wrote:

> [...]If you try
>
>  fwrite (fid, "The unicode character, ≥, is output\n", "schar");
>
> [...] instead, then the values are not limited and it should work (both work for me).

That works here too.  In fact, if I use
the following patch

--- __makeinfo__.m_original 2014-03-26 13:56:42.741106684 -0700
+++ __makeinfo__.m 2014-03-26 13:56:19.005546479 -0700
@@ -120,7 +120,7 @@
      if (fid < 0)
        error ("__makeinfo__: could not create temporary file");
      endif
-    fwrite (fid, text);
+    fwrite (fid, text, "schar");
      fclose (fid);

      ## Take action depending on output type

then it solves the original issue (without the suggested
## @documentencoding UTF-8 line) I posted concerning utf8 help strings
for functions.  So that is how I have written up the bug report
at http://savannah.gnu.org/bugs/index.php?41965.

> This may or may not be an acceptable workaround for the particular
> case of help strings, but internally fwrite (and other functions) may
> still effectively apply ASCII range limits to char matrices (strings)
> until Octave actually supports wide characters.

My understanding is one of the principal points about utf8 is you
don't need to use wide characters.  Instead, a utf8 string should be
represented as an array of 8-bit bytes terminated by a null character.
Of course, if you convert utf8 to UCS4 (for example), then you will
need an array of wide characters to represent the latter.  That said,
it does appear from the success of "schar" for fwrite that Octave utf8
strings are stored internally as a vector of signed char's (as opposed
to unsigned char's) so the only way to keep invalid conversions of the
utf8 string from happening is to also use the "schar" type with fwrite
as in the above patch.

Thanks very much for discovering that "schar" possibility which I
think is likely the correct solution to the problem of allowing utf8 help
strings to be correctly processed by  __makeinfo__.

Alan
__________________________
Alan W. Irwin

Astronomical research affiliation with Department of Physics and Astronomy,
University of Victoria (astrowww.phys.uvic.ca).

Programming affiliations with the FreeEOS equation-of-state
implementation for stellar interiors (freeeos.sf.net); the Time
Ephemerides project (timeephem.sf.net); PLplot scientific plotting
software package (plplot.sf.net); the libLASi project
(unifont.org/lasi); the Loads of Linux Links project (loll.sf.net);
and the Linux Brochure Project (lbproject.sf.net).
__________________________

Linux-powered Science
__________________________

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: [help-octave] Re: utf8 does not appear to work for function documentation strings generated with texinfo

CdeMills
In reply to this post by Mike Miller
Mike Miller wrote
If you try

  fwrite (fid, "The unicode character, ≥, is output\n", "schar");

or

  fwrite (fid, double("The unicode character, ≥, is output\n"));
Mike,
"char" should be better. From the C++ standard definition:

Plain char, signed char, and unsigned char are three distinct types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (basic.types); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.

For a "plain char", all bits participate to the representation; which is what we require. On the same example, it works too.

This open another issue: I spotted other cases of two-arguments fwrite in the source:

scripts/image/imread.m:%!   fwrite (fid, vpng);
scripts/pkg/private/create_pkgadddel.m:      fwrite (instfid, extract_pkg (nam
, ['^[#%][#%]* *' nm ': *(.*)$']));

The first should be tested.
The second implies that package name should not contain exotic characters, which is probably a good thing.

Regards

Pascal
Reply | Threaded
Open this post in threaded view
|

Re: [help-octave] Re: utf8 does not appear to work for function documentation strings generated with texinfo

Mike Miller
On Thu, Mar 27, 2014 at 02:22:05 -0700, CdeMills wrote:
> "char" should be better. From the C++ standard definition:
> [...]
> For a "plain char", all bits participate to the representation; which is
> what we require. On the same example, it works too.

Yeah, ok I guess that is more consistent. Also, the Matlab help for
fwrite lists "char" as an encoding-dependent character, which means it's
probably the appropriate conversion type for fwrite-ing strings anyway.

So after we do support wide characters, "char" in fwrite will mean
something different but it will still be the correct term to use.

I actually just realized an even better solution - use fprintf instead
of fwrite when dealing with strings.

> This open another issue: I spotted other cases of two-arguments fwrite in
> the source:
>
> scripts/image/imread.m:%!   fwrite (fid, vpng);
> scripts/pkg/private/create_pkgadddel.m:      fwrite (instfid, extract_pkg
> (nam
> , ['^[#%][#%]* *' nm ': *(.*)$']));
>
> The first should be tested.

The first *is* a test, so simply running "test imread" shows that it
works (vpng is a matrix of type double whose values are in the range
[0,255]).

> The second implies that package name should not contain exotic characters,
> which is probably a good thing.

Looks good to me.

--
mike

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: [help-octave] Re: [help-octave] Re: utf8 does not appear to work for function documentation strings generated with texinfo

Alan W. Irwin
On 2014-03-27 09:25-0400 Mike Miller wrote:

> I actually just realized an even better solution - use fprintf instead
> of fwrite when dealing with strings.

Yes, that should be more robust since it is free of assumptions about
the exact char type being used to internally represent Octave strings.

Also, it works for me both for my simple test and also in a much
broader context (utf-8 help strings for the Octave functions
automatically created by the swig-generated Octave binding of Plplot).

Encouraged by that success I also tried the following test using the
utf-8 strings for a wide variety of human languages (adapted from
http://en.wikipedia.org/wiki/Xetex). (WARNING, this example will only
be legible if you have the appropriate fonts installed on your
system).

## -*- texinfo -*-
## @deftypefn  {Function File} {@var{a} =} fn (@var{x}, …)
##English
##
##All human beings are born free and equal in dignity and rights.
##
##Íslenska
##
##Hver maður er borinn frjáls og jafn öðrum að virðingu og réttindum.
##
##Русский
##
##Все люди рождаются свободными и равными в своем достоинстве и
##правах.
##
##Tiếng Việt
##
##Tất cả mọi người sinh ra đều được tự do và bình đẳng về nhân phẩm và
##quyền lợi.
##
##Ελληνικά
##
##Ὅλοι οἱ ἄνθρωποι γεννιοῦνται ἐλεύθεροι καὶ ἴσοι στὴν ἀξιοπρέπεια
##καὶ τὰ δικαιώματα.
##
##Legacy syntax
##
##When he goes---``Hello World!''\\
##She replies—“Hello dear!”
##
##Ligatures
##
##Questo è strano assai!
## @end deftypefn

function test_utf8_languages
printf("The unicode character, ≥, is output\n")
endfunction

The result was

warning: function ./__makeinfo__.m shadows a core library function
octave:1> help test_utf8_languages
`test_utf8_languages' is a function from the file /home/irwin/test_octave/test_utf8_languages.m

  -- Function File: A = fn (X, …)
      English

      All human beings are born free and equal in dignity and rights.

      Íslenska

      Hver maður er borinn frjáls og jafn öðrum að virðingu og réttindum.

      Русский

      Все люди рождаются свободными и равными в своем достоинстве и
      правах.

      Tiếng Việt

      Tất cả mọi người sinh ra đều được tự do và bình đẳng về nhân phẩm
      và quyền lợi.

      Ελληνικά

      Ὅλοι οἱ ἄνθρωποι γεννιοῦνται ἐλεύθεροι καὶ ἴσοι στὴν ἀξιοπρέπεια
      καὶ τὰ δικαιώματα.

      Legacy syntax

      When he goes--"Hello World!"\\ She replies—“Hello dear!”

      Ligatures

      Questo è strano assai!

Additional help for built-in functions and operators is
[...]

So it appears to me your simple change to __makeinfo__.m is ready for
your next release with some nice implications.  For example, your
change makes it possible to include any of a vast array of UTF-8
mathematical glyphs into Octave function help strings. Furthermore,
because UTF-8 covers all glyphs occurring in the scripts of human
languages (and also Klingon, :-)), it should make it convenient to
translate Octave function help strings into all languages that have a
simple left-to-right text layout, e.g., the above languages and many
more, but excluding complex text layout languages (e.g.,
Arabic, Devanagari, or Thai according to
<http://en.wikipedia.org/wiki/Complex_text_layout>).

Alan
__________________________
Alan W. Irwin

Astronomical research affiliation with Department of Physics and Astronomy,
University of Victoria (astrowww.phys.uvic.ca).

Programming affiliations with the FreeEOS equation-of-state
implementation for stellar interiors (freeeos.sf.net); the Time
Ephemerides project (timeephem.sf.net); PLplot scientific plotting
software package (plplot.sf.net); the libLASi project
(unifont.org/lasi); the Loads of Linux Links project (loll.sf.net);
and the Linux Brochure Project (lbproject.sf.net).
__________________________

Linux-powered Science
__________________________

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave