generate_html breaks documentation encoding

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

generate_html breaks documentation encoding

Oliver Heimlich
Hello,

I am preparing the first release of the interval package and encounter
the following problem:

The m-files are encoded in UTF-8 and most documentation strings contain
non-ASCII characters.

The generate_html command somehow tries to convert the document strings
from iso-8859-1 to utf-8 and labels the result as iso-8859-1 in the html
header. This is wrong in at least two ways and the resulting html page
is broken.

Is this a general problem with the generate_html package or a
misconfiguration of my system's locales?

I have found out, that the unwanted conversion happens in the
__makeinfo__ function. The html header in the __makeinfo__ output is
then replaced by the template in the generate_html package without
re-encoding.

Oliver Heimlich

Reply | Threaded
Open this post in threaded view
|

Re: generate_html breaks documentation encoding

Julien Bect
Le 15/01/2015 23:13, Oliver Heimlich a écrit :
> The m-files are encoded in UTF-8 and most documentation strings
> contain non-ASCII characters.
>
> The generate_html command somehow tries to convert the document
> strings from iso-8859-1 to utf-8 and labels the result as iso-8859-1
> in the html header. This is wrong in at least two ways and the
> resulting html page is broken.

Hello Oliver,

About the "iso-8859-1" in the HTML header: it depends on the option
structure that you pass to generate_package_html (). If you use the
"octave-forge" style, then yes, it is automatically labelled as
"iso-8859-1"; see get_html_options ().

[everyone: I can add an optional field in the structure that would allow
the package manager to specify the encoding that he wants. Any thoughts
about that? Does it sound like a good idea?]

About the fact that "The generate_html command somehow tries to convert
the document strings": I don't know about that. Can you provide a
tarball for your package and give me a specific example of this conversion?

[everyone: does anybody know about this conversion? whee does this
happen in the package?]

@++
Julien

Reply | Threaded
Open this post in threaded view
|

Re: generate_html breaks documentation encoding

Oliver Heimlich
Am 16.01.2015 um 08:33 schrieb Julien Bect:

> Le 15/01/2015 23:13, Oliver Heimlich a écrit :
>> The m-files are encoded in UTF-8 and most documentation strings
>> contain non-ASCII characters.
>>
>> The generate_html command somehow tries to convert the document
>> strings from iso-8859-1 to utf-8 and labels the result as iso-8859-1
>> in the html header. This is wrong in at least two ways and the
>> resulting html page is broken.
>
> Hello Oliver,
>
> About the "iso-8859-1" in the HTML header: it depends on the option
> structure that you pass to generate_package_html (). If you use the
> "octave-forge" style, then yes, it is automatically labelled as
> "iso-8859-1"; see get_html_options ().
>
> [everyone: I can add an optional field in the structure that would allow
> the package manager to specify the encoding that he wants. Any thoughts
> about that? Does it sound like a good idea?]

No. I would just change the encoding of the octave-forge style to utf-8.
Otherwise you would have to deal with characters that are not available
in the target charset. The default output encoding of makeinfo is utf-8.

> About the fact that "The generate_html command somehow tries to convert
> the document strings": I don't know about that. Can you provide a
> tarball for your package and give me a specific example of this conversion?
>
> [everyone: does anybody know about this conversion? whee does this
> happen in the package?]

The conversion happens during the call to __makeinfo__ (see my previous
mail), which is a core function of octave. Until then all strings are
encoded correctly in utf-8.
The output of __makeinfo__ is labeled as utf-8, but has been reencoded.
I haven't debugged any further yet.

I use octave 3.8.2. Do you want a tarball of the function files or the
documentation files?

Reply | Threaded
Open this post in threaded view
|

Re: generate_html breaks documentation encoding

Oliver Heimlich
Am 16.01.2015 um 19:18 schrieb Oliver Heimlich:

> Am 16.01.2015 um 08:33 schrieb Julien Bect:
>> Le 15/01/2015 23:13, Oliver Heimlich a écrit :
>>> The m-files are encoded in UTF-8 and most documentation strings
>>> contain non-ASCII characters.
>>>
>>> The generate_html command somehow tries to convert the document
>>> strings from iso-8859-1 to utf-8 and labels the result as iso-8859-1
>>> in the html header. This is wrong in at least two ways and the
>>> resulting html page is broken.
>>
>> About the fact that "The generate_html command somehow tries to convert
>> the document strings": I don't know about that. Can you provide a
>> tarball for your package and give me a specific example of this
>> conversion?
>>
>> [everyone: does anybody know about this conversion? whee does this
>> happen in the package?]
>
> The conversion happens during the call to __makeinfo__ (see my previous
> mail), which is a core function of octave. Until then all strings are
> encoded correctly in utf-8.
> The output of __makeinfo__ is labeled as utf-8, but has been reencoded.
> I haven't debugged any further yet.
>
> I use octave 3.8.2. Do you want a tarball of the function files or the
> documentation files?
>

I debugged into the __makeinfo__ function and found the error: The
temporary file that is parsed by the system (cmd) should definitely
carry a @documentencoding utf-8 line at the beginning.

I am going to post a patch in the bug tracker...

Reply | Threaded
Open this post in threaded view
|

Re: generate_html breaks documentation encoding

Julien Bect
Le 17/01/2015 09:16, Oliver Heimlich a écrit :
> I debugged into the __makeinfo__ function and found the error: The
> temporary file that is parsed by the system (cmd) should definitely
> carry a @documentencoding utf-8 line at the beginning.
>
> I am going to post a patch in the bug tracker...

I think there several issues there (at least two).


Issue #1: which encoding is supposed to be used in the texinfo
documentation of help functions ?

__makeinfo__.m just adds a minimal header (\input texinfo) and footer
(@bye), letting makeinfo decide.

In other words, currently, Octave doesn't enforce any specific encoding.

My opinion: if you use any "non-standard" character (say, anything
outside the range 0x20-0x7E), you should insert a @documentencoding
statement in your texinfo documentation to be safe.

If I understand correctly, you intend to enforce "@documentencoding
utf-8" for all m-files. I don't know about that, but certainly other
people on this list will have an opinion. The discussion will probably
continue on the bug tracker if you propose a patch. An option would be
to add "@documentencoding utf-8" only if another @documentencoding
statement is not already present.


Issue #2: generate_package_html() does not honor the "charset=utf-8" in
the output of makeinfo

I think this is a bug: generate_package_html() should honor whichever
encoding comes out of makeinfo.

I will fix this in the generate_html package.


@++
Julien


Reply | Threaded
Open this post in threaded view
|

Re: generate_html breaks documentation encoding

Oliver Heimlich
Am 17.01.2015 um 10:11 schrieb Julien Bect:

> Le 17/01/2015 09:16, Oliver Heimlich a écrit :
>> I debugged into the __makeinfo__ function and found the error: The
>> temporary file that is parsed by the system (cmd) should definitely
>> carry a @documentencoding utf-8 line at the beginning.
>>
>> I am going to post a patch in the bug tracker...
>
> I think there several issues there (at least two).
>
>
> Issue #1: which encoding is supposed to be used in the texinfo
> documentation of help functions ?
>
> __makeinfo__.m just adds a minimal header (\input texinfo) and footer
> (@bye), letting makeinfo decide.
>
> In other words, currently, Octave doesn't enforce any specific encoding.
>
> My opinion: if you use any "non-standard" character (say, anything
> outside the range 0x20-0x7E), you should insert a @documentencoding
> statement in your texinfo documentation to be safe.

Very good point! This resolves my problem and I will do that.

> If I understand correctly, you intend to enforce "@documentencoding
> utf-8" for all m-files. I don't know about that, but certainly other
> people on this list will have an opinion. The discussion will probably
> continue on the bug tracker if you propose a patch. An option would be
> to add "@documentencoding utf-8" only if another @documentencoding
> statement is not already present.

Since the utf-8 encoding in Octave is standard by accident [1] this
would probably work, because most source files schould be encoded in
utf-8. However, one would start to mess with texinfo input encoding.

I like your idea to explicitly label the documentation strings that are
not encoded in us-ascii, which is the "latex way" to solve it.

[1]
http://wiki.octave.org/International_Characters_Support#The_state_of_Octave

> Issue #2: generate_package_html() does not honor the "charset=utf-8" in
> the output of makeinfo
>
> I think this is a bug: generate_package_html() should honor whichever
> encoding comes out of makeinfo.
>
> I will fix this in the generate_html package.

I have checked the HTTP headers of the sourceforge web server. They do
not enforce a particular encoding and the charset information in the
html page can be changed.

Thanks for fixing ... and for your help with issue #1.


Reply | Threaded
Open this post in threaded view
|

Re: [generate_html] Encoding of NEWS file

Julien Bect
In reply to this post by Julien Bect
*** Please keep the mailing list in cc ***

Le 18/01/2015 13:46, Oliver Heimlich a écrit :

> Hello Julien,
>
>> Issue #2: generate_package_html() does not honor the "charset=utf-8" in
>> the output of makeinfo
>>
>> I think this is a bug: generate_package_html() should honor whichever
>> encoding comes out of makeinfo.
>>
>> I will fix this in the generate_html package.
>
> I have found another bug that you could fix as well.
>
> In line 533 of generate_package_html.m, where the contents of the
> COPYING.html file are written, please use the insert_char_entities
> function:
>     fprintf (fid, "<pre>%s</pre>\n\n", contents);
>
> The current version does not escape html special characters, which is
> problematic. The GPLv3 text contains several "<" and ">" characters.
>
> The section where the NEWS file is written already does it right:
>       fprintf (fid, "<pre>%s</pre>\n\n", insert_char_entities
> (news_content));

Yes, I will fix that too.

Reply | Threaded
Open this post in threaded view
|

Re: [generate_html] Encoding of NEWS file

Julien Bect
Le 18/01/2015 17:38, Julien Bect a écrit :

> *** Please keep the mailing list in cc ***
>
> Le 18/01/2015 13:46, Oliver Heimlich a écrit :
>> In line 533 of generate_package_html.m, where the contents of the
>> COPYING.html file are written, please use the insert_char_entities
>> function:
>>     fprintf (fid, "<pre>%s</pre>\n\n", contents);
>>
>> The current version does not escape html special characters, which is
>> problematic. The GPLv3 text contains several "<" and ">" characters.
>>
>> The section where the NEWS file is written already does it right:
>>       fprintf (fid, "<pre>%s</pre>\n\n", insert_char_entities
>> (news_content));
>
> Yes, I will fix that too.

Done.

https://sourceforge.net/p/octave/generate_html/merge-requests/1/


Reply | Threaded
Open this post in threaded view
|

Re: generate_html breaks documentation encoding

Julien Bect
In reply to this post by Julien Bect
Le 19/02/2015 01:13, Oliver Heimlich a écrit :

> Julien,
>
> Am 17.01.2015 um 10:11 schrieb Julien Bect:
>> I think there several issues there (at least two).
>
>> Issue #2: generate_package_html() does not honor the "charset=utf-8" in
>> the output of makeinfo
>>
>> I think this is a bug: generate_package_html() should honor whichever
>> encoding comes out of makeinfo.
>>
>> I will fix this in the generate_html package.
>
> Since you didn't fix this yet, I have created a patch. May I push it
> into the repository?
>
> Oliver
Sorry, I was going to look into this in a couple of days.  But off
course you can push this patch.

I think we should also warn (or error ?) in case the charset in the
output of makeinfo is overwritten. To make this possible, I would add
%charset variable to make it possible to control the charset using any
option set, make it default to "utf-8", and then compare this with the
one extracted from makeinfo's output. I will do that soon, but don't
hesitate to push your patch in the meantime.

@++



generate_html-fix-enconding.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: generate_html breaks documentation encoding

jbect
Le 19/02/2015 07:16, Julien Bect a écrit :

> Le 19/02/2015 01:13, Oliver Heimlich a écrit :
>> Julien,
>>
>> Am 17.01.2015 um 10:11 schrieb Julien Bect:
>>> I think there several issues there (at least two).
>>
>>> Issue #2: generate_package_html() does not honor the "charset=utf-8" in
>>> the output of makeinfo
>>>
>>> I think this is a bug: generate_package_html() should honor whichever
>>> encoding comes out of makeinfo.
>>>
>>> I will fix this in the generate_html package.
>>
>> Since you didn't fix this yet, I have created a patch. May I push it
>> into the repository?
>>
>> Oliver
>
> Sorry, I was going to look into this in a couple of days.  But off
> course you can push this patch.
>
> I think we should also warn (or error ?) in case the charset in the
> output of makeinfo is overwritten. To make this possible, I would add
> %charset variable to make it possible to control the charset using any
> option set, make it default to "utf-8", and then compare this with the
> one extracted from makeinfo's output. I will do that soon, but don't
> hesitate to push your patch in the meantime.

For the record : the encoding problem raised in this thread is solved in
the dev version of generate_html (to be released soon).

http://sourceforge.net/p/octave/generate_html/ci/bc4bd4215c680ecc1fec89a772169cd018806659/

http://sourceforge.net/p/octave/generate_html/ci/e9ba76250d9b8ee3456e688f068f4ab959f6fe5d/

UTF-8 is now the default encoding.