Handle encoding of Octave strings

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Handle encoding of Octave strings

mmuetzel
Octave Developers,

At the moment Octave strings are parsed as if they were a simple byte
stream. That means a (non-ASCII) character can be represented differently
depending on the encoding of the file the string comes from.
However, generally the user doesn't want (and shouldn't need) to care about
byte representation of a character. A character should always represent that
character no matter the encoding of the source file.

At the moment, we don't know the encoding of an Octave string when we handle
its content. That can lead to problems (e.g. bug #51210, bug #53646, ...).

To get things more consistent, I'd like to propose that the parser (or
lexer?) should take care of converting any source string to an encoding that
covers all Unicode characters when parsing m-files. Matlab uses UTF-16 (or
more specifically UCS-2). But since UTF-8 seems the predominant encoding on
Linux-y systems, I'd like to propose, we use that.

In a next step, we could take care of converting the strings to whatever
encoding we need when we pass it on (e.g. to UTF-16 for FreeType or Qt).

Any opinions? Hints where that should go?

Markus



--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

John W. Eaton
Administrator
On 04/15/2018 06:40 AM, mmuetzel wrote:

> Octave Developers,
>
> At the moment Octave strings are parsed as if they were a simple byte
> stream. That means a (non-ASCII) character can be represented differently
> depending on the encoding of the file the string comes from.
> However, generally the user doesn't want (and shouldn't need) to care about
> byte representation of a character. A character should always represent that
> character no matter the encoding of the source file.
>
> At the moment, we don't know the encoding of an Octave string when we handle
> its content. That can lead to problems (e.g. bug #51210, bug #53646, ...).
>
> To get things more consistent, I'd like to propose that the parser (or
> lexer?) should take care of converting any source string to an encoding that
> covers all Unicode characters when parsing m-files. Matlab uses UTF-16 (or
> more specifically UCS-2). But since UTF-8 seems the predominant encoding on
> Linux-y systems, I'd like to propose, we use that.
>
> In a next step, we could take care of converting the strings to whatever
> encoding we need when we pass it on (e.g. to UTF-16 for FreeType or Qt).
>
> Any opinions? Hints where that should go?

I agree that we need to do something about this issue.

Should we care about exact compatibility with Matlab?

Is there a way to make this change incrementally?

jwe

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

mmuetzel
One advantage of using UTF-8 as the internal encoding would be that the
change would be less intrusive. The character matrix type could keep being
stored in an 8 bit char type.
I don't think that we need to be Matlab compatible at that level. Even if we
decided later to do so, the switch over would probably be even easier once
the interfaces for conversion are there.

We already have the conversion functions to and from UTF-8 in liboctave (via
our wrapper functions to gnulib: "octave_u8_conv_to_encoding" and
"octave_u8_conv_from_encoding"). So that means the first step is already
done.

We still need a way to determine the encoding of the strings in the m-file.
In the GUI that could be linked to the encoding setting of the editor. For
the CLI, we would need a way for setting the "source encoding". We could
probably define an interface in liboctave to set and to change the source
encoding (the default would be the system encoding).

I don't know whether it is possible to distinguish between variables being
read from m-files (any encoding possible) or defined from the command window
(always UTF-8 as it stands right now). Finding a way to distinguish that
might be another step.

Once that is done, we could apply the conversion to strings that are read
from .m files.

More or less independent from the above steps is finding places where we
would need to convert from the internal encoding (e.g. to UTF-32 for
FreeType and to UTF-16 for Qt) and actually implement that. Conversion
functions for these operations are also available in gnulib ("u8-to-u16" and
"u8-to-u32").



--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html