Handle encoding of Octave strings

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Handle encoding of Octave strings

mmuetzel
Octave Developers,

At the moment Octave strings are parsed as if they were a simple byte
stream. That means a (non-ASCII) character can be represented differently
depending on the encoding of the file the string comes from.
However, generally the user doesn't want (and shouldn't need) to care about
byte representation of a character. A character should always represent that
character no matter the encoding of the source file.

At the moment, we don't know the encoding of an Octave string when we handle
its content. That can lead to problems (e.g. bug #51210, bug #53646, ...).

To get things more consistent, I'd like to propose that the parser (or
lexer?) should take care of converting any source string to an encoding that
covers all Unicode characters when parsing m-files. Matlab uses UTF-16 (or
more specifically UCS-2). But since UTF-8 seems the predominant encoding on
Linux-y systems, I'd like to propose, we use that.

In a next step, we could take care of converting the strings to whatever
encoding we need when we pass it on (e.g. to UTF-16 for FreeType or Qt).

Any opinions? Hints where that should go?

Markus



--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

John W. Eaton
Administrator
On 04/15/2018 06:40 AM, mmuetzel wrote:

> Octave Developers,
>
> At the moment Octave strings are parsed as if they were a simple byte
> stream. That means a (non-ASCII) character can be represented differently
> depending on the encoding of the file the string comes from.
> However, generally the user doesn't want (and shouldn't need) to care about
> byte representation of a character. A character should always represent that
> character no matter the encoding of the source file.
>
> At the moment, we don't know the encoding of an Octave string when we handle
> its content. That can lead to problems (e.g. bug #51210, bug #53646, ...).
>
> To get things more consistent, I'd like to propose that the parser (or
> lexer?) should take care of converting any source string to an encoding that
> covers all Unicode characters when parsing m-files. Matlab uses UTF-16 (or
> more specifically UCS-2). But since UTF-8 seems the predominant encoding on
> Linux-y systems, I'd like to propose, we use that.
>
> In a next step, we could take care of converting the strings to whatever
> encoding we need when we pass it on (e.g. to UTF-16 for FreeType or Qt).
>
> Any opinions? Hints where that should go?

I agree that we need to do something about this issue.

Should we care about exact compatibility with Matlab?

Is there a way to make this change incrementally?

jwe

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

mmuetzel
One advantage of using UTF-8 as the internal encoding would be that the
change would be less intrusive. The character matrix type could keep being
stored in an 8 bit char type.
I don't think that we need to be Matlab compatible at that level. Even if we
decided later to do so, the switch over would probably be even easier once
the interfaces for conversion are there.

We already have the conversion functions to and from UTF-8 in liboctave (via
our wrapper functions to gnulib: "octave_u8_conv_to_encoding" and
"octave_u8_conv_from_encoding"). So that means the first step is already
done.

We still need a way to determine the encoding of the strings in the m-file.
In the GUI that could be linked to the encoding setting of the editor. For
the CLI, we would need a way for setting the "source encoding". We could
probably define an interface in liboctave to set and to change the source
encoding (the default would be the system encoding).

I don't know whether it is possible to distinguish between variables being
read from m-files (any encoding possible) or defined from the command window
(always UTF-8 as it stands right now). Finding a way to distinguish that
might be another step.

Once that is done, we could apply the conversion to strings that are read
from .m files.

More or less independent from the above steps is finding places where we
would need to convert from the internal encoding (e.g. to UTF-32 for
FreeType and to UTF-16 for Qt) and actually implement that. Conversion
functions for these operations are also available in gnulib ("u8-to-u16" and
"u8-to-u32").



--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

mmuetzel
I opened a bug report with a patch for the initial step of reading .m files
with arbitrary character encoding [1].

If this should be accepted, the next step could be to nicely integrate the
feature into the GUI. Maybe link it to the character encoding selected for
the file editor.

[1]: https://savannah.gnu.org/bugs/index.php?53842



--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

mmuetzel
I would like to make "islower" and "isupper" Unicode aware.
At the moment, I see the following:
octave:1> islower ('ä')
ans =

  0  0

Since we are using UTF-8 for character arrays, the single lower-case letter
"ä" is represented by two bytes:
octave:2> size ('ä')
ans =

   1   2

Should islower('ä') return true(1,2) or true(1,1)? I am tending towards the
former.

This leads to the bigger question: How should indexing on (multi-byte)
character arrays work? At the moment, a user has to be somewhat aware of the
fact that Octave uses UTF-8:
octave:3> str = "aäbc"
str = aäbc
octave:4> str(1)
ans = a
octave:5> str(2)
ans = �
octave:6> str(3)
ans = �
octave:7> str(4)
ans = b
octave:8> str(2:3)
ans = ä

To index the second character in the string, the user has to access the
second and(!) third element. The third character is indexed with the fourth
element and so forth.
Is this OK?

Markus



--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

John W. Eaton
Administrator
On 05/16/2018 04:10 PM, mmuetzel wrote:

> I would like to make "islower" and "isupper" Unicode aware.
> At the moment, I see the following:
> octave:1> islower ('ä')
> ans =
>
>    0  0
>
> Since we are using UTF-8 for character arrays, the single lower-case letter
> "ä" is represented by two bytes:
> octave:2> size ('ä')
> ans =
>
>     1   2
>
> Should islower('ä') return true(1,2) or true(1,1)? I am tending towards the
> former.
>
> This leads to the bigger question: How should indexing on (multi-byte)
> character arrays work? At the moment, a user has to be somewhat aware of the
> fact that Octave uses UTF-8:
> octave:3> str = "aäbc"
> str = aäbc
> octave:4> str(1)
> ans = a
> octave:5> str(2)
> ans = �
> octave:6> str(3)
> ans = �
> octave:7> str(4)
> ans = b
> octave:8> str(2:3)
> ans = ä
>
> To index the second character in the string, the user has to access the
> second and(!) third element. The third character is indexed with the fourth
> element and so forth.
> Is this OK?

What does Matlab do?  If your choice is different, I am sure that we
will see bug reports about it.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

nrjank
On Wed, May 16, 2018 at 5:24 PM, John W. Eaton <[hidden email]> wrote:
> On 05/16/2018 04:10 PM, mmuetzel wrote:
>>
> What does Matlab do?  If your choice is different, I am sure that we will
> see bug reports about it.

Matlab 2017b does not have an islower or isupper implementation.

I do get:

>> lower ('ä')
ans =
    'ä'
>> upper ('ä')
ans =
    'Ä'

on windows it seems I'm unable to copy paste a ä, which seems to be
char(228), or 'Ä' which is char(196). Octave gives me:

>> islower(char(228))
ans = 0
>> isupper(char(228))
ans = 0
>> upper(char(228))
ans =
>> lower(char(228))
ans =

>> upper(char(228)) - lower(char(228))
ans = 0

not sure if any of that is meaningful.

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

mmuetzel
In reply to this post by John W. Eaton
> What does Matlab do?  If your choice is different, I am sure that we
> will see bug reports about it.

In Matlab:
>>  str = 'aäbc'
str =
aäbc
>> str(1)
ans =
a
>> str(2)
ans =
ä
>> str(3)
ans =
b
>> str(4)
ans =
c
>> whos str
  Name      Size            Bytes  Class    Attributes
  str       1x4                 8  char              


So in Matlab one "char" has a size of 2 bytes. On the other hand, in Octave
one "char" has 1 byte.
Do we want to change the way Octave stores its char class? Initially I was
in favor of keeping the relation of 1 byte = 1 char (hence using UTF-8). But
it would make indexing more straight forward if we changed to UTF-16 (1
"char" = 2 bytes). At least when it comes to the BMP which encompasses
characters from most current scripts.

A first step towards this could be to add "from_u8", "to_u8", ("from_u16",
"to_u16") methods to our char class.
Than we would need to identify all places in the code where we construct
char arrays from external sources (.m files, terminal, reading from files,
...) and where we pass strings to external sources (library functions,
writing to files, ...).
When this is done we might be able to switch the internal representation
from C-"char" to "uint16_t" without breaking everything...

Do you think that this is feasible?

Markus



--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: Handle encoding of Octave strings

mmuetzel
In reply to this post by nrjank
nrjank wrote:
> Matlab 2017b does not have an islower or isupper implementation.

The closest thing to islower in Matlab is probably:
>> a = 'ä';
>> lower(a)==a & upper(a)~=a
ans =
     1

> on windows it seems I'm unable to copy paste a ä, which seems to be
> char(228), or 'Ä' which is char(196). Octave gives me:

This is bug #47571:
https://savannah.gnu.org/bugs/index.php?47571

> >> islower(char(228))
> ans = 0

Octave expects UTF-8 encoded characters. So "ä" is char([195 164]). At least
atm (see my previous mail).

> >> upper(char(228))
> ans =

Support for UTF-8 encoded strings by "upper" and "lower" is covered in bug
#53873:
https://savannah.gnu.org/bugs/index.php?53873

Markus




--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html