Re: char type in Octave

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: char type in Octave

Rik-4
On 05/17/2018 04:05 AM, [hidden email] wrote:
Subject:
Re: Handle encoding of Octave strings
From:
mmuetzel [hidden email]
Date:
05/17/2018 03:52 AM
To:
[hidden email]
List-Post:
[hidden email]
Content-Transfer-Encoding:
quoted-printable
Precedence:
list
MIME-Version:
1.0
References:
[hidden email] [hidden email] [hidden email] [hidden email] [hidden email] [hidden email]
In-Reply-To:
[hidden email]
Message-ID:
[hidden email]
Content-Type:
text/plain; charset=UTF-8
Message:
7

What does Matlab do?  If your choice is different, I am sure that we
will see bug reports about it.
In Matlab:
 str = 'aäbc'
str =
aäbc
str(1)
ans =
a
str(2)
ans =
ä
str(3)
ans =
b
str(4)
ans =
c
whos str
  Name      Size            Bytes  Class    Attributes
  str       1x4                 8  char               


So in Matlab one "char" has a size of 2 bytes. On the other hand, in Octave
one "char" has 1 byte.
This is a known difference.  Matlab uses wide chars (wchar_t) which is 16 bits, rather than regular char (8 bits).

Do we want to change the way Octave stores its char class? Initially I was
in favor of keeping the relation of 1 byte = 1 char (hence using UTF-8). But
it would make indexing more straight forward if we changed to UTF-16 (1
"char" = 2 bytes). At least when it comes to the BMP which encompasses
characters from most current scripts.

A first step towards this could be to add "from_u8", "to_u8", ("from_u16",
"to_u16") methods to our char class. 
Than we would need to identify all places in the code where we construct
char arrays from external sources (.m files, terminal, reading from files,
...) and where we pass strings to external sources (library functions,
writing to files, ...).
When this is done we might be able to switch the internal representation
from C-"char" to "uint16_t" without breaking everything...

Do you think that this is feasible?

If we want perfect compatibility we may be driven this way, but it will be a lot of work.  Part of the point of Octave is to rely on good quality code found in external libraries, so there are a lot of interfaces (regexp in PCRE, file operations in stdlib, font rendering libraries, external programs via pipes like gnuplot, etc.).  Is the gain in compatibility going to be worth the pain of implementing this?

--Rik
Reply | Threaded
Open this post in threaded view
|

Re: char type in Octave

Michael Godfrey


On 05/17/2018 10:55 PM, Rik wrote:
On 05/17/2018 04:05 AM, [hidden email] wrote:
Subject:
Re: Handle encoding of Octave strings
From:
mmuetzel [hidden email]
Date:
05/17/2018 03:52 AM
To:
[hidden email]
List-Post:
[hidden email]
Content-Transfer-Encoding:
quoted-printable
Precedence:
list
MIME-Version:
1.0
References:
[hidden email] [hidden email] [hidden email] [hidden email] [hidden email] [hidden email]
In-Reply-To:
[hidden email]
Message-ID:
[hidden email]
Content-Type:
text/plain; charset=UTF-8
Message:
7

What does Matlab do?  If your choice is different, I am sure that we
will see bug reports about it.
In Matlab:
 str = 'aäbc'
str =
aäbc
str(1)
ans =
a
str(2)
ans =
ä
str(3)
ans =
b
str(4)
ans =
c
whos str
  Name      Size            Bytes  Class    Attributes
  str       1x4                 8  char               


So in Matlab one "char" has a size of 2 bytes. On the other hand, in Octave
one "char" has 1 byte.
This is a known difference.  Matlab uses wide chars (wchar_t) which is 16 bits, rather than regular char (8 bits).

Do we want to change the way Octave stores its char class? Initially I was
in favor of keeping the relation of 1 byte = 1 char (hence using UTF-8). But
it would make indexing more straight forward if we changed to UTF-16 (1
"char" = 2 bytes). At least when it comes to the BMP which encompasses
characters from most current scripts.

A first step towards this could be to add "from_u8", "to_u8", ("from_u16",
"to_u16") methods to our char class. 
Than we would need to identify all places in the code where we construct
char arrays from external sources (.m files, terminal, reading from files,
...) and where we pass strings to external sources (library functions,
writing to files, ...).
When this is done we might be able to switch the internal representation
from C-"char" to "uint16_t" without breaking everything...

Do you think that this is feasible?

If we want perfect compatibility we may be driven this way, but it will be a lot of work.  Part of the point of Octave is to rely on good quality code found in external libraries, so there are a lot of interfaces (regexp in PCRE, file operations in stdlib, font rendering libraries, external programs via pipes like gnuplot, etc.).  Is the gain in compatibility going to be worth the pain of implementing this?

--Rik
The arguments against include:

1. A LOT of work.
2. Residual induced bugs lasting probably for years.
3. Compatibility with UTF-8 packages, etc.

Does anyone know what are the specific Matlab compatibility cases?

Michael
Reply | Threaded
Open this post in threaded view
|

Re: char type in Octave

mmuetzel
TL;DR: Let's stay with UTF-8.

Longer version:
I had a (not so) quick look at the code and the amount of effort for
switching our char representation seems unreasonably high.
If we kept our current 8-bit representation, the main "issue" from a user's
point of view might be with indexing: A user might suspect that a char
vector with N characters would always have N elements and indexing the n-th
element would return the n-th character.
But even if we moved from a 8-bit representation of characters to a 16-bit
representation, we wouldn't be able to represent characters from higher
Unicode plains with one char element. Even if we went one step further and
used a 32-bit representation, there are character modifiers (e.g. accents).
So one character could always be represented by several basic elements
(8-bit, 16-bit, or 32-bit).
Thus, indexing into character arrays will always be problematic in some
cases. No matter which UTF-flavour we would be using.
I am seconding Rik's and Michael's reasoning and would like to vote for
staying with 8-bit chars.

However, I am still in favor of consistently using and supporting Unicode
(UTF-8) wherever possible.
We could facilitate the possible issue with indexing by providing dedicated
functions. These could help with indexing into char arrays by identifying
elements that belong to one character.
Something along the lines of:
str = 'aäbc'
str_idx = u8_char_idx(str)

which could result in:
str_idx = [ 1 2 2 3 4 ]

Indexing the n-th character would be as easy as:
str(str_idx==n)

That also leads back to my initial doubt of whether "element-wise" operators
on character arrays like isupper or islower should return an array of the
same size as the input. IMHO they should.

Markus



--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: char type in Octave

Rik-4
In reply to this post by Rik-4
On 05/24/2018 09:00 AM, [hidden email] wrote:
Subject:
Re: char type in Octave
From:
mmuetzel [hidden email]
Date:
05/24/2018 08:29 AM
To:
[hidden email]
List-Post:
[hidden email]
Content-Transfer-Encoding:
quoted-printable
Precedence:
list
MIME-Version:
1.0
References:
<MTAwMDAxMi5ub21hZA.1526594108@quikprotect> [hidden email]
In-Reply-To:
[hidden email]
Message-ID:
[hidden email]
Content-Type:
text/plain; charset=UTF-8
Message:
4

TL;DR: Let's stay with UTF-8.

Longer version:
I had a (not so) quick look at the code and the amount of effort for
switching our char representation seems unreasonably high.
If we kept our current 8-bit representation, the main "issue" from a user's
point of view might be with indexing: A user might suspect that a char
vector with N characters would always have N elements and indexing the n-th
element would return the n-th character.
But even if we moved from a 8-bit representation of characters to a 16-bit
representation, we wouldn't be able to represent characters from higher
Unicode plains with one char element. Even if we went one step further and
used a 32-bit representation, there are character modifiers (e.g. accents).
So one character could always be represented by several basic elements
(8-bit, 16-bit, or 32-bit).
Thus, indexing into character arrays will always be problematic in some
cases. No matter which UTF-flavour we would be using.
I am seconding Rik's and Michael's reasoning and would like to vote for
staying with 8-bit chars.
I do think that is a good idea.  And UTF-8 is well understood, which means we don't need to work out a solution from scratch.  There must be loads of other programs who have made the transition, and we can use the same strategy they did.


However, I am still in favor of consistently using and supporting Unicode
(UTF-8) wherever possible.
We could facilitate the possible issue with indexing by providing dedicated
functions. These could help with indexing into char arrays by identifying
elements that belong to one character.
Something along the lines of:
str = 'aäbc'
str_idx = u8_char_idx(str)

which could result in:
str_idx = [ 1 2 2 3 4 ]

Indexing the n-th character would be as easy as:
str(str_idx==n)

That also leads back to my initial doubt of whether "element-wise" operators
on character arrays like isupper or islower should return an array of the
same size as the input. IMHO they should.

Yes, one core programming idea is the Principle of Least Surprise (https://en.wikipedia.org/wiki/Principle_of_least_astonishment).  As a programmer I would be very surprised--even upset--if I called a function like toupper with a 5-byte string, and it came back as a 10-byte string.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: char type in Octave

mmuetzel
> As a programmer I would be very surprised--even upset--if I called a
function like toupper with a 5-byte string, and it came back as a 10-byte
string.

Unfortunately, for the "toupper" and "tolower" functions we don't have much
choice but use what Unicode has defined. Consider e.g. the uppercase
character U+0130 "İ" which is represented in UTF-8 by two bytes (C4 B0). Its
lowercase version is U+0069 "i" which is only one byte long in UTF-8. (Same
but vice versa for lowercase U+0131 "ı" and its uppercase U+0049 "I").
If such a case occurs (size of result wouldn't match size of input), I chose
to emit a warning and fall back to the non-Unicode aware standard library
functions (see bug #53873).

For "islower" and "isupper" (and other similar functions), I'll try to stick
to that Principle of Least Surprise.

Markus




--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html