Unicode support in io Forge package

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode support in io Forge package

apjanke-floss
Hi, Octave and io maintainers,

I'm confused by the Unicode support in the io package. In particular,
the functions unicode2utf8 and utf82unicode, and the "encode_utf"
options in some of the ods/xls read/write functions.

What is the encoding that utf82unicode/unicode2utf8 are calling
"unicode" here? It looks like it's doing a single-byte encoding,
treating each byte as an unsigned int 0-255, and treating those 0-255
values directly as Unicode code point values. That's not any of the
standard Unicode encodings. (But I think it is exactly the same as
Latin-1/ISO 8859-1.)

As I understand it, since about Octave 4.4, Octave's internal encoding
(that is, how it interprets Octave char arrays) is either UTF-8 or an
opaque array of bytes; it's never in the "system code page" or some
other locale-specific encoding.

Is this UTF-8 support in io still relevant/correct? Maybe it should be
deprecated or renamed/removed? Since Octave now supports UTF-8, I think
you'd want to just leave UTF-8 text as is in all cases.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

mmuetzel
Andrew,

Iirc, the interface uses UTF-16. The conversion function really only works for Latin-1 encoded input.
There really is no UTF-8 in this. TBH, I chose the name before I got a sufficient grasp of that encoding mess.

Forge packages usually target a wider range of Octave versions. I don't know whether this workaround can be safely removed without loosing support for Latin-1 in older Octave versions supported by io.
I didn't re-read the code. But believe that "unicode2native" is used if it is available.

Markus

PS: Sorry for top-posting. My mobile phone app doesn't allow otherwise.
--
Diese Nachricht wurde von meinem Android Mobiltelefon mit GMX Mail gesendet.
Am 19.10.19, 07:04, Andrew Janke <[hidden email]> schrieb:
Hi, Octave and io maintainers,

I'm confused by the Unicode support in the io package. In particular,
the functions unicode2utf8 and utf82unicode, and the "encode_utf"
options in some of the ods/xls read/write functions.

What is the encoding that utf82unicode/unicode2utf8 are calling
"unicode" here? It looks like it's doing a single-byte encoding,
treating each byte as an unsigned int 0-255, and treating those 0-255
values directly as Unicode code point values. That's not any of the
standard Unicode encodings. (But I think it is exactly the same as
Latin-1/ISO 8859-1.)

As I understand it, since about Octave 4.4, Octave's internal encoding
(that is, how it interprets Octave char arrays) is either UTF-8 or an
opaque array of bytes; it's never in the "system code page" or some
other locale-specific encoding.

Is this UTF-8 support in io still relevant/correct? Maybe it should be
deprecated or renamed/removed? Since Octave now supports UTF-8, I think
you'd want to just leave UTF-8 text as is in all cases.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

mmuetzel
In reply to this post by apjanke-floss
Looking at the code again. I believe that the aim was to map the Latin-1 subset of UTF-16.
But there might be something wrong with the conversion function that is set up in oct2xls.m at around line 197, too.
Maybe I'll have time to think about it again in the next few days.

Markus

--
Diese Nachricht wurde von meinem Android Mobiltelefon mit GMX Mail gesendet.
Am 19.10.19, 08:16, "Markus Mützel" <[hidden email]> schrieb:
Andrew,

Iirc, the interface uses UTF-16. The conversion function really only works for Latin-1 encoded input.
There really is no UTF-8 in this. TBH, I chose the name before I got a sufficient grasp of that encoding mess.

Forge packages usually target a wider range of Octave versions. I don't know whether this workaround can be safely removed without loosing support for Latin-1 in older Octave versions supported by io.
I didn't re-read the code. But believe that "unicode2native" is used if it is available.

Markus

PS: Sorry for top-posting. My mobile phone app doesn't allow otherwise.
--
Diese Nachricht wurde von meinem Android Mobiltelefon mit GMX Mail gesendet.
Am 19.10.19, 07:04, Andrew Janke <[hidden email]> schrieb:
Hi, Octave and io maintainers,

I'm confused by the Unicode support in the io package. In particular,
the functions unicode2utf8 and utf82unicode, and the "encode_utf"
options in some of the ods/xls read/write functions.

What is the encoding that utf82unicode/unicode2utf8 are calling
"unicode" here? It looks like it's doing a single-byte encoding,
treating each byte as an unsigned int 0-255, and treating those 0-255
values directly as Unicode code point values. That's not any of the
standard Unicode encodings. (But I think it is exactly the same as
Latin-1/ISO 8859-1.)

As I understand it, since about Octave 4.4, Octave's internal encoding
(that is, how it interprets Octave char arrays) is either UTF-8 or an
opaque array of bytes; it's never in the "system code page" or some
other locale-specific encoding.

Is this UTF-8 support in io still relevant/correct? Maybe it should be
deprecated or renamed/removed? Since Octave now supports UTF-8, I think
you'd want to just leave UTF-8 text as is in all cases.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

PhilipNienhuis
In reply to this post by apjanke-floss
apjanke-floss wrote

> Hi, Octave and io maintainers,
>
> I'm confused by the Unicode support in the io package. In particular,
> the functions unicode2utf8 and utf82unicode, and the "encode_utf"
> options in some of the ods/xls read/write functions.
>
> What is the encoding that utf82unicode/unicode2utf8 are calling
> "unicode" here? It looks like it's doing a single-byte encoding,
> treating each byte as an unsigned int 0-255, and treating those 0-255
> values directly as Unicode code point values. That's not any of the
> standard Unicode encodings. (But I think it is exactly the same as
> Latin-1/ISO 8859-1.)
>
> As I understand it, since about Octave 4.4, Octave's internal encoding
> (that is, how it interprets Octave char arrays) is either UTF-8 or an
> opaque array of bytes; it's never in the "system code page" or some
> other locale-specific encoding.
>
> Is this UTF-8 support in io still relevant/correct? Maybe it should be
> deprecated or renamed/removed? Since Octave now supports UTF-8, I think
> you'd want to just leave UTF-8 text as is in all cases.

AFAIR to apply unicode2utf8 and utf82unicode there needs to be an option set
explicitly.
I also lost why it was included (and no time to dive in the mercurial logs
now) but there sure was a good reason for it, like bug reports etc.

In core Octave there's native2unicode and unicode2native, maybe those are a
better alternatives.

Philip




--
Sent from: https://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

apjanke-floss


On 10/19/19 5:51 AM, PhilipNienhuis wrote:

> apjanke-floss wrote
>> Hi, Octave and io maintainers,
>>
>> I'm confused by the Unicode support in the io package. In particular,
>> the functions unicode2utf8 and utf82unicode, and the "encode_utf"
>> options in some of the ods/xls read/write functions.
>>
>> What is the encoding that utf82unicode/unicode2utf8 are calling
>> "unicode" here? It looks like it's doing a single-byte encoding,
>> treating each byte as an unsigned int 0-255, and treating those 0-255
>> values directly as Unicode code point values. That's not any of the
>> standard Unicode encodings. (But I think it is exactly the same as
>> Latin-1/ISO 8859-1.)
>>
>> As I understand it, since about Octave 4.4, Octave's internal encoding
>> (that is, how it interprets Octave char arrays) is either UTF-8 or an
>> opaque array of bytes; it's never in the "system code page" or some
>> other locale-specific encoding.
>>
>> Is this UTF-8 support in io still relevant/correct? Maybe it should be
>> deprecated or renamed/removed? Since Octave now supports UTF-8, I think
>> you'd want to just leave UTF-8 text as is in all cases.
>
> AFAIR to apply unicode2utf8 and utf82unicode there needs to be an option set
> explicitly.
> I also lost why it was included (and no time to dive in the mercurial logs
> now) but there sure was a good reason for it, like bug reports etc.
>
> In core Octave there's native2unicode and unicode2native, maybe those are a
> better alternatives.

The io code uses native2unicode as an alternative if it's available,
using a feature test. Here's an example from xls2oct.m:


   ## Convert from UTF-8 and strip characters that are not supported by
Octave
   ## (any chars < 32 or > 255).
   if (! strcmp (xls.xtype, "COM") && (spsh_opts.convert_utf))
     if (exist ("native2unicode", "file"))
       conv_fcn = @(str) unicode2native (native2unicode (str, "UTF-8"));
     else
       conv_fcn = @utf82unicode;
     endif
     rawarr = tidyxml (rawarr, conv_fcn);
   endif

This is leaving me even more confused: I'm not sure what the round trip
through both native2unicode and unicode2native accomplishes, especially
since native2unicode converts from the specified code page to UTF-8, so
doing native2unicode(str, "UTF-8") should basically be a no-op.

Putting aside the first native2unicode call, I _think_ the use of
unicode2native here is incorrect, because even on Windows, Octave's
internal strings are now UTF-8 and not the system default code page. I'm
going to do some more research and set up some test spreadsheets, but I
suspect all the encoding conversion logic here should just be removed.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

mmuetzel
Am 19. Oktober 2019 um 20:35 Uhr schrieb "Andrew Janke":

> The io code uses native2unicode as an alternative if it's available,
> using a feature test. Here's an example from xls2oct.m:
>
>
>    ## Convert from UTF-8 and strip characters that are not supported by
> Octave
>    ## (any chars < 32 or > 255).
>    if (! strcmp (xls.xtype, "COM") && (spsh_opts.convert_utf))
>      if (exist ("native2unicode", "file"))
>        conv_fcn = @(str) unicode2native (native2unicode (str, "UTF-8"));
>      else
>        conv_fcn = @utf82unicode;
>      endif
>      rawarr = tidyxml (rawarr, conv_fcn);
>    endif
>
> This is leaving me even more confused: I'm not sure what the round trip
> through both native2unicode and unicode2native accomplishes, especially
> since native2unicode converts from the specified code page to UTF-8, so
> doing native2unicode(str, "UTF-8") should basically be a no-op.
>
> Putting aside the first native2unicode call, I _think_ the use of
> unicode2native here is incorrect, because even on Windows, Octave's
> internal strings are now UTF-8 and not the system default code page. I'm
> going to do some more research and set up some test spreadsheets, but I
> suspect all the encoding conversion logic here should just be removed.
>

Please, ignore my previous messages.
I think you are right! I also believe it should be removed completely. The XML in the .xlsx files is encoded in UTF-8 (always?) and that is Octave's internal encoding. No transcoding should be done at all.
The code was originally introduced for bug #49222:
https://savannah.gnu.org/bugs/?49222
It's embarrassing to re-read how I initially completely mis-understood the issue and came up with a fix that seemed to work (on a western Windows) back then.
If I correctly understand the last few comments, the problem was (or is?) that UTF-8 encoded strings weren't displayed correctly on legacy Windows. But I don't think that the io package should interfere with the encoding of the strings it reads to work around this.
If this is still an issue (it isn't on Windows 10), it should be resolved differently.

Markus



Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

PhilipNienhuis
mmuetzel wrote

> Am 19. Oktober 2019 um 20:35 Uhr schrieb "Andrew Janke":
>> The io code uses native2unicode as an alternative if it's available,
>> using a feature test. Here's an example from xls2oct.m:
>>
>>
>>    ## Convert from UTF-8 and strip characters that are not supported by
>> Octave
>>    ## (any chars < 32 or > 255).
>>    if (! strcmp (xls.xtype, "COM") && (spsh_opts.convert_utf))
>>      if (exist ("native2unicode", "file"))
>>        conv_fcn = @(str) unicode2native (native2unicode (str, "UTF-8"));
>>      else
>>        conv_fcn = @utf82unicode;
>>      endif
>>      rawarr = tidyxml (rawarr, conv_fcn);
>>    endif
>>
>> This is leaving me even more confused: I'm not sure what the round trip
>> through both native2unicode and unicode2native accomplishes, especially
>> since native2unicode converts from the specified code page to UTF-8, so
>> doing native2unicode(str, "UTF-8") should basically be a no-op.
>>
>> Putting aside the first native2unicode call, I _think_ the use of
>> unicode2native here is incorrect, because even on Windows, Octave's
>> internal strings are now UTF-8 and not the system default code page. I'm
>> going to do some more research and set up some test spreadsheets, but I
>> suspect all the encoding conversion logic here should just be removed.
>>
>
> Please, ignore my previous messages.
> I think you are right! I also believe it should be removed completely. The
> XML in the .xlsx files is encoded in UTF-8 (always?) and that is Octave's
> internal encoding. No transcoding should be done at all.
> The code was originally introduced for bug #49222:
> https://savannah.gnu.org/bugs/?49222
> It's embarrassing to re-read how I initially completely mis-understood the
> issue and came up with a fix that seemed to work (on a western Windows)
> back then.

Please don't judge yourself too harshly :-) At the time we both agreed the
fix worked, and that it worked is what counts. Things can always be done
better in hindsight .

Looking back at my own code in the io package, I would also do many things
quite differently.
E.g., collapsing the separate but largely identical ods and xls code sets
into one is a fix for an IMO big mistake I made early on.


> If I correctly understand the last few comments, the problem was (or is?)
> that UTF-8 encoded strings weren't displayed correctly on legacy Windows.
> But I don't think that the io package should interfere with the encoding
> of the strings it reads to work around this.
> If this is still an issue (it isn't on Windows 10), it should be resolved
> differently.

In an earlier post in this thread you wrote something about legacy systems
and releases.
io is to be backwards-compatible to Octave-4.0.0 and 32-bit systems, so we
need to be sure it's still working there.

Support for Windows 7 formally ends January next year but I'd like to keep
io working on Win7 a little longer.

I'm too unfamiliar with Linux & BSDs to know if there are any distros that
might be affected (I've been using Mageia and predecessors for decades and
that has always been fairly up-to-date).

Philip




--
Sent from: https://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html

Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

apjanke-floss


On 10/20/19 6:13 AM, PhilipNienhuis wrote:

> mmuetzel wrote
>> If I correctly understand the last few comments, the problem was (or is?)
>> that UTF-8 encoded strings weren't displayed correctly on legacy Windows.
>> But I don't think that the io package should interfere with the encoding
>> of the strings it reads to work around this.
>> If this is still an issue (it isn't on Windows 10), it should be resolved
>> differently.
>
> In an earlier post in this thread you wrote something about legacy systems
> and releases.
> io is to be backwards-compatible to Octave-4.0.0 and 32-bit systems, so we
> need to be sure it's still working there.
>
> Support for Windows 7 formally ends January next year but I'd like to keep
> io working on Win7 a little longer.

So it sounds like we need to find an authoritative answer for which
Octave version had pretty much switched over to UTF-8 internal
representations, and wrap all this Unicode conversion stuff inside a
ver_less_than() test?

Unless someone knows this offhand, might be kind of hard since Octave
didn't used to be formal about discussing how it handled character
encodings in chars.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

mmuetzel
Am 20. Oktober 2019 um 15:44 Uhr schrieb "Andrew Janke":

> On 10/20/19 6:13 AM, PhilipNienhuis wrote:
> > mmuetzel wrote
> >> If I correctly understand the last few comments, the problem was (or is?)
> >> that UTF-8 encoded strings weren't displayed correctly on legacy Windows.
> >> But I don't think that the io package should interfere with the encoding
> >> of the strings it reads to work around this.
> >> If this is still an issue (it isn't on Windows 10), it should be resolved
> >> differently.
> >
> > In an earlier post in this thread you wrote something about legacy systems
> > and releases.
> > io is to be backwards-compatible to Octave-4.0.0 and 32-bit systems, so we
> > need to be sure it's still working there.
> >
> > Support for Windows 7 formally ends January next year but I'd like to keep
> > io working on Win7 a little longer.
>
> So it sounds like we need to find an authoritative answer for which
> Octave version had pretty much switched over to UTF-8 internal
> representations, and wrap all this Unicode conversion stuff inside a
> ver_less_than() test?
>
> Unless someone knows this offhand, might be kind of hard since Octave
> didn't used to be formal about discussing how it handled character
> encodings in chars.

Unfortunately, I have to agree again. I think it is very hard to pinpoint an exact version for which we suddenly switched to use UTF-8 consistently. In earlier versions, one had to use a bunch of different work-arounds to use non-ASCII characters in different situations - most of them not compatible to each other. And for some situations, there just was no work-around at all.
I think that the transition to using UTF-8 more or less consistently still isn't over. If I remember correctly, it started around or shortly after the work on bug #49222 - so about 3 years ago.
Maybe, we could "define" the change for bug #43099 to be the decisive step for this particular bug (displaying strings in the command window). So if we really want to keep the old work-around (I'm not sure I want to vote for it since it breaks other things), I'd vote for disabling it for Octave version 4.4 and later.

Markus


Reply | Threaded
Open this post in threaded view
|

Re: Unicode support in io Forge package

apjanke-floss


On 10/20/19 7:15 AM, "Markus Mützel" wrote:

> Am 20. Oktober 2019 um 15:44 Uhr schrieb "Andrew Janke":
>> So it sounds like we need to find an authoritative answer for which
>> Octave version had pretty much switched over to UTF-8 internal
>> representations, and wrap all this Unicode conversion stuff inside a
>> ver_less_than() test?
>>
>> Unless someone knows this offhand, might be kind of hard since Octave
>> didn't used to be formal about discussing how it handled character
>> encodings in chars.
>
> Unfortunately, I have to agree again. I think it is very hard to pinpoint an exact version for which we suddenly switched to use UTF-8 consistently. In earlier versions, one had to use a bunch of different work-arounds to use non-ASCII characters in different situations - most of them not compatible to each other. And for some situations, there just was no work-around at all.
> I think that the transition to using UTF-8 more or less consistently still isn't over. If I remember correctly, it started around or shortly after the work on bug #49222 - so about 3 years ago.
> Maybe, we could "define" the change for bug #43099 to be the decisive step for this particular bug (displaying strings in the command window). So if we really want to keep the old work-around (I'm not sure I want to vote for it since it breaks other things), I'd vote for disabling it for Octave version 4.4 and later.

Okay. I have no opinion as to where to make that cutoff or whether to
keep it for older versions.

Andrew