How should we treat invalid UTF-8?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

How should we treat invalid UTF-8?

mmuetzel
Hi,

Some time ago, we decided to use UTF-8 as the default encoding in Octave.
In particular, a change to allow (and require!) UTF-8 in regular expressions [1] triggered a few bug reports and questions on the mailing lists that involved invalid UTF-8 (e.g. [2]).
Background: Some characters in UTF-8 are encoded with multiple bytes (e.g. the German umlaut "ä" is encoded as decimal [195 164]). As a consequence of how Unicode codepoints are encoded in UTF-8, there are some byte sequences that cannot be correctly decoded to a Unicode codepoint (e.g. a byte with the decimal value 228 on its own). Such byte sequences are called "invalid".
At the moment, we don't have any logic for handling those invalid byte sequences specially. This can lead to a whole lot of different errors and is not limited to the regexp family of functions. E.g. entering "char (228)" at the Octave prompt leads to a replacement character ("�") being displayed at the command window on Linux (at least for me on Ubuntu 19.04), but it completely breaks the command window on Windows (e.g. [3]).
Similarly, there are issues when using invalid UTF-8 for strings in plots.

There are different approaches for how to handle invalid byte sequences in UTF-8 (that are suggested by the standard). I can't find a direct reference right now. But here is what Wikipedia says about it: [4].
They can be mainly be assigned into these 3 groups:
1. Throw an error.
2. Replace each invalid byte with the same or different replacement characters.
3. Fall back to a different encoding for such bytes (e.g. ISO-8859-1 or CP1252).

Judging from some error reports, (western) users seem to expect that they get a micro sign on entering "char(181)" (and similarly for other printable characters at codepoints 128-255). If we implemented falling back to "ISO-8859-1" or "CP1252", we would follow that principle of least surprise in that respect.

However, it is not clear to me at which level we would implement that fallback conversion: For some users, it might feel "most natural" to see a "µ" everywhere when they use "char(181)" in their code. Others might be surprised if the conversion from one type (double) to another type (char) and back leads to a different result (different number of elements even!).
If we don't do the validation on creation of the char vector, there are probably a lot of places where strings should be validated before we use them.

A similar question arises when reading strings from a file (fopen, fread, fgets, fgetl, textscan, ...): Should we return the bytes as stored in the file? Or should we better assure that the strings are valid?

Matlab doesn't have the same problem (for western users) because they don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally in UCS-2).

I am slightly leaning towards implementing some sort of fallback mechanism (see e.g. bug #57107 [2] comment #17). But I'm open to any ideas of how to implement that exactly.

Another "solution" would be to review our initial decision to use UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our "char" class. But that would probably involve some major changes and a lot of conversions on interfaces to libraries we use.

Markus

[1]: http://hg.savannah.gnu.org/hgweb/octave/rev/94d490815aa8
[2]: https://savannah.gnu.org/bugs/index.php?57107
[3]: https://savannah.gnu.org/bugs/index.php?57133
[4]: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

apjanke-floss

On 11/2/19 8:24 AM, "Markus Mützel" wrote:

> Hi,
>
> Some time ago, we decided to use UTF-8 as the default encoding in Octave.
> In particular, a change to allow (and require!) UTF-8 in regular expressions [1] triggered a few bug reports and questions on the mailing lists that involved invalid UTF-8 (e.g. [2]).
> Background: Some characters in UTF-8 are encoded with multiple bytes (e.g. the German umlaut "ä" is encoded as decimal [195 164]). As a consequence of how Unicode codepoints are encoded in UTF-8, there are some byte sequences that cannot be correctly decoded to a Unicode codepoint (e.g. a byte with the decimal value 228 on its own). Such byte sequences are called "invalid".
> At the moment, we don't have any logic for handling those invalid byte sequences specially. This can lead to a whole lot of different errors and is not limited to the regexp family of functions. E.g. entering "char (228)" at the Octave prompt leads to a replacement character ("�") being displayed at the command window on Linux (at least for me on Ubuntu 19.04), but it completely breaks the command window on Windows (e.g. [3]).
> Similarly, there are issues when using invalid UTF-8 for strings in plots.
>
> There are different approaches for how to handle invalid byte sequences in UTF-8 (that are suggested by the standard). I can't find a direct reference right now. But here is what Wikipedia says about it: [4].
> They can be mainly be assigned into these 3 groups:
> 1. Throw an error.
> 2. Replace each invalid byte with the same or different replacement characters.
> 3. Fall back to a different encoding for such bytes (e.g. ISO-8859-1 or CP1252).
>
> Judging from some error reports, (western) users seem to expect that they get a micro sign on entering "char(181)" (and similarly for other printable characters at codepoints 128-255). If we implemented falling back to "ISO-8859-1" or "CP1252", we would follow that principle of least surprise in that respect.
>
> However, it is not clear to me at which level we would implement that fallback conversion: For some users, it might feel "most natural" to see a "µ" everywhere when they use "char(181)" in their code. Others might be surprised if the conversion from one type (double) to another type (char) and back leads to a different result (different number of elements even!).
> If we don't do the validation on creation of the char vector, there are probably a lot of places where strings should be validated before we use them.
>
> A similar question arises when reading strings from a file (fopen, fread, fgets, fgetl, textscan, ...): Should we return the bytes as stored in the file? Or should we better assure that the strings are valid?
>
> Matlab doesn't have the same problem (for western users) because they don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally in UCS-2).
>
> I am slightly leaning towards implementing some sort of fallback mechanism (see e.g. bug #57107 [2] comment #17). But I'm open to any ideas of how to implement that exactly.
>
> Another "solution" would be to review our initial decision to use UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our "char" class. But that would probably involve some major changes and a lot of conversions on interfaces to libraries we use.
>
> Markus
>
> [1]: http://hg.savannah.gnu.org/hgweb/octave/rev/94d490815aa8
> [2]: https://savannah.gnu.org/bugs/index.php?57107
> [3]: https://savannah.gnu.org/bugs/index.php?57133
> [4]: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
>

Hi all,

I'm coming around to the idea that Octave should be conservative and
strict about encodings at I/O and library boundaries, and lean toward
erroring out or using replacement characters, and not doing any
mixed-encoding fallback mechanisms. At least for our basic stuff like
fopen/fread/csvread. I think it would support higher-quality code, and
it would be easier for users to understand and diagnose, given a little
explanation.

I don't think we can fully protect users from having to know about
character encodings, and having to know what encoding their input data
is in. And trying to get fancy there could make it harder to do the
"right" thing when program correctness is important.

> There are different approaches for how to handle invalid byte
sequences in UTF-8 [...]

One note: I don't think this is strictly about invalid byte sequences in
UTF-8, but rather invalid byte sequences in text data in any encoding.

My inclination is to handle invalid encoded byte sequences by:
  1. When doing file input or output, raise an error immediately
    a) That probably (maybe?) goes for encoding-aware text-oriented
network I/O, like urlread(), too.
  2. When doing transcoding explicitly requested by the user (like a
unicode2native() call), raise an error unless the user explicitly
requested a character-replacement or fallback scheme. (This would be a
change from current behavior.)
  2. When passing text to a UI presentation element that Octave controls
(like a GUI widget, a plot element, or terminal output), use the
"invalid character" replacement character
Where validation probably happens whenever you're crossing an encoding
boundary or library/system-call boundary.

Doing "smart" fallback is a convenience for users who are using Octave
interactively and looking at their data as its processed, so they can
recognize garbage (if the data set is small enough). But for automated
processes or stuff with long processing pipelines, it could end up
silently passing incorrect data through, which isn't good. And I think
it would be nice if Octave would support those scenarios. Raising an
error at the point of the conversion failure makes sure that the
user/maintainer notices the problem, and makes it easy to locate (and
with a decent error message, hopefully easy to Google to figure out what
went wrong).

> Matlab doesn't have the same problem (for western users) because they
don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters
encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally
in UCS-2).
>
> Another "solution" would be to review our initial decision to use
UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our
"char" class. But that would probably involve some major changes and a
lot of conversions on interfaces to libraries we use.

I don't think that's why Matlab has it "easy" here. I think it's because
a) all their text I/O is encoding-aware, and b) on Windows, they use the
system default legacy code page as the default encoding, which gives you
ISO-8859-1 in the West. The fact that Matlab's internal encoding is
UCS-2 and that's an easy transformation from ISO-8859-1 is just an
internal implementation detail.

Matlab does have the opposite problem: if your input data is actually
UTF-8 (which I think is the more common case these days) or if you want
your code to be portable across OSes or regions, you need to explicitly
specify UTF-8 or some other known encoding whenever your code does an
fopen(). If you have UTF-8 data and do a plain fopen(), it'll silently
garble your data.

If we changed Octave char to be 16-bit UTF-16 code points, we'd still
have the same problem of deciding what to use for a default encoding,
and what to do when the input didn't match that encoding.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

mmuetzel
Am 04. November 2019 um 21:48 Uhr schrieb "Andrew Janke":

> Hi all,
>
> I'm coming around to the idea that Octave should be conservative and
> strict about encodings at I/O and library boundaries, and lean toward
> erroring out or using replacement characters, and not doing any
> mixed-encoding fallback mechanisms. At least for our basic stuff like
> fopen/fread/csvread. I think it would support higher-quality code, and
> it would be easier for users to understand and diagnose, given a little
> explanation.
>
> I don't think we can fully protect users from having to know about
> character encodings, and having to know what encoding their input data
> is in. And trying to get fancy there could make it harder to do the
> "right" thing when program correctness is important.

I agree.

> > There are different approaches for how to handle invalid byte
> sequences in UTF-8 [...]
>
> One note: I don't think this is strictly about invalid byte sequences in
> UTF-8, but rather invalid byte sequences in text data in any encoding.

We should primarily focus on UTF-8.

> My inclination is to handle invalid encoded byte sequences by:
>   1. When doing file input or output, raise an error immediately

I kind of like that approach. However, would that mean that a user would need to clean up any encoding errors with other tools before they would be able to read such files?

>     a) That probably (maybe?) goes for encoding-aware text-oriented
> network I/O, like urlread(), too.
What could a user do about encoding errors in sources that are beyond their influence?

>   2. When doing transcoding explicitly requested by the user (like a
> unicode2native() call), raise an error unless the user explicitly
> requested a character-replacement or fallback scheme. (This would be a
> change from current behavior.)

"unicode2native()" currently fails on invalid UTF-8. Imho, it would probably be better to have a separate function that provides a (configurable?) fallback conversion.

>   2. When passing text to a UI presentation element that Octave controls
> (like a GUI widget, a plot element, or terminal output), use the
> "invalid character" replacement character
> Where validation probably happens whenever you're crossing an encoding
> boundary or library/system-call boundary.

That might be hard (especially thinking of the command window on Windows). But it might be achievable.

> Doing "smart" fallback is a convenience for users who are using Octave
> interactively and looking at their data as its processed, so they can
> recognize garbage (if the data set is small enough). But for automated
> processes or stuff with long processing pipelines, it could end up
> silently passing incorrect data through, which isn't good. And I think
> it would be nice if Octave would support those scenarios. Raising an
> error at the point of the conversion failure makes sure that the
> user/maintainer notices the problem, and makes it easy to locate (and
> with a decent error message, hopefully easy to Google to figure out what
> went wrong).

I agree that a smart fallback mechanism (maybe even including some heuristics) is probably *not* what we want. But maybe we could use a more "straight forward" fallback mechanism. (If that exists.)

> > Matlab doesn't have the same problem (for western users) because they
> don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters
> encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally
> in UCS-2).
> >
> > Another "solution" would be to review our initial decision to use
> UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our
> "char" class. But that would probably involve some major changes and a
> lot of conversions on interfaces to libraries we use.
>
> I don't think that's why Matlab has it "easy" here. I think it's because
> a) all their text I/O is encoding-aware, and b) on Windows, they use the
> system default legacy code page as the default encoding, which gives you
> ISO-8859-1 in the West. The fact that Matlab's internal encoding is
> UCS-2 and that's an easy transformation from ISO-8859-1 is just an
> internal implementation detail.

I was more thinking of "Matlab compatibility" bug reports to come. Like: "Why does my code using char(181) work in Matlab but fail in Octave?"

> Matlab does have the opposite problem: if your input data is actually
> UTF-8 (which I think is the more common case these days) or if you want
> your code to be portable across OSes or regions, you need to explicitly
> specify UTF-8 or some other known encoding whenever your code does an
> fopen(). If you have UTF-8 data and do a plain fopen(), it'll silently
> garble your data.

Is that on all platforms? Or only on Windows?

> If we changed Octave char to be 16-bit UTF-16 code points, we'd still
> have the same problem of deciding what to use for a default encoding,
> and what to do when the input didn't match that encoding.

I agree. Those are two different pairs of shoes. One is: What should be the size of one char in Octave? The other is: What should be the default encoding for reading (and writing) 8bit sources?
But any fallback mechanism (if we wanted to have one) would depend on the answer to the former question.

Markus


Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

apjanke-floss


On 11/4/19 5:12 PM, "Markus Mützel" wrote:
> Am 04. November 2019 um 21:48 Uhr schrieb "Andrew Janke":
>
>>> There are different approaches for how to handle invalid byte
>> sequences in UTF-8 [...]
>>
>> One note: I don't think this is strictly about invalid byte sequences in
>> UTF-8, but rather invalid byte sequences in text data in any encoding.
>
> We should primarily focus on UTF-8.

I don't think I agree: we're designing how Octave handles strings and
encodings in general, and we live in an international, multi-encoding
world. We should come up with a system that works for multiple encodings.


>> My inclination is to handle invalid encoded byte sequences by:
>>   1. When doing file input or output, raise an error immediately
>
> I kind of like that approach. However, would that mean that a user would need to clean up any encoding errors with other tools before they would be able to read such files?

Yes and no. Yes, ou would need to fix the read error somehow. In my
experience, this usually means that you're just using the wrong
encoding, and you just need to specify the right encoding instead of
modifying the input data. If there are actually are encoding errors,
then your data is corrupt, and you should fix it up before slurping it
into Octave chars. You could do this with external tools. Or if you
wanted to do it in Octave, you could just read the file in binary mode
and work with the raw encoded bytes, and then transcode it once cleaned up.

>>     a) That probably (maybe?) goes for encoding-aware text-oriented
>> network I/O, like urlread(), too.
> What could a user do about encoding errors in sources that are beyond their influence?

What can a user do about any data corruption in a data source beyond
their influence? Talk to the source to get it fixed, or write a tool to
correct it yourself. You could do this in Octave using byte-oriented
binary I/O. Or we could provide a "read with fallback-to-replacement
character" function or mode as a convenience; I just don't think that
should be the default, because we shouldn't silently lose data unless asked.

You can always fall back and look at the raw bytes using byte-oriented
I/O and munge them there. I just think we should be conservative about
what actually makes it into chars using the text-oriented I/O.

>>   2. When doing transcoding explicitly requested by the user (like a
>> unicode2native() call), raise an error unless the user explicitly
>> requested a character-replacement or fallback scheme. (This would be a
>> change from current behavior.)
>
> "unicode2native()" currently fails on invalid UTF-8.

Ah! Okay, I just misread the helptext for it. I think that is a good
behavior.

> Imho, it would probably be better to have a separate function that provides a (configurable?) fallback conversion.

I agree.

>>   2. When passing text to a UI presentation element that Octave controls
>> (like a GUI widget, a plot element, or terminal output), use the
>> "invalid character" replacement character
>> Where validation probably happens whenever you're crossing an encoding
>> boundary or library/system-call boundary.
>
> That might be hard (especially thinking of the command window on Windows). But it might be achievable.

Afraid I don't really know enough about GUI programming to be much help
here. But for Qt and Windows GUI widgets, they're not using C char *s
for their strings; they're using QString or Windows wchar_t/TCHAR/PWSTR
values, right? Which aren't UTF-8, so there's already gotta be a
translation point between Octave strings and the GUI toolkit's strings,
I'd think? That's where you'd slap your validation + fallback-char
replacement calls.

>> Doing "smart" fallback is a convenience for users who are using Octave
>> interactively and looking at their data as its processed, so they can
>> recognize garbage (if the data set is small enough). But for automated
>> processes or stuff with long processing pipelines, it could end up
>> silently passing incorrect data through, which isn't good. And I think
>> it would be nice if Octave would support those scenarios. Raising an
>> error at the point of the conversion failure makes sure that the
>> user/maintainer notices the problem, and makes it easy to locate (and
>> with a decent error message, hopefully easy to Google to figure out what
>> went wrong).
>
> I agree that a smart fallback mechanism (maybe even including some heuristics) is probably *not* what we want. But maybe we could use a more "straight forward" fallback mechanism. (If that exists.)
>
>>> Matlab doesn't have the same problem (for western users) because they
>> don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters
>> encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally
>> in UCS-2).
>>>
>>> Another "solution" would be to review our initial decision to use
>> UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our
>> "char" class. But that would probably involve some major changes and a
>> lot of conversions on interfaces to libraries we use.
>>
>> I don't think that's why Matlab has it "easy" here. I think it's because
>> a) all their text I/O is encoding-aware, and b) on Windows, they use the
>> system default legacy code page as the default encoding, which gives you
>> ISO-8859-1 in the West. The fact that Matlab's internal encoding is
>> UCS-2 and that's an easy transformation from ISO-8859-1 is just an
>> internal implementation detail.
>
> I was more thinking of "Matlab compatibility" bug reports to come. Like: "Why does my code using char(181) work in Matlab but fail in Octave?"

I think that's kind of a different issue than I/O encoding. And
achieving a degree of Matlab compatibility would be feasible.

On the one hand, mathematically speaking, Matlab's char(double) takes
the input doubles and narrows them to 16-bit and casts them to char,
squashing non-UCS-2 values to placeholders. You can also view Matlab's
char(double) as working like this: it treats the input doubles as
numeric Unicode code point values (not ISO-8859-1 or any other
encoding), and it converts those into the Matlab-native char (UCS-2)
values that represent those code, squashing out-of-range values to the
0xFFFF placeholder replacement character. We could get somewhat
equivalent behavior by having Octave's char(double) also treat its
inputs as Unicode code points, and have it return the Octave-native char
(UTF-8) vector that represents that sequence of code points.

char(181) in Matlab gives you the micro sign, which is 0x00B5 as a
1-long UCS-2 Matlab string. We could have char(181) in Octave also give
you the micro sign, which would be 0xC2 0xB5 as a 2-long UTF-8 Octave
char string.

>> Matlab does have the opposite problem: if your input data is actually
>> UTF-8 (which I think is the more common case these days) or if you want
>> your code to be portable across OSes or regions, you need to explicitly
>> specify UTF-8 or some other known encoding whenever your code does an
>> fopen(). If you have UTF-8 data and do a plain fopen(), it'll silently
>> garble your data.
>
> Is that on all platforms? Or only on Windows?

It is not on all platforms. Just Windows. On Linux, it uses your locale,
which will give you UTF-8 as the default encoding on most modern
systems. And on macOS, for some  reason, it seems to default to
ISO-8859-1, even though that's not the system default encoding in any
sense that I'm aware of. So if you want cross-OS portable Matlab code,
you must always specify an encoding when calling fopen(), or write
OS-specific logic.

>> If we changed Octave char to be 16-bit UTF-16 code points, we'd still
>> have the same problem of deciding what to use for a default encoding,
>> and what to do when the input didn't match that encoding.
>
> I agree. Those are two different pairs of shoes. One is: What should be the size of one char in Octave? The other is: What should be the default encoding for reading (and writing) 8bit sources?

Yep. And the first one is up for grabs. That depends on: What's a good
encoding in general? How much do we care about Matlab compatibility? Do
users need random access to characters within a string? Etc etc.

> But any fallback mechanism (if we wanted to have one) would depend on the answer to the former question.

I don't think it does, really. Transcoding to your internal string
representation is a three-step process:

1. Parse the input byte sequence in the input encoding to get a sequence
of code points (characters) in the input character set.
2. Map those input character set code points to code points in your
internal character set.
3. Encode those internal character set code points into your internal
"char"/string type's encoding.

I think the fallback mechanism happens entirely in steps 1 and 2, as
long as your internal string representation uses a character set that
can represent whatever replacement characters you want. (And if you're
using Unicode, that's always true.) Whether Octave uses UTF-8, UTF-16,
or some other Unicode encoding only affects step 3.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

mmuetzel
Am 05. November 2019 um 00:00 Uhr schrieb "Andrew Janke":

> On 11/4/19 5:12 PM, "Markus Mützel" wrote:
> > Am 04. November 2019 um 21:48 Uhr schrieb "Andrew Janke":
> >
> >>> There are different approaches for how to handle invalid byte
> >> sequences in UTF-8 [...]
> >>
> >> One note: I don't think this is strictly about invalid byte sequences in
> >> UTF-8, but rather invalid byte sequences in text data in any encoding.
> >
> > We should primarily focus on UTF-8.
>
> I don't think I agree: we're designing how Octave handles strings and
> encodings in general, and we live in an international, multi-encoding
> world. We should come up with a system that works for multiple encodings.

As a first step, I am more worried about the char(double) oddities. But you are right: Should we decide to error on invalid input that should also apply to other multi-byte input encodings that can have invalid byte sequences.

>
> >> My inclination is to handle invalid encoded byte sequences by:
> >>   1. When doing file input or output, raise an error immediately
> >
> > I kind of like that approach. However, would that mean that a user would need to clean up any encoding errors with other tools before they would be able to read such files?
>
> Yes and no. Yes, ou would need to fix the read error somehow. In my
> experience, this usually means that you're just using the wrong
> encoding, and you just need to specify the right encoding instead of
> modifying the input data. If there are actually are encoding errors,
> then your data is corrupt, and you should fix it up before slurping it
> into Octave chars. You could do this with external tools. Or if you
> wanted to do it in Octave, you could just read the file in binary mode
> and work with the raw encoded bytes, and then transcode it once cleaned up.
>
> >>     a) That probably (maybe?) goes for encoding-aware text-oriented
> >> network I/O, like urlread(), too.
> > What could a user do about encoding errors in sources that are beyond their influence?
>
> What can a user do about any data corruption in a data source beyond
> their influence? Talk to the source to get it fixed, or write a tool to
> correct it yourself. You could do this in Octave using byte-oriented
> binary I/O. Or we could provide a "read with fallback-to-replacement
> character" function or mode as a convenience; I just don't think that
> should be the default, because we shouldn't silently lose data unless asked.
>
> You can always fall back and look at the raw bytes using byte-oriented
> I/O and munge them there. I just think we should be conservative about
> what actually makes it into chars using the text-oriented I/O.
>
Agreed on both your points.

> >>   2. When doing transcoding explicitly requested by the user (like a
> >> unicode2native() call), raise an error unless the user explicitly
> >> requested a character-replacement or fallback scheme. (This would be a
> >> change from current behavior.)
> >
> > "unicode2native()" currently fails on invalid UTF-8.
>
> Ah! Okay, I just misread the helptext for it. I think that is a good
> behavior.
>
> > Imho, it would probably be better to have a separate function that provides a (configurable?) fallback conversion.
>
> I agree.
>
> >>   2. When passing text to a UI presentation element that Octave controls
> >> (like a GUI widget, a plot element, or terminal output), use the
> >> "invalid character" replacement character
> >> Where validation probably happens whenever you're crossing an encoding
> >> boundary or library/system-call boundary.
> >
> > That might be hard (especially thinking of the command window on Windows). But it might be achievable.
>
> Afraid I don't really know enough about GUI programming to be much help
> here. But for Qt and Windows GUI widgets, they're not using C char *s
> for their strings; they're using QString or Windows wchar_t/TCHAR/PWSTR
> values, right? Which aren't UTF-8, so there's already gotta be a
> translation point between Octave strings and the GUI toolkit's strings,
> I'd think? That's where you'd slap your validation + fallback-char
> replacement calls.

Unfortunately, the command window is not a Qt widget. Especially the Windows implementation is a beast because the Windows prompt that we use has so many limitations. Especially when it comes to variable byte encodings. (See e.g. the bug about output stopping completely after an invalid byte.)

>
> >> Doing "smart" fallback is a convenience for users who are using Octave
> >> interactively and looking at their data as its processed, so they can
> >> recognize garbage (if the data set is small enough). But for automated
> >> processes or stuff with long processing pipelines, it could end up
> >> silently passing incorrect data through, which isn't good. And I think
> >> it would be nice if Octave would support those scenarios. Raising an
> >> error at the point of the conversion failure makes sure that the
> >> user/maintainer notices the problem, and makes it easy to locate (and
> >> with a decent error message, hopefully easy to Google to figure out what
> >> went wrong).
> >
> > I agree that a smart fallback mechanism (maybe even including some heuristics) is probably *not* what we want. But maybe we could use a more "straight forward" fallback mechanism. (If that exists.)
> >
> >>> Matlab doesn't have the same problem (for western users) because they
> >> don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters
> >> encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally
> >> in UCS-2).
> >>>
> >>> Another "solution" would be to review our initial decision to use
> >> UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our
> >> "char" class. But that would probably involve some major changes and a
> >> lot of conversions on interfaces to libraries we use.
> >>
> >> I don't think that's why Matlab has it "easy" here. I think it's because
> >> a) all their text I/O is encoding-aware, and b) on Windows, they use the
> >> system default legacy code page as the default encoding, which gives you
> >> ISO-8859-1 in the West. The fact that Matlab's internal encoding is
> >> UCS-2 and that's an easy transformation from ISO-8859-1 is just an
> >> internal implementation detail.
> >
> > I was more thinking of "Matlab compatibility" bug reports to come. Like: "Why does my code using char(181) work in Matlab but fail in Octave?"
>
> I think that's kind of a different issue than I/O encoding. And
> achieving a degree of Matlab compatibility would be feasible.
>
> On the one hand, mathematically speaking, Matlab's char(double) takes
> the input doubles and narrows them to 16-bit and casts them to char,
> squashing non-UCS-2 values to placeholders. You can also view Matlab's
> char(double) as working like this: it treats the input doubles as
> numeric Unicode code point values (not ISO-8859-1 or any other
> encoding), and it converts those into the Matlab-native char (UCS-2)
> values that represent those code, squashing out-of-range values to the
> 0xFFFF placeholder replacement character. We could get somewhat
> equivalent behavior by having Octave's char(double) also treat its
> inputs as Unicode code points, and have it return the Octave-native char
> (UTF-8) vector that represents that sequence of code points.
>
> char(181) in Matlab gives you the micro sign, which is 0x00B5 as a
> 1-long UCS-2 Matlab string. We could have char(181) in Octave also give
> you the micro sign, which would be 0xC2 0xB5 as a 2-long UTF-8 Octave
> char string.

I prefer that approach of extending the logic to all Unicode code points to my initial idea of only doing that for the first 256 (which seems odd now thinking about it).
But still: Do we really want this? That would lead to the same round trip oddities.

> >> Matlab does have the opposite problem: if your input data is actually
> >> UTF-8 (which I think is the more common case these days) or if you want
> >> your code to be portable across OSes or regions, you need to explicitly
> >> specify UTF-8 or some other known encoding whenever your code does an
> >> fopen(). If you have UTF-8 data and do a plain fopen(), it'll silently
> >> garble your data.
> >
> > Is that on all platforms? Or only on Windows?
>
> It is not on all platforms. Just Windows. On Linux, it uses your locale,
> which will give you UTF-8 as the default encoding on most modern
> systems. And on macOS, for some  reason, it seems to default to
> ISO-8859-1, even though that's not the system default encoding in any
> sense that I'm aware of. So if you want cross-OS portable Matlab code,
> you must always specify an encoding when calling fopen(), or write
> OS-specific logic.
>
> >> If we changed Octave char to be 16-bit UTF-16 code points, we'd still
> >> have the same problem of deciding what to use for a default encoding,
> >> and what to do when the input didn't match that encoding.
> >
> > I agree. Those are two different pairs of shoes. One is: What should be the size of one char in Octave? The other is: What should be the default encoding for reading (and writing) 8bit sources?
>
> Yep. And the first one is up for grabs. That depends on: What's a good
> encoding in general? How much do we care about Matlab compatibility? Do
> users need random access to characters within a string? Etc etc.
>
> > But any fallback mechanism (if we wanted to have one) would depend on the answer to the former question.
>
> I don't think it does, really. Transcoding to your internal string
> representation is a three-step process:
>
> 1. Parse the input byte sequence in the input encoding to get a sequence
> of code points (characters) in the input character set.
> 2. Map those input character set code points to code points in your
> internal character set.
> 3. Encode those internal character set code points into your internal
> "char"/string type's encoding.
>
> I think the fallback mechanism happens entirely in steps 1 and 2, as
> long as your internal string representation uses a character set that
> can represent whatever replacement characters you want. (And if you're
> using Unicode, that's always true.) Whether Octave uses UTF-8, UTF-16,
> or some other Unicode encoding only affects step 3.

Still, I am more focussed on the char(double) issue. The round trip oddities would disappear (or become much less prominent) if we used a wider char representation.

Markus


Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

apjanke-floss

On 11/4/19 6:29 PM, "Markus Mützel" wrote:
> Am 05. November 2019 um 00:00 Uhr schrieb "Andrew Janke":
>> On 11/4/19 5:12 PM, "Markus Mützel" wrote:

Okay. I feel good about where this conversation is with respect to file
I/O and encodings.

>> Afraid I don't really know enough about GUI programming to be much help
>> here. But for Qt and Windows GUI widgets, they're not using C char *s
>> for their strings; they're using QString or Windows wchar_t/TCHAR/PWSTR
>> values, right? Which aren't UTF-8, so there's already gotta be a
>> translation point between Octave strings and the GUI toolkit's strings,
>> I'd think? That's where you'd slap your validation + fallback-char
>> replacement calls.
>
> Unfortunately, the command window is not a Qt widget. Especially the Windows implementation is a beast because the Windows prompt that we use has so many limitations. Especially when it comes to variable byte encodings. (See e.g. the bug about output stopping completely after an invalid byte.)

I'm out of my depth here. Guess I need to add "learn the Windows
terminal widget" to my long TODO list.

>> char(181) in Matlab gives you the micro sign, which is 0x00B5 as a
>> 1-long UCS-2 Matlab string. We could have char(181) in Octave also give
>> you the micro sign, which would be 0xC2 0xB5 as a 2-long UTF-8 Octave
>> char string.
>
> I prefer that approach of extending the logic to all Unicode code points to my initial idea of only doing that for the first 256 (which seems odd now thinking about it).
> But still: Do we really want this? That would lead to the same round trip oddities.

We could totally round-trip it!

Define double(char) and char(double) to work along row vectors instead
of on individual elements. (You have to define char(double) this way for
the conversion I suggested above to make sense in a UTF-8 world anyway.)

Define double(char) as "takes a row vector of chars that contain UTF-8
(or whatever Octave's internal encoding is) and returns a row vector of
doubles that contain the sequence of Unicode code point values encoded
by those chars". That's the inverse of the char(double) I describe
above, and it should round-trip just fine.

Then you could say:

x = [ 121 117 109 hex2dec('1F34C') ];  % "yum🍌"
str = char (x);  % Get back 7-long char with values 0x79 0x75 0x6D 0xF0
0x9F 0x8D 0x8C
x2 = double (str); % Get back [121 117 109 127820]
isequal (x, x2);  % Returns true

Now, this wouldn't work for 2-D or higher-dimension arrays that contain
non-ASCII (>127) code points. But I think that's a small, acceptable
loss: IMHO, 2-D char arrays are terrible, and you should pretty much
never use them.

(For 2-D and higher arrays: still define it as operating
row-vector-wise, and if all the row vector operations result in outputs
that have compatible dimensions and can cat() cleanly, cat and return
that; else error with a "dimension mismatch" error.)

This even gives you a win over Matlab: UCS-2 can't represent U+1F34C, so
Matlab squashes the banana into the 0xFFFF replacement character, and x2
does not equal x.

> Still, I am more focussed on the char(double) issue. The round trip oddities would disappear (or become much less prominent) if we used a wider char representation.

True. Having 16-bit chars would mean you could mostly define these
operations *elementwise* instead of vector-wise, and then get nice
round-trip results and support higher-dimensional arrays as long as you
stay in the Basic Multilingual Plane. And indexing becomes more
intuitive, especially for less-experienced users. (And maybe better
MAT-file compatibility?)

Personally, I like being able to support non-BMP characters, because I
like working with emoji, and there's some mathematical symbols there
that may be of interest to Octave users creating plots or documents. So
unless you're doing sentiment analysis on Twitter or Slack streams or
something, the vast majority of your text is going to be all BMP. But as
long as your 16-bit char type passes through surrogate pairs unmolested,
you can still use non-BMP characters; it's just a bit less convenient to
construct them in code. And all that could easily be wrapped up in
user-defined helper functions.

My personal desire has always been to see Octave switch to a 16-bit
UCS-2 or UTF-16 char type, because maximal Matlab compatibility is my
highest hope. (I'm coming from an enterprise background where we'd like
to maybe some day use Octave to replace Matlab for some workloads, with
minimal porting effort for our existing M-code. And I think there are
maybe other people in that situation.) But maybe that's not what's best
for the Octave community as a whole, or feasible for the Octave developers.

Maybe the best way to approach this is to discuss what use cases or
coding techniques a wider char type enables. From what I can see, the
big thing you get with UCS-2 or UTF-32 (and UTF-16 if you're sloppy
about it) is random access for characters: for a char array str and
integer i, str(i) is a single character, which is also a scalar char.
That's very useful if you want to do character-wise manipulation of
strings. Do people want to do that in practice? I can think of lots of
toy examples like "reverse a string" or "replace a couple characters in
a string with something else". But these operations mostly show up in
tutorials and coding interviews. Is this something people actually want
to do in practice?

The one use case I can think of that I've actually done in recent years
on the job is parsing of fixed-width-field format records. Like you have
a weather station identifier in the format "TSSZZZZZ-nnnn" where "T" is
a one-letter code for the station type, "SS" is the 2-character state
abbreviation, "ZZZZZ" is the zip code, "nnnn" is the ID number, and so
on. With 16-bit chars, you can do this conveniently with direct
character indexing, and can vectorize the operation using a 2-D char
array. With 8-bit UTF-8 chars, you can still do that, but you have to do
an intermediate step where you call a function that maps character
indexes to (start_index, end_index) pairs that index into the byte
offsets of the char array.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

John W. Eaton
Administrator
On 11/4/19 11:46 PM, Andrew Janke wrote:

> Guess I need to add "learn the Windows
> terminal widget" to my long TODO list.

I wouldn't work too hard on that.  I still hope to replace the existing
Unix and Windows terminal widgets with a grand unified command window
widget that will use Qt to handle input and output.  My plans are that
it won't really be a "terminal" and it won't be possible to execute
external programs in it (no big loss, you can't really do that now) and
Octave won't be using the Windows Console or a Unix pty.  Instead it
will read input and send it to Octave and get output back that it will
display.  It should be much simpler and as functional as we want it to be.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

mmuetzel
Am 05. November 2019 um 07:24 Uhr schrieb "John W. Eaton":

> On 11/4/19 11:46 PM, Andrew Janke wrote:
>
> > Guess I need to add "learn the Windows
> > terminal widget" to my long TODO list.
>
> I wouldn't work too hard on that.  I still hope to replace the existing
> Unix and Windows terminal widgets with a grand unified command window
> widget that will use Qt to handle input and output.  My plans are that
> it won't really be a "terminal" and it won't be possible to execute
> external programs in it (no big loss, you can't really do that now) and
> Octave won't be using the Windows Console or a Unix pty.  Instead it
> will read input and send it to Octave and get output back that it will
> display.  It should be much simpler and as functional as we want it to be.

It is very nice to read that this is still your goal.
Can you already estimate whether it will be ready for Octave 6.1? Or more likely to come later?

Markus


Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

mmuetzel
In reply to this post by apjanke-floss
Am 05. November 2019 um 05:46 Uhr schrieb "Andrew Janke":

> On 11/4/19 6:29 PM, "Markus Mützel" wrote:
> > Am 05. November 2019 um 00:00 Uhr schrieb "Andrew Janke":
> >> On 11/4/19 5:12 PM, "Markus Mützel" wrote:
>
> Okay. I feel good about where this conversation is with respect to file
> I/O and encodings.
>
> >> char(181) in Matlab gives you the micro sign, which is 0x00B5 as a
> >> 1-long UCS-2 Matlab string. We could have char(181) in Octave also give
> >> you the micro sign, which would be 0xC2 0xB5 as a 2-long UTF-8 Octave
> >> char string.
> >
> > I prefer that approach of extending the logic to all Unicode code points to my initial idea of only doing that for the first 256 (which seems odd now thinking about it).
> > But still: Do we really want this? That would lead to the same round trip oddities.
>
> We could totally round-trip it!
>
> Define double(char) and char(double) to work along row vectors instead
> of on individual elements. (You have to define char(double) this way for
> the conversion I suggested above to make sense in a UTF-8 world anyway.)
>
> Define double(char) as "takes a row vector of chars that contain UTF-8
> (or whatever Octave's internal encoding is) and returns a row vector of
> doubles that contain the sequence of Unicode code point values encoded
> by those chars". That's the inverse of the char(double) I describe
> above, and it should round-trip just fine.
>
> Then you could say:
>
> x = [ 121 117 109 hex2dec('1F34C') ];  % "yum🍌"
> str = char (x);  % Get back 7-long char with values 0x79 0x75 0x6D 0xF0
> 0x9F 0x8D 0x8C
> x2 = double (str); % Get back [121 117 109 127820]
> isequal (x, x2);  % Returns true

You are right. I was still thinking that we wanted to implement this as a fallback mechanism.
But if we always interpret double input to char() as "Unicode code points" (resembling UTF-32), round trips would be save.

Do we want that char on double (and vice versa) does more than a simple cast-like operation? If we can answer this question with "Yes", I think we could be close to a possible solution.

What about single and the integer classes as input to char()? It would probably be reasonable to do the same for them.

> Now, this wouldn't work for 2-D or higher-dimension arrays that contain
> non-ASCII (>127) code points. But I think that's a small, acceptable
> loss: IMHO, 2-D char arrays are terrible, and you should pretty much
> never use them.
>
> (For 2-D and higher arrays: still define it as operating
> row-vector-wise, and if all the row vector operations result in outputs
> that have compatible dimensions and can cat() cleanly, cat and return
> that; else error with a "dimension mismatch" error.)
>
> This even gives you a win over Matlab: UCS-2 can't represent U+1F34C, so
> Matlab squashes the banana into the 0xFFFF replacement character, and x2
> does not equal x.
>
> > Still, I am more focused on the char(double) issue. The round trip oddities would disappear (or become much less prominent) if we used a wider char representation.
>
> True. Having 16-bit chars would mean you could mostly define these
> operations *elementwise* instead of vector-wise, and then get nice
> round-trip results and support higher-dimensional arrays as long as you
> stay in the Basic Multilingual Plane. And indexing becomes more
> intuitive, especially for less-experienced users. (And maybe better
> MAT-file compatibility?)
>
> Personally, I like being able to support non-BMP characters, because I
> like working with emoji, and there's some mathematical symbols there
> that may be of interest to Octave users creating plots or documents. So
> unless you're doing sentiment analysis on Twitter or Slack streams or
> something, the vast majority of your text is going to be all BMP. But as
> long as your 16-bit char type passes through surrogate pairs unmolested,
> you can still use non-BMP characters; it's just a bit less convenient to
> construct them in code. And all that could easily be wrapped up in
> user-defined helper functions.
>
> My personal desire has always been to see Octave switch to a 16-bit
> UCS-2 or UTF-16 char type, because maximal Matlab compatibility is my
> highest hope. (I'm coming from an enterprise background where we'd like
> to maybe some day use Octave to replace Matlab for some workloads, with
> minimal porting effort for our existing M-code. And I think there are
> maybe other people in that situation.) But maybe that's not what's best
> for the Octave community as a whole, or feasible for the Octave developers.
>
> Maybe the best way to approach this is to discuss what use cases or
> coding techniques a wider char type enables. From what I can see, the
> big thing you get with UCS-2 or UTF-32 (and UTF-16 if you're sloppy
> about it) is random access for characters: for a char array str and
> integer i, str(i) is a single character, which is also a scalar char.
> That's very useful if you want to do character-wise manipulation of
> strings. Do people want to do that in practice? I can think of lots of
> toy examples like "reverse a string" or "replace a couple characters in
> a string with something else". But these operations mostly show up in
> tutorials and coding interviews. Is this something people actually want
> to do in practice?

We have the Octave-specific unicode_idx() function that might help in these situations:
str = "aäbc";
str(unicode_idx (str)==2) % is the second character
But I agree that it adds complexity to use that function instead of simply indexing into the string.

We could also add more functions that could better support more use cases.

> The one use case I can think of that I've actually done in recent years
> on the job is parsing of fixed-width-field format records. Like you have
> a weather station identifier in the format "TSSZZZZZ-nnnn" where "T" is
> a one-letter code for the station type, "SS" is the 2-character state
> abbreviation, "ZZZZZ" is the zip code, "nnnn" is the ID number, and so
> on. With 16-bit chars, you can do this conveniently with direct
> character indexing, and can vectorize the operation using a 2-D char
> array. With 8-bit UTF-8 chars, you can still do that, but you have to do
> an intermediate step where you call a function that maps character
> indexes to (start_index, end_index) pairs that index into the byte
> offsets of the char array.

Also this use case doesn't work in Octave (but does in Matlab with the wider chars). But it's probably bad coding style anyway:
a = "a";
a(end+1) = "ä";

Markus


Reply | Threaded
Open this post in threaded view
|

Re: How should we treat invalid UTF-8?

apjanke-floss


On 11/6/19 4:57 AM, "Markus Mützel" wrote:

> Am 05. November 2019 um 05:46 Uhr schrieb "Andrew Janke":
>> On 11/4/19 6:29 PM, "Markus Mützel" wrote:
>> [...]
>>
>> Define double(char) and char(double) to work along row vectors instead
>> of on individual elements. (You have to define char(double) this way for
>> the conversion I suggested above to make sense in a UTF-8 world anyway.)
>> [...]
>
> You are right. I was still thinking that we wanted to implement this as a fallback mechanism.
> But if we always interpret double input to char() as "Unicode code points" (resembling UTF-32), round trips would be save.
>
> Do we want that char on double (and vice versa) does more than a simple cast-like operation? If we can answer this question with "Yes", I think we could be close to a possible solution.
>
> What about single and the integer classes as input to char()? It would probably be reasonable to do the same for them.

That makes sense to me. With the provision that it should probably throw
an error when "overflow" would happen when the char string you're
converting contains a code point that won't fit in the target type.

> We have the Octave-specific unicode_idx() function that might help in these situations:
> str = "aäbc";
> str(unicode_idx (str)==2) % is the second character
> But I agree that it adds complexity to use that function instead of simply indexing into the string.
>
> We could also add more functions that could better support more use cases.

Yeah, that's what you'd need.

That "==" step sounds expensive. Would it be able to have unicode_idx
also return a precomputed lookup table of the start and end
bytes/elements for each character index? That way, you could call
unicode_idx once on a string, and then have O(1) access to each of its
characters. Like this:

str = "aäbc"
[idx, idx2] = unicode_idx (str)
% idx = [1 2 2 3 4]
% idx2 = [1 1; 2 3; 4 4; 5 5]
nth_character = str(idx2(n,1):idx2(n,2))  % No == needed, so this is O(1)

>> [...] With 16-bit chars, you can do this conveniently with direct
>> character indexing, and can vectorize the operation using a 2-D char
>> array. [...]
>
> Also this use case doesn't work in Octave (but does in Matlab with the wider chars). But it's probably bad coding style anyway:
> a = "a";
> a(end+1) = "ä";

There's also ==, <, and >.

"xx" == "ä"  % runs, but doesn't do what you'd probably expect
"foobär" == "ä"  % dimension mismatch error

>> if any('ä' == 'Ê'); disp('yep!'); end
yep!

And sort(), unique(), and ismember():

>> sort("späm")
ans = ��mps
>> unique("foobär")
ans = ��bfor
>>

(That's not even valid UTF-8 in the results.)

>> [tf, ix] = ismember('ä', 'foobär')
tf =
  1  1
ix =
   5   6
>> [tf, ix] = ismember('ä', 'foobÊr')
tf =
  1  0
ix =
   5   0

Would anyone actually ever want to do these things on strings with
non-ASCII characters? I honestly don't know the answer to that.

This brings up another point: it'll be useful to still have a way of
getting at the raw underlying bytes inside a string, without any
validation or transcoding, for debugging odd results or invalid strings.
typecast() seems appropriate for this, and it seems to already work.
Like in that sort() result: since the results are not valid UTF-8, I
want to just look at the raw bytes to see what's going on.

>> sort("späm")
ans = ��mps
>> typecast(str, 'uint8')
ans =
  164  195  109  112  115

Side note: that's a surprising result. If sort() is working byte-wise on
the char elements, why are the high bytes sorted to the beginning of the
result? That might be a bug.

Cheers,
Andrew