locale encoding and core functions

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

locale encoding and core functions

mmuetzel
TL;DR: Is there a way to get information whether an .m file is from Octave core or from a user function?

Some background:
With the upcoming Octave 5 it will be possible to set the mfile_encoding that will be used to read .m files. This is important because Octave has to know which encoding is used in the .m file to correctly display non-ASCII characters in strings (e.g. in the "workspace" view or in plots). This is done by converting from whatever encoding the user set up to UTF-8 and convert to whatever encoding necessary at any interfaces.
However, there is a problem when we read core .m files which are always encoded in UTF-8 (and not in the encoding the user set up). On conversion of these files from the locale encoding to UTF-8, non-ASCII characters result in garbled text.
E.g. the German character "ä" encoded in UTF-8 is represented by two bytes: c3 a4. Assume that users would set the mfile_encoding to "ISO 8859-1" (Latin1). Then these two bytes are interpreted as representing the two letters "ä". This means that a string from a core .m file that contained the letter "ä" would display as "ä" for those users.

None of the core .m files contain any non-ASCII characters at the moment. However, there are a few help texts in some Octave Forge packages that do. See also bug #55195 [1].

The conversion to UTF-8 is done in "file_reader::get_input" in the file "input.cc".
If we knew in that function that the file we read from was from the core (or an Octave Forge package), we could skip the conversion from the locale encoding to mitigate the problem.

So back to the initial question: Is there a way to pass this information down to that function?

Markus

PS: This problem mostly affects Windows users where the default mfile_encoding depends on the locale of Windows (see also bug #49685). But in general any user who would prefer to use an encoding other than UTF-8 in their .m file code would be affected by this bug.

[1]: https://savannah.gnu.org/bugs/index.php?55195
[2]: https://savannah.gnu.org/bugs/index.php?49685

Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

apjanke-floss


On 2/23/19 4:12 AM, "Markus Mützel" wrote:

> TL;DR: Is there a way to get information whether an .m file is from Octave core or from a user function?
>
> Some background:
> With the upcoming Octave 5 it will be possible to set the mfile_encoding that will be used to read .m files. This is important because Octave has to know which encoding is used in the .m file to correctly display non-ASCII characters in strings (e.g. in the "workspace" view or in plots). This is done by converting from whatever encoding the user set up to UTF-8 and convert to whatever encoding necessary at any interfaces.
> However, there is a problem when we read core .m files which are always encoded in UTF-8 (and not in the encoding the user set up). On conversion of these files from the locale encoding to UTF-8, non-ASCII characters result in garbled text.
> E.g. the German character "ä" encoded in UTF-8 is represented by two bytes: c3 a4. Assume that users would set the mfile_encoding to "ISO 8859-1" (Latin1). Then these two bytes are interpreted as representing the two letters "ä". This means that a string from a core .m file that contained the letter "ä" would display as "ä" for those users.
>
> None of the core .m files contain any non-ASCII characters at the moment. However, there are a few help texts in some Octave Forge packages that do. See also bug #55195 [1].
>
> The conversion to UTF-8 is done in "file_reader::get_input" in the file "input.cc".
> If we knew in that function that the file we read from was from the core (or an Octave Forge package), we could skip the conversion from the locale encoding to mitigate the problem.
>
> So back to the initial question: Is there a way to pass this information down to that function?
>
> Markus
>
> PS: This problem mostly affects Windows users where the default mfile_encoding depends on the locale of Windows (see also bug #49685). But in general any user who would prefer to use an encoding other than UTF-8 in their .m file code would be affected by this bug.
>
> [1]: https://savannah.gnu.org/bugs/index.php?55195
> [2]: https://savannah.gnu.org/bugs/index.php?49685
>

Fixed-encoding support like this sounds like a good idea. I would like
to be able to use non-ASCII characters in .m source code in a portable
manner. And I can see use cases for this in core M-code: example and
test data may want to use international or special characters, both to
test that the code under test supports it, and to provide examples  for
advanced usage. It would be convenient to enter these as literal
characters instead of having to use \x escape sequences.

But just switching on "core/Forge" vs "user" .m files may not be the
best way to do it in the long run. In particular, I think these encoding
concerns apply to non-core Octave code, too.

There's no direct way to detect whether an .m file is from core Octave.
But you could build a function to do so on top of __pathorig__() pretty
easily: Take that path and remove all the paths under the pkg
installation locations. What's left is, I think, the Octave default core
path. You could consider any .m file from one of those paths to be
"core" Octave; anything else to be user-defined Octave.

You could also use that path to detect files which are pkg-installed vs
on the user path. But that's not the same as detecting Octave Forge
packages, because users might also install non-Forge packages using pkg.
You would have to look into the installation metadata for each package
to determine Forge vs non-Forge.

But, this core vs user detection has a couple drawbacks, at least for
Octave developers. It's really convenient to be able to work on Octave's
.m files by cloning the octave repo, firing up a reference installed
Octave, and sticking selected directories from your local repo's
scripts/ dir on the front of the Octave path. If the encoding of those
.m files was detected differently in those cases, this wouldn't work
portably when there were non-ASCII characters in the source files.

My real issue is that this doesn't support portability for .m code
outside core Octave, which I think is a worthy goal. In today's
globalized world, you might well want to share code between developers
or users that are in different locales and have different default
encodings on their machines. It would be nice if Octave projects were
easily portable between those users without requiring them to do special
configuration on their machines.

Let's say I have colleagues Edward in the UK, Cixin in China, and Juri
in Japan. Edward uses an English Windows machine. Cixin runs a machine
with GB2312 default encoding, and Juri runs Shift-JIS default encoding.
I'm running a US English Mac. Edward, Cixin, and Juri have each written
Octave library projects, with .m files in their local default encoding,
and we all want to write programs that use all those libraries. How can
this be done? If "non-core" .m files are always read with the default
system encoding, then Cixin and Juri's files will always be garbled for
Edward, and vice versa. And there's no system default encoding I can set
that will allow me to use all these libraries at the same time. (Without
manually transcoding their source files, which is a big pain, and a
total no-go if you have developers in multiple encodings working from
the same project git repo.)

Another example: I have an Octave package octave-table
(https://github.com/apjanke/octave-table), and I would like its
+table_examples namespace to include examples with international text
and emoji and the like, to demonstrate that they are supported. How
should its source files be written so that they work for users running
under any default encoding? I think they need to be encoded in Unicode,
and Octave has to have a mechanism to know to interpret them as Unicode
(or as a specific UTF format).

And if Octave does encoding detection differently for Octave Forge and
non-Octave-Forge packages, would I then need to transcode my files if my
package is eventually accepted to Octave Forge? When doing further
development, would I also need to go through a "pkg install" step each
time I changed some source code and wanted to test it?

I suspect the only way to resolve this is something like either:
a) support an explicit source code encoding indicator at a per-project,
per-directory, or per-m-file level, or
b) take a big breaking change, and require all .m source files to always
be in Unicode. Then locales are irrelevant when reading source.

For a), you could support a special .encoding file in either each M-code
source dir (the things added to the Octave path) or project root (would
have to be inferred by just traversing up the directory path above
source root files), and add UTF-8 .encoding files to all Octave core and
Octave Forge code dirs. Or for the file-level indicator, you could
support a magic "%encoding <whatever>" comment, like Ruby and Python do.
I would prefer a per-project/dir .encoding, because you only need to
remember to do it once, and not per file. Which also makes it easier to
add it in after the fact for existing projects that need to be
internationalized.

Figuring out the Matlab compatibility situation is difficult. There are
some threads discussing this, but they all confuse source code file
encoding with the runtime's I/O and character data processing, and no
docs come right out and explicitly say how Matlab handles character
encoding of its .m source files.

https://www.mathworks.com/matlabcentral/answers/340903-unicode-characters-in-m-file
https://www.mathworks.com/matlabcentral/answers/262114-why-i-can-not-read-comments-in-chinese-in-my-mfile
https://stackoverflow.com/questions/4984532/unicode-characters-in-matlab-source-files
https://www.mathworks.com/help/matlab/matlab_env/how-the-matlab-process-uses-locale-settings.html

Reading between the lines (and using memories from the dim past), I
think Matlab always treats .m source files as being in the system
default encoding. So I don't think there's a way to support full easy
Matlab portability and full easy locale portability at the same time.
And the Matlab editor does not have good non-ASCII support, so it's
harder to tell what's going on.

Here's another weird edge case: If different .m files are going to be
interpreted as being in different encodings, how do strings with "\x"
escape sequences in those files work? Are those byte sequences produced
by the "\x" escapes interpreted as being in the same encoding as that
source file? Or are they always considered to be in the internal
encoding used by Octave's string objects? More generally, what
transcoding is applied to string literals in M source, and does the "\x"
escape interpretation happen before or after that transcoding? In either
of these scenarios, is it actually possible for a developer to portably
write a string literal that uses \x escapes to encode multibyte
international characters?

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

mmuetzel
Am 05. März 2019 um 05:33 Uhr schrieb "Andrew Janke":

> On 2/23/19 4:12 AM, "Markus Mützel" wrote:
> I suspect the only way to resolve this is something like either:
> a) support an explicit source code encoding indicator at a per-project,
> per-directory, or per-m-file level, or
> b) take a big breaking change, and require all .m source files to always
> be in Unicode. Then locales are irrelevant when reading source.
>
> For a), you could support a special .encoding file in either each M-code
> source dir (the things added to the Octave path) or project root (would
> have to be inferred by just traversing up the directory path above
> source root files), and add UTF-8 .encoding files to all Octave core and
> Octave Forge code dirs. Or for the file-level indicator, you could
> support a magic "%encoding <whatever>" comment, like Ruby and Python do.
> I would prefer a per-project/dir .encoding, because you only need to
> remember to do it once, and not per file. Which also makes it easier to
> add it in after the fact for existing projects that need to be
> internationalized.

Your idea with .encoding files in each directory sounds promising. Maybe we should use ".mfile-encoding" or some other name more specific.
I'd rather not traverse up the directory tree to look for that file. When should we stop looking for that file? Should we traverse up until root? What should be done in case we reach a directory without read access?
I would also prefer to not parse each source file for a magic comment.
Both of these options also sound like they might impact first run performance.


> Figuring out the Matlab compatibility situation is difficult.
I think anything we'd do in that respect would automatically beat Matlab that is ignorant to the source file encoding.

> Reading between the lines (and using memories from the dim past), I
> think Matlab always treats .m source files as being in the system
> default encoding.
That is what I gathered as well.

> Here's another weird edge case: If different .m files are going to be
> interpreted as being in different encodings, how do strings with "\x"
> escape sequences in those files work? Are those byte sequences produced
> by the "\x" escapes interpreted as being in the same encoding as that
> source file? Or are they always considered to be in the internal
> encoding used by Octave's string objects? More generally, what
> transcoding is applied to string literals in M source, and does the "\x"
> escape interpretation happen before or after that transcoding? In either
> of these scenarios, is it actually possible for a developer to portably
> write a string literal that uses \x escapes to encode multibyte
> international characters?
Do we automatically escape \x sequences when parsing .m files? Or is this something the interpreter does when processing double quoted strings?
In the latter case, I don't think that we have to worry about that.

Markus

Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

apjanke-floss

On 3/9/19 10:10 AM, "Markus Mützel" wrote:

> Your idea with .encoding files in each directory sounds promising. Maybe we should use ".mfile-encoding" or some other name more specific.

Yes, ".mfile-encoding" or similar is better; ".encoding" is too generic
and there's no standard for it.

> I'd rather not traverse up the directory tree to look for that file. When should we stop looking for that file? Should we traverse up until root? What should be done in case we reach a directory without read access?
> I would also prefer to not parse each source file for a magic comment.
> Both of these options also sound like they might impact first run performance.

Now that I think some more, sticking with an .mfile-encoding for each
PATH entry is probably best. Octave projects tend to have few source
dirs, so it's not a burden on users. Avoids your performance concerns,
easier to code, and it won't interact in surprising ways with
.mfile-encoding files that users stick elsewhere in their directory tree
(which might not be included in source control! e.g. maybe a user thinks
editing ~/.mfile-encoding is the way to use it; now this feature is just
making things more complicated.).


>> Figuring out the Matlab compatibility situation is difficult.
> I think anything we'd do in that respect would automatically beat Matlab that is ignorant to the source file encoding.

It's not about beating Matlab; it's about being able to exchange source
file collections with them unmodified.

>> Reading between the lines (and using memories from the dim past), I
>> think Matlab always treats .m source files as being in the system
>> default encoding.
> That is what I gathered as well.
>
>> Here's another weird edge case: If different .m files are going to be
>> interpreted as being in different encodings, how do strings with "\x"
>> escape sequences in those files work? Are those byte sequences produced
>> by the "\x" escapes interpreted as being in the same encoding as that
>> source file? Or are they always considered to be in the internal
>> encoding used by Octave's string objects? More generally, what
>> transcoding is applied to string literals in M source, and does the "\x"
>> escape interpretation happen before or after that transcoding? In either
>> of these scenarios, is it actually possible for a developer to portably
>> write a string literal that uses \x escapes to encode multibyte
>> international characters?
> Do we automatically escape \x sequences when parsing .m files? Or is this something the interpreter does when processing double quoted strings?
> In the latter case, I don't think that we have to worry about that.

I'm still unclear on whether Octave strings are internally always UTF-8,
or are in the system default encoding. If they're UTF-8, this sounds
fine; \x escapes are always UTF-8 bytes (code units). But if they're
system default encoded, then the \x escape meaning will vary depending
on the locale you're running Octave in.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

John W. Eaton
Administrator
On 3/9/19 12:07 PM, Andrew Janke wrote:

>
> On 3/9/19 10:10 AM, "Markus Mützel" wrote:
>
>> Your idea with .encoding files in each directory sounds promising.
>> Maybe we should use ".mfile-encoding" or some other name more specific.
>
> Yes, ".mfile-encoding" or similar is better; ".encoding" is too generic
> and there's no standard for it.
>
>> I'd rather not traverse up the directory tree to look for that file.
>> When should we stop looking for that file? Should we traverse up until
>> root? What should be done in case we reach a directory without read
>> access?
>> I would also prefer to not parse each source file for a magic comment.
>> Both of these options also sound like they might impact first run
>> performance.
>
> Now that I think some more, sticking with an .mfile-encoding for each
> PATH entry is probably best. Octave projects tend to have few source
> dirs, so it's not a burden on users. Avoids your performance concerns,
> easier to code, and it won't interact in surprising ways with
> .mfile-encoding files that users stick elsewhere in their directory tree
> (which might not be included in source control! e.g. maybe a user thinks
> editing ~/.mfile-encoding is the way to use it; now this feature is just
> making things more complicated.).

If you would like to do this, let's consider making the file more
generally useful.  We could also use it to store other per-directory
information.  For example, we could mark directories as "traditional" so
that full-on Matlab compatibility mode could be enforced, which could
make a feature like warning/error for Matlab incompatibility actually
useful.

Since the info only applies to directories in the load-path, the
performance hit shouldn't be too high.  We already scan those
directories at startup and rescan them when they change.  And the number
of directories is not large.  Compare this with the .dir-locals.el files
that Emacs uses -- they may appear in any parent directory of any file
that Emacs opens, not just the files where .el files appear.

Also, here's how Emacs handles .dir-locals.el files:

https://www.gnu.org/software/emacs/manual/html_node/emacs/Directory-Variables.html#Directory-Variables

Note that it looks up the directory tree until it finds a .dir-locals.el
file.  But since our search is limited to the load path, we wouldn't
have to scan the filesystem each time this information is needed.  We
could already have it cached in the load_path object.

In Emacs these directory local files normally set variables, but can
also eval code.  Octave doesn't have variables in the same way that
Emacs does, so we would have to decide whether we are willing to execute
arbitrary code, a subset of functions, or just allow some limited number
of settings/options.

Also in Emacs, I think that anything that can be set on a per-directory
basis may also be set in each file by scanning for special comments.
Would it also be worth doing that, at least for .m files that we already
parse?

jwe

Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

apjanke-floss


On 3/9/19 1:03 PM, John W. Eaton wrote:

> On 3/9/19 12:07 PM, Andrew Janke wrote:
>>
>> On 3/9/19 10:10 AM, "Markus Mützel" wrote:
>>
>>> Your idea with .encoding files in each directory sounds promising.
>>> Maybe we should use ".mfile-encoding" or some other name more specific.
>>
>> Yes, ".mfile-encoding" or similar is better; ".encoding" is too
>> generic and there's no standard for it.
>>
>>> I'd rather not traverse up the directory tree to look for that file.
>>> When should we stop looking for that file? Should we traverse up
>>> until root? What should be done in case we reach a directory without
>>> read access?
>>> I would also prefer to not parse each source file for a magic comment.
>>> Both of these options also sound like they might impact first run
>>> performance.
>>
>> Now that I think some more, sticking with an .mfile-encoding for each
>> PATH entry is probably best. Octave projects tend to have few source
>> dirs, so it's not a burden on users. Avoids your performance concerns,
>> easier to code, and it won't interact in surprising ways with
>> .mfile-encoding files that users stick elsewhere in their directory
>> tree (which might not be included in source control! e.g. maybe a user
>> thinks editing ~/.mfile-encoding is the way to use it; now this
>> feature is just making things more complicated.).
>
> If you would like to do this, let's consider making the file more
> generally useful.  We could also use it to store other per-directory
> information.  For example, we could mark directories as "traditional" so
> that full-on Matlab compatibility mode could be enforced, which could
> make a feature like warning/error for Matlab incompatibility actually
> useful.

Now that's kind of exciting. That could be the way that support for
Matlab-compatible double-quoted string object literals gets in to Octave.

And maybe if it's going to be so powerful, it shouldn't be a hidden
dot-file. Maybe "mcode.properties"?

> Also, here's how Emacs handles .dir-locals.el files:
>
> https://www.gnu.org/software/emacs/manual/html_node/emacs/Directory-Variables.html#Directory-Variables 
>
> [...snip...]
>
> In Emacs these directory local files normally set variables, but can
> also eval code.  Octave doesn't have variables in the same way that
> Emacs does, so we would have to decide whether we are willing to execute
> arbitrary code, a subset of functions, or just allow some limited number
> of settings/options.

The idea of running arbitrary code in a pre-code-reading context is kind
of scary to me. Especially because relative execution time & environment
of this may depend on what order source files from different directories
get called & loaded in. My gut says to stick with a predefined set of
settings/options.

Andrew

Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

John W. Eaton
Administrator
On 3/9/19 2:24 PM, Andrew Janke wrote:

> The idea of running arbitrary code in a pre-code-reading context is kind
> of scary to me. Especially because relative execution time & environment
> of this may depend on what order source files from different directories
> get called & loaded in. My gut says to stick with a predefined set of
> settings/options.

Yeah.  Long ago, Emacs would execute arbitrary code without asking.  Now
it asks before doing that and allows ways to make it happen
automatically.  But for Octave I also think we can just allow a limited
number of settings.

jwe



Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

Mike Miller-4
On Sat, Mar 09, 2019 at 14:49:47 -0500, John W. Eaton wrote:

> On 3/9/19 2:24 PM, Andrew Janke wrote:
>
> > The idea of running arbitrary code in a pre-code-reading context is kind
> > of scary to me. Especially because relative execution time & environment
> > of this may depend on what order source files from different directories
> > get called & loaded in. My gut says to stick with a predefined set of
> > settings/options.
>
> Yeah.  Long ago, Emacs would execute arbitrary code without asking.  Now it
> asks before doing that and allows ways to make it happen automatically.  But
> for Octave I also think we can just allow a limited number of settings.
The PKG_ADD and PKG_DEL files can be dropped into any directory on the
load path, and they already support executing arbitrary commands.

Do we need to define a new file or can we install PKG_ADD files that run
some command that does what you need? Something like

    dn = fileparts (mfilename ("fullpath"));
    set_dir_file_encoding (dn);

?

--
mike

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

Carnë Draug
In reply to this post by mmuetzel
On Sat, 9 Mar 2019 at 15:10, "Markus Mützel" <[hidden email]> wrote:
> [...]
> I would also prefer to not parse each source file for a magic comment.
> Both of these options also sound like they might impact first run performance.
> [...]

You don't have to parse the whole file for such magic comment, one can
specify where the notation must be.  For example, in html you must
specify the charset on the initial 1024 bytes.  Another example, in
Python it must be done in the first or second line [1].

[1] https://www.python.org/dev/peps/pep-0263/

Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

John W. Eaton
Administrator
In reply to this post by Mike Miller-4
On 3/9/19 3:25 PM, Mike Miller wrote:
 > The PKG_ADD and PKG_DEL files can be dropped into any directory on the
 > load path, and they already support executing arbitrary commands.

Good point.  We might as well use those files.  As a separate issue, we
might think about whether it is safe to execute arbitrary code found in
those files..

 > Do we need to define a new file or can we install PKG_ADD files that run
 > some command that does what you need? Something like
 >
 >      dn = fileparts (mfilename ("fullpath"));
 >      set_dir_file_encoding (dn);
Seems reasonable to me.

jwe



Reply | Threaded
Open this post in threaded view
|

Re: locale encoding and core functions

apjanke-floss


On 3/12/19 12:49 AM, John W. Eaton wrote:

> On 3/9/19 3:25 PM, Mike Miller wrote:
>  > The PKG_ADD and PKG_DEL files can be dropped into any directory on the
>  > load path, and they already support executing arbitrary commands.
>
> Good point.  We might as well use those files.  As a separate issue, we
> might think about whether it is safe to execute arbitrary code found in
> those files..
>
>  > Do we need to define a new file or can we install PKG_ADD files that run
>  > some command that does what you need? Something like
>  >
>  >      dn = fileparts (mfilename ("fullpath"));
>  >      set_dir_file_encoding (dn);
> Seems reasonable to me.
>
> jwe
>

This sounds like it would work.

How about a more generic function that supports compatibility mode and
other code base properties like jwe suggested earlier? Something like:

# Signature
codebase_properties (dir, property_name, property_val)

# Set encoding
my_dir = fileparts (mfilename ("fullpath"))
codebase_properties (my_dir, "encoding", "UTF-8")
# Set compatibility mode
codebase_properties (my_dir, "compatibility", "octave")
codebase_properties (my_dir, "compatibility", "matlab")  # or "traditional"

And maybe have no-arg codebase_properties() dump a list of all the dirs
with properties set and the values of all their properties, as a
debugging convenience.

Andrew