8-bit char problem

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

8-bit char problem

Paul Kienzle-6
Under mingw and cygwin, toascii(setstr(200)) == 72 and
toascii(setstr(-100)) == 28.  It seems we only have 7-bit
characters to work with.

I'm recompiling with -funsigned-char to see if I can get
toascii(setstr(200)) == 200, but the proper solution is
to tag all chars with unsigned.  Otherwise we have to
figure out how to tell all the compilers that can compile
octave how to force unsigned chars.  Or is there already
an autoconf test to do this?  The only one I see is
AC_C_CHAR_UNSIGNED which defines __CHAR_UNSIGNED__ but
doesn't modify CFLAGS and CXXFLAGS so that they are
forced to be unsigned.

Paul Kienzle
[hidden email]


Reply | Threaded
Open this post in threaded view
|

8-bit char problem

John W. Eaton-6
On 10-Oct-2002, Paul Kienzle <[hidden email]> wrote:

| Under mingw and cygwin, toascii(setstr(200)) == 72 and
| toascii(setstr(-100)) == 28.  It seems we only have 7-bit
| characters to work with.

FWIW, this is also what happens on my Debian system:

  bevo:386> octave
  GNU Octave, version 2.1.36 (i386-pc-linux-gnu).
  Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001, 2002 John W. Eaton.
  This is free software; see the source code for copying conditions.
  There is ABSOLUTELY NO WARRANTY; not even for MERCHANTIBILITY or
  FITNESS FOR A PARTICULAR PURPOSE.  For details, type `warranty'.

  Report bugs to <[hidden email]>.

  octave:1> toascii(setstr(200))
  ans = 72
  octave:2> toascii(setstr(-100))
  ans = 28

| I'm recompiling with -funsigned-char to see if I can get
| toascii(setstr(200)) == 200, but the proper solution is
| to tag all chars with unsigned.

This could be somewhat painful, but I suppose it might be worth doing
it.

jwe


Reply | Threaded
Open this post in threaded view
|

Re: 8-bit char problem

Paul Kienzle-6
On Thu, Oct 10, 2002 at 04:29:19PM -0500, John W. Eaton wrote:

> On 10-Oct-2002, Paul Kienzle <[hidden email]> wrote:
>
> | Under mingw and cygwin, toascii(setstr(200)) == 72 and
> | toascii(setstr(-100)) == 28.  It seems we only have 7-bit
> | characters to work with.
>
> FWIW, this is also what happens on my Debian system:
>
>   bevo:386> octave
>   GNU Octave, version 2.1.36 (i386-pc-linux-gnu).
>   Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001, 2002 John W. Eaton.
>   This is free software; see the source code for copying conditions.
>   There is ABSOLUTELY NO WARRANTY; not even for MERCHANTIBILITY or
>   FITNESS FOR A PARTICULAR PURPOSE.  For details, type `warranty'.
>
>   Report bugs to <[hidden email]>.
>
>   octave:1> toascii(setstr(200))
>   ans = 72
>   octave:2> toascii(setstr(-100))
>   ans = 28

Mine too :-(  From the toascii man page, I see that it is supposed to strip
the top bit.  I did some more tests of the problem I'm trying to solve
(colormaps on my images are messed up on Windows), and setstr is not the
cause.

Perhaps I will write a function double() to convert from characters to
unsigned(?) numbers, but I don't need it for now.

>
> | I'm recompiling with -funsigned-char to see if I can get
> | toascii(setstr(200)) == 200, but the proper solution is
> | to tag all chars with unsigned.
>
> This could be somewhat painful, but I suppose it might be worth doing
> it.

Yes it will be painful because the system string functions and the
consumers of charMatrix use char, and we don't want to replace them all, so
casts will be sprinkled everywhere.  I'm inclined to ignore it for now.

I wonder how much more work it would be to support unicode?

Paul Kienzle
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: 8-bit char problem

John W. Eaton-6
On 11-Oct-2002, Paul Kienzle <[hidden email]> wrote:

| Mine too :-(  From the toascii man page, I see that it is supposed to strip
| the top bit.

Yes, I just saw that too.  So I don't think we should change the
behavior of toascii.  But I've just made some changes that make Octave
behave as follows:

  octave:1> toascii (setstr ([-100, 100, 200, 300]))
  warning: range error for conversion to character value
  ans =

      0  100   72    0

  octave:2> abs (setstr ([-100, 100, 200, 300]))
  warning: range error for conversion to character value
  ans =

      0  100  200    0

(AFAIK, abs was originally the recommended way to convert a string to
ASCII in Matlab; maybe now they say to use double).

I'm open to suggestions for better things to do with out of range
values other than converting to zero, but I'm afraid that anything
else will not be easy.

Matlab doesn't have this "problem" because character matrices are
stored as arrays of double values with a special flag set, so setstr
simply sets the flag and abs unsets it, which would preserve out of
range values, except that it also checks for negative values and
converts those to zero (with a warning).  I'm not sure why they don't
trap large values, since they don't seem too useful in strings.  For
example, I see this weird behavior:

  >> fprintf ('%s\n', setstr (100))
  d
  >> fprintf ('%s\n', setstr (400))
  4.000000e+02

Does this make any senes?

| Perhaps I will write a function double() to convert from characters to
| unsigned(?) numbers, but I don't need it for now.

Yes, we should probably have a double function for compatibility.

| Yes it will be painful because the system string functions and the
| consumers of charMatrix use char, and we don't want to replace them all, so
| casts will be sprinkled everywhere.

Maybe there aren't that many places where it really matters.
 
| I wonder how much more work it would be to support unicode?

I have no idea since I'm not really up to date on things like this.
But clues from others would be helpful.

Thanks,

jwe


Reply | Threaded
Open this post in threaded view
|

Re: 8-bit char problem

Paul Kienzle-6
> (AFAIK, abs was originally the recommended way to convert a string to
> ASCII in Matlab; maybe now they say to use double).

I think V5 they introduced char(), double(), etc.

> I'm open to suggestions for better things to do with out of range
> values other than converting to zero, but I'm afraid that anything
> else will not be easy.
>
> Matlab doesn't have this "problem" because character matrices are
> stored as arrays of double values with a special flag set, so setstr
> simply sets the flag and abs unsets it, which would preserve out of
> range values, except that it also checks for negative values and
> converts those to zero (with a warning).  I'm not sure why they don't
> trap large values, since they don't seem too useful in strings.  For
> example, I see this weird behavior:
>
>   >> fprintf ('%s\n', setstr (100))
>   d
>   >> fprintf ('%s\n', setstr (400))
>   4.000000e+02
>
> Does this make any senes?

This makes sense if they are using unicode and require a 16-bit character
set.  I believe they have patchy support for it in the most recent release.

>
> | Yes it will be painful because the system string functions and the
> | consumers of charMatrix use char, and we don't want to replace them all, so
> | casts will be sprinkled everywhere.
>
> Maybe there aren't that many places where it really matters.

I find it painful using casts for functions like strlen which really don't
care if the values are signed or unsigned, even if there aren't that many.

- Paul


Reply | Threaded
Open this post in threaded view
|

Re: 8-bit char problem

John W. Eaton-6
On 11-Oct-2002, Paul Kienzle <[hidden email]> wrote:

| >   >> fprintf ('%s\n', setstr (100))
| >   d
| >   >> fprintf ('%s\n', setstr (400))
| >   4.000000e+02
| >
| > Does this make any senes?
|
| This makes sense if they are using unicode and require a 16-bit character
| set.  I believe they have patchy support for it in the most recent release.

I don't understand why a '%s' conversion would result in numeric
output when you are only printing one character.  If the value of 400
is invalid, setstr should complain.  If it's not, then '%s' should
convert it to a character, not a number.  But maybe this is what you
mean about support being patchy?

jwe