Discussion:
[ragel-users] signed/unsigned portability issue
Peter van Dijk
2013-10-24 18:52:17 UTC
Permalink
Hello folks,



we (PowerDNS) have a small Ragel parser for segmenting and unescaping DNS TXT record data. Some time ago, we expanded the allowed inputs for this parser to the full 8 bit 'extended ASCII' range (which Ragel calls 'extend').

This works well on most platforms - but it failed for us on Debian/s390x.

After a lot of digging I found that char is unsigned on s390x, while it is signed on amd64, i386 and many other platforms.

I have added 'alphtype unsigned char' to our Ragel file. This makes the parser work reliably on both amd64 and s390x (and, hopefully, many other platforms).

However, I feel something is wrong. It seems that on s390x, Ragel is mostly confused about the type of char. It generates a parser that treats extend as -128..127, but maps non-ASCII inputs in the 128..255 range. This discrepancy feels like a Ragel issue to me.

A much longer version of this story is at https://www.evernote.com/shard/s344/sh/cb968134-4d58-4e46-8b5e-47366a129038/60fafaf56d5a350edf891cf82cefc66d

My question: is this a Ragel bug? Regardless of yes/no, is what I did (alphtype unsigned char) the best workaround?

I did most of the debugging with ragel 6.7-1 (Debian version number), but verified that the problem is identical in 6.8-1.

Kind regards,
--
Peter van Dijk
Netherlabs Computer Consulting BV - http://www.netherlabs.nl/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://www.complang.org/pipermail/ragel-users/attachments/20131024/a3b83a44/attachment.pgp>
William Ahern
2013-10-24 19:53:12 UTC
Permalink
Post by Peter van Dijk
Hello folks,
we (PowerDNS) have a small Ragel parser for segmenting and unescaping DNS
TXT record data. Some time ago, we expanded the allowed inputs for this
parser to the full 8 bit 'extended ASCII' range (which Ragel calls
'extend').
This works well on most platforms - but it failed for us on Debian/s390x.
After a lot of digging I found that char is unsigned on s390x, while it is
signed on amd64, i386 and many other platforms.
I have added 'alphtype unsigned char' to our Ragel file. This makes the
parser work reliably on both amd64 and s390x (and, hopefully, many other
platforms).
However, I feel something is wrong. It seems that on s390x, Ragel is
mostly confused about the type of char. It generates a parser that treats
extend as -128..127, but maps non-ASCII inputs in the 128..255 range. This
discrepancy feels like a Ragel issue to me.
A much longer version of this story is at
https://www.evernote.com/shard/s344/sh/cb968134-4d58-4e46-8b5e-47366a129038/60fafaf56d5a350edf891cf82cefc66d
My question: is this a Ragel bug? Regardless of yes/no, is what I did
(alphtype unsigned char) the best workaround?
IMHO it would probably be better for Ragel to use unsigned char arithmetic
for both char and unsigned char. Off the top of my head it even seems like
Ragel should treat all input as unsigned.

FWIW, I always use unsigned arithmetic, for Ragel and most everything else.
Signed arithmetic is for mathematical formulas, not bit twiddling and string
processing. At the very least, it quickly leads to undefined behavior,
whereas signed->unsigned conversions in C are always well defined.

Does anybody on the list actually use or depend on signed behavior in their
machines?
Adrian Thurston
2013-11-02 14:55:50 UTC
Permalink
We need ragel's internal data structures to match the signedness of the input
array, and sometimes you just need a signed type because you're parsing a
stream of integers.

Perhaps what might be better is defaulting the C alphtype to unsigned char, if
that's the more common case.

-Adrian
Post by William Ahern
Post by Peter van Dijk
Hello folks,
we (PowerDNS) have a small Ragel parser for segmenting and unescaping DNS
TXT record data. Some time ago, we expanded the allowed inputs for this
parser to the full 8 bit 'extended ASCII' range (which Ragel calls
'extend').
This works well on most platforms - but it failed for us on Debian/s390x.
After a lot of digging I found that char is unsigned on s390x, while it is
signed on amd64, i386 and many other platforms.
I have added 'alphtype unsigned char' to our Ragel file. This makes the
parser work reliably on both amd64 and s390x (and, hopefully, many other
platforms).
However, I feel something is wrong. It seems that on s390x, Ragel is
mostly confused about the type of char. It generates a parser that treats
extend as -128..127, but maps non-ASCII inputs in the 128..255 range. This
discrepancy feels like a Ragel issue to me.
A much longer version of this story is at
https://www.evernote.com/shard/s344/sh/cb968134-4d58-4e46-8b5e-47366a129038/60fafaf56d5a350edf891cf82cefc66d
My question: is this a Ragel bug? Regardless of yes/no, is what I did
(alphtype unsigned char) the best workaround?
IMHO it would probably be better for Ragel to use unsigned char arithmetic
for both char and unsigned char. Off the top of my head it even seems like
Ragel should treat all input as unsigned.
FWIW, I always use unsigned arithmetic, for Ragel and most everything else.
Signed arithmetic is for mathematical formulas, not bit twiddling and string
processing. At the very least, it quickly leads to undefined behavior,
whereas signed->unsigned conversions in C are always well defined.
Does anybody on the list actually use or depend on signed behavior in their
machines?
_______________________________________________
ragel-users mailing list
ragel-users at complang.org
http://www.complang.org/mailman/listinfo/ragel-users
Adrian Thurston
2013-11-02 14:52:37 UTC
Permalink
Definitely a bug. We take the min and max values for the type from CHAR_MIN and
CHAR_MAX, which should be set appropriately for the architecture.

However, there is an isSigned bit that is not drawn from the compilation
environment. We should be doing that somehow.

This code is in common.{h,cc}

-Adrian
Post by Peter van Dijk
Hello folks,
we (PowerDNS) have a small Ragel parser for segmenting and unescaping DNS TXT record data. Some time ago, we expanded the allowed inputs for this parser to the full 8 bit 'extended ASCII' range (which Ragel calls 'extend').
This works well on most platforms - but it failed for us on Debian/s390x.
After a lot of digging I found that char is unsigned on s390x, while it is signed on amd64, i386 and many other platforms.
I have added 'alphtype unsigned char' to our Ragel file. This makes the parser work reliably on both amd64 and s390x (and, hopefully, many other platforms).
However, I feel something is wrong. It seems that on s390x, Ragel is mostly confused about the type of char. It generates a parser that treats extend as -128..127, but maps non-ASCII inputs in the 128..255 range. This discrepancy feels like a Ragel issue to me.
A much longer version of this story is at https://www.evernote.com/shard/s344/sh/cb968134-4d58-4e46-8b5e-47366a129038/60fafaf56d5a350edf891cf82cefc66d
My question: is this a Ragel bug? Regardless of yes/no, is what I did (alphtype unsigned char) the best workaround?
I did most of the debugging with ragel 6.7-1 (Debian version number), but verified that the problem is identical in 6.8-1.
Kind regards,
--
Peter van Dijk
Netherlabs Computer Consulting BV - http://www.netherlabs.nl/
_______________________________________________
ragel-users mailing list
ragel-users at complang.org
http://www.complang.org/mailman/listinfo/ragel-users
Loading...