[ragel-users] How to implement lookahead in Ragel?

Discussion:

Solomon Gibbs

2012-12-04 01:22:20 UTC

I have two states; one is a specific instance of the other, more
general, state. I believe that the right way to avoid entering both
states simultaneously is to implement lookahead with k>1, but I can't
find any examples of how to do this.

The Ragle user's guide says:

In both the use of fhold and fexec the user must be cautious of
combining the resulting machine with another in such a way that the
transition on which the current position is adjusted is not combined
with a transition from the other machine.

I'm not entirely sure what this means, except perhaps "don't try to
read past the end of the current expression".

My machine looks like this:

seglen16 = any{2} >{ swab(p, &len, 2); len = len - 2; };
action check {len--}
buffer = (any when check)* %when !check @{ printf("[%d]:%d\n", len, *p); };

# JPEG Markers
mk_app0 = 0xFF 0xE0;
mk_appx = 0xFF (0xE0..0xEF);
marker = 0xFF ^0x00;
nonmarker = !marker - zlen;

# JPEG APP Segments
seg_app0_jfif = mk_app0 seglen16 "JFIF" 0x00 buffer @{ printf("jfif app0\n"); };
seg_appx_unk = mk_appx nonmarker* @{ printf("unknown app content\n"); };
seg_app = (seg_app0_jfif | seg_app1_exif | seg_appx_unk);

# Main Machine
expr = (mk_soi @lerr(bad) nonmarker* seg_app* nonmarker* mk_eoi);

I want to tokenize a JPEG header, skipping unknown segments and
handling well-known segments like JFIF. The JPEG application segment
app0 starts with 0xFFE0. If app0 contains JFIF data, the app0 marker
will be followed by a two-byte length and the string "JFIF\0". This
means I need 7 bytes of lookahead when identifying application
segments

Thanks for any pointers.

Tim Goddard

2012-12-04 02:03:11 UTC

Permalink

You shouldn't need to do any form of lookahead there. Since the behaviour
for the general case is just to skip, why not try to parse the header and
use the error action to skip to the end if it fails?

app

Post by Solomon Gibbs
content\n"); }; seg_app = (seg_app0_jfif | seg_app1_exif |

seg_appx_unk);

Post by Solomon Gibbs
# Main Machine
I want to tokenize a JPEG header, skipping unknown segments and
handling well-known segments like JFIF. The JPEG application segment
app0 starts with 0xFFE0. If app0 contains JFIF data, the app0 marker
will be followed by a two-byte length and the string "JFIF\0". This
means I need 7 bytes of lookahead when identifying application
segments
Thanks for any pointers.
_______________________________________________
ragel-users mailing list
ragel-users at complang.org
http://www.complang.org/mailman/listinfo/ragel-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.complang.org/pipermail/ragel-users/attachments/20121204/1392457f/attachment.html>

Solomon Gibbs

2012-12-04 03:18:30 UTC

Permalink

Post by Tim Goddard
You shouldn't need to do any form of lookahead there. Since the behaviour
for the general case is just to skip, why not try to parse the header and
use the error action to skip to the end if it fails?

This is a general question -- the JPEG example just happens to be a
simple illustration.

If I don't do lookahead, then every APP0 segment type (JFIF, EXIF,
unknown, etc.) will be entered; I will then have to do everything in
finishing transitions or implement logic to roll back everything
except the segment that succeeds. This also rules out the possibility
of seeking the input to a different location based on the content of
the segment.

Generally, it seems like lookahead should produce a simpler, faster, machine.

Post by Tim Goddard

Post by Solomon Gibbs
I have two states; one is a specific instance of the other, more
general, state. I believe that the right way to avoid entering both
states simultaneously is to implement lookahead with k>1, but I can't
find any examples of how to do this.
In both the use of fhold and fexec the user must be cautious of
combining the resulting machine with another in such a way that the
transition on which the current position is adjusted is not combined
with a transition from the other machine.
I'm not entirely sure what this means, except perhaps "don't try to
read past the end of the current expression".
seglen16 = any{2} >{ swab(p, &len, 2); len = len - 2; };
action check {len--}
# JPEG Markers
mk_app0 = 0xFF 0xE0;
mk_appx = 0xFF (0xE0..0xEF);
marker = 0xFF ^0x00;
nonmarker = !marker - zlen;
# JPEG APP Segments
content\n"); }; seg_app = (seg_app0_jfif | seg_app1_exif | seg_appx_unk);
# Main Machine
I want to tokenize a JPEG header, skipping unknown segments and
handling well-known segments like JFIF. The JPEG application segment
app0 starts with 0xFFE0. If app0 contains JFIF data, the app0 marker
will be followed by a two-byte length and the string "JFIF\0". This
means I need 7 bytes of lookahead when identifying application
segments
Thanks for any pointers.
_______________________________________________
ragel-users mailing list
ragel-users at complang.org
http://www.complang.org/mailman/listinfo/ragel-users

Tim Goddard

2012-12-04 22:12:59 UTC

Permalink

Post by Solomon Gibbs
This is a general question -- the JPEG example just happens to be a
simple illustration.
If I don't do lookahead, then every APP0 segment type (JFIF, EXIF,
unknown, etc.) will be entered; I will then have to do everything in
finishing transitions or implement logic to roll back everything
except the segment that succeeds. This also rules out the possibility
of seeking the input to a different location based on the content of
the segment.
Generally, it seems like lookahead should produce a simpler, faster, machine.

If you need lookahead, Ragel might not be the best option. If you know it's
fed in particular ways (all data at once / whole blocks only / etc) then you
might be able to do this, but Ragel itself only assumes it has the current
character available when choosing a transition.

If these are reliably "tagged" somehow then you can embed priorities
which make a decision when they recognise the tag:

general = (...) $ (my_alternatives, 0);
specific1 = "tag1" @ (my_alternatives, 1) (...);
specific2 = "tag2" @ (my_alternatives, 1) (...);

main = general | specific1 | specific2;

That would abandon the general case as soon as a tag is recognised. If you
can work without lookahead, it will be faster.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.complang.org/pipermail/ragel-users/attachments/20121205/8376cb35/attachment.html>

Michael Conrad

2012-12-06 04:25:23 UTC

Permalink

Post by Solomon Gibbs

This is a general question -- the JPEG example just happens to be a
simple illustration.
If I don't do lookahead, then every APP0 segment type (JFIF, EXIF,
unknown, etc.) will be entered; I will then have to do everything in
finishing transitions or implement logic to roll back everything
except the segment that succeeds. This also rules out the possibility
of seeking the input to a different location based on the content of
the segment.
Generally, it seems like lookahead should produce a simpler, faster, machine.

Unlikely. As soon as you need multiple characters to consider, you need
to coordinate a buffering strategy between library and host program, and
you have more/larger tables to work with. You would have a more
complicated process for pausing the machine while you fetch more input.
As it is, Ragel's state is a single integer and it can process every
character in a buffer and never needs to backtrack. (And, it doesn't
require a callback mechanism for fetching more input, which I think is
great!)

There are lots of Parser designs that use lookahead, and only run
actions for the parse tree that is actually found. However Parser
algorithms generally run slower than state machines, which is why it is
most common to use a scanner to produce tokens from characters, and then
parse the tokens.

Try a design where each possibility in the machine stores small bits of
data in a struct, and then whichever finishing action happens first
takes only the relevant bits from the struct and acts on them. Then use
"fgoto" to prevent other finishing actions from running. If it gets too
complicated, make the machine generate tokens for a parser and handle
the complication there.

Thats the strategy that worked well for me.

-Mike