[ragel-users] Conditional parsing

Discussion:

Iñaki Baz Castillo

2014-04-10 20:08:04 UTC

Hi,

I'm building a parser for a protocol message similar to HTTP (let's
say: a main header and N key: value separated by CRLF until a final
double CRLF). My concern is:

- I parse the messages in a "Dispatcher" module that just needs to
parse a few fields in each message.
- Then the Dispatcher passes the message to a Worker thread via UNIX Socket.
- And the Worker must parse it again, but in this case I need all the
fields parsed.

Note that during the Worker's parsing, a C++ complex object is build
with all the parsed fields mapped into member variables, so I don't
want to play with those complex objects in the Dispatcher module.

How could I reuse the same Ragel machine for both cases? Of course I
would like something like:

%%{
machine Parser;

[...]

if (dispatcher) {
main := xxxxxxx
}
else {
main := yyyyyyy
}

}%%

Thanks a lot.

--
I?aki Baz Castillo
<ibc at aliax.net>

Iñaki Baz Castillo

2014-04-10 20:52:41 UTC

Permalink

Post by IÃ±aki Baz Castillo
Hi,
I'm building a parser for a protocol message similar to HTTP (let's
say: a main header and N key: value separated by CRLF until a final
- I parse the messages in a "Dispatcher" module that just needs to
parse a few fields in each message.
- Then the Dispatcher passes the message to a Worker thread via UNIX Socket.
- And the Worker must parse it again, but in this case I need all the
fields parsed.
Note that during the Worker's parsing, a C++ complex object is build
with all the parsed fields mapped into member variables, so I don't
want to play with those complex objects in the Dispatcher module.
How could I reuse the same Ragel machine for both cases? Of course I
%%{
machine Parser;
[...]
if (dispatcher) {
main := xxxxxxx
}
else {
main := yyyyyyy
}
}%%

Mmm, I think using the "when" statement is the way to go :)

--
I?aki Baz Castillo
<ibc at aliax.net>

Adrian Thurston

2014-06-28 22:43:17 UTC

Permalink

Hi I?aki,

Using conditionals is one way. I'm not sure I fully understand your use
case, but I think you can also make use of the ability to enter into any
named instantiation. All machine instances are defined in the data
section as constant values. It's just a matter of setting the current
state to the appropriate value.

-Adrian

Post by IÃ±aki Baz Castillo

Mmm, I think using the "when" statement is the way to go :)

Iñaki Baz Castillo

2014-07-01 10:21:50 UTC

Permalink

Post by Adrian Thurston
Using conditionals is one way. I'm not sure I fully understand your use
case, but I think you can also make use of the ability to enter into any
named instantiation. All machine instances are defined in the data section
as constant values. It's just a matter of setting the current state to the
appropriate value.

Hi, do you mean setting cs to point to an specific machine "fragment"?
May you please provide a little example?

Thanks a lot.

--
I?aki Baz Castillo
<ibc at aliax.net>

Tim Goddard

2014-06-30 02:19:31 UTC

Permalink

Sounds like you have a common language, but want two separate sets of actions.

You can include a file within a ragel machine - I'd write the grammar in one
file, then declare the "main" object and actions within two separate parsers.

Cheers,

Tim

Post by IÃ±aki Baz Castillo

Post by IÃ±aki Baz Castillo
Hi,
I'm building a parser for a protocol message similar to HTTP (let's
say: a main header and N key: value separated by CRLF until a final
- I parse the messages in a "Dispatcher" module that just needs to
parse a few fields in each message.
- Then the Dispatcher passes the message to a Worker thread via UNIX
Socket. - And the Worker must parse it again, but in this case I need all
the fields parsed.
Note that during the Worker's parsing, a C++ complex object is build
with all the parsed fields mapped into member variables, so I don't
want to play with those complex objects in the Dispatcher module.
How could I reuse the same Ragel machine for both cases? Of course I
%%{
machine Parser;
[...]
if (dispatcher) {
main := xxxxxxx
}
else {
main := yyyyyyy
}
}%%

Mmm, I think using the "when" statement is the way to go :)

William Ahern

2014-06-30 02:49:34 UTC

Permalink

Post by IÃ±aki Baz Castillo
Hi,
I'm building a parser for a protocol message similar to HTTP (let's
say: a main header and N key: value separated by CRLF until a final
- I parse the messages in a "Dispatcher" module that just needs to
parse a few fields in each message.
- Then the Dispatcher passes the message to a Worker thread via UNIX
Socket. - And the Worker must parse it again, but in this case I need all
the fields parsed.
Note that during the Worker's parsing, a C++ complex object is build
with all the parsed fields mapped into member variables, so I don't
want to play with those complex objects in the Dispatcher module.
How could I reuse the same Ragel machine for both cases?

<snip>

Here's an example from my own code. For various reasons (expediency,
simplicity) I used different machines to parse individual headers. But they
all use the same library of tokenization sub-machines.

The first machine is the basic library. You could put this in a separate
file, but mine is in the same file as everything else HTTP/RTSP-related. The
second and third machines are parser examples. Note that most of the context
is missing, so you won't be able to copy+paste this. For example, I have a
basic tokenizer written in pure C (which follows DJB's algorithm for
structured MIME header parsing) which emits tagged characters as short
integers (e.g. an escaped or quoted character will have a high bit set).
This made it easier for me to handle things like quoted strings and
parenthetical comments. Although, I wrote this years ago and today I might
find it easier to handle those problems with Ragel's fcall and fgoto
statments. But the truly beautiful thing about Ragel is how it allows you to
mix-and-match approaches. So there's really no wrong way. And I would
counsel a novice to avoid attempts at Ragel-purity--i.e. trying to do
everything in Ragel, such as handle recursive structures directly in Ragel.
You can do it (and I do it in some other stuff, like my Flash FLV, Microsoft
ASF, and SMTP parsers), but it's not something worth struggling over.

%%{
machine tokenizer;

crlf = [\r\n];
lwsp = [ \t];

qdigit = (0x0130 - 0x0139);
qxdigit = (0x0141 - 0x0146) | (0x0161 - 0x0166) | qdigit;

digits = digit | qdigit;
xdigits = xdigit | qxdigit;

qalpha = (0x0141 - 0x015a) | (0x0161 | 0x017a);

action num_begin { num = 0; }
action num_write { num *= 10; num += (0xff & fc) - '0'; }

action hex_begin { num = 0; }
action hex_write { num <<= 4; num += ((0xff & fc) > '9')? (10 + (tolower((0xff & fc)) - 'a')) : (0xff & fc) - '0'; }

action str_begin {
str = 0;
if ((error = obs_new(obs, 0)))
goto error;
}

action str_write {
if ((error = obs_putc(obs, 0xff & fc)))
goto error;
}

action str_end { str = obs_top(obs); }
}%%

%%{
machine x_sessioncookie_parser;
alphtype short;

include tokenizer;

action oops {
rtsp_badparse("x-sessioncookie", src, len, p);
error = EINVAL;
goto error;
}

token = (alnum | "+" | "/")+ >str_begin $str_write %str_end %{ hdr->token = str; };

main := (token lwsp*) $!oops;

write data;
}%%

%%{
machine content_type_parser;
alphtype short;

getkey (0xff & (*fpc)); # Mask high-order bits.

include tokenizer;

action oops {
rtsp_badparse("Content-Type", src, len, p);
error = EINVAL;
goto error;
}

equal = lwsp** "=" lwsp**;

reg_name = (alnum | [!#$&.+\-\^_]){1,127}; # RFC 4288 4.2

charset = "charset" equal reg_name >str_begin $str_write %str_end %{ hdr->charset = str; };
boundary = "boundary" equal reg_name >str_begin $str_write %str_end %{ hdr->boundary = str; };

attrib = (charset | boundary)? <: ^";"**;

type = reg_name >str_begin $str_write %str_end %{ hdr->type = str; };
subtype = reg_name >str_begin $str_write %str_end %{ hdr->subtype = str; };

main := (type "/" subtype lwsp** (";" lwsp** attrib)*) $!oops;

write data;
}%%

Iñaki Baz Castillo

2014-07-01 10:23:49 UTC

Permalink

Great! thanks a lot.

Post by William Ahern

Post by IÃ±aki Baz Castillo
Hi,
I'm building a parser for a protocol message similar to HTTP (let's
say: a main header and N key: value separated by CRLF until a final
- I parse the messages in a "Dispatcher" module that just needs to
parse a few fields in each message.
- Then the Dispatcher passes the message to a Worker thread via UNIX
Socket. - And the Worker must parse it again, but in this case I need all
the fields parsed.
Note that during the Worker's parsing, a C++ complex object is build
with all the parsed fields mapped into member variables, so I don't
want to play with those complex objects in the Dispatcher module.
How could I reuse the same Ragel machine for both cases?

<snip>
Here's an example from my own code. For various reasons (expediency,
simplicity) I used different machines to parse individual headers. But they
all use the same library of tokenization sub-machines.
The first machine is the basic library. You could put this in a separate
file, but mine is in the same file as everything else HTTP/RTSP-related. The
second and third machines are parser examples. Note that most of the context
is missing, so you won't be able to copy+paste this. For example, I have a
basic tokenizer written in pure C (which follows DJB's algorithm for
structured MIME header parsing) which emits tagged characters as short
integers (e.g. an escaped or quoted character will have a high bit set).
This made it easier for me to handle things like quoted strings and
parenthetical comments. Although, I wrote this years ago and today I might
find it easier to handle those problems with Ragel's fcall and fgoto
statments. But the truly beautiful thing about Ragel is how it allows you to
mix-and-match approaches. So there's really no wrong way. And I would
counsel a novice to avoid attempts at Ragel-purity--i.e. trying to do
everything in Ragel, such as handle recursive structures directly in Ragel.
You can do it (and I do it in some other stuff, like my Flash FLV, Microsoft
ASF, and SMTP parsers), but it's not something worth struggling over.
%%{
machine tokenizer;
crlf = [\r\n];
lwsp = [ \t];
qdigit = (0x0130 - 0x0139);
qxdigit = (0x0141 - 0x0146) | (0x0161 - 0x0166) | qdigit;
digits = digit | qdigit;
xdigits = xdigit | qxdigit;
qalpha = (0x0141 - 0x015a) | (0x0161 | 0x017a);
action num_begin { num = 0; }
action num_write { num *= 10; num += (0xff & fc) - '0'; }
action hex_begin { num = 0; }
action hex_write { num <<= 4; num += ((0xff & fc) > '9')? (10 + (tolower((0xff & fc)) - 'a')) : (0xff & fc) - '0'; }
action str_begin {
str = 0;
if ((error = obs_new(obs, 0)))
goto error;
}
action str_write {
if ((error = obs_putc(obs, 0xff & fc)))
goto error;
}
action str_end { str = obs_top(obs); }
}%%
%%{
machine x_sessioncookie_parser;
alphtype short;
include tokenizer;
action oops {
rtsp_badparse("x-sessioncookie", src, len, p);
error = EINVAL;
goto error;
}
token = (alnum | "+" | "/")+ >str_begin $str_write %str_end %{ hdr->token = str; };
main := (token lwsp*) $!oops;
write data;
}%%
%%{
machine content_type_parser;
alphtype short;
getkey (0xff & (*fpc)); # Mask high-order bits.
include tokenizer;
action oops {
rtsp_badparse("Content-Type", src, len, p);
error = EINVAL;
goto error;
}
equal = lwsp** "=" lwsp**;
reg_name = (alnum | [!#$&.+\-\^_]){1,127}; # RFC 4288 4.2
charset = "charset" equal reg_name >str_begin $str_write %str_end %{ hdr->charset = str; };
boundary = "boundary" equal reg_name >str_begin $str_write %str_end %{ hdr->boundary = str; };
attrib = (charset | boundary)? <: ^";"**;
type = reg_name >str_begin $str_write %str_end %{ hdr->type = str; };
subtype = reg_name >str_begin $str_write %str_end %{ hdr->subtype = str; };
main := (type "/" subtype lwsp** (";" lwsp** attrib)*) $!oops;
write data;
}%%
_______________________________________________
ragel-users mailing list
ragel-users at complang.org
http://www.complang.org/mailman/listinfo/ragel-users

--
I?aki Baz Castillo
<ibc at aliax.net>

Iñaki Baz Castillo

2014-07-01 10:22:30 UTC

Permalink

Post by Tim Goddard
Sounds like you have a common language, but want two separate sets of actions.
You can include a file within a ragel machine - I'd write the grammar in one
file, then declare the "main" object and actions within two separate parsers.

Good point!

--
I?aki Baz Castillo
<ibc at aliax.net>