[ragel-users] Hints about ragel for performance

Discussion:

Michael Lachmann

2012-11-03 22:34:30 UTC

Hi,

I'm starting to learn ragel because I'd like to write a very fast
parser to a fairly simple file structure.
I'd like to learn some of the tricks of increasing the performance of
the resulting program. So,
here are a few questions:

1. Is there a good sample program in terms of performance? I
downloaded awkemu - is that a good example?
2. Often, one can use **, or one can find a terminating character.
For example, awkemu has:
line = ( blineElements** '\n' )
I think here just * would have been enough, because there is the
terminating \n - is that right? Does it matter?
Should ** be avoided if possible?

3. Is there a disadvantage of using the lex-like scanner with
\*
pat =>
pat =>
etc., vs just specifying the full machine?
4. Is there a disadvantage of using intersection? For example, I think
the above line handling can written as:
line = something & [^\n]* '\n'
where something doesn't care about handling end-of-line. Is it just as
fast as writing expressions that also handle end-of line?

5. awkemu uses the following:
--
/* Find the last newline by searching backwards. This is where
* we will stop processing on this iteration. */
p = buf;
pe = buf + have + len - 1;
while ( *pe != '\n' && pe >= buf )
pe--;
pe += 1;

/* fprintf( stderr, "running on: %i\n", pe - p ); */

%% write exec;

/* How much is still in the buffer. */
have = data + len - pe;
if ( have > 0 )
memmove( buf, pe, have );
--
Is the first running backward to find the last eol necessary? It seems
to run part of the file through two parsers.

Thanks!
Michael

Michael Lachmann

2012-11-03 23:51:28 UTC

Permalink

Just continuing the performance question.
I tried to write a simple wc in ragel.

The main part looks like this:
--
%%{
machine wc;
word = (^space)+ > { nwords++; };
line = (space* . (word . space+)* word?) &
(^'\n')* > { nlines++; };
main := (line . '\n')** (line)? ;
}%%
--
(NOTE: I think this is still not totally correct. How do I handle a
line without '\n' at the end of the file correctly?)

Then I coded it in C:
--
switch( *B.cur )
{
case '\n':
n_lines++;
case ' ':
case '\t':
in_word = false ;
break ;
default:
if( !in_word )
{
in_word = true ;
nwords++ ;
}
}
--
The ragel parser, compiled with g++ -O3, runs at ~35MB/s on a big
file, and the hand-crafted parser at least at 130MB/s.
Did I do some errors in the ragel coding? Could I have done it more efficiently?
Or, will hand coding always be a fairly bit faster than ragel generated code?

Thanks!
Michael

Post by Michael Lachmann
Hi,
I'm starting to learn ragel because I'd like to write a very fast
parser to a fairly simple file structure.
I'd like to learn some of the tricks of increasing the performance of
the resulting program. So,
1. Is there a good sample program in terms of performance? I
downloaded awkemu - is that a good example?
2. Often, one can use **, or one can find a terminating character.
line = ( blineElements** '\n' )
I think here just * would have been enough, because there is the
terminating \n - is that right? Does it matter?
Should ** be avoided if possible?
3. Is there a disadvantage of using the lex-like scanner with
\*
pat =>
pat =>
etc., vs just specifying the full machine?
4. Is there a disadvantage of using intersection? For example, I think
line = something & [^\n]* '\n'
where something doesn't care about handling end-of-line. Is it just as
fast as writing expressions that also handle end-of line?
--
/* Find the last newline by searching backwards. This is where
* we will stop processing on this iteration. */
p = buf;
pe = buf + have + len - 1;
while ( *pe != '\n' && pe >= buf )
pe--;
pe += 1;
/* fprintf( stderr, "running on: %i\n", pe - p ); */
%% write exec;
/* How much is still in the buffer. */
have = data + len - pe;
if ( have > 0 )
memmove( buf, pe, have );
--
Is the first running backward to find the last eol necessary? It seems
to run part of the file through two parsers.
Thanks!
Michael

Michael Conrad

2012-11-11 08:52:04 UTC

Permalink

Post by Michael Lachmann
Just continuing the performance question.
I tried to write a simple wc in ragel.
--
%%{
machine wc;
word = (^space)+ > { nwords++; };
line = (space* . (word . space+)* word?) &
(^'\n')* > { nlines++; };
main := (line . '\n')** (line)? ;
}%%
--
(NOTE: I think this is still not totally correct. How do I handle a
line without '\n' at the end of the file correctly?)

If you want a machine closer in spirit to the C code, try
%%{
machine wc;
word = (^space)+ >{ nwords++; };
line = (any-'\n')* '\n' >{ nlines++; };
words = (space* word space)*;
main := words | line*;
}%%

Keep in mind that ragel runs the sub-machines "words" and "line*" in
parallel, much like your C loop, so all actions occur regardless of
which one reaches a final state.

(also, this machine doesn't bother to make the end of the file a valid
final state)

Post by Michael Lachmann
Did I do some errors in the ragel coding? Could I have done it more efficiently?
Or, will hand coding always be a fairly bit faster than ragel generated code?

Well, you wrote a hand-crafted parser with essentially 2 states and such
simple transitions that you don't even need to check which state you're
in before you switch() on the next character. The assembly is probably
a dozen bytes long, and probably runs closer to 1GB/s except your hard
drive and ram can only deliver 130MB/s.

When you get into any significantly complex state machine, you'll need
to track which actions to fire more carefully, and end up about the same
speed as what ragel generates.

Adrian Thurston

2012-11-06 17:28:28 UTC

Permalink

In general, if you're careful to not introduce unwanted non-deterministic constructions, then performance comes down to the code style that you choose and what's in your actions

The scanner construction is super-regular and can indroduce an extra cost beyond a single pass because it backtracks. In the best case, the backtracking introduces 0 additional cost, yeilding O(n) running time overall. At worst you can have O(n^2) running time.
-----Original Message-----
From: Michael Lachmann <lachmann at eva.mpg.de>
Sender: ragel-users-bounces at complang.org
Date: Sat, 3 Nov 2012 23:34:30
To: <ragel-users at complang.org>
Reply-To: ragel-users at complang.org
Subject: [ragel-users] Hints about ragel for performance

Hi,

I'm starting to learn ragel because I'd like to write a very fast
parser to a fairly simple file structure.
I'd like to learn some of the tricks of increasing the performance of
the resulting program. So,
here are a few questions:

1. Is there a good sample program in terms of performance? I
downloaded awkemu - is that a good example?
2. Often, one can use **, or one can find a terminating character.
For example, awkemu has:
line = ( blineElements** '\n' )
I think here just * would have been enough, because there is the
terminating \n - is that right? Does it matter?
Should ** be avoided if possible?

3. Is there a disadvantage of using the lex-like scanner with
\*
pat =>
pat =>
etc., vs just specifying the full machine?
4. Is there a disadvantage of using intersection? For example, I think
the above line handling can written as:
line = something & [^\n]* '\n'
where something doesn't care about handling end-of-line. Is it just as
fast as writing expressions that also handle end-of line?

5. awkemu uses the following:
--
/* Find the last newline by searching backwards. This is where
* we will stop processing on this iteration. */
p = buf;
pe = buf + have + len - 1;
while ( *pe != '\n' && pe >= buf )
pe--;
pe += 1;

/* fprintf( stderr, "running on: %i\n", pe - p ); */

%% write exec;

/* How much is still in the buffer. */
have = data + len - pe;
if ( have > 0 )
memmove( buf, pe, have );
--
Is the first running backward to find the last eol necessary? It seems
to run part of the file through two parsers.

Thanks!
Michael