Monday, April 25, 2016

Ragel modes comparison

I have recently decided to create a parser of SMTP addresses (RFC 5321) in ragel. Therefore, I have implemented the basic grammar using it in rspamd: smtp_address.rl.

Ragel has three modes of code generation:

  1. Table driven (T mode) - all states are pushed into one table and transitions are calculated using the next state and the input character performing states table lookup depending on the current state
  2. Alphabet driven (F mode) - same as previous but states are searched using the current character as index
  3. Goto driven mode - no tables are created but there are really many goto statements in the code

Initially, I thought that goto driven mode is not very friendly for the modern CPU. However, I decided to perform some performance tests. I've used the ragel generated code to parse 100K of email addresses all in the form of '<@domain1,@domain2:addr%d@example.com>'. Yes, this is a valid SMTP address (counting that %d is replaced with some number) and this enforces the most of state machine to be used.

Here are results from 30 measurements using -O0 optimization level:

And with heavy optimizations - -O3 -march=native 

Tests were performed on my macbook with Haswell CPU. Compiler - clang-3.8.

As you can see, T mode is the slowest and G mode is the winner. F mode has shown the intermediate results. However, the generated code size for G mode is the largest among these options and F mode is the winner in this case. On the other hand, F mode is not suitable for wide input alphabet mode.