Monday, April 25, 2016

Ragel modes comparison

I have recently decided to create a parser of SMTP addresses (RFC 5321) in ragel. Therefore, I have implemented the basic grammar using it in rspamd: smtp_address.rl.

Ragel has three modes of code generation:

  1. Table driven (T mode) - all states are pushed into one table and transitions are calculated using the next state and the input character performing states table lookup depending on the current state
  2. Alphabet driven (F mode) - same as previous but states are searched using the current character as index
  3. Goto driven mode - no tables are created but there are really many goto statements in the code

Initially, I thought that goto driven mode is not very friendly for the modern CPU. However, I decided to perform some performance tests. I've used the ragel generated code to parse 100K of email addresses all in the form of '<@domain1,>'. Yes, this is a valid SMTP address (counting that %d is replaced with some number) and this enforces the most of state machine to be used.

Here are results from 30 measurements using -O0 optimization level:

And with heavy optimizations - -O3 -march=native 

Tests were performed on my macbook with Haswell CPU. Compiler - clang-3.8.

As you can see, T mode is the slowest and G mode is the winner. F mode has shown the intermediate results. However, the generated code size for G mode is the largest among these options and F mode is the winner in this case. On the other hand, F mode is not suitable for wide input alphabet mode.

Wednesday, April 29, 2015

How fast is your memchr

Recently, I got a simple task: calculate line counts in some text. The trick was to do it as quickly as possible. The naive approach was just to use some cycle on a text and count all characters whose code is '\n'. But of course this approach is not very clever.

The second approach was to take the standard memchr function from the C library. This is the most portable but not equally effective approach on different platforms.

The third approach was to treat input as a sequence of 32 or 64 bit integers and use 'XOR' operation with the repeated pattern. However, the trick was to use special magic bits to track if some of bytes in this word are zero after 'XOR'. The idea was taken from GNU C library, where it is used in memchr function.

The code that evaluates the performance of each method is placed here:

Currently, it uses linux specific timer to get time intervals, so if you want to test it on your OS then you need to change this timer's value appropriately.

Here are some results:

Linux/amd64 (gcc 4.8 -O3)

Naive: 5583 0.001655
Stupid xor: 5583 0.001553
Memchr: 5583 0.000229
Magic bits: 5583 0.000352

Here we can see that memchr in glibc is blazingly fast.

FreeBSD/amd64 (clang 3.4.1 -O3)

Naive: 5589 0.001343
Stupid xor: 5589 0.001431
Memchr: 5589 0.001323
Magic bits: 5589 0.000444

In FreeBSD, memchr is almost as slow as naive implementation.

Solaris/amd64 (gcc 4.4 -O2) - this is another hardware platform and test

Naive: 9013 0.872925
Stupid xor: 9013 0.812501
Memchr: 9013 0.889934
Magic bits: 9013 0.955014

-m64 -O2

Naive: 9013 1.138129
Stupid xor: 9013 0.982253
Memchr: 9013 0.741068
Magic bits: 9013 0.349103

As we can see, in Solaris, memchr is reasonably fast but is almost twice slower than bithack approach when used in 64 bits mode.

UPDATE: I've added some more cases to the evaluation, namely SSE2 and AVX2 versions using compiler intrinsics. Finally, I've evaluated the performance of algorithms on a larger file using my Mac laptop with Haswell CPU:

-m64 -mavx2 -O2

Naive: 4471200 0.179973
Stupid xor: 4471200 0.186705
Memchr: 4471200 0.083217
Memchr (musl): 4471200 0.164340
Magic bits: 4471200 0.122103
SSE: 4471200 0.073190
AVX2: 4471200 0.06790

So OSX libc is fast enough to beat the bithack version, so I presume it uses SSE for speed. Nonetheless, in 32 bits mode it sucks (but naive version comes even faster than libc one, because clang is smart enough to optimize it using SSE instructions):

-m32 -mavx2 -O2

Naive: 4471200 0.123135
Stupid xor: 4471200 0.109367
Memchr: 4471200 0.313270
Memchr (musl): 4471200 0.336193
Magic bits: 4471200 0.223416
SSE: 4471200 0.071051
AVX2: 4471200 0.070952

The performance of avx2 and sse2 versions is almost the same in this case. I've found that the version with vmovntdqa was slightly slower than a version with vpmaskmovd.

Monday, June 23, 2014

How to find running statements in sqlite3

Quite often, sqlite3 users get SQLITE_LOCKED error code when trying to execute some query. Sometimes it is related to concurrent access and is well documented here. However, in case of DROP TABLE operation it is practically useless. I have used the following method to detect which statements are not finished yet. To use it, you need some debugger that can attach to a running process. Then you need to set a breakpoint just near sqlite3_exec or whatever other function that causes SQLITE_LOCKED. Then, you need to step inside that function and print the database object (and you need sqlite3 debug symbols + source code to do these checks):

(lldb) p *db

nVdbeActive = 1
nVdbeRead = 1

So you have a single unfinished query. But to find it we need to apply some sort of black magic. First of all, sqlite stores statements that are not finalized (e.g. prepared statements) inside db object in a linked list of Vdbe* structures. Each Vdbe structure contains a single statement (actually, sqlite3_stmt is an opaque name for Vdbe structure). But how could we find, which query is finished and which is active. Here is the code from sqlite3 itself:

SQLITE_API int sqlite3_stmt_busy(sqlite3_stmt *pStmt){
  Vdbe *v = (Vdbe*)pStmt;
  return v!=0 && v->pc>0 && v->magic==VDBE_MAGIC_RUN;

Hence, we need to check 2 fields of Vdbe structure: pc (program count) and magic. In decimal form magic for active statements is 3186757027. The subsequent operations are quite simple: iterate over linked list (by using pNext field or by writing some macro to simplify this procedure) and find queries, that have pc > 0 and magic equal to 3186757027:

(lldb) p *db->pVdbe->pNext->pNext->pNext->pNext->pNext->pNext->pNext
zSql = 0x0000000801067c08 "SELECT version FROM packages WHERE origin=?1"
magic = 3186757027
pc = 12

Finally, examine and fix your code to reset that statement prior to calling for DROP TABLE/INDEX.

Thursday, May 3, 2012

AIO в Linux.


Не так давно я реализовывал систему асинхронного io для эффективной работы в Linux. Данный пост является компиляцией моего опыта в данной теме и описывает, в основном, ядерный io (который осуществляется через io_submit). Желающих ознакомиться с данным опытом прошу под кат.

Особенности флагов оптимизации gcc в ubuntu.

При включении оптимизационных флагов gcc в ubuntu (начиная с -O), включаются дополнительные проверки вызовов функций. При компиляции статических приложений (тех, что не включают libc и gcc с целью минимизации генерируемого кода) это может вызвать проблемы, например, такие:

/tmp/cc3z0FbU.o: In function `main':
sgio.c:(.text.startup+0x248): undefined reference to `__printf_chk'
sgio.c:(.text.startup+0x25e): undefined reference to `__printf_chk'

В мануале об этом написано так:
NOTE: In Ubuntu 8.10 and later versions, -D_FORTIFY_SOURCE=2 is set by default, and is activated when -O is set to 2 or higher. This enables
additional compile-time and run-time checks for several libc functions. To disable, specify either -U_FORTIFY_SOURCE or -D_FORTIFY_SOURCE=0.

При следовании этим рекомендациям, проблема исчезает, однако, эта "фича" далеко не так очевидна.

Monday, July 25, 2011

VirtualBox OSE и PXE

Чтобы в VirtualBox OSE работал pxeboot в guest системах, надо загрузить и установить в настройках расширение Oracle VM VirtualBox Extension Pack. Найти можно тут.

Обзор rspamd

Планировал разместить эту статью на хабре, но, к сожалению, по каким-то причинам она не прошла премодерации в "песочнице". Поэтому продублирую ее здесь.