Wednesday, April 29, 2015

How fast is your memchr

Recently, I got a simple task: calculate line counts in some text. The trick was to do it as quickly as possible. The naive approach was just to use some cycle on a text and count all characters whose code is '\n'. But of course this approach is not very clever.

The second approach was to take the standard memchr function from the C library. This is the most portable but not equally effective approach on different platforms.

The third approach was to treat input as a sequence of 32 or 64 bit integers and use 'XOR' operation with the repeated pattern. However, the trick was to use special magic bits to track if some of bytes in this word are zero after 'XOR'. The idea was taken from GNU C library, where it is used in memchr function.

The code that evaluates the performance of each method is placed here:

Currently, it uses linux specific timer to get time intervals, so if you want to test it on your OS then you need to change this timer's value appropriately.

Here are some results:

Linux/amd64 (gcc 4.8 -O3)

Naive: 5583 0.001655
Stupid xor: 5583 0.001553
Memchr: 5583 0.000229
Magic bits: 5583 0.000352

Here we can see that memchr in glibc is blazingly fast.

FreeBSD/amd64 (clang 3.4.1 -O3)

Naive: 5589 0.001343
Stupid xor: 5589 0.001431
Memchr: 5589 0.001323
Magic bits: 5589 0.000444

In FreeBSD, memchr is almost as slow as naive implementation.

Solaris/amd64 (gcc 4.4 -O2) - this is another hardware platform and test

Naive: 9013 0.872925
Stupid xor: 9013 0.812501
Memchr: 9013 0.889934
Magic bits: 9013 0.955014

-m64 -O2

Naive: 9013 1.138129
Stupid xor: 9013 0.982253
Memchr: 9013 0.741068
Magic bits: 9013 0.349103

As we can see, in Solaris, memchr is reasonably fast but is almost twice slower than bithack approach when used in 64 bits mode.

UPDATE: I've added some more cases to the evaluation, namely SSE2 and AVX2 versions using compiler intrinsics. Finally, I've evaluated the performance of algorithms on a larger file using my Mac laptop with Haswell CPU:

-m64 -mavx2 -O2

Naive: 4471200 0.179973
Stupid xor: 4471200 0.186705
Memchr: 4471200 0.083217
Memchr (musl): 4471200 0.164340
Magic bits: 4471200 0.122103
SSE: 4471200 0.073190
AVX2: 4471200 0.06790

So OSX libc is fast enough to beat the bithack version, so I presume it uses SSE for speed. Nonetheless, in 32 bits mode it sucks (but naive version comes even faster than libc one, because clang is smart enough to optimize it using SSE instructions):

-m32 -mavx2 -O2

Naive: 4471200 0.123135
Stupid xor: 4471200 0.109367
Memchr: 4471200 0.313270
Memchr (musl): 4471200 0.336193
Magic bits: 4471200 0.223416
SSE: 4471200 0.071051
AVX2: 4471200 0.070952

The performance of avx2 and sse2 versions is almost the same in this case. I've found that the version with vmovntdqa was slightly slower than a version with vpmaskmovd.