Wednesday, April 29, 2015

How fast is your memchr

Recently, I got a simple task: calculate line counts in some text. The trick was to do it as quickly as possible. The naive approach was just to use some cycle on a text and count all characters whose code is '\n'. But of course this approach is not very clever.

The second approach was to take the standard memchr function from the C library. This is the most portable but not equally effective approach on different platforms.

The third approach was to treat input as a sequence of 32 or 64 bit integers and use 'XOR' operation with the repeated pattern. However, the trick was to use special magic bits to track if some of bytes in this word are zero after 'XOR'. The idea was taken from GNU C library, where it is used in memchr function.

The code that evaluates the performance of each method is placed here:
https://gist.github.com/vstakhov/99c04d6aa18e47c7950d

Currently, it uses linux specific timer to get time intervals, so if you want to test it on your OS then you need to change this timer's value appropriately.

Here are some results:

Linux/amd64 (gcc 4.8 -O3)

Naive: 5583 0.001655
Stupid xor: 5583 0.001553
Memchr: 5583 0.000229
Magic bits: 5583 0.000352

Here we can see that memchr in glibc is blazingly fast.

FreeBSD/amd64 (clang 3.4.1 -O3)

Naive: 5589 0.001343
Stupid xor: 5589 0.001431
Memchr: 5589 0.001323
Magic bits: 5589 0.000444

In FreeBSD, memchr is almost as slow as naive implementation.

Solaris/amd64 (gcc 4.4 -O2) - this is another hardware platform and test

Naive: 9013 0.872925
Stupid xor: 9013 0.812501
Memchr: 9013 0.889934
Magic bits: 9013 0.955014

-m64 -O2

Naive: 9013 1.138129
Stupid xor: 9013 0.982253
Memchr: 9013 0.741068
Magic bits: 9013 0.349103


As we can see, in Solaris, memchr is reasonably fast but is almost twice slower than bithack approach when used in 64 bits mode.

UPDATE: I've added some more cases to the evaluation, namely SSE2 and AVX2 versions using compiler intrinsics. Finally, I've evaluated the performance of algorithms on a larger file using my Mac laptop with Haswell CPU:

-m64 -mavx2 -O2

Naive: 4471200 0.179973
Stupid xor: 4471200 0.186705
Memchr: 4471200 0.083217
Memchr (musl): 4471200 0.164340
Magic bits: 4471200 0.122103
SSE: 4471200 0.073190
AVX2: 4471200 0.06790

So OSX libc is fast enough to beat the bithack version, so I presume it uses SSE for speed. Nonetheless, in 32 bits mode it sucks (but naive version comes even faster than libc one, because clang is smart enough to optimize it using SSE instructions):

-m32 -mavx2 -O2

Naive: 4471200 0.123135
Stupid xor: 4471200 0.109367
Memchr: 4471200 0.313270
Memchr (musl): 4471200 0.336193
Magic bits: 4471200 0.223416
SSE: 4471200 0.071051
AVX2: 4471200 0.070952

The performance of avx2 and sse2 versions is almost the same in this case. I've found that the version with vmovntdqa was slightly slower than a version with vpmaskmovd.

3 comments:

  1. cebka,

    The following SSE2 code (to be inserted in your main function) was in my experiments significantly faster than the others. Code for AVX2 could be made using a similar method.

    #ifdef __SSE__
    t1 = get_ticks ();
    {
    __m128i xmm0, xmm1, xmm2;
    p = buf;
    ip = (uint64_t *)p;
    xmm2 = _mm_set1_epi8 ('\n');
    __m128i xmm3, xmm4, xmm5,xmm6,xmm7;
    xmm3 = _mm_set1_epi8(1);
    xmm6 = _mm_setzero_si128();
    xmm7 = xmm6;
    /* First part. */
    if(r>=960)
    {
    uint64_t *ipe1 = ip + ((r - (size_t)952)>>3);
    do
    {
    __m128i xmm10,xmm11,xmm14,xmm15;
    xmm4 = xmm7;
    xmm14 = xmm7;
    uint64_t *ipe2 = ip + 120;
    do
    {
    xmm0 = _mm_load_si128((const __m128i*)ip); ip+=2;
    xmm10 = _mm_load_si128((const __m128i*)ip); ip+=2;
    xmm1 = _mm_cmpeq_epi8 ( xmm0, xmm2);
    xmm11 = _mm_cmpeq_epi8 (xmm10, xmm2);
    xmm5 = _mm_min_epu8( xmm1, xmm3);
    xmm15 = _mm_min_epu8(xmm11, xmm3);
    xmm4 = _mm_adds_epu8( xmm4,xmm5);
    xmm14 = _mm_adds_epu8(xmm14,xmm15);
    }
    while(ip>4)<<4));
    for(;ip<ipe2;ip+=2)
    {
    xmm0 = _mm_load_si128((const __m128i*)ip);
    xmm1 = _mm_cmpeq_epi8 (xmm0, xmm2);
    xmm5 = _mm_min_epu8(xmm1, xmm3);
    xmm4 = _mm_adds_epu8(xmm4,xmm5);
    }
    xmm1 = _mm_sad_epu8(xmm4,xmm7);
    xmm6 = _mm_add_epi64(xmm6,xmm1);
    cnt = (size_t)_mm_cvtsi128_si64(xmm6);
    cnt += (size_t)_mm_cvtsi128_si64(
    (_mm_castpd_si128(
    _mm_shuffle_pd(_mm_castsi128_pd(xmm6),
    _mm_castsi128_pd(xmm6), 1))));
    }
    while(false);
    /* Last part. */
    do
    {
    unsigned char *pe = p + r;
    p = (unsigned char *)(ip);
    for(;p<pe;++p) if(*p=='\n') ++cnt;
    }
    while(false);
    }
    t2 = get_ticks ();
    printf("SSE2: %zd %.6f\n", cnt, t2 - t1);
    #endif

    ReplyDelete
    Replies
    1. Sorry, an error got in the code. Here follows the correct code:

      #ifdef __SSE__
      t1 = get_ticks ();
      {
      __m128i xmm0, xmm1, xmm2;
      p = buf;
      ip = (uint64_t *)p;
      xmm2 = _mm_set1_epi8 ('\n');
      __m128i xmm3, xmm4, xmm5,xmm6,xmm7;
      xmm3 = _mm_set1_epi8(1);
      xmm6 = _mm_setzero_si128();
      xmm7 = xmm6;
      /* First part. */
      if(r>=960)
      {
      uint64_t *ipe1 = ip + ((r - (size_t)952)>>3);
      do
      {
      __m128i xmm10,xmm11,xmm14,xmm15;
      xmm4 = xmm7;
      xmm14 = xmm7;
      uint64_t *ipe2 = ip + 120;
      do
      {
      xmm0 = _mm_load_si128((const __m128i*)ip); ip+=2;
      xmm10 = _mm_load_si128((const __m128i*)ip); ip+=2;
      xmm1 = _mm_cmpeq_epi8 ( xmm0, xmm2);
      xmm11 = _mm_cmpeq_epi8 (xmm10, xmm2);
      xmm5 = _mm_min_epu8( xmm1, xmm3);
      xmm15 = _mm_min_epu8(xmm11, xmm3);
      xmm4 = _mm_adds_epu8( xmm4,xmm5);
      xmm14 = _mm_adds_epu8(xmm14,xmm15);
      }
      while(ip < ipe2);
      xmm1 = _mm_sad_epu8( xmm4,xmm7);
      xmm11 = _mm_sad_epu8(xmm14,xmm7);
      xmm6 = _mm_add_epi64(xmm6, xmm1);
      xmm6 = _mm_add_epi64(xmm6,xmm11);
      }
      while(ip < ipe1);
      }
      /* Second part. */
      do
      {
      xmm4 = xmm7;
      uint64_t *ipe2 = (uint64_t *)(p + ((r>>4)<<4));
      for(;ip<ipe2;ip+=2)
      {
      xmm0 = _mm_load_si128((const __m128i*)ip);
      xmm1 = _mm_cmpeq_epi8 (xmm0, xmm2);
      xmm5 = _mm_min_epu8(xmm1, xmm3);
      xmm4 = _mm_adds_epu8(xmm4,xmm5);
      }
      xmm1 = _mm_sad_epu8(xmm4,xmm7);
      xmm6 = _mm_add_epi64(xmm6,xmm1);
      cnt = (size_t)_mm_cvtsi128_si64(xmm6);
      cnt += (size_t)_mm_cvtsi128_si64(
      (_mm_castpd_si128(
      _mm_shuffle_pd(_mm_castsi128_pd(xmm6),
      _mm_castsi128_pd(xmm6), 1))));
      }
      while(false);
      /* Last part. */
      do
      {
      unsigned char *pe = p + r;
      p = (unsigned char *)(ip);
      for(;p<pe;++p) if(*p=='\n') ++cnt;
      }
      while(false);
      }
      t2 = get_ticks ();
      printf("SSE2: %zd %.6f\n", cnt, t2 - t1);
      #endif

      Delete
  2. Well, as far as I see, you have interleaved the inner loop. I planned to do that for both sse and avx2 versions but was actually too lazy for that. Here are updated results:

    Naive: 32852 0.004941
    Stupid xor: 32852 0.004136
    Memchr: 32852 0.000610
    Memchr (musl): 32852 0.002223
    Magic bits: 32852 0.001735
    SSE: 32852 0.000825
    SSE2: 32852 0.000768
    AVX2: 32852 0.000458

    ReplyDelete