Recently, I got a simple task: calculate line counts in some text. The trick was to do it as quickly as possible. The naive approach was just to use some cycle on a text and count all characters whose code is '\n'. But of course this approach is not very clever.
The second approach was to take the standard memchr function from the C library. This is the most portable but not equally effective approach on different platforms.
The third approach was to treat input as a sequence of 32 or 64 bit integers and use 'XOR' operation with the repeated pattern. However, the trick was to use special magic bits to track if some of bytes in this word are zero after 'XOR'. The idea was taken from GNU C library, where it is used in memchr function.
The code that evaluates the performance of each method is placed here:
https://gist.github.com/vstakhov/99c04d6aa18e47c7950d
Currently, it uses linux specific timer to get time intervals, so if you want to test it on your OS then you need to change this timer's value appropriately.
Here are some results:
Linux/amd64 (gcc 4.8 -O3)
Naive: 5583 0.001655
Stupid xor: 5583 0.001553
Memchr: 5583 0.000229
Magic bits: 5583 0.000352
Here we can see that memchr in glibc is blazingly fast.
FreeBSD/amd64 (clang 3.4.1 -O3)
Naive: 5589 0.001343
Stupid xor: 5589 0.001431
Memchr: 5589 0.001323
Magic bits: 5589 0.000444
In FreeBSD, memchr is almost as slow as naive implementation.
Solaris/amd64 (gcc 4.4 -O2) - this is another hardware platform and test
Naive: 9013 0.872925
Stupid xor: 9013 0.812501
Memchr: 9013 0.889934
Magic bits: 9013 0.955014
-m64 -O2
Naive: 9013 1.138129
Stupid xor: 9013 0.982253
Memchr: 9013 0.741068
Magic bits: 9013 0.349103
As we can see, in Solaris, memchr is reasonably fast but is almost twice slower than bithack approach when used in 64 bits mode.
UPDATE: I've added some more cases to the evaluation, namely SSE2 and AVX2 versions using compiler intrinsics. Finally, I've evaluated the performance of algorithms on a larger file using my Mac laptop with Haswell CPU:
-m64 -mavx2 -O2
Naive: 4471200 0.179973
Stupid xor: 4471200 0.186705
Memchr: 4471200 0.083217
Memchr (musl): 4471200 0.164340
Magic bits: 4471200 0.122103
SSE: 4471200 0.073190
AVX2: 4471200 0.06790
So OSX libc is fast enough to beat the bithack version, so I presume it uses SSE for speed. Nonetheless, in 32 bits mode it sucks (but naive version comes even faster than libc one, because clang is smart enough to optimize it using SSE instructions):
-m32 -mavx2 -O2
Naive: 4471200 0.123135
Stupid xor: 4471200 0.109367
Memchr: 4471200 0.313270
Memchr (musl): 4471200 0.336193
Magic bits: 4471200 0.223416
SSE: 4471200 0.071051
AVX2: 4471200 0.070952
The performance of avx2 and sse2 versions is almost the same in this case. I've found that the version with vmovntdqa was slightly slower than a version with vpmaskmovd.
The second approach was to take the standard memchr function from the C library. This is the most portable but not equally effective approach on different platforms.
The third approach was to treat input as a sequence of 32 or 64 bit integers and use 'XOR' operation with the repeated pattern. However, the trick was to use special magic bits to track if some of bytes in this word are zero after 'XOR'. The idea was taken from GNU C library, where it is used in memchr function.
The code that evaluates the performance of each method is placed here:
https://gist.github.com/vstakhov/99c04d6aa18e47c7950d
Currently, it uses linux specific timer to get time intervals, so if you want to test it on your OS then you need to change this timer's value appropriately.
Here are some results:
Linux/amd64 (gcc 4.8 -O3)
Naive: 5583 0.001655
Stupid xor: 5583 0.001553
Memchr: 5583 0.000229
Magic bits: 5583 0.000352
Here we can see that memchr in glibc is blazingly fast.
FreeBSD/amd64 (clang 3.4.1 -O3)
Naive: 5589 0.001343
Stupid xor: 5589 0.001431
Memchr: 5589 0.001323
Magic bits: 5589 0.000444
In FreeBSD, memchr is almost as slow as naive implementation.
Solaris/amd64 (gcc 4.4 -O2) - this is another hardware platform and test
Naive: 9013 0.872925
Stupid xor: 9013 0.812501
Memchr: 9013 0.889934
Magic bits: 9013 0.955014
-m64 -O2
Naive: 9013 1.138129
Stupid xor: 9013 0.982253
Memchr: 9013 0.741068
Magic bits: 9013 0.349103
As we can see, in Solaris, memchr is reasonably fast but is almost twice slower than bithack approach when used in 64 bits mode.
UPDATE: I've added some more cases to the evaluation, namely SSE2 and AVX2 versions using compiler intrinsics. Finally, I've evaluated the performance of algorithms on a larger file using my Mac laptop with Haswell CPU:
-m64 -mavx2 -O2
Naive: 4471200 0.179973
Stupid xor: 4471200 0.186705
Memchr: 4471200 0.083217
Memchr (musl): 4471200 0.164340
Magic bits: 4471200 0.122103
SSE: 4471200 0.073190
AVX2: 4471200 0.06790
So OSX libc is fast enough to beat the bithack version, so I presume it uses SSE for speed. Nonetheless, in 32 bits mode it sucks (but naive version comes even faster than libc one, because clang is smart enough to optimize it using SSE instructions):
-m32 -mavx2 -O2
Naive: 4471200 0.123135
Stupid xor: 4471200 0.109367
Memchr: 4471200 0.313270
Memchr (musl): 4471200 0.336193
Magic bits: 4471200 0.223416
SSE: 4471200 0.071051
AVX2: 4471200 0.070952
The performance of avx2 and sse2 versions is almost the same in this case. I've found that the version with vmovntdqa was slightly slower than a version with vpmaskmovd.
cebka,
ReplyDeleteThe following SSE2 code (to be inserted in your main function) was in my experiments significantly faster than the others. Code for AVX2 could be made using a similar method.
#ifdef __SSE__
t1 = get_ticks ();
{
__m128i xmm0, xmm1, xmm2;
p = buf;
ip = (uint64_t *)p;
xmm2 = _mm_set1_epi8 ('\n');
__m128i xmm3, xmm4, xmm5,xmm6,xmm7;
xmm3 = _mm_set1_epi8(1);
xmm6 = _mm_setzero_si128();
xmm7 = xmm6;
/* First part. */
if(r>=960)
{
uint64_t *ipe1 = ip + ((r - (size_t)952)>>3);
do
{
__m128i xmm10,xmm11,xmm14,xmm15;
xmm4 = xmm7;
xmm14 = xmm7;
uint64_t *ipe2 = ip + 120;
do
{
xmm0 = _mm_load_si128((const __m128i*)ip); ip+=2;
xmm10 = _mm_load_si128((const __m128i*)ip); ip+=2;
xmm1 = _mm_cmpeq_epi8 ( xmm0, xmm2);
xmm11 = _mm_cmpeq_epi8 (xmm10, xmm2);
xmm5 = _mm_min_epu8( xmm1, xmm3);
xmm15 = _mm_min_epu8(xmm11, xmm3);
xmm4 = _mm_adds_epu8( xmm4,xmm5);
xmm14 = _mm_adds_epu8(xmm14,xmm15);
}
while(ip>4)<<4));
for(;ip<ipe2;ip+=2)
{
xmm0 = _mm_load_si128((const __m128i*)ip);
xmm1 = _mm_cmpeq_epi8 (xmm0, xmm2);
xmm5 = _mm_min_epu8(xmm1, xmm3);
xmm4 = _mm_adds_epu8(xmm4,xmm5);
}
xmm1 = _mm_sad_epu8(xmm4,xmm7);
xmm6 = _mm_add_epi64(xmm6,xmm1);
cnt = (size_t)_mm_cvtsi128_si64(xmm6);
cnt += (size_t)_mm_cvtsi128_si64(
(_mm_castpd_si128(
_mm_shuffle_pd(_mm_castsi128_pd(xmm6),
_mm_castsi128_pd(xmm6), 1))));
}
while(false);
/* Last part. */
do
{
unsigned char *pe = p + r;
p = (unsigned char *)(ip);
for(;p<pe;++p) if(*p=='\n') ++cnt;
}
while(false);
}
t2 = get_ticks ();
printf("SSE2: %zd %.6f\n", cnt, t2 - t1);
#endif
Sorry, an error got in the code. Here follows the correct code:
Delete#ifdef __SSE__
t1 = get_ticks ();
{
__m128i xmm0, xmm1, xmm2;
p = buf;
ip = (uint64_t *)p;
xmm2 = _mm_set1_epi8 ('\n');
__m128i xmm3, xmm4, xmm5,xmm6,xmm7;
xmm3 = _mm_set1_epi8(1);
xmm6 = _mm_setzero_si128();
xmm7 = xmm6;
/* First part. */
if(r>=960)
{
uint64_t *ipe1 = ip + ((r - (size_t)952)>>3);
do
{
__m128i xmm10,xmm11,xmm14,xmm15;
xmm4 = xmm7;
xmm14 = xmm7;
uint64_t *ipe2 = ip + 120;
do
{
xmm0 = _mm_load_si128((const __m128i*)ip); ip+=2;
xmm10 = _mm_load_si128((const __m128i*)ip); ip+=2;
xmm1 = _mm_cmpeq_epi8 ( xmm0, xmm2);
xmm11 = _mm_cmpeq_epi8 (xmm10, xmm2);
xmm5 = _mm_min_epu8( xmm1, xmm3);
xmm15 = _mm_min_epu8(xmm11, xmm3);
xmm4 = _mm_adds_epu8( xmm4,xmm5);
xmm14 = _mm_adds_epu8(xmm14,xmm15);
}
while(ip < ipe2);
xmm1 = _mm_sad_epu8( xmm4,xmm7);
xmm11 = _mm_sad_epu8(xmm14,xmm7);
xmm6 = _mm_add_epi64(xmm6, xmm1);
xmm6 = _mm_add_epi64(xmm6,xmm11);
}
while(ip < ipe1);
}
/* Second part. */
do
{
xmm4 = xmm7;
uint64_t *ipe2 = (uint64_t *)(p + ((r>>4)<<4));
for(;ip<ipe2;ip+=2)
{
xmm0 = _mm_load_si128((const __m128i*)ip);
xmm1 = _mm_cmpeq_epi8 (xmm0, xmm2);
xmm5 = _mm_min_epu8(xmm1, xmm3);
xmm4 = _mm_adds_epu8(xmm4,xmm5);
}
xmm1 = _mm_sad_epu8(xmm4,xmm7);
xmm6 = _mm_add_epi64(xmm6,xmm1);
cnt = (size_t)_mm_cvtsi128_si64(xmm6);
cnt += (size_t)_mm_cvtsi128_si64(
(_mm_castpd_si128(
_mm_shuffle_pd(_mm_castsi128_pd(xmm6),
_mm_castsi128_pd(xmm6), 1))));
}
while(false);
/* Last part. */
do
{
unsigned char *pe = p + r;
p = (unsigned char *)(ip);
for(;p<pe;++p) if(*p=='\n') ++cnt;
}
while(false);
}
t2 = get_ticks ();
printf("SSE2: %zd %.6f\n", cnt, t2 - t1);
#endif
Well, as far as I see, you have interleaved the inner loop. I planned to do that for both sse and avx2 versions but was actually too lazy for that. Here are updated results:
ReplyDeleteNaive: 32852 0.004941
Stupid xor: 32852 0.004136
Memchr: 32852 0.000610
Memchr (musl): 32852 0.002223
Magic bits: 32852 0.001735
SSE: 32852 0.000825
SSE2: 32852 0.000768
AVX2: 32852 0.000458