Move away from integral casts in xorbuf #1020

noloader · 2021-03-17T04:18:48Z

xorbuf in misc.cpp has code to perform xor's on arbitrary buffers. For example:

void xorbuf(byte *output, const byte *input, const byte *mask, size_t count)
{
    size_t i=0;
    if (IsAligned<word32>(output) && IsAligned<word32>(input) && IsAligned<word32>(mask))
    {
        if (IsAligned<word64>(output) && IsAligned<word64>(input) && IsAligned<word64>(mask))
        {
            for (i=0; i<count/8; i++)
                ((word64*)(void*)output)[i] = ((word64*)(void*)input)[i] ^ ((word64*)(void*)mask)[i];
            count -= 8*i;
            if (!count)
                return;
            output += 8*i;
            input += 8*i;
            mask += 8*i;
        }

        for (i=0; i<count/4; i++)
            ((word32*)(void*)output)[i] = ((word32*)(void*)input)[i] ^ ((word32*)(void*)mask)[i];
        count -= 4*i;
        if (!count)
            return;
        output += 4*i;
        input += 4*i;
        mask += 4*i;
    }

    for (i=0; i<count; i++)
        output[i] = input[i] ^ mask[i];
}

The casting is kind of shady nowadays. It could draw the ire of the compiler and earn us a demerit.

In a simpler form, this is closer to what we should be doing. It avoids the cast and honors alignment. Compilers nowadays will know when they can elide the memcpy and simply perform the xor.

void xorbuf(byte *output, const byte *input, const byte *mask, size_t count)
{
    while (count >= 4)
    {
        word32 b, m, r;
        memcpy(&b, input, 4); memcpy(&m, mask, 4);

        r = b ^ m;
        memcpy(output, &r, 4);

        output += 4; input += 4; mask += 4; count -= 4;
    }

    for (size_t i=0; i<count; i++)
        output[i] = input[i] ^ mask[i];
}

In fact, we can add some architectural speed-ups to make things run even faster without the casting. The code below runs 0.1 cpb to 0.4 cpb faster on x86_64. An x86_64 machine will enable the __SSE2__ code path without arch options.

void xorbuf(byte *buf, const byte *mask, size_t count)
{
#if defined(__AVX__)
    while (count >= 32)
    {
        __m256i b = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(buf));
        __m256i m = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(mask));
        _mm256_storeu_si256(reinterpret_cast<__m256i*>(buf), _mm256_castps_si256(
            _mm256_xor_ps(_mm256_castsi256_ps(b), _mm256_castsi256_ps(m))));
        buf += 32; mask += 32; count -= 32;
    }
#endif
#if defined(__SSE2__)
    while (count >= 16)
    {
        __m128i b = _mm_loadu_si128(reinterpret_cast<const __m128i*>(buf));
        __m128i m = _mm_loadu_si128(reinterpret_cast<const __m128i*>(mask));
        _mm_storeu_si128(reinterpret_cast<__m128i*>(buf), _mm_castps_si128(
            _mm_xor_ps(_mm_castsi128_ps(b), _mm_castsi128_ps(m))));
        buf += 16; mask += 16; count -= 16;
    }

    if (count == 0) return;
#endif

    while (count >= 4)
    {
        word32 r, b, m;
        memcpy(&b, buf, 4); memcpy(&m, mask, 4);

        r = b ^ m;
        memcpy(buf, &r, 4);

        buf += 4; mask += 4; count -= 4;
    }

    for (size_t i=0; i<count; i++)
        buf[i] ^= mask[i];
}

This bug report will track the cut-over.

The text was updated successfully, but these errors were encountered:

We think this is another instance problem that surfaced under GH #683 when inString==outString. It violates aliasing rules and the compiler begins removing code. The ultimate workaround was to add a member variable m_tempOutString as scratch space when inString==outString. We did not loose much in the way of perforamce for some reason. It looks like AES/CTR lost about 0.03-0.05 cpb. When combined with the updated xorbuf from GH #1020, the net result was a speedup of 0.1-0.6 cpb. In fact, some ciphers like RC6, gained almost 5 cpb.

We think this is another instance problem that surfaced under GH weidai11#683 when inString==outString. It violates aliasing rules and the compiler begins removing code. The ultimate workaround was to add a member variable m_tempOutString as scratch space when inString==outString. We did not loose much in the way of perforamce for some reason. It looks like AES/CTR lost about 0.03-0.05 cpb. When combined with the updated xorbuf from GH weidai11#1020, the net result was a speedup of 0.1-0.6 cpb. In fact, some ciphers like RC6, gained almost 5 cpb.

noloader added a commit to noloader/cryptopp that referenced this issue Mar 17, 2021

Cleanup xorbuf and VerifyBufsEqual (GH weidai11#1020)

95ca1dd

noloader added a commit to noloader/cryptopp that referenced this issue Mar 17, 2021

Cleanup xorbuf and VerifyBufsEqual (GH weidai11#1020)

e7b60ed

noloader added a commit that referenced this issue Mar 17, 2021

Cleanup xorbuf and VerifyBufsEqual (GH #1020)

4eac79f

noloader added a commit that referenced this issue Mar 17, 2021

Fix compile on ARM with GCC 4.8 (GH #1020)

faa8210

noloader added a commit that referenced this issue Mar 17, 2021

Cleanup xorbuf and VerifyBufsEqual (GH #1020)

d4975aa

noloader closed this as completed Mar 17, 2021

noloader added a commit that referenced this issue Mar 18, 2021

Cleanup xorbuf and VerifyBufsEqual (GH #1020)

9a7f4f0

EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021

Cleanup xorbuf and VerifyBufsEqual (GH weidai11#1020)

70d2d56

EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021

Fix compile on ARM with GCC 4.8 (GH weidai11#1020)

c8f5e54

EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021

Cleanup xorbuf and VerifyBufsEqual (GH weidai11#1020)

d3238c5

EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021

Cleanup xorbuf and VerifyBufsEqual (GH weidai11#1020)

b5fa5bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move away from integral casts in xorbuf #1020

Move away from integral casts in xorbuf #1020

noloader commented Mar 17, 2021 •

edited

Loading

Move away from integral casts in xorbuf #1020

Move away from integral casts in xorbuf #1020

Comments

noloader commented Mar 17, 2021 • edited Loading

noloader commented Mar 17, 2021 •

edited

Loading