Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move away from integral casts in xorbuf #1020

Closed
noloader opened this issue Mar 17, 2021 · 0 comments
Closed

Move away from integral casts in xorbuf #1020

noloader opened this issue Mar 17, 2021 · 0 comments

Comments

@noloader
Copy link
Collaborator

noloader commented Mar 17, 2021

xorbuf in misc.cpp has code to perform xor's on arbitrary buffers. For example:

void xorbuf(byte *output, const byte *input, const byte *mask, size_t count)
{
    size_t i=0;
    if (IsAligned<word32>(output) && IsAligned<word32>(input) && IsAligned<word32>(mask))
    {
        if (IsAligned<word64>(output) && IsAligned<word64>(input) && IsAligned<word64>(mask))
        {
            for (i=0; i<count/8; i++)
                ((word64*)(void*)output)[i] = ((word64*)(void*)input)[i] ^ ((word64*)(void*)mask)[i];
            count -= 8*i;
            if (!count)
                return;
            output += 8*i;
            input += 8*i;
            mask += 8*i;
        }

        for (i=0; i<count/4; i++)
            ((word32*)(void*)output)[i] = ((word32*)(void*)input)[i] ^ ((word32*)(void*)mask)[i];
        count -= 4*i;
        if (!count)
            return;
        output += 4*i;
        input += 4*i;
        mask += 4*i;
    }

    for (i=0; i<count; i++)
        output[i] = input[i] ^ mask[i];
}

The casting is kind of shady nowadays. It could draw the ire of the compiler and earn us a demerit.

In a simpler form, this is closer to what we should be doing. It avoids the cast and honors alignment. Compilers nowadays will know when they can elide the memcpy and simply perform the xor.

void xorbuf(byte *output, const byte *input, const byte *mask, size_t count)
{
    while (count >= 4)
    {
        word32 b, m, r;
        memcpy(&b, input, 4); memcpy(&m, mask, 4);

        r = b ^ m;
        memcpy(output, &r, 4);

        output += 4; input += 4; mask += 4; count -= 4;
    }

    for (size_t i=0; i<count; i++)
        output[i] = input[i] ^ mask[i];
}

In fact, we can add some architectural speed-ups to make things run even faster without the casting. The code below runs 0.1 cpb to 0.4 cpb faster on x86_64. An x86_64 machine will enable the __SSE2__ code path without arch options.

void xorbuf(byte *buf, const byte *mask, size_t count)
{
#if defined(__AVX__)
    while (count >= 32)
    {
        __m256i b = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(buf));
        __m256i m = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(mask));
        _mm256_storeu_si256(reinterpret_cast<__m256i*>(buf), _mm256_castps_si256(
            _mm256_xor_ps(_mm256_castsi256_ps(b), _mm256_castsi256_ps(m))));
        buf += 32; mask += 32; count -= 32;
    }
#endif
#if defined(__SSE2__)
    while (count >= 16)
    {
        __m128i b = _mm_loadu_si128(reinterpret_cast<const __m128i*>(buf));
        __m128i m = _mm_loadu_si128(reinterpret_cast<const __m128i*>(mask));
        _mm_storeu_si128(reinterpret_cast<__m128i*>(buf), _mm_castps_si128(
            _mm_xor_ps(_mm_castsi128_ps(b), _mm_castsi128_ps(m))));
        buf += 16; mask += 16; count -= 16;
    }

    if (count == 0) return;
#endif

    while (count >= 4)
    {
        word32 r, b, m;
        memcpy(&b, buf, 4); memcpy(&m, mask, 4);

        r = b ^ m;
        memcpy(buf, &r, 4);

        buf += 4; mask += 4; count -= 4;
    }

    for (size_t i=0; i<count; i++)
        buf[i] ^= mask[i];
}

This bug report will track the cut-over.

noloader added a commit to noloader/cryptopp that referenced this issue Mar 17, 2021
noloader added a commit to noloader/cryptopp that referenced this issue Mar 17, 2021
noloader added a commit that referenced this issue Mar 17, 2021
We think this is another instance problem that surfaced under GH #683 when inString==outString. It violates aliasing rules and the compiler begins removing code.

The ultimate workaround was to add a member variable m_tempOutString as scratch space when inString==outString. We did not loose much in the way of perforamce for some reason. It looks like AES/CTR lost about 0.03-0.05 cpb.

When combined with the updated xorbuf from GH #1020, the net result was a speedup of 0.1-0.6 cpb. In fact, some ciphers like RC6, gained almost 5 cpb.
noloader added a commit that referenced this issue Mar 17, 2021
We think this is another instance problem that surfaced under GH #683 when inString==outString. It violates aliasing rules and the compiler begins removing code.

The ultimate workaround was to add a member variable m_tempOutString as scratch space when inString==outString. We did not loose much in the way of perforamce for some reason. It looks like AES/CTR lost about 0.03-0.05 cpb.

When combined with the updated xorbuf from GH #1020, the net result was a speedup of 0.1-0.6 cpb. In fact, some ciphers like RC6, gained almost 5 cpb.
EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021
EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021
We think this is another instance problem that surfaced under GH weidai11#683 when inString==outString. It violates aliasing rules and the compiler begins removing code.

The ultimate workaround was to add a member variable m_tempOutString as scratch space when inString==outString. We did not loose much in the way of perforamce for some reason. It looks like AES/CTR lost about 0.03-0.05 cpb.

When combined with the updated xorbuf from GH weidai11#1020, the net result was a speedup of 0.1-0.6 cpb. In fact, some ciphers like RC6, gained almost 5 cpb.
EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021
We think this is another instance problem that surfaced under GH weidai11#683 when inString==outString. It violates aliasing rules and the compiler begins removing code.

The ultimate workaround was to add a member variable m_tempOutString as scratch space when inString==outString. We did not loose much in the way of perforamce for some reason. It looks like AES/CTR lost about 0.03-0.05 cpb.

When combined with the updated xorbuf from GH weidai11#1020, the net result was a speedup of 0.1-0.6 cpb. In fact, some ciphers like RC6, gained almost 5 cpb.
EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021
EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021
EAddario pushed a commit to EAddario/cryptopp that referenced this issue Apr 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant