| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* configure.ac: Added chacha20 and poly1305 assembly implementations.
* cipher/chacha20-p10le-8x.s: (New) - support 8 blocks (512 bytes)
unrolling.
* cipher/poly1305-p10le.s: (New) - support 4 blocks (128 bytes)
unrolling.
* cipher/Makefile.am: Added new chacha20 and poly1305 files.
* cipher/chacha20.c: Added PPC p10 le support for 8x chacha20.
* cipher/poly1305.c: Added PPC p10 le support for 4x poly1305.
* cipher/poly1305-internal.h: Added PPC p10 le support for poly1305.
---
GnuPG-bug-id: 6006
Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
[jk: cosmetic changes to C code]
[jk: fix building on ppc64be]
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* LICENSES: Add 3-clause BSD license for poly1305-amd64-avx512.S.
* cipher/Makefile.am: Add 'poly1305-amd64-avx512.S'.
* cipher/poly1305-amd64-avx512.S: New.
* cipher/poly1305-internal.h (POLY1305_USE_AVX512): New.
(poly1305_context_s): Add 'use_avx512'.
* cipher/poly1305.c (ASM_FUNC_ABI, ASM_FUNC_WRAPPER_ATTR): New.
[POLY1305_USE_AVX512] (_gcry_poly1305_amd64_avx512_blocks)
(poly1305_amd64_avx512_blocks): New.
(poly1305_init): Use AVX512 is HW feature available (set use_avx512).
[USE_MPI_64BIT] (poly1305_blocks): Rename to ...
[USE_MPI_64BIT] (poly1305_blocks_generic): ... this.
[USE_MPI_64BIT] (poly1305_blocks): New.
--
Patch adds AMD64 AVX512-FMA52 implementation for Poly1305.
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
POLY1305 | 0.306 ns/B 3117 MiB/s 1.25 c/B 4090
After (5.0x faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
POLY1305 | 0.061 ns/B 15699 MiB/s 0.249 c/B 4095±3
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/asm-poly1305-amd64.h: New.
* cipher/Makefile.am: Add 'asm-poly1305-amd64.h'.
* cipher/chacha20-amd64-avx2.S (QUATERROUND2): Add interleave
operators.
(_gcry_chacha20_poly1305_amd64_avx2_blocks8): New.
* cipher/chacha20-amd64-ssse3.S (QUATERROUND2): Add interleave
operators.
(_gcry_chacha20_poly1305_amd64_ssse3_blocks4)
(_gcry_chacha20_poly1305_amd64_ssse3_blocks1): New.
* cipher/chacha20.c (_gcry_chacha20_poly1305_amd64_ssse3_blocks4)
(_gcry_chacha20_poly1305_amd64_ssse3_blocks1)
(_gcry_chacha20_poly1305_amd64_avx2_blocks8): New prototypes.
(chacha20_encrypt_stream): Split tail to...
(do_chacha20_encrypt_stream_tail): ... new function.
(_gcry_chacha20_poly1305_encrypt)
(_gcry_chacha20_poly1305_decrypt): New.
* cipher/cipher-internal.h (_gcry_chacha20_poly1305_encrypt)
(_gcry_chacha20_poly1305_decrypt): New prototypes.
* cipher/cipher-poly1305.c (_gcry_cipher_poly1305_encrypt): Call
'_gcry_chacha20_poly1305_encrypt' if cipher is ChaCha20.
(_gcry_cipher_poly1305_decrypt): Call
'_gcry_chacha20_poly1305_decrypt' if cipher is ChaCha20.
* cipher/poly1305-internal.h (_gcry_cipher_poly1305_update_burn): New
prototype.
* cipher/poly1305.c (poly1305_blocks): Make static.
(_gcry_poly1305_update): Split main function body to ...
(_gcry_poly1305_update_burn): ... new function.
--
Benchmark on Intel Skylake (i5-6500, 3200 Mhz):
Before, 8-way AVX2:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.378 ns/B 2526 MiB/s 1.21 c/B
STREAM dec | 0.373 ns/B 2560 MiB/s 1.19 c/B
POLY1305 enc | 0.685 ns/B 1392 MiB/s 2.19 c/B
POLY1305 dec | 0.686 ns/B 1390 MiB/s 2.20 c/B
POLY1305 auth | 0.315 ns/B 3031 MiB/s 1.01 c/B
After, 8-way AVX2 (~36% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 0.503 ns/B 1896 MiB/s 1.61 c/B
POLY1305 dec | 0.485 ns/B 1965 MiB/s 1.55 c/B
Benchmark on Intel Haswell (i7-4790K, 3998 Mhz):
Before, 8-way AVX2:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.318 ns/B 2999 MiB/s 1.27 c/B
STREAM dec | 0.317 ns/B 3004 MiB/s 1.27 c/B
POLY1305 enc | 0.586 ns/B 1627 MiB/s 2.34 c/B
POLY1305 dec | 0.586 ns/B 1627 MiB/s 2.34 c/B
POLY1305 auth | 0.271 ns/B 3524 MiB/s 1.08 c/B
After, 8-way AVX2 (~30% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 0.452 ns/B 2108 MiB/s 1.81 c/B
POLY1305 dec | 0.440 ns/B 2167 MiB/s 1.76 c/B
Before, 4-way SSSE3:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.627 ns/B 1521 MiB/s 2.51 c/B
STREAM dec | 0.626 ns/B 1523 MiB/s 2.50 c/B
POLY1305 enc | 0.895 ns/B 1065 MiB/s 3.58 c/B
POLY1305 dec | 0.896 ns/B 1064 MiB/s 3.58 c/B
POLY1305 auth | 0.271 ns/B 3521 MiB/s 1.08 c/B
After, 4-way SSSE3 (~20% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 0.733 ns/B 1301 MiB/s 2.93 c/B
POLY1305 dec | 0.726 ns/B 1314 MiB/s 2.90 c/B
Before, 1-way SSSE3:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 1.56 ns/B 609.6 MiB/s 6.25 c/B
POLY1305 dec | 1.56 ns/B 609.4 MiB/s 6.26 c/B
After, 1-way SSSE3 (~18% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 1.31 ns/B 725.4 MiB/s 5.26 c/B
POLY1305 dec | 1.31 ns/B 727.3 MiB/s 5.24 c/B
For comparison to other libraries (on Intel i7-4790K, 3998 Mhz):
bench-slope-openssl: OpenSSL 1.1.1 11 Sep 2018
Cipher:
chacha20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.301 ns/B 3166.4 MiB/s 1.20 c/B
STREAM dec | 0.300 ns/B 3174.7 MiB/s 1.20 c/B
POLY1305 enc | 0.463 ns/B 2060.6 MiB/s 1.85 c/B
POLY1305 dec | 0.462 ns/B 2063.8 MiB/s 1.85 c/B
POLY1305 auth | 0.162 ns/B 5899.3 MiB/s 0.646 c/B
bench-slope-nettle: Nettle 3.4
Cipher:
chacha | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 1.65 ns/B 578.2 MiB/s 6.59 c/B
STREAM dec | 1.65 ns/B 578.2 MiB/s 6.59 c/B
POLY1305 enc | 2.05 ns/B 464.8 MiB/s 8.20 c/B
POLY1305 dec | 2.05 ns/B 464.7 MiB/s 8.20 c/B
POLY1305 auth | 0.404 ns/B 2359.1 MiB/s 1.62 c/B
bench-slope-botan: Botan 2.6.0
Cipher:
ChaCha | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc/dec | 0.855 ns/B 1116.0 MiB/s 3.42 c/B
POLY1305 enc | 1.60 ns/B 595.4 MiB/s 6.40 c/B
POLY1305 dec | 1.60 ns/B 595.8 MiB/s 6.40 c/B
POLY1305 auth | 0.752 ns/B 1268.3 MiB/s 3.01 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Include '../mpi' for 'longlong.h'; Remove
'poly1305-sse2-amd64.S', 'poly1305-avx2-amd64.S' and
'poly1305-armv7-neon.S'.
* cipher/poly1305-armv7-neon.S: Remove.
* cipher/poly1305-avx2-amd64.S: Remove.
* cipher/poly1305-sse2-amd64.S: Remove.
* cipher/poly1305-internal.h (POLY1305_BLOCKSIZE)
(POLY1305_STATE): New.
(POLY1305_SYSV_FUNC_ABI, POLY1305_REF_BLOCKSIZE)
(POLY1305_REF_STATESIZE, POLY1305_REF_ALIGNMENT)
(POLY1305_USE_SSE2, POLY1305_SSE2_BLOCKSIZE, POLY1305_SSE2_STATESIZE)
(POLY1305_SSE2_ALIGNMENT, POLY1305_USE_AVX2, POLY1305_AVX2_BLOCKSIZE)
(POLY1305_AVX2_STATESIZE, POLY1305_AVX2_ALIGNMENT)
(POLY1305_USE_NEON, POLY1305_NEON_BLOCKSIZE, POLY1305_NEON_STATESIZE)
(POLY1305_NEON_ALIGNMENT, POLY1305_LARGEST_BLOCKSIZE)
(POLY1305_LARGEST_STATESIZE, POLY1305_LARGEST_ALIGNMENT)
(POLY1305_STATE_BLOCKSIZE, POLY1305_STATE_STATESIZE)
(POLY1305_STATE_ALIGNMENT, OPS_FUNC_ABI, poly1305_key_s)
(poly1305_ops_s): Remove.
(poly1305_context_s): Rewrite.
* cipher/poly1305.c (_gcry_poly1305_amd64_sse2_init_ext)
(_gcry_poly1305_amd64_sse2_finish_ext)
(_gcry_poly1305_amd64_sse2_blocks, poly1305_amd64_sse2_ops)
(poly1305_init_ext_ref32, poly1305_blocks_ref32)
(poly1305_finish_ext_ref32, poly1305_default_ops)
(_gcry_poly1305_amd64_avx2_init_ext)
(_gcry_poly1305_amd64_avx2_finish_ext)
(_gcry_poly1305_amd64_avx2_blocks)
(poly1305_amd64_avx2_ops, poly1305_get_state): Remove.
(poly1305_init): Rewrite.
(USE_MPI_64BIT, USE_MPI_32BIT): New.
[USE_MPI_64BIT] (ADD_1305_64, MUL_MOD_1305_64, poly1305_blocks)
(poly1305_final): New implementation using 64-bit limbs.
[USE_MPI_32BIT] (UMUL_ADD_32, ADD_1305_32, MUL_MOD_1305_32)
(poly1305_blocks): New implementation using 32-bit limbs.
(_gcry_poly1305_update, _gcry_poly1305_finish)
(_gcry_poly1305_init): Adapt to new implementation.
* configure.ac: Remove 'poly1305-sse2-amd64.lo',
'poly1305-avx2-amd64.lo' and 'poly1305-armv7-neon.lo'.
--
Intel Core i7-4790K CPU @ 4.00GHz (x86_64):
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.284 ns/B 3358.6 MiB/s 1.14 c/B
Intel Core i7-4790K CPU @ 4.00GHz (i386):
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.888 ns/B 1073.9 MiB/s 3.55 c/B
Cortex-A53 @ 1152Mhz (armv7):
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 4.40 ns/B 216.7 MiB/s 5.07 c/B
Cortex-A53 @ 1152Mhz (aarch64):
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 2.60 ns/B 367.0 MiB/s 2.99 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/poly1305-avx2-amd64.S: Enable when
HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS defined.
(ELF): New macro to mask lines with ELF specific commands.
* cipher/poly1305-sse2-amd64.S: Enable when
HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS defined.
(ELF): New macro to mask lines with ELF specific commands.
* cipher/poly1305-internal.h (POLY1305_SYSV_FUNC_ABI): New.
(POLY1305_USE_SSE2, POLY1305_USE_AVX2): Enable when
HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS defined.
(OPS_FUNC_ABI): New.
(poly1305_ops_t): Use OPS_FUNC_ABI.
* cipher/poly1305.c (_gcry_poly1305_amd64_sse2_init_ext)
(_gcry_poly1305_amd64_sse2_finish_ext)
(_gcry_poly1305_amd64_sse2_blocks, _gcry_poly1305_amd64_avx2_init_ext)
(_gcry_poly1305_amd64_avx2_finish_ext)
(_gcry_poly1305_amd64_avx2_blocks, _gcry_poly1305_armv7_neon_init_ext)
(_gcry_poly1305_armv7_neon_finish_ext)
(_gcry_poly1305_armv7_neon_blocks, poly1305_init_ext_ref32)
(poly1305_blocks_ref32, poly1305_finish_ext_ref32)
(poly1305_init_ext_ref8, poly1305_blocks_ref8)
(poly1305_finish_ext_ref8): Use OPS_FUNC_ABI.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'poly1305-armv7-neon.S'.
* cipher/poly1305-armv7-neon.S: New.
* cipher/poly1305-internal.h (POLY1305_USE_NEON)
(POLY1305_NEON_BLOCKSIZE, POLY1305_NEON_STATESIZE)
(POLY1305_NEON_ALIGNMENT): New.
* cipher/poly1305.c [POLY1305_USE_NEON]
(_gcry_poly1305_armv7_neon_init_ext)
(_gcry_poly1305_armv7_neon_finish_ext)
(_gcry_poly1305_armv7_neon_blocks, poly1305_armv7_neon_ops): New.
(_gcry_poly1305_init) [POLY1305_USE_NEON]: Select NEON implementation
if HWF_ARM_NEON set.
* configure.ac [neonsupport=yes]: Add 'poly1305-armv7-neon.lo'.
--
Add Andrew Moon's public domain NEON implementation of Poly1305. Original
source is available at: https://github.com/floodyberry/poly1305-opt
Benchmark on Cortex-A8 (--cpu-mhz 1008):
Old:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 12.34 ns/B 77.27 MiB/s 12.44 c/B
New:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 2.12 ns/B 450.7 MiB/s 2.13 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'poly1305-avx2-amd64.S'.
* cipher/poly1305-avx2-amd64.S: New.
* cipher/poly1305-internal.h (POLY1305_USE_AVX2)
(POLY1305_AVX2_BLOCKSIZE, POLY1305_AVX2_STATESIZE)
(POLY1305_AVX2_ALIGNMENT): New.
(POLY1305_LARGEST_BLOCKSIZE, POLY1305_LARGEST_STATESIZE)
(POLY1305_STATE_ALIGNMENT): Use AVX2 versions when needed.
* cipher/poly1305.c [POLY1305_USE_AVX2]
(_gcry_poly1305_amd64_avx2_init_ext)
(_gcry_poly1305_amd64_avx2_finish_ext)
(_gcry_poly1305_amd64_avx2_blocks, poly1305_amd64_avx2_ops): New.
(_gcry_poly1305_init) [POLY1305_USE_AVX2]: Use AVX2 implementation if
AVX2 supported by CPU.
* configure.ac [host=x86_64]: Add 'poly1305-avx2-amd64.lo'.
--
Add Andrew Moon's public domain AVX2 implementation of Poly1305. Original
source is available at: https://github.com/floodyberry/poly1305-opt
Benchmarks on Intel i5-4570 (haswell):
Old:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.448 ns/B 2129.5 MiB/s 1.43 c/B
New:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.205 ns/B 4643.5 MiB/s 0.657 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'poly1305-sse2-amd64.S'.
* cipher/poly1305-internal.h (POLY1305_USE_SSE2)
(POLY1305_SSE2_BLOCKSIZE, POLY1305_SSE2_STATESIZE)
(POLY1305_SSE2_ALIGNMENT): New.
(POLY1305_LARGEST_BLOCKSIZE, POLY1305_LARGEST_STATESIZE)
(POLY1305_STATE_ALIGNMENT): Use SSE2 versions when needed.
* cipher/poly1305-sse2-amd64.S: New.
* cipher/poly1305.c [POLY1305_USE_SSE2]
(_gcry_poly1305_amd64_sse2_init_ext)
(_gcry_poly1305_amd64_sse2_finish_ext)
(_gcry_poly1305_amd64_sse2_blocks, poly1305_amd64_sse2_ops): New.
(_gcry_polu1305_init) [POLY1305_USE_SSE2]: Use SSE2 version.
* configure.ac [host=x86_64]: Add 'poly1305-sse2-amd64.lo'.
--
Add Andrew Moon's public domain SSE2 implementation of Poly1305. Original
source is available at: https://github.com/floodyberry/poly1305-opt
Benchmarks on Intel i5-4570 (haswell):
Old:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.844 ns/B 1130.2 MiB/s 2.70 c/B
New:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.448 ns/B 2129.5 MiB/s 1.43 c/B
Benchmarks on Intel i5-2450M (sandy-bridge):
Old:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 1.25 ns/B 763.0 MiB/s 3.12 c/B
New:
| nanosecs/byte mebibytes/sec cycles/byte
POLY1305 | 0.605 ns/B 1575.9 MiB/s 1.51 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
* cipher/Makefile.am: Add 'mac-poly1305.c', 'poly1305.c' and
'poly1305-internal.h'.
* cipher/mac-internal.h (poly1305mac_context_s): New.
(gcry_mac_handle): Add 'u.poly1305mac'.
(_gcry_mac_type_spec_poly1305mac): New.
* cipher/mac-poly1305.c: New.
* cipher/mac.c (mac_list): Add Poly1305.
* cipher/poly1305-internal.h: New.
* cipher/poly1305.c: New.
* src/gcrypt.h.in: Add 'GCRY_MAC_POLY1305'.
* tests/basic.c (check_mac): Add Poly1035 test vectors; Allow
overriding lengths of data and key buffers.
* tests/bench-slope.c (mac_bench): Increase max algo number from 500 to
600.
* tests/benchmark.c (mac_bench): Ditto.
--
Patch adds Bernstein's Poly1305 message authentication code to libgcrypt.
Implementation is based on Andrew Moon's public domain implementation
from: https://github.com/floodyberry/poly1305-opt
The algorithm added by this patch is the plain Poly1305 without AES and
takes 32-bit key that must not be reused.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|