| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am [ENABLE_O_FLAG_MUNGING]: Support -Oz.
* random/Makefile.am [ENABLE_O_FLAG_MUNGING]: Support -Oz.
--
GnuPG-bug-id: 6432
Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm4-ppc.c'.
* cipher/sm4-ppc.c: New.
* cipher/sm4.c (USE_PPC_CRYPTO): New.
(SM4_context): Add 'use_ppc8le' and 'use_ppc9le'.
[USE_PPC_CRYPTO] (_gcry_sm4_ppc8le_crypt_blk1_16)
(_gcry_sm4_ppc9le_crypt_blk1_16, sm4_ppc8le_crypt_blk1_16)
(sm4_ppc9le_crypt_blk1_16): New.
(sm4_setkey) [USE_PPC_CRYPTO]: Set use_ppc8le and use_ppc9le
based on HW features.
(sm4_get_crypt_blk1_16_fn) [USE_PPC_CRYPTO]: Add PowerPC
implementation selection.
--
Benchmark on POWER9:
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 14.47 ns/B 65.89 MiB/s 33.29 c/B
ECB dec | 14.47 ns/B 65.89 MiB/s 33.29 c/B
CBC enc | 35.09 ns/B 27.18 MiB/s 80.71 c/B
CBC dec | 16.69 ns/B 57.13 MiB/s 38.39 c/B
CFB enc | 35.09 ns/B 27.18 MiB/s 80.71 c/B
CFB dec | 16.76 ns/B 56.90 MiB/s 38.55 c/B
CTR enc | 16.88 ns/B 56.50 MiB/s 38.82 c/B
CTR dec | 16.88 ns/B 56.50 MiB/s 38.82 c/B
After (ECB ~4.4x faster):
SM4 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 3.26 ns/B 292.3 MiB/s 7.50 c/B
ECB dec | 3.26 ns/B 292.3 MiB/s 7.50 c/B
CBC enc | 35.10 ns/B 27.17 MiB/s 80.72 c/B
CBC dec | 3.33 ns/B 286.3 MiB/s 7.66 c/B
CFB enc | 35.10 ns/B 27.17 MiB/s 80.74 c/B
CFB dec | 3.36 ns/B 283.8 MiB/s 7.73 c/B
CTR enc | 3.47 ns/B 275.0 MiB/s 7.98 c/B
CTR dec | 3.47 ns/B 275.0 MiB/s 7.98 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'camellia-aarch64-ce.(c|o|lo)'.
(aarch64_neon_cflags): New.
* cipher/camellia-aarch64-ce.c: New.
* cipher/camellia-glue.c (USE_AARCH64_CE): New.
(CAMELLIA_context): Add 'use_aarch64ce'.
(_gcry_camellia_aarch64ce_encrypt_blk16)
(_gcry_camellia_aarch64ce_decrypt_blk16)
(_gcry_camellia_aarch64ce_keygen, camellia_aarch64ce_enc_blk16)
(camellia_aarch64ce_dec_blk16, aarch64ce_burn_stack_depth): New.
(camellia_setkey) [USE_AARCH64_CE]: Set use_aarch64ce if HW has
HWF_ARM_AES; Use AArch64/CE key generation if supported by HW.
(camellia_encrypt_blk1_32, camellia_decrypt_blk1_32)
[USE_AARCH64_CE]: Add AArch64/CE code path.
--
Patch enables 128-bit vector instrinsics implementation of Camellia
cipher for AArch64.
Benchmark on AWS Graviton2:
Before:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 5.99 ns/B 159.2 MiB/s 14.97 c/B 2500
ECB dec | 5.99 ns/B 159.1 MiB/s 14.98 c/B 2500
CBC enc | 6.16 ns/B 154.7 MiB/s 15.41 c/B 2500
CBC dec | 6.12 ns/B 155.8 MiB/s 15.29 c/B 2499
CFB enc | 6.49 ns/B 147.0 MiB/s 16.21 c/B 2500
CFB dec | 6.05 ns/B 157.6 MiB/s 15.13 c/B 2500
CTR enc | 6.09 ns/B 156.7 MiB/s 15.22 c/B 2500
CTR dec | 6.09 ns/B 156.6 MiB/s 15.22 c/B 2500
XTS enc | 6.16 ns/B 154.9 MiB/s 15.39 c/B 2500
XTS dec | 6.16 ns/B 154.8 MiB/s 15.40 c/B 2499
GCM enc | 6.31 ns/B 151.1 MiB/s 15.78 c/B 2500
GCM dec | 6.31 ns/B 151.1 MiB/s 15.78 c/B 2500
GCM auth | 0.206 ns/B 4635 MiB/s 0.514 c/B 2500
OCB enc | 6.63 ns/B 143.9 MiB/s 16.57 c/B 2499
OCB dec | 6.63 ns/B 143.9 MiB/s 16.56 c/B 2499
OCB auth | 6.55 ns/B 145.7 MiB/s 16.37 c/B 2499
After (ecb ~2.1x faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 2.77 ns/B 344.2 MiB/s 6.93 c/B 2499
ECB dec | 2.76 ns/B 345.3 MiB/s 6.90 c/B 2499
CBC enc | 6.17 ns/B 154.7 MiB/s 15.41 c/B 2499
CBC dec | 2.89 ns/B 330.3 MiB/s 7.22 c/B 2500
CFB enc | 6.48 ns/B 147.1 MiB/s 16.21 c/B 2499
CFB dec | 2.84 ns/B 336.1 MiB/s 7.09 c/B 2499
CTR enc | 2.90 ns/B 328.8 MiB/s 7.25 c/B 2499
CTR dec | 2.90 ns/B 328.9 MiB/s 7.25 c/B 2500
XTS enc | 2.93 ns/B 325.3 MiB/s 7.33 c/B 2500
XTS dec | 2.92 ns/B 326.2 MiB/s 7.31 c/B 2500
GCM enc | 3.10 ns/B 307.2 MiB/s 7.76 c/B 2500
GCM dec | 3.10 ns/B 307.2 MiB/s 7.76 c/B 2499
GCM auth | 0.206 ns/B 4635 MiB/s 0.514 c/B 2500
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'camellia-simd128.h',
'camellia-ppc8le.c' and 'camellia-ppc9le.c'.
* cipher/camellia-glue.c (USE_PPC_CRYPTO): New.
(CAMELLIA_context) [USE_PPC_CRYPTO]: Add 'use_ppc', 'use_ppc8'
and 'use_ppc9'.
[USE_PPC_CRYPTO] (_gcry_camellia_ppc8_encrypt_blk16)
(_gcry_camellia_ppc8_decrypt_blk16, _gcry_camellia_ppc8_keygen)
(_gcry_camellia_ppc9_encrypt_blk16)
(_gcry_camellia_ppc9_decrypt_blk16, _gcry_camellia_ppc9_keygen)
(camellia_ppc_enc_blk16, camellia_ppc_dec_blk16)
(ppc_burn_stack_depth): New.
(camellia_setkey) [USE_PPC_CRYPTO]: Setup 'use_ppc', 'use_ppc8'
and 'use_ppc9' and use PPC key-generation if HWF is available.
(camellia_encrypt_blk1_32)
(camellia_decrypt_blk1_32) [USE_PPC_CRYPTO]: Add 'use_ppc' paths.
(_gcry_camellia_ocb_crypt, _gcry_camellia_ocb_auth): Enable
generic bulk path when USE_PPC_CRYPTO is defined.
* cipher/camellia-ppc8le.c: New.
* cipher/camellia-ppc9le.c: New.
* cipher/camellia-simd128.h: New.
* configure.ac: Add 'camellia-ppc8le.lo' and 'camellia-ppc9le.lo'.
--
Patch adds 128-bit vector instrinsics implementation of Camellia
cipher and enables implementation for POWER8 and POWER9.
Benchmark on POWER9:
Before:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 13.45 ns/B 70.90 MiB/s 30.94 c/B
ECB dec | 13.45 ns/B 70.92 MiB/s 30.93 c/B
CBC enc | 15.22 ns/B 62.66 MiB/s 35.00 c/B
CBC dec | 13.54 ns/B 70.41 MiB/s 31.15 c/B
CFB enc | 15.24 ns/B 62.59 MiB/s 35.04 c/B
CFB dec | 13.53 ns/B 70.48 MiB/s 31.12 c/B
CTR enc | 13.60 ns/B 70.15 MiB/s 31.27 c/B
CTR dec | 13.62 ns/B 70.02 MiB/s 31.33 c/B
XTS enc | 13.67 ns/B 69.74 MiB/s 31.45 c/B
XTS dec | 13.74 ns/B 69.41 MiB/s 31.60 c/B
GCM enc | 18.18 ns/B 52.45 MiB/s 41.82 c/B
GCM dec | 17.76 ns/B 53.69 MiB/s 40.86 c/B
GCM auth | 4.12 ns/B 231.7 MiB/s 9.47 c/B
OCB enc | 14.40 ns/B 66.22 MiB/s 33.12 c/B
OCB dec | 14.40 ns/B 66.23 MiB/s 33.12 c/B
OCB auth | 14.37 ns/B 66.37 MiB/s 33.05 c/B
After (ECB ~4.1x faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 3.25 ns/B 293.7 MiB/s 7.47 c/B
ECB dec | 3.25 ns/B 293.4 MiB/s 7.48 c/B
CBC enc | 15.22 ns/B 62.68 MiB/s 35.00 c/B
CBC dec | 3.36 ns/B 284.1 MiB/s 7.72 c/B
CFB enc | 15.25 ns/B 62.55 MiB/s 35.07 c/B
CFB dec | 3.36 ns/B 284.0 MiB/s 7.72 c/B
CTR enc | 3.47 ns/B 275.1 MiB/s 7.97 c/B
CTR dec | 3.47 ns/B 275.1 MiB/s 7.97 c/B
XTS enc | 3.54 ns/B 269.0 MiB/s 8.15 c/B
XTS dec | 3.54 ns/B 269.6 MiB/s 8.14 c/B
GCM enc | 3.69 ns/B 258.2 MiB/s 8.49 c/B
GCM dec | 3.69 ns/B 258.2 MiB/s 8.50 c/B
GCM auth | 0.226 ns/B 4220 MiB/s 0.520 c/B
OCB enc | 3.81 ns/B 250.2 MiB/s 8.77 c/B
OCB dec | 4.08 ns/B 233.8 MiB/s 9.38 c/B
OCB auth | 3.53 ns/B 270.0 MiB/s 8.12 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'aria-gfni-avx512-amd64.S'.
* cipher/aria-gfni-avx512-amd64.S: New.
* cipher/aria.c (USE_GFNI_AVX512): New.
[USE_GFNI_AVX512] (MAX_PARALLEL_BLKS): New.
(ARIA_context): Add 'use_gfni_avx512'.
(_gcry_aria_gfni_avx512_ecb_crypt_blk64)
(_gcry_aria_gfni_avx512_ctr_crypt_blk64)
(aria_gfni_avx512_ecb_crypt_blk64)
(aria_gfni_avx512_ctr_crypt_blk64): New.
(aria_crypt_blocks) [USE_GFNI_AVX512]: Add 64 parallel block
AVX512/GFNI processing.
(_gcry_aria_ctr_enc) [USE_GFNI_AVX512]: Add 64 parallel block
AVX512/GFNI processing.
(aria_setkey): Enable GFNI/AVX512 based on HW features.
* configure.ac: Add 'aria-gfni-avx512-amd64.lo'.
--
This patch adds AVX512/GFNI accelerated ARIA block cipher
implementation for libgcrypt. This implementation is based on
work by Taehee Yoo, with following notable changes:
- Integration to libgcrypt, use of 'aes-common-amd64.h'.
- Use round loop instead of unrolling for smaller code size and
increased performance.
- Use stack for temporary storage instead of external buffers.
- Add byte-addition fast path for CTR.
===
Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
GFNI/AVX512:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.203 ns/B 4703 MiB/s 0.953 c/B 4700
ECB dec | 0.204 ns/B 4675 MiB/s 0.959 c/B 4700
CTR enc | 0.207 ns/B 4609 MiB/s 0.973 c/B 4700
CTR dec | 0.207 ns/B 4608 MiB/s 0.973 c/B 4700
===
Benchmark on Intel Core i3-1115G4 (tiger-lake, turbo-freq off):
GFNI/AVX512:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.362 ns/B 2635 MiB/s 1.08 c/B 2992
ECB dec | 0.361 ns/B 2639 MiB/s 1.08 c/B 2992
CTR enc | 0.362 ns/B 2633 MiB/s 1.08 c/B 2992
CTR dec | 0.362 ns/B 2633 MiB/s 1.08 c/B 2992
[v2]:
- Add byte-addition fast path for CTR.
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'aria-aesni-avx-amd64.S' and
'aria-aesni-avx2-amd64.S'.
* cipher/aria-aesni-avx-amd64.S: New.
* cipher/aria-aesni-avx2-amd64.S: New.
* cipher/aria.c (USE_AESNI_AVX, USE_GFNI_AVX, USE_AESNI_AVX2)
(USE_GFNI_AVX2, MAX_PARALLEL_BLKS, ASM_FUNC_ABI, ASM_EXTRA_STACK): New.
(ARIA_context): Add 'use_aesni_avx', 'use_gfni_avx',
'use_aesni_avx2' and 'use_gfni_avx2'.
(_gcry_aria_aesni_avx_ecb_crypt_blk1_16)
(_gcry_aria_aesni_avx_ctr_crypt_blk16)
(_gcry_aria_gfni_avx_ecb_crypt_blk1_16)
(_gcry_aria_gfni_avx_ctr_crypt_blk16)
(aria_avx_ecb_crypt_blk1_16, aria_avx_ctr_crypt_blk16)
(_gcry_aria_aesni_avx2_ecb_crypt_blk32)
(_gcry_aria_aesni_avx2_ctr_crypt_blk32)
(_gcry_aria_gfni_avx2_ecb_crypt_blk32)
(_gcry_aria_gfni_avx2_ctr_crypt_blk32)
(aria_avx2_ecb_crypt_blk32, aria_avx2_ctr_crypt_blk32): New.
(aria_crypt_blocks) [USE_AESNI_AVX2]: Add 32 parallel block
AVX2/AESNI/GFNI processing.
(aria_crypt_blocks) [USE_AESNI_AVX]: Add 3 to 16 parallel block
AVX/AESNI/GFNI processing.
(_gcry_aria_ctr_enc) [USE_AESNI_AVX2]: Add 32 parallel block
AVX2/AESNI/GFNI processing.
(_gcry_aria_ctr_enc) [USE_AESNI_AVX]: Add 16 parallel block
AVX/AESNI/GFNI processing.
(_gcry_aria_ctr_enc, _gcry_aria_cbc_dec, _gcry_aria_cfb_enc)
(_gcry_aria_ecb_crypt, _gcry_aria_xts_crypt, _gcry_aria_ctr32le_enc)
(_gcry_aria_ocb_crypt, _gcry_aria_ocb_auth): Use MAX_PARALLEL_BLKS
for parallel processing width.
(aria_setkey): Enable AESNI/AVX, GFNI/AVX, AESNI/AVX2, GFNI/AVX2 based
on HW features.
* configure.ac: Add 'aria-aesni-avx-amd64.lo' and
'aria-aesni-avx2-amd64.lo'.
---
This patch adds AVX/AVX2/AESNI/GFNI accelerated ARIA block cipher
implementations for libgcrypt. This implementation is based on work
by Taehee Yoo, with following notable changes:
- Integration to libgcrypt, use of 'aes-common-amd64.h'.
- Use 'vmovddup' for loading GFNI constants.
- Use round loop instead of unrolling for smaller code size and
increased performance.
- Use stack for temporary storage instead of external buffers.
- Use merge ECB encryption/decryption to single function.
- Add 1 to 15 blocks support for AVX ECB functions.
- Add byte-addition fast path for CTR.
===
Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
AESNI/AVX:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.715 ns/B 1333 MiB/s 3.36 c/B 4700
ECB dec | 0.712 ns/B 1339 MiB/s 3.35 c/B 4700
CTR enc | 0.714 ns/B 1336 MiB/s 3.36 c/B 4700
CTR dec | 0.714 ns/B 1335 MiB/s 3.36 c/B 4700
GFNI/AVX:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.516 ns/B 1847 MiB/s 2.43 c/B 4700
ECB dec | 0.519 ns/B 1839 MiB/s 2.44 c/B 4700
CTR enc | 0.517 ns/B 1846 MiB/s 2.43 c/B 4700
CTR dec | 0.518 ns/B 1843 MiB/s 2.43 c/B 4700
AESNI/AVX2:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.416 ns/B 2292 MiB/s 1.96 c/B 4700
ECB dec | 0.421 ns/B 2266 MiB/s 1.98 c/B 4700
CTR enc | 0.415 ns/B 2298 MiB/s 1.95 c/B 4700
CTR dec | 0.415 ns/B 2300 MiB/s 1.95 c/B 4700
GFNI/AVX2:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.235 ns/B 4056 MiB/s 1.11 c/B 4700
ECB dec | 0.234 ns/B 4079 MiB/s 1.10 c/B 4700
CTR enc | 0.232 ns/B 4104 MiB/s 1.09 c/B 4700
CTR dec | 0.233 ns/B 4094 MiB/s 1.10 c/B 4700
===
Benchmark on Intel Core i3-1115G4 (tiger-lake, turbo-freq off):
AESNI/AVX:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 1.26 ns/B 757.6 MiB/s 3.77 c/B 2993
ECB dec | 1.27 ns/B 753.1 MiB/s 3.79 c/B 2992
CTR enc | 1.25 ns/B 760.3 MiB/s 3.75 c/B 2992
CTR dec | 1.26 ns/B 759.1 MiB/s 3.76 c/B 2992
GFNI/AVX:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.967 ns/B 986.6 MiB/s 2.89 c/B 2992
ECB dec | 0.966 ns/B 987.1 MiB/s 2.89 c/B 2992
CTR enc | 0.972 ns/B 980.8 MiB/s 2.91 c/B 2993
CTR dec | 0.971 ns/B 982.5 MiB/s 2.90 c/B 2993
AESNI/AVX2:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.817 ns/B 1167 MiB/s 2.44 c/B 2992
ECB dec | 0.819 ns/B 1164 MiB/s 2.45 c/B 2992
CTR enc | 0.819 ns/B 1164 MiB/s 2.45 c/B 2992
CTR dec | 0.819 ns/B 1164 MiB/s 2.45 c/B 2992
GFNI/AVX2:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.506 ns/B 1886 MiB/s 1.51 c/B 2992
ECB dec | 0.505 ns/B 1887 MiB/s 1.51 c/B 2992
CTR enc | 0.564 ns/B 1691 MiB/s 1.69 c/B 2992
CTR dec | 0.565 ns/B 1689 MiB/s 1.69 c/B 2992
===
Benchmark on AMD Ryzen 7 5800X (zen3, turbo-freq off):
AESNI/AVX:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.921 ns/B 1035 MiB/s 3.50 c/B 3800
ECB dec | 0.922 ns/B 1034 MiB/s 3.50 c/B 3800
CTR enc | 0.923 ns/B 1033 MiB/s 3.51 c/B 3800
CTR dec | 0.923 ns/B 1033 MiB/s 3.51 c/B 3800
AESNI/AVX2:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.559 ns/B 1707 MiB/s 2.12 c/B 3800
ECB dec | 0.560 ns/B 1703 MiB/s 2.13 c/B 3800
CTR enc | 0.570 ns/B 1672 MiB/s 2.17 c/B 3800
CTR dec | 0.568 ns/B 1679 MiB/s 2.16 c/B 3800
===
Benchmark on AMD EPYC 7642 (zen2):
AESNI/AVX:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 1.22 ns/B 784.5 MiB/s 4.01 c/B 3298
ECB dec | 1.22 ns/B 784.8 MiB/s 4.00 c/B 3292
CTR enc | 1.22 ns/B 780.1 MiB/s 4.03 c/B 3299
CTR dec | 1.22 ns/B 779.1 MiB/s 4.04 c/B 3299
AESNI/AVX2:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.735 ns/B 1298 MiB/s 2.42 c/B 3299
ECB dec | 0.738 ns/B 1292 MiB/s 2.44 c/B 3299
CTR enc | 0.732 ns/B 1303 MiB/s 2.41 c/B 3299
CTR dec | 0.732 ns/B 1303 MiB/s 2.41 c/B 3299
===
Benchmark on Intel Core i5-6500 (skylake):
AESNI/AVX:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 1.24 ns/B 766.6 MiB/s 4.48 c/B 3598
ECB dec | 1.25 ns/B 764.9 MiB/s 4.49 c/B 3598
CTR enc | 1.25 ns/B 761.7 MiB/s 4.50 c/B 3598
CTR dec | 1.25 ns/B 761.6 MiB/s 4.51 c/B 3598
AESNI/AVX2:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.829 ns/B 1150 MiB/s 2.98 c/B 3599
ECB dec | 0.831 ns/B 1147 MiB/s 2.99 c/B 3598
CTR enc | 0.829 ns/B 1150 MiB/s 2.98 c/B 3598
CTR dec | 0.828 ns/B 1152 MiB/s 2.98 c/B 3598
===
Benchmark on Intel Core i5-2450M (sandy-bridge, turbo-freq off):
AESNI/AVX:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 2.11 ns/B 452.7 MiB/s 5.25 c/B 2494
ECB dec | 2.10 ns/B 454.5 MiB/s 5.23 c/B 2494
CTR enc | 2.10 ns/B 453.2 MiB/s 5.25 c/B 2494
CTR dec | 2.10 ns/B 453.2 MiB/s 5.25 c/B 2494
[v2]
- Optimization for CTR mode: Use CTR byte-addition path when
counter carry-overflow happen only on ctr-variable but not in
generated counter vector registers.
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'aria.c'.
* cipher/aria.c: New.
* cipher/cipher.c (cipher_list, cipher_list_algo301): Add ARIA cipher
specs.
* cipher/mac-cmac.c (map_mac_algo_to_cipher): Add GCRY_MAC_CMAC_ARIA.
(_gcry_mac_type_spec_cmac_aria): New.
* cipher/mac-gmac.c (map_mac_algo_to_cipher): Add GCRY_MAC_GMAC_ARIA.
(_gcry_mac_type_spec_gmac_aria): New.
* cipher/mac-internal.h (_gcry_mac_type_spec_cmac_aria)
(_gcry_mac_type_spec_gmac_aria)
(_gcry_mac_type_spec_poly1305mac_aria): New.
* cipher/mac-poly1305.c (poly1305mac_open): Add GCRY_MAC_GMAC_ARIA.
(_gcry_mac_type_spec_poly1305mac_aria): New.
* cipher/mac.c (mac_list, mac_list_algo201, mac_list_algo401)
(mac_list_algo501): Add ARIA MAC specs.
* configure.ac (available_ciphers): Add 'aria'.
(GCRYPT_CIPHERS): Add 'aria.lo'.
(USE_ARIA): New.
* doc/gcrypt.texi: Add GCRY_CIPHER_ARIA128, GCRY_CIPHER_ARIA192,
GCRY_CIPHER_ARIA256, GCRY_MAC_CMAC_ARIA, GCRY_MAC_GMAC_ARIA and
GCRY_MAC_POLY1305_ARIA.
* src/cipher.h (_gcry_cipher_spec_aria128, _gcry_cipher_spec_aria192)
(_gcry_cipher_spec_aria256): New.
* src/gcrypt.h.in (gcry_cipher_algos): Add GCRY_CIPHER_ARIA128,
GCRY_CIPHER_ARIA192 and GCRY_CIPHER_ARIA256.
(gcry_mac_algos): GCRY_MAC_CMAC_ARIA, GCRY_MAC_GMAC_ARIA and
GCRY_MAC_POLY1305_ARIA.
* tests/basic.c (check_ecb_cipher, check_ctr_cipher)
(check_cfb_cipher, check_ocb_cipher) [USE_ARIA]: Add ARIA test-vectors.
(check_ciphers) [USE_ARIA]: Add GCRY_CIPHER_ARIA128, GCRY_CIPHER_ARIA192
and GCRY_CIPHER_ARIA256.
(main): Also run 'check_bulk_cipher_modes' for 'cipher_modes_only'-mode.
* tests/bench-slope.c (bench_mac_init): Add GCRY_MAC_POLY1305_ARIA
setiv-handling.
* tests/benchmark.c (mac_bench): Likewise.
--
This patch adds ARIA block cipher for libgcrypt. This implementation
is based on work by Taehee Yoo, with following notable changes:
- Integration to libgcrypt, use of bithelp.h and bufhelp.h helper
functions where possible.
- Added lookup table prefetching as is done in AES, GCM and SM4
implementations.
- Changed `get_u8` to return `u32` as returning `byte` caused
sub-optimal code generation with gcc-12/x86-64 (zero extending
from 8-bit to 32-bit register, followed by extraneous sign
extending from 32-bit to 64-bit register).
- Changed 'aria_crypt' loop structure a bit for tiny performance
increase (~1% seen with gcc-12/x86-64/zen4).
Benchmark on AMD Ryzen 9 7900X (x86-64):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 3.99 ns/B 239.1 MiB/s 22.43 c/B 5625
ECB dec | 4.00 ns/B 238.4 MiB/s 22.50 c/B 5625
Benchmark on AMD Ryzen 9 7900X (win32):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 4.57 ns/B 208.7 MiB/s 25.31 c/B 5538
ECB dec | 4.66 ns/B 204.8 MiB/s 25.39 c/B 5453
Benchmark on ARM Cortex-A53 (aarch64):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 74.69 ns/B 12.77 MiB/s 48.40 c/B 647.9
ECB dec | 74.99 ns/B 12.72 MiB/s 48.58 c/B 647.9
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sha512-armv8-aarch64-ce.S'.
* cipher/sha512-armv8-aarch64-ce.S: New.
* cipher/sha512.c (ATTR_ALIGNED_64, USE_ARM64_SHA512): New.
(k): Make array aligned to 64 bytes.
[USE_ARM64_SHA512] (_gcry_sha512_transform_armv8_ce): New.
[USE_ARM64_SHA512] (do_sha512_transform_armv8_ce): New.
(sha512_init_common) [USE_ARM64_SHA512]: Use ARMv8-SHA512 accelerated
implementation if HW feature available.
* configure.ac: Add 'sha512-armv8-aarch64-ce.lo'.
(gcry_cv_gcc_inline_asm_aarch64_sha3_sha512_sm3_sm4)
(HAVE_GCC_INLINE_ASM_AARCH64_SHA3_SHA512_SM3_SM4): New.
--
Benchmark on AWS Graviton3:
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SHA512 | 2.36 ns/B 404.2 MiB/s 6.13 c/B 2600
After (2.4x faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SHA512 | 0.977 ns/B 976.6 MiB/s 2.54 c/B 2600
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'blake2b-amd64-avx512.S' and
'blake2s-amd64-avx512.S'.
* cipher/blake2.c (USE_AVX512): New.
(ASM_FUNC_ABI): Setup attribute if USE_AVX2 or USE_AVX512 enabled in
addition to USE_AVX.
(BLAKE2B_CONTEXT_S, BLAKE2S_CONTEXT_S): Add 'use_avx512'.
(_gcry_blake2b_transform_amd64_avx512)
(_gcry_blake2s_transform_amd64_avx512): New.
(blake2b_transform, blake2s_transform) [USE_AVX512]: Add AVX512 path.
(blake2b_init_ctx, blake2s_init_ctx) [USE_AVX512]: Use AVX512 if HW
feature available.
* cipher/blake2b-amd64-avx512.S: New.
* cipher/blake2s-amd64-avx512.S: New.
* configure.ac: Add 'blake2b-amd64-avx512.lo' and
'blake2s-amd64-avx512.lo'.
--
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before (AVX/AVX2 implementations):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 0.841 ns/B 1134 MiB/s 3.44 c/B 4089
BLAKE2S_256 | 1.29 ns/B 741.2 MiB/s 5.26 c/B 4089
After (blake2s ~19% faster, blake2b ~25% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4088
BLAKE2S_256 | 1.02 ns/B 933.3 MiB/s 4.18 c/B 4088
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* LICENSES: Add 'cipher/keccak-amd64-avx512.S'.
* configure.ac: Add 'keccak-amd64-avx512.lo'.
* cipher/Makefile.am: Add 'keccak-amd64-avx512.S'.
* cipher/keccak-amd64-avx512.S: New.
* cipher/keccak.c (USE_64BIT_AVX512, ASM_FUNC_ABI): New.
[USE_64BIT_AVX512] (_gcry_keccak_f1600_state_permute64_avx512)
(_gcry_keccak_absorb_blocks_avx512, keccak_f1600_state_permute64_avx512)
(keccak_absorb_lanes64_avx512, keccak_avx512_64_ops): New.
(keccak_init) [USE_64BIT_AVX512]: Enable x86-64 AVX512 implementation
if supported by HW features.
--
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before (BMI2 instructions):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SHA3-224 | 1.77 ns/B 540.3 MiB/s 7.22 c/B 4088
SHA3-256 | 1.86 ns/B 514.0 MiB/s 7.59 c/B 4089
SHA3-384 | 2.43 ns/B 393.1 MiB/s 9.92 c/B 4089
SHA3-512 | 3.49 ns/B 273.2 MiB/s 14.27 c/B 4088
SHAKE128 | 1.52 ns/B 629.1 MiB/s 6.20 c/B 4089
SHAKE256 | 1.86 ns/B 511.6 MiB/s 7.62 c/B 4089
After (~33% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SHA3-224 | 1.32 ns/B 721.8 MiB/s 5.40 c/B 4089
SHA3-256 | 1.40 ns/B 681.7 MiB/s 5.72 c/B 4089
SHA3-384 | 1.83 ns/B 522.5 MiB/s 7.46 c/B 4089
SHA3-512 | 2.63 ns/B 362.1 MiB/s 10.77 c/B 4088
SHAKE128 | 1.13 ns/B 840.4 MiB/s 4.64 c/B 4089
SHAKE256 | 1.40 ns/B 682.1 MiB/s 5.72 c/B 4089
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm4-gfni-avx512-amd64.S'.
* cipher/sm4-gfni-avx512-amd64.S: New.
* cipher/sm4-gfni.c (USE_GFNI_AVX512): New.
(SM4_context): Add 'use_gfni_avx512' and 'crypt_blk1_16'.
(_gcry_sm4_gfni_avx512_expand_key, _gcry_sm4_gfni_avx512_ctr_enc)
(_gcry_sm4_gfni_avx512_cbc_dec, _gcry_sm4_gfni_avx512_cfb_dec)
(_gcry_sm4_gfni_avx512_ocb_enc, _gcry_sm4_gfni_avx512_ocb_dec)
(_gcry_sm4_gfni_avx512_ocb_auth, _gcry_sm4_gfni_avx512_ctr_enc_blk32)
(_gcry_sm4_gfni_avx512_cbc_dec_blk32)
(_gcry_sm4_gfni_avx512_cfb_dec_blk32)
(_gcry_sm4_gfni_avx512_ocb_enc_blk32)
(_gcry_sm4_gfni_avx512_ocb_dec_blk32)
(_gcry_sm4_gfni_avx512_crypt_blk1_16)
(_gcry_sm4_gfni_avx512_crypt_blk32, sm4_gfni_avx512_crypt_blk1_16)
(sm4_crypt_blk1_32, sm4_encrypt_blk1_32, sm4_decrypt_blk1_32): New.
(sm4_expand_key): Add GFNI/AVX512 code-path
(sm4_setkey): Use GFNI/AVX512 if supported by CPU; Setup
`ctx->crypt_blk1_16`.
(sm4_encrypt, sm4_decrypt, sm4_get_crypt_blk1_16_fn, _gcry_sm4_ctr_enc)
(_gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt)
(_gcry_sm4_ocb_auth) [USE_GFNI_AVX512]: Add GFNI/AVX512 code path.
(_gcry_sm4_xts_crypt): Change parallel block size from 16 to 32.
* configure.ac: Add 'sm4-gfni-avx512-amd64.lo'.
--
Benchmark on Intel i3-1115G4 (tigerlake):
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 9.45 ns/B 101.0 MiB/s 38.63 c/B 4089
CBC dec | 0.647 ns/B 1475 MiB/s 2.64 c/B 4089
CFB enc | 9.43 ns/B 101.1 MiB/s 38.57 c/B 4089
CFB dec | 0.648 ns/B 1472 MiB/s 2.65 c/B 4089
CTR enc | 0.661 ns/B 1443 MiB/s 2.70 c/B 4089
CTR dec | 0.661 ns/B 1444 MiB/s 2.70 c/B 4089
XTS enc | 0.767 ns/B 1243 MiB/s 3.14 c/B 4089
XTS dec | 0.772 ns/B 1235 MiB/s 3.16 c/B 4089
OCB enc | 0.671 ns/B 1421 MiB/s 2.74 c/B 4089
OCB dec | 0.676 ns/B 1410 MiB/s 2.77 c/B 4089
OCB auth | 0.668 ns/B 1428 MiB/s 2.73 c/B 4090
After:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 7.80 ns/B 122.2 MiB/s 31.91 c/B 4090
CBC dec | 0.293 ns/B 3258 MiB/s 1.20 c/B 4095±3
CFB enc | 7.80 ns/B 122.2 MiB/s 31.90 c/B 4089
CFB dec | 0.294 ns/B 3247 MiB/s 1.20 c/B 4096±3
CTR enc | 0.306 ns/B 3120 MiB/s 1.25 c/B 4098±4
CTR dec | 0.300 ns/B 3182 MiB/s 1.23 c/B 4103±6
XTS enc | 0.431 ns/B 2211 MiB/s 1.77 c/B 4107±9
XTS dec | 0.431 ns/B 2213 MiB/s 1.77 c/B 4102±6
OCB enc | 0.324 ns/B 2946 MiB/s 1.33 c/B 4096±3
OCB dec | 0.326 ns/B 2923 MiB/s 1.34 c/B 4093±2
OCB auth | 0.536 ns/B 1779 MiB/s 2.19 c/B 4089
CBC/CFB enc: 1.20x faster
CBC/CFB dec: 2.20x faster
CTR: 2.18x faster
XTS: 1.78x faster
OCB enc/dec: 2.07x faster
OCB auth: 1.24x faster
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm4-armv9-aarch64-sve-ce.S'.
* cipher/sm4-armv9-aarch64-sve-ce.S: New.
* cipher/sm4.c (USE_ARM_SVE_CE): New.
(SM4_context) [USE_ARM_SVE_CE]: Add 'use_arm_sve_ce'.
(_gcry_sm4_armv9_sve_ce_crypt, _gcry_sm4_armv9_sve_ce_ctr_enc)
(_gcry_sm4_armv9_sve_ce_cbc_dec, _gcry_sm4_armv9_sve_ce_cfb_dec)
(sm4_armv9_sve_ce_crypt_blk1_16): New.
(sm4_setkey): Enable ARMv9 SVE CE if supported by HW.
(sm4_get_crypt_blk1_16_fn) [USE_ARM_SVE_CE]: Add ARMv9 SVE CE
bulk functions.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
[USE_ARM_SVE_CE]: Add ARMv9 SVE CE bulk functions.
* configure.ac: Add 'sm4-armv9-aarch64-sve-ce.lo'.
--
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am (libcipher_la_SOURCES): Add bulkhelp.h.
--
Fixes-commit: 9388279803ff82ea0ccd12a83157b94c807e7a8f
Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* configure.ac: Added chacha20 and poly1305 assembly implementations.
* cipher/chacha20-p10le-8x.s: (New) - support 8 blocks (512 bytes)
unrolling.
* cipher/poly1305-p10le.s: (New) - support 4 blocks (128 bytes)
unrolling.
* cipher/Makefile.am: Added new chacha20 and poly1305 files.
* cipher/chacha20.c: Added PPC p10 le support for 8x chacha20.
* cipher/poly1305.c: Added PPC p10 le support for 4x poly1305.
* cipher/poly1305-internal.h: Added PPC p10 le support for poly1305.
---
GnuPG-bug-id: 6006
Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
[jk: cosmetic changes to C code]
[jk: fix building on ppc64be]
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Remove 'cipher-selftest.c' and 'cipher-selftest.h'.
* cipher/cipher-selftest.c: Remove (refactor these tests to
tests/basic.c).
* cipher/cipher-selftest.h: Remove.
* cipher/blowfish.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove.
(selftest): Remove CTR/CBC/CFB bulk self-tests.
* cipher/camellia-glue.c (selftest_ctr_128, selftest_cbc_128)
(selftest_cfb_128): Remove.
(selftest): Remove CTR/CBC/CFB bulk self-tests.
* cipher/cast5.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove.
(selftest): Remove CTR/CBC/CFB bulk self-tests.
* cipher/des.c (bulk_selftest_setkey, selftest_ctr, selftest_cbc)
(selftest_cfb): Remove.
(selftest): Remove CTR/CBC/CFB bulk self-tests.
* cipher/rijndael.c (selftest_basic_128, selftest_basic_192)
(selftest_basic_256): Allocate context from stack instead of heap and
handle alignment manually.
(selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Remove.
(selftest): Remove CTR/CBC/CFB bulk self-tests.
* cipher/serpent.c (selftest_ctr_128, selftest_cbc_128)
(selftest_cfb_128): Remove.
(selftest): Remove CTR/CBC/CFB bulk self-tests.
* cipher/sm4.c (selftest_ctr_128, selftest_cbc_128)
(selftest_cfb_128): Remove.
(selftest): Remove CTR/CBC/CFB bulk self-tests.
* cipher/twofish.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove.
(selftest): Remove CTR/CBC/CFB bulk self-tests.
* tests/basic.c (buf_xor, cipher_cbc_bulk_test, buf_xor_2dst)
(cipher_cfb_bulk_test, cipher_ctr_bulk_test): New.
(check_ciphers): Run cipher_cbc_bulk_test(), cipher_cfb_bulk_test() and
cipher_ctr_bulk_test() for block ciphers.
---
CBC/CFB/CTR bulk self-tests are quite computationally heavy and
slow down use cases where application opens cipher context once,
does processing and exits. Better place for these tests is in
`tests/basic`.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'camellia-gfni-avx512-amd64.S'.
* cipher/bulkhelp.h (bulk_ocb_prepare_L_pointers_array_blk64): New.
* cipher/camellia-aesni-avx2-amd64.h: Rename internal functions from
"__camellia_???" to "FUNC_NAME(???)"; Minor changes to comments.
* cipher/camellia-gfni-avx512-amd64.S: New.
* cipher/camellia-gfni.c (USE_GFNI_AVX512): New.
(CAMELLIA_context): Add 'use_gfni_avx512'.
(_gcry_camellia_gfni_avx512_ctr_enc, _gcry_camellia_gfni_avx512_cbc_dec)
(_gcry_camellia_gfni_avx512_cfb_dec, _gcry_camellia_gfni_avx512_ocb_enc)
(_gcry_camellia_gfni_avx512_ocb_dec)
(_gcry_camellia_gfni_avx512_enc_blk64)
(_gcry_camellia_gfni_avx512_dec_blk64, avx512_burn_stack_depth): New.
(camellia_setkey): Use GFNI/AVX512 if supported by CPU.
(camellia_encrypt_blk1_64, camellia_decrypt_blk1_64): New.
(_gcry_camellia_ctr_enc, _gcry_camellia_cbc_dec, _gcry_camellia_cfb_dec)
(_gcry_camellia_ocb_crypt) [USE_GFNI_AVX512]: Add GFNI/AVX512 code path.
(_gcry_camellia_xts_crypt): Change parallel block size from 32 to 64.
(selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Increase test
block size.
* cipher/chacha20-amd64-avx512.S: Clear k-mask registers with xor.
* cipher/poly1305-amd64-avx512.S: Likewise.
* cipher/sha512-avx512-amd64.S: Likewise.
---
Benchmark on Intel i3-1115G4 (tigerlake):
Before (GFNI/AVX2):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.356 ns/B 2679 MiB/s 1.46 c/B 4089
CFB dec | 0.374 ns/B 2547 MiB/s 1.53 c/B 4089
CTR enc | 0.409 ns/B 2332 MiB/s 1.67 c/B 4089
CTR dec | 0.406 ns/B 2347 MiB/s 1.66 c/B 4089
XTS enc | 0.430 ns/B 2216 MiB/s 1.76 c/B 4090
XTS dec | 0.433 ns/B 2201 MiB/s 1.77 c/B 4090
OCB enc | 0.460 ns/B 2071 MiB/s 1.88 c/B 4089
OCB dec | 0.492 ns/B 1939 MiB/s 2.01 c/B 4089
After (GFNI/AVX512):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.207 ns/B 4600 MiB/s 0.827 c/B 3989
CFB dec | 0.207 ns/B 4610 MiB/s 0.825 c/B 3989
CTR enc | 0.218 ns/B 4382 MiB/s 0.868 c/B 3990
CTR dec | 0.217 ns/B 4389 MiB/s 0.867 c/B 3990
XTS enc | 0.330 ns/B 2886 MiB/s 1.35 c/B 4097±4
XTS dec | 0.328 ns/B 2904 MiB/s 1.35 c/B 4097±3
OCB enc | 0.246 ns/B 3879 MiB/s 0.981 c/B 3990
OCB dec | 0.247 ns/B 3855 MiB/s 0.987 c/B 3990
CBC dec: 70% faster
CFB dec: 80% faster
CTR: 87% faster
XTS: 31% faster
OCB: 92% faster
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm4-gfni-avx2-amd64.S'.
* cipher/sm4-aesni-avx2-amd64.S: New.
* cipher/sm4.c (USE_GFNI_AVX2): New.
(SM4_context): Add 'use_gfni_avx2'.
(crypt_blk1_8_fn_t): Rename to...
(crypt_blk1_16_fn_t): ...this.
(sm4_aesni_avx_crypt_blk1_8): Rename to...
(sm4_aesni_avx_crypt_blk1_16): ...this and add handling for 9 to 16
input blocks.
(_gcry_sm4_gfni_avx_expand_key, _gcry_sm4_gfni_avx2_ctr_enc)
(_gcry_sm4_gfni_avx2_cbc_dec, _gcry_sm4_gfni_avx2_cfb_dec)
(_gcry_sm4_gfni_avx2_ocb_enc, _gcry_sm4_gfni_avx2_ocb_dec)
(_gcry_sm4_gfni_avx2_ocb_auth, _gcry_sm4_gfni_avx2_crypt_blk1_16)
(sm4_gfni_avx2_crypt_blk1_16): New.
(sm4_aarch64_crypt_blk1_8): Rename to...
(sm4_aarch64_crypt_blk1_16): ...this and add handling for 9 to 16
input blocks.
(sm4_armv8_ce_crypt_blk1_8): Rename to...
(sm4_armv8_ce_crypt_blk1_16): ...this and add handling for 9 to 16
input blocks.
(sm4_expand_key): Add GFNI/AVX2 path.
(sm4_setkey): Enable GFNI/AVX2 implementation if HW features
available; Disable AESNI implementations when GFNI implementation is
enabled.
(sm4_encrypt) [USE_GFNI_AVX2]: New.
(sm4_decrypt) [USE_GFNI_AVX2]: New.
(sm4_get_crypt_blk1_8_fn): Rename to...
(sm4_get_crypt_blk1_16_fn): ...this; Update to use *_blk1_16 functions;
Add GFNI/AVX2 selection.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): Add GFNI/AVX2 path; Widen
generic bulk processing from 8 blocks to 16 blocks.
(_gcry_sm4_xts_crypt): Widen generic bulk processing from 8 blocks to
16 blocks.
--
Benchmark on Intel i3-1115G4 (tigerlake):
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 10.34 ns/B 92.21 MiB/s 42.29 c/B 4089
ECB dec | 10.34 ns/B 92.24 MiB/s 42.29 c/B 4090
CBC enc | 11.06 ns/B 86.26 MiB/s 45.21 c/B 4090
CBC dec | 1.13 ns/B 844.8 MiB/s 4.62 c/B 4090
CFB enc | 11.06 ns/B 86.27 MiB/s 45.22 c/B 4090
CFB dec | 1.13 ns/B 846.0 MiB/s 4.61 c/B 4090
CTR enc | 1.14 ns/B 834.3 MiB/s 4.67 c/B 4089
CTR dec | 1.14 ns/B 834.5 MiB/s 4.67 c/B 4089
XTS enc | 1.93 ns/B 494.1 MiB/s 7.89 c/B 4090
XTS dec | 1.94 ns/B 492.5 MiB/s 7.92 c/B 4090
OCB enc | 1.16 ns/B 823.3 MiB/s 4.74 c/B 4090
OCB dec | 1.16 ns/B 818.8 MiB/s 4.76 c/B 4089
OCB auth | 1.15 ns/B 831.0 MiB/s 4.69 c/B 4089
After:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 8.39 ns/B 113.6 MiB/s 34.33 c/B 4090
ECB dec | 8.40 ns/B 113.5 MiB/s 34.35 c/B 4090
CBC enc | 9.45 ns/B 101.0 MiB/s 38.63 c/B 4089
CBC dec | 0.650 ns/B 1468 MiB/s 2.66 c/B 4090
CFB enc | 9.44 ns/B 101.1 MiB/s 38.59 c/B 4090
CFB dec | 0.660 ns/B 1444 MiB/s 2.70 c/B 4090
CTR enc | 0.664 ns/B 1437 MiB/s 2.71 c/B 4090
CTR dec | 0.664 ns/B 1437 MiB/s 2.71 c/B 4090
XTS enc | 0.756 ns/B 1262 MiB/s 3.09 c/B 4090
XTS dec | 0.757 ns/B 1260 MiB/s 3.10 c/B 4090
OCB enc | 0.673 ns/B 1417 MiB/s 2.75 c/B 4090
OCB dec | 0.675 ns/B 1413 MiB/s 2.76 c/B 4090
OCB auth | 0.672 ns/B 1418 MiB/s 2.75 c/B 4090
ECB: 1.2x faster
CBC-enc / CFB-enc: 1.17x faster
CBC-dec / CFB-dec / CTR / OCB: 1.7x faster
XTS: 2.5x faster
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add "camellia-gfni-avx2-amd64.S".
* cipher/camellia-aesni-avx2-amd64.h [CAMELLIA_GFNI_BUILD]: Add GFNI
support.
* cipher/camellia-gfni-avx2-amd64.S: New.
* cipher/camellia-glue.c (USE_GFNI_AVX2): New.
(CAMELLIA_context) [USE_AESNI_AVX2]: New member "use_gfni_avx2".
[USE_GFNI_AVX2] (_gcry_camellia_gfni_avx2_ctr_enc)
(_gcry_camellia_gfni_avx2_cbc_dec, _gcry_camellia_gfni_avx2_cfb_dec)
(_gcry_camellia_gfni_avx2_ocb_enc, _gcry_camellia_gfni_avx2_ocb_dec)
(_gcry_camellia_gfni_avx2_ocb_auth): New.
(camellia_setkey) [USE_GFNI_AVX2]: Enable GFNI if supported by HW.
(_gcry_camellia_ctr_enc) [USE_GFNI_AVX2]: Add GFNI support.
(_gcry_camellia_cbc_dec) [USE_GFNI_AVX2]: Add GFNI support.
(_gcry_camellia_cfb_dec) [USE_GFNI_AVX2]: Add GFNI support.
(_gcry_camellia_ocb_crypt) [USE_GFNI_AVX2]: Add GFNI support.
(_gcry_camellia_ocb_auth) [USE_GFNI_AVX2]: Add GFNI support.
* configure.ac: Add "camellia-gfni-avx2-amd64.lo".
--
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before (VAES/AVX2 implementation):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.579 ns/B 1646 MiB/s 2.37 c/B 4090
CFB dec | 0.579 ns/B 1648 MiB/s 2.37 c/B 4089
CTR enc | 0.586 ns/B 1628 MiB/s 2.40 c/B 4090
CTR dec | 0.587 ns/B 1626 MiB/s 2.40 c/B 4090
OCB enc | 0.607 ns/B 1570 MiB/s 2.48 c/B 4089
OCB dec | 0.611 ns/B 1561 MiB/s 2.50 c/B 4089
OCB auth | 0.602 ns/B 1585 MiB/s 2.46 c/B 4089
After (~80% faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.299 ns/B 3186 MiB/s 1.22 c/B 4090
CFB dec | 0.314 ns/B 3039 MiB/s 1.28 c/B 4089
CTR enc | 0.322 ns/B 2962 MiB/s 1.32 c/B 4090
CTR dec | 0.321 ns/B 2970 MiB/s 1.31 c/B 4090
OCB enc | 0.339 ns/B 2817 MiB/s 1.38 c/B 4089
OCB dec | 0.346 ns/B 2756 MiB/s 1.41 c/B 4089
OCB auth | 0.337 ns/B 2831 MiB/s 1.38 c/B 4089
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'chacha20-amd64-avx512.S'.
* cipher/chacha20-amd64-avx512.S: New.
* cipher/chacha20.c (USE_AVX512): New.
(CHACHA20_context_s): Add 'use_avx512'.
[USE_AVX512] (_gcry_chacha20_amd64_avx512_blocks16): New.
(chacha20_do_setkey) [USE_AVX512]: Setup 'use_avx512' based on
HW features.
(do_chacha20_encrypt_stream_tail) [USE_AVX512]: Use AVX512
implementation if supported.
(_gcry_chacha20_poly1305_encrypt) [USE_AVX512]: Disable stitched
chacha20-poly1305 implementations if AVX512 implementation is used.
(_gcry_chacha20_poly1305_decrypt) [USE_AVX512]: Disable stitched
chacha20-poly1305 implementations if AVX512 implementation is used.
--
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.276 ns/B 3451 MiB/s 1.13 c/B 4090
STREAM dec | 0.284 ns/B 3359 MiB/s 1.16 c/B 4090
POLY1305 enc | 0.411 ns/B 2320 MiB/s 1.68 c/B 4098±3
POLY1305 dec | 0.408 ns/B 2338 MiB/s 1.67 c/B 4091±1
POLY1305 auth | 0.060 ns/B 15785 MiB/s 0.247 c/B 4090±1
After (stream 1.7x faster, poly1305-aead 1.8x faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.162 ns/B 5869 MiB/s 0.665 c/B 4092±1
STREAM dec | 0.162 ns/B 5884 MiB/s 0.664 c/B 4096±3
POLY1305 enc | 0.221 ns/B 4306 MiB/s 0.907 c/B 4097±3
POLY1305 dec | 0.220 ns/B 4342 MiB/s 0.900 c/B 4096±3
POLY1305 auth | 0.060 ns/B 15797 MiB/s 0.247 c/B 4085±2
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* LICENSES: Add 3-clause BSD license for poly1305-amd64-avx512.S.
* cipher/Makefile.am: Add 'poly1305-amd64-avx512.S'.
* cipher/poly1305-amd64-avx512.S: New.
* cipher/poly1305-internal.h (POLY1305_USE_AVX512): New.
(poly1305_context_s): Add 'use_avx512'.
* cipher/poly1305.c (ASM_FUNC_ABI, ASM_FUNC_WRAPPER_ATTR): New.
[POLY1305_USE_AVX512] (_gcry_poly1305_amd64_avx512_blocks)
(poly1305_amd64_avx512_blocks): New.
(poly1305_init): Use AVX512 is HW feature available (set use_avx512).
[USE_MPI_64BIT] (poly1305_blocks): Rename to ...
[USE_MPI_64BIT] (poly1305_blocks_generic): ... this.
[USE_MPI_64BIT] (poly1305_blocks): New.
--
Patch adds AMD64 AVX512-FMA52 implementation for Poly1305.
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
POLY1305 | 0.306 ns/B 3117 MiB/s 1.25 c/B 4090
After (5.0x faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
POLY1305 | 0.061 ns/B 15699 MiB/s 0.249 c/B 4095±3
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm3-armv8-aarch64-ce.S'.
* cipher/sm3-armv8-aarch64-ce.S: New.
* cipher/sm3.c (USE_ARM_CE): New.
[USE_ARM_CE] (_gcry_sm3_transform_armv8_ce)
(do_sm3_transform_armv8_ce): New.
(sm3_init) [USE_ARM_CE]: New.
* configure.ac: Add 'sm3-armv8-aarch64-ce.lo'.
--
Benchmark on T-Head Yitian-710 2.75 GHz:
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 2.84 ns/B 335.3 MiB/s 7.82 c/B 2749
After (~55% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 1.84 ns/B 518.1 MiB/s 5.06 c/B 2749
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Use EXEEXT_FOR_BUILD.
* doc/Makefile.am: Likewise.
--
Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* LICENSES: Add 'cipher/sha512-avx512-amd64.S'.
* cipher/Makefile.am: Add 'sha512-avx512-amd64.S'.
* cipher/sha512-avx512-amd64.S: New.
* cipher/sha512.c (USE_AVX512): New.
(do_sha512_transform_amd64_ssse3, do_sha512_transform_amd64_avx)
(do_sha512_transform_amd64_avx2): Add ASM_EXTRA_STACK to return value
only if assembly routine returned non-zero value.
[USE_AVX512] (_gcry_sha512_transform_amd64_avx512)
(do_sha512_transform_amd64_avx512): New.
(sha512_init_common) [USE_AVX512]: Use AVX512 implementation if HW
feature supported.
---
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SHA512 | 1.51 ns/B 631.6 MiB/s 6.17 c/B 4089
After (~29% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SHA512 | 1.16 ns/B 819.0 MiB/s 4.76 c/B 4090
GnuPG-bug-id: T4460
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'.
* cipher/sm4-armv8-aarch64-ce.S: New.
* cipher/sm4.c (USE_ARM_CE): New.
(SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'.
[USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key)
(_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc)
(_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec)
(_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New.
(sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup.
(sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW.
(sm4_encrypt) [USE_ARM_CE]: Use SM4 CE encryption.
(sm4_decrypt) [USE_ARM_CE]: Use SM4 CE decryption.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add
ARMv8/AArch64/CE bulk functions.
* configure.ac: Add 'sm4-armv8-aarch64-ce.lo'.
--
This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk
functions process eight blocks in parallel.
Benchmark on T-Head Yitian-710 2.75 GHz:
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750
CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749
CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750
CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750
CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750
CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750
GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750
GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750
GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750
OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750
OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750
OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750
After (10x - 19x faster than ARMv8/AArch64 impl):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 1.25 ns/B 762.7 MiB/s 3.44 c/B 2749
CBC dec | 0.243 ns/B 3927 MiB/s 0.668 c/B 2750
CFB enc | 1.25 ns/B 763.1 MiB/s 3.44 c/B 2750
CFB dec | 0.245 ns/B 3899 MiB/s 0.673 c/B 2750
CTR enc | 0.298 ns/B 3199 MiB/s 0.820 c/B 2750
CTR dec | 0.298 ns/B 3198 MiB/s 0.820 c/B 2750
GCM enc | 0.487 ns/B 1957 MiB/s 1.34 c/B 2749
GCM dec | 0.487 ns/B 1959 MiB/s 1.34 c/B 2750
GCM auth | 0.189 ns/B 5048 MiB/s 0.519 c/B 2750
OCB enc | 0.443 ns/B 2150 MiB/s 1.22 c/B 2749
OCB dec | 0.486 ns/B 1964 MiB/s 1.34 c/B 2750
OCB auth | 0.369 ns/B 2585 MiB/s 1.01 c/B 2749
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am [ENABLE_PPC_VCRYPTO_EXTRA_CFLAGS]
(ppc_vcrypto_cflags): Add '-O2'.
* configure.ac (gcry_cv_cc_ppc_altivec): Check for missing compiler
optimization with vec_sld_u32 inline function.
* configure.ac (gcry_cv_cc_ppc_altivec_cflags): Check for missing
compiler optimization with vec_sld_u32 inline function; Add '-O2' to
CFLAGS.
--
Attempt to enable optimization for PPC vector register implementations
if PPC altivec check does not pass otherwise.
GnuPG-bug-id: T5785
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm4-aarch64.S'.
* cipher/sm4-aarch64.S: New.
* cipher/sm4.c (USE_AARCH64_SIMD): New.
(SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'.
[USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt)
(_gcry_sm4_aarch64_ctr_enc, _gcry_sm4_aarch64_cbc_dec)
(_gcry_sm4_aarch64_cfb_dec, _gcry_sm4_aarch64_crypt_blk1_8)
(sm4_aarch64_crypt_blk1_8): New.
(sm4_setkey): Enable ARMv8/AArch64 if supported by HW.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AARCH64_SIMD]:
Add ARMv8/AArch64 bulk functions.
* configure.ac: Add 'sm4-aarch64.lo'.
--
This patch adds ARMv8/AArch64 bulk encryption/decryption. Bulk
functions process eight blocks in parallel.
Benchmark on T-Head Yitian-710 2.75 GHz:
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 12.10 ns/B 78.81 MiB/s 33.28 c/B 2750
CBC dec | 7.19 ns/B 132.6 MiB/s 19.77 c/B 2750
CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750
CFB dec | 7.24 ns/B 131.8 MiB/s 19.90 c/B 2750
CTR enc | 7.24 ns/B 131.7 MiB/s 19.90 c/B 2750
CTR dec | 7.24 ns/B 131.7 MiB/s 19.91 c/B 2750
GCM enc | 9.49 ns/B 100.4 MiB/s 26.11 c/B 2750
GCM dec | 9.49 ns/B 100.5 MiB/s 26.10 c/B 2750
GCM auth | 2.25 ns/B 423.1 MiB/s 6.20 c/B 2750
OCB enc | 7.35 ns/B 129.8 MiB/s 20.20 c/B 2750
OCB dec | 7.36 ns/B 129.6 MiB/s 20.23 c/B 2750
OCB auth | 7.29 ns/B 130.8 MiB/s 20.04 c/B 2749
After (~55% faster):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750
CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749
CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750
CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750
CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750
CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750
GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750
GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750
GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750
OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750
OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750
OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm3-aarch64.S'.
* cipher/sm3-aarch64.S: New.
* cipher/sm3.c (USE_AARCH64_SIMD): New.
[USE_AARCH64_SIMD] (_gcry_sm3_transform_aarch64)
(do_sm3_transform_aarch64): New.
(sm3_init) [USE_AARCH64_SIMD]: New.
* configure.ac: Add 'sm3-aarch64.lo'.
* tests/basic.c (main): Add command-line option '--hash' for running
only hash algorithm tests.
--
Benchmark on AWS Graviton2:
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 4.24 ns/B 224.8 MiB/s 10.61 c/B 2500
After (~34% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 3.15 ns/B 302.4 MiB/s 7.88 c/B 2500
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* configure.ac: Added p10 assembly implementation file and assiciated file.
* cipher/Makefile.am: Added p10 assembly implementation file and associated
file.
* cipher/rijndael.c: Added p10 function.
* cipher/rijndael-p10le.c: New wrapper file for AES-GCM call.
* cipher/rijndael-gcm-p10le.s: New implementation of AES-GCM bulk function in
Power Assembly.
* src/g10lib.h: Added Power arch 3.1 definition for p10.
* src/hwf-ppc.c: Added Power arch 3.1 definition for p10.
* src/hwfeatures.c: Added Power arch 3.1 definition for p10.
--
GnuPG-bug-id: 5700
Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
[jk: fixes for C coding style]
[jk: prefix assembly functions with '_gcry_ppc10']
[jk: add assert check for gcm_table size]
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm3-avx-bmi2-amd64.S'.
* cipher/sm3-avx-bmi2-amd64.S: New.
* cipher/sm3.c (USE_AVX_BMI2, ASM_FUNC_ABI, ASM_EXTRA_STACK): New.
(SM3_CONTEXT): Define 'h' as array instead of separate fields 'h1',
'h2', etc.
[USE_AVX_BMI2] (_gcry_sm3_transform_amd64_avx_bmi2)
(do_sm3_transform_amd64_avx_bmi2): New.
(sm3_init): Select AVX/BMI2 transform function if support by HW; Update
to use 'hd->h' as array.
(transform_blk, sm3_final): Update to use 'hd->h' as array.
* configure.ac: Add 'sm3-avx-bmi2-amd64.lo'.
--
Benchmark on AMD Zen3:
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 2.18 ns/B 436.6 MiB/s 10.59 c/B 4850
After (~43% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 1.52 ns/B 627.4 MiB/s 7.37 c/B 4850
Benchmark on Intel Skylake:
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 4.35 ns/B 219.2 MiB/s 13.48 c/B 3098
After (~34% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 3.24 ns/B 294.4 MiB/s 10.04 c/B 3098
Benchmark on AMD Zen2:
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 2.73 ns/B 348.9 MiB/s 11.86 c/B 4339
After (~38% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SM3 | 1.97 ns/B 483.0 MiB/s 8.52 c/B 4318
Reviewed-and-tested-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add a space.
* doc/Makefile.am: Ditto.
--
Signed-off-by: Alexander Kanavin <alex.kanavin@gmail.com>
|
|
|
|
|
|
|
|
|
|
| |
* configure.ac [host=s390x-*-*]: Add 'poly1305-s390x.lo'.
* cipher/Makefile.am: Move 'poly1305-s390x.S' to
'EXTRA_libcipher_la_SOURCES'.
--
GnuPG-bug-id: 5694
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'cipher-gcm-siv.c'.
* cipher/cipher-gcm-siv.c: New.
* cipher/cipher-gcm.c (_gcry_cipher_gcm_setupM): New.
* cipher/cipher-internal.h (gcry_cipher_handle): Add 'siv_keylen'.
(_gcry_cipher_gcm_setupM, _gcry_cipher_gcm_siv_encrypt)
(_gcry_cipher_gcm_siv_decrypt, _gcry_cipher_gcm_siv_set_nonce)
(_gcry_cipher_gcm_siv_authenticate)
(_gcry_cipher_gcm_siv_set_decryption_tag)
(_gcry_cipher_gcm_siv_get_tag, _gcry_cipher_gcm_siv_check_tag)
(_gcry_cipher_gcm_siv_setkey): New prototypes.
(cipher_block_bswap): New helper function.
* cipher/cipher.c (_gcry_cipher_open_internal): Add
'GCRY_CIPHER_MODE_GCM_SIV'; Refactor mode requirement checks for
better size optimization (check pointers & blocksize in same order
for all).
(cipher_setkey, cipher_reset, _gcry_cipher_setup_mode_ops)
(_gcry_cipher_setup_mode_ops, _gcry_cipher_info): Add GCM-SIV.
(_gcry_cipher_ctl): Handle 'set decryption tag' for GCM-SIV.
* doc/gcrypt.texi: Add GCM-SIV.
* src/gcrypt.h.in (GCRY_CIPHER_MODE_GCM_SIV): New.
(GCRY_SIV_BLOCK_LEN, gcry_cipher_set_decryption_tag): Add to comment
that these are also for GCM-SIV in addition to SIV mode.
* tests/basic.c (check_gcm_siv_cipher): New.
(check_cipher_modes): Check for GCM-SIV.
* tests/bench-slope.c (bench_gcm_siv_encrypt_do_bench)
(bench_gcm_siv_decrypt_do_bench, bench_gcm_siv_authenticate_do_bench)
(gcm_siv_encrypt_ops, gcm_siv_decrypt_ops)
(gcm_siv_authenticate_ops): New.
(cipher_modes): Add GCM-SIV.
(cipher_bench_one): Check key length requirement for GCM-SIV.
--
GnuPG-bug-id: T4485
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'cipher-siv.c'.
* cipher/cipher-ctr.c (_gcry_cipher_ctr_encrypt): Rename to
_gcry_cipher_ctr_encrypt_ctx and add algo context parameter.
(_gcry_cipher_ctr_encrypt): New using _gcry_cipher_ctr_encrypt_ctx.
* cipher/cipher-internal.h (gcry_cipher_handle): Add 'u_mode.siv'.
(_gcry_cipher_ctr_encrypt_ctx, _gcry_cipher_siv_encrypt)
(_gcry_cipher_siv_decrypt, _gcry_cipher_siv_set_nonce)
(_gcry_cipher_siv_authenticate, _gcry_cipher_siv_set_decryption_tag)
(_gcry_cipher_siv_get_tag, _gcry_cipher_siv_check_tag)
(_gcry_cipher_siv_setkey): New.
* cipher/cipher-siv.c: New.
* cipher/cipher.c (_gcry_cipher_open_internal, cipher_setkey)
(cipher_reset, _gcry_cipher_setup_mode_ops, _gcry_cipher_info): Add
GCRY_CIPHER_MODE_SIV handling.
(_gcry_cipher_ctl): Add GCRYCTL_SET_DECRYPTION_TAG handling.
* doc/gcrypt.texi: Add documentation for SIV mode.
* src/gcrypt.h.in (GCRYCTL_SET_DECRYPTION_TAG): New.
(GCRY_CIPHER_MODE_SIV): New.
(gcry_cipher_set_decryption_tag): New.
* tests/basic.c (check_siv_cipher): New.
(check_cipher_modes): Add call for 'check_siv_cipher'.
* tests/bench-slope.c (bench_encrypt_init): Use double size key for
SIV mode.
(bench_aead_encrypt_do_bench, bench_aead_decrypt_do_bench)
(bench_aead_authenticate_do_bench): Reset cipher context on each run.
(bench_aead_authenticate_do_bench): Support nonce-less operation.
(bench_siv_encrypt_do_bench, bench_siv_decrypt_do_bench)
(bench_siv_authenticate_do_bench, siv_encrypt_ops)
(siv_decrypt_ops, siv_authenticate_ops): New.
(cipher_modes): Add SIV mode benchmarks.
(cipher_bench_one): Restrict SIV mode testing to 16 byte block-size.
--
GnuPG-bug-id: T4486
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Move arch specific 'cipher-gcm-*.[cS]' files
from libcipher_la_SOURCES to EXTRA_libcipher_la_SOURCES.
* configure.ac: Add 'cipher-gcm-intel-pclmul.lo' and
'cipher-gcm-arm*.lo'.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'cipher-gcm-ppc.c'.
* cipher/cipher-gcm-ppc.c: New.
* cipher/cipher-gcm.c [GCM_USE_PPC_VPMSUM] (_gcry_ghash_setup_ppc_vpmsum)
(_gcry_ghash_ppc_vpmsum, ghash_setup_ppc_vpsum, ghash_ppc_vpmsum): New.
(setupM) [GCM_USE_PPC_VPMSUM]: Select ppc-vpmsum implementation if
HW feature "ppc-vcrypto" is available.
* cipher/cipher-internal.h (GCM_USE_PPC_VPMSUM): New.
(gcry_cipher_handle): Move 'ghash_fn' at end of 'gcm' block to align
'gcm_table' to 16 bytes.
* configure.ac: Add 'cipher-gcm-ppc.lo'.
* tests/basic.c (_check_gcm_cipher): New AES256 test vector.
* AUTHORS: Add 'CRYPTOGAMS'.
* LICENSES: Add original license to 3-clause-BSD section.
--
https://dev.gnupg.org/D501:
10-20X speed.
However this Power 9 machine is faster than the last Power 9 benchmarks
on the optimized versions, so while better than the last patch, it is
not all due to the code.
Before:
GCM enc | 4.23 ns/B 225.3 MiB/s - c/B
GCM dec | 3.58 ns/B 266.2 MiB/s - c/B
GCM auth | 3.34 ns/B 285.3 MiB/s - c/B
After:
GCM enc | 0.370 ns/B 2578 MiB/s - c/B
GCM dec | 0.371 ns/B 2571 MiB/s - c/B
GCM auth | 0.159 ns/B 6003 MiB/s - c/B
Signed-off-by: Shawn Landden <shawn@git.icu>
[jk: coding style fixes, Makefile.am integration, patch from Differential
to git, commit changelog, fixed few compiler warnings]
GnuPG-bug-id: 5040
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'rijndael-vaes.c' and
'rijndael-vaes-avx2-amd64.S'.
* cipher/rijndael-internal.h (USE_VAES): New.
* cipher/rijndael-vaes-avx2-amd64.S: New.
* cipher/rijndael-vaes.c: New.
* cipher/rijndael.c (_gcry_aes_vaes_cfb_dec, _gcry_aes_vaes_cbc_dec)
(_gcry_aes_vaes_ctr_enc, _gcry_aes_vaes_ocb_crypt)
(_gcry_aes_vaes_xts_crypt): New.
(do_setkey) [USE_VAES]: Add detection for VAES.
(selftest_ctr_128, selftest_cbc_128, selftest_cfb_128)
[USE_VAES]: Increase number of selftest blocks.
* configure.ac: Add 'rijndael-vaes.lo' and
'rijndael-vaes-avx2-amd64.lo'.
--
Patch adds VAES/AVX2 accelerated implementation for CBC-decryption,
CFB-decryption, CTR-encryption, OCB-en/decryption and XTS-en/decryption.
Benchmarks on AMD Ryzen 5800X:
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.067 ns/B 14314 MiB/s 0.323 c/B 4850
CFB dec | 0.067 ns/B 14322 MiB/s 0.323 c/B 4850
CTR enc | 0.066 ns/B 14429 MiB/s 0.321 c/B 4850
CTR dec | 0.066 ns/B 14433 MiB/s 0.320 c/B 4850
XTS enc | 0.087 ns/B 10910 MiB/s 0.424 c/B 4850
XTS dec | 0.088 ns/B 10856 MiB/s 0.426 c/B 4850
OCB enc | 0.070 ns/B 13633 MiB/s 0.339 c/B 4850
OCB dec | 0.069 ns/B 13911 MiB/s 0.332 c/B 4850
After (XTS ~1.7x faster, others ~1.9x faster):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.034 ns/B 28159 MiB/s 0.164 c/B 4850
CFB dec | 0.034 ns/B 27955 MiB/s 0.165 c/B 4850
CTR enc | 0.034 ns/B 28214 MiB/s 0.164 c/B 4850
CTR dec | 0.034 ns/B 28146 MiB/s 0.164 c/B 4850
XTS enc | 0.051 ns/B 18539 MiB/s 0.249 c/B 4850
XTS dec | 0.051 ns/B 18655 MiB/s 0.248 c/B 4850
GCM auth | 0.088 ns/B 10817 MiB/s 0.428 c/B 4850
OCB enc | 0.037 ns/B 25824 MiB/s 0.179 c/B 4850
OCB dec | 0.038 ns/B 25359 MiB/s 0.182 c/B 4850
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'camellia-aesni-avx2-amd64.h' and
'camellia-vaes-avx2-amd64.S'.
* cipher/camellia-aesni-avx2-amd64.S: New, old content moved to...
* cipher/camellia-aesni-avx2-amd64.h: ...here.
(IF_AESNI, IF_VAES, FUNC_NAME): New.
* cipher/camellia-vaes-avx2-amd64.S: New.
* cipher/camellia-glue.c (USE_VAES_AVX2): New.
(CAMELLIA_context): New member 'use_vaes_avx2'.
(_gcry_camellia_vaes_avx2_ctr_enc, _gcry_camellia_vaes_avx2_cbc_dec)
(_gcry_camellia_vaes_avx2_cfb_dec, _gcry_camellia_vaes_avx2_ocb_enc)
(_gcry_camellia_vaes_avx2_ocb_dec)
(_gcry_camellia_vaes_avx2_ocb_auth): New.
(camellia_setkey): Check for HWF_INTEL_VAES.
(_gcry_camellia_ctr_enc, _gcry_camellia_cbc_dec)
(_gcry_camellia_cfb_dec, _gcry_camellia_ocb_crypt)
(_gcry_camellia_ocb_auth): Add USE_VAES_AVX2 code.
* configure.ac: Add 'camellia-vaes-avx2-amd64.lo'.
--
Camellia AES-NI/AVX2 implementation had to split 256-bit vector
to 128-bit parts for AES processing, but now we can use those
256-bit registers directly with VAES.
Benchmarks on AMD Ryzen 5800X:
Before (AES-NI/AVX2):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.539 ns/B 1769 MiB/s 2.62 c/B 4852
CFB dec | 0.528 ns/B 1806 MiB/s 2.56 c/B 4852±1
CTR enc | 0.552 ns/B 1728 MiB/s 2.68 c/B 4850
OCB enc | 0.550 ns/B 1734 MiB/s 2.65 c/B 4825
OCB dec | 0.577 ns/B 1653 MiB/s 2.78 c/B 4825
OCB auth | 0.546 ns/B 1747 MiB/s 2.63 c/B 4825
After (VAES/AVX2, CBC-dec ~13%, CFB-dec/CTR/OCB ~20% faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.477 ns/B 1999 MiB/s 2.31 c/B 4850
CFB dec | 0.433 ns/B 2201 MiB/s 2.10 c/B 4850
CTR enc | 0.438 ns/B 2176 MiB/s 2.13 c/B 4851
OCB enc | 0.449 ns/B 2122 MiB/s 2.18 c/B 4850
OCB dec | 0.468 ns/B 2038 MiB/s 2.27 c/B 4850
OCB auth | 0.447 ns/B 2131 MiB/s 2.17 c/B 4850
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am (o_flag_munging): Add handling for '-Og'.
* random/Makefile.am (o_flag_munging): Add handling for '-Og'.
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|\
| |
| |
| |
| |
| | |
--
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* cipher/Makefile.am: Add 'poly1305-s390x.S' and
'asm-poly1305-s390x.h'.
* cipher/asm-poly1305-s390x.h: New
* cipher/chacha20-s390x.S (_gcry_chacha20_poly1305_s390x_vx_blocks8)
(_gcry_chacha20_poly1305_s390x_vx_blocks4_2_1): New, stitched
chacha20-poly1305 implementation.
* cipher/chacha20.c (USE_S390X_VX_POLY1305): New.
(_gcry_chacha20_poly1305_s390x_vx_blocks8)
(_gcry_chacha20_poly1305_s390x_vx_blocks4_2_1): New prototypes.
(_gcry_chacha20_poly1305_encrypt, _gcry_chacha20_poly1305_decrypt): Add
s390x/VX stitched chacha20-poly1305 code-path.
* cipher/poly1305-s390x.S: New.
* cipher/poly1305.c (USE_S390X_ASM, HAVE_ASM_POLY1305_BLOCKS): New.
[USE_S390X_ASM] (_gcry_poly1305_s390x_blocks1, poly1305_blocks): New.
* configure.ac (gcry_cv_gcc_inline_asm_s390x): Check for 'risbgn' and
'algrk' instructions.
* tests/basic.c (_check_poly1305_cipher): Add large chacha20-poly1305
test vector.
--
Patch adds Poly1305 and stitched ChaCha20-Poly1305 implementation
for zSeries. Stitched implementation interleaves ChaCha20 and Poly1305
processing for higher instruction level parallelism and better
utilization of execution units.
Benchmark on z15 (4504 Mhz):
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 1.16 ns/B 823.2 MiB/s 5.22 c/B
POLY1305 dec | 1.16 ns/B 823.2 MiB/s 5.22 c/B
POLY1305 auth | 0.736 ns/B 1295 MiB/s 3.32 c/B
After (chacha20-poly1305 ~71% faster, poly1305 ~29% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 0.677 ns/B 1409 MiB/s 3.05 c/B
POLY1305 dec | 0.655 ns/B 1456 MiB/s 2.95 c/B
POLY1305 auth | 0.569 ns/B 1675 MiB/s 2.56 c/B
GnuPG-bug-id: 5202
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* cipher/Makefile.am: Add 'asm-common-s390x.h' and 'chacha20-s390x.S'.
* cipher/asm-common-s390x.h: New.
* cipher/chacha20-s390x.S: New.
* cipher/chacha20.c (USE_S390X_VX): New.
(CHACHA20_context_t): Change 'use_*' bit-field to unsigned type; Add
'use_s390x'.
(_gcry_chacha20_s390x_vx_blocks8)
(_gcry_chacha20_s390x_vx_blocks4_2_1): New.
(chacha20_do_setkey): Add HW feature detect for s390x/VX.
(chacha20_blocks, do_chacha20_encrypt_stream_tail): Add s390x/VX
code-path.
* configure.ac: Add 'chacha20-s390x.lo'.
--
Patch adds VX vector instruction set accelerated ChaCha20
implementation for zSeries.
Benchmark on z15 (4504 Mhz):
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 2.62 ns/B 364.0 MiB/s 11.80 c/B
STREAM dec | 2.62 ns/B 363.8 MiB/s 11.81 c/B
After (~5x faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.505 ns/B 1888 MiB/s 2.28 c/B
STREAM dec | 0.506 ns/B 1887 MiB/s 2.28 c/B
GnuPG-bug-id: 5201
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* cipher/Makefile.am: Add 'asm-inline-s390x.h'.
* cipher/asm-inline-s390x.h: New.
* cipher/cipher-gcm.c [GCM_USE_S390X_CRYPTO] (ghash_s390x_kimd): New.
(setupM) [GCM_USE_S390X_CRYPTO]: Add setup for s390x GHASH function.
* cipher/cipher-internal.h (GCM_USE_S390X_CRYPTO): New.
* cipher/rijndael-s390x.c (u128_t, km_functions_e): Move to
'asm-inline-s390x.h'.
(aes_s390x_gcm_crypt): New.
(_gcry_aes_s390x_setup_acceleration): Use 'km_function_to_mask'; Add
setup for GCM bulk function.
--
This patch adds zSeries acceleration for GHASH and AES-GCM.
Benchmarks (z15, 5.2Ghz):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
GCM enc | 2.64 ns/B 361.6 MiB/s 13.71 c/B
GCM dec | 2.64 ns/B 361.3 MiB/s 13.72 c/B
GCM auth | 2.58 ns/B 370.1 MiB/s 13.40 c/B
After:
AES | nanosecs/byte mebibytes/sec cycles/byte
GCM enc | 0.059 ns/B 16066 MiB/s 0.309 c/B
GCM dec | 0.059 ns/B 16114 MiB/s 0.308 c/B
GCM auth | 0.057 ns/B 16747 MiB/s 0.296 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* configure.ac: Add 'rijndael-s390x.lo'.
* cipher/Makefile.am: Add 'rijndael-s390x.c'.
* cipher/rijndael-internal.c (USE_S390X_CRYPTO): New.
(RIJNDAEL_context_s) [USE_S390X_CRYPTO]: New 'km*_func' members.
* cipher/rijndael-s390x.c: New.
* cipher/rijndael.c (_gcry_aes_s390x_setup_acceleration)
(_gcry_aes_s390x_setup_setkey)
(_gcry_aes_s390x_setup_prepare_decryption, _gcry_aes_s390x_encrypt)
(_gcry_aes_s390x_decrypt): New.
(do_setkey) [USE_S390X_CRYPTO]: Add s390x acceleration setup.
--
Patchs adds acceleration for single-block AES and following modes:
- CBC, CBC-MAC, CFB, OFB, CTR, XTS and OCB
Benchmarks (z15, 5.2Ghz):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 3.81 ns/B 250.2 MiB/s 19.82 c/B
ECB dec | 4.13 ns/B 231.1 MiB/s 21.46 c/B
CBC enc | 3.69 ns/B 258.5 MiB/s 19.19 c/B
CBC dec | 3.71 ns/B 257.1 MiB/s 19.29 c/B
CFB enc | 3.69 ns/B 258.7 MiB/s 19.17 c/B
CFB dec | 3.56 ns/B 267.8 MiB/s 18.52 c/B
OFB enc | 3.85 ns/B 247.8 MiB/s 20.01 c/B
OFB dec | 3.85 ns/B 247.9 MiB/s 20.01 c/B
CTR enc | 3.65 ns/B 261.6 MiB/s 18.96 c/B
CTR dec | 3.64 ns/B 261.6 MiB/s 18.95 c/B
XTS enc | 3.66 ns/B 260.8 MiB/s 19.02 c/B
XTS dec | 3.75 ns/B 254.2 MiB/s 19.51 c/B
CCM enc | 7.34 ns/B 129.9 MiB/s 38.19 c/B
CCM dec | 7.34 ns/B 129.9 MiB/s 38.19 c/B
CCM auth | 3.70 ns/B 257.6 MiB/s 19.25 c/B
EAX enc | 7.34 ns/B 129.8 MiB/s 38.19 c/B
EAX dec | 7.35 ns/B 129.8 MiB/s 38.20 c/B
EAX auth | 3.70 ns/B 257.8 MiB/s 19.24 c/B
GCM enc | 6.22 ns/B 153.3 MiB/s 32.36 c/B
GCM dec | 6.23 ns/B 153.0 MiB/s 32.42 c/B
GCM auth | 2.59 ns/B 368.9 MiB/s 13.44 c/B
OCB enc | 3.82 ns/B 249.7 MiB/s 19.86 c/B
OCB dec | 3.90 ns/B 244.2 MiB/s 20.31 c/B
OCB auth | 3.88 ns/B 245.5 MiB/s 20.20 c/B
After:
AES | nanosecs/byte mebibytes/sec cycles/byte
ECB enc | 2.10 ns/B 453.1 MiB/s 10.94 c/B
ECB dec | 2.11 ns/B 453.0 MiB/s 10.95 c/B
CBC enc | 0.182 ns/B 5240 MiB/s 0.946 c/B
CBC dec | 0.044 ns/B 21581 MiB/s 0.230 c/B
CFB enc | 0.206 ns/B 4623 MiB/s 1.07 c/B
CFB dec | 0.140 ns/B 6826 MiB/s 0.727 c/B
OFB enc | 0.183 ns/B 5222 MiB/s 0.950 c/B
OFB dec | 0.182 ns/B 5252 MiB/s 0.944 c/B
CTR enc | 0.059 ns/B 16095 MiB/s 0.308 c/B
CTR dec | 0.059 ns/B 16045 MiB/s 0.309 c/B
XTS enc | 0.043 ns/B 21998 MiB/s 0.225 c/B
XTS dec | 0.043 ns/B 22012 MiB/s 0.225 c/B
CCM enc | 0.239 ns/B 3989 MiB/s 1.24 c/B
CCM dec | 0.239 ns/B 3987 MiB/s 1.24 c/B
CCM auth | 0.180 ns/B 5288 MiB/s 0.938 c/B
EAX enc | 0.242 ns/B 3940 MiB/s 1.26 c/B
EAX dec | 0.243 ns/B 3926 MiB/s 1.26 c/B
EAX auth | 0.183 ns/B 5218 MiB/s 0.950 c/B
GCM enc | 2.64 ns/B 361.6 MiB/s 13.71 c/B
GCM dec | 2.64 ns/B 361.3 MiB/s 13.72 c/B
GCM auth | 2.58 ns/B 370.1 MiB/s 13.40 c/B
OCB enc | 0.186 ns/B 5132 MiB/s 0.966 c/B
OCB dec | 0.176 ns/B 5414 MiB/s 0.916 c/B
OCB auth | 0.149 ns/B 6394 MiB/s 0.776 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|/
|
|
|
|
|
|
| |
* cipher/Makefile.am (EXTRA_DIST): Remove hmac-tests.c.
* cipher/hmac-tests.c: Remove, merge into...
* cipher/mac-hmac.c: ... here.
Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Prepare merge of hmac-test.c into mac-hmac.c.
* cipher/hmac-tests.c: Ifdef-out run_selftests and _gcry_hmac_selftest.
* cipher/mac-internal.h: Include cipher-proto.h for selftest.
(gcry_mac_spec_ops): Add selftest field.
* cipher/mac-hmac.c: Include hmac-tests.c for migration.
(hmac_selftest) New.
(hmac_ops): Add hmac_selftest.
* cipher/gost28147.c, cipher/mac-cmac.c: Add new field for selftest.
* cipher/mac-gmac.c, cipher/mac-poly1305.c: Likewise..
* cipher/mac.c (_gcry_mac_selftest): New.
* src/fips.c (run_mac_selftests): Rename from run_hmac_selftests.
Use GCRY_MAC_HMAC_*, and call _gcry_mac_selftest.
(_gcry_fips_run_selftests): Use run_mac_selftests.
Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm4-aesni-avx2-amd64.S'.
* cipher/sm4-aesni-avx2-amd64.S: New.
* cipher/sm4.c (USE_AESNI_AVX2): New.
(SM4_context) [USE_AESNI_AVX2]: Add 'use_aesni_avx2'.
[USE_AESNI_AVX2] (_gcry_sm4_aesni_avx2_ctr_enc)
(_gcry_sm4_aesni_avx2_cbc_dec, _gcry_sm4_aesni_avx2_cfb_dec)
(_gcry_sm4_aesni_avx2_ocb_enc, _gcry_sm4_aesni_avx2_ocb_dec)
(_gcry_sm4_aesni_avx_ocb_auth): New.
(sm4_setkey): Enable AES-NI/AVX2 if supported by HW.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AESNI_AVX2]: Add
AES-NI/AVX2 bulk functions.
* configure.ac: Add ''sm4-aesni-avx2-amd64.lo'.
--
This patch adds x86-64/AES-NI/AVX2 bulk encryption/decryption. Bulk
functions process 16 blocks in parallel.
Benchmark on AMD Ryzen 7 3700X:
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 8.98 ns/B 106.2 MiB/s 38.62 c/B 4300
CBC dec | 1.55 ns/B 613.7 MiB/s 6.64 c/B 4275
CFB enc | 8.96 ns/B 106.4 MiB/s 38.52 c/B 4300
CFB dec | 1.54 ns/B 617.4 MiB/s 6.60 c/B 4275
CTR enc | 1.57 ns/B 607.8 MiB/s 6.75 c/B 4300
CTR dec | 1.57 ns/B 608.9 MiB/s 6.74 c/B 4300
OCB enc | 1.58 ns/B 603.8 MiB/s 6.75 c/B 4275
OCB dec | 1.57 ns/B 605.7 MiB/s 6.73 c/B 4275
OCB auth | 1.53 ns/B 624.5 MiB/s 6.57 c/B 4300
After (~56% faster than AES-NI/AVX impl.):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 8.93 ns/B 106.8 MiB/s 38.61 c/B 4326
CBC dec | 0.984 ns/B 969.5 MiB/s 4.23 c/B 4300
CFB enc | 8.93 ns/B 106.8 MiB/s 38.62 c/B 4325
CFB dec | 0.983 ns/B 970.3 MiB/s 4.23 c/B 4300
CTR enc | 0.998 ns/B 955.1 MiB/s 4.29 c/B 4300
CTR dec | 0.996 ns/B 957.4 MiB/s 4.28 c/B 4300
OCB enc | 1.00 ns/B 951.8 MiB/s 4.31 c/B 4300
OCB dec | 1.00 ns/B 951.8 MiB/s 4.31 c/B 4300
OCB auth | 0.993 ns/B 960.2 MiB/s 4.28 c/B 4304±2
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am: Add 'sm4-aesni-avx-amd64.S'.
* cipher/sm4-aesni-avx-amd64.S: New.
* cipher/sm4.c (USE_AESNI_AVX, ASM_FUNC_ABI): New.
(SM4_context) [USE_AESNI_AVX]: Add 'use_aesni_avx'.
[USE_AESNI_AVX] (_gcry_sm4_aesni_avx_expand_key)
(_gcry_sm4_aesni_avx_crypt_blk1_8, _gcry_sm4_aesni_avx_ctr_enc)
(_gcry_sm4_aesni_avx_cbc_dec, _gcry_sm4_aesni_avx_cfb_dec)
(_gcry_sm4_aesni_avx_ocb_enc, _gcry_sm4_aesni_avx_ocb_dec)
(_gcry_sm4_aesni_avx_ocb_auth, sm4_aesni_avx_crypt_blk1_8): New.
(sm4_expand_key) [USE_AESNI_AVX]: Use AES-NI/AVX key setup.
(sm4_setkey): Enable AES-NI/AVX if supported by HW.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AESNI_AVX]: Add
AES-NI/AVX bulk functions.
* configure.ac: Add ''sm4-aesni-avx-amd64.lo'.
--
This patch adds x86-64/AES-NI/AVX bulk encryption/decryption and key
setup for SM4 cipher. Bulk functions process eight blocks in parallel.
Benchmark on AMD Ryzen 7 3700X:
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 8.94 ns/B 106.7 MiB/s 38.66 c/B 4325
CBC dec | 4.78 ns/B 199.7 MiB/s 20.42 c/B 4275
CFB enc | 8.95 ns/B 106.5 MiB/s 38.72 c/B 4325
CFB dec | 4.81 ns/B 198.2 MiB/s 20.57 c/B 4275
CTR enc | 4.81 ns/B 198.2 MiB/s 20.69 c/B 4300
CTR dec | 4.80 ns/B 198.8 MiB/s 20.63 c/B 4300
GCM auth | 0.116 ns/B 8232 MiB/s 0.504 c/B 4351
OCB enc | 4.88 ns/B 195.5 MiB/s 20.86 c/B 4275
OCB dec | 4.85 ns/B 196.6 MiB/s 20.86 c/B 4301
OCB auth | 4.80 ns/B 198.9 MiB/s 20.62 c/B 4301
After (~3.0x faster):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 8.98 ns/B 106.2 MiB/s 38.62 c/B 4300
CBC dec | 1.55 ns/B 613.7 MiB/s 6.64 c/B 4275
CFB enc | 8.96 ns/B 106.4 MiB/s 38.52 c/B 4300
CFB dec | 1.54 ns/B 617.4 MiB/s 6.60 c/B 4275
CTR enc | 1.57 ns/B 607.8 MiB/s 6.75 c/B 4300
CTR dec | 1.57 ns/B 608.9 MiB/s 6.74 c/B 4300
OCB enc | 1.58 ns/B 603.8 MiB/s 6.75 c/B 4275
OCB dec | 1.57 ns/B 605.7 MiB/s 6.73 c/B 4275
OCB auth | 1.53 ns/B 624.5 MiB/s 6.57 c/B 4300
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
sm4 avx fix
sm4 avx fix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c.
* cipher/cipher.c (cipher_list, cipher_list_algo301): Add
_gcry_cipher_spec_sm4.
* cipher/mac-cmac.c (map_mac_algo_to_cipher): Add cmac SM4.
(_gcry_mac_type_spec_cmac_sm4): Add cmac SM4.
* cipher/mac-internal.h: Declare spec_cmac_sm4.
* cipher/mac.c (mac_list, mac_list_algo201): Add cmac SM4.
* cipher/sm4.c: New.
* configure.ac (available_ciphers): Add sm4.
* doc/gcrypt.texi: Add SM4 document.
* src/cipher.h: Add declarations for SM4 and cmac SM4.
* src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4.
--
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
[jk: add missing mapping in mac-cmac.c:map_mac_algo_to_cipher]
[jk: add GCRY_MAC_CMAC_SM4 to gcrypt.texi]
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* configure.ac: Add 'rijndael-ppc9le.lo'.
* cipher/Makefile.am: Add 'rijndael-ppc9le.c', 'rijndael-ppc-common.h'
and 'rijndael-ppc-functions.h'.
* cipher/rijndael-internal.h (USE_PPC_CRYPTO_WITH_PPC9LE): New.
(RIJNDAEL_context_s): Add 'use_ppc9le_crypto'.
* cipher/rijndael.c (_gcry_aes_ppc9le_encrypt)
(_gcry_aes_ppc9le_decrypt, _gcry_aes_ppc9le_cfb_enc)
(_gcry_aes_ppc9le_cfb_dec, _gcry_aes_ppc9le_ctr_enc)
(_gcry_aes_ppc9le_cbc_enc, _gcry_aes_ppc9le_cbc_dec)
(_gcry_aes_ppc9le_ocb_crypt, _gcry_aes_ppc9le_ocb_auth)
(_gcry_aes_ppc9le_xts_crypt): New.
(do_setkey, _gcry_aes_cfb_enc, _gcry_aes_cbc_enc)
(_gcry_aes_ctr_enc, _gcry_aes_cfb_dec, _gcry_aes_cbc_dec)
(_gcry_aes_ocb_crypt, _gcry_aes_ocb_auth, _gcry_aes_xts_crypt)
[USE_PPC_CRYPTO_WITH_PPC9LE]: New.
* cipher/rijndael-ppc.c: Split common code to headers
'rijndael-ppc-common.h' and 'rijndael-ppc-functions.h'.
* cipher/rijndael-ppc-common.h: Split from 'rijndael-ppc.c'.
(asm_add_uint64, asm_sra_int64, asm_swap_uint64_halfs): New.
* cipher/rijndael-ppc-functions.h: Split from 'rijndael-ppc.c'.
(CFB_ENC_FUNC, CBC_ENC_FUNC): Unroll loop by 2.
(XTS_CRYPT_FUNC, GEN_TWEAK): Tweak generation without vperm
instruction.
* cipher/rijndael-ppc9le.c: New.
--
Provide POWER9 little-endian optimized variant of PPC vcrypto AES
implementation. This implementation uses 'lxvb16x' and 'stxvb16x'
instructions to load/store vectors directly in big-endian order.
Benchmark on POWER9 (~3.8Ghz):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
CBC enc | 1.04 ns/B 918.7 MiB/s 3.94 c/B
CBC dec | 0.222 ns/B 4292 MiB/s 0.844 c/B
CFB enc | 1.04 ns/B 916.9 MiB/s 3.95 c/B
CFB dec | 0.224 ns/B 4252 MiB/s 0.852 c/B
CTR enc | 0.226 ns/B 4218 MiB/s 0.859 c/B
CTR dec | 0.225 ns/B 4233 MiB/s 0.856 c/B
XTS enc | 0.500 ns/B 1907 MiB/s 1.90 c/B
XTS dec | 0.494 ns/B 1932 MiB/s 1.88 c/B
OCB enc | 0.288 ns/B 3312 MiB/s 1.09 c/B
OCB dec | 0.292 ns/B 3266 MiB/s 1.11 c/B
OCB auth | 0.267 ns/B 3567 MiB/s 1.02 c/B
After (ctr & ocb & cbc-dec & cfb-dec ~15% and xts ~8% faster):
AES | nanosecs/byte mebibytes/sec cycles/byte
CBC enc | 1.04 ns/B 914.2 MiB/s 3.96 c/B
CBC dec | 0.191 ns/B 4984 MiB/s 0.727 c/B
CFB enc | 1.03 ns/B 930.0 MiB/s 3.90 c/B
CFB dec | 0.194 ns/B 4906 MiB/s 0.739 c/B
CTR enc | 0.196 ns/B 4868 MiB/s 0.744 c/B
CTR dec | 0.197 ns/B 4834 MiB/s 0.750 c/B
XTS enc | 0.460 ns/B 2075 MiB/s 1.75 c/B
XTS dec | 0.455 ns/B 2097 MiB/s 1.73 c/B
OCB enc | 0.250 ns/B 3812 MiB/s 0.951 c/B
OCB dec | 0.253 ns/B 3764 MiB/s 0.963 c/B
OCB auth | 0.232 ns/B 4106 MiB/s 0.883 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* configure.ac (enabled_pubkey_ciphers): Add ecc-sm2.
* cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add ecc-sm2.c.
* cipher/pubkey-util.c (_gcry_pk_util_parse_flaglist,
_gcry_pk_util_preparse_sigval): Add sm2 flags.
* cipher/ecc.c: Support ecc-sm2.
* cipher/ecc-common.h: Add declarations for ecc-sm2.
* cipher/ecc-sm2.c: New.
* src/cipher.h: Define PUBKEY_FLAG_SM2.
--
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
|