summaryrefslogtreecommitdiff
path: root/cipher/Makefile.am
Commit message (Collapse)AuthorAgeFilesLines
* build: Allow build with -Oz.NIIBE Yutaka2023-04-031-1/+1
| | | | | | | | | | * cipher/Makefile.am [ENABLE_O_FLAG_MUNGING]: Support -Oz. * random/Makefile.am [ENABLE_O_FLAG_MUNGING]: Support -Oz. -- GnuPG-bug-id: 6432 Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
* Add PowerPC vector implementation of SM4Jussi Kivilinna2023-03-061-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm4-ppc.c'. * cipher/sm4-ppc.c: New. * cipher/sm4.c (USE_PPC_CRYPTO): New. (SM4_context): Add 'use_ppc8le' and 'use_ppc9le'. [USE_PPC_CRYPTO] (_gcry_sm4_ppc8le_crypt_blk1_16) (_gcry_sm4_ppc9le_crypt_blk1_16, sm4_ppc8le_crypt_blk1_16) (sm4_ppc9le_crypt_blk1_16): New. (sm4_setkey) [USE_PPC_CRYPTO]: Set use_ppc8le and use_ppc9le based on HW features. (sm4_get_crypt_blk1_16_fn) [USE_PPC_CRYPTO]: Add PowerPC implementation selection. -- Benchmark on POWER9: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 14.47 ns/B 65.89 MiB/s 33.29 c/B ECB dec | 14.47 ns/B 65.89 MiB/s 33.29 c/B CBC enc | 35.09 ns/B 27.18 MiB/s 80.71 c/B CBC dec | 16.69 ns/B 57.13 MiB/s 38.39 c/B CFB enc | 35.09 ns/B 27.18 MiB/s 80.71 c/B CFB dec | 16.76 ns/B 56.90 MiB/s 38.55 c/B CTR enc | 16.88 ns/B 56.50 MiB/s 38.82 c/B CTR dec | 16.88 ns/B 56.50 MiB/s 38.82 c/B After (ECB ~4.4x faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 3.26 ns/B 292.3 MiB/s 7.50 c/B ECB dec | 3.26 ns/B 292.3 MiB/s 7.50 c/B CBC enc | 35.10 ns/B 27.17 MiB/s 80.72 c/B CBC dec | 3.33 ns/B 286.3 MiB/s 7.66 c/B CFB enc | 35.10 ns/B 27.17 MiB/s 80.74 c/B CFB dec | 3.36 ns/B 283.8 MiB/s 7.73 c/B CTR enc | 3.47 ns/B 275.0 MiB/s 7.98 c/B CTR dec | 3.47 ns/B 275.0 MiB/s 7.98 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* camellia: add AArch64 crypto-extension implementationJussi Kivilinna2023-02-281-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'camellia-aarch64-ce.(c|o|lo)'. (aarch64_neon_cflags): New. * cipher/camellia-aarch64-ce.c: New. * cipher/camellia-glue.c (USE_AARCH64_CE): New. (CAMELLIA_context): Add 'use_aarch64ce'. (_gcry_camellia_aarch64ce_encrypt_blk16) (_gcry_camellia_aarch64ce_decrypt_blk16) (_gcry_camellia_aarch64ce_keygen, camellia_aarch64ce_enc_blk16) (camellia_aarch64ce_dec_blk16, aarch64ce_burn_stack_depth): New. (camellia_setkey) [USE_AARCH64_CE]: Set use_aarch64ce if HW has HWF_ARM_AES; Use AArch64/CE key generation if supported by HW. (camellia_encrypt_blk1_32, camellia_decrypt_blk1_32) [USE_AARCH64_CE]: Add AArch64/CE code path. -- Patch enables 128-bit vector instrinsics implementation of Camellia cipher for AArch64. Benchmark on AWS Graviton2: Before: CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 5.99 ns/B 159.2 MiB/s 14.97 c/B 2500 ECB dec | 5.99 ns/B 159.1 MiB/s 14.98 c/B 2500 CBC enc | 6.16 ns/B 154.7 MiB/s 15.41 c/B 2500 CBC dec | 6.12 ns/B 155.8 MiB/s 15.29 c/B 2499 CFB enc | 6.49 ns/B 147.0 MiB/s 16.21 c/B 2500 CFB dec | 6.05 ns/B 157.6 MiB/s 15.13 c/B 2500 CTR enc | 6.09 ns/B 156.7 MiB/s 15.22 c/B 2500 CTR dec | 6.09 ns/B 156.6 MiB/s 15.22 c/B 2500 XTS enc | 6.16 ns/B 154.9 MiB/s 15.39 c/B 2500 XTS dec | 6.16 ns/B 154.8 MiB/s 15.40 c/B 2499 GCM enc | 6.31 ns/B 151.1 MiB/s 15.78 c/B 2500 GCM dec | 6.31 ns/B 151.1 MiB/s 15.78 c/B 2500 GCM auth | 0.206 ns/B 4635 MiB/s 0.514 c/B 2500 OCB enc | 6.63 ns/B 143.9 MiB/s 16.57 c/B 2499 OCB dec | 6.63 ns/B 143.9 MiB/s 16.56 c/B 2499 OCB auth | 6.55 ns/B 145.7 MiB/s 16.37 c/B 2499 After (ecb ~2.1x faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 2.77 ns/B 344.2 MiB/s 6.93 c/B 2499 ECB dec | 2.76 ns/B 345.3 MiB/s 6.90 c/B 2499 CBC enc | 6.17 ns/B 154.7 MiB/s 15.41 c/B 2499 CBC dec | 2.89 ns/B 330.3 MiB/s 7.22 c/B 2500 CFB enc | 6.48 ns/B 147.1 MiB/s 16.21 c/B 2499 CFB dec | 2.84 ns/B 336.1 MiB/s 7.09 c/B 2499 CTR enc | 2.90 ns/B 328.8 MiB/s 7.25 c/B 2499 CTR dec | 2.90 ns/B 328.9 MiB/s 7.25 c/B 2500 XTS enc | 2.93 ns/B 325.3 MiB/s 7.33 c/B 2500 XTS dec | 2.92 ns/B 326.2 MiB/s 7.31 c/B 2500 GCM enc | 3.10 ns/B 307.2 MiB/s 7.76 c/B 2500 GCM dec | 3.10 ns/B 307.2 MiB/s 7.76 c/B 2499 GCM auth | 0.206 ns/B 4635 MiB/s 0.514 c/B 2500 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* camellia: add POWER8/POWER9 vcrypto implementationJussi Kivilinna2023-02-281-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'camellia-simd128.h', 'camellia-ppc8le.c' and 'camellia-ppc9le.c'. * cipher/camellia-glue.c (USE_PPC_CRYPTO): New. (CAMELLIA_context) [USE_PPC_CRYPTO]: Add 'use_ppc', 'use_ppc8' and 'use_ppc9'. [USE_PPC_CRYPTO] (_gcry_camellia_ppc8_encrypt_blk16) (_gcry_camellia_ppc8_decrypt_blk16, _gcry_camellia_ppc8_keygen) (_gcry_camellia_ppc9_encrypt_blk16) (_gcry_camellia_ppc9_decrypt_blk16, _gcry_camellia_ppc9_keygen) (camellia_ppc_enc_blk16, camellia_ppc_dec_blk16) (ppc_burn_stack_depth): New. (camellia_setkey) [USE_PPC_CRYPTO]: Setup 'use_ppc', 'use_ppc8' and 'use_ppc9' and use PPC key-generation if HWF is available. (camellia_encrypt_blk1_32) (camellia_decrypt_blk1_32) [USE_PPC_CRYPTO]: Add 'use_ppc' paths. (_gcry_camellia_ocb_crypt, _gcry_camellia_ocb_auth): Enable generic bulk path when USE_PPC_CRYPTO is defined. * cipher/camellia-ppc8le.c: New. * cipher/camellia-ppc9le.c: New. * cipher/camellia-simd128.h: New. * configure.ac: Add 'camellia-ppc8le.lo' and 'camellia-ppc9le.lo'. -- Patch adds 128-bit vector instrinsics implementation of Camellia cipher and enables implementation for POWER8 and POWER9. Benchmark on POWER9: Before: CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 13.45 ns/B 70.90 MiB/s 30.94 c/B ECB dec | 13.45 ns/B 70.92 MiB/s 30.93 c/B CBC enc | 15.22 ns/B 62.66 MiB/s 35.00 c/B CBC dec | 13.54 ns/B 70.41 MiB/s 31.15 c/B CFB enc | 15.24 ns/B 62.59 MiB/s 35.04 c/B CFB dec | 13.53 ns/B 70.48 MiB/s 31.12 c/B CTR enc | 13.60 ns/B 70.15 MiB/s 31.27 c/B CTR dec | 13.62 ns/B 70.02 MiB/s 31.33 c/B XTS enc | 13.67 ns/B 69.74 MiB/s 31.45 c/B XTS dec | 13.74 ns/B 69.41 MiB/s 31.60 c/B GCM enc | 18.18 ns/B 52.45 MiB/s 41.82 c/B GCM dec | 17.76 ns/B 53.69 MiB/s 40.86 c/B GCM auth | 4.12 ns/B 231.7 MiB/s 9.47 c/B OCB enc | 14.40 ns/B 66.22 MiB/s 33.12 c/B OCB dec | 14.40 ns/B 66.23 MiB/s 33.12 c/B OCB auth | 14.37 ns/B 66.37 MiB/s 33.05 c/B After (ECB ~4.1x faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 3.25 ns/B 293.7 MiB/s 7.47 c/B ECB dec | 3.25 ns/B 293.4 MiB/s 7.48 c/B CBC enc | 15.22 ns/B 62.68 MiB/s 35.00 c/B CBC dec | 3.36 ns/B 284.1 MiB/s 7.72 c/B CFB enc | 15.25 ns/B 62.55 MiB/s 35.07 c/B CFB dec | 3.36 ns/B 284.0 MiB/s 7.72 c/B CTR enc | 3.47 ns/B 275.1 MiB/s 7.97 c/B CTR dec | 3.47 ns/B 275.1 MiB/s 7.97 c/B XTS enc | 3.54 ns/B 269.0 MiB/s 8.15 c/B XTS dec | 3.54 ns/B 269.6 MiB/s 8.14 c/B GCM enc | 3.69 ns/B 258.2 MiB/s 8.49 c/B GCM dec | 3.69 ns/B 258.2 MiB/s 8.50 c/B GCM auth | 0.226 ns/B 4220 MiB/s 0.520 c/B OCB enc | 3.81 ns/B 250.2 MiB/s 8.77 c/B OCB dec | 4.08 ns/B 233.8 MiB/s 9.38 c/B OCB auth | 3.53 ns/B 270.0 MiB/s 8.12 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* aria: add x86_64 GFNI/AVX512 accelerated implementationJussi Kivilinna2023-02-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'aria-gfni-avx512-amd64.S'. * cipher/aria-gfni-avx512-amd64.S: New. * cipher/aria.c (USE_GFNI_AVX512): New. [USE_GFNI_AVX512] (MAX_PARALLEL_BLKS): New. (ARIA_context): Add 'use_gfni_avx512'. (_gcry_aria_gfni_avx512_ecb_crypt_blk64) (_gcry_aria_gfni_avx512_ctr_crypt_blk64) (aria_gfni_avx512_ecb_crypt_blk64) (aria_gfni_avx512_ctr_crypt_blk64): New. (aria_crypt_blocks) [USE_GFNI_AVX512]: Add 64 parallel block AVX512/GFNI processing. (_gcry_aria_ctr_enc) [USE_GFNI_AVX512]: Add 64 parallel block AVX512/GFNI processing. (aria_setkey): Enable GFNI/AVX512 based on HW features. * configure.ac: Add 'aria-gfni-avx512-amd64.lo'. -- This patch adds AVX512/GFNI accelerated ARIA block cipher implementation for libgcrypt. This implementation is based on work by Taehee Yoo, with following notable changes: - Integration to libgcrypt, use of 'aes-common-amd64.h'. - Use round loop instead of unrolling for smaller code size and increased performance. - Use stack for temporary storage instead of external buffers. - Add byte-addition fast path for CTR. === Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off): GFNI/AVX512: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.203 ns/B 4703 MiB/s 0.953 c/B 4700 ECB dec | 0.204 ns/B 4675 MiB/s 0.959 c/B 4700 CTR enc | 0.207 ns/B 4609 MiB/s 0.973 c/B 4700 CTR dec | 0.207 ns/B 4608 MiB/s 0.973 c/B 4700 === Benchmark on Intel Core i3-1115G4 (tiger-lake, turbo-freq off): GFNI/AVX512: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.362 ns/B 2635 MiB/s 1.08 c/B 2992 ECB dec | 0.361 ns/B 2639 MiB/s 1.08 c/B 2992 CTR enc | 0.362 ns/B 2633 MiB/s 1.08 c/B 2992 CTR dec | 0.362 ns/B 2633 MiB/s 1.08 c/B 2992 [v2]: - Add byte-addition fast path for CTR. Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* aria: add x86_64 AESNI/GFNI/AVX/AVX2 accelerated implementationsJussi Kivilinna2023-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'aria-aesni-avx-amd64.S' and 'aria-aesni-avx2-amd64.S'. * cipher/aria-aesni-avx-amd64.S: New. * cipher/aria-aesni-avx2-amd64.S: New. * cipher/aria.c (USE_AESNI_AVX, USE_GFNI_AVX, USE_AESNI_AVX2) (USE_GFNI_AVX2, MAX_PARALLEL_BLKS, ASM_FUNC_ABI, ASM_EXTRA_STACK): New. (ARIA_context): Add 'use_aesni_avx', 'use_gfni_avx', 'use_aesni_avx2' and 'use_gfni_avx2'. (_gcry_aria_aesni_avx_ecb_crypt_blk1_16) (_gcry_aria_aesni_avx_ctr_crypt_blk16) (_gcry_aria_gfni_avx_ecb_crypt_blk1_16) (_gcry_aria_gfni_avx_ctr_crypt_blk16) (aria_avx_ecb_crypt_blk1_16, aria_avx_ctr_crypt_blk16) (_gcry_aria_aesni_avx2_ecb_crypt_blk32) (_gcry_aria_aesni_avx2_ctr_crypt_blk32) (_gcry_aria_gfni_avx2_ecb_crypt_blk32) (_gcry_aria_gfni_avx2_ctr_crypt_blk32) (aria_avx2_ecb_crypt_blk32, aria_avx2_ctr_crypt_blk32): New. (aria_crypt_blocks) [USE_AESNI_AVX2]: Add 32 parallel block AVX2/AESNI/GFNI processing. (aria_crypt_blocks) [USE_AESNI_AVX]: Add 3 to 16 parallel block AVX/AESNI/GFNI processing. (_gcry_aria_ctr_enc) [USE_AESNI_AVX2]: Add 32 parallel block AVX2/AESNI/GFNI processing. (_gcry_aria_ctr_enc) [USE_AESNI_AVX]: Add 16 parallel block AVX/AESNI/GFNI processing. (_gcry_aria_ctr_enc, _gcry_aria_cbc_dec, _gcry_aria_cfb_enc) (_gcry_aria_ecb_crypt, _gcry_aria_xts_crypt, _gcry_aria_ctr32le_enc) (_gcry_aria_ocb_crypt, _gcry_aria_ocb_auth): Use MAX_PARALLEL_BLKS for parallel processing width. (aria_setkey): Enable AESNI/AVX, GFNI/AVX, AESNI/AVX2, GFNI/AVX2 based on HW features. * configure.ac: Add 'aria-aesni-avx-amd64.lo' and 'aria-aesni-avx2-amd64.lo'. --- This patch adds AVX/AVX2/AESNI/GFNI accelerated ARIA block cipher implementations for libgcrypt. This implementation is based on work by Taehee Yoo, with following notable changes: - Integration to libgcrypt, use of 'aes-common-amd64.h'. - Use 'vmovddup' for loading GFNI constants. - Use round loop instead of unrolling for smaller code size and increased performance. - Use stack for temporary storage instead of external buffers. - Use merge ECB encryption/decryption to single function. - Add 1 to 15 blocks support for AVX ECB functions. - Add byte-addition fast path for CTR. === Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off): AESNI/AVX: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.715 ns/B 1333 MiB/s 3.36 c/B 4700 ECB dec | 0.712 ns/B 1339 MiB/s 3.35 c/B 4700 CTR enc | 0.714 ns/B 1336 MiB/s 3.36 c/B 4700 CTR dec | 0.714 ns/B 1335 MiB/s 3.36 c/B 4700 GFNI/AVX: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.516 ns/B 1847 MiB/s 2.43 c/B 4700 ECB dec | 0.519 ns/B 1839 MiB/s 2.44 c/B 4700 CTR enc | 0.517 ns/B 1846 MiB/s 2.43 c/B 4700 CTR dec | 0.518 ns/B 1843 MiB/s 2.43 c/B 4700 AESNI/AVX2: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.416 ns/B 2292 MiB/s 1.96 c/B 4700 ECB dec | 0.421 ns/B 2266 MiB/s 1.98 c/B 4700 CTR enc | 0.415 ns/B 2298 MiB/s 1.95 c/B 4700 CTR dec | 0.415 ns/B 2300 MiB/s 1.95 c/B 4700 GFNI/AVX2: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.235 ns/B 4056 MiB/s 1.11 c/B 4700 ECB dec | 0.234 ns/B 4079 MiB/s 1.10 c/B 4700 CTR enc | 0.232 ns/B 4104 MiB/s 1.09 c/B 4700 CTR dec | 0.233 ns/B 4094 MiB/s 1.10 c/B 4700 === Benchmark on Intel Core i3-1115G4 (tiger-lake, turbo-freq off): AESNI/AVX: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 1.26 ns/B 757.6 MiB/s 3.77 c/B 2993 ECB dec | 1.27 ns/B 753.1 MiB/s 3.79 c/B 2992 CTR enc | 1.25 ns/B 760.3 MiB/s 3.75 c/B 2992 CTR dec | 1.26 ns/B 759.1 MiB/s 3.76 c/B 2992 GFNI/AVX: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.967 ns/B 986.6 MiB/s 2.89 c/B 2992 ECB dec | 0.966 ns/B 987.1 MiB/s 2.89 c/B 2992 CTR enc | 0.972 ns/B 980.8 MiB/s 2.91 c/B 2993 CTR dec | 0.971 ns/B 982.5 MiB/s 2.90 c/B 2993 AESNI/AVX2: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.817 ns/B 1167 MiB/s 2.44 c/B 2992 ECB dec | 0.819 ns/B 1164 MiB/s 2.45 c/B 2992 CTR enc | 0.819 ns/B 1164 MiB/s 2.45 c/B 2992 CTR dec | 0.819 ns/B 1164 MiB/s 2.45 c/B 2992 GFNI/AVX2: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.506 ns/B 1886 MiB/s 1.51 c/B 2992 ECB dec | 0.505 ns/B 1887 MiB/s 1.51 c/B 2992 CTR enc | 0.564 ns/B 1691 MiB/s 1.69 c/B 2992 CTR dec | 0.565 ns/B 1689 MiB/s 1.69 c/B 2992 === Benchmark on AMD Ryzen 7 5800X (zen3, turbo-freq off): AESNI/AVX: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.921 ns/B 1035 MiB/s 3.50 c/B 3800 ECB dec | 0.922 ns/B 1034 MiB/s 3.50 c/B 3800 CTR enc | 0.923 ns/B 1033 MiB/s 3.51 c/B 3800 CTR dec | 0.923 ns/B 1033 MiB/s 3.51 c/B 3800 AESNI/AVX2: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.559 ns/B 1707 MiB/s 2.12 c/B 3800 ECB dec | 0.560 ns/B 1703 MiB/s 2.13 c/B 3800 CTR enc | 0.570 ns/B 1672 MiB/s 2.17 c/B 3800 CTR dec | 0.568 ns/B 1679 MiB/s 2.16 c/B 3800 === Benchmark on AMD EPYC 7642 (zen2): AESNI/AVX: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 1.22 ns/B 784.5 MiB/s 4.01 c/B 3298 ECB dec | 1.22 ns/B 784.8 MiB/s 4.00 c/B 3292 CTR enc | 1.22 ns/B 780.1 MiB/s 4.03 c/B 3299 CTR dec | 1.22 ns/B 779.1 MiB/s 4.04 c/B 3299 AESNI/AVX2: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.735 ns/B 1298 MiB/s 2.42 c/B 3299 ECB dec | 0.738 ns/B 1292 MiB/s 2.44 c/B 3299 CTR enc | 0.732 ns/B 1303 MiB/s 2.41 c/B 3299 CTR dec | 0.732 ns/B 1303 MiB/s 2.41 c/B 3299 === Benchmark on Intel Core i5-6500 (skylake): AESNI/AVX: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 1.24 ns/B 766.6 MiB/s 4.48 c/B 3598 ECB dec | 1.25 ns/B 764.9 MiB/s 4.49 c/B 3598 CTR enc | 1.25 ns/B 761.7 MiB/s 4.50 c/B 3598 CTR dec | 1.25 ns/B 761.6 MiB/s 4.51 c/B 3598 AESNI/AVX2: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.829 ns/B 1150 MiB/s 2.98 c/B 3599 ECB dec | 0.831 ns/B 1147 MiB/s 2.99 c/B 3598 CTR enc | 0.829 ns/B 1150 MiB/s 2.98 c/B 3598 CTR dec | 0.828 ns/B 1152 MiB/s 2.98 c/B 3598 === Benchmark on Intel Core i5-2450M (sandy-bridge, turbo-freq off): AESNI/AVX: ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 2.11 ns/B 452.7 MiB/s 5.25 c/B 2494 ECB dec | 2.10 ns/B 454.5 MiB/s 5.23 c/B 2494 CTR enc | 2.10 ns/B 453.2 MiB/s 5.25 c/B 2494 CTR dec | 2.10 ns/B 453.2 MiB/s 5.25 c/B 2494 [v2] - Optimization for CTR mode: Use CTR byte-addition path when counter carry-overflow happen only on ctr-variable but not in generated counter vector registers. Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add ARIA block cipherJussi Kivilinna2023-01-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'aria.c'. * cipher/aria.c: New. * cipher/cipher.c (cipher_list, cipher_list_algo301): Add ARIA cipher specs. * cipher/mac-cmac.c (map_mac_algo_to_cipher): Add GCRY_MAC_CMAC_ARIA. (_gcry_mac_type_spec_cmac_aria): New. * cipher/mac-gmac.c (map_mac_algo_to_cipher): Add GCRY_MAC_GMAC_ARIA. (_gcry_mac_type_spec_gmac_aria): New. * cipher/mac-internal.h (_gcry_mac_type_spec_cmac_aria) (_gcry_mac_type_spec_gmac_aria) (_gcry_mac_type_spec_poly1305mac_aria): New. * cipher/mac-poly1305.c (poly1305mac_open): Add GCRY_MAC_GMAC_ARIA. (_gcry_mac_type_spec_poly1305mac_aria): New. * cipher/mac.c (mac_list, mac_list_algo201, mac_list_algo401) (mac_list_algo501): Add ARIA MAC specs. * configure.ac (available_ciphers): Add 'aria'. (GCRYPT_CIPHERS): Add 'aria.lo'. (USE_ARIA): New. * doc/gcrypt.texi: Add GCRY_CIPHER_ARIA128, GCRY_CIPHER_ARIA192, GCRY_CIPHER_ARIA256, GCRY_MAC_CMAC_ARIA, GCRY_MAC_GMAC_ARIA and GCRY_MAC_POLY1305_ARIA. * src/cipher.h (_gcry_cipher_spec_aria128, _gcry_cipher_spec_aria192) (_gcry_cipher_spec_aria256): New. * src/gcrypt.h.in (gcry_cipher_algos): Add GCRY_CIPHER_ARIA128, GCRY_CIPHER_ARIA192 and GCRY_CIPHER_ARIA256. (gcry_mac_algos): GCRY_MAC_CMAC_ARIA, GCRY_MAC_GMAC_ARIA and GCRY_MAC_POLY1305_ARIA. * tests/basic.c (check_ecb_cipher, check_ctr_cipher) (check_cfb_cipher, check_ocb_cipher) [USE_ARIA]: Add ARIA test-vectors. (check_ciphers) [USE_ARIA]: Add GCRY_CIPHER_ARIA128, GCRY_CIPHER_ARIA192 and GCRY_CIPHER_ARIA256. (main): Also run 'check_bulk_cipher_modes' for 'cipher_modes_only'-mode. * tests/bench-slope.c (bench_mac_init): Add GCRY_MAC_POLY1305_ARIA setiv-handling. * tests/benchmark.c (mac_bench): Likewise. -- This patch adds ARIA block cipher for libgcrypt. This implementation is based on work by Taehee Yoo, with following notable changes: - Integration to libgcrypt, use of bithelp.h and bufhelp.h helper functions where possible. - Added lookup table prefetching as is done in AES, GCM and SM4 implementations. - Changed `get_u8` to return `u32` as returning `byte` caused sub-optimal code generation with gcc-12/x86-64 (zero extending from 8-bit to 32-bit register, followed by extraneous sign extending from 32-bit to 64-bit register). - Changed 'aria_crypt' loop structure a bit for tiny performance increase (~1% seen with gcc-12/x86-64/zen4). Benchmark on AMD Ryzen 9 7900X (x86-64): ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 3.99 ns/B 239.1 MiB/s 22.43 c/B 5625 ECB dec | 4.00 ns/B 238.4 MiB/s 22.50 c/B 5625 Benchmark on AMD Ryzen 9 7900X (win32): ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 4.57 ns/B 208.7 MiB/s 25.31 c/B 5538 ECB dec | 4.66 ns/B 204.8 MiB/s 25.39 c/B 5453 Benchmark on ARM Cortex-A53 (aarch64): ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 74.69 ns/B 12.77 MiB/s 48.40 c/B 647.9 ECB dec | 74.99 ns/B 12.72 MiB/s 48.58 c/B 647.9 Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* sha512: add AArch64 crypto/SHA512 extension implementationJussi Kivilinna2022-07-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sha512-armv8-aarch64-ce.S'. * cipher/sha512-armv8-aarch64-ce.S: New. * cipher/sha512.c (ATTR_ALIGNED_64, USE_ARM64_SHA512): New. (k): Make array aligned to 64 bytes. [USE_ARM64_SHA512] (_gcry_sha512_transform_armv8_ce): New. [USE_ARM64_SHA512] (do_sha512_transform_armv8_ce): New. (sha512_init_common) [USE_ARM64_SHA512]: Use ARMv8-SHA512 accelerated implementation if HW feature available. * configure.ac: Add 'sha512-armv8-aarch64-ce.lo'. (gcry_cv_gcc_inline_asm_aarch64_sha3_sha512_sm3_sm4) (HAVE_GCC_INLINE_ASM_AARCH64_SHA3_SHA512_SM3_SM4): New. -- Benchmark on AWS Graviton3: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SHA512 | 2.36 ns/B 404.2 MiB/s 6.13 c/B 2600 After (2.4x faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SHA512 | 0.977 ns/B 976.6 MiB/s 2.54 c/B 2600 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* blake2: add AVX512 accelerated implementationsJussi Kivilinna2022-07-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'blake2b-amd64-avx512.S' and 'blake2s-amd64-avx512.S'. * cipher/blake2.c (USE_AVX512): New. (ASM_FUNC_ABI): Setup attribute if USE_AVX2 or USE_AVX512 enabled in addition to USE_AVX. (BLAKE2B_CONTEXT_S, BLAKE2S_CONTEXT_S): Add 'use_avx512'. (_gcry_blake2b_transform_amd64_avx512) (_gcry_blake2s_transform_amd64_avx512): New. (blake2b_transform, blake2s_transform) [USE_AVX512]: Add AVX512 path. (blake2b_init_ctx, blake2s_init_ctx) [USE_AVX512]: Use AVX512 if HW feature available. * cipher/blake2b-amd64-avx512.S: New. * cipher/blake2s-amd64-avx512.S: New. * configure.ac: Add 'blake2b-amd64-avx512.lo' and 'blake2s-amd64-avx512.lo'. -- Benchmark on Intel Core i3-1115G4 (tigerlake): Before (AVX/AVX2 implementations): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 0.841 ns/B 1134 MiB/s 3.44 c/B 4089 BLAKE2S_256 | 1.29 ns/B 741.2 MiB/s 5.26 c/B 4089 After (blake2s ~19% faster, blake2b ~25% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4088 BLAKE2S_256 | 1.02 ns/B 933.3 MiB/s 4.18 c/B 4088 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* sha3: Add x86-64 AVX512 accelerated implementationJussi Kivilinna2022-07-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * LICENSES: Add 'cipher/keccak-amd64-avx512.S'. * configure.ac: Add 'keccak-amd64-avx512.lo'. * cipher/Makefile.am: Add 'keccak-amd64-avx512.S'. * cipher/keccak-amd64-avx512.S: New. * cipher/keccak.c (USE_64BIT_AVX512, ASM_FUNC_ABI): New. [USE_64BIT_AVX512] (_gcry_keccak_f1600_state_permute64_avx512) (_gcry_keccak_absorb_blocks_avx512, keccak_f1600_state_permute64_avx512) (keccak_absorb_lanes64_avx512, keccak_avx512_64_ops): New. (keccak_init) [USE_64BIT_AVX512]: Enable x86-64 AVX512 implementation if supported by HW features. -- Benchmark on Intel Core i3-1115G4 (tigerlake): Before (BMI2 instructions): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SHA3-224 | 1.77 ns/B 540.3 MiB/s 7.22 c/B 4088 SHA3-256 | 1.86 ns/B 514.0 MiB/s 7.59 c/B 4089 SHA3-384 | 2.43 ns/B 393.1 MiB/s 9.92 c/B 4089 SHA3-512 | 3.49 ns/B 273.2 MiB/s 14.27 c/B 4088 SHAKE128 | 1.52 ns/B 629.1 MiB/s 6.20 c/B 4089 SHAKE256 | 1.86 ns/B 511.6 MiB/s 7.62 c/B 4089 After (~33% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SHA3-224 | 1.32 ns/B 721.8 MiB/s 5.40 c/B 4089 SHA3-256 | 1.40 ns/B 681.7 MiB/s 5.72 c/B 4089 SHA3-384 | 1.83 ns/B 522.5 MiB/s 7.46 c/B 4089 SHA3-512 | 2.63 ns/B 362.1 MiB/s 10.77 c/B 4088 SHAKE128 | 1.13 ns/B 840.4 MiB/s 4.64 c/B 4089 SHAKE256 | 1.40 ns/B 682.1 MiB/s 5.72 c/B 4089 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* sm4: add amd64 GFNI/AVX512 implementationJussi Kivilinna2022-07-211-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm4-gfni-avx512-amd64.S'. * cipher/sm4-gfni-avx512-amd64.S: New. * cipher/sm4-gfni.c (USE_GFNI_AVX512): New. (SM4_context): Add 'use_gfni_avx512' and 'crypt_blk1_16'. (_gcry_sm4_gfni_avx512_expand_key, _gcry_sm4_gfni_avx512_ctr_enc) (_gcry_sm4_gfni_avx512_cbc_dec, _gcry_sm4_gfni_avx512_cfb_dec) (_gcry_sm4_gfni_avx512_ocb_enc, _gcry_sm4_gfni_avx512_ocb_dec) (_gcry_sm4_gfni_avx512_ocb_auth, _gcry_sm4_gfni_avx512_ctr_enc_blk32) (_gcry_sm4_gfni_avx512_cbc_dec_blk32) (_gcry_sm4_gfni_avx512_cfb_dec_blk32) (_gcry_sm4_gfni_avx512_ocb_enc_blk32) (_gcry_sm4_gfni_avx512_ocb_dec_blk32) (_gcry_sm4_gfni_avx512_crypt_blk1_16) (_gcry_sm4_gfni_avx512_crypt_blk32, sm4_gfni_avx512_crypt_blk1_16) (sm4_crypt_blk1_32, sm4_encrypt_blk1_32, sm4_decrypt_blk1_32): New. (sm4_expand_key): Add GFNI/AVX512 code-path (sm4_setkey): Use GFNI/AVX512 if supported by CPU; Setup `ctx->crypt_blk1_16`. (sm4_encrypt, sm4_decrypt, sm4_get_crypt_blk1_16_fn, _gcry_sm4_ctr_enc) (_gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt) (_gcry_sm4_ocb_auth) [USE_GFNI_AVX512]: Add GFNI/AVX512 code path. (_gcry_sm4_xts_crypt): Change parallel block size from 16 to 32. * configure.ac: Add 'sm4-gfni-avx512-amd64.lo'. -- Benchmark on Intel i3-1115G4 (tigerlake): Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 9.45 ns/B 101.0 MiB/s 38.63 c/B 4089 CBC dec | 0.647 ns/B 1475 MiB/s 2.64 c/B 4089 CFB enc | 9.43 ns/B 101.1 MiB/s 38.57 c/B 4089 CFB dec | 0.648 ns/B 1472 MiB/s 2.65 c/B 4089 CTR enc | 0.661 ns/B 1443 MiB/s 2.70 c/B 4089 CTR dec | 0.661 ns/B 1444 MiB/s 2.70 c/B 4089 XTS enc | 0.767 ns/B 1243 MiB/s 3.14 c/B 4089 XTS dec | 0.772 ns/B 1235 MiB/s 3.16 c/B 4089 OCB enc | 0.671 ns/B 1421 MiB/s 2.74 c/B 4089 OCB dec | 0.676 ns/B 1410 MiB/s 2.77 c/B 4089 OCB auth | 0.668 ns/B 1428 MiB/s 2.73 c/B 4090 After: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 7.80 ns/B 122.2 MiB/s 31.91 c/B 4090 CBC dec | 0.293 ns/B 3258 MiB/s 1.20 c/B 4095±3 CFB enc | 7.80 ns/B 122.2 MiB/s 31.90 c/B 4089 CFB dec | 0.294 ns/B 3247 MiB/s 1.20 c/B 4096±3 CTR enc | 0.306 ns/B 3120 MiB/s 1.25 c/B 4098±4 CTR dec | 0.300 ns/B 3182 MiB/s 1.23 c/B 4103±6 XTS enc | 0.431 ns/B 2211 MiB/s 1.77 c/B 4107±9 XTS dec | 0.431 ns/B 2213 MiB/s 1.77 c/B 4102±6 OCB enc | 0.324 ns/B 2946 MiB/s 1.33 c/B 4096±3 OCB dec | 0.326 ns/B 2923 MiB/s 1.34 c/B 4093±2 OCB auth | 0.536 ns/B 1779 MiB/s 2.19 c/B 4089 CBC/CFB enc: 1.20x faster CBC/CFB dec: 2.20x faster CTR: 2.18x faster XTS: 1.78x faster OCB enc/dec: 2.07x faster OCB auth: 1.24x faster Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add SM4 ARMv9 SVE CE assembly implementationTianjia Zhang2022-07-211-0/+1
| | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm4-armv9-aarch64-sve-ce.S'. * cipher/sm4-armv9-aarch64-sve-ce.S: New. * cipher/sm4.c (USE_ARM_SVE_CE): New. (SM4_context) [USE_ARM_SVE_CE]: Add 'use_arm_sve_ce'. (_gcry_sm4_armv9_sve_ce_crypt, _gcry_sm4_armv9_sve_ce_ctr_enc) (_gcry_sm4_armv9_sve_ce_cbc_dec, _gcry_sm4_armv9_sve_ce_cfb_dec) (sm4_armv9_sve_ce_crypt_blk1_16): New. (sm4_setkey): Enable ARMv9 SVE CE if supported by HW. (sm4_get_crypt_blk1_16_fn) [USE_ARM_SVE_CE]: Add ARMv9 SVE CE bulk functions. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) [USE_ARM_SVE_CE]: Add ARMv9 SVE CE bulk functions. * configure.ac: Add 'sm4-armv9-aarch64-sve-ce.lo'. -- Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
* cipher: Add buildhelp.h to source to be distributed.NIIBE Yutaka2022-07-191-1/+2
| | | | | | | | | * cipher/Makefile.am (libcipher_la_SOURCES): Add bulkhelp.h. -- Fixes-commit: 9388279803ff82ea0ccd12a83157b94c807e7a8f Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
* Chacha20/poly1305 - Optimized chacha20/poly1305 for P10 operationDanny Tsen2022-06-121-0/+2
| | | | | | | | | | | | | | | | | | | * configure.ac: Added chacha20 and poly1305 assembly implementations. * cipher/chacha20-p10le-8x.s: (New) - support 8 blocks (512 bytes) unrolling. * cipher/poly1305-p10le.s: (New) - support 4 blocks (128 bytes) unrolling. * cipher/Makefile.am: Added new chacha20 and poly1305 files. * cipher/chacha20.c: Added PPC p10 le support for 8x chacha20. * cipher/poly1305.c: Added PPC p10 le support for 4x poly1305. * cipher/poly1305-internal.h: Added PPC p10 le support for poly1305. --- GnuPG-bug-id: 6006 Signed-off-by: Danny Tsen <dtsen@us.ibm.com> [jk: cosmetic changes to C code] [jk: fix building on ppc64be] Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* cipher: move CBC/CFB/CTR self-tests to tests/basicJussi Kivilinna2022-05-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Remove 'cipher-selftest.c' and 'cipher-selftest.h'. * cipher/cipher-selftest.c: Remove (refactor these tests to tests/basic.c). * cipher/cipher-selftest.h: Remove. * cipher/blowfish.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/camellia-glue.c (selftest_ctr_128, selftest_cbc_128) (selftest_cfb_128): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/cast5.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/des.c (bulk_selftest_setkey, selftest_ctr, selftest_cbc) (selftest_cfb): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/rijndael.c (selftest_basic_128, selftest_basic_192) (selftest_basic_256): Allocate context from stack instead of heap and handle alignment manually. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/serpent.c (selftest_ctr_128, selftest_cbc_128) (selftest_cfb_128): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/sm4.c (selftest_ctr_128, selftest_cbc_128) (selftest_cfb_128): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * cipher/twofish.c (selftest_ctr, selftest_cbc, selftest_cfb): Remove. (selftest): Remove CTR/CBC/CFB bulk self-tests. * tests/basic.c (buf_xor, cipher_cbc_bulk_test, buf_xor_2dst) (cipher_cfb_bulk_test, cipher_ctr_bulk_test): New. (check_ciphers): Run cipher_cbc_bulk_test(), cipher_cfb_bulk_test() and cipher_ctr_bulk_test() for block ciphers. --- CBC/CFB/CTR bulk self-tests are quite computationally heavy and slow down use cases where application opens cipher context once, does processing and exits. Better place for these tests is in `tests/basic`. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* camellia: add amd64 GFNI/AVX512 implementationJussi Kivilinna2022-05-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'camellia-gfni-avx512-amd64.S'. * cipher/bulkhelp.h (bulk_ocb_prepare_L_pointers_array_blk64): New. * cipher/camellia-aesni-avx2-amd64.h: Rename internal functions from "__camellia_???" to "FUNC_NAME(???)"; Minor changes to comments. * cipher/camellia-gfni-avx512-amd64.S: New. * cipher/camellia-gfni.c (USE_GFNI_AVX512): New. (CAMELLIA_context): Add 'use_gfni_avx512'. (_gcry_camellia_gfni_avx512_ctr_enc, _gcry_camellia_gfni_avx512_cbc_dec) (_gcry_camellia_gfni_avx512_cfb_dec, _gcry_camellia_gfni_avx512_ocb_enc) (_gcry_camellia_gfni_avx512_ocb_dec) (_gcry_camellia_gfni_avx512_enc_blk64) (_gcry_camellia_gfni_avx512_dec_blk64, avx512_burn_stack_depth): New. (camellia_setkey): Use GFNI/AVX512 if supported by CPU. (camellia_encrypt_blk1_64, camellia_decrypt_blk1_64): New. (_gcry_camellia_ctr_enc, _gcry_camellia_cbc_dec, _gcry_camellia_cfb_dec) (_gcry_camellia_ocb_crypt) [USE_GFNI_AVX512]: Add GFNI/AVX512 code path. (_gcry_camellia_xts_crypt): Change parallel block size from 32 to 64. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Increase test block size. * cipher/chacha20-amd64-avx512.S: Clear k-mask registers with xor. * cipher/poly1305-amd64-avx512.S: Likewise. * cipher/sha512-avx512-amd64.S: Likewise. --- Benchmark on Intel i3-1115G4 (tigerlake): Before (GFNI/AVX2): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.356 ns/B 2679 MiB/s 1.46 c/B 4089 CFB dec | 0.374 ns/B 2547 MiB/s 1.53 c/B 4089 CTR enc | 0.409 ns/B 2332 MiB/s 1.67 c/B 4089 CTR dec | 0.406 ns/B 2347 MiB/s 1.66 c/B 4089 XTS enc | 0.430 ns/B 2216 MiB/s 1.76 c/B 4090 XTS dec | 0.433 ns/B 2201 MiB/s 1.77 c/B 4090 OCB enc | 0.460 ns/B 2071 MiB/s 1.88 c/B 4089 OCB dec | 0.492 ns/B 1939 MiB/s 2.01 c/B 4089 After (GFNI/AVX512): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.207 ns/B 4600 MiB/s 0.827 c/B 3989 CFB dec | 0.207 ns/B 4610 MiB/s 0.825 c/B 3989 CTR enc | 0.218 ns/B 4382 MiB/s 0.868 c/B 3990 CTR dec | 0.217 ns/B 4389 MiB/s 0.867 c/B 3990 XTS enc | 0.330 ns/B 2886 MiB/s 1.35 c/B 4097±4 XTS dec | 0.328 ns/B 2904 MiB/s 1.35 c/B 4097±3 OCB enc | 0.246 ns/B 3879 MiB/s 0.981 c/B 3990 OCB dec | 0.247 ns/B 3855 MiB/s 0.987 c/B 3990 CBC dec: 70% faster CFB dec: 80% faster CTR: 87% faster XTS: 31% faster OCB: 92% faster Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add SM4 x86-64/GFNI/AVX2 implementationJussi Kivilinna2022-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm4-gfni-avx2-amd64.S'. * cipher/sm4-aesni-avx2-amd64.S: New. * cipher/sm4.c (USE_GFNI_AVX2): New. (SM4_context): Add 'use_gfni_avx2'. (crypt_blk1_8_fn_t): Rename to... (crypt_blk1_16_fn_t): ...this. (sm4_aesni_avx_crypt_blk1_8): Rename to... (sm4_aesni_avx_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (_gcry_sm4_gfni_avx_expand_key, _gcry_sm4_gfni_avx2_ctr_enc) (_gcry_sm4_gfni_avx2_cbc_dec, _gcry_sm4_gfni_avx2_cfb_dec) (_gcry_sm4_gfni_avx2_ocb_enc, _gcry_sm4_gfni_avx2_ocb_dec) (_gcry_sm4_gfni_avx2_ocb_auth, _gcry_sm4_gfni_avx2_crypt_blk1_16) (sm4_gfni_avx2_crypt_blk1_16): New. (sm4_aarch64_crypt_blk1_8): Rename to... (sm4_aarch64_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (sm4_armv8_ce_crypt_blk1_8): Rename to... (sm4_armv8_ce_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (sm4_expand_key): Add GFNI/AVX2 path. (sm4_setkey): Enable GFNI/AVX2 implementation if HW features available; Disable AESNI implementations when GFNI implementation is enabled. (sm4_encrypt) [USE_GFNI_AVX2]: New. (sm4_decrypt) [USE_GFNI_AVX2]: New. (sm4_get_crypt_blk1_8_fn): Rename to... (sm4_get_crypt_blk1_16_fn): ...this; Update to use *_blk1_16 functions; Add GFNI/AVX2 selection. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): Add GFNI/AVX2 path; Widen generic bulk processing from 8 blocks to 16 blocks. (_gcry_sm4_xts_crypt): Widen generic bulk processing from 8 blocks to 16 blocks. -- Benchmark on Intel i3-1115G4 (tigerlake): Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 10.34 ns/B 92.21 MiB/s 42.29 c/B 4089 ECB dec | 10.34 ns/B 92.24 MiB/s 42.29 c/B 4090 CBC enc | 11.06 ns/B 86.26 MiB/s 45.21 c/B 4090 CBC dec | 1.13 ns/B 844.8 MiB/s 4.62 c/B 4090 CFB enc | 11.06 ns/B 86.27 MiB/s 45.22 c/B 4090 CFB dec | 1.13 ns/B 846.0 MiB/s 4.61 c/B 4090 CTR enc | 1.14 ns/B 834.3 MiB/s 4.67 c/B 4089 CTR dec | 1.14 ns/B 834.5 MiB/s 4.67 c/B 4089 XTS enc | 1.93 ns/B 494.1 MiB/s 7.89 c/B 4090 XTS dec | 1.94 ns/B 492.5 MiB/s 7.92 c/B 4090 OCB enc | 1.16 ns/B 823.3 MiB/s 4.74 c/B 4090 OCB dec | 1.16 ns/B 818.8 MiB/s 4.76 c/B 4089 OCB auth | 1.15 ns/B 831.0 MiB/s 4.69 c/B 4089 After: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 8.39 ns/B 113.6 MiB/s 34.33 c/B 4090 ECB dec | 8.40 ns/B 113.5 MiB/s 34.35 c/B 4090 CBC enc | 9.45 ns/B 101.0 MiB/s 38.63 c/B 4089 CBC dec | 0.650 ns/B 1468 MiB/s 2.66 c/B 4090 CFB enc | 9.44 ns/B 101.1 MiB/s 38.59 c/B 4090 CFB dec | 0.660 ns/B 1444 MiB/s 2.70 c/B 4090 CTR enc | 0.664 ns/B 1437 MiB/s 2.71 c/B 4090 CTR dec | 0.664 ns/B 1437 MiB/s 2.71 c/B 4090 XTS enc | 0.756 ns/B 1262 MiB/s 3.09 c/B 4090 XTS dec | 0.757 ns/B 1260 MiB/s 3.10 c/B 4090 OCB enc | 0.673 ns/B 1417 MiB/s 2.75 c/B 4090 OCB dec | 0.675 ns/B 1413 MiB/s 2.76 c/B 4090 OCB auth | 0.672 ns/B 1418 MiB/s 2.75 c/B 4090 ECB: 1.2x faster CBC-enc / CFB-enc: 1.17x faster CBC-dec / CFB-dec / CTR / OCB: 1.7x faster XTS: 2.5x faster Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add GFNI/AVX2 implementation of CamelliaJussi Kivilinna2022-04-241-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add "camellia-gfni-avx2-amd64.S". * cipher/camellia-aesni-avx2-amd64.h [CAMELLIA_GFNI_BUILD]: Add GFNI support. * cipher/camellia-gfni-avx2-amd64.S: New. * cipher/camellia-glue.c (USE_GFNI_AVX2): New. (CAMELLIA_context) [USE_AESNI_AVX2]: New member "use_gfni_avx2". [USE_GFNI_AVX2] (_gcry_camellia_gfni_avx2_ctr_enc) (_gcry_camellia_gfni_avx2_cbc_dec, _gcry_camellia_gfni_avx2_cfb_dec) (_gcry_camellia_gfni_avx2_ocb_enc, _gcry_camellia_gfni_avx2_ocb_dec) (_gcry_camellia_gfni_avx2_ocb_auth): New. (camellia_setkey) [USE_GFNI_AVX2]: Enable GFNI if supported by HW. (_gcry_camellia_ctr_enc) [USE_GFNI_AVX2]: Add GFNI support. (_gcry_camellia_cbc_dec) [USE_GFNI_AVX2]: Add GFNI support. (_gcry_camellia_cfb_dec) [USE_GFNI_AVX2]: Add GFNI support. (_gcry_camellia_ocb_crypt) [USE_GFNI_AVX2]: Add GFNI support. (_gcry_camellia_ocb_auth) [USE_GFNI_AVX2]: Add GFNI support. * configure.ac: Add "camellia-gfni-avx2-amd64.lo". -- Benchmark on Intel Core i3-1115G4 (tigerlake): Before (VAES/AVX2 implementation): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.579 ns/B 1646 MiB/s 2.37 c/B 4090 CFB dec | 0.579 ns/B 1648 MiB/s 2.37 c/B 4089 CTR enc | 0.586 ns/B 1628 MiB/s 2.40 c/B 4090 CTR dec | 0.587 ns/B 1626 MiB/s 2.40 c/B 4090 OCB enc | 0.607 ns/B 1570 MiB/s 2.48 c/B 4089 OCB dec | 0.611 ns/B 1561 MiB/s 2.50 c/B 4089 OCB auth | 0.602 ns/B 1585 MiB/s 2.46 c/B 4089 After (~80% faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.299 ns/B 3186 MiB/s 1.22 c/B 4090 CFB dec | 0.314 ns/B 3039 MiB/s 1.28 c/B 4089 CTR enc | 0.322 ns/B 2962 MiB/s 1.32 c/B 4090 CTR dec | 0.321 ns/B 2970 MiB/s 1.31 c/B 4090 OCB enc | 0.339 ns/B 2817 MiB/s 1.38 c/B 4089 OCB dec | 0.346 ns/B 2756 MiB/s 1.41 c/B 4089 OCB auth | 0.337 ns/B 2831 MiB/s 1.38 c/B 4089 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* chacha20: add AVX512 implementationJussi Kivilinna2022-04-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'chacha20-amd64-avx512.S'. * cipher/chacha20-amd64-avx512.S: New. * cipher/chacha20.c (USE_AVX512): New. (CHACHA20_context_s): Add 'use_avx512'. [USE_AVX512] (_gcry_chacha20_amd64_avx512_blocks16): New. (chacha20_do_setkey) [USE_AVX512]: Setup 'use_avx512' based on HW features. (do_chacha20_encrypt_stream_tail) [USE_AVX512]: Use AVX512 implementation if supported. (_gcry_chacha20_poly1305_encrypt) [USE_AVX512]: Disable stitched chacha20-poly1305 implementations if AVX512 implementation is used. (_gcry_chacha20_poly1305_decrypt) [USE_AVX512]: Disable stitched chacha20-poly1305 implementations if AVX512 implementation is used. -- Benchmark on Intel Core i3-1115G4 (tigerlake): Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz STREAM enc | 0.276 ns/B 3451 MiB/s 1.13 c/B 4090 STREAM dec | 0.284 ns/B 3359 MiB/s 1.16 c/B 4090 POLY1305 enc | 0.411 ns/B 2320 MiB/s 1.68 c/B 4098±3 POLY1305 dec | 0.408 ns/B 2338 MiB/s 1.67 c/B 4091±1 POLY1305 auth | 0.060 ns/B 15785 MiB/s 0.247 c/B 4090±1 After (stream 1.7x faster, poly1305-aead 1.8x faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz STREAM enc | 0.162 ns/B 5869 MiB/s 0.665 c/B 4092±1 STREAM dec | 0.162 ns/B 5884 MiB/s 0.664 c/B 4096±3 POLY1305 enc | 0.221 ns/B 4306 MiB/s 0.907 c/B 4097±3 POLY1305 dec | 0.220 ns/B 4342 MiB/s 0.900 c/B 4096±3 POLY1305 auth | 0.060 ns/B 15797 MiB/s 0.247 c/B 4085±2 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* poly1305: add AVX512 implementationJussi Kivilinna2022-04-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * LICENSES: Add 3-clause BSD license for poly1305-amd64-avx512.S. * cipher/Makefile.am: Add 'poly1305-amd64-avx512.S'. * cipher/poly1305-amd64-avx512.S: New. * cipher/poly1305-internal.h (POLY1305_USE_AVX512): New. (poly1305_context_s): Add 'use_avx512'. * cipher/poly1305.c (ASM_FUNC_ABI, ASM_FUNC_WRAPPER_ATTR): New. [POLY1305_USE_AVX512] (_gcry_poly1305_amd64_avx512_blocks) (poly1305_amd64_avx512_blocks): New. (poly1305_init): Use AVX512 is HW feature available (set use_avx512). [USE_MPI_64BIT] (poly1305_blocks): Rename to ... [USE_MPI_64BIT] (poly1305_blocks_generic): ... this. [USE_MPI_64BIT] (poly1305_blocks): New. -- Patch adds AMD64 AVX512-FMA52 implementation for Poly1305. Benchmark on Intel Core i3-1115G4 (tigerlake): Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz POLY1305 | 0.306 ns/B 3117 MiB/s 1.25 c/B 4090 After (5.0x faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz POLY1305 | 0.061 ns/B 15699 MiB/s 0.249 c/B 4095±3 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add SM3 ARMv8/AArch64/CE assembly implementationTianjia Zhang2022-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm3-armv8-aarch64-ce.S'. * cipher/sm3-armv8-aarch64-ce.S: New. * cipher/sm3.c (USE_ARM_CE): New. [USE_ARM_CE] (_gcry_sm3_transform_armv8_ce) (do_sm3_transform_armv8_ce): New. (sm3_init) [USE_ARM_CE]: New. * configure.ac: Add 'sm3-armv8-aarch64-ce.lo'. -- Benchmark on T-Head Yitian-710 2.75 GHz: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 2.84 ns/B 335.3 MiB/s 7.82 c/B 2749 After (~55% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 1.84 ns/B 518.1 MiB/s 5.06 c/B 2749 Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
* build: Fix for build for Windows.NIIBE Yutaka2022-03-281-4/+4
| | | | | | | | | * cipher/Makefile.am: Use EXEEXT_FOR_BUILD. * doc/Makefile.am: Likewise. -- Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
* SHA512: Add AVX512 implementationJussi Kivilinna2022-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | * LICENSES: Add 'cipher/sha512-avx512-amd64.S'. * cipher/Makefile.am: Add 'sha512-avx512-amd64.S'. * cipher/sha512-avx512-amd64.S: New. * cipher/sha512.c (USE_AVX512): New. (do_sha512_transform_amd64_ssse3, do_sha512_transform_amd64_avx) (do_sha512_transform_amd64_avx2): Add ASM_EXTRA_STACK to return value only if assembly routine returned non-zero value. [USE_AVX512] (_gcry_sha512_transform_amd64_avx512) (do_sha512_transform_amd64_avx512): New. (sha512_init_common) [USE_AVX512]: Use AVX512 implementation if HW feature supported. --- Benchmark on Intel Core i3-1115G4 (tigerlake): Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SHA512 | 1.51 ns/B 631.6 MiB/s 6.17 c/B 4089 After (~29% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SHA512 | 1.16 ns/B 819.0 MiB/s 4.76 c/B 4090 GnuPG-bug-id: T4460 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add SM4 ARMv8/AArch64/CE assembly implementationTianjia Zhang2022-03-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm4-armv8-aarch64-ce.S'. * cipher/sm4-armv8-aarch64-ce.S: New. * cipher/sm4.c (USE_ARM_CE): New. (SM4_context) [USE_ARM_CE]: Add 'use_arm_ce'. [USE_ARM_CE] (_gcry_sm4_armv8_ce_expand_key) (_gcry_sm4_armv8_ce_crypt, _gcry_sm4_armv8_ce_ctr_enc) (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec) (_gcry_sm4_armv8_ce_crypt_blk1_8, sm4_armv8_ce_crypt_blk1_8): New. (sm4_expand_key) [USE_ARM_CE]: Use ARMv8/AArch64/CE key setup. (sm4_setkey): Enable ARMv8/AArch64/CE if supported by HW. (sm4_encrypt) [USE_ARM_CE]: Use SM4 CE encryption. (sm4_decrypt) [USE_ARM_CE]: Use SM4 CE decryption. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_ARM_CE]: Add ARMv8/AArch64/CE bulk functions. * configure.ac: Add 'sm4-armv8-aarch64-ce.lo'. -- This patch adds ARMv8/AArch64/CE bulk encryption/decryption. Bulk functions process eight blocks in parallel. Benchmark on T-Head Yitian-710 2.75 GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 After (10x - 19x faster than ARMv8/AArch64 impl): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 1.25 ns/B 762.7 MiB/s 3.44 c/B 2749 CBC dec | 0.243 ns/B 3927 MiB/s 0.668 c/B 2750 CFB enc | 1.25 ns/B 763.1 MiB/s 3.44 c/B 2750 CFB dec | 0.245 ns/B 3899 MiB/s 0.673 c/B 2750 CTR enc | 0.298 ns/B 3199 MiB/s 0.820 c/B 2750 CTR dec | 0.298 ns/B 3198 MiB/s 0.820 c/B 2750 GCM enc | 0.487 ns/B 1957 MiB/s 1.34 c/B 2749 GCM dec | 0.487 ns/B 1959 MiB/s 1.34 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.519 c/B 2750 OCB enc | 0.443 ns/B 2150 MiB/s 1.22 c/B 2749 OCB dec | 0.486 ns/B 1964 MiB/s 1.34 c/B 2750 OCB auth | 0.369 ns/B 2585 MiB/s 1.01 c/B 2749 Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
* powerpc: check for missing optimization level for vector register usageJussi Kivilinna2022-02-241-1/+1
| | | | | | | | | | | | | | | | | * cipher/Makefile.am [ENABLE_PPC_VCRYPTO_EXTRA_CFLAGS] (ppc_vcrypto_cflags): Add '-O2'. * configure.ac (gcry_cv_cc_ppc_altivec): Check for missing compiler optimization with vec_sld_u32 inline function. * configure.ac (gcry_cv_cc_ppc_altivec_cflags): Check for missing compiler optimization with vec_sld_u32 inline function; Add '-O2' to CFLAGS. -- Attempt to enable optimization for PPC vector register implementations if PPC altivec check does not pass otherwise. GnuPG-bug-id: T5785 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add SM4 ARMv8/AArch64 assembly implementationTianjia Zhang2022-02-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm4-aarch64.S'. * cipher/sm4-aarch64.S: New. * cipher/sm4.c (USE_AARCH64_SIMD): New. (SM4_context) [USE_AARCH64_SIMD]: Add 'use_aarch64_simd'. [USE_AARCH64_SIMD] (_gcry_sm4_aarch64_crypt) (_gcry_sm4_aarch64_ctr_enc, _gcry_sm4_aarch64_cbc_dec) (_gcry_sm4_aarch64_cfb_dec, _gcry_sm4_aarch64_crypt_blk1_8) (sm4_aarch64_crypt_blk1_8): New. (sm4_setkey): Enable ARMv8/AArch64 if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AARCH64_SIMD]: Add ARMv8/AArch64 bulk functions. * configure.ac: Add 'sm4-aarch64.lo'. -- This patch adds ARMv8/AArch64 bulk encryption/decryption. Bulk functions process eight blocks in parallel. Benchmark on T-Head Yitian-710 2.75 GHz: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.81 MiB/s 33.28 c/B 2750 CBC dec | 7.19 ns/B 132.6 MiB/s 19.77 c/B 2750 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 7.24 ns/B 131.8 MiB/s 19.90 c/B 2750 CTR enc | 7.24 ns/B 131.7 MiB/s 19.90 c/B 2750 CTR dec | 7.24 ns/B 131.7 MiB/s 19.91 c/B 2750 GCM enc | 9.49 ns/B 100.4 MiB/s 26.11 c/B 2750 GCM dec | 9.49 ns/B 100.5 MiB/s 26.10 c/B 2750 GCM auth | 2.25 ns/B 423.1 MiB/s 6.20 c/B 2750 OCB enc | 7.35 ns/B 129.8 MiB/s 20.20 c/B 2750 OCB dec | 7.36 ns/B 129.6 MiB/s 20.23 c/B 2750 OCB auth | 7.29 ns/B 130.8 MiB/s 20.04 c/B 2749 After (~55% faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 12.10 ns/B 78.79 MiB/s 33.28 c/B 2750 CBC dec | 4.63 ns/B 205.9 MiB/s 12.74 c/B 2749 CFB enc | 12.14 ns/B 78.58 MiB/s 33.37 c/B 2750 CFB dec | 4.64 ns/B 205.5 MiB/s 12.76 c/B 2750 CTR enc | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 CTR dec | 4.69 ns/B 203.3 MiB/s 12.90 c/B 2750 GCM enc | 4.88 ns/B 195.4 MiB/s 13.42 c/B 2750 GCM dec | 4.88 ns/B 195.5 MiB/s 13.42 c/B 2750 GCM auth | 0.189 ns/B 5048 MiB/s 0.520 c/B 2750 OCB enc | 4.86 ns/B 196.0 MiB/s 13.38 c/B 2750 OCB dec | 4.90 ns/B 194.7 MiB/s 13.47 c/B 2750 OCB auth | 4.79 ns/B 199.0 MiB/s 13.18 c/B 2750 Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
* Add SM3 ARM/AArch64 assembly implementationJussi Kivilinna2022-01-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm3-aarch64.S'. * cipher/sm3-aarch64.S: New. * cipher/sm3.c (USE_AARCH64_SIMD): New. [USE_AARCH64_SIMD] (_gcry_sm3_transform_aarch64) (do_sm3_transform_aarch64): New. (sm3_init) [USE_AARCH64_SIMD]: New. * configure.ac: Add 'sm3-aarch64.lo'. * tests/basic.c (main): Add command-line option '--hash' for running only hash algorithm tests. -- Benchmark on AWS Graviton2: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 4.24 ns/B 224.8 MiB/s 10.61 c/B 2500 After (~34% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 3.15 ns/B 302.4 MiB/s 7.88 c/B 2500 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* AES-GCM: Bulk implementation of AES-GCM acceleration for ppc64leDanny Tsen2021-12-211-0/+7
| | | | | | | | | | | | | | | | | | | | | * configure.ac: Added p10 assembly implementation file and assiciated file. * cipher/Makefile.am: Added p10 assembly implementation file and associated file. * cipher/rijndael.c: Added p10 function. * cipher/rijndael-p10le.c: New wrapper file for AES-GCM call. * cipher/rijndael-gcm-p10le.s: New implementation of AES-GCM bulk function in Power Assembly. * src/g10lib.h: Added Power arch 3.1 definition for p10. * src/hwf-ppc.c: Added Power arch 3.1 definition for p10. * src/hwfeatures.c: Added Power arch 3.1 definition for p10. -- GnuPG-bug-id: 5700 Signed-off-by: Danny Tsen <dtsen@us.ibm.com> [jk: fixes for C coding style] [jk: prefix assembly functions with '_gcry_ppc10'] [jk: add assert check for gcm_table size] Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add SM3 x86-64 AVX/BMI2 assembly implementationJussi Kivilinna2021-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm3-avx-bmi2-amd64.S'. * cipher/sm3-avx-bmi2-amd64.S: New. * cipher/sm3.c (USE_AVX_BMI2, ASM_FUNC_ABI, ASM_EXTRA_STACK): New. (SM3_CONTEXT): Define 'h' as array instead of separate fields 'h1', 'h2', etc. [USE_AVX_BMI2] (_gcry_sm3_transform_amd64_avx_bmi2) (do_sm3_transform_amd64_avx_bmi2): New. (sm3_init): Select AVX/BMI2 transform function if support by HW; Update to use 'hd->h' as array. (transform_blk, sm3_final): Update to use 'hd->h' as array. * configure.ac: Add 'sm3-avx-bmi2-amd64.lo'. -- Benchmark on AMD Zen3: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 2.18 ns/B 436.6 MiB/s 10.59 c/B 4850 After (~43% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 1.52 ns/B 627.4 MiB/s 7.37 c/B 4850 Benchmark on Intel Skylake: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 4.35 ns/B 219.2 MiB/s 13.48 c/B 3098 After (~34% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 3.24 ns/B 294.4 MiB/s 10.04 c/B 3098 Benchmark on AMD Zen2: Before: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 2.73 ns/B 348.9 MiB/s 11.86 c/B 4339 After (~38% faster): | nanosecs/byte mebibytes/sec cycles/byte auto Mhz SM3 | 1.97 ns/B 483.0 MiB/s 8.52 c/B 4318 Reviewed-and-tested-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* build: cipher/Makefile.am, doc/Makefile.am: add a missing spaceAlexander Kanavin2021-12-071-1/+1
| | | | | | | | | * cipher/Makefile.am: Add a space. * doc/Makefile.am: Ditto. -- Signed-off-by: Alexander Kanavin <alex.kanavin@gmail.com>
* Do not build poly1305-s390x.S on foreign architecturesJussi Kivilinna2021-11-181-1/+1
| | | | | | | | | | * configure.ac [host=s390x-*-*]: Add 'poly1305-s390x.lo'. * cipher/Makefile.am: Move 'poly1305-s390x.S' to 'EXTRA_libcipher_la_SOURCES'. -- GnuPG-bug-id: 5694 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add AES-GCM-SIV mode (RFC 8452)Jussi Kivilinna2021-08-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'cipher-gcm-siv.c'. * cipher/cipher-gcm-siv.c: New. * cipher/cipher-gcm.c (_gcry_cipher_gcm_setupM): New. * cipher/cipher-internal.h (gcry_cipher_handle): Add 'siv_keylen'. (_gcry_cipher_gcm_setupM, _gcry_cipher_gcm_siv_encrypt) (_gcry_cipher_gcm_siv_decrypt, _gcry_cipher_gcm_siv_set_nonce) (_gcry_cipher_gcm_siv_authenticate) (_gcry_cipher_gcm_siv_set_decryption_tag) (_gcry_cipher_gcm_siv_get_tag, _gcry_cipher_gcm_siv_check_tag) (_gcry_cipher_gcm_siv_setkey): New prototypes. (cipher_block_bswap): New helper function. * cipher/cipher.c (_gcry_cipher_open_internal): Add 'GCRY_CIPHER_MODE_GCM_SIV'; Refactor mode requirement checks for better size optimization (check pointers & blocksize in same order for all). (cipher_setkey, cipher_reset, _gcry_cipher_setup_mode_ops) (_gcry_cipher_setup_mode_ops, _gcry_cipher_info): Add GCM-SIV. (_gcry_cipher_ctl): Handle 'set decryption tag' for GCM-SIV. * doc/gcrypt.texi: Add GCM-SIV. * src/gcrypt.h.in (GCRY_CIPHER_MODE_GCM_SIV): New. (GCRY_SIV_BLOCK_LEN, gcry_cipher_set_decryption_tag): Add to comment that these are also for GCM-SIV in addition to SIV mode. * tests/basic.c (check_gcm_siv_cipher): New. (check_cipher_modes): Check for GCM-SIV. * tests/bench-slope.c (bench_gcm_siv_encrypt_do_bench) (bench_gcm_siv_decrypt_do_bench, bench_gcm_siv_authenticate_do_bench) (gcm_siv_encrypt_ops, gcm_siv_decrypt_ops) (gcm_siv_authenticate_ops): New. (cipher_modes): Add GCM-SIV. (cipher_bench_one): Check key length requirement for GCM-SIV. -- GnuPG-bug-id: T4485 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add SIV mode (RFC 5297)Jussi Kivilinna2021-08-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'cipher-siv.c'. * cipher/cipher-ctr.c (_gcry_cipher_ctr_encrypt): Rename to _gcry_cipher_ctr_encrypt_ctx and add algo context parameter. (_gcry_cipher_ctr_encrypt): New using _gcry_cipher_ctr_encrypt_ctx. * cipher/cipher-internal.h (gcry_cipher_handle): Add 'u_mode.siv'. (_gcry_cipher_ctr_encrypt_ctx, _gcry_cipher_siv_encrypt) (_gcry_cipher_siv_decrypt, _gcry_cipher_siv_set_nonce) (_gcry_cipher_siv_authenticate, _gcry_cipher_siv_set_decryption_tag) (_gcry_cipher_siv_get_tag, _gcry_cipher_siv_check_tag) (_gcry_cipher_siv_setkey): New. * cipher/cipher-siv.c: New. * cipher/cipher.c (_gcry_cipher_open_internal, cipher_setkey) (cipher_reset, _gcry_cipher_setup_mode_ops, _gcry_cipher_info): Add GCRY_CIPHER_MODE_SIV handling. (_gcry_cipher_ctl): Add GCRYCTL_SET_DECRYPTION_TAG handling. * doc/gcrypt.texi: Add documentation for SIV mode. * src/gcrypt.h.in (GCRYCTL_SET_DECRYPTION_TAG): New. (GCRY_CIPHER_MODE_SIV): New. (gcry_cipher_set_decryption_tag): New. * tests/basic.c (check_siv_cipher): New. (check_cipher_modes): Add call for 'check_siv_cipher'. * tests/bench-slope.c (bench_encrypt_init): Use double size key for SIV mode. (bench_aead_encrypt_do_bench, bench_aead_decrypt_do_bench) (bench_aead_authenticate_do_bench): Reset cipher context on each run. (bench_aead_authenticate_do_bench): Support nonce-less operation. (bench_siv_encrypt_do_bench, bench_siv_decrypt_do_bench) (bench_siv_authenticate_do_bench, siv_encrypt_ops) (siv_decrypt_ops, siv_authenticate_ops): New. (cipher_modes): Add SIV mode benchmarks. (cipher_bench_one): Restrict SIV mode testing to 16 byte block-size. -- GnuPG-bug-id: T4486 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Compile arch specific GCM implementations only on target archJussi Kivilinna2021-03-071-3/+3
| | | | | | | | | | * cipher/Makefile.am: Move arch specific 'cipher-gcm-*.[cS]' files from libcipher_la_SOURCES to EXTRA_libcipher_la_SOURCES. * configure.ac: Add 'cipher-gcm-intel-pclmul.lo' and 'cipher-gcm-arm*.lo'. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* VPMSUMD acceleration for GCM mode on PPCShawn Landden2021-03-071-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'cipher-gcm-ppc.c'. * cipher/cipher-gcm-ppc.c: New. * cipher/cipher-gcm.c [GCM_USE_PPC_VPMSUM] (_gcry_ghash_setup_ppc_vpmsum) (_gcry_ghash_ppc_vpmsum, ghash_setup_ppc_vpsum, ghash_ppc_vpmsum): New. (setupM) [GCM_USE_PPC_VPMSUM]: Select ppc-vpmsum implementation if HW feature "ppc-vcrypto" is available. * cipher/cipher-internal.h (GCM_USE_PPC_VPMSUM): New. (gcry_cipher_handle): Move 'ghash_fn' at end of 'gcm' block to align 'gcm_table' to 16 bytes. * configure.ac: Add 'cipher-gcm-ppc.lo'. * tests/basic.c (_check_gcm_cipher): New AES256 test vector. * AUTHORS: Add 'CRYPTOGAMS'. * LICENSES: Add original license to 3-clause-BSD section. -- https://dev.gnupg.org/D501: 10-20X speed. However this Power 9 machine is faster than the last Power 9 benchmarks on the optimized versions, so while better than the last patch, it is not all due to the code. Before: GCM enc | 4.23 ns/B 225.3 MiB/s - c/B GCM dec | 3.58 ns/B 266.2 MiB/s - c/B GCM auth | 3.34 ns/B 285.3 MiB/s - c/B After: GCM enc | 0.370 ns/B 2578 MiB/s - c/B GCM dec | 0.371 ns/B 2571 MiB/s - c/B GCM auth | 0.159 ns/B 6003 MiB/s - c/B Signed-off-by: Shawn Landden <shawn@git.icu> [jk: coding style fixes, Makefile.am integration, patch from Differential to git, commit changelog, fixed few compiler warnings] GnuPG-bug-id: 5040 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* rijndael: add x86_64 VAES/AVX2 accelerated implementationJussi Kivilinna2021-02-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'rijndael-vaes.c' and 'rijndael-vaes-avx2-amd64.S'. * cipher/rijndael-internal.h (USE_VAES): New. * cipher/rijndael-vaes-avx2-amd64.S: New. * cipher/rijndael-vaes.c: New. * cipher/rijndael.c (_gcry_aes_vaes_cfb_dec, _gcry_aes_vaes_cbc_dec) (_gcry_aes_vaes_ctr_enc, _gcry_aes_vaes_ocb_crypt) (_gcry_aes_vaes_xts_crypt): New. (do_setkey) [USE_VAES]: Add detection for VAES. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128) [USE_VAES]: Increase number of selftest blocks. * configure.ac: Add 'rijndael-vaes.lo' and 'rijndael-vaes-avx2-amd64.lo'. -- Patch adds VAES/AVX2 accelerated implementation for CBC-decryption, CFB-decryption, CTR-encryption, OCB-en/decryption and XTS-en/decryption. Benchmarks on AMD Ryzen 5800X: Before: AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.067 ns/B 14314 MiB/s 0.323 c/B 4850 CFB dec | 0.067 ns/B 14322 MiB/s 0.323 c/B 4850 CTR enc | 0.066 ns/B 14429 MiB/s 0.321 c/B 4850 CTR dec | 0.066 ns/B 14433 MiB/s 0.320 c/B 4850 XTS enc | 0.087 ns/B 10910 MiB/s 0.424 c/B 4850 XTS dec | 0.088 ns/B 10856 MiB/s 0.426 c/B 4850 OCB enc | 0.070 ns/B 13633 MiB/s 0.339 c/B 4850 OCB dec | 0.069 ns/B 13911 MiB/s 0.332 c/B 4850 After (XTS ~1.7x faster, others ~1.9x faster): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.034 ns/B 28159 MiB/s 0.164 c/B 4850 CFB dec | 0.034 ns/B 27955 MiB/s 0.165 c/B 4850 CTR enc | 0.034 ns/B 28214 MiB/s 0.164 c/B 4850 CTR dec | 0.034 ns/B 28146 MiB/s 0.164 c/B 4850 XTS enc | 0.051 ns/B 18539 MiB/s 0.249 c/B 4850 XTS dec | 0.051 ns/B 18655 MiB/s 0.248 c/B 4850 GCM auth | 0.088 ns/B 10817 MiB/s 0.428 c/B 4850 OCB enc | 0.037 ns/B 25824 MiB/s 0.179 c/B 4850 OCB dec | 0.038 ns/B 25359 MiB/s 0.182 c/B 4850 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* camellia: add x86_64 VAES/AVX2 accelerated implementationJussi Kivilinna2021-02-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'camellia-aesni-avx2-amd64.h' and 'camellia-vaes-avx2-amd64.S'. * cipher/camellia-aesni-avx2-amd64.S: New, old content moved to... * cipher/camellia-aesni-avx2-amd64.h: ...here. (IF_AESNI, IF_VAES, FUNC_NAME): New. * cipher/camellia-vaes-avx2-amd64.S: New. * cipher/camellia-glue.c (USE_VAES_AVX2): New. (CAMELLIA_context): New member 'use_vaes_avx2'. (_gcry_camellia_vaes_avx2_ctr_enc, _gcry_camellia_vaes_avx2_cbc_dec) (_gcry_camellia_vaes_avx2_cfb_dec, _gcry_camellia_vaes_avx2_ocb_enc) (_gcry_camellia_vaes_avx2_ocb_dec) (_gcry_camellia_vaes_avx2_ocb_auth): New. (camellia_setkey): Check for HWF_INTEL_VAES. (_gcry_camellia_ctr_enc, _gcry_camellia_cbc_dec) (_gcry_camellia_cfb_dec, _gcry_camellia_ocb_crypt) (_gcry_camellia_ocb_auth): Add USE_VAES_AVX2 code. * configure.ac: Add 'camellia-vaes-avx2-amd64.lo'. -- Camellia AES-NI/AVX2 implementation had to split 256-bit vector to 128-bit parts for AES processing, but now we can use those 256-bit registers directly with VAES. Benchmarks on AMD Ryzen 5800X: Before (AES-NI/AVX2): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.539 ns/B 1769 MiB/s 2.62 c/B 4852 CFB dec | 0.528 ns/B 1806 MiB/s 2.56 c/B 4852±1 CTR enc | 0.552 ns/B 1728 MiB/s 2.68 c/B 4850 OCB enc | 0.550 ns/B 1734 MiB/s 2.65 c/B 4825 OCB dec | 0.577 ns/B 1653 MiB/s 2.78 c/B 4825 OCB auth | 0.546 ns/B 1747 MiB/s 2.63 c/B 4825 After (VAES/AVX2, CBC-dec ~13%, CFB-dec/CTR/OCB ~20% faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.477 ns/B 1999 MiB/s 2.31 c/B 4850 CFB dec | 0.433 ns/B 2201 MiB/s 2.10 c/B 4850 CTR enc | 0.438 ns/B 2176 MiB/s 2.13 c/B 4851 OCB enc | 0.449 ns/B 2122 MiB/s 2.18 c/B 4850 OCB dec | 0.468 ns/B 2038 MiB/s 2.27 c/B 4850 OCB auth | 0.447 ns/B 2131 MiB/s 2.17 c/B 4850 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add handling for -Og with O-flag mungingJussi Kivilinna2021-02-031-1/+1
| | | | | | | | * cipher/Makefile.am (o_flag_munging): Add handling for '-Og'. * random/Makefile.am (o_flag_munging): Add handling for '-Og'. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Merge remote-tracking branch 'origin/cipher-s390x-optimizations' into masterJussi Kivilinna2021-01-191-1/+6
|\ | | | | | | | | | | -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
| * Add s390x/zSeries implementation of Poly1305cipher-s390x-optimizationsJussi Kivilinna2020-12-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'poly1305-s390x.S' and 'asm-poly1305-s390x.h'. * cipher/asm-poly1305-s390x.h: New * cipher/chacha20-s390x.S (_gcry_chacha20_poly1305_s390x_vx_blocks8) (_gcry_chacha20_poly1305_s390x_vx_blocks4_2_1): New, stitched chacha20-poly1305 implementation. * cipher/chacha20.c (USE_S390X_VX_POLY1305): New. (_gcry_chacha20_poly1305_s390x_vx_blocks8) (_gcry_chacha20_poly1305_s390x_vx_blocks4_2_1): New prototypes. (_gcry_chacha20_poly1305_encrypt, _gcry_chacha20_poly1305_decrypt): Add s390x/VX stitched chacha20-poly1305 code-path. * cipher/poly1305-s390x.S: New. * cipher/poly1305.c (USE_S390X_ASM, HAVE_ASM_POLY1305_BLOCKS): New. [USE_S390X_ASM] (_gcry_poly1305_s390x_blocks1, poly1305_blocks): New. * configure.ac (gcry_cv_gcc_inline_asm_s390x): Check for 'risbgn' and 'algrk' instructions. * tests/basic.c (_check_poly1305_cipher): Add large chacha20-poly1305 test vector. -- Patch adds Poly1305 and stitched ChaCha20-Poly1305 implementation for zSeries. Stitched implementation interleaves ChaCha20 and Poly1305 processing for higher instruction level parallelism and better utilization of execution units. Benchmark on z15 (4504 Mhz): Before: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte POLY1305 enc | 1.16 ns/B 823.2 MiB/s 5.22 c/B POLY1305 dec | 1.16 ns/B 823.2 MiB/s 5.22 c/B POLY1305 auth | 0.736 ns/B 1295 MiB/s 3.32 c/B After (chacha20-poly1305 ~71% faster, poly1305 ~29% faster): CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte POLY1305 enc | 0.677 ns/B 1409 MiB/s 3.05 c/B POLY1305 dec | 0.655 ns/B 1456 MiB/s 2.95 c/B POLY1305 auth | 0.569 ns/B 1675 MiB/s 2.56 c/B GnuPG-bug-id: 5202 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
| * Add s390x/zSeries implementation of ChaCha20Jussi Kivilinna2020-12-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'asm-common-s390x.h' and 'chacha20-s390x.S'. * cipher/asm-common-s390x.h: New. * cipher/chacha20-s390x.S: New. * cipher/chacha20.c (USE_S390X_VX): New. (CHACHA20_context_t): Change 'use_*' bit-field to unsigned type; Add 'use_s390x'. (_gcry_chacha20_s390x_vx_blocks8) (_gcry_chacha20_s390x_vx_blocks4_2_1): New. (chacha20_do_setkey): Add HW feature detect for s390x/VX. (chacha20_blocks, do_chacha20_encrypt_stream_tail): Add s390x/VX code-path. * configure.ac: Add 'chacha20-s390x.lo'. -- Patch adds VX vector instruction set accelerated ChaCha20 implementation for zSeries. Benchmark on z15 (4504 Mhz): Before: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 2.62 ns/B 364.0 MiB/s 11.80 c/B STREAM dec | 2.62 ns/B 363.8 MiB/s 11.81 c/B After (~5x faster): CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.505 ns/B 1888 MiB/s 2.28 c/B STREAM dec | 0.506 ns/B 1887 MiB/s 2.28 c/B GnuPG-bug-id: 5201 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
| * Add bulk AES-GCM acceleration for s390x/zSeriesJussi Kivilinna2020-12-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'asm-inline-s390x.h'. * cipher/asm-inline-s390x.h: New. * cipher/cipher-gcm.c [GCM_USE_S390X_CRYPTO] (ghash_s390x_kimd): New. (setupM) [GCM_USE_S390X_CRYPTO]: Add setup for s390x GHASH function. * cipher/cipher-internal.h (GCM_USE_S390X_CRYPTO): New. * cipher/rijndael-s390x.c (u128_t, km_functions_e): Move to 'asm-inline-s390x.h'. (aes_s390x_gcm_crypt): New. (_gcry_aes_s390x_setup_acceleration): Use 'km_function_to_mask'; Add setup for GCM bulk function. -- This patch adds zSeries acceleration for GHASH and AES-GCM. Benchmarks (z15, 5.2Ghz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte GCM enc | 2.64 ns/B 361.6 MiB/s 13.71 c/B GCM dec | 2.64 ns/B 361.3 MiB/s 13.72 c/B GCM auth | 2.58 ns/B 370.1 MiB/s 13.40 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte GCM enc | 0.059 ns/B 16066 MiB/s 0.309 c/B GCM dec | 0.059 ns/B 16114 MiB/s 0.308 c/B GCM auth | 0.057 ns/B 16747 MiB/s 0.296 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
| * Add s390x/zSeries acceleration for AESJussi Kivilinna2020-12-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * configure.ac: Add 'rijndael-s390x.lo'. * cipher/Makefile.am: Add 'rijndael-s390x.c'. * cipher/rijndael-internal.c (USE_S390X_CRYPTO): New. (RIJNDAEL_context_s) [USE_S390X_CRYPTO]: New 'km*_func' members. * cipher/rijndael-s390x.c: New. * cipher/rijndael.c (_gcry_aes_s390x_setup_acceleration) (_gcry_aes_s390x_setup_setkey) (_gcry_aes_s390x_setup_prepare_decryption, _gcry_aes_s390x_encrypt) (_gcry_aes_s390x_decrypt): New. (do_setkey) [USE_S390X_CRYPTO]: Add s390x acceleration setup. -- Patchs adds acceleration for single-block AES and following modes: - CBC, CBC-MAC, CFB, OFB, CTR, XTS and OCB Benchmarks (z15, 5.2Ghz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 3.81 ns/B 250.2 MiB/s 19.82 c/B ECB dec | 4.13 ns/B 231.1 MiB/s 21.46 c/B CBC enc | 3.69 ns/B 258.5 MiB/s 19.19 c/B CBC dec | 3.71 ns/B 257.1 MiB/s 19.29 c/B CFB enc | 3.69 ns/B 258.7 MiB/s 19.17 c/B CFB dec | 3.56 ns/B 267.8 MiB/s 18.52 c/B OFB enc | 3.85 ns/B 247.8 MiB/s 20.01 c/B OFB dec | 3.85 ns/B 247.9 MiB/s 20.01 c/B CTR enc | 3.65 ns/B 261.6 MiB/s 18.96 c/B CTR dec | 3.64 ns/B 261.6 MiB/s 18.95 c/B XTS enc | 3.66 ns/B 260.8 MiB/s 19.02 c/B XTS dec | 3.75 ns/B 254.2 MiB/s 19.51 c/B CCM enc | 7.34 ns/B 129.9 MiB/s 38.19 c/B CCM dec | 7.34 ns/B 129.9 MiB/s 38.19 c/B CCM auth | 3.70 ns/B 257.6 MiB/s 19.25 c/B EAX enc | 7.34 ns/B 129.8 MiB/s 38.19 c/B EAX dec | 7.35 ns/B 129.8 MiB/s 38.20 c/B EAX auth | 3.70 ns/B 257.8 MiB/s 19.24 c/B GCM enc | 6.22 ns/B 153.3 MiB/s 32.36 c/B GCM dec | 6.23 ns/B 153.0 MiB/s 32.42 c/B GCM auth | 2.59 ns/B 368.9 MiB/s 13.44 c/B OCB enc | 3.82 ns/B 249.7 MiB/s 19.86 c/B OCB dec | 3.90 ns/B 244.2 MiB/s 20.31 c/B OCB auth | 3.88 ns/B 245.5 MiB/s 20.20 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 2.10 ns/B 453.1 MiB/s 10.94 c/B ECB dec | 2.11 ns/B 453.0 MiB/s 10.95 c/B CBC enc | 0.182 ns/B 5240 MiB/s 0.946 c/B CBC dec | 0.044 ns/B 21581 MiB/s 0.230 c/B CFB enc | 0.206 ns/B 4623 MiB/s 1.07 c/B CFB dec | 0.140 ns/B 6826 MiB/s 0.727 c/B OFB enc | 0.183 ns/B 5222 MiB/s 0.950 c/B OFB dec | 0.182 ns/B 5252 MiB/s 0.944 c/B CTR enc | 0.059 ns/B 16095 MiB/s 0.308 c/B CTR dec | 0.059 ns/B 16045 MiB/s 0.309 c/B XTS enc | 0.043 ns/B 21998 MiB/s 0.225 c/B XTS dec | 0.043 ns/B 22012 MiB/s 0.225 c/B CCM enc | 0.239 ns/B 3989 MiB/s 1.24 c/B CCM dec | 0.239 ns/B 3987 MiB/s 1.24 c/B CCM auth | 0.180 ns/B 5288 MiB/s 0.938 c/B EAX enc | 0.242 ns/B 3940 MiB/s 1.26 c/B EAX dec | 0.243 ns/B 3926 MiB/s 1.26 c/B EAX auth | 0.183 ns/B 5218 MiB/s 0.950 c/B GCM enc | 2.64 ns/B 361.6 MiB/s 13.71 c/B GCM dec | 2.64 ns/B 361.3 MiB/s 13.72 c/B GCM auth | 2.58 ns/B 370.1 MiB/s 13.40 c/B OCB enc | 0.186 ns/B 5132 MiB/s 0.966 c/B OCB dec | 0.176 ns/B 5414 MiB/s 0.916 c/B OCB auth | 0.149 ns/B 6394 MiB/s 0.776 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* | Merge hmac-tests.c into mac-hmac.c.NIIBE Yutaka2020-12-211-1/+1
|/ | | | | | | | * cipher/Makefile.am (EXTRA_DIST): Remove hmac-tests.c. * cipher/hmac-tests.c: Remove, merge into... * cipher/mac-hmac.c: ... here. Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
* Reorganize self-tests for HMAC.NIIBE Yutaka2020-12-181-2/+1
| | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Prepare merge of hmac-test.c into mac-hmac.c. * cipher/hmac-tests.c: Ifdef-out run_selftests and _gcry_hmac_selftest. * cipher/mac-internal.h: Include cipher-proto.h for selftest. (gcry_mac_spec_ops): Add selftest field. * cipher/mac-hmac.c: Include hmac-tests.c for migration. (hmac_selftest) New. (hmac_ops): Add hmac_selftest. * cipher/gost28147.c, cipher/mac-cmac.c: Add new field for selftest. * cipher/mac-gmac.c, cipher/mac-poly1305.c: Likewise.. * cipher/mac.c (_gcry_mac_selftest): New. * src/fips.c (run_mac_selftests): Rename from run_hmac_selftests. Use GCRY_MAC_HMAC_*, and call _gcry_mac_selftest. (_gcry_fips_run_selftests): Use run_mac_selftests. Signed-off-by: NIIBE Yutaka <gniibe@fsij.org>
* Add SM4 x86-64/AES-NI/AVX2 implementationJussi Kivilinna2020-06-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm4-aesni-avx2-amd64.S'. * cipher/sm4-aesni-avx2-amd64.S: New. * cipher/sm4.c (USE_AESNI_AVX2): New. (SM4_context) [USE_AESNI_AVX2]: Add 'use_aesni_avx2'. [USE_AESNI_AVX2] (_gcry_sm4_aesni_avx2_ctr_enc) (_gcry_sm4_aesni_avx2_cbc_dec, _gcry_sm4_aesni_avx2_cfb_dec) (_gcry_sm4_aesni_avx2_ocb_enc, _gcry_sm4_aesni_avx2_ocb_dec) (_gcry_sm4_aesni_avx_ocb_auth): New. (sm4_setkey): Enable AES-NI/AVX2 if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AESNI_AVX2]: Add AES-NI/AVX2 bulk functions. * configure.ac: Add ''sm4-aesni-avx2-amd64.lo'. -- This patch adds x86-64/AES-NI/AVX2 bulk encryption/decryption. Bulk functions process 16 blocks in parallel. Benchmark on AMD Ryzen 7 3700X: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 8.98 ns/B 106.2 MiB/s 38.62 c/B 4300 CBC dec | 1.55 ns/B 613.7 MiB/s 6.64 c/B 4275 CFB enc | 8.96 ns/B 106.4 MiB/s 38.52 c/B 4300 CFB dec | 1.54 ns/B 617.4 MiB/s 6.60 c/B 4275 CTR enc | 1.57 ns/B 607.8 MiB/s 6.75 c/B 4300 CTR dec | 1.57 ns/B 608.9 MiB/s 6.74 c/B 4300 OCB enc | 1.58 ns/B 603.8 MiB/s 6.75 c/B 4275 OCB dec | 1.57 ns/B 605.7 MiB/s 6.73 c/B 4275 OCB auth | 1.53 ns/B 624.5 MiB/s 6.57 c/B 4300 After (~56% faster than AES-NI/AVX impl.): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 8.93 ns/B 106.8 MiB/s 38.61 c/B 4326 CBC dec | 0.984 ns/B 969.5 MiB/s 4.23 c/B 4300 CFB enc | 8.93 ns/B 106.8 MiB/s 38.62 c/B 4325 CFB dec | 0.983 ns/B 970.3 MiB/s 4.23 c/B 4300 CTR enc | 0.998 ns/B 955.1 MiB/s 4.29 c/B 4300 CTR dec | 0.996 ns/B 957.4 MiB/s 4.28 c/B 4300 OCB enc | 1.00 ns/B 951.8 MiB/s 4.31 c/B 4300 OCB dec | 1.00 ns/B 951.8 MiB/s 4.31 c/B 4300 OCB auth | 0.993 ns/B 960.2 MiB/s 4.28 c/B 4304±2 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add SM4 x86-64/AES-NI/AVX implementationJussi Kivilinna2020-06-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am: Add 'sm4-aesni-avx-amd64.S'. * cipher/sm4-aesni-avx-amd64.S: New. * cipher/sm4.c (USE_AESNI_AVX, ASM_FUNC_ABI): New. (SM4_context) [USE_AESNI_AVX]: Add 'use_aesni_avx'. [USE_AESNI_AVX] (_gcry_sm4_aesni_avx_expand_key) (_gcry_sm4_aesni_avx_crypt_blk1_8, _gcry_sm4_aesni_avx_ctr_enc) (_gcry_sm4_aesni_avx_cbc_dec, _gcry_sm4_aesni_avx_cfb_dec) (_gcry_sm4_aesni_avx_ocb_enc, _gcry_sm4_aesni_avx_ocb_dec) (_gcry_sm4_aesni_avx_ocb_auth, sm4_aesni_avx_crypt_blk1_8): New. (sm4_expand_key) [USE_AESNI_AVX]: Use AES-NI/AVX key setup. (sm4_setkey): Enable AES-NI/AVX if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AESNI_AVX]: Add AES-NI/AVX bulk functions. * configure.ac: Add ''sm4-aesni-avx-amd64.lo'. -- This patch adds x86-64/AES-NI/AVX bulk encryption/decryption and key setup for SM4 cipher. Bulk functions process eight blocks in parallel. Benchmark on AMD Ryzen 7 3700X: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 8.94 ns/B 106.7 MiB/s 38.66 c/B 4325 CBC dec | 4.78 ns/B 199.7 MiB/s 20.42 c/B 4275 CFB enc | 8.95 ns/B 106.5 MiB/s 38.72 c/B 4325 CFB dec | 4.81 ns/B 198.2 MiB/s 20.57 c/B 4275 CTR enc | 4.81 ns/B 198.2 MiB/s 20.69 c/B 4300 CTR dec | 4.80 ns/B 198.8 MiB/s 20.63 c/B 4300 GCM auth | 0.116 ns/B 8232 MiB/s 0.504 c/B 4351 OCB enc | 4.88 ns/B 195.5 MiB/s 20.86 c/B 4275 OCB dec | 4.85 ns/B 196.6 MiB/s 20.86 c/B 4301 OCB auth | 4.80 ns/B 198.9 MiB/s 20.62 c/B 4301 After (~3.0x faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 8.98 ns/B 106.2 MiB/s 38.62 c/B 4300 CBC dec | 1.55 ns/B 613.7 MiB/s 6.64 c/B 4275 CFB enc | 8.96 ns/B 106.4 MiB/s 38.52 c/B 4300 CFB dec | 1.54 ns/B 617.4 MiB/s 6.60 c/B 4275 CTR enc | 1.57 ns/B 607.8 MiB/s 6.75 c/B 4300 CTR dec | 1.57 ns/B 608.9 MiB/s 6.74 c/B 4300 OCB enc | 1.58 ns/B 603.8 MiB/s 6.75 c/B 4275 OCB dec | 1.57 ns/B 605.7 MiB/s 6.73 c/B 4275 OCB auth | 1.53 ns/B 624.5 MiB/s 6.57 c/B 4300 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi> sm4 avx fix sm4 avx fix
* Add SM4 symmetric cipher algorithmTianjia Zhang2020-06-161-0/+1
| | | | | | | | | | | | | | | | | | | | | * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c. * cipher/cipher.c (cipher_list, cipher_list_algo301): Add _gcry_cipher_spec_sm4. * cipher/mac-cmac.c (map_mac_algo_to_cipher): Add cmac SM4. (_gcry_mac_type_spec_cmac_sm4): Add cmac SM4. * cipher/mac-internal.h: Declare spec_cmac_sm4. * cipher/mac.c (mac_list, mac_list_algo201): Add cmac SM4. * cipher/sm4.c: New. * configure.ac (available_ciphers): Add sm4. * doc/gcrypt.texi: Add SM4 document. * src/cipher.h: Add declarations for SM4 and cmac SM4. * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4. -- Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> [jk: add missing mapping in mac-cmac.c:map_mac_algo_to_cipher] [jk: add GCRY_MAC_CMAC_SM4 to gcrypt.texi] Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add POWER9 little-endian variant of PPC AES implementationJussi Kivilinna2020-02-021-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * configure.ac: Add 'rijndael-ppc9le.lo'. * cipher/Makefile.am: Add 'rijndael-ppc9le.c', 'rijndael-ppc-common.h' and 'rijndael-ppc-functions.h'. * cipher/rijndael-internal.h (USE_PPC_CRYPTO_WITH_PPC9LE): New. (RIJNDAEL_context_s): Add 'use_ppc9le_crypto'. * cipher/rijndael.c (_gcry_aes_ppc9le_encrypt) (_gcry_aes_ppc9le_decrypt, _gcry_aes_ppc9le_cfb_enc) (_gcry_aes_ppc9le_cfb_dec, _gcry_aes_ppc9le_ctr_enc) (_gcry_aes_ppc9le_cbc_enc, _gcry_aes_ppc9le_cbc_dec) (_gcry_aes_ppc9le_ocb_crypt, _gcry_aes_ppc9le_ocb_auth) (_gcry_aes_ppc9le_xts_crypt): New. (do_setkey, _gcry_aes_cfb_enc, _gcry_aes_cbc_enc) (_gcry_aes_ctr_enc, _gcry_aes_cfb_dec, _gcry_aes_cbc_dec) (_gcry_aes_ocb_crypt, _gcry_aes_ocb_auth, _gcry_aes_xts_crypt) [USE_PPC_CRYPTO_WITH_PPC9LE]: New. * cipher/rijndael-ppc.c: Split common code to headers 'rijndael-ppc-common.h' and 'rijndael-ppc-functions.h'. * cipher/rijndael-ppc-common.h: Split from 'rijndael-ppc.c'. (asm_add_uint64, asm_sra_int64, asm_swap_uint64_halfs): New. * cipher/rijndael-ppc-functions.h: Split from 'rijndael-ppc.c'. (CFB_ENC_FUNC, CBC_ENC_FUNC): Unroll loop by 2. (XTS_CRYPT_FUNC, GEN_TWEAK): Tweak generation without vperm instruction. * cipher/rijndael-ppc9le.c: New. -- Provide POWER9 little-endian optimized variant of PPC vcrypto AES implementation. This implementation uses 'lxvb16x' and 'stxvb16x' instructions to load/store vectors directly in big-endian order. Benchmark on POWER9 (~3.8Ghz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte CBC enc | 1.04 ns/B 918.7 MiB/s 3.94 c/B CBC dec | 0.222 ns/B 4292 MiB/s 0.844 c/B CFB enc | 1.04 ns/B 916.9 MiB/s 3.95 c/B CFB dec | 0.224 ns/B 4252 MiB/s 0.852 c/B CTR enc | 0.226 ns/B 4218 MiB/s 0.859 c/B CTR dec | 0.225 ns/B 4233 MiB/s 0.856 c/B XTS enc | 0.500 ns/B 1907 MiB/s 1.90 c/B XTS dec | 0.494 ns/B 1932 MiB/s 1.88 c/B OCB enc | 0.288 ns/B 3312 MiB/s 1.09 c/B OCB dec | 0.292 ns/B 3266 MiB/s 1.11 c/B OCB auth | 0.267 ns/B 3567 MiB/s 1.02 c/B After (ctr & ocb & cbc-dec & cfb-dec ~15% and xts ~8% faster): AES | nanosecs/byte mebibytes/sec cycles/byte CBC enc | 1.04 ns/B 914.2 MiB/s 3.96 c/B CBC dec | 0.191 ns/B 4984 MiB/s 0.727 c/B CFB enc | 1.03 ns/B 930.0 MiB/s 3.90 c/B CFB dec | 0.194 ns/B 4906 MiB/s 0.739 c/B CTR enc | 0.196 ns/B 4868 MiB/s 0.744 c/B CTR dec | 0.197 ns/B 4834 MiB/s 0.750 c/B XTS enc | 0.460 ns/B 2075 MiB/s 1.75 c/B XTS dec | 0.455 ns/B 2097 MiB/s 1.73 c/B OCB enc | 0.250 ns/B 3812 MiB/s 0.951 c/B OCB dec | 0.253 ns/B 3764 MiB/s 0.963 c/B OCB auth | 0.232 ns/B 4106 MiB/s 0.883 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add elliptic curve SM2 implementation.Tianjia Zhang2020-01-211-1/+1
| | | | | | | | | | | | | | * configure.ac (enabled_pubkey_ciphers): Add ecc-sm2. * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add ecc-sm2.c. * cipher/pubkey-util.c (_gcry_pk_util_parse_flaglist, _gcry_pk_util_preparse_sigval): Add sm2 flags. * cipher/ecc.c: Support ecc-sm2. * cipher/ecc-common.h: Add declarations for ecc-sm2. * cipher/ecc-sm2.c: New. * src/cipher.h: Define PUBKEY_FLAG_SM2. -- Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>