delta/libgcrypt.git - dev.gnupg.org: source/libgcrypt.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	aria-avx512: small optimization for aria_diff_m	Jussi Kivilinna	2023-02-22	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for 3-way XOR operation. --- Using vpternlogq gives small performance improvement on AMD Zen4. With Intel tiger-lake speed is the same as before. Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off): Before: ARIA128 \| nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc \| 0.203 ns/B 4703 MiB/s 0.953 c/B 4700 ECB dec \| 0.204 ns/B 4675 MiB/s 0.959 c/B 4700 CTR enc \| 0.207 ns/B 4609 MiB/s 0.973 c/B 4700 CTR dec \| 0.207 ns/B 4608 MiB/s 0.973 c/B 4700 After (~3% faster): ARIA128 \| nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc \| 0.197 ns/B 4847 MiB/s 0.925 c/B 4700 ECB dec \| 0.197 ns/B 4852 MiB/s 0.924 c/B 4700 CTR enc \| 0.200 ns/B 4759 MiB/s 0.942 c/B 4700 CTR dec \| 0.200 ns/B 4772 MiB/s 0.939 c/B 4700 Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
*	aria: add x86_64 GFNI/AVX512 accelerated implementation	Jussi Kivilinna	2023-02-22	1	-0/+1014
	* cipher/Makefile.am: Add 'aria-gfni-avx512-amd64.S'. * cipher/aria-gfni-avx512-amd64.S: New. * cipher/aria.c (USE_GFNI_AVX512): New. [USE_GFNI_AVX512] (MAX_PARALLEL_BLKS): New. (ARIA_context): Add 'use_gfni_avx512'. (_gcry_aria_gfni_avx512_ecb_crypt_blk64) (_gcry_aria_gfni_avx512_ctr_crypt_blk64) (aria_gfni_avx512_ecb_crypt_blk64) (aria_gfni_avx512_ctr_crypt_blk64): New. (aria_crypt_blocks) [USE_GFNI_AVX512]: Add 64 parallel block AVX512/GFNI processing. (_gcry_aria_ctr_enc) [USE_GFNI_AVX512]: Add 64 parallel block AVX512/GFNI processing. (aria_setkey): Enable GFNI/AVX512 based on HW features. * configure.ac: Add 'aria-gfni-avx512-amd64.lo'. -- This patch adds AVX512/GFNI accelerated ARIA block cipher implementation for libgcrypt. This implementation is based on work by Taehee Yoo, with following notable changes: - Integration to libgcrypt, use of 'aes-common-amd64.h'. - Use round loop instead of unrolling for smaller code size and increased performance. - Use stack for temporary storage instead of external buffers. - Add byte-addition fast path for CTR. === Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off): GFNI/AVX512: ARIA128 \| nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc \| 0.203 ns/B 4703 MiB/s 0.953 c/B 4700 ECB dec \| 0.204 ns/B 4675 MiB/s 0.959 c/B 4700 CTR enc \| 0.207 ns/B 4609 MiB/s 0.973 c/B 4700 CTR dec \| 0.207 ns/B 4608 MiB/s 0.973 c/B 4700 === Benchmark on Intel Core i3-1115G4 (tiger-lake, turbo-freq off): GFNI/AVX512: ARIA128 \| nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc \| 0.362 ns/B 2635 MiB/s 1.08 c/B 2992 ECB dec \| 0.361 ns/B 2639 MiB/s 1.08 c/B 2992 CTR enc \| 0.362 ns/B 2633 MiB/s 1.08 c/B 2992 CTR dec \| 0.362 ns/B 2633 MiB/s 1.08 c/B 2992 [v2]: - Add byte-addition fast path for CTR. Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>