summaryrefslogtreecommitdiff
path: root/cipher/rijndael-armv8-aarch64-ce.S
Commit message (Collapse)AuthorAgeFilesLines
* aarch64-asm: align functions to 16 bytesJussi Kivilinna2023-01-191-17/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/camellia-aarch64.S: Align functions to 16 bytes. * cipher/chacha20-aarch64.S: Likewise. * cipher/cipher-gcm-armv8-aarch64-ce.S: Likewise. * cipher/crc-armv8-aarch64-ce.S: Likewise. * cipher/rijndael-aarch64.S: Likewise. * cipher/rijndael-armv8-aarch64-ce.S: Likewise. * cipher/sha1-armv8-aarch64-ce.S: Likewise. * cipher/sha256-armv8-aarch64-ce.S: Likewise. * cipher/sha512-armv8-aarch64-ce.S: Likewise. * cipher/sm3-aarch64.S: Likewise. * cipher/sm3-armv8-aarch64-ce.S: Likewise. * cipher/sm4-aarch64.S: Likewise. * cipher/sm4-armv8-aarch64-ce.S: Likewise. * cipher/sm4-armv9-aarch64-sve-ce.S: Likewise. * cipher/twofish-aarch64.S: Likewise. * mpi/aarch64/mpih-add1.S: Likewise. * mpi/aarch64/mpih-mul1.S: Likewise. * mpi/aarch64/mpih-mul2.S: Likewise. * mpi/aarch64/mpih-mul3.S: Likewise. * mpi/aarch64/mpih-sub1.S: Likewise. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* rijndael: add ECB acceleration (for benchmarking purposes)Jussi Kivilinna2022-10-261-4/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/cipher-internal.h (cipher_bulk_ops): Add 'ecb_crypt'. * cipher/cipher.c (do_ecb_crypt): Use bulk function if available. * cipher/rijndael-aesni.c (do_aesni_enc_vec8): Change asm label '.Ldeclast' to '.Lenclast'. (_gcry_aes_aesni_ecb_crypt): New. * cipher/rijndael-armv8-aarch32-ce.S (_gcry_aes_ecb_enc_armv8_ce) (_gcry_aes_ecb_dec_armv8_ce): New. * cipher/rijndael-armv8-aarch64-ce.S (_gcry_aes_ecb_enc_armv8_ce) (_gcry_aes_ecb_dec_armv8_ce): New. * cipher/rijndael-armv8-ce.c (_gcry_aes_ocb_enc_armv8_ce) (_gcry_aes_ocb_dec_armv8_ce, _gcry_aes_ocb_auth_armv8_ce): Change return value from void to size_t. (ocb_crypt_fn_t, xts_crypt_fn_t): Remove. (_gcry_aes_armv8_ce_ocb_crypt, _gcry_aes_armv8_ce_xts_crypt): Remove indirect function call; Return value from called function (allows tail call optimization). (_gcry_aes_armv8_ce_ocb_auth): Return value from called function (allows tail call optimization). (_gcry_aes_ecb_enc_armv8_ce, _gcry_aes_ecb_dec_armv8_ce) (_gcry_aes_armv8_ce_ecb_crypt): New. * cipher/rijndael-vaes-avx2-amd64.S (_gcry_vaes_avx2_ecb_crypt_amd64): New. * cipher/rijndael-vaes.c (_gcry_vaes_avx2_ecb_crypt_amd64) (_gcry_aes_vaes_ecb_crypt): New. * cipher/rijndael.c (_gcry_aes_aesni_ecb_crypt) (_gcry_aes_vaes_ecb_crypt, _gcry_aes_armv8_ce_ecb_crypt): New. (do_setkey): Setup ECB bulk function for x86 AESNI/VAES and ARM CE. -- Benchmark on AMD Ryzen 9 7900X: Before (OCB for reference): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.128 ns/B 7460 MiB/s 0.720 c/B 5634±1 ECB dec | 0.134 ns/B 7103 MiB/s 0.753 c/B 5608 OCB enc | 0.029 ns/B 32930 MiB/s 0.163 c/B 5625 OCB dec | 0.029 ns/B 32738 MiB/s 0.164 c/B 5625 After: AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.028 ns/B 33761 MiB/s 0.159 c/B 5625 ECB dec | 0.028 ns/B 33917 MiB/s 0.158 c/B 5625 GnuPG-bug-id: T6242 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add straight-line speculation hardening for aarch64 assemblyJussi Kivilinna2022-01-111-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | * cipher/asm-common-aarch64.h (ret_spec_stop): New. * cipher/asm-poly1305-aarch64.h: Use 'ret_spec_stop' for 'ret' instruction. * cipher/camellia-aarch64.S: Likewise. * cipher/chacha20-aarch64.S: Likewise. * cipher/cipher-gcm-armv8-aarch64-ce.S: Likewise. * cipher/crc-armv8-aarch64-ce.S: Likewise. * cipher/rijndael-aarch64.S: Likewise. * cipher/rijndael-armv8-aarch64-ce.S: Likewise. * cipher/sha1-armv8-aarch64-ce.S: Likewise. * cipher/sha256-armv8-aarch64-ce.S: Likewise. * cipher/sm3-aarch64.S: Likewise. * cipher/twofish-aarch64.S: Likewise. * mpi/aarch64/mpih-add1.S: Likewise. * mpi/aarch64/mpih-mul1.S: Likewise. * mpi/aarch64/mpih-mul2.S: Likewise. * mpi/aarch64/mpih-mul3.S: Likewise. * mpi/aarch64/mpih-sub1.S: Likewise. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Optimizations for AES aarch64-ce assembly implementationJussi Kivilinna2022-01-111-514/+713
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/rijndael-armv8-aarch64-ce.S (vk14): Remove. (vklast, __, _): New. (aes_preload_keys): Setup vklast. (do_aes_one128/192/256): Split to ... (do_aes_one_part1, do_aes_part2_128/192/256): ... these and add interleave ops. (do_aes_one128/192/256): New using above part1 and part2 macros. (aes_round_4): Rename to ... (aes_round_4_multikey): ... this and allow different key used for parallel blocks. (aes_round_4): New using above multikey macro. (aes_lastround_4): Reorder AES round and xor instructions, allow different last key for parallel blocks. (do_aes_4_128/192/256): Split to ... (do_aes_4_part1_multikey, do_aes_4_part1) (do_aes_4_part2_128/192/256): ... these. (do_aes_4_128/192/256): New using above part1 and part2 macros. (CLEAR_REG): Use movi for clearing registers. (aes_clear_keys): Remove branching and clear all key registers. (_gcry_aes_enc_armv8_ce, _gcry_aes_dec_armv8_ce): Adjust to macro changes. (_gcry_aes_cbc_enc_armv8_ce, _gcry_aes_cbc_dec_armv8_ce) (_gcry_aes_cfb_enc_armv8_ce, _gcry_aes_cfb_enc_armv8_ce) (_gcry_aes_ctr32le_enc_armv8_ce): Apply entry/loop-body/exit optimization for better interleaving of input/output processing; First/last round key and input/output xoring optimization to reduce critical path length. (_gcry_aes_ctr_enc_armv8_ce): Add fast path for counter incrementing without byte-swaps when counter does not overflow 8-bit; Apply entry/loop-body/exit optimization for better interleaving of input/output processing; First/last round key and input/output xoring optimization to reduce critical path length. (_gcry_aes_ocb_enc_armv8_ce, _gcry_aes_ocb_dec_armv8_ce): Add aligned processing for nblk and OCB offsets; Apply entry/loop-body/exit optimization for better interleaving of input/output processing; First/last round key and input/output xoring optimization to reduce critical path length; Change to use same function body macro for both encryption and decryption. (_gcry_aes_xts_enc_armv8_ce, _gcry_aes_xts_dec_armv8_ce): Apply entry/loop-body/exit optimization for better interleaving of input/output processing; First/last round key and input/output xoring optimization to reduce critical path length; Change to use same function body macro for both encryption and decryption. -- Benchmark on AWS Graviton2 (2500Mhz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte CBC enc | 0.663 ns/B 1439 MiB/s 1.66 c/B CBC dec | 0.288 ns/B 3310 MiB/s 0.720 c/B CFB enc | 0.657 ns/B 1453 MiB/s 1.64 c/B CFB dec | 0.288 ns/B 3313 MiB/s 0.720 c/B CTR dec | 0.314 ns/B 3039 MiB/s 0.785 c/B XTS enc | 0.357 ns/B 2674 MiB/s 0.891 c/B XTS dec | 0.358 ns/B 2666 MiB/s 0.894 c/B OCB enc | 0.343 ns/B 2784 MiB/s 0.856 c/B OCB dec | 0.341 ns/B 2795 MiB/s 0.853 c/B GCM-SIV enc | 0.526 ns/B 1813 MiB/s 1.31 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte perf increase CBC enc | 0.500 ns/B 1906 MiB/s 1.25 c/B +33% CBC dec | 0.263 ns/B 3622 MiB/s 0.658 c/B +9% CFB enc | 0.500 ns/B 1906 MiB/s 1.25 c/B +31% CFB dec | 0.263 ns/B 3620 MiB/s 0.658 c/B +9% CTR enc | 0.264 ns/B 3618 MiB/s 0.659 c/B +19% XTS enc | 0.350 ns/B 2722 MiB/s 0.876 c/B +2% OCB enc | 0.275 ns/B 3468 MiB/s 0.687 c/B +25% OCB dec | 0.276 ns/B 3459 MiB/s 0.689 c/B +24% GCM-SIV enc | 0.494 ns/B 1929 MiB/s 1.24 c/B +6% Benchmark on Cortex-A53 (1152Mhz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte CBC enc | 1.41 ns/B 675.9 MiB/s 1.63 c/B CBC dec | 0.910 ns/B 1048 MiB/s 1.05 c/B CFB enc | 1.30 ns/B 732.2 MiB/s 1.50 c/B CFB dec | 0.910 ns/B 1048 MiB/s 1.05 c/B CTR enc | 1.03 ns/B 924.4 MiB/s 1.19 c/B XTS enc | 1.25 ns/B 763.0 MiB/s 1.44 c/B OCB enc | 1.21 ns/B 789.5 MiB/s 1.39 c/B OCB dec | 1.21 ns/B 788.9 MiB/s 1.39 c/B GCM-SIV enc | 1.92 ns/B 496.6 MiB/s 2.21 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte perf increase CBC enc | 1.14 ns/B 836.6 MiB/s 1.31 c/B +24% CBC dec | 0.843 ns/B 1132 MiB/s 0.971 c/B +8% CFB enc | 1.19 ns/B 798.8 MiB/s 1.38 c/B +9% CFB dec | 0.842 ns/B 1132 MiB/s 0.970 c/B +8% CTR enc | 0.898 ns/B 1062 MiB/s 1.03 c/B +16% XTS enc | 1.22 ns/B 779.9 MiB/s 1.41 c/B +2% OCB enc | 0.992 ns/B 961.0 MiB/s 1.14 c/B +22% OCB dec | 0.993 ns/B 960.5 MiB/s 1.14 c/B +22% GCM-SIV enc | 1.88 ns/B 507.3 MiB/s 2.17 c/B +2% Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add ARMv8-CE HW acceleration for GCM-SIV counter modeJussi Kivilinna2021-08-261-0/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/rijndael-armv8-aarch32-ce.S (_gcry_aes_ctr32le_enc_armv8_ce): New. * cipher/rijndael-armv8-aarch64-ce.S (_gcry_aes_ctr32le_enc_armv8_ce): New. * cipher/rijndael-armv8-ce.c (_gcry_aes_ctr32le_enc_armv8_ce) (_gcry_aes_armv8_ce_ctr32le_enc): New. * cipher/rijndael.c (_gcry_aes_armv8_ce_ctr32le_enc): New prototype. (do_setkey): Add setup of 'bulk_ops->ctr32le_enc' for ARMv8-CE. -- Benchmark on Cortex-A53 (aarch64): Before: AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM-SIV enc | 11.77 ns/B 81.03 MiB/s 7.63 c/B 647.9 GCM-SIV dec | 11.92 ns/B 79.98 MiB/s 7.73 c/B 647.9 GCM-SIV auth | 2.99 ns/B 318.9 MiB/s 1.94 c/B 648.0 After (~2.4x faster): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz GCM-SIV enc | 4.66 ns/B 204.5 MiB/s 3.02 c/B 647.9 GCM-SIV dec | 4.82 ns/B 198.0 MiB/s 3.12 c/B 647.9 GCM-SIV auth | 3.00 ns/B 318.4 MiB/s 1.94 c/B 648.0 GnuPG-bug-id: T4485 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Move data pointer macro for 64-bit ARM assembly to common headerJussi Kivilinna2019-04-261-5/+0
| | | | | | | | | | | | | * cipher/asm-common-aarch64.h (GET_DATA_POINTER): New. * cipher/chacha20-aarch64.S (GET_DATA_POINTER): Remove. * cipher/cipher-gcm-armv8-aarch64-ce.S (GET_DATA_POINTER): Remove. * cipher/crc-armv8-aarch64-ce.S (GET_DATA_POINTER): Remove. * cipher/rijndael-armv8-aarch64-ce.S (GET_DATA_POINTER): Remove. * cipher/sha1-armv8-aarch64-ce.S (GET_DATA_POINTER): Remove. * cipher/sha256-armv8-aarch64-ce.S (GET_DATA_POINTER): Remove. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add CFI unwind assembly directives for 64-bit ARM assemblyJussi Kivilinna2019-04-261-3/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/asm-common-aarch64.h (CFI_STARTPROC, CFI_ENDPROC) (CFI_REMEMBER_STATE, CFI_RESTORE_STATE, CFI_ADJUST_CFA_OFFSET) (CFI_REL_OFFSET, CFI_DEF_CFA_REGISTER, CFI_REGISTER, CFI_RESTORE) (DW_REGNO_SP, DW_SLEB128_7BIT, DW_SLEB128_28BIT, CFI_CFA_ON_STACK) (CFI_REG_ON_STACK): New. * cipher/camellia-aarch64.S: Add CFI directives. * cipher/chacha20-aarch64.S: Add CFI directives. * cipher/cipher-gcm-armv8-aarch64-ce.S: Add CFI directives. * cipher/crc-armv8-aarch64-ce.S: Add CFI directives. * cipher/rijndael-aarch64.S: Add CFI directives. * cipher/rijndael-armv8-aarch64-ce.S: Add CFI directives. * cipher/sha1-armv8-aarch64-ce.S: Add CFI directives. * cipher/sha256-armv8-aarch64-ce.S: Add CFI directives. * cipher/twofish-aarch64.S: Add CFI directives. * mpi/aarch64/mpih-add1.S: Add CFI directives. * mpi/aarch64/mpih-mul1.S: Add CFI directives. * mpi/aarch64/mpih-mul2.S: Add CFI directives. * mpi/aarch64/mpih-mul3.S: Add CFI directives. * mpi/aarch64/mpih-sub1.S: Add CFI directives. * mpi/asm-common-aarch64.h: Include "../cipher/asm-common-aarch64.h". (ELF): Remove. -- This commit adds CFI directives that add DWARF unwinding information for debugger to backtrace when executing code from 64-bit ARM assembly files. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* aarch64/assembly: only use the lower 32 bit of an int parametersJussi Kivilinna2018-03-281-4/+8
| | | | | | | | | | | | | | | | | | | * cipher/camellia-aarch64.S (_gcry_camellia_arm_encrypt_block) (__gcry_camellia_arm_decrypt_block): Make comment section about input registers match usage. * cipher/rijndael-armv8-aarch64-ce.S (_gcry_aes_ocb_auth_armv8_ce): Use 'w12' and 'w7' instead of 'x12' and 'x7'. (_gcry_aes_xts_enc_armv8_ce, _gcry_aes_xts_dec_armv8_ce): Fix function prototype in comments. * mpi/aarch64/mpih-add1.S: Use 32-bit registers for 32-bit mpi_size_t parameters. * mpi/aarch64/mpih-mul1.S: Ditto. * mpi/aarch64/mpih-mul2.S: Ditto. * mpi/aarch64/mpih-mul3.S: Ditto. * mpi/aarch64/mpih-sub1.S: Ditto. -- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* aarch64: Enable building the aarch64 cipher assembly for windowsMartin Storsjö2018-03-281-29/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/asm-common-aarch64.h: New. * cipher/camellia-aarch64.S: Use ELF macro, use x19 instead of x18. * cipher/chacha20-aarch64.S: Use ELF macro, don't use GOT on windows. * cipher/cipher-gcm-armv8-aarch64-ce.S: Use ELF macro. * cipher/rijndael-aarch64.S: Use ELF macro. * cipher/rijndael-armv8-aarch64-ce.S: Use ELF macro. * cipher/sha1-armv8-aarch64-ce.S: Use ELF macro. * cipher/sha256-armv8-aarch64-ce.S: Use ELF macro. * cipher/twofish-aarch64.S: Use ELF macro. * configure.ac: Don't require .size and .type in aarch64 assembly check. -- Don't require .type and .size in configure; we can make them optional via a preprocessor macro. This is mostly a mechanical change, wrapping the .type and .size directives in an ELF() macro, with two actual manual changes: (when targeting windows): - Don't load global symbols via a GOT (in chacha20) - Don't use the x18 register (in camellia); back up and restore x19 in the prologue/epilogue and use that instead. x18 is a platform specific register; on linux, it's free to be used by user code, while it's reserved for platform use on windows and darwin. Always use x19 instead of x18 for consistency. Signed-off-by: Martin Storsjö <martin@martin.st>
* Add ARMv8/CE acceleration for AES-XTSJussi Kivilinna2018-01-201-0/+274
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/rijndael-armv8-aarch32-ce.S (_gcry_aes_xts_enc_armv8_ce) (_gcry_aes_xts_dec_armv8_ce): New. * cipher/rijndael-armv8-aarch64-ce.S (_gcry_aes_xts_enc_armv8_ce) (_gcry_aes_xts_dec_armv8_ce): New. * cipher/rijndael-armv8-ce.c (_gcry_aes_xts_enc_armv8_ce) (_gcry_aes_xts_dec_armv8_ce, xts_crypt_fn_t) (_gcry_aes_armv8_ce_xts_crypt): New. * cipher/rijndael.c (_gcry_aes_armv8_ce_xts_crypt): New. (_gcry_aes_xts_crypt) [USE_ARM_CE]: New. -- Benchmark on Cortex-A53 (AArch64, 1152 Mhz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 4.88 ns/B 195.5 MiB/s 5.62 c/B XTS dec | 4.94 ns/B 192.9 MiB/s 5.70 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 5.55 ns/B 171.8 MiB/s 6.39 c/B XTS dec | 5.61 ns/B 169.9 MiB/s 6.47 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 6.22 ns/B 153.3 MiB/s 7.17 c/B XTS dec | 6.29 ns/B 151.7 MiB/s 7.24 c/B = After (~2.6x faster): AES | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 1.83 ns/B 520.9 MiB/s 2.11 c/B XTS dec | 1.82 ns/B 524.9 MiB/s 2.09 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 1.97 ns/B 483.3 MiB/s 2.27 c/B XTS dec | 1.96 ns/B 486.9 MiB/s 2.26 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 2.11 ns/B 450.9 MiB/s 2.44 c/B XTS dec | 2.10 ns/B 453.8 MiB/s 2.42 c/B = Benchmark on Cortex-A53 (AArch32, 1152 Mhz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 6.52 ns/B 146.2 MiB/s 7.51 c/B XTS dec | 6.57 ns/B 145.2 MiB/s 7.57 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 7.10 ns/B 134.3 MiB/s 8.18 c/B XTS dec | 7.11 ns/B 134.2 MiB/s 8.19 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 7.30 ns/B 130.7 MiB/s 8.41 c/B XTS dec | 7.38 ns/B 129.3 MiB/s 8.50 c/B = After (~2.7x faster): Cipher: AES | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 2.33 ns/B 409.6 MiB/s 2.68 c/B XTS dec | 2.35 ns/B 405.3 MiB/s 2.71 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 2.53 ns/B 377.6 MiB/s 2.91 c/B XTS dec | 2.54 ns/B 375.5 MiB/s 2.93 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte XTS enc | 2.75 ns/B 346.8 MiB/s 3.17 c/B XTS dec | 2.76 ns/B 345.2 MiB/s 3.18 c/B = Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Fix building with clang on ARM64/FreeBSDJussi Kivilinna2017-02-271-1/+1
| | | | | | | | | | | | | | | | * cipher/cipher-gcm-armv8-aarch64-ce.S: Use '.cpu generic+simd+crypto' instead of '.arch armv8-a+crypto'. * cipher/rijndael-armv8-aarch64-ce.S: Ditto. * cipher/sha1-armv8-aarch64-ce.S: Ditto. * cipher/sha256-armv8-aarch64-ce.S: Ditto. * configure.ac (gcry_cv_gcc_inline_asm_aarch64_neon): Ditto. (gcry_cv_gcc_inline_asm_aarch64_crypto): Ditto; and include NEON instructions to crypto instructions check. -- GnuPG-bug-id: 2975 Reported-by: Kirill Ponomarev <kp@krion.cc> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* OCB ARM CE: Move ocb_get_l handling to assembly partJussi Kivilinna2016-12-101-38/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cipher/rijndael-armv8-aarch32-ce.S: Add OCB 'L_{ntz(i)}' calculation. * cipher/rijndael-armv8-aarch64-ce.S: Ditto. * cipher/rijndael-armv8-ce.c (_gcry_aes_ocb_enc_armv8_ce) (_gcry_aes_ocb_dec_armv8_ce, _gcry_aes_ocb_auth_armv8_ce) (ocb_cryt_fn_t): Updated arguments. (_gcry_aes_armv8_ce_ocb_crypt, _gcry_aes_armv8_ce_ocb_auth): Remove 'ocb_get_l' handling and splitting input to 32 block chunks, instead pass full buffers to assembly. -- Performance on Cortex-A53 (AArch32): Before: AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 1.63 ns/B 583.8 MiB/s 1.88 c/B OCB dec | 1.67 ns/B 572.1 MiB/s 1.92 c/B OCB auth | 1.33 ns/B 717.1 MiB/s 1.53 c/B After (~12% faster): AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 1.47 ns/B 650.2 MiB/s 1.69 c/B OCB dec | 1.48 ns/B 644.5 MiB/s 1.70 c/B OCB auth | 1.19 ns/B 798.2 MiB/s 1.38 c/B Performance on Cortex-A53 (AArch64): Before: AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 1.29 ns/B 738.5 MiB/s 1.49 c/B OCB dec | 1.32 ns/B 723.5 MiB/s 1.52 c/B OCB auth | 1.15 ns/B 827.0 MiB/s 1.33 c/B After (~8% faster): AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 1.21 ns/B 789.1 MiB/s 1.39 c/B OCB dec | 1.21 ns/B 789.2 MiB/s 1.39 c/B OCB auth | 1.10 ns/B 867.0 MiB/s 1.27 c/B Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
* Add ARMv8/AArch64 Crypto Extension implementation of AESJussi Kivilinna2016-09-051-0/+1265
* cipher/Makefile.am: Add 'rijndael-armv-aarch64-ce.S'. * cipher/rijndael-armv8-aarch64-ce.S: New. * cipher/rijndael-internal.h (USE_ARM_CE): Enable for ARMv8/AArch64. * configure.ac: Add 'rijndael-armv-aarch64-ce.lo' and 'rijndael-armv8-ce.lo' for ARMv8/AArch64. -- Improvement vs AArch64 assembly on Cortex-A53: AES-128 AES-192 AES-256 CBC enc: 13.19x 13.53x 13.76x CBC dec: 20.53x 21.91x 22.60x CFB enc: 14.29x 14.50x 14.63x CFB dec: 20.42x 21.69x 22.50x CTR: 18.29x 19.61x 20.53x OCB enc: 15.21x 16.32x 17.12x OCB dec: 14.95x 16.11x 16.88x OCB auth: 16.73x 17.93x 18.66x Benchmark on Cortex-A53 (1152 Mhz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 21.86 ns/B 43.62 MiB/s 25.19 c/B ECB dec | 22.68 ns/B 42.05 MiB/s 26.13 c/B CBC enc | 18.66 ns/B 51.10 MiB/s 21.50 c/B CBC dec | 18.72 ns/B 50.95 MiB/s 21.56 c/B CFB enc | 18.61 ns/B 51.25 MiB/s 21.44 c/B CFB dec | 18.61 ns/B 51.25 MiB/s 21.44 c/B OFB enc | 22.84 ns/B 41.75 MiB/s 26.31 c/B OFB dec | 22.84 ns/B 41.75 MiB/s 26.31 c/B CTR enc | 18.89 ns/B 50.50 MiB/s 21.76 c/B CTR dec | 18.89 ns/B 50.50 MiB/s 21.76 c/B CCM enc | 37.55 ns/B 25.40 MiB/s 43.25 c/B CCM dec | 37.55 ns/B 25.40 MiB/s 43.25 c/B CCM auth | 18.77 ns/B 50.80 MiB/s 21.63 c/B GCM enc | 20.18 ns/B 47.25 MiB/s 23.25 c/B GCM dec | 20.18 ns/B 47.25 MiB/s 23.25 c/B GCM auth | 1.30 ns/B 732.5 MiB/s 1.50 c/B OCB enc | 19.67 ns/B 48.48 MiB/s 22.66 c/B OCB dec | 19.73 ns/B 48.34 MiB/s 22.72 c/B OCB auth | 19.46 ns/B 49.00 MiB/s 22.42 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 25.39 ns/B 37.56 MiB/s 29.25 c/B ECB dec | 26.15 ns/B 36.47 MiB/s 30.13 c/B CBC enc | 22.08 ns/B 43.19 MiB/s 25.44 c/B CBC dec | 22.25 ns/B 42.87 MiB/s 25.63 c/B CFB enc | 22.03 ns/B 43.30 MiB/s 25.38 c/B CFB dec | 22.03 ns/B 43.29 MiB/s 25.38 c/B OFB enc | 26.26 ns/B 36.32 MiB/s 30.25 c/B OFB dec | 26.26 ns/B 36.32 MiB/s 30.25 c/B CTR enc | 22.30 ns/B 42.76 MiB/s 25.69 c/B CTR dec | 22.30 ns/B 42.76 MiB/s 25.69 c/B CCM enc | 44.38 ns/B 21.49 MiB/s 51.13 c/B CCM dec | 44.38 ns/B 21.49 MiB/s 51.13 c/B CCM auth | 22.20 ns/B 42.97 MiB/s 25.57 c/B GCM enc | 23.60 ns/B 40.41 MiB/s 27.19 c/B GCM dec | 23.60 ns/B 40.41 MiB/s 27.19 c/B GCM auth | 1.30 ns/B 732.4 MiB/s 1.50 c/B OCB enc | 23.09 ns/B 41.31 MiB/s 26.60 c/B OCB dec | 23.21 ns/B 41.09 MiB/s 26.74 c/B OCB auth | 22.88 ns/B 41.68 MiB/s 26.36 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 28.76 ns/B 33.17 MiB/s 33.13 c/B ECB dec | 29.46 ns/B 32.37 MiB/s 33.94 c/B CBC enc | 25.45 ns/B 37.48 MiB/s 29.31 c/B CBC dec | 25.50 ns/B 37.40 MiB/s 29.38 c/B CFB enc | 25.39 ns/B 37.56 MiB/s 29.25 c/B CFB dec | 25.39 ns/B 37.56 MiB/s 29.25 c/B OFB enc | 29.62 ns/B 32.19 MiB/s 34.13 c/B OFB dec | 29.62 ns/B 32.19 MiB/s 34.13 c/B CTR enc | 25.67 ns/B 37.15 MiB/s 29.57 c/B CTR dec | 25.67 ns/B 37.15 MiB/s 29.57 c/B CCM enc | 51.11 ns/B 18.66 MiB/s 58.88 c/B CCM dec | 51.11 ns/B 18.66 MiB/s 58.88 c/B CCM auth | 25.56 ns/B 37.32 MiB/s 29.44 c/B GCM enc | 26.96 ns/B 35.37 MiB/s 31.06 c/B GCM dec | 26.98 ns/B 35.35 MiB/s 31.08 c/B GCM auth | 1.30 ns/B 733.4 MiB/s 1.50 c/B OCB enc | 26.45 ns/B 36.05 MiB/s 30.47 c/B OCB dec | 26.53 ns/B 35.95 MiB/s 30.56 c/B OCB auth | 26.24 ns/B 36.34 MiB/s 30.23 c/B = After: Cipher: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 4.83 ns/B 197.5 MiB/s 5.56 c/B ECB dec | 4.99 ns/B 191.1 MiB/s 5.75 c/B CBC enc | 1.41 ns/B 675.5 MiB/s 1.63 c/B CBC dec | 0.911 ns/B 1046.9 MiB/s 1.05 c/B CFB enc | 1.30 ns/B 732.2 MiB/s 1.50 c/B CFB dec | 0.911 ns/B 1046.7 MiB/s 1.05 c/B OFB enc | 5.81 ns/B 164.3 MiB/s 6.69 c/B OFB dec | 5.81 ns/B 164.3 MiB/s 6.69 c/B CTR enc | 1.03 ns/B 924.0 MiB/s 1.19 c/B CTR dec | 1.03 ns/B 924.1 MiB/s 1.19 c/B CCM enc | 2.50 ns/B 381.8 MiB/s 2.88 c/B CCM dec | 2.50 ns/B 381.7 MiB/s 2.88 c/B CCM auth | 1.57 ns/B 606.1 MiB/s 1.81 c/B GCM enc | 2.33 ns/B 408.5 MiB/s 2.69 c/B GCM dec | 2.34 ns/B 408.4 MiB/s 2.69 c/B GCM auth | 1.30 ns/B 732.1 MiB/s 1.50 c/B OCB enc | 1.29 ns/B 736.6 MiB/s 1.49 c/B OCB dec | 1.32 ns/B 724.4 MiB/s 1.52 c/B OCB auth | 1.16 ns/B 819.6 MiB/s 1.34 c/B = AES192 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 5.48 ns/B 174.0 MiB/s 6.31 c/B ECB dec | 5.64 ns/B 169.0 MiB/s 6.50 c/B CBC enc | 1.63 ns/B 585.8 MiB/s 1.88 c/B CBC dec | 1.02 ns/B 935.8 MiB/s 1.17 c/B CFB enc | 1.52 ns/B 627.7 MiB/s 1.75 c/B CFB dec | 1.02 ns/B 935.9 MiB/s 1.17 c/B OFB enc | 6.46 ns/B 147.7 MiB/s 7.44 c/B OFB dec | 6.46 ns/B 147.7 MiB/s 7.44 c/B CTR enc | 1.14 ns/B 836.1 MiB/s 1.31 c/B CTR dec | 1.14 ns/B 835.9 MiB/s 1.31 c/B CCM enc | 2.83 ns/B 337.6 MiB/s 3.25 c/B CCM dec | 2.82 ns/B 338.0 MiB/s 3.25 c/B CCM auth | 1.79 ns/B 532.7 MiB/s 2.06 c/B GCM enc | 2.44 ns/B 390.3 MiB/s 2.82 c/B GCM dec | 2.44 ns/B 390.2 MiB/s 2.82 c/B GCM auth | 1.30 ns/B 731.9 MiB/s 1.50 c/B OCB enc | 1.41 ns/B 674.7 MiB/s 1.63 c/B OCB dec | 1.44 ns/B 662.0 MiB/s 1.66 c/B OCB auth | 1.28 ns/B 746.1 MiB/s 1.47 c/B = AES256 | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 6.13 ns/B 155.5 MiB/s 7.06 c/B ECB dec | 6.29 ns/B 151.5 MiB/s 7.25 c/B CBC enc | 1.85 ns/B 516.8 MiB/s 2.13 c/B CBC dec | 1.13 ns/B 845.6 MiB/s 1.30 c/B CFB enc | 1.74 ns/B 549.5 MiB/s 2.00 c/B CFB dec | 1.13 ns/B 846.1 MiB/s 1.30 c/B OFB enc | 7.11 ns/B 134.2 MiB/s 8.19 c/B OFB dec | 7.11 ns/B 134.2 MiB/s 8.19 c/B CTR enc | 1.25 ns/B 763.5 MiB/s 1.44 c/B CTR dec | 1.25 ns/B 763.4 MiB/s 1.44 c/B CCM enc | 3.15 ns/B 302.9 MiB/s 3.63 c/B CCM dec | 3.15 ns/B 302.9 MiB/s 3.63 c/B CCM auth | 2.01 ns/B 474.2 MiB/s 2.32 c/B GCM enc | 2.55 ns/B 374.2 MiB/s 2.94 c/B GCM dec | 2.55 ns/B 373.7 MiB/s 2.94 c/B GCM auth | 1.30 ns/B 732.2 MiB/s 1.50 c/B OCB enc | 1.54 ns/B 617.6 MiB/s 1.78 c/B OCB dec | 1.57 ns/B 606.8 MiB/s 1.81 c/B OCB auth | 1.40 ns/B 679.8 MiB/s 1.62 c/B = Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>