summaryrefslogtreecommitdiff
path: root/libavutil/tx.c
diff options
context:
space:
mode:
authorLynne <dev@lynne.ee>2022-02-03 11:27:03 +0000
committerLynne <dev@lynne.ee>2022-08-25 17:40:28 +0200
commitf932b89ea3ef0b8a31077c839371fe7fa2b2baa3 (patch)
tree95954a33fa30355e036dc7b0c39d9a7d1e0487f2 /libavutil/tx.c
parent9bf9d42d013bb6ff17aca90d63e2e257a50a6dc3 (diff)
downloadffmpeg-f932b89ea3ef0b8a31077c839371fe7fa2b2baa3.tar.gz
lavu/tx: implement aarch64 NEON SIMD FFT
The fastest fast Fourier transform in not just the west, but the world, now for the most popular toy ISA. On a high level, it follows the design of the AVX2 version closely, with the exception that the input is slightly less permuted as we don't have to do lane switching with the input on double 4pt and 8pt. On a low level, the lack of subadd/addsub instructions REALLY penalizes any attempt at writing an FFT. That single register matters a lot, and reloading it simply takes unacceptably long. In x86 land, vendors would've noticed developers need this. In ARM land, you get a badly designed complex multiplication instruction we cannot use, that's not present on 95% of devices. Because only compilers matter, right? Future optimization options are very few, perhaps better register management to use more ld1/st1s. All timings below are in cycles: A53: Length | C | New (lavu) | Old (lavc) | FFTW ------ |-------------|-------------|-------------|----- 4 | 842 | 420 | 1210 | 1460 8 | 1538 | 1020 | 1850 | 2520 16 | 3717 | 1900 | 3700 | 3990 32 | 9156 | 4070 | 8289 | 8860 64 | 21160 | 9931 | 18600 | 19625 128 | 49180 | 23278 | 41922 | 41922 256 | 112073 | 53876 | 93202 | 101092 512 | 252864 | 122884 | 205897 | 207868 1024 | 560512 | 278322 | 458071 | 453053 2048 | 1295402 | 775835 | 1038205 | 1020265 4096 | 3281263 | 2021221 | 2409718 | 2577554 8192 | 8577845 | 4780526 | 5673041 | 6802722 Apple M1 New - Total for len 512 reps 2097152 = 1.459141 s Old - Total for len 512 reps 2097152 = 2.251344 s FFTW - Total for len 512 reps 2097152 = 1.868429 s New - Total for len 1024 reps 4194304 = 6.490080 s Old - Total for len 1024 reps 4194304 = 9.604949 s FFTW - Total for len 1024 reps 4194304 = 7.889281 s New - Total for len 16384 reps 262144 = 10.374001 s Old - Total for len 16384 reps 262144 = 15.266713 s FFTW - Total for len 16384 reps 262144 = 12.341745 s New - Total for len 65536 reps 8192 = 1.769812 s Old - Total for len 65536 reps 8192 = 4.209413 s FFTW - Total for len 65536 reps 8192 = 3.012365 s New - Total for len 131072 reps 4096 = 1.942836 s Old - Segfaults FFTW - Total for len 131072 reps 4096 = 3.713713 s Thanks to wbs for some simplifications, assembler fixes and a review and to jannau for giving it a look.
Diffstat (limited to 'libavutil/tx.c')
-rw-r--r--libavutil/tx.c3
1 files changed, 3 insertions, 0 deletions
diff --git a/libavutil/tx.c b/libavutil/tx.c
index c90ca509f5..28e49a5d41 100644
--- a/libavutil/tx.c
+++ b/libavutil/tx.c
@@ -458,6 +458,9 @@ av_cold int ff_tx_init_subtx(AVTXContext *s, enum AVTXType type,
#if HAVE_X86ASM
ff_tx_codelet_list_float_x86,
#endif
+#if ARCH_AARCH64
+ ff_tx_codelet_list_float_aarch64,
+#endif
};
int codelet_list_num = FF_ARRAY_ELEMS(codelet_list);