| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This work is sponsored by, and copyright, Google.
These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_avg4_neon: 27.2 23.7
vp9_avg8_neon: 56.5 54.7
vp9_avg16_neon: 169.9 167.4
vp9_avg32_neon: 585.8 585.2
vp9_avg64_neon: 2460.3 2294.7
vp9_avg_8tap_smooth_4h_neon: 132.7 125.2
vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0
vp9_avg_8tap_smooth_4v_neon: 126.0 93.7
vp9_avg_8tap_smooth_8h_neon: 241.7 234.2
vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5
vp9_avg_8tap_smooth_8v_neon: 245.0 205.5
vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1
vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1
vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1
vp9_put4_neon: 18.0 17.2
vp9_put8_neon: 40.2 37.7
vp9_put16_neon: 97.4 99.5
vp9_put32_neon/armv8: 346.0 307.4
vp9_put64_neon/armv8: 1319.0 1107.5
vp9_put_8tap_smooth_4h_neon: 126.7 118.2
vp9_put_8tap_smooth_4hv_neon: 465.7 434.0
vp9_put_8tap_smooth_4v_neon: 113.0 86.5
vp9_put_8tap_smooth_8h_neon: 229.7 221.6
vp9_put_8tap_smooth_8hv_neon: 658.9 621.3
vp9_put_8tap_smooth_8v_neon: 215.0 187.5
vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8
vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9
vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4
These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.
The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
| |
2caa93b813adc5dbb7771dfe615da826a2947d18
|
|
|
|
| |
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
|
| |
This avoids SIMD-optimized functions having to sign-extend their
stride argument manually to be able to do pointer arithmetic.
|
|
|
|
|
| |
This avoids SIMD-optimized functions having to sign-extend their
stride argument manually to be able to do pointer arithmetic.
|
| |
|
|
|
|
| |
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
| |
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
|
|
|
|
|
| |
Restore alphabetical order in lists, break overly long lines, do some
prettyprinting, add some explanatory section comments, group parts
together that belong together logically.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, ff_h264_idct_add_neon (originally in the arm version) used
a non-regular transpose in order to be able to use more instructions
that deal with registers as 128 bit register pairs. The aarch64
translation doesn't do it to the same extent, but brought along the
same structure since it was a straight translation.
This reshuffles ff_h264_idct_add_neon, bringing it closer to
the C implementation, making the transpose_4x4H macro do a regular
transpose, usable for other algorithms as well.
Previously, the third and fourth output from transpose_4x4H were
swapped, and prior to cc29d96d5a, the same inputs as well. In
addition to just swapping the outputs, also renumber the intermediate
registers for better readability (making the register order match
transpose_4x8B).
This runs with the same number of cycles as before.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
| |
|
| |
|
|
|
|
|
| |
They were superseded with their integer equivalents. Rename integer
decode_hf to decode_hf.
|
|
|
|
|
|
| |
Fix related register order issue in ff_h264_idct_add_neon.
Found-by: zjh8890 <243186085@qq.com>
|
|
|
|
|
|
|
|
|
|
| |
3% faster dts decoding on a cortex-a57.
cortex-a57 cortex-a53
int32_to_float_fmul_array8_c: 1270.9 4475.6
int32_to_float_fmul_array8_neon: 328.6 569.2
int32_to_float_fmul_scalar_c: 928.5 4119.6
int32_to_float_fmul_scalar_neon: 309.1 524.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
~25% faster dts decoding overall. The checkasm CPU cycles numbers are
not that useful since synth_filter_float() calls FFTContext.imdct_half().
cortex-a57 cortex-a53
synth_filter_float_c: 1866.2 3490.9
synth_filter_float_neon: 915.0 1531.5
With fftc.imdct_half forced to imdct_half_neon:
cortex-a57 cortex-a53
synth_filter_float_c: 1718.4 3025.3
synth_filter_float_neon: 926.2 1530.1
|
|
|
|
|
|
|
|
|
|
|
|
| |
~2% faster dts decoding overall.
cortex-a57 cortex-a53
dca_decode_hf_c: 474.8 1659.9
dca_decode_hf_neon: 225.2 301.1
dca_lfe_fir0_c: 913.2 1537.7
dca_lfe_fir0_neon: 286.8 451.9
dca_lfe_fir1_c: 848.7 1711.5
dca_lfe_fir1_neon: 387.1 506.4
|
| |
|
| |
|
|
|
|
| |
It will be reused by the AAC decoder.
|
|
|
|
|
|
|
| |
This reverts commit c00365b46d464ce47716315c1801818d811bdb9a
in addition to using a different section.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
|
|
|
| |
This allows running the code on android, where 64 bit binaries with
text relocations aren't allowed to be loaded.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
|
|
|
| |
llvm's integrated assembler does not accept spaces as macro argument
delimiter when targeting darwin. Using a explicit delimiter is a good
idea in principle since it makes case like 'macro 4 -2' vs 'macro 4 - 2'
clear.
|
|
|
|
|
| |
Adapt commit 982b596ea6640bfe218a31f6c3fc542d9fe61c31 for the arm and
aarch64 NEON asm. 5-10% faster on Cortex-A9.
|
|
|
|
|
| |
Opus celt decoding 11% faster and the iMDCT over 2.5 times faster on
Apple's A7.
|
|
|
|
| |
Values are positive powers of two, so just replace it with right shift.
|
|
|
|
|
| |
From the ARMv7 NEON version. 16 times faster as the C version, overall
more than 12% faster vorbis decoding on Apple's A7.
|
|
|
|
|
| |
30%/25% (fixed/float) faster mp3 decoding on Apple's A7. The floating
point decoder is approximately 7% faster.
|
|
|
|
| |
Approximately as fast as the ARM NEON version on Apple's A7.
|
|
|
|
| |
Approximately as fast as the ARM NEON version on Apple's A7.
|
|
|
|
| |
8% faster h264 decoding on Apple A7.
|
|
|
|
| |
This is in line with how the top-level libavcodec Makefile is structured.
|
|
|
|
|
|
|
| |
Based on the x86 branchless get_cabac asm. get_cabac_noinline() gets
approximately 20% faster (no cycle counts available) compared to clang
from Xcode 5.1 beta5. More than 6% faster overall. A part of the overall
speedup might be explained by additional inlining of get_cabac().
|
|
|
|
| |
Based on e3fec3f095ab5ea08ee662942d98526aaf5e3635 for arm.
|
| |
|
|
|
|
| |
Ported from ARMv7 NEON.
|
|
|
|
| |
Ported from ARMv7 NEON.
|
|
|
|
| |
Ported from ARMv7 NEON.
|
|
|
|
| |
Ported from ARMv7 NEON.
|
|
|
|
| |
Ported from ARMv7 NEON.
|
|
Since RV40 and VC-1 use almost the same algorithm so optimizations for
those two decoders are easy to do and included.
|