delta/ffmpeg.git - git.ffmpeg.org: ffmpeg.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	aarch64: h264idct: Use the offset parameter to movrel	Martin Storsjö	2016-11-10	1	-1/+1
\| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: vp9: Add NEON optimizations of VP9 MC functions	Martin Storsjö	2016-11-10	3	-0/+834
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This work is sponsored by, and copyright, Google. These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 27.2 23.7 vp9_avg8_neon: 56.5 54.7 vp9_avg16_neon: 169.9 167.4 vp9_avg32_neon: 585.8 585.2 vp9_avg64_neon: 2460.3 2294.7 vp9_avg_8tap_smooth_4h_neon: 132.7 125.2 vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0 vp9_avg_8tap_smooth_4v_neon: 126.0 93.7 vp9_avg_8tap_smooth_8h_neon: 241.7 234.2 vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5 vp9_avg_8tap_smooth_8v_neon: 245.0 205.5 vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1 vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1 vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1 vp9_put4_neon: 18.0 17.2 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 97.4 99.5 vp9_put32_neon/armv8: 346.0 307.4 vp9_put64_neon/armv8: 1319.0 1107.5 vp9_put_8tap_smooth_4h_neon: 126.7 118.2 vp9_put_8tap_smooth_4hv_neon: 465.7 434.0 vp9_put_8tap_smooth_4v_neon: 113.0 86.5 vp9_put_8tap_smooth_8h_neon: 229.7 221.6 vp9_put_8tap_smooth_8hv_neon: 658.9 621.3 vp9_put_8tap_smooth_8v_neon: 215.0 187.5 vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8 vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9 vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. Signed-off-by: Martin Storsjö <martin@martin.st>
*	mpegaudiodsp: aarch64: Adjust function prototype after ↵	Diego Biurrun	2016-11-10	1	-2/+3
\| \| \| \|	2caa93b813adc5dbb7771dfe615da826a2947d18
*	aarch64: Add missing sign extension in ff_h264_idct8_add_neon	Martin Storsjö	2016-10-10	1	-0/+1
\| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
*	mpegaudiodsp: Change type of array stride parameters to ptrdiff_t	Diego Biurrun	2016-09-29	1	-1/+0
\| \| \| \| \|	This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.
*	h264chroma: Change type of stride parameters to ptrdiff_t	Diego Biurrun	2016-09-29	4	-27/+24
\| \| \| \| \|	This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.
*	lavc: add clobber tests for the new encoding/decoding API	Anton Khirnov	2016-09-28	1	-0/+20
\|
*	libavcodec: fix constness in clobber test avcodec_open2() wrappers	Clément Bœsch	2016-06-26	1	-1/+1
\| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
*	cosmetics: Fix spelling mistakes	Vittorio Giovara	2016-05-04	1	-1/+1
\| \| \| \|	Signed-off-by: Diego Biurrun <diego@biurrun.de>
*	build: miscellaneous cosmetics	Diego Biurrun	2016-04-07	1	-4/+13
\| \| \| \| \| \|	Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically.
*	aarch64: Make transpose_4x4H do a regular transpose	Martin Storsjö	2016-03-26	2	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, ff_h264_idct_add_neon (originally in the arm version) used a non-regular transpose in order to be able to use more instructions that deal with registers as 128 bit register pairs. The aarch64 translation doesn't do it to the same extent, but brought along the same structure since it was a straight translation. This reshuffles ff_h264_idct_add_neon, bringing it closer to the C implementation, making the transpose_4x4H macro do a regular transpose, usable for other algorithms as well. Previously, the third and fourth output from transpose_4x4H were swapped, and prior to cc29d96d5a, the same inputs as well. In addition to just swapping the outputs, also renumber the intermediate registers for better readability (making the register order match transpose_4x8B). This runs with the same number of cycles as before. Signed-off-by: Martin Storsjö <martin@martin.st>
*	fft: Split MDCT bits off from FFT	Diego Biurrun	2016-03-01	3	-12/+40
\|
*	fft: arm: Drop unnecessary #include, add missing ones	Diego Biurrun	2016-02-26	1	-0/+3
\|
*	dca: remove unused decode_hf function and quant_d tables	Alexandra Hájková	2015-12-24	2	-67/+0
\| \| \| \| \|	They were superseded with their integer equivalents. Rename integer decode_hf to decode_hf.
*	arm64: fix inverted register order in transpose_4x4H	Janne Grunau	2015-12-21	2	-4/+4
\| \| \| \| \| \|	Fix related register order issue in ff_h264_idct_add_neon. Found-by: zjh8890 <243186085@qq.com>
*	arm64: int32_to_float_fmul neon asm	Janne Grunau	2015-12-14	3	-0/+121
\| \| \| \| \| \| \| \| \| \|	3% faster dts decoding on a cortex-a57. cortex-a57 cortex-a53 int32_to_float_fmul_array8_c: 1270.9 4475.6 int32_to_float_fmul_array8_neon: 328.6 569.2 int32_to_float_fmul_scalar_c: 928.5 4119.6 int32_to_float_fmul_scalar_neon: 309.1 524.1
*	arm64: port synth_filter_float_neon from arm	Janne Grunau	2015-12-14	4	-1/+140
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	~25% faster dts decoding overall. The checkasm CPU cycles numbers are not that useful since synth_filter_float() calls FFTContext.imdct_half(). cortex-a57 cortex-a53 synth_filter_float_c: 1866.2 3490.9 synth_filter_float_neon: 915.0 1531.5 With fftc.imdct_half forced to imdct_half_neon: cortex-a57 cortex-a53 synth_filter_float_c: 1718.4 3025.3 synth_filter_float_neon: 926.2 1530.1
*	arm64: convert dcadsp neon asm from arm	Janne Grunau	2015-12-14	3	-0/+222
\| \| \| \| \| \| \| \| \| \| \| \|	~2% faster dts decoding overall. cortex-a57 cortex-a53 dca_decode_hf_c: 474.8 1659.9 dca_decode_hf_neon: 225.2 301.1 dca_lfe_fir0_c: 913.2 1537.7 dca_lfe_fir0_neon: 286.8 451.9 dca_lfe_fir1_c: 848.7 1711.5 dca_lfe_fir1_neon: 387.1 506.4
*	h264: aarch64: intra prediction optimisations	Janne Grunau	2015-07-20	3	-0/+456
\|
*	arm64: constify src in h264qpel dsp function definitions	Janne Grunau	2015-06-24	1	-64/+64
\|
*	opus: Factor out imdct15 into a standalone component	Diego Biurrun	2015-02-02	3	-11/+12
\| \| \| \|	It will be reused by the AAC decoder.
*	aarch64: Use .data.rel.ro for const data with relocations	Martin Storsjö	2014-12-09	2	-28/+27
\| \| \| \| \| \| \|	This reverts commit c00365b46d464ce47716315c1801818d811bdb9a in addition to using a different section. Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: Make the function pointer tables position independent	Martin Storsjö	2014-11-16	2	-25/+26
\| \| \| \| \| \| \|	This allows running the code on android, where 64 bit binaries with text relocations aren't allowed to be loaded. Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: add ',' between assembler macro arguments where missing	Janne Grunau	2014-08-04	3	-6/+6
\| \| \| \| \| \| \|	llvm's integrated assembler does not accept spaces as macro argument delimiter when targeting darwin. Using a explicit delimiter is a good idea in principle since it makes case like 'macro 4 -2' vs 'macro 4 - 2' clear.
*	h264: avoid using uninitialized memory in NEON chroma mc	Janne Grunau	2014-06-23	1	-4/+55
\| \| \| \| \|	Adapt commit 982b596ea6640bfe218a31f6c3fc542d9fe61c31 for the arm and aarch64 NEON asm. 5-10% faster on Cortex-A9.
*	aarch64: opus NEON iMDCT and FFT	Janne Grunau	2014-05-15	4	-0/+724
\| \| \| \| \|	Opus celt decoding 11% faster and the iMDCT over 2.5 times faster on Apple's A7.
*	aarch64: assembler in clang-3.4 ignores the division by two	Janne Grunau	2014-05-13	1	-1/+1
\| \| \| \|	Values are positive powers of two, so just replace it with right shift.
*	aarch64: NEON vorbis_inverse_coupling	Janne Grunau	2014-04-22	3	-0/+119
\| \| \| \| \|	From the ARMv7 NEON version. 16 times faster as the C version, overall more than 12% faster vorbis decoding on Apple's A7.
*	aarch64: NEON fixed/floating point MPADSP apply_window	Janne Grunau	2014-04-22	3	-0/+267
\| \| \| \| \|	30%/25% (fixed/float) faster mp3 decoding on Apple's A7. The floating point decoder is approximately 7% faster.
*	aarch64: NEON float (i)MDCT	Janne Grunau	2014-04-22	3	-0/+334
\| \| \| \|	Approximately as fast as the ARM NEON version on Apple's A7.
*	aarch64: NEON float FFT	Janne Grunau	2014-04-22	3	-0/+481
\| \| \| \|	Approximately as fast as the ARM NEON version on Apple's A7.
*	aarch64: implement videodsp.prefetch	Janne Grunau	2014-04-06	3	-0/+63
\| \| \| \|	8% faster h264 decoding on Apple A7.
*	build: Group general components separate from de/encoders in arch Makefiles	Diego Biurrun	2014-03-20	1	-0/+1
\| \| \| \|	This is in line with how the top-level libavcodec Makefile is structured.
*	aarch64: get_cabac inline asm	Janne Grunau	2014-03-09	1	-0/+104
\| \| \| \| \| \| \|	Based on the x86 branchless get_cabac asm. get_cabac_noinline() gets approximately 20% faster (no cycle counts available) compared to clang from Xcode 5.1 beta5. More than 6% faster overall. A part of the overall speedup might be explained by additional inlining of get_cabac().
*	aarch64: use EXTERN_ASM consistently for exported symbols	Janne Grunau	2014-02-20	1	-8/+8
\| \| \| \|	Based on e3fec3f095ab5ea08ee662942d98526aaf5e3635 for arm.
*	aarch64: port neon clobber test from arm	Janne Grunau	2014-01-15	2	-0/+80
\|
*	aarch64: h264 (bi)weight NEON optimizations	Janne Grunau	2014-01-15	2	-0/+264
\| \| \| \|	Ported from ARMv7 NEON.
*	aarch64: h264 loop filter NEON optimizations	Janne Grunau	2014-01-15	4	-1/+299
\| \| \| \|	Ported from ARMv7 NEON.
*	aarch64: hpeldsp NEON optimizations	Janne Grunau	2014-01-15	4	-5/+528
\| \| \| \|	Ported from ARMv7 NEON.
*	aarch64: h264 qpel NEON optimizations	Janne Grunau	2014-01-15	4	-0/+1172
\| \| \| \|	Ported from ARMv7 NEON.
*	aarch64: h264 idct NEON assembler optimizations	Janne Grunau	2014-01-15	4	-0/+533
\| \| \| \|	Ported from ARMv7 NEON.
*	aarch64: h264 chroma motion compensation NEON optimizations	Janne Grunau	2014-01-15	5	-0/+561
	Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included.