diff options
author | Marko Mäkelä <marko.makela@mariadb.com> | 2020-09-04 10:31:41 +0300 |
---|---|---|
committer | Marko Mäkelä <marko.makela@mariadb.com> | 2020-09-04 10:31:41 +0300 |
commit | 24f510bba4f0c3236aa5a8c96c209eb71317f4fe (patch) | |
tree | 8dd7bee940c7a5340c3f375f198572dc1ec91423 /include | |
parent | 1cda462f46305daf2a5becb1ed0ce4fcdf3ae404 (diff) | |
download | mariadb-git-24f510bba4f0c3236aa5a8c96c209eb71317f4fe.tar.gz |
MDEV-23633 MY_RELAX_CPU performs unnecessary compare-and-swap on ARM
This follows up MDEV-14374, which was filed against MariaDB Server 10.3.
Back then, on a 48-core Qualcomm Centriq 2400, the performance of
delay loops for spinloops was tested both with and without the dummy
compare-and-swap operation, and it was decided to keep the dummy
operation.
On target architectures where nothing special is available (other than
x86 (IA-32, AMD64) or POWER), we perform a dummy compare-and-swap operation.
This is contrary to the idea of the x86 PAUSE instruction and the
__ppc_get_timebase(), which aim to keep the memory bus idle for a while,
to allow other cores to better execute code while a spinloop is waiting
for something to be changed.
On MariaDB Server 10.4 and another implementation of the ARMv8 ISA,
omitting the dummy compare-and-swap improved performance by up to 12%.
So, let us avoid the dummy compare-and-swap on ARM.
For now, we are retaining the dummy compare-and-swap on other ISAs
(such as SPARC, MIPS, S390x, RISC-V) because we do not have any
performance data for them.
Diffstat (limited to 'include')
-rw-r--r-- | include/my_cpu.h | 4 |
1 files changed, 4 insertions, 0 deletions
diff --git a/include/my_cpu.h b/include/my_cpu.h index b7d7008a8e3..0b51d3ef90f 100644 --- a/include/my_cpu.h +++ b/include/my_cpu.h @@ -53,6 +53,7 @@ #ifdef _WIN32 #elif defined HAVE_PAUSE_INSTRUCTION #elif defined(_ARCH_PWR8) +#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__) #else # include "my_atomic.h" #endif @@ -80,6 +81,9 @@ static inline void MY_RELAX_CPU(void) #endif #elif defined(_ARCH_PWR8) __ppc_get_timebase(); +#elif defined __GNUC__ && (defined __arm__ || defined __aarch64__) + /* Mainly, prevent the compiler from optimizing away delay loops */ + __asm__ __volatile__ ("":::"memory") #else int32 var, oldval = 0; my_atomic_cas32_strong_explicit(&var, &oldval, 1, MY_MEMORY_ORDER_RELAXED, |