diff options
author | Bogdan Andone <bogdan.andone@intel.com> | 2015-07-29 14:19:04 +0300 |
---|---|---|
committer | Bogdan Andone <bogdan.andone@intel.com> | 2015-07-29 14:51:57 +0300 |
commit | 68185bafbe2c7ab025703917d259c4c19ce456eb (patch) | |
tree | 7925372b64ce2f9ac0ff88079384cbd76b625569 | |
parent | 4e66cce87ce0e57a7394486412e61abcfc5f3520 (diff) | |
download | php-git-68185bafbe2c7ab025703917d259c4c19ce456eb.tar.gz |
opcache: Patch SSE based fast_memcpy() implementation
Use _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied memory
is preserved in data cache, which is good as the interpretor will start to use this
data without the need to go back to memory. _mm_stream* is intended to be used for
stores where we want to avoid reading data into the cache and the cache pollution;
in our scenario it seems that preserving the data in cache has a positive impact.
Tests on WordPress 4.1 show ~1% performance increase with fast_memcpy() in place
versus standard memcpy() when running php-cgi -T10000 wordpress/index.php.
I also updated SW prefetching on target memory but its contribution is almost negligible.
The address to be prefetched will be used in a couple of cycles (at the next iteration)
while the data from memory will be available in >100 cycles.
-rw-r--r-- | ext/opcache/zend_accelerator_util_funcs.c | 9 |
1 files changed, 5 insertions, 4 deletions
diff --git a/ext/opcache/zend_accelerator_util_funcs.c b/ext/opcache/zend_accelerator_util_funcs.c index e20f3d16f6..cfb03a00e4 100644 --- a/ext/opcache/zend_accelerator_util_funcs.c +++ b/ext/opcache/zend_accelerator_util_funcs.c @@ -658,16 +658,17 @@ static zend_always_inline void fast_memcpy(void *dest, const void *src, size_t s do { _mm_prefetch(dqsrc + 4, _MM_HINT_NTA); + _mm_prefetch(dqdest + 4, _MM_HINT_T0); __m128i xmm0 = _mm_load_si128(dqsrc + 0); __m128i xmm1 = _mm_load_si128(dqsrc + 1); __m128i xmm2 = _mm_load_si128(dqsrc + 2); __m128i xmm3 = _mm_load_si128(dqsrc + 3); dqsrc += 4; - _mm_stream_si128(dqdest + 0, xmm0); - _mm_stream_si128(dqdest + 1, xmm1); - _mm_stream_si128(dqdest + 2, xmm2); - _mm_stream_si128(dqdest + 3, xmm3); + _mm_store_si128(dqdest + 0, xmm0); + _mm_store_si128(dqdest + 1, xmm1); + _mm_store_si128(dqdest + 2, xmm2); + _mm_store_si128(dqdest + 3, xmm3); dqdest += 4; } while (dqsrc != end); } |