Use madvise(MADV_DONTNEED) to release memory to reduce COW (#8974)

## Backgroud As we know, after `fork`, one process will copy pages when writing data to these pages(CoW), and another process still keep old pages, they totally cost more memory. For redis, we suffered that redis consumed much memory when the fork child is serializing key/values, even that maybe cause OOM. But actually we find, in redis fork child process, the child process don't need to keep some memory and parent process may write or update that, for example, child process will never access the key-value that is serialized but users may update it in parent process. So we think it may reduce COW if the child process release memory that it is not needed. ## Implementation For releasing key value in child process, we may think we call `decrRefCount` to free memory, but i find the fork child process still use much memory when we don't write any data to redis, and it costs much more time that slows down bgsave. Maybe because memory allocator doesn't really release memory to OS, and it may modify some inner data for this free operation, especially when we free small objects. Moreover, CoW is based on pages, so it is a easy way that we only free the memory bulk that is not less than kernel page size. madvise(MADV_DONTNEED) can quickly release specified region pages to OS bypassing memory allocator, and allocator still consider that this memory still is used and don't change its inner data. There are some buffers we can release in the fork child process: - **Serialized key-values** the fork child process never access serialized key-values, so we try to free them. Because we only can release big bulk memory, and it is time consumed to iterate all items/members/fields/entries of complex data type. So we decide to iterate them and try to release them only when their average size of item/member/field/entry is more than page size of OS. - **Replication backlog** Because replication backlog is a cycle buffer, it will be changed quickly if redis has heavy write traffic, but in fork child process, we don't need to access that. - **Client buffers** If clients have requests during having the fork child process, clients' buffer also be changed frequently. The memory includes client query buffer, output buffer, and client struct used memory. To get child process peak private dirty memory, we need to count peak memory instead of last used memory, because the child process may continue to release memory (since COW used to only grow till now, the last was equivalent to the peak). Also we're adding a new `current_cow_peak` info variable (to complement the existing `current_cow_size`) Co-authored-by: Oran Agra <oran@redislabs.com>
author: Wang Yuan <wangyuan21@baidu.com> 2021-08-05 04:01:46 +0800
committer: GitHub <noreply@github.com> 2021-08-04 23:01:46 +0300
commit: d4bca53cd9879e0296bfa0a7c17df79dd52496ae (patch)
tree: a95ebc8a6afe01c030b259144919e9a6898597db /src/object.c
parent: 56eb7f7de407c66b479005d3179fb36002099f4d (diff)
download: redis-d4bca53cd9879e0296bfa0a7c17df79dd52496ae.tar.gz
1 files changed, 163 insertions, 0 deletions
diff --git a/src/object.c b/src/object.c
index 05831156c..e8c45e72a 100644
--- a/src/object.c
+++ b/src/object.c
@@ -377,6 +377,169 @@ void decrRefCount(robj *o) {
     }
 }
 
+/* See dismissObject() */
+void dismissSds(sds s) {
+    dismissMemory(sdsAllocPtr(s), sdsAllocSize(s));
+}
+
+/* See dismissObject() */
+void dismissStringObject(robj *o) {
+    if (o->encoding == OBJ_ENCODING_RAW) {
+        dismissSds(o->ptr);
+    }
+}
+
+/* See dismissObject() */
+void dismissListObject(robj *o, size_t size_hint) {
+    if (o->encoding == OBJ_ENCODING_QUICKLIST) {
+        quicklist *ql = o->ptr;
+        serverAssert(ql->len != 0);
+        /* We iterate all nodes only when average node size is bigger than a
+         * page size, and there's a high chance we'll actually dismiss something. */
+        if (size_hint / ql->len >= server.page_size) {
+            quicklistNode *node = ql->head;
+            while (node) {
+                if (quicklistNodeIsCompressed(node)) {
+                    dismissMemory(node->zl, ((quicklistLZF*)node->zl)->sz);
+                } else {
+                    dismissMemory(node->zl, node->sz);
+                }
+                node = node->next;
+            }
+        }
+    }
+}
+
+/* See dismissObject() */
+void dismissSetObject(robj *o, size_t size_hint) {
+    if (o->encoding == OBJ_ENCODING_HT) {
+        dict *set = o->ptr;
+        serverAssert(dictSize(set) != 0);
+        /* We iterate all nodes only when average member size is bigger than a
+         * page size, and there's a high chance we'll actually dismiss something. */
+        if (size_hint / dictSize(set) >= server.page_size) {
+            dictEntry *de;
+            dictIterator *di = dictGetIterator(set);
+            while ((de = dictNext(di)) != NULL) {
+                dismissSds(dictGetKey(de));
+            }
+            dictReleaseIterator(di);
+        }
+
+        /* Dismiss hash table memory. */
+        dismissMemory(set->ht[0].table, set->ht[0].size*sizeof(dictEntry*));
+        dismissMemory(set->ht[1].table, set->ht[1].size*sizeof(dictEntry*));
+    } else if (o->encoding == OBJ_ENCODING_INTSET) {
+        dismissMemory(o->ptr, intsetBlobLen((intset*)o->ptr));
+    }
+}
+
+/* See dismissObject() */
+void dismissZsetObject(robj *o, size_t size_hint) {
+    if (o->encoding == OBJ_ENCODING_SKIPLIST) {
+        zset *zs = o->ptr;
+        zskiplist *zsl = zs->zsl;
+        serverAssert(zsl->length != 0);
+        /* We iterate all nodes only when average member size is bigger than a
+         * page size, and there's a high chance we'll actually dismiss something. */
+        if (size_hint / zsl->length >= server.page_size) {
+            zskiplistNode *zn = zsl->tail;
+            while (zn != NULL) {
+                dismissSds(zn->ele);
+                zn = zn->backward;
+            }
+        }
+
+        /* Dismiss hash table memory. */
+        dict *d = zs->dict;
+        dismissMemory(d->ht[0].table, d->ht[0].size*sizeof(dictEntry*));
+        dismissMemory(d->ht[1].table, d->ht[1].size*sizeof(dictEntry*));
+    } else if (o->encoding == OBJ_ENCODING_ZIPLIST) {
+        dismissMemory(o->ptr, ziplistBlobLen((unsigned char*)o->ptr));
+    }
+}
+
+/* See dismissObject() */
+void dismissHashObject(robj *o, size_t size_hint) {
+    if (o->encoding == OBJ_ENCODING_HT) {
+        dict *d = o->ptr;
+        serverAssert(dictSize(d) != 0);
+        /* We iterate all fields only when average field/value size is bigger than
+         * a page size, and there's a high chance we'll actually dismiss something. */
+        if (size_hint / dictSize(d) >= server.page_size) {
+            dictEntry *de;
+            dictIterator *di = dictGetIterator(d);
+            while ((de = dictNext(di)) != NULL) {
+                /* Only dismiss values memory since the field size
+                 * usually is small. */
+                dismissSds(dictGetVal(de));
+            }
+            dictReleaseIterator(di);
+        }
+
+        /* Dismiss hash table memory. */
+        dismissMemory(d->ht[0].table, d->ht[0].size*sizeof(dictEntry*));
+        dismissMemory(d->ht[1].table, d->ht[1].size*sizeof(dictEntry*));
+    } else if (o->encoding == OBJ_ENCODING_ZIPLIST) {
+        dismissMemory(o->ptr, ziplistBlobLen((unsigned char*)o->ptr));
+    }
+}
+
+/* See dismissObject() */
+void dismissStreamObject(robj *o, size_t size_hint) {
+    stream *s = o->ptr;
+    rax *rax = s->rax;
+    if (raxSize(rax) == 0) return;
+
+    /* Iterate only on stream entries, although size_hint may include serialized
+     * consumer groups info, but usually, stream entries take up most of
+     * the space. */
+    if (size_hint / raxSize(rax) >= server.page_size) {
+        raxIterator ri;
+        raxStart(&ri,rax);
+        raxSeek(&ri,"^",NULL,0);
+        while (raxNext(&ri)) {
+            dismissMemory(ri.data, lpBytes(ri.data));
+        }
+        raxStop(&ri);
+    }
+}
+
+/* When creating a snapshot in a fork child process, the main process and child
+ * process share the same physical memory pages, and if / when the parent
+ * modifies any keys due to write traffic, it'll cause CoW which consume
+ * physical memory. In the child process, after serializing the key and value,
+ * the data is definitely not accessed again, so to avoid unnecessary CoW, we
+ * try to release their memory back to OS. see dismissMemory().
+ *
+ * Because of the cost of iterating all node/field/member/entry of complex data
+ * types, we iterate and dismiss them only when approximate average we estimate
+ * the size of an individual allocation is more than a page size of OS.
+ * 'size_hint' is the size of serialized value. This method is not accurate, but
+ * it can reduce unnecessary iteration for complex data types that are probably
+ * not going to release any memory. */
+void dismissObject(robj *o, size_t size_hint) {
+    /* madvise(MADV_DONTNEED) may not work if Transparent Huge Pages is enabled. */
+    if (server.thp_enabled) return;
+
+    /* Currently we use zmadvise_dontneed only when we use jemalloc.
+     * so we avoid these pointless loops when they're not going to do anything. */
+#if defined(USE_JEMALLOC)
+    if (o->refcount != 1) return;
+    switch(o->type) {
+        case OBJ_STRING: dismissStringObject(o); break;
+        case OBJ_LIST: dismissListObject(o, size_hint); break;
+        case OBJ_SET: dismissSetObject(o, size_hint); break;
+        case OBJ_ZSET: dismissZsetObject(o, size_hint); break;
+        case OBJ_HASH: dismissHashObject(o, size_hint); break;
+        case OBJ_STREAM: dismissStreamObject(o, size_hint); break;
+        default: break;
+    }
+#else
+    UNUSED(o); UNUSED(size_hint);
+#endif
+}
+
 /* This variant of decrRefCount() gets its argument as void, and is useful
  * as free method in data structures that expect a 'void free_object(void*)'
  * prototype for the free method. */
author	Wang Yuan <wangyuan21@baidu.com>	2021-08-05 04:01:46 +0800
committer	GitHub <noreply@github.com>	2021-08-04 23:01:46 +0300
commit	d4bca53cd9879e0296bfa0a7c17df79dd52496ae (patch)
tree	a95ebc8a6afe01c030b259144919e9a6898597db /src/object.c
parent	56eb7f7de407c66b479005d3179fb36002099f4d (diff)
download	redis-d4bca53cd9879e0296bfa0a7c17df79dd52496ae.tar.gz