doc/development/redis/new_redis_instance.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271

---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Add a new Redis instance

GitLab can make use of multiple [Redis instances](../redis.md#redis-instances).
These instances are functionally partitioned so that, for example, we
can store [CI trace chunks](../../administration/job_logs.md#incremental-logging-architecture)
from one Redis instance while storing sessions in another.

From time to time we might want to add a new Redis instance. Typically this will
be a functional partition split from one of the existing instances such as the
cache or shared state. This document describes an approach
for adding a new Redis instance that handles existing data, based on
prior examples:

- [Dedicated Redis instance for Trace Chunk storage](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/462).
- [Create dedicated Redis instance for Rate Limiting data](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/526).

This document does not cover the operational side of preparing and configuring
the new Redis instance in detail, but the example epics do contain information
on previous approaches to this.

## Step 1: Support configuring the new instance

Before we can switch any features to using the new instance, we have to support
configuring it and referring to it in the codebase. We must support the
main installation types:

- Source installs (including development environments) - [example MR](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/62767)
- Omnibus - [example MR](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5316)
- Helm charts - [example MR](https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/2031)

### Fallback instance

In the application code, we need to define a fallback instance in case the new
instance is not configured. For example, if a GitLab instance has already
configured a separate shared state Redis, and we are partitioning data from the
shared state Redis, our new instance's configuration should default to that of
the shared state Redis when it's not present. Otherwise we could break instances
that don't configure the new Redis instance as soon as it's available.

You can [define a `.config_fallback` method](https://gitlab.com/gitlab-org/gitlab/-/blob/a75471dd744678f1a59eeb99f71fca577b155acd/lib/gitlab/redis/wrapper.rb#L69-87)
in `Gitlab::Redis::Wrapper` (the base class for all Redis instances)
that defines the instance to be used if this one is not configured. If we were
adding a `Foo` instance that should fall back to `SharedState`, we can do that
like this:

```ruby
module Gitlab
  module Redis
    class Foo < ::Gitlab::Redis::Wrapper
      # The data we store on Foo used to be stored on SharedState.
      def self.config_fallback
        SharedState
      end
    end
  end
end
```

We should also add specs like those in
[`trace_chunks_spec.rb`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/spec/lib/gitlab/redis/trace_chunks_spec.rb)
to ensure that this fallback works correctly.

## Step 2: Support writing to and reading from the new instance

When migrating to the new instance, we must account for cases where data is
either on:

- The 'old' (original) instance.
- The new one that we have just added support for.

As a result we may need to support reading from and writing to both
instances, depending on some condition.

The exact condition to use varies depending on the data to be migrated. For
the trace chunks case above, there was already a database column indicating where the
data was stored (as there are other storage options than Redis).

This step may not apply if the data has a very short lifetime (a few minutes at most)
and is not critical. In that case, we
may decide that it is OK to incur a small amount of data loss and switch
over through configuration only.

If there is not a more natural way to mark where the data is stored, using a
[feature flag](../feature_flags/index.md) may be convenient:

- It does not require an application restart to take effect.
- It applies to all application instances (Sidekiq, API, web, etc.) at
  the same time.
- It supports incremental rollout - ideally by actor (project, group,
  user, etc.) - so that we can monitor for errors and roll back easily.

## Step 3: Migrate the data

We then need to configure the new instance for GitLab.com's production and
staging environments. Hopefully it will be possible to test this change
effectively on staging, to at least make sure that basic usage continues to
work.

After that is done, we can roll out the change to production. Ideally this would
be in an incremental fashion, following the
[standard incremental rollout](../feature_flags/controls.md#rolling-out-changes)
documentation for feature flags.

When we have been using the new instance 100% of the time in production for a
while and there are no issues, we can proceed.

### Proposed solution: Migrate data by using MultiStore with the fallback strategy

We need a way to migrate users to a new Redis store without causing any inconveniences from UX perspective.
We also want the ability to fall back to the "old" Redis instance if something goes wrong with the new instance.

Migration Requirements:

- No downtime.
- No loss of stored data until the TTL for storing data expires.
- Partial rollout using Feature Flags or ENV vars or combinations of both.
- Monitoring of the switch.
- Prometheus metrics in place.
- Easy rollback without downtime in case the new instance or logic does not behave as expected.

It is somewhat similar to the zero-downtime DB table rename.
We need to write data into both Redis instances (old + new).
We read from the new instance, but we need to fall back to the old instance when pre-fetching from the new dedicated Redis instance that failed.
We need to log any issues or exceptions with a new instance, but still fall back to the old instance.

The proposed migration strategy is to implement and use the [MultiStore](https://gitlab.com/gitlab-org/gitlab/-/blob/fcc42e80ed261a862ee6ca46b182eee293ae60b6/lib/gitlab/redis/multi_store.rb).
We used this approach with [adding new dedicated Redis instance for session keys](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/579).
Also MultiStore comes with corresponding [specs](https://gitlab.com/gitlab-org/gitlab/-/blob/master/spec/lib/gitlab/redis/multi_store_spec.rb).

The MultiStore looks like a `redis-rb ::Redis` instance.

In the new Redis instance class you added in [Step 1](#step-1-support-configuring-the-new-instance),
override the [Redis](https://gitlab.com/gitlab-org/gitlab/-/blob/fcc42e80ed261a862ee6ca46b182eee293ae60b6/lib/gitlab/redis/sessions.rb#L20-28) method from the `::Gitlab::Redis::Wrapper`

```ruby
module Gitlab
  module Redis
    class Foo < ::Gitlab::Redis::Wrapper
      ...
      def self.redis
        # Don't use multistore if redis.foo configuration is not provided
        return super if config_fallback?

        primary_store = ::Redis.new(params)
        secondary_store = ::Redis.new(config_fallback.params)

        MultiStore.new(primary_store, secondary_store, store_name)
      end
    end
  end
end
```

MultiStore is initialized by providing the new Redis instance as a primary store, and [old (fallback-instance)](#fallback-instance) as a secondary store.
The third argument is `store_name` which is used for logs, metrics and feature flag names, in case we use MultiStore implementation for different Redis stores at the same time.

By default, the MultiStore reads and writes only from the default Redis store.
The default Redis store is `secondary_store` (the old fallback-instance).
This allows us to introduce MultiStore without changing the default behavior.

MultiStore uses two feature flags to control the actual migration:

- `use_primary_and_secondary_stores_for_[store_name]`
- `use_primary_store_as_default_for_[store_name]`

For example, if our new Redis instance is called `Gitlab::Redis::Foo`, we can [create](../feature_flags/index.md#create-a-new-feature-flag) two feature flags by executing:

```shell
bin/feature-flag use_primary_and_secondary_stores_for_foo
bin/feature-flag use_primary_store_as_default_for_foo
```

By enabling `use_primary_and_secondary_stores_for_foo` feature flag, our `Gitlab::Redis::Foo` will use `MultiStore` to write to both new Redis instance
and the [old (fallback-instance)](#fallback-instance).
If we fail to fetch data from the new instance, we will fallback and read from the old Redis instance.
We can monitor logs for `Gitlab::Redis::MultiStore::ReadFromPrimaryError`, and also the Prometheus counter `gitlab_redis_multi_store_read_fallback_total`.

For pipelined commands (`pipelined` and `multi`), we execute the entire operation in both stores and then compare the results. If they differ, we emit a
`Gitlab::Redis::MultiStore:PipelinedDiffError` error, and track it in the `gitlab_redis_multi_store_pipelined_diff_error_total` Prometheus counter.

Once we stop seeing those errors, this means that we are no longer relying on the data stored on the old Redis store.
At this point, we are probably safe to move the traffic to the new Redis store.

By enabling `use_primary_store_as_default_for_foo` feature flag, the `MultiStore` will use `primary_store` (new instance) as default Redis store.

Once this feature flag is enabled, we can disable `use_primary_and_secondary_stores_for_foo` feature flag.
This will allow the MultiStore to read and write only from the primary Redis store (new store), moving all the traffic to the new Redis store.

Once we have moved all our traffic to the primary store, our data migration is complete.
We can safely remove the MultiStore implementation and continue to use newly introduced Redis store instance.

#### Implementation details

MultiStore implements read and write Redis commands separately.

##### Read commands

- `get`
- `mget`
- `smembers`
- `scard`

##### Write commands

- `set`
- `setnx`
- `setex`
- `sadd`
- `srem`
- `del`
- `pipelined`
- `flushdb`
- `rpush`

##### Pipelined commands

**NOTE:** The Ruby block passed to these commands will be executed twice, once per each store.
Thus, excluding the Redis operations performed, the block should be idempotent.

- `pipelined`
- `multi`

When a command outside of the supported list is used, `method_missing` will pass it to the old Redis instance and keep track of it.
This ensures that anything unexpected behaves like it would before.

NOTE:
By tracking `gitlab_redis_multi_store_method_missing_total` counter and `Gitlab::Redis::MultiStore::MethodMissingError`,
a developer will need to add an implementation for missing Redis commands before proceeding with the migration.

##### Errors

| error                                             | message                                                                                     |
|---------------------------------------------------|---------------------------------------------------------------------------------------------|
| `Gitlab::Redis::MultiStore::ReadFromPrimaryError` | Value not found on the Redis primary store. Read from the Redis secondary store successful. |
| `Gitlab::Redis::MultiStore::PipelinedDiffError`   | Pipelined command executed on both stores successfully but results differ between them.     |
| `Gitlab::Redis::MultiStore::MethodMissingError`   | Method missing. Falling back to execute method on the Redis secondary store.                |

##### Metrics

| metrics name                                          | type               | labels                 | description                                            |
|-------------------------------------------------------|--------------------|------------------------|--------------------------------------------------------|
| `gitlab_redis_multi_store_read_fallback_total`        | Prometheus Counter | command, instance_name | Client side Redis MultiStore reading fallback total    |
| `gitlab_redis_multi_store_pipelined_diff_error_total` | Prometheus Counter | command, instance_name | Redis MultiStore pipelined command diff between stores |
| `gitlab_redis_multi_store_method_missing_total`       | Prometheus Counter | command, instance_name | Client side Redis MultiStore method missing total      |

## Step 4: clean up after the migration

<!-- markdownlint-disable MD044 -->
We may choose to keep the migration paths or remove them, depending on whether
or not we expect self-managed instances to perform this migration.
[gitlab-com/gl-infra/scalability#1131](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1131#note_603354746)
contains a discussion on this topic for the trace chunks feature flag. It may
be - as in that case - that we decide that the maintenance costs of supporting
the migration code are higher than the benefits of allowing self-managed
instances to perform this migration seamlessly, if we expect self-managed
instances to cope without this functional partition.
<!-- markdownlint-enable MD044 -->

If we decide to keep the migration code:

- We should document the migration steps.
- If we used a feature flag, we should ensure it's an 
  [ops type feature flag](../feature_flags/index.md#ops-type), as these are long-lived flags.

Otherwise, we can remove the flags and conclude the project.