doc/development/transient/prevention-patterns.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141

---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Preventing Transient Bugs

This page will cover architectural patterns and tips for developers to follow to prevent [transient bugs.](https://about.gitlab.com/handbook/engineering/quality/issue-triage/#transient-bugs)

## Common root causes

We've noticed a few root causes that come up frequently when addressing transient bugs. 

- Needs better state management in the backend or frontend.
- Frontend code needs improvements.
- Lack of test coverage.
- Race conditions.

## Frontend

### Don't rely on response order

When working with multiple requests, it's easy to assume the order of the responses will match the order in which they are triggered.

That's not always the case and can cause bugs that only happen if the order is switched.

**Example:**

- `diffs_metadata.json` (lighter)
- `diffs_batch.json` (heavier)

If your feature requires data from both, ensure that the two have finished loading before working on it.

### Simulate slower connections when testing manually

Add a network condition template to your browser's developer tools to enable you to toggle between a slow and a fast connection.

**Example:**

- Turtle:
  - Down: 50kb/s
  - Up: 20kb/s
  - Latency: 10000ms

### Collapsed elements

When setting event listeners, if not possible to use event delegation, ensure all relevant event listeners are set for expanded content.

Including when that expanded content is:

- **Invisible** (`display: none;`). Some JavaScript requires the element to be visible to work properly, such as when taking measurements.
- **Dynamic content** (AJAX/DOM manipulation).

### Using assertions to detect transient bugs caused by unmet conditions

Transient bugs happen in the context of code that executes under the assumption
that the application’s state meets one or more conditions. We may write a feature
that assumes a server-side API response always include a group of attributes or that
an operation only executes when the application has successfully transitioned to a new
state.

Transient bugs are difficult to debug because there isn’t any mechanism that alerts
the user or the developer about unsatisfied conditions. These conditions are usually
not expressed explicitly in the code. A useful debugging technique for such situations
is placing assertions to make any assumption explicit. They can help detect the cause
which unmet condition causes the bug.

#### Asserting pre-conditions on state mutations

A common scenario that leads to transient bugs is when there is a polling service
that should mutate state only if a user operation is completed. We can use
assertions to make this pre-condition explicit:

```javascript
// This action is called by a polling service. It assumes that all pre-conditions
// are satisfied by the time the action is dispatched.
export const updateMergeableStatus = ({ commit }, payload) => {
  commit(types.SET_MERGEABLE_STATUS, payload);
};

// We can make any pre-condition explicit by adding an assertion
export const updateMergeableStatus = ({ state, commit }, payload) => {
  console.assert(
    state.isResolvingDiscussion === true,
    'Resolve discussion request must be completed before updating mergeable status'
  );
  commit(types.SET_MERGEABLE_STATUS, payload);
};
```

#### Asserting API contracts

Another useful way of using assertions is to detect if the response payload returned
by the server-side endpoint satisfies the API contract.

#### Related reading

[Debug it!](https://pragprog.com/titles/pbdp/debug-it/) explores techniques to diagnose
and fix non-determinstic bugs and write software that is easier to debug.

## Backend

### Sidekiq jobs with locks

When dealing with asynchronous work via Sidekiq, it is possible to have 2 jobs with the same arguments
getting worked on at the same time. If not handled correctly, this can result in an outdated or inaccurate state.

For instance, consider a worker that updates a state of an object. Before the worker updates the state
(for example, `#update_state`) of the object, it needs to check what the appropriate state should be
(for example, `#check_state`).

When there are 2 jobs being worked on at the same time, it is possible that the order of operations will go like:

1. (Worker A) Calls `#check_state`
1. (Worker B) Calls `#check_state`
1. (Worker B) Calls `#update_state`
1. (Worker A) Calls `#update_state`

In this example, `Worker B` is meant to set the updated status. But `Worker A` calls `#update_state` a little too late.

This can be avoided by utilizing either database locks or `Gitlab::ExclusiveLease`. This way, jobs will be
worked on one at a time. This also allows them to be marked as [idempotent](../sidekiq_style_guide.md#idempotent-jobs).

### Retry mechanism handling

There are times that an object/record will be on a failed state which can be rechecked.

If an object is in a state that can be rechecked, ensure that appropriate messaging is shown to the user
so they know what to do. Also, make sure that the retry functionality will be able to reset the state
correctly when triggered.

### Error Logging

Error logging doesn't necessarily directly prevents transient bugs but it can help to debug them.

When coding, sometimes we expect some exceptions to be raised and we rescue them.

Logging whenever we rescue an error helps in case it's causing transient bugs that a user may see.
While investigating a bug report, it may require the engineer to look into logs of when it happened.
Seeing an error being logged can be a signal of something that went wrong which can be handled differently.