chromium/testing/libfuzzer/efficient_fuzzer.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302

# Efficient Fuzzer Guide

This document describes ways to determine efficiency of a fuzz target and ways
to improve it.

## Overview

Being a coverage-driven fuzzing engine, libFuzzer considers a certain input
*interesting* if it results in new code coverage, i.e. it reaches a code that
has not been reached before. The set of all interesting inputs is called
*corpus*.

Items in corpus are constantly mutated in search of new interesting inputs.
Corpus can be shared across fuzzer runs and grows over time as new code is
reached.

There are several metrics you should look at to determine effectiveness of your
fuzz target:

* [Execution Speed](#Execution-Speed)
* [Code Coverage](#Code-Coverage)
* [Corpus Size](#Corpus-Size)

You can collect these metrics manually or take them from [ClusterFuzz status]
pages after a fuzz target is checked in Chromium repository.

The following things are extremely useful for improving fuzzing efficiency, so
we *strongly recommend* them for any fuzz target:

* [Seed Corpus](#Seed-Corpus)
* [Fuzzer Dictionary](#Fuzzer-Dictionary)

There are other ways that are useful in some cases, but not always applicable:
* [Custom Options](#Custom-Options)
* [Custom Build](#Custom-Build)


## Execution Speed

Fuzz target speed is calculated in executions per second. It is printed while a
fuzz target is running:

```
#19346  NEW    cov: 2815 bits: 1082 indir: 43 units: 150 exec/s: 19346 L: 62
```

Because libFuzzer performs randomized mutations, it is critical to have it run
as fast as possible to navigate through the large search space efficiently and
find interesting code paths. You should try to get to at least 1,000 exec/s from
your fuzz target locally before submitting it to the Chromium repository.


### Initialization/Cleanup

Try to keep `LLVMFuzzerTestOneInput` function as simple as possible. If your
fuzzing function is too complex, it can bring down fuzzer execution speed OR it
can target very specific usecases and fail to account for unexpected scenarios.

Prefer to use static initialization and shared resources rather than performing
setup and teardown on every single input. Checkout example on
[startup initialization] in libFuzzer documentation.

You can skip freeing static resources. However, all resources allocated within
`LLVMFuzzerTestOneInput` function should be de-allocated since this function is
called millions of times during a fuzzing session. Otherwise, we will hit OOMs
frequently and reduce overall fuzzing efficiency.


### Memory Usage

Avoid allocation of dynamic memory wherever possible. Memory instrumentation
works faster for stack-based and static objects, than for heap allocated ones.

It is always a good idea to try different variants for your fuzz target locally,
and then submit the fastest implementation.


## Code Coverage

[ClusterFuzz status] page provides source-level coverage report for fuzz targets
from recent runs. Looking at the report might provide an insight on how to
improve code coverage of a fuzz target.

You can also generate source-level coverage report locally by running the
[coverage script] stored in Chromium repository. The script provides detailed
instructions as well as an usage example.

Note that code coverage of a fuzz target **depends heavily** on the corpus
provided when running the target, i.e. code coverage report generated by a fuzz
target launched without any corpus would not make much sense.

We encourage you to try out the [coverage script], as it usually generates a better
code coverage visualization compared to the coverage report hosted on
ClusterFuzz.
*NOTE: This is an experimental feature and an active area of work. We are
working on improving this process.*


## Corpus Size

After running for a while, a fuzz target would reach a plateau and may stop
discovering new interesting inputs. Corpus for a reasonably complex target
should contain hundreds (if not thousands) of items.

Too small of a corpus size indicates that fuzz target is hitting a code barrier
and is unable to get past it. Common cases of such issues include: checksums,
magic numbers, etc. The easiest way to diagnose this problem is to generate and
analyze a [coverage report](#Coverage). To fix the issue, you can:

* Change the code (e.g. disable crc checks while fuzzing), see [Custom Build](#Custom-Build).
* Prepare or improve [seed corpus](#Seed-Corpus).
* Prepare or improve [fuzzer dictionary](#Fuzzer-Dictionary).
* Add [custom options](#Custom-Options).


## Seed Corpus

Seed corpus is a set of *valid* and *interesting* inputs that serve as starting
points for a fuzz target. If one is not provided, a fuzzing engine would have to
guess these inputs from scratch, which can take an indefinite amount of time
depending on size of the inputs and complexity of the target format.

Seed corpus works especially well for strictly defined file formats and data
transmission protocols.

* For file format parsers, add valid files from your test suite.
* For protocol parsers, add valid raw streams from test suite into separate files.

Other examples include a graphics library seed corpus, which would be a variety
of small PNG/JPG/GIF files.

If you are running a fuzz target locally, you can pass a corpus directory as an
argument:

```
./out/libfuzzer/my_fuzzer ~/tmp/my_fuzzer_corpus
```

The fuzzer would store all the interesting inputs it finds in that directory.

While libFuzzer can start with an empty corpus, seed corpus is always useful and
in many cases is able to increase code coverage by an order of magnitude.

ClusterFuzz uses seed corpus defined in Chromium source repository. You need to
add a `seed_corpus` attribute to your `fuzzer_test` definition in BUILD.gn file:

```
fuzzer_test("my_protocol_fuzzer") {
  ...
  seed_corpus = "test/fuzz/testcases"
  ...
}
```

You may specify multiple seed corpus directories via `seed_corpuses` attribute:

```
fuzzer_test("my_protocol_fuzzer") {
  ...
  seed_corpuses = [ "test/fuzz/testcases", "test/unittest/data" ]
  ...
}
```

All files found in these directories and their subdirectories will be archived
into a `<my_fuzzer_name>_seed_corpus.zip` output archive.

If you can't store seed corpus in Chromium repository (e.g. it is too large,
cannot be open sourced, etc), you can upload the corpus to Google Cloud Storage
bucket used by ClusterFuzz:

1) Go to [Corpus GCS Bucket].
2) Open directory named `<my_fuzzer_name>_static`. If the directory does not
exist, please create it.
3) Upload corpus files into the directory.

Alternative and faster way is to use [gsutil] command line tool:
```bash
gsutil -m rsync <path_to_corpus> gs://clusterfuzz-corpus/libfuzzer/<my_fuzzer_name>_static
```

### Corpus Minimization

It's important to minimize seed corpus to a *small set of interesting inputs*
before uploading. The reason being that seed corpus is synced to all fuzzing
bots for every iteration, so it is important to keep it small both for fuzzing
efficiency and to prevent our bots from running out of disk space.

The minimization can be done using `-merge=1` option of libFuzzer:

```bash
# Create an empty directory.
mkdir seed_corpus_minimized

# Run the fuzzer with -merge=1 flag.
./my_fuzzer -merge=1 ./seed_corpus_minimized ./seed_corpus
```

After running the command above, `seed_corpus_minimized` directory will contain
a minimized corpus that gives the same code coverage as the initial
`seed_corpus` directory.


## Fuzzer Dictionary

It is very useful to provide fuzz target with a set of *common words or values*
that you expect to find in the input. Adding a dictionary highly improves the
efficiency of finding new units and works especially well in certain usecases
(e.g. fuzzing file format decoders or text based protocols like XML).

To add a dictionary, first create a dictionary file. This is a flat text file
where tokens are listed one per line in the format of `name="value"`, where
`name` is optional and can be omitted, although it is a convenient way to
document the meaning of a particular token. The value must appear in quotes,
with hex escaping (\xNN) applied to all non-printable, high-bit, or otherwise
problematic characters (\\ and \" shorthands are recognized too). This syntax is
similar to the one used by [AFL] fuzzing engine (-x option).

An example dictionary looks like:

```
# Lines starting with '#' and empty lines are ignored.

# Adds "blah" word (w/o quotes) to the dictionary.
kw1="blah"
# Use \\ for backslash and \" for quotes.
kw2="\"ac\\dc\""
# Use \xAB for hex values.
kw3="\xF7\xF8"
# Key name before '=' can be omitted:
"foo\x0Abar"
```

Make sure to test your dictionary by running your fuzz target locally:

```bash
./out/libfuzzer/my_protocol_fuzzer -dict=<path_to_dict> <path_to_corpus>
```

If the dictionary is effective, you should see new units discovered in the
output.

To submit a dictionary to Chromium repository:

1) Add the dictionary file in the same directory as your fuzz target
2) Add `dict` attribute to `fuzzer_test` definition in BUILD.gn file:

```
fuzzer_test("my_protocol_fuzzer") {
  ...
  dict = "my_protocol_fuzzer.dict"
}
```

The dictionary will be used automatically by ClusterFuzz once it picks up a new
revision build.


## Custom Options

Custom options help to fine tune libFuzzer execution parameters and will also
override the default values used by ClusterFuzz. Please read [libFuzzer options]
page for detailed documentation on how these work.

Add the options needed in `libfuzzer_options` attribute to your `fuzzer_test`
definition in BUILD.gn file:

```
fuzzer_test("my_protocol_fuzzer") {
  ...
  libfuzzer_options = [
    "max_len=2048",
    "use_traces=1",
  ]
}
```

Please note that `dict` parameter should be provided [separately](#Fuzzer-Dictionary).
All other options can be passed using `libfuzzer_options` property.


## Custom Build

If you need to change the code being tested by your fuzz target, you may use an
`#ifdef FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION` macro in your target code.

Note that patching target code is not a preferred way of improving corresponding
fuzz target, but in some cases that might be the only way possible, e.g. when
there is no intended API to disable checksum verification, or when target code
uses random generator that affects reproducibility of crashes.


[AFL]: http://lcamtuf.coredump.cx/afl/
[ClusterFuzz status]: clusterfuzz.md#Status-Links
[Corpus GCS Bucket]: https://console.cloud.google.com/storage/clusterfuzz-corpus/libfuzzer
[issue 638836]: https://bugs.chromium.org/p/chromium/issues/detail?id=638836
[coverage script]: https://cs.chromium.org/chromium/src/tools/code_coverage/coverage.py
[gsutil]: https://cloud.google.com/storage/docs/gsutil
[libFuzzer options]: https://llvm.org/docs/LibFuzzer.html#options
[startup initialization]: https://llvm.org/docs/LibFuzzer.html#startup-initialization