Rebase: Merge BOLT codebase in monorepo

Summary: This commit is the first step in rebasing all of BOLT history in the LLVM monorepo. It also solves trivial build issues by updating BOLT codebase to use current LLVM. There is still work left in rebasing some BOLT features and in making sure everything is working as intended. History has been rewritten to put BOLT in the /bolt folder, as opposed to /tools/llvm-bolt. (cherry picked from FBD33289252)
author: Amir Ayupov <aaupov@fb.com> 2020-12-01 16:29:39 -0800
committer: Maksim Panchenko <maks@fb.com> 2020-12-01 16:29:39 -0800
commit: 1c5d3a056c4abfe1022f67e1e366765d769782fa (patch)
tree: 50a46c82ac2ecd1dcd25d2af6b36704d34cbf373 /bolt/docs
parent: 0a8aaf56bb58ba9b7d561dda02a27a292f517954 (diff)
download: llvm-1c5d3a056c4abfe1022f67e1e366765d769782fa.tar.gz
4 files changed, 353 insertions, 0 deletions
diff --git a/bolt/docs/Heatmap.png b/bolt/docs/Heatmap.png
new file mode 100644
index 000000000000..e3f76fb22932
--- /dev/null
+++ b/bolt/docs/Heatmap.png
diff --git a/bolt/docs/Heatmaps.md b/bolt/docs/Heatmaps.md
new file mode 100644
index 000000000000..526b3900b2d1
--- /dev/null
+++ b/bolt/docs/Heatmaps.md
@@ -0,0 +1,50 @@
+# Code Heatmaps
+
+BOLT has gained the ability to print code heatmaps based on
+sampling-based LBR profiles generated by `perf`. The output is produced
+in colored ASCII to be displayed in a color-capable terminal. It looks
+something like this:
+
+![](./Heatmap.png)
+
+Heatmaps can be generated for BOLTed and non-BOLTed binaries. You can
+use them to compare the code layout before and after optimizations.
+
+To generate a heatmap, start with running your app under `perf`:
+
+```bash
+$ perf record -e cycles:u -j any,u -- <executable with args>
+```
+or if you want to monitor the existing process(es):
+```bash
+$ perf record -e cycles:u -j any,u [-p PID|-a] -- sleep <interval>
+```
+
+Note that at the moment running with LBR (`-j any,u` or `-b`) is
+a requirement.
+
+Once the run is complete, and `perf.data` is generated, run BOLT in
+a heatmap mode:
+
+```bash
+$ llvm-bolt heatmap -p perf.data <executable>
+```
+
+By default the heatmap will be dumped to *stdout*. You can change it
+with `-o <heatmapfile>` option. Each character/block in the heatmap
+shows the execution data accumulated for corresponding 64 bytes of
+code. You can change this granularity with a `-block-size` option.
+E.g. set it to 4096 to see code usage grouped by 4K pages.
+Other useful options are:
+
+```bash
+-line-size=<uint>   - number of entries per line (default 256)
+-max-address=<uint> - maximum address considered valid for heatmap (default 4GB)
+```
+
+If you prefer to look at the data in a browser (or would like to share
+it that way), then you can use an HTML conversion tool. E.g.:
+
+```bash
+$ aha -b -f <heatmapfile> > <heatmapfile>.html
+```
diff --git a/bolt/docs/OptimizingClang.md b/bolt/docs/OptimizingClang.md
new file mode 100644
index 000000000000..12ac2fe19e23
--- /dev/null
+++ b/bolt/docs/OptimizingClang.md
@@ -0,0 +1,266 @@
+# Optimizing Clang : A Practical Example of Applying BOLT
+
+## Preface
+
+*BOLT* (Binary Optimization and Layout Tool) is designed to improve the application
+performance by laying out code in a manner that helps CPU better utilize its caching and
+branch predicting resources.
+
+The most obvious candidates for BOLT optimizations
+are programs that suffer from many instruction cache and iTLB misses, such as
+large applications measuring over hundreds of megabytes in size. However, medium-sized
+programs can benefit too. Clang, one of the most popular open-source C/C++ compilers,
+is a good example of the latter. Its code size could easily be in the order of tens of megabytes.
+As we will see, the Clang binary suffers from many instruction cache
+misses and can be significantly improved with BOLT, even on top of profile-guided and
+link-time optimizations.
+
+In this tutorial we will first build Clang with PGO and LTO, and then will show steps on how to
+apply BOLT optimizations to make Clang up to 15% faster. We will also analyze where
+the compile-time performance gains are coming from, and verify that the speed-ups are
+sustainable while building other applications.
+
+## Building Clang
+
+The process of getting Clang sources and performing the build is very similar to the
+one described at http://clang.llvm.org/get_started.html. For completeness, we provide the detailed steps
+on how to obtain and build Clang in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto) section.
+
+The only difference from the standard Clang build is that we require the `-Wl,-q` flag to be present during
+the final link. This option saves relocation metadata in the executable file, but does not affect
+the generated code in any way.
+
+## Optimizing Clang with BOLT
+
+We will use the setup described in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto).
+Adjust the steps accordingly if you skipped that section. We will also assume that `llvm-bolt` is present in your `$PATH`.
+
+Before we can run BOLT optimizations, we need to collect the profile for Clang, and we will use
+Clang/LLVM sources for that.
+Collecting accurate profile requires running `perf` on a hardware that
+implements taken branch sampling (`-b/-j` flag). For that reason, it may not be possible to
+collect the accurate profile in a virtualized environment, e.g. in the cloud.
+We do support regular sampling profiles, but the performance
+improvements are expected to be more modest. 
+
+```bash
+$ mkdir ${TOPLEV}/stage3
+$ cd ${TOPLEV}/stage3
+$ CPATH=${TOPLEV}/stage2-prof-use-lto/install/bin/
+$ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \
+    -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
+    -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3/install
+$ perf record -e cycles:u -j any,u -- ninja clang
+```
+
+Once the last command is finished, it will create a `perf.data` file larger than 10GiB.
+We will first convert this profile into a more compact aggregated
+form suitable to be consumed by BOLT:
+```bash
+  $ perf2bolt $CPATH/clang-7 -p perf.data -o clang-7.fdata -w clang-7.yaml
+```
+Notice that we are passing `clang-7` to `perf2bolt` which is the real binary that
+`clang` and `clang++` are symlinking to. The next step will optimize Clang using
+the generated profile:
+```bash
+$ llvm-bolt $CPATH/clang-7 -o $CPATH/clang-7.bolt -b clang-7.yaml \
+    -reorder-blocks=cache+ -reorder-functions=hfsort+ -split-functions=3 \
+    -split-all-cold -dyno-stats -icf=1 -use-gnu-stack
+```
+The output will look similar to the one below:
+```t
+...
+BOLT-INFO: enabling relocation mode
+BOLT-INFO: 11415 functions out of 104526 simple functions (10.9%) have non-empty execution profile.
+...
+BOLT-INFO: ICF folded 29144 out of 105177 functions in 8 passes. 82 functions had jump tables.
+BOLT-INFO: Removing all identical functions will save 5466.69 KB of code space. Folded functions were called 2131985 times based on profile.
+BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions
+...
+           660155947 : executed forward branches (-2.3%)
+            48252553 : taken forward branches (-57.2%)
+           129897961 : executed backward branches (+13.8%)
+            52389551 : taken backward branches (-19.5%)
+            35650038 : executed unconditional branches (-33.2%)
+           128338874 : all function calls (=)
+            19010563 : indirect calls (=)
+             9918250 : PLT calls (=)
+          6113398840 : executed instructions (-0.6%)
+          1519537463 : executed load instructions (=)
+           943321306 : executed store instructions (=)
+            20467109 : taken jump table branches (=)
+           825703946 : total branches (-2.1%)
+           136292142 : taken branches (-41.1%)
+           689411804 : non-taken conditional branches (+12.6%)
+           100642104 : taken conditional branches (-43.4%)
+           790053908 : all conditional branches (=)
+...
+```
+The statistics in the output is based on the LBR profile collected with `perf`, and since we were using
+the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional
+ branches` is a good indication that BOLT was able to straighten out the code even after PGO.
+
+## Measuring Compile-time Improvement
+
+`clang-7.bolt` can be used as a replacement for *PGO+LTO* Clang:
+```bash
+$ mv $CPATH/clang-7 $CPATH/clang-7.org
+$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
+```
+Doing a new build of Clang using the new binary shows a significant overall
+build time reduction on a 48-core Haswell system:
+```bash
+$ ln -fs $CPATH/clang-7.org $CPATH/clang-7
+$ ninja clean && /bin/time -f %e ninja clang -j48
+202.72
+$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
+$ ninja clean && /bin/time -f %e ninja clang -j48
+180.11
+```
+That's 22.61 seconds (or 12%) faster compared to the *PGO+LTO* build.
+Notice that we are measuring an improvement of the total build time, which includes the time spent in the linker.
+Compilation time improvements for individual files differ, and speedups over 15% are not uncommon.
+If we run BOLT on a Clang binary compiled without *PGO+LTO* (in which case the build is finished in 253.32 seconds),
+the gains we see are over 50 seconds (25%),
+but, as expected, the result is still slower than *PGO+LTO+BOLT* build.
+
+## Source of the Wins
+
+We mentioned that Clang suffers from considerable instruction cache misses. This can be measured with `perf`:
+```bash
+$ ln -fs $CPATH/clang-7.org $CPATH/clang-7
+$ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48
+  ...
+   16,366,101,626,647      instructions
+      359,996,216,537      L1-icache-misses
+```
+That's about 22 instruction cache misses per thousand instructions. As a rule of thumb, if the application
+has over 10 misses per thousand instructions, it is a good indication that it will be improved by BOLT.
+Now let's see how many misses are in the BOLTed binary:
+```bash
+$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
+$ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48
+  ...
+  16,319,818,488,769      instructions
+     244,888,677,972      L1-icache-misses
+```
+The number of misses per thousand instructions went down from 22 to 15, significantly reducing
+the number of stalls in the CPU front-end.
+Notice how the number of executed instructions stayed roughly the same. That's because we didn't
+run any optimizations beyond the ones affecting the code layout. Other than instruction cache misses,
+BOLT also improves branch mispredictions, iTLB misses, and misses in L2 and L3.
+
+## Using Clang for Other Applications
+
+We have collected profile for Clang using its own source code. Would it be enough to speed up
+the compilation of other projects? We picked `mysqld`, an open-source database, to do the test.
+
+On our 48-core Haswell system using the *PGO+LTO* Clang, the build finished in 136.06 seconds, while using the *PGO+LTO+BOLT* Clang, 126.10 seconds.
+That's a noticeable improvement, but not as significant as the one we saw on Clang itself.
+This is partially because the number of instruction cache misses is slightly lower on this scenario : 19 vs 22.
+Another reason is that Clang is run with a different set of options while building `mysqld` compared
+to the training run.
+
+Different options exercise different code paths, and
+if we trained without a specific option, we may have misplaced parts of the code responsible for handling it.
+To test this theory, we have collected another `perf` profile while building `mysqld`, and merged it with an existing profile
+using the `merge-fdata` utility that comes with BOLT. Optimized with that profile, the *PGO+LTO+BOLT* Clang was able
+to perform the `mysqld` build in 124.74 seconds, i.e. 11 seconds or 9% faster compared to *PGO+LGO* Clang.
+The merged profile didn't make the original Clang compilation slower either, while the number of profiled functions in Clang increased from 11,415 to 14,025.
+
+Ideally, the profile run has to be done with a superset of all commonly used options. However, the main improvement is expected with just the basic set.
+
+## Summary
+
+In this tutorial we demonstrated how to use BOLT to improve the
+performance of the Clang compiler. Similarly, BOLT could be used to improve the performance
+of GCC, or any other application suffering from a high number of instruction
+cache misses.
+
+----
+# Appendix
+
+## Bootstrapping Clang-7 with PGO and LTO
+
+Below we describe detailed steps to build Clang, and make it ready for BOLT optimizations. If you
+already have the build setup, you can skip this section, except for the last step that adds `-Wl,-q` linker flag to the final build.
+
+### Getting Clang-7 Sources
+
+Set `$TOPLEV` to the directory of your preference where you would like to do
+builds. E.g. `TOPLEV=~/clang-7/`. Follow with commands to clone the `release_70` branches
+of LLVM, Clang, lld linker, and the compiler runtime:
+```bash
+$ cd ${TOPLEV}
+$ git clone -q --depth=1 --branch=release_70 https://git.llvm.org/git/llvm.git/ llvm
+$ cd llvm/tools
+$ git clone -q --depth=1 --branch=release_70 https://git.llvm.org/git/clang.git/
+$ cd ../projects
+$ git clone -q --depth=1 --branch=release_70 https://git.llvm.org/git/lld.git/
+$ git clone -q --depth=1 --branch=release_70 https://git.llvm.org/git/compiler-rt.git/
+```
+
+### Building Stage 1 Compiler
+
+Stage 1 will be the first build we are going to do, and we will be using the
+default system compiler to build Clang. If your system lacks a compiler, use your distribution package manager to install one
+that supports C++11. In this example we are going to use GCC. In addition to the compiler,
+you will need the `cmake` and `ninja` packages.
+```bash
+$ mkdir ${TOPLEV}stage1
+$ cd ${TOPLEV}/stage1
+$ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \
+      -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_ASM_COMPILER=gcc \
+      -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage1/install
+$ ninja install
+```
+
+### Building Stage 2 Compiler With Instrumentation
+
+Using the freshly-baked stage 1 Clang compiler, we are going to build Clang with profile generation capabilities:
+```bash
+$ mkdir ${TOPLEV}/stage2-prof-gen
+$ cd ${TOPLEV}/stage2-prof-gen
+$ CPATH=${TOPLEV}/stage1/install/bin/
+$ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \
+    -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
+    -DLLVM_USE_LINKER=lld -DLLVM_BUILD_INSTRUMENTED=ON \
+    -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-gen/install
+$ ninja install
+```
+
+### Generating Profile for PGO
+
+While there are many ways to obtain the profile data, we are going to use the source code already at our
+disposal, i.e. we are going to collect the profile while building Clang itself:
+```bash
+$ mkdir ${TOPLEV}/stage3-train
+$ cd ${TOPLEV}/stage3-train
+$ CPATH=${TOPLEV}/stage2-prof-gen/install/bin
+$ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \
+    -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
+    -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3-train/install
+$ ninja clang
+```
+Once the build is completed, the profile files will be saved under `${TOPLEV}/stage2-prof-gen/profiles`. We will merge them before they can be passed back into Clang:
+```bash
+$ cd ${TOPLEV}/stage2-prof-gen/profiles
+$ ${TOPLEV}/stage1/install/bin/llvm-profdata merge -output=clang.profdata *
+```
+
+### Building Clang with PGO and LTO
+
+Now the profile can be used to guide optimizations to produce better code for our scenario, i.e. building Clang.
+We will also enable link-time optimizations to allow cross-module inlining and other optimizations. Finally, we are going to add one extra step that is useful for BOLT: a linker flag instructing it to preserve relocations in the output binary. Note that this flag does not affect the generated code or data used at runtime, it only writes metadata to the file on disk:
+```bash
+$ mkdir ${TOPLEV}/stage2-prof-use-lto
+$ cd ${TOPLEV}/stage2-prof-use-lto
+$ CPATH=${TOPLEV}/stage1/install/bin/
+$ export LDFLAGS="-Wl,-q"
+$ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \
+    -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
+    -DLLVM_ENABLE_LTO=Full -DLLVM_PROFDATA_FILE=${TOPLEV}/stage2-prof-gen/profiles/clang.profdata \
+    -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-use-lto/install
+$ ninja install
+```
+Now we have a Clang compiler that can build itself much faster. As we will see, it builds other applications faster as well, and, with BOLT, the compile time can be improved even further.
diff --git a/bolt/docs/RuntimeLibrary.md b/bolt/docs/RuntimeLibrary.md
new file mode 100644
index 000000000000..58d9497a195b
--- /dev/null
+++ b/bolt/docs/RuntimeLibrary.md
@@ -0,0 +1,37 @@
+# BOLT ORC-based linker
+
+A high-level view on the simple linker used to insert auxiliary/library code into the final binary produced by BOLT. This is built on top of LLVM's ORC infra (the newest iteration on JITting for LLVM).
+
+## Several levels of code injection
+
+When BOLT starts processing an input executable, its first task is to raise the binary to a low-level IR with CFG. After this is done, we are ready to change code in this binary. Throughout BOLT's pipeline of code transformations, there are plenty of situations when we need to insert new code or fix existing code.
+
+If operating with small code changes inside a basic block, we typically defer this work to MCPlusBuilder. This is our target-independent interface to create new instructions, but it also contains some functions that may create code spanning multiple basic blocks (for instance, when doing indirect call promotion and unrolling an indirect call into a ladder of comparisons/direct calls). The implementation here usually boils down to programmatically creating new MCInst instructions while setting their opcodes according to the target list (see X86GenInstOpcodes.inc generated by tablegen in an LLVM build).
+
+However, this approach quickly becomes awkward if we want to insert a lot of code, especially if this code is frozen and never changes. In these situations, it is more convenient to have a runtime library with all the code you need to insert. This library defines some symbols and can be linked into the final binary. In this case, all you need to do in a BOLT transformation is to insert a call to your library.
+
+## The runtime library
+
+Currently, our runtime library is written in C++ and contains code that helps us instrument a binary.
+
+### Limitations
+Our library is not written with regular C++ code as it is not linked against any other libraries (this means we cannnot rely on anything defined on libstdc++, glibc, libgcc etc), but is self sufficient. In runtime/CMakeLists.txt, we can see it is built with -ffreestanding, which requires the compiler to avoid using a runtime library by itself.
+
+While this requires us to make our own syscalls, it does simplify our linker a lot, which is very limited and can only do basic function name resolving. However, this is a big improvement in comparison with programmatically generating the code in assembly language using MCInsts.
+
+A few more quirks:
+
+* No BSS section: don't use uninitialized globals
+* No dependencies on foreign code: self sufficient
+* You should closely watch the generated bolt_rt object files, anything requiring fancy linker features will break. We only support bare bones .text, .data and nothing else.
+
+Read instr.cpp opening comment for more details.
+
+
+## Linking
+
+While RewriteInstance::emitAndLink() will perform an initial link step to resolve all references of the input program, it will not start linking the runtime library right away. The input program lives in its own module that may end up with unresolved references to the runtime library.
+
+RewriteInstance::linkRuntime() has the job of actually reading individual .o files and adding them to the binary. We currently have a single .o file, so after it is read, ORC can finally resolve references from the first module to the newly inserted .o objects.
+
+This sequence of steps is done by calls to addObject() and emitAndFinalize(). The latter will trigger symbol resolution, relying on the symbol resolver provided by us when calling createLegacyLookupResolver().
author	Amir Ayupov <aaupov@fb.com>	2020-12-01 16:29:39 -0800
committer	Maksim Panchenko <maks@fb.com>	2020-12-01 16:29:39 -0800
commit	1c5d3a056c4abfe1022f67e1e366765d769782fa (patch)
tree	50a46c82ac2ecd1dcd25d2af6b36704d34cbf373 /bolt/docs
parent	0a8aaf56bb58ba9b7d561dda02a27a292f517954 (diff)
download	llvm-1c5d3a056c4abfe1022f67e1e366765d769782fa.tar.gz