summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorRichard Samuels <richard.l.samuels@gmail.com>2022-05-11 18:56:16 +0000
committerEvergreen Agent <no-reply@evergreen.mongodb.com>2022-05-11 19:25:12 +0000
commit6b0215b0d9a50d51733d2d85bf587eb586ccd57f (patch)
tree1adb3a2d9678261064a8d883415128126f9024b3 /docs
parent9659df2d74bcb2caf56d7c6444adb8f3c82ea14a (diff)
downloadmongo-6b0215b0d9a50d51733d2d85bf587eb586ccd57f.tar.gz
SERVER-65696 Document hang analyzer
Diffstat (limited to 'docs')
-rw-r--r--docs/testing/hang_analyzer.md67
1 files changed, 67 insertions, 0 deletions
diff --git a/docs/testing/hang_analyzer.md b/docs/testing/hang_analyzer.md
new file mode 100644
index 00000000000..7a3e560e3e9
--- /dev/null
+++ b/docs/testing/hang_analyzer.md
@@ -0,0 +1,67 @@
+# Hang Analyzer
+
+The hang analyzer is a tool to collect cores and other information from processes
+that are suspected to have hung. Any task which exceeds its timeout in Evergreen
+will automatically be hang-analyzed, with information being written compressed
+and uploaded to S3.
+
+The hang analyzer can also be invoked locally at any time. For all non-Jepsen
+tasks, the invocation is `buildscripts/resmoke.py hang-analyzer -o file -o stdout -m exact -p python`. You may need to substitute `python` with the name of the python binary
+you are using, which may be one of `python`, `python3`, or on Windows: `Python`,
+`Python3`.
+
+For jepsen tasks, the invocation is `buildscripts/resmoke.py hang-analyzer -o file -o stdout -p dbtest,java,mongo,mongod,mongos,python,_test`.
+
+## Interesting Processes
+The hang analyzer detects and runs against processes which are considered
+interesting.
+
+Tasks whose name contains "jepsen": any process whose name exactly matches one
+of `dbtest,java,mongo,mongod,mongos,python,_test`.
+
+In all other scenarios, including local use of the hang-analyzer, an interesting
+process is any of:
+* process that starts with `python` or `live-record`
+* one which has been spawned as a child process of resmoke.
+
+The resmoke subcommand `hang-analyzer` will send SIGUSR1/use SetEvent to signal
+resmoke to:
+* Print stack traces for all python threads
+* Collect core dumps and other information for any non-python child
+processes, see `Data Collection` below
+* Re-signal any python child processes to do the same
+
+## Data Collection
+Data collection occurs in the following sequence:
+* Pause all non-python processes
+* Grab debug symbols on non-Sanitizer builds
+* Signal python Processes
+* Dump cores of as many processes as possible, until the disk quota is exceeded.
+The default quota is 90% of total volume space.
+
+* Collect additional, non-core data. Ideally:
+ * Print C++ Stack traces
+ * Print MozJS Stack Traces
+ * Dump locks/mutexes info
+ * Dump Server Sessions
+ * Dump Recovery Units
+ * Dump Storage engine info
+* Dump java processes (Jepsen tests) with jstack
+* SIGABRT (Unix)/terminate (Windows) go processes
+
+Note that the list of non-core data collected is only accurate on Linux. Other
+platforms only perform a subset of these operations.
+
+Additionally, note that the hang analyzer is subject to Evergreen post task
+timeouts, and may not have enough time to collect all information before
+being terminated by the Evergreen agent. When running locally there is no
+timeout, and the hang analyzer may ironically hang indefinitely.
+
+
+### Implementations
+Platform-specific concerns for data collection are handled by dumper objects in
+`buildscripts/resmokelib/hang_analyzer/dumper.py`.
+* Linux: See `GDBDumper`
+* MacOS: See `LLDBDumper`
+* Windows: See `WindowsDumper` and `JstackWindowsDumper`
+* Java (non-Windows): `JstackDumper`