summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLamont Granquist <lamont@scriptkiddie.org>2019-03-07 10:39:24 -0800
committerLamont Granquist <lamont@scriptkiddie.org>2019-03-07 10:39:47 -0800
commit21af0c448a4c9c987a42446dc89e29e56e042fdc (patch)
treed58fdec4473b07728bf9f08b2f8271ec36088a59
parent2bc5a4f177d9218d7910a62061693cbcceac3579 (diff)
downloadchef-21af0c448a4c9c987a42446dc89e29e56e042fdc.tar.gz
checkpoint
Signed-off-by: Lamont Granquist <lamont@scriptkiddie.org>
-rw-r--r--docs/dev/action_collection.md2
-rw-r--r--docs/dev/data_collector.md77
2 files changed, 79 insertions, 0 deletions
diff --git a/docs/dev/action_collection.md b/docs/dev/action_collection.md
index fb0566c865..dd32f51a4e 100644
--- a/docs/dev/action_collection.md
+++ b/docs/dev/action_collection.md
@@ -97,3 +97,5 @@ This list will be necessarily incomplete of any unprocessed sub-resources in cus
executed actions and built their own sub-resource collections.
This was a design requirement of the data collector.
+
+To implement this in a more sane manner the runner that evaluates the resource collection now tracks the resources that it visits.
diff --git a/docs/dev/data_collector.md b/docs/dev/data_collector.md
new file mode 100644
index 0000000000..be597c12b6
--- /dev/null
+++ b/docs/dev/data_collector.md
@@ -0,0 +1,77 @@
+---
+title: Data Collector
+---
+
+# Data Collector RFC
+
+The Data Collector design and API is covered in:
+
+https://github.com/chef/chef-rfc/blob/master/rfc077-mode-agnostic-data-collection.md
+
+This document will focus entirely on the nuts and bolts of the Data Collector
+
+## Action Collection Integration
+
+Most of the work is done by the separate action collection to track the actions of Chef resources. If the Data Collector is not enabled, it never registers with the
+action collection and no work will be done by the action collection.
+
+## Additional Collected Information
+
+The Data Collector also collects:
+
+- the expanded run list
+- deprecations
+- the node
+- formatted error output for exceptions
+
+Most of this is done through hooking events directly in the Data Collector itself. The ErrorHandlers module is broken out into a module which is directly mixed into
+the Data Collector to separate that concern out into a different file (it is straightforwards with fairly little state, but is just a lot of hooked methods).
+
+## Resiliency to Failures
+
+The Data Collector in Chef >= 15.0 is resilient to failures that occur anywhere in the main loop of the `Chef::Client#run` method. In order to do this there is a lot
+of defensive coding around internal data structures that may be nil (e.g. failures before the node is loaded will result in the node being nil). The spec tests for
+the data collector now run through a large sequence of events (which must, unfortunately, be manually kept in sync with the events in the Chef::Client if those events
+are ever 'moved' around) which should catch any issues in the data collector with early failures. The specs should also serve as documentation for what the messages
+will look like under different failure conditions. The goal was to keep the format of the messages to look as near as possible to the same schema as possible even
+in the presence of failures. But some data structures will be entirely empty.
+
+When the data collector fails extraordinarily early it still sends both a start and an end message. This will happen if it fails so early that it would not normally
+have sent a start message.
+
+## Decision to Be Enabled
+
+This is complicated due to over-design and is encapsulated in the `#should_be_enabled?` method and the ConfigValidation module. The `#should_be_enabled?` message and
+ConfigValidation should probably be merged into one renamed Config module to isolate the concern of processing the Chef::Config options and doing the correct thing.
+
+## Run Start and Run End Message modules
+
+These are separated out into their own modules, which are very deliberately not mixed into the main data collector. They use the data collector and action collection
+public interfaces. They are stateless themselves. This keeps the collaboration between them and the data collector very easy to understand. The start message is
+relatively simple and straightforwards. The complication of the end message is mostly due to walking through the action collection and all the collected action
+records from the entire run, along with a lot of defensive programming to deal with early errors.
+
+## Relevant Event Sequence
+
+As it happens in the actual chef-client run:
+
+1. `events.register(data_collector)`
+2. `events.register(action_collection)`
+3. `run_status.run_id = request_id`
+4. `events.run_start(Chef::VERSION, run_status)`
+ * failures during registration will cause `registration_failed(node_name, exception, config)` here and skip to #13
+ * failures during node loading will cause `node_load_failed(node_name, exception, config)` here and skip to #13
+5. `events.node_load_success(node)`
+6. `run_status.node = node`
+ * failures during run list expansion will cause `run_list_expand_failed(node, exception)` here and skip to #13
+ * failures during cookbook resolution will cause `events.cookbook_resolution_failed(node, exception)` here and skip to #13
+ * failures during cookbook synch will cause `events.cookbook_sync_failed(node, exception)` and skip to #13
+7. `events.run_list_expanded(expansion)`
+8. `run_status.start_clock`
+9. `events.run_started(run_status)`
+10. `events.cookbook_compilation_start(run_context)`
+11. < the resource events happen here which hit the action collection >
+12. `events.converge_complete` or `events.converge_failed(exception)`
+13. `run_status.stop_clock`
+14. `run_status.exception = exception` if it failed
+15. `events.run_completed(node, run_status)` or `events.run_failed(exception, run_status)`