diff options
author | Lamont Granquist <lamont@scriptkiddie.org> | 2019-03-07 10:39:24 -0800 |
---|---|---|
committer | Lamont Granquist <lamont@scriptkiddie.org> | 2019-03-07 10:39:47 -0800 |
commit | 21af0c448a4c9c987a42446dc89e29e56e042fdc (patch) | |
tree | d58fdec4473b07728bf9f08b2f8271ec36088a59 | |
parent | 2bc5a4f177d9218d7910a62061693cbcceac3579 (diff) | |
download | chef-21af0c448a4c9c987a42446dc89e29e56e042fdc.tar.gz |
checkpoint
Signed-off-by: Lamont Granquist <lamont@scriptkiddie.org>
-rw-r--r-- | docs/dev/action_collection.md | 2 | ||||
-rw-r--r-- | docs/dev/data_collector.md | 77 |
2 files changed, 79 insertions, 0 deletions
diff --git a/docs/dev/action_collection.md b/docs/dev/action_collection.md index fb0566c865..dd32f51a4e 100644 --- a/docs/dev/action_collection.md +++ b/docs/dev/action_collection.md @@ -97,3 +97,5 @@ This list will be necessarily incomplete of any unprocessed sub-resources in cus executed actions and built their own sub-resource collections. This was a design requirement of the data collector. + +To implement this in a more sane manner the runner that evaluates the resource collection now tracks the resources that it visits. diff --git a/docs/dev/data_collector.md b/docs/dev/data_collector.md new file mode 100644 index 0000000000..be597c12b6 --- /dev/null +++ b/docs/dev/data_collector.md @@ -0,0 +1,77 @@ +--- +title: Data Collector +--- + +# Data Collector RFC + +The Data Collector design and API is covered in: + +https://github.com/chef/chef-rfc/blob/master/rfc077-mode-agnostic-data-collection.md + +This document will focus entirely on the nuts and bolts of the Data Collector + +## Action Collection Integration + +Most of the work is done by the separate action collection to track the actions of Chef resources. If the Data Collector is not enabled, it never registers with the +action collection and no work will be done by the action collection. + +## Additional Collected Information + +The Data Collector also collects: + +- the expanded run list +- deprecations +- the node +- formatted error output for exceptions + +Most of this is done through hooking events directly in the Data Collector itself. The ErrorHandlers module is broken out into a module which is directly mixed into +the Data Collector to separate that concern out into a different file (it is straightforwards with fairly little state, but is just a lot of hooked methods). + +## Resiliency to Failures + +The Data Collector in Chef >= 15.0 is resilient to failures that occur anywhere in the main loop of the `Chef::Client#run` method. In order to do this there is a lot +of defensive coding around internal data structures that may be nil (e.g. failures before the node is loaded will result in the node being nil). The spec tests for +the data collector now run through a large sequence of events (which must, unfortunately, be manually kept in sync with the events in the Chef::Client if those events +are ever 'moved' around) which should catch any issues in the data collector with early failures. The specs should also serve as documentation for what the messages +will look like under different failure conditions. The goal was to keep the format of the messages to look as near as possible to the same schema as possible even +in the presence of failures. But some data structures will be entirely empty. + +When the data collector fails extraordinarily early it still sends both a start and an end message. This will happen if it fails so early that it would not normally +have sent a start message. + +## Decision to Be Enabled + +This is complicated due to over-design and is encapsulated in the `#should_be_enabled?` method and the ConfigValidation module. The `#should_be_enabled?` message and +ConfigValidation should probably be merged into one renamed Config module to isolate the concern of processing the Chef::Config options and doing the correct thing. + +## Run Start and Run End Message modules + +These are separated out into their own modules, which are very deliberately not mixed into the main data collector. They use the data collector and action collection +public interfaces. They are stateless themselves. This keeps the collaboration between them and the data collector very easy to understand. The start message is +relatively simple and straightforwards. The complication of the end message is mostly due to walking through the action collection and all the collected action +records from the entire run, along with a lot of defensive programming to deal with early errors. + +## Relevant Event Sequence + +As it happens in the actual chef-client run: + +1. `events.register(data_collector)` +2. `events.register(action_collection)` +3. `run_status.run_id = request_id` +4. `events.run_start(Chef::VERSION, run_status)` + * failures during registration will cause `registration_failed(node_name, exception, config)` here and skip to #13 + * failures during node loading will cause `node_load_failed(node_name, exception, config)` here and skip to #13 +5. `events.node_load_success(node)` +6. `run_status.node = node` + * failures during run list expansion will cause `run_list_expand_failed(node, exception)` here and skip to #13 + * failures during cookbook resolution will cause `events.cookbook_resolution_failed(node, exception)` here and skip to #13 + * failures during cookbook synch will cause `events.cookbook_sync_failed(node, exception)` and skip to #13 +7. `events.run_list_expanded(expansion)` +8. `run_status.start_clock` +9. `events.run_started(run_status)` +10. `events.cookbook_compilation_start(run_context)` +11. < the resource events happen here which hit the action collection > +12. `events.converge_complete` or `events.converge_failed(exception)` +13. `run_status.stop_clock` +14. `run_status.exception = exception` if it failed +15. `events.run_completed(node, run_status)` or `events.run_failed(exception, run_status)` |