diff options
author | Tim Smith <tsmith@chef.io> | 2019-04-08 20:02:17 -0700 |
---|---|---|
committer | Tim Smith <tsmith@chef.io> | 2019-04-23 16:52:14 -0700 |
commit | 596df89fec3311d27cbe714d06678b36aec142fd (patch) | |
tree | f37bb87464f42fc4b8f4a3fe3636b039498b5686 /docs/dev/design_documents | |
parent | 72ab794d1d3f40fa03c899b7756c73b4cfc4d782 (diff) | |
download | chef-596df89fec3311d27cbe714d06678b36aec142fd.tar.gz |
Move design docs to their own directory
Signed-off-by: Tim Smith <tsmith@chef.io>
Diffstat (limited to 'docs/dev/design_documents')
-rw-r--r-- | docs/dev/design_documents/action_collection.md | 106 | ||||
-rw-r--r-- | docs/dev/design_documents/data_collector.md | 114 |
2 files changed, 220 insertions, 0 deletions
diff --git a/docs/dev/design_documents/action_collection.md b/docs/dev/design_documents/action_collection.md new file mode 100644 index 0000000000..a0735f65fb --- /dev/null +++ b/docs/dev/design_documents/action_collection.md @@ -0,0 +1,106 @@ +--- +title: Action Collection +--- + +# Action Collection Design + +* Extract common code from the Resource Reporter and Data Collector. +* Expose a general purpose API for querying a record of all actions taken during the Chef run. +* Enable utilities like the 'zap' cookbook to be written to interact properly with Custom Resources. + +The Action Collection tracks all actions taken by all Chef resources. The resources can be in recipe code, as sub-resources of custom resources or +they may be built "by hand". Since the Action Collection hooks the events which are fired from the `run_action` method on Chef::Resource it does +not matter how the resources were built (as long as they were correctly passed the Chef `run_context`). + +This is complementary, but superior, to the resource collection which has an incomplete picture of what might happen or has happened in the run since there are +many common ways of invoking resource actions which are not captured by how the resource collection is built. Replaying the sequence of actions in +the Action Collection would be closer to replaying the chef-client converge than trying to re-converge the resource collection (although both of +those models are still flawed in the presence of any imperative code that controls the shape of those objects). + +This design extracts common duplicated code from the Data Collection and old Resource Reporter, and is designed to be used by other consumers which +need to ask questions like "in this run, what file resources had actions fired on them?", which can then be used to answer questions like +"which files is Chef managing in this directory?". + +# Usage + +## Action Collection Event Hook Registration + +Consumers may register an event handler which hooks the `action_collection_registration` hook. This event is fired directly before recipes are +compiled and converged (after library loading, attributes, etc). This is just before the earliest point in time that a resource should fire an +action so represents the latest point that a consumer should make a decision about if it needs the Action Collection to be enabled or not. + +Consumers can hook this method. They will be passed the Action Collection instance, which can be saved by the caller to be queried later. They +should then register themselves with the Action Collection (since without registering any interest, the Action Collection will disable itself). + +```ruby + def action_collection_registration(action_collection) + @action_collection = action_collection + action_collection.register(self) + end +``` + +## Library Registration + +Any cookbook library code may also register itself with the Action Collection. The Action Collection will be registered with the `run_context` after +it is created, so registration may be accomplished easily: + +```ruby + Chef.run_context.action_collection.register(self) +``` + +## Action Collection Requires Registration + +If one of the prior methods is not used to register for the Action Collection, then the Action Collection will disable itself and will not compile +the Action Collection in order to not waste the memory overhead of tracking the actions during the run. The Data Collector takes advantage of this +since if the run start message from the Data Collector is refused by the server, then the Data Collector disables itself, and then does not register +with the Action Collection, which would disable the Action Collection. This makes use of the delayed hooking through the `action_collection_regsitration` +so that the Data Collector never registers itself after it is disabled. + +## Searching + +There is a function `filtered_collection` which returns "slices" off of the `ActionCollection` object. The `max_nesting` argument can be used to prune +how deep into sub-resources the returned view goes (`max_nesting: 0` will return only resources in recipe context, with any hand created resources, but +no subresources). There are also 5 different states of the action: `up_to_date`, `skipped`, `updated`, `failed`, `unprocessed` which can be filtered +on. All of these are true by default, so they must be disabled to remove them from the filtered collection. + +The `ActionCollection` object itself implements enumerable and returns `ActionRecord` objects (see the `ActionCollection` code for the fields exposed on +`ActionRecord`s). + +This would return all file resources in any state in the recipe context: + +``` +Chef.run_context.action_collection.filtered_collection(max_nesting: 0).select { |rec| rec.new_resource.is_a?(Chef::Resource::File) } +``` + +NOTE: +As the Action Collection API was initially designed around the Resource Reporter and Data Collector use cases, the searching API is currently rudimentary +and could easily lift some of the searching features on the name of the resource from the resource collection, and could use a more fluent API +for composing searches. + +# Implementation Details + +## Resource Event Lifecycle Hooks + +Resources actions fire off several events in sequence: + +1. `resource_action_start` - this is always fired first +2. `resource_current_state_loaded` - this is normally always second, but may be skipped in the case of a resource which throws an exception during +`load_current_resource` (which means that the `current_resource` off the `ActionRecord` may be nil). +3. `resource_up_to_date` / `resource_skipped` / `resource_updated` / `resource_failed` - one of these is always called which corresponds to the state of the action. +4. `resource_completed` - this is always fired last + +For skipped resources, the conditional will be saved in the `ActionRecord`. For failed resources the exception is saved in the `ActionRecord`. + +## Unprocessed Resources + +The unprocessed resource concept is to report on resources which are left in the resource collection after a failure. A successful Chef run should +never leave any unprocessed resources (`action :nothing` resources are still inspected by the resource collection and are processed). There must be +an exception thrown during the execution of the resource collection, and the unprocessed resources were never visited by the runner that executes +the resource collection. + +This list will be necessarily incomplete of any unprocessed sub-resources in custom resources, since the run was aborted before those resources +executed actions and built their own sub-resource collections. + +This was a design requirement of the Data Collector. + +To implement this in a more sane manner the runner that evaluates the resource collection now tracks the resources that it visits. diff --git a/docs/dev/design_documents/data_collector.md b/docs/dev/design_documents/data_collector.md new file mode 100644 index 0000000000..be0a92e7fb --- /dev/null +++ b/docs/dev/design_documents/data_collector.md @@ -0,0 +1,114 @@ +--- +title: Data Collector +--- + +# Data Collector Design + +The Data Collector design and API is covered in: + +https://github.com/chef/chef-rfc/blob/master/rfc077-mode-agnostic-data-collection.md + +This document will focus entirely on the nuts and bolts of the Data Collector + +## Action Collection Integration + +Most of the work is done by a separate Action Collection to track the actions of Chef resources. If the Data Collector is not enabled, it never registers with the +Action Collection and no work will be done by the Action Collection to track resources. + +## Additional Collected Information + +The Data Collector also collects: + +- the expanded run list +- deprecations +- the node +- formatted error output for exceptions + +Most of this is done through hooking events directly in the Data Collector itself. The ErrorHandlers module is broken out into a module which is directly mixed into +the Data Collector to separate that concern out into a different file (it is straightforward with fairly little state, but is just a lot of hooked methods). + +## Basic Configuration Modes + +### Configured for Automate + +Do nothing. The URL is constructed from the base `Chef::Config[:chef_server_url]`, auth is just Chef Server API authentication, and the default behavior is that it +is configured. + +### Configured to Log to a File + +Setup a file output location, no token is necessary: + +``` +Chef::Config[:data_collector][:output_locations] = { files: [ "/Users/lamont/data_collector.out" ] } +``` + +Note the fact that you can't assign to `Chef::Config[:data_collector][:output_locations][:files]` and will NoMethodError if you try. + +### Configured to Log to a Non-Chef Server Endpoint + +Setup a server url, requiring a token: + +``` +Chef::Config[:data_collector][:server_url] = "https://chef.acme.local/myendpoint.html" +Chef::Config[:data_collector][:token] = "mytoken" +``` + +This works for chef-clients which are configured to hit a chef server, but use a custom non-Chef-Automate endpoint for reporting, or for chef-solo/zero users. + +XXX: There is also the `Chef::Config[:data_collector][:output_locations] = { uri: [ "https://chef.acme.local/myendpoint.html" ] }` method -- which is going to behave +differently, particularly on non-chef-solo use cases. In that case the Data Collector `server_url` will still be automatically derived from the `chef_server_url` and +the Data Collector will attempt to contact that endpoint, but with the token being supplied it will use that and will not use Chef Server authentication, and the +server should 403 back, and if `raise_on_failure` is left to the default of false then it will simply drop that failure and continue without raising, which will +appear to work, and output will be send to the configured `output_locations`. Note that the presence of a token flips all external URIs to using the token so that +it is **not** possible to use this feature to talk to both a Chef Automate endpoint and a custom URI reporting endpoint (which would seem to be the most useful of an +incredibly marginally useful feature and it does not work). But given how hopelessly complicated this is, the recommendation is to use the `server_url` and to avoid +using any `url` options in the `output_locations` since that feature is fairly poorly designed at this point in time. + +## Resiliency to Failures + +The Data Collector in Chef >= 15.0 is resilient to failures that occur anywhere in the main loop of the `Chef::Client#run` method. In order to do this there is a lot +of defensive coding around internal data structures that may be nil (e.g. failures before the node is loaded will result in the node being nil). The spec tests for +the Data Collector now run through a large sequence of events (which must, unfortunately, be manually kept in sync with the events in the Chef::Client if those events +are ever 'moved' around) which should catch any issues in the Data Collector with early failures. The specs should also serve as documentation for what the messages +will look like under different failure conditions. The goal was to keep the format of the messages to look as near as possible to the same schema as possible even +in the presence of failures. But some data structures will be entirely empty. + +When the Data Collector fails extraordinarily early it still sends both a start and an end message. This will happen if it fails so early that it would not normally +have sent a start message. + +## Decision to Be Enabled + +This is complicated due to over-design and is encapsulated in the `#should_be_enabled?` method and the ConfigValidation module. The `#should_be_enabled?` message and +ConfigValidation should probably be merged into one renamed Config module to isolate the concern of processing the Chef::Config options and doing the correct thing. + +## Run Start and Run End Message modules + +These are separated out into their own modules, which are very deliberately not mixed into the main Data Collector. They use the Data Collector and Action Collection +public interfaces. They are stateless themselves. This keeps the collaboration between them and the Data Collector very easy to understand. The start message is +relatively simple and straightforwards. The complication of the end message is mostly due to walking through the Action Collection and all the collected action +records from the entire run, along with a lot of defensive programming to deal with early errors. + +## Relevant Event Sequence + +As it happens in the actual chef-client run: + +1. `events.register(data_collector)` +2. `events.register(action_collection)` +3. `run_status.run_id = request_id` +4. `events.run_start(Chef::VERSION, run_status)` + * failures during registration will cause `registration_failed(node_name, exception, config)` here and skip to #13 + * failures during node loading will cause `node_load_failed(node_name, exception, config)` here and skip to #13 +5. `events.node_load_success(node)` +6. `run_status.node = node` + * failures during run list expansion will cause `run_list_expand_failed(node, exception)` here and skip to #13 +7. `events.run_list_expanded(expansion)` +8. `run_status.start_clock` +9. `events.run_started(run_status)` + * failures during cookbook resolution will cause `events.cookbook_resolution_failed(node, exception)` here and skip to #13 + * failures during cookbook synch will cause `events.cookbook_sync_failed(node, exception)` and skip to #13 +10. `events.cookbook_compilation_start(run_context)` +11. < the resource events happen here which hit the Action Collection, may throw any of the other failure events > +12. `events.converge_complete` or `events.converge_failed(exception)` +13. `run_status.stop_clock` +14. `run_status.exception = exception` if it failed +15. `events.run_completed(node, run_status)` or `events.run_failed(exception, run_status)` |