Move design docs to their own directory

Signed-off-by: Tim Smith <tsmith@chef.io>
author: Tim Smith <tsmith@chef.io> 2019-04-08 20:02:17 -0700
committer: Tim Smith <tsmith@chef.io> 2019-04-23 16:52:14 -0700
commit: 596df89fec3311d27cbe714d06678b36aec142fd (patch)
tree: f37bb87464f42fc4b8f4a3fe3636b039498b5686 /docs/dev/design_documents
parent: 72ab794d1d3f40fa03c899b7756c73b4cfc4d782 (diff)
download: chef-596df89fec3311d27cbe714d06678b36aec142fd.tar.gz
2 files changed, 220 insertions, 0 deletions
diff --git a/docs/dev/design_documents/action_collection.md b/docs/dev/design_documents/action_collection.md
new file mode 100644
index 0000000000..a0735f65fb
--- /dev/null
+++ b/docs/dev/design_documents/action_collection.md
@@ -0,0 +1,106 @@
+---
+title: Action Collection
+---
+
+# Action Collection Design
+
+* Extract common code from the Resource Reporter and Data Collector.
+* Expose a general purpose API for querying a record of all actions taken during the Chef run.
+* Enable utilities like the 'zap' cookbook to be written to interact properly with Custom Resources.
+
+The Action Collection tracks all actions taken by all Chef resources.  The resources can be in recipe code, as sub-resources of custom resources or
+they may be built "by hand".  Since the Action Collection hooks the events which are fired from the `run_action` method on Chef::Resource it does
+not matter how the resources were built (as long as they were correctly passed the Chef `run_context`).
+
+This is complementary, but superior, to the resource collection which has an incomplete picture of what might happen or has happened in the run since there are
+many common ways of invoking resource actions which are not captured by how the resource collection is built.  Replaying the sequence of actions in
+the Action Collection would be closer to replaying the chef-client converge than trying to re-converge the resource collection (although both of
+those models are still flawed in the presence of any imperative code that controls the shape of those objects).
+
+This design extracts common duplicated code from the Data Collection and old Resource Reporter, and is designed to be used by other consumers which
+need to ask questions like "in this run, what file resources had actions fired on them?", which can then be used to answer questions like
+"which files is Chef managing in this directory?".
+
+# Usage
+
+## Action Collection Event Hook Registration
+
+Consumers may register an event handler which hooks the `action_collection_registration` hook.  This event is fired directly before recipes are
+compiled and converged (after library loading, attributes, etc).  This is just before the earliest point in time that a resource should fire an
+action so represents the latest point that a consumer should make a decision about if it needs the Action Collection to be enabled or not.
+
+Consumers can hook this method.  They will be passed the Action Collection instance, which can be saved by the caller to be queried later.  They
+should then register themselves with the Action Collection (since without registering any interest, the Action Collection will disable itself).
+
+```ruby
+  def action_collection_registration(action_collection)
+    @action_collection = action_collection
+    action_collection.register(self)
+  end
+```
+
+## Library Registration
+
+Any cookbook library code may also register itself with the Action Collection.  The Action Collection will be registered with the `run_context` after
+it is created, so registration may be accomplished easily:
+
+```ruby
+  Chef.run_context.action_collection.register(self)
+```
+
+## Action Collection Requires Registration
+
+If one of the prior methods is not used to register for the Action Collection, then the Action Collection will disable itself and will not compile
+the Action Collection in order to not waste the memory overhead of tracking the actions during the run.  The Data Collector takes advantage of this
+since if the run start message from the Data Collector is refused by the server, then the Data Collector disables itself, and then does not register
+with the Action Collection, which would disable the Action Collection.  This makes use of the delayed hooking through the `action_collection_regsitration`
+so that the Data Collector never registers itself after it is disabled.
+
+## Searching
+
+There is a function `filtered_collection` which returns "slices" off of the `ActionCollection` object.  The `max_nesting` argument can be used to prune
+how deep into sub-resources the returned view goes (`max_nesting: 0` will return only resources in recipe context, with any hand created resources, but
+no subresources).  There are also 5 different states of the action:  `up_to_date`, `skipped`, `updated`, `failed`, `unprocessed` which can be filtered
+on.  All of these are true by default, so they must be disabled to remove them from the filtered collection.
+
+The `ActionCollection` object itself implements enumerable and returns `ActionRecord` objects (see the `ActionCollection` code for the fields exposed on
+`ActionRecord`s).
+
+This would return all file resources in any state in the recipe context:
+
+```
+Chef.run_context.action_collection.filtered_collection(max_nesting: 0).select { |rec| rec.new_resource.is_a?(Chef::Resource::File) }
+```
+
+NOTE:
+As the Action Collection API was initially designed around the Resource Reporter and Data Collector use cases, the searching API is currently rudimentary
+and could easily lift some of the searching features on the name of the resource from the resource collection, and could use a more fluent API
+for composing searches.
+
+# Implementation Details
+
+## Resource Event Lifecycle Hooks
+
+Resources actions fire off several events in sequence:
+
+1. `resource_action_start` - this is always fired first
+2. `resource_current_state_loaded` - this is normally always second, but may be skipped in the case of a resource which throws an exception during
+`load_current_resource` (which means that the `current_resource` off the `ActionRecord` may be nil).
+3. `resource_up_to_date` / `resource_skipped` / `resource_updated` / `resource_failed` - one of these is always called which corresponds to the state of the action.
+4. `resource_completed` - this is always fired last
+
+For skipped resources, the conditional will be saved in the `ActionRecord`.  For failed resources the exception is saved in the `ActionRecord`.
+
+## Unprocessed Resources
+
+The unprocessed resource concept is to report on resources which are left in the resource collection after a failure.  A successful Chef run should
+never leave any unprocessed resources (`action :nothing` resources are still inspected by the resource collection and are processed).  There must be
+an exception thrown during the execution of the resource collection, and the unprocessed resources were never visited by the runner that executes
+the resource collection.
+
+This list will be necessarily incomplete of any unprocessed sub-resources in custom resources, since the run was aborted before those resources
+executed actions and built their own sub-resource collections.
+
+This was a design requirement of the Data Collector.
+
+To implement this in a more sane manner the runner that evaluates the resource collection now tracks the resources that it visits.
diff --git a/docs/dev/design_documents/data_collector.md b/docs/dev/design_documents/data_collector.md
new file mode 100644
index 0000000000..be0a92e7fb
--- /dev/null
+++ b/docs/dev/design_documents/data_collector.md
@@ -0,0 +1,114 @@
+---
+title: Data Collector
+---
+
+# Data Collector Design
+
+The Data Collector design and API is covered in:
+
+https://github.com/chef/chef-rfc/blob/master/rfc077-mode-agnostic-data-collection.md
+
+This document will focus entirely on the nuts and bolts of the Data Collector
+
+## Action Collection Integration
+
+Most of the work is done by a separate Action Collection to track the actions of Chef resources.  If the Data Collector is not enabled, it never registers with the
+Action Collection and no work will be done by the Action Collection to track resources.
+
+## Additional Collected Information
+
+The Data Collector also collects:
+
+- the expanded run list
+- deprecations
+- the node
+- formatted error output for exceptions
+
+Most of this is done through hooking events directly in the Data Collector itself.  The ErrorHandlers module is broken out into a module which is directly mixed into
+the Data Collector to separate that concern out into a different file (it is straightforward with fairly little state, but is just a lot of hooked methods).
+
+## Basic Configuration Modes
+
+### Configured for Automate
+
+Do nothing.  The URL is constructed from the base `Chef::Config[:chef_server_url]`, auth is just Chef Server API authentication, and the default behavior is that it
+is configured.
+
+### Configured to Log to a File
+
+Setup a file output location, no token is necessary:
+
+```
+Chef::Config[:data_collector][:output_locations] = { files:  [ "/Users/lamont/data_collector.out" ] }
+```
+
+Note the fact that you can't assign to `Chef::Config[:data_collector][:output_locations][:files]` and will NoMethodError if you try.
+
+### Configured to Log to a Non-Chef Server Endpoint
+
+Setup a server url, requiring a token:
+
+```
+Chef::Config[:data_collector][:server_url] = "https://chef.acme.local/myendpoint.html"
+Chef::Config[:data_collector][:token] = "mytoken"
+```
+
+This works for chef-clients which are configured to hit a chef server, but use a custom non-Chef-Automate endpoint for reporting, or for chef-solo/zero users.
+
+XXX: There is also the `Chef::Config[:data_collector][:output_locations] = { uri: [ "https://chef.acme.local/myendpoint.html" ] }` method -- which is going to behave
+differently, particularly on non-chef-solo use cases.  In that case the Data Collector `server_url` will still be automatically derived from the `chef_server_url` and
+the Data Collector will attempt to contact that endpoint, but with the token being supplied it will use that and will not use Chef Server authentication, and the
+server should 403 back, and if `raise_on_failure` is left to the default of false then it will simply drop that failure and continue without raising, which will
+appear to work, and output will be send to the configured `output_locations`.  Note that the presence of a token flips all external URIs to using the token so that
+it is **not** possible to use this feature to talk to both a Chef Automate endpoint and a custom URI reporting endpoint (which would seem to be the most useful of an
+incredibly marginally useful feature and it does not work).  But given how hopelessly complicated this is, the recommendation is to use the `server_url` and to avoid
+using any `url` options in the `output_locations` since that feature is fairly poorly designed at this point in time.
+
+## Resiliency to Failures
+
+The Data Collector in Chef >= 15.0 is resilient to failures that occur anywhere in the main loop of the `Chef::Client#run` method.  In order to do this there is a lot
+of defensive coding around internal data structures that may be nil (e.g. failures before the node is loaded will result in the node being nil).  The spec tests for
+the Data Collector now run through a large sequence of events (which must, unfortunately, be manually kept in sync with the events in the Chef::Client if those events
+are ever 'moved' around) which should catch any issues in the Data Collector with early failures.  The specs should also serve as documentation for what the messages
+will look like under different failure conditions.  The goal was to keep the format of the messages to look as near as possible to the same schema as possible even
+in the presence of failures.  But some data structures will be entirely empty.
+
+When the Data Collector fails extraordinarily early it still sends both a start and an end message.  This will happen if it fails so early that it would not normally
+have sent a start message.
+
+## Decision to Be Enabled
+
+This is complicated due to over-design and is encapsulated in the `#should_be_enabled?` method and the ConfigValidation module.  The `#should_be_enabled?` message and
+ConfigValidation should probably be merged into one renamed Config module to isolate the concern of processing the Chef::Config options and doing the correct thing.
+
+## Run Start and Run End Message modules
+
+These are separated out into their own modules, which are very deliberately not mixed into the main Data Collector.  They use the Data Collector and Action Collection
+public interfaces.  They are stateless themselves.  This keeps the collaboration between them and the Data Collector very easy to understand.  The start message is
+relatively simple and straightforwards.  The complication of the end message is mostly due to walking through the Action Collection and all the collected action
+records from the entire run, along with a lot of defensive programming to deal with early errors.
+
+## Relevant Event Sequence
+
+As it happens in the actual chef-client run:
+
+1. `events.register(data_collector)`
+2. `events.register(action_collection)`
+3. `run_status.run_id = request_id`
+4. `events.run_start(Chef::VERSION, run_status)`
+  * failures during registration will cause `registration_failed(node_name, exception, config)` here and skip to #13
+  * failures during node loading will cause `node_load_failed(node_name, exception, config)` here and skip to #13
+5. `events.node_load_success(node)`
+6. `run_status.node = node`
+  * failures during run list expansion will cause `run_list_expand_failed(node, exception)` here and skip to #13
+7. `events.run_list_expanded(expansion)`
+8. `run_status.start_clock`
+9. `events.run_started(run_status)`
+  * failures during cookbook resolution will cause `events.cookbook_resolution_failed(node, exception)` here and skip to #13
+  * failures during cookbook synch will cause `events.cookbook_sync_failed(node, exception)` and skip to #13
+10. `events.cookbook_compilation_start(run_context)`
+11. < the resource events happen here which hit the Action Collection, may throw any of the other failure events >
+12. `events.converge_complete` or `events.converge_failed(exception)`
+13. `run_status.stop_clock`
+14. `run_status.exception = exception` if it failed
+15. `events.run_completed(node, run_status)` or `events.run_failed(exception, run_status)`
author	Tim Smith <tsmith@chef.io>	2019-04-08 20:02:17 -0700
committer	Tim Smith <tsmith@chef.io>	2019-04-23 16:52:14 -0700
commit	596df89fec3311d27cbe714d06678b36aec142fd (patch)
tree	f37bb87464f42fc4b8f4a3fe3636b039498b5686 /docs/dev/design_documents
parent	72ab794d1d3f40fa03c899b7756c73b4cfc4d782 (diff)
download	chef-596df89fec3311d27cbe714d06678b36aec142fd.tar.gz