Merge "Add Nodepool in Zuul spec"

author: Zuul <zuul@review.opendev.org> 2022-09-09 15:49:43 +0000
committer: Gerrit Code Review <review@openstack.org> 2022-09-09 15:49:43 +0000
commit: 76ec804a893253755c794e8271f56e567b0c1f3d (patch)
tree: ea2081a6faeec5cd14bd9a0444a1c17ab02c27be
parent: defe3d4136c2a415bdfce641de242a20a9745f55 (diff)
parent: 781a8470d3790e6257fa5e813e9d8750c41b72bc (diff)
download: zuul-76ec804a893253755c794e8271f56e567b0c1f3d.tar.gz
2 files changed, 748 insertions, 4 deletions
diff --git a/doc/source/developer/specs/index.rst b/doc/source/developer/specs/index.rst
index 78c11bbb8..a75084429 100644
--- a/doc/source/developer/specs/index.rst
+++ b/doc/source/developer/specs/index.rst
@@ -16,11 +16,12 @@ documentation instead.
 .. toctree::
    :maxdepth: 1
 
-   tenant-scoped-admin-web-API
-   kubernetes-operator
    circular-dependencies
-   zuul-runner
+   community-matrix
    enhanced-regional-executors
+   kubernetes-operator
+   nodepool-in-zuul
    tenant-resource-quota
-   community-matrix
+   tenant-scoped-admin-web-API
    tracing
+   zuul-runner
diff --git a/doc/source/developer/specs/nodepool-in-zuul.rst b/doc/source/developer/specs/nodepool-in-zuul.rst
new file mode 100644
index 000000000..10a3ad7dc
--- /dev/null
+++ b/doc/source/developer/specs/nodepool-in-zuul.rst
@@ -0,0 +1,743 @@
+Nodepool in Zuul
+================
+
+.. warning:: This is not authoritative documentation.  These features
+   are not currently available in Zuul.  They may change significantly
+   before final implementation, or may never be fully completed.
+
+The following specification describes a plan to move Nodepool's
+functionality into Zuul and end development of Nodepool as a separate
+application.  This will allow for more node and image related features
+as well as simpler maintenance and deployment.
+
+Introduction
+------------
+
+Nodepool exists as a distinct application from Zuul largely due to
+historical circumstances: it was originally a process for launching
+nodes, attaching them to Jenkins, detaching them from Jenkins and
+deleting them.  Once Zuul grew its own execution engine, Nodepool
+could have been adopted into Zuul at that point, but the existing
+loose API meant it was easy to maintain them separately and combining
+them wasn't particularly advantageous.
+
+However, now we find ourselves with a very robust framework in Zuul
+for dealing with ZooKeeper, multiple components, web services and REST
+APIs.  All of these are lagging behind in Nodepool, and it is time to
+address that one way or another.  We could of course upgrade
+Nodepool's infrastructure to match Zuul's, or even separate out these
+frameworks into third-party libraries.  However, there are other
+reasons to consider tighter coupling between Zuul and Nodepool, and
+these tilt the scales in favor of moving Nodepool functionality into
+Zuul.
+
+Designing Nodepool as part of Zuul would allow for more features
+related to Zuul's multi-tenancy.  Zuul is quite good at
+fault-tolerance as well as scaling, so designing Nodepool around that
+could allow for better cooperation between node launchers.  Finally,
+as part of Zuul, Nodepool's image lifecycle can be more easily
+integrated with Zuul-based workflow.
+
+There are two Nodepool components: nodepool-builder and
+nodepool-launcher.  We will address the functionality of each in the
+following sections on Image Management and Node Management.
+
+This spec contemplates a new Zuul component to handle image and node
+management: zuul-launcher.  Much of the Nodepool configuration will
+become Zuul configuration as well.  That is detailed in its own
+section, but for now, it's enough to know that the Zuul system as a
+whole will know what images and node labels are present in the
+configuration.
+
+Image Management
+----------------
+
+Part of nodepool-builder's functionality is important to have as a
+long-running daemon, and part of what it does would make more sense as
+a Zuul job.  By moving the actual image build into a Zuul job, we can
+make the activity more visible to users of the system.  It will be
+easier for users to test changes to image builds (inasmuch as they can
+propose a change and a check job can run on that change to see if the
+image builds sucessfully).  Build history and logs will be visible in
+the usual way in the Zuul web interface.
+
+A frequently requested feature is the ability to verify images before
+putting them into service.  This is not practical with the current
+implementation of Nodepool because of the loose coupling with Zuul.
+However, once we are able to include Zuul jobs in the workflow of
+image builds, it is easier to incorporate Zuul jobs to validate those
+images as well.  This spec includes a mechanism for that.
+
+The parts of nodepool-builder that makes sense as a long-running
+daemon are the parts dealing with image lifecycles.  Uploading builds
+to cloud providers, keeping track of image builds and uploads,
+deciding when those images should enter or leave service, and deleting
+them are all better done with state management and long-running
+processes (we should know -- early versions of Nodepool attempted to
+do all of that with Jenkins jobs with limited success).
+
+The sections below describe how we will implement image management in
+Zuul.
+
+First, a reminder that using custom images is optional with Zuul.
+Many Zuul systems will be able to operate using only stock cloud
+provider images.  One of the strengths of nodepool-builder is that it
+can build an image for Zuul without relying on any particular cloud
+provider images.  A Zuul system whose operator wants to use custom
+images will need to bootstrap that process, and under the proposed
+system where images are build in Zuul jobs, that would need to be done
+using a stock cloud image.  In other words, to bootstrap a system such
+as OpenDev from scratch, the operators would need to use a stock cloud
+image to run the job to build the custom image.  Once a custom image
+is available, further image builds could be run on either the stock
+cloud image or the custom image.  That decision is left to the
+operator and involves consideration of fault tolerance and disaster
+recovery scenarios.
+
+To build a custom image, an operator will define a fairly typical Zuul
+job for each image they would like to produce.  For example, a system
+may have one job to build a debian-stable image, a second job for
+debian-unstable, a third job for ubuntu-focal, a fourth job for
+ubuntu-jammy.  Zuul's job inheritance system could be very useful here
+to deal with many variations of a similar process.
+
+Currently nodepool-builder will build an image under three
+circumstances: 1) the image (or the image in a particular format) is
+missing; 2) a user has directly requested a build; 3) on an automatic
+interval (typically daily).  To map this into Zuul, we will use Zuul's
+existing pipeline functionality, but we will add a new trigger for
+case #1.  Case #2 can be handled by a manual Zuul enqueue command, and
+case #3 by a periodic pipeline trigger.
+
+Since Zuul knows what images are configured and what their current
+states are, it will be able to emit trigger events when it detects
+that a new image (or image format) has been added to its
+configuration.  In these cases, the `zuul` driver in Zuul will enqueue
+an `image-build` trigger event on startup or reconfiguration for every
+missing image.  The event will include the image name.  Pipelines will
+be configured to trigger on `image-build` events as well as on a timer
+trigger.
+
+Jobs will include an extra attribute to indicate they build a
+particular image.  This serves two purposes; first, in the case of an
+`image-build` trigger event, it will act as a matcher so that only
+jobs matching the image that needs building are run.  Second, it will
+allow Zuul to determine which formats are needed for that image (based
+on which providers are configured to use it) and include that
+information as job data.
+
+The job will be responsible for building the image and uploading the
+result to some storage system.  The URLs for each image format built
+should be returned to Zuul as artifacts.
+
+Finally, the `zuul` driver reporter will accept parameters which will
+tell it to search the result data for these artifact URLs and update
+the internal image state accordingly.
+
+An example configuration for a simple single-stage image build:
+
+.. code-block:: yaml
+
+   - pipeline:
+       name: image
+       trigger:
+         zuul:
+           events:
+             - image-build
+         timer:
+           time: 0 0 * * *
+       success:
+         zuul:
+           image-built: true
+           image-validated: true
+
+   - job:
+       name: build-debian-unstable-image
+       image-build-name: debian-unstable
+
+This job would run whenever Zuul determines it needs a new
+debian-unstable image or daily at midnight.  Once the job completes,
+because of the ``image-built: true`` report, it will look for artifact
+data like this:
+
+.. code-block:: yaml
+
+  artifacts:
+    - name: raw image
+      url: https://storage.example.com/new_image.raw
+      metadata:
+        type: zuul_image
+        image_name: debian-unstable
+        format: raw
+    - name: qcow2 image
+      url: https://storage.example.com/new_image.qcow2
+      metadata:
+        type: zuul_image
+        image_name: debian-unstable
+        format: qcow2
+
+Zuul will update internal records in ZooKeeper for the image to record
+the storage URLs.  The zuul-launcher process will then start
+background processes to download the images from the storage system
+and upload them to the configured providers (much as nodepool-builder
+does now with files on disk).  As a special case, it may detect that
+the image files are stored in a location that a provider can access
+directly for import and may be able to import directly from the
+storage location rather than downloading locally first.
+
+To handle image validation, a flag will be stored for each image
+upload indicating whether it has been validated.  The example above
+specifies ``image-validated: true`` and therefore Zuul will put the
+image into service as soon as all image uploads are complete.
+However, if it were false, then Zuul would emit an `image-validate`
+event after each upload is complete.  A second pipeline can be
+configured to perform image validation.  It can run any number of
+jobs, and since Zuul has complete knowledge of image states, it will
+supply nodes using the new image upload (which is not yet in service
+for normal jobs).  An example of this might look like:
+
+.. code-block:: yaml
+
+   - pipeline:
+       name: image-validate
+       trigger:
+         zuul:
+           events:
+             - image-validate
+       success:
+         zuul:
+           image-validated: true
+
+   - job:
+       name: validate-debian-unstable-image
+       image-build-name: debian-unstable
+       nodeset:
+         nodes:
+           - name: node
+             label: debian
+
+The label should specify the same image that is being validated.  Its
+node request will be made with extra specifications so that it is
+fulfilled with a node built from the image under test.  This process
+may repeat for each of the providers using that image (normal pipeline
+queue deduplication rules may need a special case to allow this).
+Once the validation jobs pass, the entry in ZooKeeper will be updated
+and the image will go into regular service.
+
+A more specific process definition follows:
+
+After a buildset reports with ``image-built: true``, Zuul will scan
+result data and for each artifact it finds, it will create an entry in
+ZooKeeper at `/zuul/images/<image_name>/<sequence>`.  Zuul will know
+not to emit any more `image-build` events for that image at this
+point.
+
+For every provider using that image, Zuul will create an entry in
+ZooKeeper at
+`/zuul/image-uploads/<image_name>/<image_number>/provider/<provider_name>`.
+It will set the remote image ID to null and the `image-validated` flag
+to whatever was specified in the reporter.
+
+Whenever zuul-launcher observes a new `image-upload` record without an
+ID, it will:
+
+* Lock the whole image
+* Lock each upload it can handle
+* Unlocks the image while retaining the upload locks
+* Downloads artifact (if needed) and uploads images to provider
+* If upload requires validation, it enqueues an `image-validate` zuul driver trigger event
+* Unlocks upload
+
+The locking sequence is so that a single launcher can perform multiple
+uploads from a single artifact download if it has the opportunity.
+
+Once more than two builds of an image are in service, the oldest is
+deleted.  The image ZooKeeper record set to the `deleting` state.
+Zuul-launcher will delete the uploads from the providers.  The `zuul`
+driver emits an `image-delete` event with item data for the image
+artifact.  This will trigger an image-delete job that can delete the
+artifact from the cloud storage.
+
+All of these pipeline definitions should typically be in a single
+tenant (but need not be), but the images they build are potentially
+available to each tenant that includes the image definition
+configuration object (see the Configuration section below).  Any repo
+in a tenant with an image build pipeline will be able to cause images
+to be built and uploaded to providers.
+
+Snapshot Images
+~~~~~~~~~~~~~~~
+
+Nodepool does not currently support snapshot images, but the spec for
+the current version of Nodepool does contemplate the possibility of a
+snapshot based nodepool-builder process.  Likewise, this spec does not
+require us to support snapshot image builds, but in case we want to
+add support in the future, we should have a plan for it.
+
+The image build job in Zuul could, instead of running
+diskimage-builder, act on the remote node to prepare it for a
+snapshot.  A special job attribute could indicate that it is a
+snapshot image job, and instead of having the zuul-launcher component
+delete the node at the end of the job, it could snapshot the node and
+record that information in ZooKeeper.  Unlike an image-build job, an
+image-snapshot job would need to run in each provider (similar to how
+it is proposed that an image-validate job will run in each provider).
+An image-delete job would not be required.
+
+
+Node Management
+---------------
+
+The techniques we have developed for cooperative processing in Zuul
+can be applied to the node lifecycle.  This is a good time to make a
+significant change to the nodepool protocol.  We can achieve several
+long-standing goals:
+
+* Scaling and fault-tolerance: rather than having a 1:N relationship
+  of provider:nodepool-launcher, we can have multiple zuul-launcher
+  processes, each of which is capable of handling any number of
+  providers.
+
+* More intentional request fulfillment: almost no intelligence goes
+  into selecting which provider will fulfill a given node request; by
+  assigning providers intentionally, we can more efficiently utilize
+  providers.
+
+* Fulfilling node requests from multiple providers: by designing
+  zuul-launcher for cooperative work, we can have nodesets that
+  request nodes which are fulfilled by different providers.  Generally
+  we should favor the same provider for a set of nodes (since they may
+  need to communicate over a LAN), but if that is not feasible,
+  allowing multiple providers to fulfill a request will permit
+  nodesets with diverse node types (e.g., VM + static, or VM +
+  container).
+
+Each zuul-launcher process will execute a number of processing loops
+in series; first a global request processing loop, and then a
+processing loop for each provider.  Each one will involve obtaining a
+ZooKeeper lock so that only one zuul-launcher process will perform
+each function at a time.
+
+Zuul-launcher will need to know about every connection in the system
+so that it may have a fuul copy of the configuration, but operators
+may wish to localize launchers to specific clouds.  To support this,
+zuul-launcher will take an optional command-line argument to indicate
+on which connections it should operate.
+
+Currently a node request as a whole may be declined by providers.  We
+will make that more granular and store information about each node in
+the request (in other words, individual nodes may be declined by
+providers).
+
+All drivers for providers should implement the state machine
+interface.  Any state machine information currently storen in memory
+in nodepool-launcher will need to move to ZooKeeper so that other
+launchers can resume state machine processing.
+
+The individual provider loop will:
+
+* Lock a provider in ZooKeeper (`/zuul/provider/<name>`)
+* Iterate over every node assigned to that provider in a `building` state
+
+  * Drive the state machine
+  * If success, update request
+  * If failure, determine if it's a temporary or permanent failure
+    and update the request accordingly
+  * If quota available, unpause provider (if paused)
+
+The global queue process will:
+
+* Lock the global queue
+* Iterate over every pending node request, and every node within that request
+
+  * If all providers have failed the request, clear all temp failures
+  * If all providers have permanently failed the request, return error
+  * Identify providers capable of fulfilling the request
+  * Assign nodes to any provider with sufficient quota
+  * If no providers with sufficient quota, assign it to first (highest
+    priority) provider that can fulfill it later and pause that
+    provider
+
+Configuration
+-------------
+
+The configuration currently handled by Nodepool will be refactored and
+added to Zuul's configuration syntax.  It will be loaded directly from
+git repos like most Zuul configuration, however it will be
+non-speculative (like pipelines and semaphores -- changes must merge
+before they take effect).
+
+Information about connecting to a cloud will be added to ``zuul.conf``
+as a ``connection`` entry.  The rate limit setting will be moved to
+the connection configuration.  Providers will then reference these
+connections by name.
+
+Because providers and images reference global (i.e., outside tenant
+scope) concepts, ZooKeeper paths for data related to those should
+include the canonical name of the repo where these objects are
+defined.  For example, a `debian-unstable` image in the
+`opendev/images` repo should be stored at
+``/zuul/zuul-images/opendev.org%2fopendev%2fimages/``.  This avoids
+collisions if different tenants contain different image objects with
+the same name.
+
+The actual Zuul config objects will be tenant scoped.  Image
+definitions which should be available to a tenant should be included
+in that tenant's config.  Again using the OpenDev example, the
+hypothetical `opendev/images` repository should be included in every
+OpenDev tenant so all of those images are available.
+
+Within a tenant, image names must be unique (otherwise it is a tenant
+configuration error, similar to a job name collision).
+
+The diskimage-builder related configuration items will no longer be
+necessary since they will be encoded in Zuul jobs.  This will reduce
+the complexity of the configuration significantly.
+
+The provider configuration will change as we take the opportunity to
+make it more "Zuul-like".  Instead of a top-level dictionary, we will
+use lists.  We will standardize on attributes used across drivers
+where possible, as well as attributes which may be located at
+different levels of the configuration.
+
+The goals of this reorganization are:
+
+* Allow projects to manage their own image lifecycle (if permitted by
+  site administrators).
+* Manage access control to labels, images and flavors via standard
+  Zuul mechanisms (whether an item appears within a tenant).
+* Reduce repetition and boilerplate for systems with many clouds,
+  labels, or images.
+
+The new configuration objects are:
+
+Image
+  This represents any kind of image (A Zuul image built by a job
+  described above, or a cloud image).  By using one object to
+  represent both, we open the possibility of having a label in one
+  provider use a cloud image and in another provider use a Zuul image
+  (because the label will reference the image by short-name which may
+  resolve to a different image object in different tenants).  A given
+  image object will specify what type it is, and any relevant
+  information about it (such as the username to use, etc).
+
+Flavor
+  This is a new abstraction layer to reference instance types across
+  different cloud providers.  Much like labels today, these probably
+  won't have much information associated with them other than to
+  reserve a name for other objects to reference.  For example, a site
+  could define a `small` and a `large` flavor.  These would later be
+  mapped to specific instance types on clouds.
+
+Label
+  Unlike the current Nodepool ``label`` definitions, these labels will
+  also specify the image and flavor to use.  These reference the two
+  objects above, which means that labels themselves contain the
+  high-level definition of what will be provided (e.g., a `large
+  ubuntu` node) while the specific mapping of what `large` and
+  `ubuntu` mean are left to the more specific configuration levels.
+
+Section
+  This looks a lot like the current ``provider`` configuration in
+  Nodepool (but also a little bit like a ``pool``).  Several parts of
+  the Nodepool configuration (such as separating out availability
+  zones from providers into pools) were added as an afterthought, and
+  we can take the opportunity to address that here.
+
+  A ``section`` is part of a cloud.  It might be a region (if a cloud
+  has regions).  It might be one or more availability zones within a
+  region.  A lot of the specifics about images, flavors, subnets,
+  etc., will be specified here.  Because a cloud may have many
+  sections, we will implement inheritance among sections.
+
+Provider
+  This is mostly a mapping of labels to sections and is similar to a
+  provider pool in the current Nodepool configuration.  It exists as a
+  separate object so that site administrators can restrict ``section``
+  definitions to central repos and allow tenant administrators to
+  control their own image and labels by allowing certain projects to
+  define providers.
+
+  It mostly consists of a list of labels, but may also include images.
+
+When launching a node, relevant attributes may come from several
+sources (the pool, image, flavor, or provider).  Not all attributes
+make sense in all locations, but where we can support them in multiple
+locations, the order of application (later items override earlier
+ones) will be:
+
+* ``image`` stanza
+* ``flavor`` stanza
+* ``label`` stanza
+* ``section`` stanza (top level)
+* ``image`` within ``section``
+* ``flavor`` within ``section``
+* ``provider`` stanza (top level)
+* ``label`` within ``provider``
+
+This reflects that the configuration is built upwards from general and
+simple objects toward more specific objects image, flavor, label,
+section, provider.  Generally speaking, inherited scalar values will
+override, dicts will merge, lists will concatenate.
+
+An example configuration follows.  First, some configuration which may
+appear in a central project and shared among multiple tenants:
+
+.. code-block:: yaml
+
+   # Images, flavors, and labels are the building blocks of the
+   # configuration.
+
+   - image:
+       name: centos-7
+       type: zuul
+       # Any other image-related info such as:
+       # username: ...
+       # python-path: ...
+       # shell-type: ...
+       # A default that can be overridden by a provider:
+       # config-drive: true
+
+   - image:
+       name: ubuntu
+       type: cloud
+
+   - flavor:
+       name: large
+
+   - label:
+       name: centos-7
+       min-ready: 1
+       flavor: large
+       image: centos-7
+
+   - label:
+       name: ubuntu
+       flavor: small
+       image: ubuntu
+
+   # A section for each cloud+region+az
+
+   - section:
+       name: rax-base
+       abstract: true
+       connection: rackspace
+       boot-timeout: 120
+       launch-timeout: 600
+       key-name: infra-root-keys-2020-05-13
+       # The launcher will apply the minimum of the quota reported by the
+       # driver (if available) or the values here.
+       quota:
+         instances: 2000
+       subnet: some-subnet
+       tags:
+         section-info: foo
+       # We attach both kinds of images to providers in order to provide
+       # image-specific info (like config-drive) or username.
+       images:
+         - name: centos-7
+           config-drive: true
+           # This is a Zuul image
+         - name: ubuntu
+           # This is a cloud image, so the specific cloud image name is required
+           image-name: ibm-ubuntu-20-04-3-minimal-amd64-1
+           # Other information may be provided
+           # username ...
+           # python-path: ...
+           # shell-type: ...
+       flavors:
+         - name: small
+           cloud-flavor: "Performance 8G"
+         - name: large
+           cloud-flavor: "Performance 16G"
+
+   - section:
+       name: rax-dfw
+       parent: rax-base
+       region: 'DFW'
+       availability-zones: ["a", "b"]
+
+   # A provider to indicate what labels are available to a tenant from
+   # a section.
+
+   - provider:
+       name: rax-dfw-main
+       section: rax-dfw
+       labels:
+         - name: centos-7
+         - name: ubuntu
+           key-name: infra-root-keys-2020-05-13
+           tags:
+             provider-info: bar
+
+The following configuration might appear in a repo that is only used
+in a single tenant:
+
+.. code-block:: yaml
+
+   - image:
+       name: devstack
+       type: zuul
+
+   - label:
+       name: devstack
+
+   - provider:
+       name: rax-dfw-devstack
+       section: rax-dfw
+       # The images can be attached to the provider just as a section.
+       image:
+         - name: devstack
+           config-drive: true
+       labels:
+         - name: devstack
+
+Here is a potential static node configuration:
+
+.. code-block:: yaml
+
+   - label:
+       name: big-static-node
+
+   - section:
+       name: static-nodes
+       connection: null
+       nodes:
+         - name: static.example.com
+           labels:
+             - big-static-node
+           host-key: ...
+           username: zuul
+
+   - provider:
+       name: static-provider
+       section: static-nodes
+       labels:
+         - big-static-node
+
+Each of the the above stanzas may only appear once in a tenant for a
+given name (like pipelines or semaphores, they are singleton objects).
+If they appear in more than one branch of a project, the definitions
+must be identical; otherwise, or if they appear in more than one repo,
+the second definition is an error.  These are meant to be used in
+unbranched repos.  Whatever tenants they appear in will be permitted
+to access those respective resources.
+
+The purpose of the ``provider`` stanza is to associate labels, images,
+and sections.  Much of the configuration related to launching an
+instance (including the availability of zuul or cloud images) may be
+supplied in the ``provider`` stanza and will apply to any labels
+within.  The ``section`` stanza also allows configuration of the same
+information except for the labels themselves.  The ``section``
+supplies default values and the ``provider`` can override them or add
+any missing values.  Images are additive -- any images that appear in
+a ``provider`` will augment those that appear in a ``section``.
+
+The result is a modular scheme for configuration, where a single
+``section`` instance can be used to set as much information as
+possible that applies globally to a provider.  A simple configuration
+may then have a single ``provider`` instance to attach labels to that
+section.  A more complex installation may define a "standard" pool
+that is present in every tenant, and then tenant-specific pools as
+well.  These pools will all attach to the same section.
+
+References to sections, images and labels will be internally converted
+to canonical repo names to avoid ambiguity.  Under the current
+Nodepool system, labels are truly a global object, but under this
+proposal, a label short name in one tenant may be different than one
+in another.  Therefore the node request will internally specify the
+canonical label name instead of the short name.  Users will never use
+canonical names, only short names.
+
+For static nodes, there is some repitition to labels: first labels
+must be associated with the individual nodes defined on the section,
+then the labels must appear again on a provider.  This allows an
+operator to define a collection of static nodes centrally on a
+section, then include tenant-specific sets of labels in a provider.
+For the simple case where all static node labels in a section should
+be available in a provider, we could consider adding a flag to the
+provider to allow that (e.g., ``include-all-node-labels: true``).
+Static nodes themselves are configured on a section with a ``null``
+connection (since there is no cloud provider associated with static
+nodes).  In this case, the additional ``nodes`` section attribute
+becomes available.
+
+Upgrade Process
+---------------
+
+Most users of diskimages will need to create new jobs to build these
+images.  This proposal also includes significant changes to the node
+allocation system which come with operational risks.
+
+To make the transition as minimally disruptive as possible, we will
+support both systems in Zuul, and allow for selection of one system or
+the other on a per-label and per-tenant basis.
+
+By default, if a nodeset specifies a label that is not defined by a
+``label`` object in the tenant, Zuul will use the old system and place
+a ZooKeeper request in ``/nodepool``.  If a matching ``label`` is
+available in the tenant, The request will use the new system and be
+sent to ``/zuul/node-requests``.  Once a tenant has completely
+converted, a configuration flag may be set in the tenant configuration
+and that will allow Zuul to treat nodesets that reference unknown
+labels as configuration errors.  A later version of Zuul will remove
+the backwards compatability and make this the standard behavior.
+
+Because each of the systems will have unique metadata, they will not
+recognize each others nodes, and it will appear to each that another
+system is using part of their quota.  Nodepool is already designed to
+handle this case (at least, handle it as well as possible).
+
+Library Requirements
+--------------------
+
+The new zuul-launcher component will need most of Nodepool's current
+dependencies, which will entail adding many third-party cloud provider
+interfaces.  As of writing, this uses another 420M of disk space.
+Since our primary method of distribution at this point is container
+images, if the additional space is a concern, we could restrict the
+installation of these dependencies to only the zuul-launcher image.
+
+Diskimage-Builder Testing
+-------------------------
+
+The diskimage-builder project team has come to rely on Nodepool in its
+testing process.  It uses Nodepool to upload images to a devstack
+cloud, launch nodes from those instances, and verify that they
+function.  To aid in continuity of testing in the diskimage-builder
+project, we will extract the OpenStack image upload and node launching
+code into a simple Python script that can be used in diskimage-builder
+test jobs in place of Nodepool.
+
+Work Items
+----------
+
+* In existing Nodepool convert the following drivers to statemachine:
+  gce, kubernetes, openshift, openshift, openstack (openstack is the
+  only one likely to require substantial effort, the others should be
+  trivial)
+* Replace Nodepool with an image upload script in diskimage-builder
+  test jobs
+* Add roles to zuul-jobs to build images using diskimage-builder
+* Implement node-related config items in Zuul config and Layout
+* Create zuul-launcher executable/component
+* Add image-name item data
+* Add image-build-name attribute to jobs
+  * Including job matcher based on item image-name
+  * Include image format information based on global config
+* Add zuul driver pipeline trigger/reporter
+* Add image lifecycle manager to zuul-launcher
+  * Emit image-build events
+  * Emit image-validate events
+  * Emit image-delete events
+* Add Nodepool driver code to Zuul
+* Update zuul-launcher to perform image uploads and deletion
+* Implement node launch global request handler
+* Implement node launch provider handlers
+* Update Zuul nodepool interface to handle both Nodepool and
+  zuul-launcher node request queues
+* Add tenant feature flag to switch between them
+* Release a minor version of Zuul with support for both
+* Remove Nodepool support from Zuul
+* Release a major version of Zuul with only zuul-launcher support
+* Retire Nodepool
author	Zuul <zuul@review.opendev.org>	2022-09-09 15:49:43 +0000
committer	Gerrit Code Review <review@openstack.org>	2022-09-09 15:49:43 +0000
commit	76ec804a893253755c794e8271f56e567b0c1f3d (patch)
tree	ea2081a6faeec5cd14bd9a0444a1c17ab02c27be
parent	defe3d4136c2a415bdfce641de242a20a9745f55 (diff)
parent	781a8470d3790e6257fa5e813e9d8750c41b72bc (diff)
download	zuul-76ec804a893253755c794e8271f56e567b0c1f3d.tar.gz