| Commit message (Collapse) | Author | Age | Files | Lines |
|\ |
|
| |
| |
| |
| |
| |
| | |
Versions 2.8 and 2.9 are no longer supported by the Ansible project.
Change-Id: I888ddcbecadd56ced83a27ae5a6e70377dc3bf8c
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This adds support for configuring tracing in Zuul along with
basic documentation of the configuration.
It also adds test infrastructure that runs a gRPC-based collector
so that we can test tracing end-to-end, and exercises a simple
test span.
Change-Id: I4744dc2416460a2981f2c90eb3e48ac93ec94964
|
|\ \ |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This feature instructs Zuul to attempt a second or more node request
with a different node configuration (ie, possibly different labels)
if the first one fails.
It is intended to address the case where a cloud provider is unable
to supply specialized high-performance nodes, and the user would like
the job to proceed anyway on lower-performance nodes.
Change-Id: Idede4244eaa3b21a34c20099214fda6ecdc992df
|
|\ \ \ |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This was deprecated quite some time ago and we should remove it as
part of the next major release.
Also remove a very old Zuul v1 layout.yaml from the test fixtures.
Change-Id: I40030840b71e95f813f028ff31bc3e9b3eac4d6a
|
|\ \ \ \
| |/ / / |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This was previously deprecated and should be removed shortly before
we release Zuul v7.
Change-Id: Idbdfca227d2f7ede5583f031492868f634e1a990
|
|\ \ \ \
| |_|_|/
|/| | | |
|
| | |/
| |/|
| | |
| | |
| | |
| | |
| | |
| | | |
This adds an option to include result data from a job in the MQTT
reporter. It is off by default since it may be quite large for
some jobs.
Change-Id: I802adee834b60256abd054eda2db834f8db82650
|
|\ \ \
| | |/
| |/| |
|
| | |
| | |
| | |
| | | |
Change-Id: I0d450d9385b9aaab22d2d87fb47798bf56525f50
|
|\ \ \
| |/ / |
|
| | |
| | |
| | |
| | | |
Change-Id: I2576d0dcec7c8f7bbb76bdd469fd992874742edc
|
|\ \ \
| |/ /
|/| | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
I noticed in some of our testing a construct like
debug:
msg: '{{ ansible_version }}'
was actually erroring out; you'll see in the console output if you're
looking
Ansible output: b'TASK [Print ansible version msg={{ ansible_version }}] *************************'
Ansible output: b'[WARNING]: Failure using method (v2_runner_on_ok) in callback plugin'
Ansible output: b'(<ansible.plugins.callback.zuul_stream.CallbackModule object at'
Ansible output: b"0x7f502760b490>): 'dict' object has no attribute 'startswith'"
and the job-output.txt will be empty for this task (this is detected
by by I9f569a411729f8a067de17d99ef6b9d74fc21543).
This is because the msg value here comes in as a dict, and in several
places we assume it is a string. This changes places we inspect the
msg variable to use the standard Ansible way to make a text string
(to_text function) and ensures in the logging function it converts the
input to a string.
We test for this with updated tasks in the remote_zuul_stream tests.
It is slightly refactored to do partial matches so we can use the
version strings, which is where we saw the issue.
Change-Id: I6e6ed8dba2ba1fc74e7fc8361e8439ea6139279e
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Several of our tests which validate Ansible behavior with Zuul are
not versioned so that they test all supported versions of Ansible.
For those cases, add versioned tests and fix any descrepancies that
have been uncovered by the additional tests (fortunately all are
minor test syntax issues and do not affect real-world usage).
One of our largest versioned Ansible tests was not actually testing
multiple Ansible versions -- we just ran it 3 times on the default
version. Correct that and add validation that the version ran was
the expected version.
Change-Id: I26213f69fe844776408fce24322749a197e07551
|
|\ \ |
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This adds a config-error pipeline reporter configuration option and
now also reports config errors and merge conflicts to the database
as buildset failures.
The driving use case is that if periodic pipelines encounter config
errors (such as being unable to freeze a job graph), they might send
email if configured to send email on merge conflicts, but otherwise
their results are not reported to the database.
To make this more visible, first we need Zuul pipelines to report
buildset ends to the database in more cases -- currently we typically
only report a buildset end if there are jobs (and so a buildset start),
or in some other special cases. This change adds config errors and
merge conflicts to the set of cases where we report a buildset end.
Because of some shortcuts previously taken, that would end up reporting
a merge conflict message to the database instead of the actual error
message. To resolve this, we add a new config-error reporter action
and adjust the config error reporter handling path to use it instead
of the merge-conflicts action.
Tests of this as well as the merge-conflicts code path are added.
Finally, a small debug aid is added to the GerritReporter so that we
can easily see in the logs which reporter action was used.
Change-Id: I805c26a88675bf15ae9d0d6c8999b178185e4f1f
|
|/
|
|
| |
Change-Id: I12e8a056a2e5cd1bb18c1f24ecd7db55405f0a8c
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We call item.setResult after a build is complete so that the queue
item can do any internal processing necessary (for example, prepare
data structures for child jobs, or move the build to the retry_builds
list).
In the case of deduplicated builds, we should do that for every queue
item the build participates in since each item may have a different
job graph.
We were not correctly identifying other builds of deduplicated jobs
and so in the case of a dependency cycle we would call setResult on
jobs of the same name in that cycle regardless of whether they were
deduplicated.
This corrects the issue and adds a test to detect that case.
Change-Id: I4c47beb2709a77c21c11c97f1d1a8f743d4bf5eb
|
|\ \
| |/ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There is no good reason to do so (there are no resources consumed
by the job), and it's difficult to disable a behavior for the
noop job globally since it has no definition. Let's never have it
deduplicate so that we keep things simple for folks who want to
avoid deduplication.
Change-Id: Ib3841ce5ef020540edef1cfa479d90c65be97112
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the before times when we only had a single scheduler, it was
naturally the case that reconfiguration events were processed as they
were encountered and no trigger events which arrived after them would
be processed until the reconfiguration was complete. As we added more
event queues to support SOS, it became possible for trigger events
which arrived at the scheduler to be processed before a tenant
reconfiguration caused by a preceding event to be complete. This is
now even possible with a single scheduler.
As a concrete example, imagine a change merges which updates the jobs
which should run on a tag, and then a tag is created. A scheduler
will process both of those events in succession. The first will cause
it to submit a tenant reconfiguration event, and then forward the
trigger event to any matching pipelines. The second event will also
be forwarded to pipeline event queues. The pipeline events will then
be processed, and then only at that point will the scheduler return to
the start of the run loop and process the reconfiguration event.
To correct this, we can take one of two approaches: make the
reconfiguration more synchronous, or make it safer to be
asynchronous. To make reconfiguration more synchronous, we would need
to be able to upgrade a tenant read lock into a tenant write lock
without releasing it. The lock recipes we use from kazoo do not
support this. While it would be possible to extend them to do so, it
would lead us further from parity with the upstream kazoo recipes, so
this aproach is not used.
Instead, we will make it safer for reconfiguration to be asynchronous
by annotating every trigger event we forward with the last
reconfiguration event that was seen before it. This means that every
trigger event now specifies the minimum reconfiguration time for that
event. If our local scheduler has not reached that time, we should
stop processing trigger events and wait for it to catch up. This
means that schedulers may continue to process events up to the point
of a reconfiguration, but will then stop. The already existing
short-circuit to abort processing once a scheduler is ready to
reconfigure a tenant (where we check the tenant write lock contenders
for a waiting reconfiguration) helps us get out of the way of pending
reconfigurations as well. In short, once a reconfiguration is ready
to start, we won't start processing tenant events anymore because of
the existing lock check. And up until that happens, we will process
as many events as possible until any further events require the
reconfiguration.
We will use the ltime of the tenant trigger event as our timestamp.
As we forward tenant trigger events to the pipeline trigger event
queues, we decide whether an event should cause a reconfiguration.
Whenever one does, we note the ltime of that event and store it as
metadata on the tenant trigger event queue so that we always know what
the most recent required minimum ltime is (ie, the ltime of the most
recently seen event that should cause a reconfiguration). Every event
that we forward to the pipeline trigger queue will be annotated to
specify that its minimum required reconfiguration ltime is that most
recently seen ltime. And each time we reconfigure a tenant, we store
the ltime of the event that prompted the reconfiguration in the layout
state. If we later process a pipeline trigger event with a minimum
required reconfigure ltime greater than the current one, we know we
need to stop and wait for a reconfiguration, so we abort early.
Because this system involves several event queues and objects each of
which may be serialized at any point during a rolling upgrade, every
involved object needs to have appropriate default value handling, and
a synchronized model api change is not helpful. The remainder of this
commit message is a description of what happens with each object when
handled by either an old or new scheduler component during a rolling
upgrade.
When forwarding a trigger event and submitting a tenant
reconfiguration event:
The tenant trigger event zuul_event_ltime is initialized
from zk, so will always have a value.
The pipeline management event trigger_event_ltime is initialzed to the
tenant trigger event zuul_event_ltime, so a new scheduler will write
out the value. If an old scheduler creates the tenant reconfiguration
event, it will be missing the trigger_event_ltime.
The _reconfigureTenant method is called with a
last_reconfigure_event_ltime parameter, which is either the
trigger_event_ltime above in the case of a tenant reconfiguration
event forwarded by a new scheduler, or -1 in all other cases
(including other types of reconfiguration, or a tenant reconfiguration
event forwarded by an old scheduler). If it is -1, it will use the
current ltime so that if we process an event from an old scheduler
which is missing the event ltime, or we are bootstrapping a tenant or
otherwise reconfiguring in a context where we don't have a triggering
event ltime, we will use an ltime which is very new so that we don't
defer processing trigger events. We also ensure we never go backward,
so that if we process an event from an old scheduler (and thus use the
current ltime) then process an event from a new scheduler with an
older (than "now") ltime, we retain the newer ltime.
Each time a tenant reconfiguration event is submitted, the ltime of
that reconfiguration event is stored on the trigger event queue. This
is then used as the min_reconfigure_ltime attribute on the forwarded
trigger events. This is updated by new schedulers, and ignored by old
ones, so if an old scheduler process a tenant trigger event queue it
won't update the min ltime. That will just mean that any events
processed by a new scheduler may continue to use an older ltime as
their minimum, which should not cause a problem. Any events forwarded
by an old scheduler will omit the min_reconfigure_ltime field; that
field will be initialized to -1 when loaded on a new scheduler.
When processing pipeline trigger events:
In process_pipeline_trigger_queue we compare two values: the
last_reconfigure_event_ltime on the layout state which is either set
to a value as above (by a new scheduler), or will be -1 if it was last
written by an old scheduler (including in the case it was overwritten
by an old scheduler; it will re-initialize to -1 in that case). The
event.min_reconfigure_ltime field will either be the most recent
reconfiguration ltime seen by a new scheduler forwarding trigger
events, or -1 otherwise. If the min_reconfigure_ltime of an event is
-1, we retain the old behavior of processing the event regardless.
Only if we have a min_reconfigure_ltime > -1 and it is greater than
the layout state last_reconfigure_event_ltime (which itself may be -1,
and thus less than the min_reconfigure_ltime) do we abort processing
the event.
(The test_config_update test for the Gerrit checks plugin is updated
to include an extra waitUntilSettled since a potential test race was
observed during development.)
Change-Id: Icb6a7858591ab867e7006c7c80bfffeb582b28ee
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Zuul Ansible callback stream plugin assumed that the ansible loop
var was always called 'item' in the result_dict. You can override this
value (and it is often necessary to do so to avoid collisions) to
something less generic. In those cases we would get errors like:
b'[WARNING]: Failure using method (v2_runner_item_on_ok) in callback plugin'
b'(<ansible.plugins.callback.zuul_stream.CallbackModule object at'
b"0x7fbecc97c910>): 'item'"
And stream output would not include the info typically logged.
Address this by checking if ansible_loop_var is in the results_dict and
using that value for the loop var name instead. We still fall back to
'item' as I'm not sure that ansible_loop_var is always present.
Change-Id: I408e6d4af632f8097d63c04cbcb611d843086f6c
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the command didn't exist:
- Popen would throw an exception
- 't' would not be updated (t is None)
- return code would not be written to the console
- zuul_stream would wait unecessary for 10 seconds
As rc is defined in normal case or in both exceptions, it can be written
in each case to the console.
Change-Id: I77a4e1bdc6cd163143eacda06555b62c9195ee38
|
|\ \ |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In previous change support for the gitlab merge was added, but the
parameters dict was not properly passed to the invocation method.
Fix this now and add corresponding test.
Change-Id: I781c02848abc524ca98e03984539507b769d19fe
|
|\ \ \ |
|
| | |/
| |/|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In a multi-host Gerrit environment (HA or failover) it's plausible
that admins may use one mechanism for managing ingress for HTTP
requests and a different for SSH requests. Or admins may have
different firewall rules for each. To accomodate these situations,
Add an "ssh_server" configuration item for Gerrit. This makes the
set of hostname-like items the following:
* server: the HTTP hostname and default for all others
* canonical_hostname: what to use for golang-style git paths
* ssh_server: the hostname to use for SSH connections
* baseurl: the base URL for HTTP connections
The following are equivalent:
server=review.example.com
ssh_server=ssh-review.example.com
and:
server=ssh-review.example.com
baseurl=https://review.example.com
Change-Id: I6e9cd9f48c1a78d8d24bfe176efbb932a18ec83c
|
|\ \ \ |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This adds an option to specify that certain branches should always trigger
dynamic configuration and never be included in static configuration.
The use case is a large number of rarely used feature branches, where
developers would still like to be able to run pre-merge check jobs, and alter
those jobs on request, but otherwise not have the configuration clogged up
with hundreds of generally unused job variants.
Change-Id: I60ed7a572d66a20a2ee014f72da3cb7132a550da
|
|\ \ \ \
| |_|/ /
|/| | | |
|
| | |/
| |/|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This adds support for global semaphores which can be used by multiple tenants.
This supports the use case where they represent real-world resources which
operate independentyl of Zuul tenants.
This implements and removes the spec describing the feature. One change from
the spec is that the configuration object in the tenant config file is
"global-semaphore" rather than "semaphore". This makes it easier to distinguish
them in documentation (facilitating easier cross-references and deep links),
and may also make it easier for users to understand that they have distinct
behavoirs.
Change-Id: I5f2225a700d8f9bef0399189017f23b3f4caad17
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When a project configuration defined a queue, but did not directly
specify any pipeline configuration (e.g. only referenced templates), the
relative priority queues were not setup correctly.
This could happen in pipelines using the independent and supercedent
manager. Other pipelines using the shared change queue mixin handle this
correctly.
This edge case will be tested in
`test_scheduler.TestScheduler.test_nodepool_relative_priority_check` by
slightly modifying the config to use a template for one of the projects.
Change-Id: I1f682e6593ccdad3cfacf5817fc1a1cf7de8856b
|
|\ \ |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This adds support for deduplicating jobs within dependency cycles.
By default, this will happen automatically if we can determine that the
results of two builds would be expected to be identical. This uses a
heuristic which should almost always be correct; the behavior can be
overidden otherwise.
Change-Id: I890407df822035d52ead3516942fd95e3633094b
|
|\ \ \
| |/ / |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Node request failures cause a queue item to fail (naturally). In a normal
queue without cycles, that just means that we would cancel jobs behind and
wait for the current item to finish the remaining jobs. But with cycles,
the items in the bundle detect that items ahead (which are part of the bundle)
are failing and so they cancel their own jobs more agressively. If they do
this before all the jobs have started (ie, because we are waiting on an
unfulfilled node request), they can end up in a situation where they never
run builds, but yet they don't report because they are still expecting
those builds.
This likely points to a larger problem in that we should probably not be
canceling those jobs so aggressively. However, the more serious and immediate
problem is the race condition that can cause items not to report.
To correct this immediate problem, tell the scheduler to create fake build
objects with a result of "CANCELED" when the pipeline manager cancels builds
and there is no existing build already. This will at least mean that all
expected builds are present regardless of whether the node request has been
fulfilled.
A later change can be made to avoid canceling jobs in the first place without
needing to change this behavior.
Change-Id: I1e1150ef67c03452b9a98f9366434c53a5ad26fb
|
|\ \ \
| |_|/
|/| | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If a role is applied to a host more than once (via either play
roles or include_roles, but not via an include_role loop), it will
have the same task UUID from ansible which means Zuul's command
plugin will write the streaming output to the same filename, and
the log streaming will request the same file. That means the file
might look this after the second invocation:
2022-05-19 17:06:23.673625 | one
2022-05-19 17:06:23.673781 | [Zuul] Task exit code: 0
2022-05-19 17:06:29.226463 | two
2022-05-19 17:06:29.226605 | [Zuul] Task exit code: 0
But since we stop reading the log after "Task exit code", the user
would see "one" twice, and never see "two".
Here are some potential fixes for this that don't work:
* Accessing the task vars from zuul_stream to store any additional
information: the callback plugins are not given the task vars.
* Setting the log id on the task args in zuul_stream instead of
command: the same Task object is used for each host and therefore
the command module might see the task object after it has been
further modified (in other words, nothing host-specific can be
set on the task object).
* Setting an even more unique uuid than Task._uuid on the Task
object in zuul_stream and using that in the command module instead
of Task._uuid: in some rare cases, the actual task Python object
may be different between the callback and command plugin, yet still
have the same _uuid; therefore the new attribute would be missing.
Instead, a global variable is used in order to transfer data between
zuul_stream and command. This variable holds a counter for each
task+host combination. Most of the time it will be 1, but if we run
the same task on the same host again, it will increment. Since Ansible
will not run more than one task on a host simultaneously, so there is
no race between the counter being incremented in zuul_stream and used
in command.
Because Ansible is re-invoked for each playbook, the memory usage is
not a concern.
There may be a fork between zuul_stream and command, but that's fine
as long as we treat it as read-only in the command plugin. It will
have the data for our current task+host from the most recent zuul_stream
callback invocation.
This change also includes a somewhat unrelated change to the test
infrastructure. Because we were not setting the log stream port on
the executor in tests, we were actually relying on the "real" OpenDev
Zuul starting zuul_console on the test nodes rather than the
zuul_console we set up for each specific Ansible version from the tests.
This corrects that and uses the correct zuul_console port, so that if we
make any changes to zuul_console in the future, we will test the
changed version, not the one from the Zuul which actually runs the
tox-remote job.
Change-Id: Ia656db5f3dade52c8dbd0505b24049fe0fff67a5
|
|/ /
| |
| |
| |
| |
| |
| |
| | |
This allows operators to filter the set of branches from which
Zuul loads configuration. They are similar to exclude-unprotected-branches
but apply to all drivers.
Change-Id: I8201b3a19efb266298decb4851430b7205e855a1
|
|\ \
| |/
|/| |
|
| |
| |
| |
| |
| |
| |
| |
| | |
This is a follow-on from I02bcd307bcfad8d99dd0db13d979ce7ba3d5e0e4
which creates a fake library to simulate different result types being
returned to zuul_json callback plugin.
Change-Id: Ib0d2360f98daf33e05207c9c285528ce2f51cbf9
|
|\ \ |
|
| | |
| | |
| | |
| | | |
Change-Id: I0358608cb588000f6f9c0ec8ac0c4db179f8fab7
|
|\ \ \ |
|