| Commit message (Collapse) | Author | Age | Files | Lines |
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With the defaulh "linear" strategy (and likely others), Ansible will
send the on_task_start callback, and then fork a worker process to
execute that task. Since we spawn a thread in the on_task_start
callback, we can end up emitting a log message in this method while
Ansible is forking. If a forked process inherits a Python file object
(i.e., stdout) that is locked by a thread that doesn't exist in the
fork (i.e., this one), it can deadlock when trying to flush the file
object. To minimize the chances of that happening, we should avoid
using _display outside the main thread.
The Python logging module is supposed to use internal locks which are
automatically aqcuired and released across a fork. Assuming this is
(still) true and functioning correctly, we should be okay to issue
our Python logging module calls at any time. If there is a fault
in this system, however, it could have a potential to cause a similar
problem.
If we can convince the Ansible maintainers to lock _display across
forks, we may be able to revert this change in the future.
Change-Id: Ifc6b835c151539e6209284728ccad467bef8be6f
|
|\ \
| |/
|/| |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This adds a config-error pipeline reporter configuration option and
now also reports config errors and merge conflicts to the database
as buildset failures.
The driving use case is that if periodic pipelines encounter config
errors (such as being unable to freeze a job graph), they might send
email if configured to send email on merge conflicts, but otherwise
their results are not reported to the database.
To make this more visible, first we need Zuul pipelines to report
buildset ends to the database in more cases -- currently we typically
only report a buildset end if there are jobs (and so a buildset start),
or in some other special cases. This change adds config errors and
merge conflicts to the set of cases where we report a buildset end.
Because of some shortcuts previously taken, that would end up reporting
a merge conflict message to the database instead of the actual error
message. To resolve this, we add a new config-error reporter action
and adjust the config error reporter handling path to use it instead
of the merge-conflicts action.
Tests of this as well as the merge-conflicts code path are added.
Finally, a small debug aid is added to the GerritReporter so that we
can easily see in the logs which reporter action was used.
Change-Id: I805c26a88675bf15ae9d0d6c8999b178185e4f1f
|
|\ \ |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The report message "This change depends on a change that failed to
merge" (and a similar change for circular dependency bundles) is
famously vague. To help users identify the actual problem, include
URLs for which change(s) caused the problem so that users may more
easily resolve the issue.
Change-Id: Id8b9f8cf2c108703e9209e30bdc9a3933f074652
|
|\ \ \ |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
To avoid issues with outdated Github access tokens in the Git config we
only update the remote URL on the repo object after the config update
was successful.
This also adds a missing repo lock when building the repo state.
Change-Id: I8e1b5b26f03cb75727d2b2e3c9310214a3eac447
|
| |_|/
|/| |
| | |
| | | |
Change-Id: I12e8a056a2e5cd1bb18c1f24ecd7db55405f0a8c
|
|\ \ \ |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Sometimes Gerrit events may arrive in batches (for example, an
automated process modifies several related changes nearly
simultaneously). Because of our inbuilt delay (10 seconds by
default), it's possible that in these cases, many or all of the
updates represented by these events will have settled on the Gerrit
server before we even start processing the first event. In these
cases, we don't need to query the same changes multiple times.
Take for example a stack of 10 changes. Someone approves all 10
simultaneously. That would produce (at least) 10 events for Zuul
to process. Each event would cause Zuul to query all 10 changes in
the series (since they are related). That's 100 change queries
(and each change query requires 2 or 3 HTTP queries).
But if we know that all the event arrived before our first set of
change queries, we can reduce that set of 100 queries to 10 by
suppressing any queries after the first.
This change generates a logical timestamp (ltime) immediately
before querying Gerrit for a change, and stores that ltime in the
change cache. Whenever an event arrives for processing with an
ltime later than the query ltime, we assume the change is already
up to date with that event and do not perform another query.
Change-Id: Ib1b9245cc84ab3f5a0624697f4e3fc73bc8e03bd
|
|\ \ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In zuul_stream.py:v2_playbook_on_task_start() it checks for
"task.loop" and exits if the task is part of a loop.
However the library/command.py override still writes out the console
log despite it never being read. To avoid leaving this file around,
mark a sentinel uuid in the action plugin if the command is part of a
loop. In that case, for simplicity we just write to /dev/null -- that
way no other assumptions in the library/command.py have to change; it
just doesn't leave a file on disk.
This is currently difficult to test as the infrastructure zuul_console
leaves /tmp/console-* files and we do not know what comes from that,
or testing. After this and the related change
I823156dc2bcae91bd6d9770bd1520aa55ad875b4 are deployed to the
infrastructure executors, we can make a simple and complete test for
the future by just ensuring no /tmp/console-* files are left behind
afer testing. I have tested this locally and do not see files from
loops, which I was before this change.
Change-Id: I4f4660c3c0b0f170561c14940cc159dc43eadc79
|
|\ \ \ \ \
| |_|_|/ /
|/| | | | |
|
| | |_|/
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
An executor can have a zone configured and at the same time allow
unzoned jobs. In this case the executor was not counted for the zoned
executor metric (online/accepting).
Change-Id: Ib39947e3403d828b595cf2479e64789e049e63cc
|
|\ \ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In Ief366c092e05fb88351782f6d9cd280bfae96237 I missed that this runs
in the context of the remote node; meaning that it must support all
the Python versions that might run there. f-strings are not 3.5
compatible.
I'm thinking about how to lint this better (a syntax check run?)
Change-Id: Ia4133b061800791196cd631f2e6836cb77347664
|
|\ \ \ \ \
| |/ / / /
|/| | | | |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When using protocol version 1, send a finalise message when streaming
is complete so that the zuul_console daemon can delete the temporary
file.
We test this by inspecting the Ansible console output, which logs a
message with the UUID of the streaming job. We dump the temporary
files on the remote side and make sure a console file for that job
isn't present.
Change-Id: I823156dc2bcae91bd6d9770bd1520aa55ad875b4
|
|\ \ \ \ \
| |/ / / /
| | / / /
| |/ / /
|/| | | |
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
A refresher on how this works, to the best of my knowledge
1 Firstly, Zuul's Ansible has a library task "zuul_console:" which is
run against the remote node; this forks a console daemon, listening
on a default port.
2 We have a action plugin that runs for each task, and if that task
is a command/shell task, assigns it a unique id
3 We then override with library/command.py (which backs command/shell
tasks) with a version that forks and runs the process on the target
node as usual, but also saves the stdout/stderr output to a
temporary file named per the unique uuid from the step above.
4 At the same time we have the callback plugin zuul_stream.py, which
Ansible is calling as it moves through starting, running and
finishing the tasks. This looks at the task, and if it has a UUID
[2], sends a request to the zuul_console [1], which opens the
temporary file [3] and starts streaming it back.
5 We loop reading this until the connection is closed by [1],
eventually outputting each line.
In this way, the console log is effectively streamed and saved into
our job output.
We have established that we expect the console [1] is updated
asynchronously to the command/streaming [3,4] in situation such as
static nodes. This poses a problem if we ever want to update either
part -- for example we can not change the file-name that the
command.py file logs to, because an old zuul_console: will not know to
open the new file. You could imagine other fantasy things you might
like to do; e.g. negotiate compression etc. that would have similar
issues.
To provide the flexibility for these types of changes, implement a
simple protocol where the zuul_stream and zuul_console sides exchange
their respective version numbers before sending the log files. This
way they can both decide what operations are compatible both ways.
Luckily the extant protocol, which is really just sending a plain
uuid, can be adapted to this. When an old zuul_console server gets
the protocol request it will just look like an invalid log file, which
zuul_stream can handle and thus assume the remote end doesn't know
about protocols.
This bumps the testing timeout; it seems that the extra calls make for
random failures. The failures are random and not in the same place,
I've run this separately in 850719 several times and not seen any
problems with the higher timeout. This test is already has a settle
timeout slightly higher so I think it must have just been on the edge.
Change-Id: Ief366c092e05fb88351782f6d9cd280bfae96237
|
|\ \ \ |
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Adds the pipeline for the change to the subject format. This makes it
easier to include information about the pipeline (e.g. its name) in the
e-mail subject
Change-Id: I6ec973635543b4404c125589f23ffd1ba5504c17
|
| | |
| | |
| | |
| | |
| | |
| | | |
This fixes pep8 E275 which wants whitespace after assert and del.
Change-Id: I1f8659f462aa91c3fdf8f7eb8b939b67c0ce9f55
|
|\ \ \ |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
So that operators can see in aggregate how long merge, files-changes,
and repo-state merge operations take in certain pipelines, add
metrics for the merge operations themselves (these exclude the
overhead of pipeline processing and job dispatching).
Change-Id: I8a707b8453c7c9559d22c627292741972c47c7d7
|
|\ \ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The pipeline change cache is used to avoid repeated cache lookups
for change dependencies, but is not as robust as the "real" sounce
change cache at managing the lifetime of change objects. If it is
used to store some change objects in one scheduler which are later
dequeued by a second scheduler, the next time those changes show
up in that pipeline on the first scheduler it may use old change
objects instead of new ones from the ZooKeeper cache.
To fix this, clear the pipeeline change cache before we refresh
the pipeline state. This will cause some extra ZK ChangeCache
hits to repopulate the cache, but only in the case of commit
dependencies (the pipeline change cache isn't used for git
dependencies or the changes in the pipeline themselves unless they
are commit dependencies).
Change-Id: I0a20dc972917440d4f3e8bb59295b77c13913a48
|
|\ \ \ \ \ |
|
| | |_|/ /
| |/| | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Nodepool updated its internal interface for registering launchers.
Since Zuul uses that internal interface to determine what labels
are available, update it to match.
Change-Id: Iffa0c1c1d9ef8d195c5e1ea1b625de9d119add3b
|
|\ \ \ \ \
| |_|/ / /
|/| | | | |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When we deduplicate a build, we use the same Build object in
multiple BuildSets. But when a second scheduler loads the pipeline
state, even if the build path points to a build outside of a given
BuildSet it will create new Build objects and set their local object
references for the buildset and job to the "current" buildset rather
than the one that is being referenced. Instead we need to find the
Build object for the other buildset, but at this point in the process,
the queues haven't been completely loaded yet and so we can't go
searching for it.
To correct this, when we load a Build from ZooKeeper that is outside
of our buildset, we insert a BuildReference object which is a
placeholder until we finish loading the pipeline state. Then at the
end of that process, we go through and replace the BuildReference
objects with the actual builds.
Change-Id: I0f0e5e4b8f8c9d34b86ccd83ad0b0578ce4a780c
|
|\ \ \ \ \
| |/ / / / |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
We call item.setResult after a build is complete so that the queue
item can do any internal processing necessary (for example, prepare
data structures for child jobs, or move the build to the retry_builds
list).
In the case of deduplicated builds, we should do that for every queue
item the build participates in since each item may have a different
job graph.
We were not correctly identifying other builds of deduplicated jobs
and so in the case of a dependency cycle we would call setResult on
jobs of the same name in that cycle regardless of whether they were
deduplicated.
This corrects the issue and adds a test to detect that case.
Change-Id: I4c47beb2709a77c21c11c97f1d1a8f743d4bf5eb
|
|\ \ \ \ \
| |/ / / / |
|
| |/ / /
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There is no good reason to do so (there are no resources consumed
by the job), and it's difficult to disable a behavior for the
noop job globally since it has no definition. Let's never have it
deduplicate so that we keep things simple for folks who want to
avoid deduplication.
Change-Id: Ib3841ce5ef020540edef1cfa479d90c65be97112
|
|\ \ \ \ |
|
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
For heavy users of "dispatch jobs" (where many jobs are declared as
dependencies of a single job which then mutates the child_jobs
return value to indicate which few of those should be run), there
may be large numbers of "SKIPPED" jobs in the status page and in the
final job report, which reduces the usability of both of those.
Yet it is important for users to be able to see skipped jobs since
they may represent an error (they may be inadvertently skipped).
To address this, we remove "SKIPPED" jobs from the status page by
default, but add a button at the bottom of the change box which
can toggle their display.
We remove "SKIPPED" jobs from the report, but add a note at the
bottom which says "Skipped X jobs". Users can follow the buildset
link to see which ones were skipped.
The buildset page will continue to show all the jobs for the buildset.
Change-Id: Ie297168cdf5b39d1d6f219e9b2efc44c01e87f35
|
|\ \ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Currently the timer trigger accepts an optional "jitter" specification
which can delay the start of a pipeline timer trigger by up to a
certain number of seconds. It applies uniformly to every project-branch
that participates in the pipeline. For example, if a periodic pipeline
with nova and glance is configured to trigger at midnight, and has
a jitter of 30 seconds, then the master and stable branches of nova and
glance will all be enqueued at the same time (perhaps 00:00:17).
While this provides some utility in that if other systems are configured
to do things around midnight, this pipeline may not join a thundering
herd with them. Or if there are many periodic pipelines configured for
midnight (perhaps across different tenants, or just with slightly different
purposes), they won't be a thundering hurd.
But to the extent that jobs within a given pipeline might want to avoid
a thundering herd with other similar jobs in the same pipeline, it offers
no relief. While Zuul may be able to handle it (especially since multiple
schedulers allows other pipelines to continue to operate), these jobs
may interact with remote systems which would appreciate not being DoS'd.
To alleviate this, we change the jitter from applying to the pipeline
as a whole to individual project-branches. To be clear, it is still the
case that the pipeline has only a single configured trigger time (this
change does not allow projects to configure their own triggers). But
instead of firing a single event for the entire pipeline, we will fire
a unique event for every project-branch in that pipeline, and these
events will have the jitter applied to them individually. So in our
example above, nova@master might fire at 00:00:05, nova@stable/zulu
may fire at 00:00:07, glance@master at 00:00:13, etc.
This behavior is similar enough in spirit to the current behavior that
we can consider it a minor implementation change, and it doesn't require
any new configuration options, feature flags, deprecation notice, etc.
The documentation is updated to describe the new behavior, as well as
correct an error relating to the description of jitter (it only delays,
not advances, events).
We currently add a single job to APScheduler for every timer triggered
pipeline in every tenant (so the number of jobs is the sum of the
periodic pipelines in every tenant). OpenDev for example may have on
the order of 20 APScheduler jobs. With the new approach, we will
enqueue a job for each project-branch in a periodic pipeline. For a
system like OpenDev, that could potentially be thousands of jobs.
In reality, based on current configuration and pipeline participation,
it should be 176.
Even though it will result in the same number of Zuul trigger events,
there is overhead to having more APScheduler jobs. To characterize
this, I performed a benchmark where I added a certain number of
APScheduler jobs with the same trigger time (and no jitter) and
recorded the amount of time needed to add the jobs and also, once the
jobs began firing, the elapsed time from the first to the last job.
This should charactize the additional overhead the scheduler will
encounter with this change.
Time needed to add jobs to APScheduler (seconds)
1: 0.00014448165893554688
10: 0.0009338855743408203
100: 0.00925445556640625
1000: 0.09204769134521484
10000: 0.9236903190612793
100000: 11.758053541183472
1000000: 223.83168983459473
Time to run jobs (last-first in seconds)
1: 2.384185791015625e-06
10: 0.006863832473754883
100: 0.09936022758483887
1000: 0.22670435905456543
10000: 1.517075777053833
100000: 19.97287678718567
1000000: 399.24730825424194
Given that this operates primarily at the tenant level (when a tenant
reconfiguration happens, jobs need to be removed and added), I think
it makes sense to consider up to 10,000 jobs a reasonable high end.
It looks like we can go a little past that (between 10,000 and 100,000)
while still seeing something like a linear increase. As we approach
1,000,000 jobs it starts looking more polynomial and I would not conisder
the performance to be acceptable. But 100,000 is already an unlikely
number, so I think this level of performance is okay within the likely
range of jobs.
The default executor used by APScheduler is a standard python
ThreadPoolExecutor with a maximum of 10 simultaneous workers. This
will cause us to fire up to 10 Zuul event simultaneously (whereas before
we were only likely to fire simultaneous events if multiple tenants
had identical pipeline timer triggers). This could result in more
load on the connection sources and change cache as they update the
branch tips in the change cache. It seems plausible that 10 simulatenous
events is something that the sources and ZK can handle. If not, we
can reduce the granularity of the lock we use to prevent updating the
same project at the same time (to perhaps a single lock for all
projects), or construct the APScheduler with a lower number of max_workrs.
Change-Id: I27fc23763da81273eb135e14cd1d0bd95964fd16
|
|\ \ \ \ \
| |_|_|/ /
|/| | | | |
|
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In the before times when we only had a single scheduler, it was
naturally the case that reconfiguration events were processed as they
were encountered and no trigger events which arrived after them would
be processed until the reconfiguration was complete. As we added more
event queues to support SOS, it became possible for trigger events
which arrived at the scheduler to be processed before a tenant
reconfiguration caused by a preceding event to be complete. This is
now even possible with a single scheduler.
As a concrete example, imagine a change merges which updates the jobs
which should run on a tag, and then a tag is created. A scheduler
will process both of those events in succession. The first will cause
it to submit a tenant reconfiguration event, and then forward the
trigger event to any matching pipelines. The second event will also
be forwarded to pipeline event queues. The pipeline events will then
be processed, and then only at that point will the scheduler return to
the start of the run loop and process the reconfiguration event.
To correct this, we can take one of two approaches: make the
reconfiguration more synchronous, or make it safer to be
asynchronous. To make reconfiguration more synchronous, we would need
to be able to upgrade a tenant read lock into a tenant write lock
without releasing it. The lock recipes we use from kazoo do not
support this. While it would be possible to extend them to do so, it
would lead us further from parity with the upstream kazoo recipes, so
this aproach is not used.
Instead, we will make it safer for reconfiguration to be asynchronous
by annotating every trigger event we forward with the last
reconfiguration event that was seen before it. This means that every
trigger event now specifies the minimum reconfiguration time for that
event. If our local scheduler has not reached that time, we should
stop processing trigger events and wait for it to catch up. This
means that schedulers may continue to process events up to the point
of a reconfiguration, but will then stop. The already existing
short-circuit to abort processing once a scheduler is ready to
reconfigure a tenant (where we check the tenant write lock contenders
for a waiting reconfiguration) helps us get out of the way of pending
reconfigurations as well. In short, once a reconfiguration is ready
to start, we won't start processing tenant events anymore because of
the existing lock check. And up until that happens, we will process
as many events as possible until any further events require the
reconfiguration.
We will use the ltime of the tenant trigger event as our timestamp.
As we forward tenant trigger events to the pipeline trigger event
queues, we decide whether an event should cause a reconfiguration.
Whenever one does, we note the ltime of that event and store it as
metadata on the tenant trigger event queue so that we always know what
the most recent required minimum ltime is (ie, the ltime of the most
recently seen event that should cause a reconfiguration). Every event
that we forward to the pipeline trigger queue will be annotated to
specify that its minimum required reconfiguration ltime is that most
recently seen ltime. And each time we reconfigure a tenant, we store
the ltime of the event that prompted the reconfiguration in the layout
state. If we later process a pipeline trigger event with a minimum
required reconfigure ltime greater than the current one, we know we
need to stop and wait for a reconfiguration, so we abort early.
Because this system involves several event queues and objects each of
which may be serialized at any point during a rolling upgrade, every
involved object needs to have appropriate default value handling, and
a synchronized model api change is not helpful. The remainder of this
commit message is a description of what happens with each object when
handled by either an old or new scheduler component during a rolling
upgrade.
When forwarding a trigger event and submitting a tenant
reconfiguration event:
The tenant trigger event zuul_event_ltime is initialized
from zk, so will always have a value.
The pipeline management event trigger_event_ltime is initialzed to the
tenant trigger event zuul_event_ltime, so a new scheduler will write
out the value. If an old scheduler creates the tenant reconfiguration
event, it will be missing the trigger_event_ltime.
The _reconfigureTenant method is called with a
last_reconfigure_event_ltime parameter, which is either the
trigger_event_ltime above in the case of a tenant reconfiguration
event forwarded by a new scheduler, or -1 in all other cases
(including other types of reconfiguration, or a tenant reconfiguration
event forwarded by an old scheduler). If it is -1, it will use the
current ltime so that if we process an event from an old scheduler
which is missing the event ltime, or we are bootstrapping a tenant or
otherwise reconfiguring in a context where we don't have a triggering
event ltime, we will use an ltime which is very new so that we don't
defer processing trigger events. We also ensure we never go backward,
so that if we process an event from an old scheduler (and thus use the
current ltime) then process an event from a new scheduler with an
older (than "now") ltime, we retain the newer ltime.
Each time a tenant reconfiguration event is submitted, the ltime of
that reconfiguration event is stored on the trigger event queue. This
is then used as the min_reconfigure_ltime attribute on the forwarded
trigger events. This is updated by new schedulers, and ignored by old
ones, so if an old scheduler process a tenant trigger event queue it
won't update the min ltime. That will just mean that any events
processed by a new scheduler may continue to use an older ltime as
their minimum, which should not cause a problem. Any events forwarded
by an old scheduler will omit the min_reconfigure_ltime field; that
field will be initialized to -1 when loaded on a new scheduler.
When processing pipeline trigger events:
In process_pipeline_trigger_queue we compare two values: the
last_reconfigure_event_ltime on the layout state which is either set
to a value as above (by a new scheduler), or will be -1 if it was last
written by an old scheduler (including in the case it was overwritten
by an old scheduler; it will re-initialize to -1 in that case). The
event.min_reconfigure_ltime field will either be the most recent
reconfiguration ltime seen by a new scheduler forwarding trigger
events, or -1 otherwise. If the min_reconfigure_ltime of an event is
-1, we retain the old behavior of processing the event regardless.
Only if we have a min_reconfigure_ltime > -1 and it is greater than
the layout state last_reconfigure_event_ltime (which itself may be -1,
and thus less than the min_reconfigure_ltime) do we abort processing
the event.
(The test_config_update test for the Gerrit checks plugin is updated
to include an extra waitUntilSettled since a potential test race was
observed during development.)
Change-Id: Icb6a7858591ab867e7006c7c80bfffeb582b28ee
|
|\ \ \ \
| |/ / /
|/| | | |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Using a badly formatted token resulted in an error 500 from zuul-web.
Return a more precise error message and an error 401 in zuul-web when
this occurs.
Also fix a typo in default messages for some auth-related exceptions.
Change-Id: I4abe013e76ac51c3dad7ccd969ffe79f5cb459e3
|
|\ \ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The Zuul Ansible callback stream plugin assumed that the ansible loop
var was always called 'item' in the result_dict. You can override this
value (and it is often necessary to do so to avoid collisions) to
something less generic. In those cases we would get errors like:
b'[WARNING]: Failure using method (v2_runner_item_on_ok) in callback plugin'
b'(<ansible.plugins.callback.zuul_stream.CallbackModule object at'
b"0x7fbecc97c910>): 'item'"
And stream output would not include the info typically logged.
Address this by checking if ansible_loop_var is in the results_dict and
using that value for the loop var name instead. We still fall back to
'item' as I'm not sure that ansible_loop_var is always present.
Change-Id: I408e6d4af632f8097d63c04cbcb611d843086f6c
|
|\ \ \ \ \
| |/ / / /
|/| | | | |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
We previously only reported statsd timing info for SUCCESS and FAILURE
builds. But it would be useful to get info for POST_FAILURE and
TIMED_OUT builds as well. In particular with POST_FAILURE builds we can
track how long they are running before they fail.
Change-Id: I2fe443ac2f69f40b7419e5280a38958d3ac7c080
|
|\ \ \ \ \ |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
When exclude-unprotected-branche is in effect and a project doesn't have
any protected branches (e.g. a wrong branch protection rule in Github or
none at all) the branch cache will contain an empty list.
Since the empty list was so far used to indicate a fetch error, those
projects showed up with a config error in zuul-web ("Will not fetch
project branches as read-only is set").
For the branch cache we need to distinguish three different cases:
1. branche cache miss (no fetch yet)
2. branch cache hit (including empty list of branches)
3. fetch error
Instead of storing an empty list in case of a fetch error we will store
a sentinel value of `None` in those cases. However, with this change we
can't use `None` as an indicator for a cache miss anymore, so we are now
raising a `LookupError` instead.
Change-Id: I5b51805f7d0a5d916a46fe89db7b32f14063d7aa
|