diff options
author | Lars Wirzenius <lars.wirzenius@codethink.co.uk> | 2014-01-20 14:24:27 +0000 |
---|---|---|
committer | Lars Wirzenius <lars.wirzenius@codethink.co.uk> | 2014-01-29 16:41:42 +0000 |
commit | cea3ac2dd484da35fb0e1c7bb10cc29ab7d7db53 (patch) | |
tree | ee9a9422e9b53575a4a40accdd844bda288bca6c | |
parent | 190e553f25e0ef01f510eab050b9a411c35cbd66 (diff) | |
download | lorry-controller-cea3ac2dd484da35fb0e1c7bb10cc29ab7d7db53.tar.gz |
Add new Lorry Controller requirements and architecture
-rw-r--r-- | ARCH | 393 |
1 files changed, 393 insertions, 0 deletions
@@ -0,0 +1,393 @@ +% Architecture of daemonised Lorry Controller +% Codethink Ltd + +Introduction +============ + +This is an architecture document for Lorry Controller. It is aimed at +those who develop the software. + +Lorry is a tool in Baserock for mirroring code from whatever format +upstream provides it into git repositories, converting them to git as +needed. Lorry Controller is service, running on a Trove, which runs +Lorry against all configured upstreams, including other Troves. + +Lorry Controller reads a configuration from a git repository. That +configuration includes specifications of which upstreams to +mirror/convert. This includes what upstream Troves to mirror. Lorry +Controller instructs Lorry to push to a Trove's git repositories. + +Lorry specifications, and upstream Trove specifications, may include +scheduling information, which the Lorry Controller uses to decide when +to execute which specification. + +Requirements +============ + +Some concepts/terminology: + +* CONFGIT is the git repository the Lorry Controller instance uses for + its configuration. +* Lorry specification: which upstream version control repository or + tarball to mirror. +* Trove specification: which upstream Trove to mirror. This gets + broken into generated Lorry specifications, one per git repository + on the upstream Trove. There can be many Trove specifications to + mirror many Troves. +* job: An instance of executing a Lorry specification. Each job has an + identifier and associated data (such as the output provided by the + running job, and whether it succeeded). +* run queue: all the Lorry specifications (from CONFGIT or generated + from the Troe specifications) a Lorry Controller knows about; this + is the set of things that get scheduled. The queue has a linear + order (first job in the queue is the next job to execute). +* admin: a person who can control or reconfigure a Lorry Controller + instance. + +Original set of requirement, which have been broken down and detailed +up below: + +* Lorry Controller should be capable of being reconfigured at runtime + to allow new tasks to be added and old tasks to be removed. + (RC/ADD, RC/RM, RC/START) +* Lorry Controller should not allow all tasks to become stuck if one + task is taking a long time. (RR/MULTI) +* Lorry Controller should not allow stuck tasks to remain stuck + forever. (Configurable timeout? monitoring of disk usage or CPU to + see if work is being done?) (RR/TIMEOUT) +* Lorry Controller should be able to be controlled at runtime to allow: + - Querying of the current task set (RQ/SPECS, RQ/SPEC) + - Querying of currently running tasks (RQ/RUNNING) + - Promotion or demotion of a task in the queue (RT/TOP, RT/BOT) + - Supporting of the health monitoring to allow appropriate alerts + to be sent out (MON/STATIC, MON/DU) + +The detailed requirements (prefixed by a unique identfier, which is +used elsewhere to refer to the exact requirement): + +* (FW) Lorry Controller can access upstream Troves from behind firewalls. + * (FW/H) Lorry Controller can access the upstream Trove using HTTP or + HTTPS only, without using ssh, in order to get a list of + repositories to mirror. (Lorry itself also needs to be able to + access the upstream Trove using HTTP or HTTPS only, bypassing + ssh, but that's a Lorry problem and outside the scope of Lorry + Controller, so it'll need to be dealt separately.) + * (FW/C) Lorry Controller does not verify SSL/TLS certificates + when accessing the upstream Trove. +* (RC) Lorry Controller can be reconfigured at runtime. + * (RC/ADD) A new Lorry specification can be added to CONFGIT, and + a running Lorry Controller will add them to its run queue as + soon as it is notified of the change. + * (RC/RM) A Lorry specification can be removed from CONFGIT, and a + running Lorry Controller will remove it from its run queue as + soon as it is notified of the change. + * (RC/START) A Lorry Controller reads CONFGIT when it starts, + updating its run queue if anything has changed. +* (RT) Lorry Controller can controlled at runtime. + * (RT/KILL) An admin can get their Lorry Controller to stop a running job. + * (RT/TOP) An admin can get their Lorry Controller to move a Lorry spec to + the beginning of the run queue. + * (RT/BOT) An admin can get their Lorry Controller to move a Lorry + spec to the end of the run queue. + * (RT/QSTOP) An admin can stop their Lorry Controller from scheduling any new + jobs. + * (RT/QSTART) An admin can get their Lorry Controller to start + scheduling jobs again. +* (RQ) Lorry Controller can be queried at runtime. + * (RQ/RUNNING) An admin can list all currently running jobs. + * (RQ/ALLJOBS) An admin can list all finished jobs that the Lorry + Controller still remembers. + * (RQ/SPECS) An admin can list all existing Lorry specifications + in the run queue. + * (RQ/SPEC) An admin can query existing Lorry specifications in + the run queue for any information the Lorry Controller holds for + them, such as the last time they successfully finished running. +* (RR) Lorry Controller is reasonably robust. + * (RR/CONF) Lorry Controller ignores any broken Lorry or Trove + specifications in CONFGIT, and runs without them. + * (RR/TIMEOUT) Lorry Controller stops a job that runs for too + long. + * (RR/MULTI) Lorry Controller can run multiple jobs at the same + time, and lets the maximal number of such jobs be configured by + the admin. + * (RR/DU) Lorry Controller (and the way it runs Lorry) is + designed to be frugal about disk space usage. + * (RR/CERT) Lorry Controller tells Lorry to not worry about + unverifiable SSL/TLS certificates and to continue even if the + certificate can't be verified or the verification fails. +* (RS) Lorry Controller is reasonably scalable. + * (RS/SPECS) Lorry Controller works for the number of Lorry + specifications we have on git.baserock.org (a number that will + increase, and is currently about 500). + * (RS/GITS) Lorry Controller works for mirroring git.baserock.org + (about 500 git repositories). + * (RS/HW) Lorry Controller may assume that CPU, disk, and + bandwidth are sufficient, if not to be needlessly wasted. +* (MON) Lorry Controller can be monitored from the outside. + * (MON/STATIC) Lorry Controller updates at least once a minute a + static HTML file, which shows its current status with sufficient + detail that an admin knows if things get stuck or break. + * (MON/DU) Lorry Controller measures, at least, the disk usage of + each job and Lorry specification. +* (SEC) Lorry Controller is reasonably secure. + * (SEC/API) Access to the Lorry Controller run-time query and + controller interfaces is managed with iptables (for now). + * (SEC/CONF) Access to CONFGIT is managed by the git server that + hosts it. (Gitano on Trove.) + +Architecture design +=================== + +Constraints +----------- + +Python is not good at multiple threads (partly due to the global +interpreter lock), and mixing threads and executing subprocesses is +quite tricky to get right in general. Thus, this design avoids using +threads. + +Entities +-------- + +* An admin is a human being that communicates with the Lorry + Controller using an HTTP API. They might do it using a command line + client. +* Lorry Controller runs Lorry appropriately, and consists of several + components described below. +* The local Trove is where Lorry Controller tells its Lorry to push + the results. +* Upstream Trove is a Trove that Lorry Controller mirrors to the local + Trove. There can be multiple upstream Troves. + +Components of Lorry Controller +------------------------------ + +* CONFGIT is a git repository for Lorry Controller configuration, + which the Lorry Controller can access and pull from. Pushing is not + required and should be prevented by Gitano. CONFGIT is hosted on the + local Trove. +* STATEDB is persistent storage for the Lorry Controller's state: what + Lorry specs it knows about (provided by the admin, or generated from + a Trove spec by Lorry Controller itself), their ordering, jobs that + have been run or are being run, information about the jobs, etc. + The idea is that the Lorry Controller process can terminate (cleanly + or by crashing), and be restarted, and continue approximately where + it was. Also, a persistent storage is useful if there are multiple + processes involved due to how bottle.py and WSGI work. STATEDB is + implemented using sqlite3. +* WEBAPP is the controlling part of Lorry Controller, which maintains + the run queue, and provides an HTTP API for monitoring and + controller Lorry Controller. WEBAPP is implemented as a bottle.py + application. +* MINION runs jobs (external processes) on behalf of WEBAPP. It + communicates with WEBAPP over HTTP, and requests a job to run, + starts it, and while it waits, sends partial output to the WEBAPP, + and asks the WEBAPP whether the job should be aborted or not. MINION + may eventually run on a different host than WEBAPP, for added + scalability. + +Components external to Lorry Controller +--------------------------------------- + +* A web server. This runs the Lorry Controller WEBAPP, using WSGI so + that multiple instances (processes) can run at once, and thus serve + many clients. +* bottle.py is a Python microframework for web applications. We + already have it in Baserock, where we use it for morph-cache-server, + and it seems to be acceptable. +* systemd is the operating system component that starts services and + processes. + +How the components work together +-------------------------------- + +* Each WEBAPP instance is started by the web server, when a request + comes in. The web server is started by a systemd unit. +* Each MINION instance is started by a systemd unit. Each MINION + handles one job at a time, and doesn't block other MINIONs from + running other jobs. The admins decide how many MINIONs run at once, + depending on hardware resources and other considerations. (RR/MULTI) +* An admin communicates with the WEBAPP only, by making HTTP requests. + Each request is either a query (GET) or a command (POST). Queries + report state as stored in STATEDB. Commands cause the WEBAPP + instance to do something and alter STATEDB accordingly. +* When an admin makes changes to CONFGIT, and pushes them to the local + Trove, the Trove's git post-update hook makes an HTTP request to + WEBAPP to update STATEDB from CONFGIT. (RC/ADD, RC/RM) +* Each MINION likewise communicates only with the WEBAPP using HTTP + requests. MINION requests a job to run (which triggers WEBAPP's job + scheduling), and then reports results to the WEBAPP (which causes + WEBAPP to store them in STATEDB), which tells MINION whether to + continue running the job or not (RT/KILL). There is no separate + scheduling process: all scheduling happens when there is a MINION + available. +* At system start up, a systemd unit makes an HTTP request to WEBAPP + to make it refresh STATEDB from CONFGIT. (RC/START) +* A timer unit for systemd makes an HTTP request to get WEBAPP to + refresh the static HTML status page. (MON/STATIC) + +In summary: systemd starts WEBAPP and MINIONs, and whenever a +MINION can do work, it asks WEBAPP for something to do, and reports +back results. Meanwhile, admin can query and control via HTTP requests +to WEBAPP, and WEBAPP instances communicate via STATEDB. + +The WEBAPP +---------- + +The WEBAPP provides an HTTP API as described below. + +Requests for admins: + +* `GET /1.0/status` causes WEBAPP to return a JSON object that + describes the state of Lorry Controller. This information is meant + to be programmatically useable and may or may not be the same as in + the HTML page. +* `GET /1.0/status/disk-free-bytes` causes WEBAPP to return an integer + JSON object that gives the amount of free disk space on the + filesystem Lorry Controller uses (in bytes). This information is + included in the `/1.0/status` query as well, but this is easier to + deal with for simple monitoring. (MON/DU) +* `POST /1.0/stop-queue` causes WEBAPP to stop scheduling new jobs to + run. Any currently running jobs are not affected. (RT/QSTOP) +* `POST /1.0/start-queue` causes WEBAPP to start scheduling jobs + again. (RT/QSTART) + +* `GET /1.0/list-queue` causes WEBAPP to return a JSON list of ids of + all Lorry specifications in the run queue, in the order they are in + the run queue. (RQ/SPECS) +* `GET /1.0/lorry/<lorryspecid>` causes WEBAPP to return a JSON map + (dict) with all the information about the specified Lorry + specification. (RQ/SPEC) +* `POST /1.0/move-to-top/<lorryspecid>` where `lorryspecid` is the id + of a Lorry specification in the run queue, causes WEBAPP to move the + specified spec to the head of the run queue, and store this in + STATEDB. It doesn't affect currently running jobs. (RT/TOP) +* `POST /1.0/move-to-bottom/<lorryspecid>` is like `/move-to-top`, but + moves the job to the end of the run queue. (RT/BOT) + +* `GET /1.0/list-running-jobs` causes WEBAPP to return a JSON list of + ids of all currently running jobs. (RQ/RUNNING) +* `GET /1.0/job/<jobid>` causes WEBAPP to return a JSON map (dict) + with all the information about the specified job. +* `POST /1.0/stop-job/<jobid>` where `jobid` is an id of a running job, + causes WEBAPP to record in STATEDB that the job is to be killed, and + waits for it to be killed. (Killing to be done when MINION gets + around to it.) This request returns as soon as the STATEDB change is + done. +* `GET /1.0/list-all-jobs` causes WEBAPP to return a JSON list of ids + of all jobs, running or finished, that it knows about. (RQ/ALLJOBS) + +Requests for MINION: + +* `GET /1.0/give-me-job` is used by MINION to get a new job to run. + WEBAPP will either return a JSON object describing the job to run, + or return a status code indicating that there is nothing to do. + WEBAPP may wait a fairly long time until there is something for + MINION to do, so MINION doesn't busy-wait. WEBAPP updates STATEDB to + record that the job is allocated to a MINION. +* `POST /1.0/job-update` is used by MINION to push updates about the + job it is running to WEBAPP. The body is a JSON object containing + additional information about the job, such as data from its + stdout/stderr, and current resource usage. There MUST be at least + one `job-update` call, which indicates the job has terminated. + WEBAPP responds with a status indicating whether the job should + continue to run or be terminated (RR/TIMEOUT). WEBAPP records the + job as terminated only after MINION tells it the job has been + terminated. MINION makes the `job-update` request frequently, even + if the job has produced no output, so that WEBAPP can update a + timestamp in STATEDB to indicate the job is still alive. + +Other requests: + +* `POST /1.0/read-configuration` causes WEBAPP to update its copy of + CONFGIT and update STATEDB based on the new configuration, if it has + changed. Returns OK/ERROR status. (RC/ADD, RC/RM, RC/START) +* `GET /1.0/status-html` causes WEBAPP to return an HTML page that + describes the state of Lorry Controller. This also updates an + on-disk copy of the HTML page, which the web server is configured to + serve using a normal HTTP request. (MON/STATIC) + +The MINION +---------- + +* Do `GET /1.0/give-me-job` to WEBAPP. +* If didn't get a job, do it again. +* If did get job, fork and exec that. +* In a loop: wait for output, for a suitably short period of time, + from job (or its termination), with `select` or similar mechanism, + and send anything (if anything) you get to WEBAPP. If the WEBAPP + told us to kill the job, kill it, then send an update to that effect + to WEBAPP. +* Go back to top to request new job. + +Implementation plan +=================== + +The following are meant to be a good sequence of steps to implement +the design as described above. + +* Make a skeleton Lorry Controller and yarn test suite for it (2d) + + Write a simplistic, skeleton of a Lorry Controller WEBAPP and MINION, + and a few representative tests for them using yarn. The goal here is + not to have applications that do something real, or tests that test + something real, but to have a base upon which to start building, and + especially to make it easy to write tests (including new step + implementations) easily in the future. + +* Implement /1.0/status and /1.0/status-html in Lorry Controller + WEBAPP (1d) + + This is the very basic, core of the status reporting. Every + subsequent change will include updating the status reporting as + necessary. + +* Implement /1.0/status/disk-free-bytes in Lorry Controller WEBAPP (1d) + +* Implement /1.0/stop-queue and /1.0/start-queue in Lorry Controller + WEBAPP (1d) + + This should just affect the bit in STATEDB that decides whether we + are currently running jobs from the run queue or not. This + implementation step does not need to actually implement running + jobs. + +* Implement /1.0/read-configuration and /1.0/list-queue in Lorry + Controller WEBAPP (1d) + + This requires implementing parsing of the configuration files in + CONFGIT, generation of Lorry specs from Trove specs, + adding/removing/updating specs in the run queue according to + changes. list-queue needs to be implemented so that the results of + read-configuration can be verified. + +* Implement /1.0/lorry/ in Lorry Controller WEBAPP (1d) + +* Implement /1.0/move-to-top/ and /1.0/move-to-bottom/ in Lorry + Controller WEBAPP (1d) + +* Implement running jobs in Lorry Controller WEBAPP (1d) + + Requests /1.0/give-me-job, /1.0/job-update, + /1.0/list-running-jobs, /1.0/stop-job/. These do not actually run + anything, of course, since that is a job for MINION, but they + change the state of the job in STATEDB, and that's what needs to + be implemented and tested. + +* Implement /1.0/list-all-jobs, /1.0/job/ in Lorry Controller + WEBAPP (1d) + +* Implement MINION in Lorry Controller (1d) + +* Add new Lorry Controller to Trove (2d) + + Replace old Lorry Controller with new one, and add any systemd + units needed to make it functional. Create at least a very basic + sanity check, using yarn, to verify that a deployed, running + system has a working Lorry Controller. + +* Review Lorry Controller situation and decide on further work + + No implementation plan survives contact with reality, and thus + things will need to be reviewed at the end, in case something has + been forgotten or requirements have changed. |