Add new Lorry Controller requirements and architecture

author: Lars Wirzenius <lars.wirzenius@codethink.co.uk> 2014-01-20 14:24:27 +0000
committer: Lars Wirzenius <lars.wirzenius@codethink.co.uk> 2014-01-29 16:41:42 +0000
commit: cea3ac2dd484da35fb0e1c7bb10cc29ab7d7db53 (patch)
tree: ee9a9422e9b53575a4a40accdd844bda288bca6c
parent: 190e553f25e0ef01f510eab050b9a411c35cbd66 (diff)
download: lorry-controller-cea3ac2dd484da35fb0e1c7bb10cc29ab7d7db53.tar.gz
1 files changed, 393 insertions, 0 deletions
diff --git a/ARCH b/ARCH
new file mode 100644
index 0000000..fe62ce7
--- /dev/null
+++ b/ARCH
@@ -0,0 +1,393 @@
+% Architecture of daemonised Lorry Controller
+% Codethink Ltd
+
+Introduction
+============
+
+This is an architecture document for Lorry Controller. It is aimed at
+those who develop the software.
+
+Lorry is a tool in Baserock for mirroring code from whatever format
+upstream provides it into git repositories, converting them to git as
+needed. Lorry Controller is service, running on a Trove, which runs
+Lorry against all configured upstreams, including other Troves.
+
+Lorry Controller reads a configuration from a git repository. That
+configuration includes specifications of which upstreams to
+mirror/convert. This includes what upstream Troves to mirror. Lorry
+Controller instructs Lorry to push to a Trove's git repositories.
+
+Lorry specifications, and upstream Trove specifications, may include
+scheduling information, which the Lorry Controller uses to decide when
+to execute which specification.
+
+Requirements
+============
+
+Some concepts/terminology:
+
+* CONFGIT is the git repository the Lorry Controller instance uses for
+  its configuration.
+* Lorry specification: which upstream version control repository or
+  tarball to mirror.
+* Trove specification: which upstream Trove to mirror. This gets
+  broken into generated Lorry specifications, one per git repository
+  on the upstream Trove. There can be many Trove specifications to
+  mirror many Troves.
+* job: An instance of executing a Lorry specification. Each job has an
+  identifier and associated data (such as the output provided by the
+  running job, and whether it succeeded).
+* run queue: all the Lorry specifications (from CONFGIT or generated
+  from the Troe specifications) a Lorry Controller knows about; this
+  is the set of things that get scheduled. The queue has a linear
+  order (first job in the queue is the next job to execute).
+* admin: a person who can control or reconfigure a Lorry Controller
+  instance.
+
+Original set of requirement, which have been broken down and detailed
+up below:
+
+*   Lorry Controller should be capable of being reconfigured at runtime
+    to allow new tasks to be added and old tasks to be removed.
+    (RC/ADD, RC/RM, RC/START)
+*   Lorry Controller should not allow all tasks to become stuck if one
+    task is taking a long time. (RR/MULTI)
+*   Lorry Controller should not allow stuck tasks to remain stuck
+    forever. (Configurable timeout? monitoring of disk usage or CPU to
+    see if work is being done?) (RR/TIMEOUT)
+*   Lorry Controller should be able to be controlled at runtime to allow:
+    - Querying of the current task set (RQ/SPECS, RQ/SPEC)
+    - Querying of currently running tasks (RQ/RUNNING)
+    - Promotion or demotion of a task in the queue (RT/TOP, RT/BOT)
+    - Supporting of the health monitoring to allow appropriate alerts
+      to be sent out (MON/STATIC, MON/DU)
+
+The detailed requirements (prefixed by a unique identfier, which is
+used elsewhere to refer to the exact requirement):
+
+* (FW) Lorry Controller can access upstream Troves from behind firewalls.
+    * (FW/H) Lorry Controller can access the upstream Trove using HTTP or
+      HTTPS only, without using ssh, in order to get a list of
+      repositories to mirror. (Lorry itself also needs to be able to
+      access the upstream Trove using HTTP or HTTPS only, bypassing
+      ssh, but that's a Lorry problem and outside the scope of Lorry
+      Controller, so it'll need to be dealt separately.)
+    * (FW/C) Lorry Controller does not verify SSL/TLS certificates
+      when accessing the upstream Trove.
+* (RC) Lorry Controller can be reconfigured at runtime.
+    * (RC/ADD) A new Lorry specification can be added to CONFGIT, and
+      a running Lorry Controller will add them to its run queue as
+      soon as it is notified of the change.
+    * (RC/RM) A Lorry specification can be removed from CONFGIT, and a
+      running Lorry Controller will remove it from its run queue as
+      soon as it is notified of the change.
+    * (RC/START) A Lorry Controller reads CONFGIT when it starts,
+      updating its run queue if anything has changed.
+* (RT) Lorry Controller can controlled at runtime.
+    * (RT/KILL) An admin can get their Lorry Controller to stop a running job.
+    * (RT/TOP) An admin can get their Lorry Controller to move a Lorry spec to
+      the beginning of the run queue.
+    * (RT/BOT) An admin can get their Lorry Controller to move a Lorry
+      spec to the end of the run queue.
+    * (RT/QSTOP) An admin can stop their Lorry Controller from scheduling any new
+      jobs.
+    * (RT/QSTART) An admin can get their Lorry Controller to start
+      scheduling jobs again.
+* (RQ) Lorry Controller can be queried at runtime.
+    * (RQ/RUNNING) An admin can list all currently running jobs.
+    * (RQ/ALLJOBS) An admin can list all finished jobs that the Lorry
+      Controller still remembers.
+    * (RQ/SPECS) An admin can list all existing Lorry specifications
+      in the run queue.
+    * (RQ/SPEC) An admin can query existing Lorry specifications in
+      the run queue for any information the Lorry Controller holds for
+      them, such as the last time they successfully finished running.
+* (RR) Lorry Controller is reasonably robust.
+    * (RR/CONF) Lorry Controller ignores any broken Lorry or Trove
+      specifications in CONFGIT, and runs without them.
+    * (RR/TIMEOUT) Lorry Controller stops a job that runs for too
+      long.
+    * (RR/MULTI) Lorry Controller can run multiple jobs at the same
+      time, and lets the maximal number of such jobs be configured by
+      the admin.
+    * (RR/DU) Lorry Controller (and the way it runs Lorry) is
+      designed to be frugal about disk space usage.
+    * (RR/CERT) Lorry Controller tells Lorry to not worry about
+      unverifiable SSL/TLS certificates and to continue even if the
+      certificate can't be verified or the verification fails.
+* (RS) Lorry Controller is reasonably scalable.
+    * (RS/SPECS) Lorry Controller works for the number of Lorry
+      specifications we have on git.baserock.org (a number that will
+      increase, and is currently about 500).
+    * (RS/GITS) Lorry Controller works for mirroring git.baserock.org
+      (about 500 git repositories).
+    * (RS/HW) Lorry Controller may assume that CPU, disk, and
+      bandwidth are sufficient, if not to be needlessly wasted.
+* (MON) Lorry Controller can be monitored from the outside.
+    * (MON/STATIC) Lorry Controller updates at least once a minute a
+      static HTML file, which shows its current status with sufficient
+      detail that an admin knows if things get stuck or break.
+    * (MON/DU) Lorry Controller measures, at least, the disk usage of
+      each job and Lorry specification.
+* (SEC) Lorry Controller is reasonably secure.
+    * (SEC/API) Access to the Lorry Controller run-time query and
+      controller interfaces is managed with iptables (for now).
+    * (SEC/CONF) Access to CONFGIT is managed by the git server that
+      hosts it. (Gitano on Trove.)
+
+Architecture design
+===================
+
+Constraints
+-----------
+
+Python is not good at multiple threads (partly due to the global
+interpreter lock), and mixing threads and executing subprocesses is
+quite tricky to get right in general. Thus, this design avoids using
+threads.
+
+Entities
+--------
+
+* An admin is a human being that communicates with the Lorry
+  Controller using an HTTP API. They might do it using a command line
+  client.
+* Lorry Controller runs Lorry appropriately, and consists of several
+  components described below.
+* The local Trove is where Lorry Controller tells its Lorry to push
+  the results.
+* Upstream Trove is a Trove that Lorry Controller mirrors to the local
+  Trove. There can be multiple upstream Troves.
+
+Components of Lorry Controller
+------------------------------
+
+* CONFGIT is a git repository for Lorry Controller configuration,
+  which the Lorry Controller can access and pull from. Pushing is not
+  required and should be prevented by Gitano. CONFGIT is hosted on the
+  local Trove.
+* STATEDB is persistent storage for the Lorry Controller's state: what
+  Lorry specs it knows about (provided by the admin, or generated from
+  a Trove spec by Lorry Controller itself), their ordering, jobs that
+  have been run or are being run, information about the jobs, etc.
+  The idea is that the Lorry Controller process can terminate (cleanly
+  or by crashing), and be restarted, and continue approximately where
+  it was. Also, a persistent storage is useful if there are multiple
+  processes involved due to how bottle.py and WSGI work. STATEDB is
+  implemented using sqlite3.
+* WEBAPP is the controlling part of Lorry Controller, which maintains
+  the run queue, and provides an HTTP API for monitoring and
+  controller Lorry Controller. WEBAPP is implemented as a bottle.py
+  application.
+* MINION runs jobs (external processes) on behalf of WEBAPP. It
+  communicates with WEBAPP over HTTP, and requests a job to run,
+  starts it, and while it waits, sends partial output to the WEBAPP,
+  and asks the WEBAPP whether the job should be aborted or not. MINION
+  may eventually run on a different host than WEBAPP, for added
+  scalability.
+
+Components external to Lorry Controller
+---------------------------------------
+
+* A web server. This runs the Lorry Controller WEBAPP, using WSGI so
+  that multiple instances (processes) can run at once, and thus serve
+  many clients.
+* bottle.py is a Python microframework for web applications. We
+  already have it in Baserock, where we use it for morph-cache-server,
+  and it seems to be acceptable.
+* systemd is the operating system component that starts services and
+  processes.
+
+How the components work together
+--------------------------------
+
+* Each WEBAPP instance is started by the web server, when a request
+  comes in. The web server is started by a systemd unit.
+* Each MINION instance is started by a systemd unit. Each MINION
+  handles one job at a time, and doesn't block other MINIONs from
+  running other jobs. The admins decide how many MINIONs run at once,
+  depending on hardware resources and other considerations. (RR/MULTI)
+* An admin communicates with the WEBAPP only, by making HTTP requests.
+  Each request is either a query (GET) or a command (POST). Queries
+  report state as stored in STATEDB. Commands cause the WEBAPP
+  instance to do something and alter STATEDB accordingly.
+* When an admin makes changes to CONFGIT, and pushes them to the local
+  Trove, the Trove's git post-update hook makes an HTTP request to
+  WEBAPP to update STATEDB from CONFGIT. (RC/ADD, RC/RM)
+* Each MINION likewise communicates only with the WEBAPP using HTTP
+  requests. MINION requests a job to run (which triggers WEBAPP's job
+  scheduling), and then reports results to the WEBAPP (which causes
+  WEBAPP to store them in STATEDB), which tells MINION whether to
+  continue running the job or not (RT/KILL). There is no separate
+  scheduling process: all scheduling happens when there is a MINION
+  available.
+* At system start up, a systemd unit makes an HTTP request to WEBAPP
+  to make it refresh STATEDB from CONFGIT. (RC/START)
+* A timer unit for systemd makes an HTTP request to get WEBAPP to
+  refresh the static HTML status page. (MON/STATIC)
+
+In summary: systemd starts WEBAPP and MINIONs, and whenever a
+MINION can do work, it asks WEBAPP for something to do, and reports
+back results. Meanwhile, admin can query and control via HTTP requests
+to WEBAPP, and WEBAPP instances communicate via STATEDB.
+
+The WEBAPP
+----------
+
+The WEBAPP provides an HTTP API as described below.
+
+Requests for admins:
+
+* `GET /1.0/status` causes WEBAPP to return a JSON object that
+  describes the state of Lorry Controller. This information is meant
+  to be programmatically useable and may or may not be the same as in
+  the HTML page.
+* `GET /1.0/status/disk-free-bytes` causes WEBAPP to return an integer
+  JSON object that gives the amount of free disk space on the
+  filesystem Lorry Controller uses (in bytes). This information is
+  included in the `/1.0/status` query as well, but this is easier to
+  deal with for simple monitoring. (MON/DU)
+* `POST /1.0/stop-queue` causes WEBAPP to stop scheduling new jobs to
+  run. Any currently running jobs are not affected. (RT/QSTOP)
+* `POST /1.0/start-queue` causes WEBAPP to start scheduling jobs
+  again. (RT/QSTART)
+
+* `GET /1.0/list-queue` causes WEBAPP to return a JSON list of ids of
+  all Lorry specifications in the run queue, in the order they are in
+  the run queue. (RQ/SPECS)
+* `GET /1.0/lorry/<lorryspecid>` causes WEBAPP to return a JSON map
+  (dict) with all the information about the specified Lorry
+  specification. (RQ/SPEC)
+* `POST /1.0/move-to-top/<lorryspecid>` where `lorryspecid` is the id
+  of a Lorry specification in the run queue, causes WEBAPP to move the
+  specified spec to the head of the run queue, and store this in
+  STATEDB. It doesn't affect currently running jobs. (RT/TOP)
+* `POST /1.0/move-to-bottom/<lorryspecid>` is like `/move-to-top`, but
+  moves the job to the end of the run queue. (RT/BOT)
+
+* `GET /1.0/list-running-jobs` causes WEBAPP to return a JSON list of
+  ids of all currently running jobs. (RQ/RUNNING)
+* `GET /1.0/job/<jobid>` causes WEBAPP to return a JSON map (dict)
+  with all the information about the specified job.
+* `POST /1.0/stop-job/<jobid>` where `jobid` is an id of a running job,
+  causes WEBAPP to record in STATEDB that the job is to be killed, and
+  waits for it to be killed. (Killing to be done when MINION gets
+  around to it.) This request returns as soon as the STATEDB change is
+  done.
+* `GET /1.0/list-all-jobs` causes WEBAPP to return a JSON list of ids
+  of all jobs, running or finished, that it knows about. (RQ/ALLJOBS)
+
+Requests for MINION:
+
+* `GET /1.0/give-me-job` is used by MINION to get a new job to run.
+  WEBAPP will either return a JSON object describing the job to run,
+  or return a status code indicating that there is nothing to do.
+  WEBAPP may wait a fairly long time until there is something for
+  MINION to do, so MINION doesn't busy-wait. WEBAPP updates STATEDB to
+  record that the job is allocated to a MINION.
+* `POST /1.0/job-update` is used by MINION to push updates about the
+  job it is running to WEBAPP. The body is a JSON object containing
+  additional information about the job, such as data from its
+  stdout/stderr, and current resource usage. There MUST be at least
+  one `job-update` call, which indicates the job has terminated.
+  WEBAPP responds with a status indicating whether the job should
+  continue to run or be terminated (RR/TIMEOUT). WEBAPP records the
+  job as terminated only after MINION tells it the job has been
+  terminated. MINION makes the `job-update` request frequently, even
+  if the job has produced no output, so that WEBAPP can update a
+  timestamp in STATEDB to indicate the job is still alive.
+
+Other requests:
+
+* `POST /1.0/read-configuration` causes WEBAPP to update its copy of
+  CONFGIT and update STATEDB based on the new configuration, if it has
+  changed. Returns OK/ERROR status. (RC/ADD, RC/RM, RC/START)
+* `GET /1.0/status-html` causes WEBAPP to return an HTML page that
+  describes the state of Lorry Controller. This also updates an
+  on-disk copy of the HTML page, which the web server is configured to
+  serve using a normal HTTP request. (MON/STATIC)
+
+The MINION
+----------
+
+* Do `GET /1.0/give-me-job` to WEBAPP.
+* If didn't get a job, do it again.
+* If did get job, fork and exec that.
+* In a loop: wait for output, for a suitably short period of time,
+  from job (or its termination), with `select` or similar mechanism,
+  and send anything (if anything) you get to WEBAPP. If the WEBAPP
+  told us to kill the job, kill it, then send an update to that effect
+  to WEBAPP.
+* Go back to top to request new job.
+
+Implementation plan
+===================
+
+The following are meant to be a good sequence of steps to implement
+the design as described above.
+
+*   Make a skeleton Lorry Controller and yarn test suite for it (2d)
+
+    Write a simplistic, skeleton of a Lorry Controller WEBAPP and MINION,
+    and a few representative tests for them using yarn. The goal here is
+    not to have applications that do something real, or tests that test
+    something real, but to have a base upon which to start building, and
+    especially to make it easy to write tests (including new step
+    implementations) easily in the future.
+
+*   Implement /1.0/status and /1.0/status-html in Lorry Controller
+    WEBAPP (1d)
+
+    This is the very basic, core of the status reporting. Every
+    subsequent change will include updating the status reporting as
+    necessary.
+
+*   Implement /1.0/status/disk-free-bytes in Lorry Controller WEBAPP (1d)
+
+*   Implement /1.0/stop-queue and /1.0/start-queue in Lorry Controller
+    WEBAPP (1d)
+
+    This should just affect the bit in STATEDB that decides whether we
+    are currently running jobs from the run queue or not. This
+    implementation step does not need to actually implement running
+    jobs.
+
+*   Implement /1.0/read-configuration and /1.0/list-queue in Lorry
+    Controller WEBAPP (1d)
+
+    This requires implementing parsing of the configuration files in
+    CONFGIT, generation of Lorry specs from Trove specs,
+    adding/removing/updating specs in the run queue according to
+    changes. list-queue needs to be implemented so that the results of
+    read-configuration can be verified.
+
+*   Implement /1.0/lorry/ in Lorry Controller WEBAPP (1d)
+
+*   Implement /1.0/move-to-top/ and /1.0/move-to-bottom/ in Lorry
+    Controller WEBAPP (1d)
+
+*   Implement running jobs in Lorry Controller WEBAPP (1d)
+
+    Requests /1.0/give-me-job, /1.0/job-update,
+    /1.0/list-running-jobs, /1.0/stop-job/. These do not actually run
+    anything, of course, since that is a job for MINION, but they
+    change the state of the job in STATEDB, and that's what needs to
+    be implemented and tested.
+
+*   Implement /1.0/list-all-jobs, /1.0/job/ in Lorry Controller
+    WEBAPP (1d)
+
+*   Implement MINION in Lorry Controller (1d)
+
+*   Add new Lorry Controller to Trove (2d)
+
+    Replace old Lorry Controller with new one, and add any systemd
+    units needed to make it functional. Create at least a very basic
+    sanity check, using yarn, to verify that a deployed, running
+    system has a working Lorry Controller.
+
+*   Review Lorry Controller situation and decide on further work
+
+    No implementation plan survives contact with reality, and thus
+    things will need to be reviewed at the end, in case something has
+    been forgotten or requirements have changed.
author	Lars Wirzenius <lars.wirzenius@codethink.co.uk>	2014-01-20 14:24:27 +0000
committer	Lars Wirzenius <lars.wirzenius@codethink.co.uk>	2014-01-29 16:41:42 +0000
commit	cea3ac2dd484da35fb0e1c7bb10cc29ab7d7db53 (patch)
tree	ee9a9422e9b53575a4a40accdd844bda288bca6c
parent	190e553f25e0ef01f510eab050b9a411c35cbd66 (diff)
download	lorry-controller-cea3ac2dd484da35fb0e1c7bb10cc29ab7d7db53.tar.gz