diff options
author | Sam Thursfield <sam.thursfield@codethink.co.uk> | 2017-10-13 13:10:15 +0100 |
---|---|---|
committer | Sam Thursfield <sam.thursfield@codethink.co.uk> | 2017-10-13 13:10:15 +0100 |
commit | 69d6d1a76de7c9f4c1274ada238fe5295fe7dc30 (patch) | |
tree | f5cb69aa57738297f644d20401d71eb8a174caff /README.md | |
parent | a274319240e34288b6c0e0cba6224854b2f22912 (diff) | |
download | infrastructure-69d6d1a76de7c9f4c1274ada238fe5295fe7dc30.tar.gz |
Rename README so it gets displayed in GitLab
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 584 |
1 files changed, 584 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 00000000..2f4c08d5 --- /dev/null +++ b/README.md @@ -0,0 +1,584 @@ +Baserock project public infrastructure +====================================== + +This repository contains the definitions for all of the Baserock Project's +infrastructure. This includes every service used by the project, except for +the mailing lists (hosted by [Pepperfish]) the wiki (hosted by [Branchable]) +and the GitLab CI runners (set up by Javier Jardón). + +Some of these systems are Baserock systems. This has proved an obstacle to +keeping them up to date with security updates, and we plan to switch everything +to run on mainstream distros in future. + +All files necessary for (re)deploying the systems should be contained in this +Git repository. Private tokens should be encrypted using +[ansible-vault](https://www.ansible.com/blog/2014/02/19/ansible-vault). + +[Pepperfish]: http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo +[Branchable]: http://www.branchable.com/ + + +General notes +------------- + +When instantiating a machine that will be public, remember to give shell +access everyone on the ops team. This can be done using a post-creation +customisation script that injects all of their SSH keys. The SSH public +keys of the Baserock Operations team are collected in +`baserock-ops-team.cloud-config.`. + +Ensure SSH password login is disabled in all systems you deploy! See: +<https://testbit.eu/is-ssh-insecure/> for why. The Ansible playbook +`admin/sshd_config.yaml` can ensure that all systems have password login +disabled. + + +Administration +-------------- + +You can use [Ansible] to automate tasks on the baserock.org systems. + +To run a playbook: + + ansible-playbook -i hosts $PLAYBOOK.yaml + +To run an ad-hoc command (upgrading, for example): + + ansible -i hosts fedora -m command -a 'sudo dnf update -y' + ansible -i hosts ubuntu -m command -a 'sudo apt-get update -y' + +[Ansible]: http://www.ansible.com + + +Security updates +---------------- + +Fedora security updates can be watched here: +<https://bodhi.fedoraproject.org/updates/?type=security>. Ubuntu issues +security advisories here: <http://www.ubuntu.com/usn/>. +The Baserock reference systems doesn't have such a service. The [LWN +Alerts](https://lwn.net/Alerts/) service gives you info from all major Linux +distributions. + +If there is a vulnerability discovered in some software we use, we might need +to upgrade all of the systems that use that component at baserock.org. + +Bear in mind some systems are not accessible except via the frontend-haproxy +system. Those are usually less at risk than those that face the web directly. +Also bear in mind we use OpenStack security groups to block most ports. + +### Prepare the patch for Baserock systems + +First, you need to update the Baserock reference system definitions with a +fixed version of the component. Build that and test that it works. Submit +the patch to gerrit.baserock.org, get it reviewed, and merged. Then cherry +pick that patch into infrastructure.git. + +This a long-winded process. There are shortcuts you can take, although +someone still has to complete the process described above at some point. + +* You can modify the infrastructure.git definitions directly and start rebuilding + the infrastructure systems right away, to avoid waiting for the Baserock patch + review process. + +* You can add the new version of the component as a stratum that sits above + everything else in the build graph. For example, to do a 'hot-fix' for GLIBC, + add a 'glibc-hotfix' stratum containing the new version to all of the systems + you need to upgrade. Rebuilding them will be quick because you just need to + build GLIBC, and can reuse the cached artifacts for everything else. The new + GLIBC will overwrite the one that is lower down in the build graph in the + resulting filesystem. Of course, if the new version of the component is not + ABI compatible then this approach will break things. Be careful. + +### Check the inventory + +Make sure the Ansible inventory file is up to date, and that you have access to +all machines. Run this: + + ansible \* -i ./hosts -m ping + +You should see lots of this sort of output: + + mail | success >> { + "changed": false, + "ping": "pong" + } + + frontend-haproxy | success >> { + "changed": false, + "ping": "pong" + } + +You may find some host key errors like this: + + paste | FAILED => SSH Error: Host key verification failed. + It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue. + +If you have a host key problem, that could be because somebody redeployed +the system since the last time you connected to it with SSH, and did not +transfer the SSH host keys from the old system to the new system. Check with +other ops teams members about this. If you are sure the new host keys can +be trusted, you can remove the old ones with `ssh-keygen -R 192.168.x.y`, where 192.168.x.y is the internal IP address of the machine. You'll then be prompted to accept the new ones when you run Ansible again. + +Once all machines respond to the Ansible 'ping' module, double check that +every machine you can see in the OpenStack Horizon dashboard has a +corresponding entry in the 'hosts' file, to ensure the next steps operate +on all of the machines. + +### Check and upgrade Fedora systems + +> Bear in mind that only the latest 2 versions of Fedora receive security +updates. If any machines are not running the latest version of Fedora, +you should redeploy them with the latest version. See the instructions below +on how to (re)deploy each machine. You should deploy a new instance of a system +and test it *before* terminating the existing instance. Switching over should +be a matter of changing either its floating IP address or the IP address in +baserock_frontend/haproxy.conf. + +You can find out what version of Fedora is in use with this command: + + ansible fedora -i hosts -m setup -a 'filter=ansible_distribution_version' + +Check what version of a package is in use with this command (using GLIBC as an +example). You can compare this against Fedora package changelogs at +[Koji](https://koji.fedoraproject.org). + + ansible fedora -i hosts -m command -a 'rpm -q glibc --qf "%{VERSION}.%{RELEASE}\n"' + +You can see what updates are available using the `dnf updateinfo info' command. + + ansible -i hosts fedora -m command -a 'dnf updateinfo info glibc' + +You can then use `dnf upgrade -y` to install all available updates. Or give the +name of a package to update just that package. Be aware that DNF is quite slow, +and if you forget to pass `-y` then it will hang forever waiting for input. + +You will then need to restart services. The `dnf needs-restarting` command might be +useful, but rebooting the whole machine is probably easiest. + +### Check and upgrade Ubuntu systems + +> Bear in mind that only the latest and the latest LTS release of Ubuntu receive any +security updates. + +Find out what version of Ubuntu is in use with this command: + + ansible ubuntu -i hosts -m setup -a 'filter=ansible_distribution_version' + +Check what version of a given package is in use with this command (using GLIBC +as an example). + + ansible -i hosts ubuntu -m command -a 'dpkg-query --show libc6' + +Check for available updates, and what they contain: + + ansible -i hosts ubuntu -m command -a 'apt-cache policy libc6' + ansible -i hosts ubuntu -m command -a 'apt-get changelog libc6' | head -n 20 + +You can update all the packages with: + + ansible -i hosts ubuntu -m command -a 'apt-get upgrade -y' --sudo + +You will then need to restart services. Rebooting the machine is probably +easiest. + +### Check and upgrade Baserock systems + +Check what version of a given package is in use with this command (using GLIBC +as an example). Ideally Baserock reference systems would have a query tool for +this info, but for now we have to look at the JSON metadata file directly. + + ansible -i hosts baserock -m command \ + -a "grep '\"\(sha1\|repo\|original_ref\)\":' /baserock/glibc-bins.meta" + +The default Baserock machine layout uses Btrfs for the root filesystem. Filling +up a Btrfs disk results in unpredictable behaviour. Before deploying any system +upgrades, check that each machine has enough free disk space to hold an +upgrade. Allow for at least 4GB free space, to be safe. + + ansible -i hosts baserock -m command -a "df -h /" + +A good way to free up space is to remove old system-versions using the +`system-version-manager` tool. There may be other things that are +unnecessarily taking up space in the root file system, too. + +Ideally, at this point you've prepared a patch for definitions.git to fix +the security issue in the Baserock reference systems, and it has been merged. +In that case, pull from the reference systems into infrastructure.git, using +`git pull git://git.baserock.org/baserock/baserock/definitions master`. + +If the necessary patch isn't merged in definitions.git, it's still best to +merge 'master' from there into infrastructure.git, and then cherry-pick the +patch from Gerrit on top. + +You then need to build and upgrade the systems one by one. Do this from the +'devel-system' machine in the same OpenStack cloud that hosts the +infrastructure. Baserock upgrades currently involve transferring the whole +multi-gigabyte system image, so you *must* have a fast connection to the +target. + +Each Baserock system has its own deployment instructions. Each should have +a deployment .morph file that you can pass to `morph upgrade`. For example, +to deploy an upgrade git.baserock.org: + + morph upgrade --local-changes=ignore \ + baserock_trove/baserock_trove.morph gbo.VERSION_LABEL=2016-02-19 + +Once this completes successfully, rebooting the system should bring up the +new system. You may want to check that the new `/etc` is correct; you can +do this inside the machine by mounting `/dev/vda` and looking in `systems/$VERSION_LABEL/run/etc`. + +If you want to revert the upgrade, use `system-version-manager list` and +`system-version-manager set-default <old-version>` to set the previous +version as the default, then reboot. If the system doesn't boot at all, +reboot it while you have the graphical console open in Horizon, and you +should be able to press `ESC` fast enough to get the boot menu open. This +will allow booting into previous versions of the system. (You shouldn't +have any problems though since of course we test everything regularly). + +Beware of <https://storyboard.baserock.org/#!/story/77>. + +For cache.baserock.org, you can reuse the deployment instructions for +git.baserock.org. Try: + + morph upgrade --local-changes=ignore \ + baserock_trove/baserock_trove.morph \ + gbo.update-location=root@cache.baserock.org + gbo.VERSION_LABEL=2016-02-19 + +Deployment to OpenStack +----------------------- + +The intention is that all of the systems defined here are deployed to an +OpenStack cloud. The instructions here harcode some details about the specific +tenancy at [DataCentred](http://www.datacentred.io) that the Baserock project +uses. It should be easy to adapt them for other OpenStack hosts, though. + +### Credentials + +The instructions below assume you have the following environment variables set +according to the OpenStack host you are deploying to: + + - `OS_AUTH_URL` + - `OS_TENANT_NAME` + - `OS_USERNAME` + - `OS_PASSWORD` + +When using `morph deploy` to deploy to OpenStack, you will need to set these +variables, because currently Morph does not honour the standard ones. See: +<https://storyboard.baserock.org/#!/story/35>. + + - `OPENSTACK_USER=$OS_USERNAME` + - `OPENSTACK_PASSWORD=$OS_PASSWORD` + - `OPENSTACK_TENANT=$OS_TENANT_NAME` + +The `location` field in the deployment .morph file will also need to point to +the correct `$OS_AUTH_URL`. + +### Firewall / Security Groups + +The instructions assume the presence of a set of security groups. You can +create these by running the following Ansible playbook. + + ansible-playbook -i hosts firewall.yaml + +### Placeholders + +The commands below use a couple of placeholders like $network_id, you can set +them in your environment to allow you to copy and paste the commands below +as-is. + + - `export fedora_image_id=...` (find this with `glance image-list`) + - `export network_id=...` (find this with `neutron net-list`) + - `export keyname=...` (find this with `nova keypair-list`) + +The `$fedora_image_id` should reference a Fedora Cloud image. You can import +these from <http://www.fedoraproject.org/>. At time of writing, these +instructions were tested with Fedora Cloud 23 for x86_64. + +Backups +------- + +Backups of git.baserock.org's data volume are run by and stored on on a +Codethink-managed machine named 'access'. They will need to migrate off this +system before long. The backups are taken without pausing services or +snapshotting the data, so they will not be 100% clean. The current +git.baserock.org data volume does not use LVM and cannot be easily snapshotted. + +Backups of 'gerrit' and 'database' are handled by the +'baserock_backup/backup.py' script. This currently runs on an instance in +Codethink's internal OpenStack cloud. + +Instances themselves are not backed up. In the event of a crisis we will +redeploy them from the infrastructure.git repository. There should be nothing +valuable stored outside of the data volumes that are backed up. + +To prepare the infrastructure to run the backup scripts you will need to run +the following playbooks: + + ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml + ansible-playbook -i hosts baserock_database/instance-backup-config.yml + ansible-playbook -i hosts baserock_gerrit/instance-backup-config.yml + +NOTE: to run these playbooks you need to have the public ssh key of the backups +instance in `keys/backup.key.pub`. + + +Systems +------- + +### Front-end + +The front-end provides a reverse proxy, to allow more flexible routing than +simply pointing each subdomain to a different instance using separate public +IPs. It also provides a starting point for future load-balancing and failover +configuration. + +To deploy this system: + + nova boot frontend-haproxy \ + --key-name=$keyname \ + --flavor=dc1.1x0 \ + --image=$fedora_image_id \ + --nic="net-id=$network_id" \ + --security-groups default,gerrit,shared-artifact-cache,web-server \ + --user-data ./baserock-ops-team.cloud-config + ansible-playbook -i hosts baserock_frontend/image-config.yml + ansible-playbook -i hosts baserock_frontend/instance-config.yml + ansible-playbook -i hosts baserock_frontend/instance-backup-config.yml + + ansible -i hosts -m service -a 'name=haproxy enabled=true state=started' \ + --sudo frontend-haproxy + +The baserock_frontend system is stateless. + +Full HAProxy 1.5 documentation: <https://cbonte.github.io/haproxy-dconv/configuration-1.5.html>. + +If you want to add a new service to the Baserock Project infrastructure via +the frontend, do the following: + +- request a subdomain that points at 185.43.218.170 (frontend) +- alter the haproxy.cfg file in the baserock_frontend/ directory in this repo + as necessary to proxy requests to the real instance +- run the baserock_frontend/instance-config.yml playbook +- run `ansible -i hosts -m service -a 'name=haproxy enabled=true + state=restarted' --sudo frontend-haproxy` + +OpenStack doesn't provide any kind of internal DNS service, so you must put the +fixed IP of each instance. + +The internal IP address of this machine is hardcoded in some places (beyond the +usual haproxy.cfg file), use 'git grep' to find all of them. You'll need to +update all the relevant config files. We really need some internal DNS system +to avoid this hassle. + +### Trove + +To deploy to production, run these commands in a Baserock 'devel' +or 'build' system. + + nova volume-create \ + --display-name git.baserock.org-home \ + --display-description '/home partition of git.baserock.org' \ + --volume-type Ceph \ + 300 + + git clone git://git.baserock.org/baserock/baserock/infrastructure.git + cd infrastructure + + morph build systems/trove-system-x86_64.morph + morph deploy baserock_trove/baserock_trove.morph + + nova boot git.baserock.org \ + --key-name $keyname \ + --flavor 'dc1.8x16' \ + --image baserock_trove \ + --nic "net-id=$network_id,v4-fixed-ip=192.168.222.58" \ + --security-groups default,git-server,web-server,shared-artifact-cache \ + --user-data baserock-ops-team.cloud-config + + nova volume-attach git.baserock.org <volume-id> /dev/vdb + + # Note, if this floating IP is not available, you will have to change + # the DNS in the DNS provider. + nova add-floating-ip git.baserock.org 185.43.218.183 + + ansible-playbook -i hosts baserock_trove/instance-config.yml + + # Before configuring the Trove you will need to create some ssh + # keys for it. You can also use existing keys. + + mkdir private + ssh-keygen -N '' -f private/lorry.key + ssh-keygen -N '' -f private/worker.key + ssh-keygen -N '' -f private/admin.key + + # Now you can finish the configuration of the Trove with: + + ansible-playbook -i hosts baserock_trove/configure-trove.yml + +### OSTree artifact cache + +To deploy this system to production: + + nova volume-create \ + --display-name ostree-volume \ + --display-description 'OSTree cache volume' \ + --volume-type Ceph \ + 300 + + nova boot ostree.baserock.org \ + --key-name $keyname \ + --flavor dc1.2x8.40 \ + --image $fedora_image_id \ + --nic "net-id=$network_id,v4-fixed-ip=192.168.222.153" \ + --security-groups default,web-server \ + --user-data ./baserock-ops-team.cloud-config + + nova volume-attach ostree.baserock.org <volume-id> /dev/vdb + + ansible-playbook -i hosts baserock_ostree/image-config.yml + ansible-playbook -i hosts baserock_ostree/instance-config.yml + ansible-playbook -i hosts baserock_ostree/ostree-access-config.yml + +SSL certificates +================ + +The certificates used for our infrastructure are provided for free +by Let's Encrypt. These certificates expire every 3 months. Here we +will explain how to renew the certificates, and how to deploy them. + +Generation of certificates +-------------------------- + +> Note: This should be automated in the next upgrade. The instructions +> sound like a lot of effort + +To generate the SSL certs, first you need to clone the following repositories: + + git clone https://github.com/lukas2511/letsencrypt.sh.git + git clone https://github.com/mythic-beasts/letsencrypt-mythic-dns01.git + +The version used the first time was `0.4.0` with sha `116386486b3749e4c5e1b4da35904f30f8b2749b`, +(just in case future releases break these instructions) + +Now inside of the repo, create a `domains.txt` file with the information +of the subdomains: + + cd letsencrypt.sh + cat >domains.txt <<'EOF' + baserock.org + docs.baserock.org download.baserock.org irclogs.baserock.org ostree.baserock.org paste.baserock.org spec.baserock.org + git.baserock.org + EOF + +And the `config` file needed: + + cat >config <<'EOF' + CONTACT_EMAIL="admin@baserock.org" + HOOK="../letsencrypt-mythic-dns01/letsencrypt-mythic-dns01.sh" + CHALLENGETYPE="dns-01" + EOF + +Create a `dnsapi.config.txt` with the contents of `private/dnsapi.config.txt` +decrypted. To show the contents of this file, run the following in a +`infrastructure.git` repo checkout. + + ansible-vault view private/dnsapi.config.txt + +Now, to generate the certs, run: + + ./dehydrated -c + +> If this is the first time, you will get asked to run +> `./dehydrated --register --accept-terms` + +In the `certs` folder you will have all the certificates generated. To construct the +certificates that are present in `certs` and `private` you will have to: + + cd certs + mkdir -p tmp/private tmp/certs + + # Create some full certs including key for some services that need it this way + cat git.baserock.org/cert.csr git.baserock.org/cert.pem git.baserock.org/chain.pem git.baserock.org/privkey.pem > tmp/private/git-with-key.pem + cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem docs.baserock.org/chain.pem docs.baserock.org/privkey.pem > tmp/private/frontend-with-key.pem + + # Copy key files + cp git.baserock.org/privkey.pem tmp/private/git.pem + cp docs.baserock.org/privkey.pem tmp/private/frontend.pem + + # Copy cert files + cp git.baserock.org/cert.csr tmp/certs/git.csr + cp git.baserock.org/cert.pem tmp/certs/git.pem + cp git.baserock.org/chain.pem tmp/certs/git-chain.pem + cp docs.baserock.org/cert.csr tmp/certs/frontend.csr + cp docs.baserock.org/cert.pem tmp/certs/frontend.pem + cp docs.baserock.org/chain.pem tmp/certs/frontend-chain.pem + + # Create full certs without keys + cat git.baserock.org/cert.csr git.baserock.org/cert.pem chain.pem > tmp/certs/git-full.pem + cat docs.baserock.org/cert.csr docs.baserock.org/cert.pem chain.pem > tmp/certs/frontend-full.pem + +Before replacing the current ones, make sure you **encrypt** the ones that contain +keys (located in `private` folder): + + ansible-vault encrypt tmp/private/* + +And copy them to the repo: + + cp tmp/certs/* ../../certs/ + cp tmp/private/* ../../private/ + + +Deploy certificates +------------------- + +For `git.baserock.org` just run: + + ansible-playbook -i hosts baserock_trove/configure-trove.yml + +This script will copy the certificates to the Trove and run the scripts +that will configure them. + +For the frontend, run: + + ansible-playbook -i hosts baserock_frontend/instance-config.yml + ansible -i hosts -m service -a 'name=haproxy enabled=true state=restarted' --sudo frontend-haproxy + +Which will install the certificates and then restart the services needed. + + +GitLab CI runners setup +======================= + +Baserock uses [GitLab CI] for build and test automation. For performance reasons +we provide our own runners and avoid using the free, shared runners provided by +GitLab. The runners are hosted at [DigitalOcean] and managed by the 'baserock' +team account there. + +There is a persistent 'manager' machine with a public IP of 138.68.143.2 that +runs GitLab Runner and [docker-machine]. This doesn't run any builds itself -- +we use the [autoscaling feature] of GitLab Runner to spawn new VMs for building +in. The configuration for this is in `/etc/gitlab-runner/config.toml`. + +Each build occurs in a Docker container on one of the transient VMs. As per +the [\[runners.docker\] section] of `config.toml`, each gets a newly created +volume mounted at `/cache`. The YBD and BuildStream cache directories get +located here because jobs were running out of disk space when using the default +configuration. + +There is a second persistent machine with a public IP of 46.101.48.48 that +hosts a Docker registry and a [Minio] cache. These services run as Docker +containers. The Docker registry exists to cache the Docker images we use which +improves the spin-up time of the transient builder VMs, as documented +[here](https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-docker-registry-mirroring). +The Minio cache is used for the [distributed caching] feature of GitLab CI. + + +[GitLab CI]: https://about.gitlab.com/features/gitlab-ci-cd/ +[DigitalOcean]: https://cloud.digitalocean.com/ +[docker-machine]: https://docs.docker.com/machine/ +[autoscaling feature]: https://docs.gitlab.com/runner/configuration/autoscale.html +[Minio]: https://www.minio.io/ +[\[runners.docker\] section]: https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-docker-section +[distributed caching]: https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-runners-caching |