Blog post:

Deploying Open edX with Google Cloud Build

Tags:

When someone asks what CI/CD platforms we use at Appsembler, I’m tempted to give the snarky answer of “yes”.

There’s hardly one that we don’t either currently use or have used in the past. TravisCI, CircleCI, CodeShip, and Jenkins are all in current use and we’ve used others at various times.

When it comes to the Open edX deployments that we manage, we’ve mostly just used those platforms for Continuous Integration, the “CI” part of “CI/CD”. For deployment, we have largely stuck to the “standard” upstream community approach, deploying directly with Ansible from a developer’s workstation.

There are a number of problems with that approach, especially once you get to the large number of independent deployments that we manage. Naturally, moving to more of a CD world, or at least getting deployment happening in a more controlled “push button” manner has been a high priority for us.

The “Standard” Deploy

Let me do a quick review of how the “standard” deploy process works for us. Our developer must have a locally checked out copy of our forked versions of edx-platform, configuration, edx-themes, another repo that has our own custom ansible roles (so we don’t have to maintain them within the forked configuration), a private repo that has customer themes, and a private repo with customer configurations (basically server-vars.yml and associate Ansible Vault files for each customer deployment). Some of those use shared branches based on the Open edX release while others require being checked out to a customer-specific branch.

We have an internal tool, ax, which manages those branches, does some basic sanity checks, and simplifies things quite a bit. Eventually though, when we use ax to deploy, it’s just constructing an Ansible command and executing that.

This works, but has also had its share of problems. The developer deploying must have direct SSH access to the servers, switching between customers/deployments can be annoying because of the the local branch changes, especially when working on in-progress features. There tends to be a lot of git stash and similar cleanup work that has to happen on every context switch. With all of those repos involved, it takes a concerted effort to verify that they are all checked out to the right branch and up to date (ax does some of that, but there are enough exceptions that it’s still a pain point). Output from the deploy is only saved locally, so if something goes wrong and the developer wants help debugging, they have extra steps to do to capture the output and share it with the rest of the team. The configs directory contains sensitive (though encrypted) data and we have a security policy of not leaving that repo checked out on a workstation any longer than necessary. That means that most of the time, a deploy involves cloning that repo, doing everything else for the deploy, then deleting it. More manual steps. Finally, the developer needs to have a steady, reliable internet connection for the entire deploy. If they disconnect in the middle of a deploy, it could leave a server in an inconsistent state until they are able to re-connect and resume or restart the deploy. Since Appsembler is a remote-first, global, company and many of our developers take advantage of that to travel while working, not being able to deploy without a reliable Wifi connection is a major impediment (in fact, a lot of the first prototype of what I will be discussing in this post was developed while I was on a train in the Scottish countryside, tethering from my phone).

There are a few challenges in adapting that “run ansible from your workstation” deployment approach to a CI/CD tool. First, since our deploy uses Ansible over SSH, the CI/CD tool must have SSH credentials and access to our servers. Entrusting that access to a third party provider is inherently risky and something we strive to avoid. Similarly, the deploy process must have access to private Github repos and to various sensitive parameters (SSL certificates, database passwords, etc). We typically keep those encrypted in Ansible Vault files, but the deployment process still requires access to those Ansible Vault files.

All of the CI/CD services available have ways of handling these various secrets securely, but there is always a certain amount of trust that we would be placing in them. A smaller, but more immediate problem that we found looking at most of them is that they inherently expect a one to one correspondence between git repos and deployments. A Tahoe deploy involves a half dozen different repos though and that requires some awkward hacks to implement on most services.

Enter Google Cloud Build

If you’re wondering which one we went with for Tahoe/Open edX deployment, well, the title of the post kind of gives it away. We are currently deploying using Google Cloud Build. The rest of this post will explain roughly how it works, why it works well for us, and how things might be improved in the future.

The first thing that comes into play is the fact that we run the majority of our deployments and services on Google Cloud Platform. If we were primarily on AWS, Azure, or some other platform, we likely would have gone in a very different direction and selected different tools.

Cloud Build has a pretty straightforward setup. You define a cloudbuild.yaml file with some global configuration and a number of “steps”. The “steps” are essentially all just:

  • pull or build a docker image
  • mount some shared directories (so data can persist between the steps)
  • run a command using that docker container and environment

It’s a bit different from some services that have a model where you define steps and those steps all run within a provisioned VM/container. It makes for a more complicated config, but is conceptually simple and allows for some nice modularity and reusability. You can build custom images for each step, choose from Google provided ones (especially handy for interacting with their APIs), or just use anything off the Docker Hub. If you understand Docker already, and can read a YAML file, there’s not a lot more to learn to be able to do basic deployments with GCB.

Our Cloud Build code isn’t public, but the basic outline is something like:

  • Use a gsutil container to pull encrypted secrets down from a cloud storage bucket into a volume, that will also be mounted for all the rest of the steps. These secrets are SSH deploy keys and an ansible vault password.
  • Use a gcloud KMS container to decrypt those passwords so later steps can actually use them
  • Clone the different git repos that are involved. Instead of going directly from Github, it is much simpler to clone from Cloud Source Repositories which have been previously set up to mirror our Github repos.
  • These are each checked out to the branches that are appropriate for the project and deployment environment.
  • Use a gcloud container to trigger an on-demand backup of the Cloud SQL database. This gives us a nice rollback point if we ever deploy some code that does something really bad and messes up our data.
  • It then uses a fairly standard python container with ansible and our regular deployment tool installed to run a fairly typical Open edX deploy.

What I really like about this setup:

  • The basic docker container approach is simple to understand and reason about. It makes me confident that if our deploy changes to require some other tool that it will be straightforward to include it (just slap it in a docker image).
  • The service account model is great. Cloud Build runs as a designated service account in the project and you can grant it access to other GCP resources as needed. In our case, it has read access to the source repos and the storage bucket that has encrypted keys, decrypt access for the KMS keys used to encrypt the secrets, permission to trigger a backup on Cloud SQL, and basically no access to anything else. That’s a great tradeoff between the flexibility to allow a build to do all kinds of different things and securely limiting access to only what is needed.
  • The ease of integration with the rest of the GCP services and APIs is nice. Mostly that is just a combination of the service account model and the fact that gcloud and gsutil are available as builder steps that you can incorporate. For Tahoe, this made it pretty straightforward to integrate with Cloud Source Repositories, Storage Buckets, and KMS.
  • Parallelization is actually not bad. Steps normally run sequentially, but you can also specify a waitFor parameter that tells it what steps it depends on. If that is specified, it will run anything it can in parallel as long as the dependencies already ran. There is no artificial limit on concurrent builds (CircleCI). It doesn’t make a big difference in our case since the main Ansible deploy dominates everything else, but as we look to deploying to multiple clusters in different regions simultaneously, it could be very valuable.
  • We didn’t really make use of it in the Tahoe or Prometheus deploys, but GCB makes it stupidly easy to build a docker image and push that to GCR. If your overall system consists of a bunch of independent services that run as docker containers managed by something like Kubernetes, GCB is basically exactly what you want. A commit triggers the building and publishing of a container, then it runs a gcloud or helm command against your cluster to deploy the new image and you are done. That’s a really nice way to run and deploy services so I like that GCB makes that the “golden path” and basically rewards you for architecting your system that way.

It’s far from perfect though. These are some of the more difficult aspects that I struggled with while setting this up:

  • Steps run in order and the whole things stops when one fails. Normally this is what you want in your build pipeline. The exception is some kind of “clean up step” or steps that should run at the end whether it succeeded or not. You can hack this by making the deploy step basically always succeed (with an || true in the shell script). But that means that we don’t get notifications when it fails. You can work around that by writing that status to a file, doing the “cleanup” steps, and then having another step to set the overall build status based on that afterwards. That’s getting really complicated though and it would be simpler if they just had first-class support for the equivalent of a finally clause.
  • Similarly, all the steps in the config run in order or they don’t. There is no way to conditionally run a step. This becomes annoying if you want to have different triggers to do slightly different things. Eg, building and testing on every commit but only deploying when it’s a push to a certain branch. Currently, you would have to have separate triggers and cloudbuild.yaml configs for each of those instances. That wouldn’t be too bad if there were also some notion of importing or including build steps. But currently, any overlap between the tasks would have to just be duplicated. There are at least a few issues for this in their issue tracker and it sounds like it’s coming “some day”.
  • you can put substituted variables into the arguments for a step, but nothing calculated dynamically. So args: [“${_ENVIRONMENT}-foo”] is good and you can end up with staging-foo or prod-foo, but there’s no way to map say staging to foo and prod to bar. On the Tahoe deploy, this is a small issue because our staging and prod database instances are not quite named consistently. The old deploy script had a little if/then to handle it, but for the Cloud Build, we had to add the semi-redundant instance name as a top level variable that is specified for each triggered build. I would also really like to be able to calculate a value in one step and use it as an argument in a later step.

Finally, there are also a lot of rough edges and things that I think could be improved or that we had to work around, but haven’t really been a significant problem:

  • The “Run Trigger” button in the console has a “down arrow” that makes it look like there are additional options that can be expanded. There aren’t. Clicking it thinking you can see what’s there, it just runs the trigger. In general, triggering builds through the web interface is very clunky. It’s designed entirely for being triggered automatically on a git push but we aren’t quite ready for that level of automation and have been doing them manually.
  • There’s supposed to be a way to hook it up to Github so it reports back build status on PRs. I haven’t been able to get that to work. So far we are using it for deploys rather than running tests on PRs so it hasn’t been that important, but if we really want to replace Travis/Circle with it some day, we will need to get that working.
  • Sending out notifications when a build succeeds or fails is not that easy. You can either do it manually as a step in your build (eg, having a docker container make a curl to a Slack API) or hooking up a Google Cloud Function to the pub/sub notification that it sends out. The former runs into the problem with the lack of a finally or cleanup step, making it hard to notify on a failed build in a clean way. The latter is more robust, but is way more plumbing to set up.
  • Variable substitutions defined directly in the cloudbuild.yaml file are limited to 100 chars. But that only kicks in when it runs as a triggered build. If you run manually via gcloud builds …, longer ones are fine. This bit me with encrypted variables. For a vault password or something, the simplest setup is to just stick it in the cloudbuild.yaml as an encrypted variable. That works fine at first. Then switching to a triggered build it fails because the encrypted field is long. That means adding in a manual step or three somewhere to pull the secret of a file, decrypt it, and stick it in another file.
  • Logs in the console are pretty ugly. They aren’t really updated in real time. Build step status isn’t even updated until the whole thing finishes. Circle/Travis/etc. all do a much better job of providing a nice view of running builds.

 

End of post.