Appsembler Virtual Labs is a product we’ve created that enables software trainers to create repeatable environments for their students. It works seamlessly with Open edX. Check out this blog post for more details.
In this post I will cover the architecture and scalability solutions behind this service.
The Appsembler Virtual Labs (AVL) product allows students to spin up individual lab environments directly from Open edX courses (thanks to our open source “Launch Container” XBlock). The lab environment can range from a simple, shell-only Linux installation to a massive corporate database with a complex web front end.
The technology that allows us to simultaneously deploy several hundred of those environments is Docker Swarm, a powerful tool for managing a cluster of machines, each running Docker and hosting containers and images. It allows us to combine the resources of multiple nodes into a single pool which behaves, from the perspective of our application, like a single Docker server. Our Python application simply talks to Swarm like it would an ordinary, local Docker daemon—maintaining the same Docker API.
High Availability & Consul
All reliable solutions should avoid single points of failure—and Swarm clusters are no exception. To minimize the chance of suffering from this class of issues, we employ the High Availability feature. It allows us to run multiple instances of Swarm Manager and then provide discovery and coordination through Hashicorp’s Consul (a great tool for distributed service discovery and configuration).
Thanks to the combination of High Availability Swarm and Consul, if any individual Swarm node stops working, it’s temporarily removed from the cluster, without adversely affecting the availability of the Virtual Labs. Usually the only affected resources are the containers and images residing on the problematic node. However, the images are often still accessible through the remaining, healthy nodes in the cluster. (I plan to examine this further in a future blog post on the user interface side of Appsembler Virtual labs—stay tuned!)
As soon as the issue is fixed, the node is re-added to the cluster and the containers can be accessed again.
As the number of students and their containers grow, the likelihood of a security incident also grows. All in all, a container is just a fancy name for an isolated group of processes. To increase the isolation (and therefore the security), containers use a separate overlay network and cannot access our own internal services.
We also started rolling out a “container namespace” feature, which allows us to decrease the privileges of processes running inside containers, further reducing the attack surface.
Monitoring and Reporting
The last (but not least!) piece of the puzzle is PagerDuty, which collects reports about malfunctioning subsystems and immediately notifies our engineers. You can read more about it in our post on monitoring of infrastructure and applications.
Hosting online labs can be a challenging problem. The platform needs to be able to launch many instances of (often massive) training environments, isolate the them from the core infrastructure and minimize the impact of problems related to individual containers and servers in the cluster.
Docker Swarm and Hashipcorp’s Consul turned out to be excellent choices for us. Thanks to their stability, performance and security features we can focus our efforts on developing new features for Appsembler Virtual Labs.