Deploying research code to a compute cluster is a pain. EC2 can be particularly thorny. My workflow in past projects usually looked something like this:
- Write code on a local machine
- Manually spin up an EC2 instance
- SSH into an instance, install the dependencies, and create an AMI
- Spin up more instances with the new AMI, launch jobs individually
- Manually collect output
I try not to think about how many hours I've spent this way. While it's pretty obvious that most of the workflow can in principle be automated, the path to automation is long and treacherous. Most open-source projects only solve a part of the problem, and putting the pieces together often requires a fairly high level of sophistication on the part of the user.
This is the part where I ought to pitch my new amazing framework that will solve all your problems with a clean, simple UI. The truth is I don't think any framework exists that gives researchers the combination of flexibility and simplicity they want. Maybe Grid AI will get there someday. My hypothesis for why the problem of deploying research code is still so difficult is very simple. Researchers don't have the time or desire to learn frameworks. We want to write simple, transparent code that's easy to change, and then we want that same code to run on a cluster. In this post I'll take you through my process for scaling to EC2 clusters. It strikes a healthy balance between ease-of-use and flexibility. We'll be using the following packages:
Each of the following steps will take some time, and may need to be tweaked to suit your particular situation. Think of this more as an example than a comprehensive guide. To try to keep this post to a reasonable length, I won't be including details that are well-documented, like package installation. If things don't work for you at first, keep at it! A little time invested here will save you hours of thankless grunt work later on. I've also created a working example to make things more concrete.
Hydra and Ray are in active development. Run
pip install -r requirements.txt with the following requirements to set up your environment.
# /path/to/project/requirements.txt omegaconf==2.1.0.dev26 hydra-core==1.1.0.dev6 ray==1.2.0 cloudpickle==1.6.0 pickle5==0.0.11 hydra-ray-launcher==1.1.0.dev1
I've pinned the packages to specific versions that I use across several projects. The packages should ultimately be pinned to stable release versions, but once I find a combination of package versions that works I tend to lock it down and never touch it again. Pinning package versions will save you a lot of headaches if you want to be able to deploy your code without changes six months after you wrote it.
1. Writing Python Scripts with Hydra (Death to argparse)¶
I genuinely don't understand why people still use argparse, with all its boilerplate bloat. Let's take a simple example. Suppose I have the following function,
import time def naptime(duration): print('going to sleep') time.sleep(duration) print('waking up') naptime(1)
going to sleep waking up
I'd like to turn the function into a script
main.py, and I want to be able to change the duration of the nap from the command line. Here's what you have to do with argparse:
# /path/to/project/main.py import time import argparse def naptime(duration):... if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--duration', type=int, default=1) args = parser.parse_args() naptime(args.duration)
python main.py --duration 10
That's all well and good. But what if I decide I want this program to be a full-blown naptime simulation, with GAN-generated dreams? Configuring such an application just from the command line will quickly become incredibly unwieldy. When I finally give in and start configuring my application with
.yaml files, I need to add more boilerplate so that I can load the yaml config first, then process command-line overrides, and I still won't have a record of what I overrode unless I explicitly add logging boilerplate.
Hydra solves all these problems, plus loads of extra functionality that make your life so much better. Here's how we'd do this with Hydra. We'd create a
config/ directory and create
# /path/to/project/config/main.yaml defaults: - hydra/launcher: basic duration: 1
The script is just a wrapper around the
# /path/to/project/main.py import time import hydra def naptime(duration):... hydra.main(config_path='./config', config_name='main') def main(config): naptime(config.duration) if __name__ == '__main__': main()
python main.py duration=10
Now we have all the flexibility we had before, but with automatic config and override logs, plus goodies like tab-completion if I've forgotten what flags to use.
You also get Hydra's incredible composability features, which makes it easy to swap out different modules of code from the command line. I won't go into this in a lot of detail because it isn't the focus of the post, but the documentation is great.
The features I just mentioned are worth the switch on their own merits, but we're not done. Hydra also includes plugins for cluster launchers (e.g. SubmitIt), and sweepers like Ax and Nevergrad. Crucially, the plugins are meant to work with no code changes and the same command line interface. And that's where our story really starts, with the Hydra Ray plugin.
2. Configuring AWS¶
After installing the plugin the first thing you need to do to use EC2 is make sure AWS is set up correctly. I'm going to assume you've already set up an AWS account, requested a sufficient service quota, and configured your AWS CLI. There is a lot of documentation and learning resources out there for this part, so I won't spend more time here.
The Hydra Ray plugin documentation recommends setting up IAM roles for your head and worker nodes, as is done here. This part is crucial to make sure your head node can spin up more workers, and to make sure both the head and worker nodes can access S3.
In addition to using EC2 for compute resources, you can also use S3 to store output. I wrote a simple dataframe logger that writes directly to S3, as long as your AWS credentials are configured. At the end we can quickly pull all of our output to our local machine by running
aws s3 sync s3://my-bucket/path/to/remote/ path/to/project/
3. Containerizing your code¶
For a simple program like
main.py, containerization is obviously overkill. In practice ML research projects tend to have many dependencies, both at the system level (e.g. CUDA) and at the python environment level (PyTorch, Pandas, Torchvision, etc.). There is also typically dependency on source code that only exists in git repos. For many of us, retracing the steps we took to create a local compute environment is painful and time-consuming. Containerization adds some complexity upfront, but it ensures a stable environment from your code, free from the vagaries of package version changes and cluster system updates. To get started, you can try using the image I've created. Keep reading if you want
to know how to make your own image.
You'll add a Dockerfile to the working directory that looks something like the one from my github example,
# /path/to/project/Dockerfile FROM ubuntu:18.04 # Ray wants these lines ENV LC_ALL=C.UTF-8 ENV LANG=C.UTF-8 RUN apt update && apt install software-properties-common -y # python3.8-dev includes headers that are needed to install pickle5 later RUN apt-get install python3.8-dev gcc -y # install python 3.8 virtual environment RUN add-apt-repository ppa:deadsnakes/ppa RUN apt update && apt install python3.8-venv git -y ENV VIRTUAL_ENV=/opt/venv RUN python3.8 -m venv $VIRTUAL_ENV ENV PATH=$VIRTUAL_ENV/bin:$PATH RUN python -m pip install --upgrade pip setuptools # install java, requirement to install hydra from source RUN apt install default-jre -y RUN mkdir src COPY hydra-ray-demo/ src/hydra-ray-demo/ RUN python -m pip install -r src/hydra-ray-demo/requirements.txt WORKDIR src/hydra-ray-demo
There's a couple things to note about this Dockerfile. First, even though this is a container, we're still using a virtual python environment (and it's not conda). The reason for keeping the virtual python environment is it allows a clean separation between system-level packages managed with
apt and environment-level packages managed with
pip. We aren't using conda because Docker and conda don't play very nicely together. For conda to work as intended, you need to source your
.bash_profile before running anything. This happens automatically if you start an interactive bash session, or use
/bin/bash --login -c. The default shell for Docker is
/bin/sh, which does not source those files. You can always hack the
PATH environment variable,
but that won't necessarily replicate all the behavior of
conda activate. Although I did initially try to use conda eventually I wound up using virtualenv instead. As a bonus, here's a gist of how to write a Dockerfile that installs packages in a conda virtual environment, if you're set on using it.
To build your Docker container you need to be outside the project directory. Once we've built the image, we'll push it to DockerHub to use later.
cd /path/to/project cd .. docker build . -f /path/to/project/Dockerfile --tag <DOCKERHUB_USER>/<IMAGE_NAME>:<TAG> docker push <DOCKERHUB_USER>/<IMAGE_NAME>:<TAG>
Note: Keep the contents of the parent directory to a minimum for fastest build times.
After using docker for a while your storage will be quickly filled with fragments of previous images.
docker system prune -f to clean that up.
4. Configuring the Hydra Ray Plugin¶
Now we need to update our program config to accomodate the Hydra Ray launcher plugin. In order to maintain our ability to run our program locally, we'll make use of Hydra's composability and add another file,
# /path/to/project/config/hydra/launcher/ray_aws.yaml _target_: hydra_plugins.hydra_ray_launcher.ray_aws_launcher.RayAWSLauncher env_setup: pip_packages: omegaconf: null hydra_core: null ray: null cloudpickle: null pickle5: null hydra_ray_launcher: null commands:  ray: cluster: cluster_name: demo-cluster min_workers: 0 max_workers: 0 initial_workers: 0 autoscaling_mode: default target_utilization_fraction: 0.8 idle_timeout_minutes: 5 docker: image: 'samuelstanton/hydra-ray-demo:latest' container_name: 'hydra-container' pull_before_run: true run_options:  provider: type: aws region: us-east-2 availability_zone: us-east-2a,us-east-2b auth: ssh_user: ubuntu head_node: InstanceType: m4.large ImageId: ami-010bc10395b6826fb worker_nodes: InstanceType: m4.large ImageId: ami-010bc10395b6826fb stop_cluster: true
At last we have arrived at our destination. If you recall, the command we used to run our program locally was
python main.py duration=10. You can still run that command, and the program's behavior will be unchanged. To simultaneously excute several realizations of your program on EC2, simply use
python main.py -m hydra/launcher=ray_aws duration=1,2,4,8, and you're off to the races!
If you've stayed with me this long, thanks for reading! I hope you find this useful. I arrived at this procedure mostly through trial-and-error, so don't be afraid to try to improve on it!