Deploying Research Code with Hydra, Ray, and Docker

Deploying research code to a compute cluster is a pain. EC2 can be particularly thorny. My workflow in past projects usually looked something like this:

  • Write code on a local machine
  • Manually spin up an EC2 instance
  • SSH into an instance, install the dependencies, and create an AMI
  • Spin up more instances with the new AMI, launch jobs individually
  • Manually collect output

I try not to think about how many hours I've spent this way. While it's pretty obvious that most of the workflow can in principle be automated, the path to automation is long and treacherous. Most open-source projects only solve a part of the problem, and putting the pieces together often requires a fairly high level of sophistication on the part of the user.

This is the part where I ought to pitch my new amazing framework that will solve all your problems with a clean, simple UI. The truth is I don't think any framework exists that gives researchers the combination of flexibility and simplicity they want. Maybe Grid AI will get there someday. My hypothesis for why the problem of deploying research code is still so difficult is very simple. Researchers don't have the time or desire to learn frameworks. We want to write simple, transparent code that's easy to change, and then we want that same code to run on a cluster. In this post I'll take you through my process for scaling to EC2 clusters. It strikes a healthy balance between ease-of-use and flexibility. We'll be using the following packages:

  • Hydra for application configuration
  • Ray for cluster management
  • Docker to manage dependencies

Each of the following steps will take some time, and may need to be tweaked to suit your particular situation. Think of this more as an example than a comprehensive guide. To try to keep this post to a reasonable length, I won't be including details that are well-documented, like package installation. If things don't work for you at first, keep at it! A little time invested here will save you hours of thankless grunt work later on. I've also created a working example to make things more concrete.

Package Versions

Hydra and Ray are in active development. Use the following commands to install Hydra and the Ray plugin.

pip install git+https://github.com/facebookresearch/hydra.git@7e7832f72664b40ee04834ac8956e7146a3d37fb
pip install git+https://github.com/facebookresearch/hydra.git@7e7832f72664b40ee04834ac8956e7146a3d37fb#subdirectory=plugins/hydra_ray_launcher

I've pinned the installation to a specific commit because docker is not supported in the current release. When the next stable release is made available, I will update this section. The example should work with ray==1.2.0.

1. Writing Python Scripts with Hydra (Death to argparse)

I genuinely don't understand why people still use argparse, with all its boilerplate bloat. Let's take a simple example. Suppose I have the following function, naptime:

In [1]:
import time

def naptime(duration):
    print('going to sleep')
    time.sleep(duration)
    print('waking up')

naptime(1)
going to sleep
waking up

I'd like to turn the function into a script main.py, and I want to be able to change the duration of the nap from the command line. Here's what you have to do with argparse:

# /path/to/project/main.py
import time
import argparse

def naptime(duration):...

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--duration', type=int, default=1)
    args = parser.parse_args()
    naptime(args.duration)

python main.py --duration 10

That's all well and good. But what if I decide I want this program to be a full-blown naptime simulation, with GAN-generated dreams? Configuring such an application just from the command line will quickly become incredibly unwieldy. When I finally give in and start configuring my application with .yaml files, I need to add more boilerplate so that I can load the yaml config first, then process command-line overrides, and I still won't have a record of what I overrode unless I explicitly add logging boilerplate.

Hydra solves all these problems, plus loads of extra functionality that make your life so much better. Here's how we'd do this with Hydra. We'd create a config/ directory and create config/main.yaml.

# /path/to/project/config/main.yaml
defaults:
    - hydra/launcher: basic

duration: 1

The script is just a wrapper around the runtime function,

# /path/to/project/main.py
import time
import hydra

def naptime(duration):...

hydra.main(config_path='./config', config_name='main')
def main(config):
    naptime(config.duration)

if __name__ == '__main__':
    main()

python main.py duration=10

Now we have all the flexibility we had before, but with automatic config and override logs, plus goodies like tab-completion if I've forgotten what flags to use.

You also get Hydra's incredible composability features, which makes it easy to swap out different modules of code from the command line. I won't go into this in a lot of detail because it isn't the focus of the post, but the documentation is great.

The features I just mentioned are worth the switch on their own merits, but we're not done. Hydra also includes plugins for cluster launchers (e.g. SubmitIt), and sweepers like Ax and Nevergrad. Crucially, the plugins are meant to work with no code changes and the same command line interface. And that's where our story really starts, with the Hydra Ray plugin.

2. Configuring AWS

After installing the plugin the first thing you need to do to use EC2 is make sure AWS is set up correctly. I'm going to assume you've already set up an AWS account, requested a sufficient service quota, and configured your AWS CLI. There is a lot of documentation and learning resources out there for this part, so I won't spend more time here.

The Hydra Ray plugin documentation recommends setting up IAM roles for your head and worker nodes, as is done here. This part is crucial to make sure your head node can spin up more workers, and to make sure both the head and worker nodes can access S3.

In addition to using EC2 for compute resources, you can also use S3 to store output. I wrote a simple dataframe logger that writes directly to S3, as long as your AWS credentials are configured. At the end we can quickly pull all of our output to our local machine by running

aws s3 sync s3://my-bucket/path/to/remote/ path/to/project/

3. Containerizing your code

For a simple program like main.py, containerization is obviously overkill. In practice ML research projects tend to have many dependencies, both at the system level (e.g. CUDA) and at the python environment level (PyTorch, Pandas, Torchvision, etc.). There is also typically dependency on source code that only exists in git repos. For many of us, retracing the steps we took to create a local compute environment is painful and time-consuming. Containerization adds some complexity upfront, but it ensures a stable environment from your code, free from the vagaries of package version changes and cluster system updates. To get started, you can try using the image I've created. Keep reading if you want to know how to make your own image.

You'll add a Dockerfile to the working directory that looks something like the one from my github example,

# /path/to/project/Dockerfile
FROM ubuntu:18.04

# Ray wants these lines
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

RUN apt update && apt install software-properties-common -y
# python3.8-dev includes headers that are needed to install pickle5 later
RUN apt-get install python3.8-dev gcc -y
# install python 3.8 virtual environment
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update && apt install python3.8-venv git -y
ENV VIRTUAL_ENV=/opt/venv
RUN python3.8 -m venv $VIRTUAL_ENV
ENV PATH=$VIRTUAL_ENV/bin:$PATH
RUN python -m pip install --upgrade pip setuptools

# install java, requirement to install hydra from source
RUN apt install default-jre -y

RUN mkdir src
COPY hydra-ray-demo/ src/hydra-ray-demo/
RUN python -m pip install -r src/hydra-ray-demo/requirements.txt
WORKDIR src/hydra-ray-demo

There's a couple things to note about this Dockerfile. First, even though this is a container, we're still using a virtual python environment (and it's not conda). The reason for keeping the virtual python environment is it allows a clean separation between system-level packages managed with apt and environment-level packages managed with pip. We aren't using conda because Docker and conda don't play very nicely together. For conda to work as intended, you need to source your .bashrc and .bash_profile before running anything. This happens automatically if you start an interactive bash session, or use /bin/bash --login -c. The default shell for Docker is /bin/sh, which does not source those files. You can always hack the PATH environment variable, but that won't necessarily replicate all the behavior of conda activate. Although I did initially try to use conda eventually I wound up using virtualenv instead. As a bonus, here's a gist of how to write a Dockerfile that installs packages in a conda virtual environment, if you're set on using it.

To build your Docker container you need to be outside the project directory. Once we've built the image, we'll push it to DockerHub to use later.

cd /path/to/project
cd ..
docker build . -f /path/to/project/Dockerfile --tag <DOCKERHUB_USER>/<IMAGE_NAME>:<TAG>
docker push <DOCKERHUB_USER>/<IMAGE_NAME>:<TAG>

Note: Keep the contents of the parent directory to a minimum for fastest build times.

After using docker for a while your storage will be quickly filled with fragments of previous images. Use docker system prune -f to clean that up.

4. Configuring the Hydra Ray Plugin

Now we need to update our program config to accomodate the Hydra Ray launcher plugin. In order to maintain our ability to run our program locally, we'll make use of Hydra's composability and add another file, config/hydra/launcher/ray_aws.yaml

# /path/to/project/config/hydra/launcher/ray_aws.yaml
_target_: hydra_plugins.hydra_ray_launcher.ray_aws_launcher.RayAWSLauncher
env_setup:
  pip_packages:
    omegaconf: null
    hydra_core: null
    ray: null
    cloudpickle: null
    pickle5: null
    hydra_ray_launcher: null
  commands: []
ray:
  cluster:
    cluster_name: demo-cluster
    min_workers: 0
    max_workers: 0
    initial_workers: 0
    autoscaling_mode: default
    target_utilization_fraction: 0.8
    idle_timeout_minutes: 5
    docker:
      image: 'samuelstanton/hydra-ray-demo:latest'
      container_name: 'hydra-container'
      pull_before_run: true
      run_options: []
    provider:
      type: aws
      region: us-east-2
      availability_zone: us-east-2a,us-east-2b
    auth:
      ssh_user: ubuntu
    head_node:
      InstanceType: m4.large
      ImageId: ami-010bc10395b6826fb
    worker_nodes:
      InstanceType: m4.large
      ImageId: ami-010bc10395b6826fb
stop_cluster: true

5. Profit.

At last we have arrived at our destination. If you recall, the command we used to run our program locally was python main.py duration=10. You can still run that command, and the program's behavior will be unchanged. To simultaneously excute several realizations of your program on EC2, simply use python main.py -m hydra/launcher=ray_aws duration=1,2,4,8, and you're off to the races!

If you've stayed with me this long, thanks for reading! I hope you find this useful. I arrived at this procedure mostly through trial-and-error, so don't be afraid to try to improve on it!

Additional Resources:

Focus Groups and Academic Reviews

I've expressed the ideas that follow on Twitter, but for the sake of posterity I'll repeat my thoughts here.

As I prepare to respond to NeurIPS reviews, I've been reflecting on the analogy of conference proceedings to a marketplace, where ideas (products) are produced by research labs (competing firms) and exchanged for time/attention (currency). Each time your lab attempts to launch a new product, it first has to pass muster with a small focus group, namely your reviewers. The focus group evaluates your product rather subjectively, and in comparison to the substitutes available.

You don't need me to tell you that every year, there are more and more apparent substitutes for your product in the ML research marketplace. If you're a young researcher, launching your career is akin to entering a crowded, fiercely competitive attention price war. The goal of scientific publication (aside from personal incentives) may ultimately be collaborative, but the process is fundamentally competitive. Acceptance thresholds at conferences are driven by scarcity of attention, not scarcity of good ideas.

That's why I don't think introducing lower tiers of acceptance would change much. It might look nicer than "preprint" on your CV, but it wouldn't change the fact that it would fail to signal to employers and researchers that your work is particularly worthy of attention. I do think that there is much that could be changed about the conference publication process that could greatly improve both the participation experience and output. Eliminating toxic high-stakes deadlines and compensating reviewers are good examples.

I find it helpful to remember that conference publications are reserved almost exclusively for ideas that fare well w.r.t. the impact/attention ratio, and not the magnitude of the impact itself. There's two ways to make your ideas more competitive in the short term -- innovate more (very hard) and explain better (less hard). Conference feedback seems almost useless for the former, and somewhat helpful for the latter. In the long term you could try to keep outcompeting more established labs, or you could take a step back and consider where your competitive advantages lie. Behind every flashy conference paper there are critical weaknesses that the authors will do their best to disguise.

You might be more likely to hit upon a disruptive idea when asking yourself "In what frame is method X weak?", than "How can I make method X better in the conventional frame?" At any rate, I think it's worth a shot.

Hello, World!

New Year's resolutions are decidedly out of fashion. In spite of that, on a cold January day I promised myself that this would be the year I started writing. I suppose the necessary impetus has come from my experiences in grad school. The aspiring researcher has two fundamental objectives. First he must learn how to find and explore new ideas. Second he must learn to present his new-found knowledge in such a way that it can be consumed in less time than it took him to create it. Until he has mastered both these skills, he cannot fulfill the basic function of a scientist or thinker. Of course, 'thinker' is an obvious misnomer. Everyone thinks, and most people consider their thoughts rather good. To be a 'thinker' one must give form to thought, and thrust that form out into the world. And so here we are. Happy reading, friends.