Define the environment

Now that we have seen the basic building blocks of executing code in the cloud, the next step is to understand how to manage and run more complex projects. In particular, we need to package software dependencies - both our own code as well as any 3rd party libraries - for reliable cloud execution.

Dependency management in Python is known to be a headache (in particular affecting ML/AI projects that have extra requirements due to large packages and GPU drivers), but the good news is that Outerbounds streamlines the process while allowing you to choose an approach that works for your needs.

Specifically,

Metaflow takes care of packaging your code automatically. You don't have to worry about packaging the project's Python files.
Outerbounds makes it easier, faster, and more secure to work with 3rd party dependencies and Docker images. For details, see managing dependencies.

Let's take a look at an example to see these features in action.

Example: PyTorch, optionally on GPUs

A litmus test for effective dependency management in Python is handling a large ML or AI framework like PyTorch. Beyond the package itself, using it with GPUs requires CUDA drivers, which adds to the overall complexity.

The example below, TorchTestFlow, implements a small benchmark - squaring a decent-sized tensor - which benefits from accelerated hardware to illustrate how Outerbounds manages dependencies and GPUs.

Save this flow in torchtest.py:

from metaflow import FlowSpec, step, current, Flow, resources, conda_base, card
from metaflow.cards import Markdown
from metaflow.profilers import gpu_profile
import time

class TorchTestFlow(FlowSpec):
    
    # ⚡ Enable these decorators to use a GPU ⚡
    # @resources(gpu=1)
    # @gpu_profile()
    @card(type="blank", refresh_interval=1, id="status")
    @step
    def start(self):
        t = self.create_tensor()
        self.run_squarings(t)
        self.next(self.end)

    def create_tensor(self, dim=5000):
        import torch  # pylint: disable=import-error

        print("Creating a random tensor")
        self.tensor = t = torch.rand((dim, dim))
        print("Tensor created! Shape", self.tensor.shape)
        print("Tensor is stored on", self.tensor.device)
        if torch.cuda.is_available():
            print("CUDA available! Moving tensor to GPU memory")
            t = self.tensor.to("cuda")
            print("Tensor is now stored on", t.device)
        else:
            print("CUDA not available")
        return t

    def run_squarings(self, tensor, seconds=60):
        import torch  # pylint: disable=import-error

        print("Starting benchmark")
        counter = Markdown("# Starting to square...")
        current.card["status"].append(counter)
        current.card["status"].refresh()

        count = 0
        s = time.time()
        while time.time() - s < seconds:
            for i in range(25):
                # square the tensor!
                torch.matmul(tensor, tensor)
            count += 25
            counter.update(f"# {count} squarings completed ")
            current.card["status"].refresh()
        elapsed = time.time() - s

        msg = f"⚡ {count/elapsed} squarings per second ⚡"
        current.card["status"].append(Markdown(f"# {msg}"))
        print(msg)

    @step
    def end(self):
        # show that we persisted the tensor artifact
        print("Tensor shape is still", self.tensor.shape)


if __name__ == "__main__":
    TorchTestFlow()

Importantly, the approach demonstrated here works with any PyTorch project (with or without GPUs), so you can start working with your own projects and models in no time.

Local development without hassle

If you have pytorch already installed in your environment, you can run the flow locally as any Python project or a notebook:

python torchtest.py run

No need to worry about Docker or anything else special.

More compute power, easily accessible

If you don't have all the packages already installed or you want to run the flow with more compute power, you can run it in the cloud. We can do this using a prepackaged image by AWS which includes pytorch:

python torchtest.py run --with kubernetes:image=763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-ec2

Note that image is quite large, 5-10GB, so it may take a few minutes to start the task for the first time. Subsequent executions should be faster as the image will be cached.

Run on a GPU

Tensor operations are massively faster on a GPU. Let's prove it!

Do I have GPUs in my cluster?

Check the availability of GPU instances in your cluster as follows:

Go to the Status view,
click Pools,
and see if any of the pools has a note Has Access to GPUs.

If you want, you can add GPUs by contacting the support Slack.

Once you have access to GPUs, enable these decorators for the start step

@resouces(gpu=1)
@gpu_profile()

The @resources decorator specifies that the task needs to run with a GPU. @gpu_profile is an optional decorator that shows a card visualizing GPU utilization in real time as the task is executing.

With the decorators added, we can run the flow using the same command as above:

python torchtest.py run --with kubernetes:image=763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-ec2

This clip shows the command in action:

A GPU instance is brought up to execute the task. You can see auto-scaling events like this in the compute pools tab under the Status view.
In the task view, you can observe GPU utilization thanks to @gpu_profile,
as well as monitor the results of the benchmark in real-time, thanks to a custom @card.

The process takes a few minutes for the first time, as the cluster launches a new GPU instance and pulls the large image to it.

GPUs are fast! This GPU handles 74 squarings per second whereas a modern Macbook can do about 8. You can test various environments by yourself for fun.

Example: PyTorch, optionally on GPUs​

Local development without hassle​

More compute power, easily accessible​

Run on a GPU​

Example: PyTorch, optionally on GPUs

Local development without hassle

More compute power, easily accessible

Run on a GPU