Define the environment
Now that we have seen the basic building blocks of executing code in the cloud, the next step is to understand how to manage and run more complex projects. In particular, we need to package software dependencies - both our own code as well as any 3rd party libraries - for reliable cloud execution.
Dependency management in Python is known to be a headache (in particular affecting ML/AI projects that have extra requirements due to large packages and GPU drivers), but the good news is that Outerbounds streamlines the process while allowing you to choose an approach that works for your needs.
Specifically,
- Metaflow takes care of packaging your code automatically. You don't have to worry about packaging the project's Python files.
- Outerbounds makes it easier, faster, and more secure to work with 3rd party dependencies and Docker images. For details, see managing dependencies.
Let's take a look at an example to see these features in action.
Example: PyTorch, optionally on GPUs
A litmus test for effective dependency management in Python is handling a large ML or AI framework like PyTorch. Beyond the package itself, using it with GPUs requires CUDA drivers, which adds to the overall complexity.
The
example below, TorchTestFlow
, implements a small benchmark - squaring a decent-sized
tensor - which benefits from accelerated hardware to illustrate how Outerbounds
manages dependencies and GPUs.
Save this flow in torchtest.py
:
from metaflow import FlowSpec, step, current, Flow, resources, conda_base, card
from metaflow.cards import Markdown
from metaflow.profilers import gpu_profile
import time
class TorchTestFlow(FlowSpec):
# ⚡ Enable these decorators to use a GPU ⚡
# @resources(gpu=1)
# @gpu_profile()
@card(type="blank", refresh_interval=1, id="status")
@step
def start(self):
t = self.create_tensor()
self.run_squarings(t)
self.next(self.end)
def create_tensor(self, dim=5000):
import torch # pylint: disable=import-error
print("Creating a random tensor")
self.tensor = t = torch.rand((dim, dim))
print("Tensor created! Shape", self.tensor.shape)
print("Tensor is stored on", self.tensor.device)
if torch.cuda.is_available():
print("CUDA available! Moving tensor to GPU memory")
t = self.tensor.to("cuda")
print("Tensor is now stored on", t.device)
else:
print("CUDA not available")
return t
def run_squarings(self, tensor, seconds=60):
import torch # pylint: disable=import-error
print("Starting benchmark")
counter = Markdown("# Starting to square...")
current.card["status"].append(counter)
current.card["status"].refresh()
count = 0
s = time.time()
while time.time() - s < seconds:
for i in range(25):
# square the tensor!
torch.matmul(tensor, tensor)
count += 25
counter.update(f"# {count} squarings completed ")
current.card["status"].refresh()
elapsed = time.time() - s
msg = f"⚡ {count/elapsed} squarings per second ⚡"
current.card["status"].append(Markdown(f"# {msg}"))
print(msg)
@step
def end(self):
# show that we persisted the tensor artifact
print("Tensor shape is still", self.tensor.shape)
if __name__ == "__main__":
TorchTestFlow()
Importantly, the approach demonstrated here works with any PyTorch project (with or without GPUs), so you can start working with your own projects and models in no time.
Local development without hassle
If you have pytorch
already installed in your environment, you can run the flow
locally as any Python project or a notebook:
python torchtest.py run
No need to worry about Docker or anything else special.
More compute power, easily accessible
If you don't have all the packages already installed or you
want to run the flow with more compute power, you can run it in the cloud.
We can do this using a prepackaged image by AWS which includes pytorch
:
python torchtest.py run --with kubernetes:image=763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-ec2
Note that image is quite large, 5-10GB, so it may take a few minutes to start the task for the first time. Subsequent executions should be faster as the image will be cached.
Run on a GPU
Tensor operations are massively faster on a GPU. Let's prove it!
Check the availability of GPU instances in your cluster as follows:
- Go to the Status view,
- click
Pools
, - and see if any of the pools has a note
Has Access to GPUs
.
If you want, you can add GPUs by contacting the support Slack.
Once you have
access to GPUs, enable these decorators for the start
step
@resouces(gpu=1)
@gpu_profile()
The @resources
decorator specifies that the task needs to run
with a GPU. @gpu_profile
is an optional
decorator that shows a card visualizing GPU utilization in real
time as the task is executing.
With the decorators added, we can run the flow using the same command as above:
python torchtest.py run --with kubernetes:image=763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-ec2
This clip shows the command in action:
- A GPU instance is brought up to execute the task. You can see auto-scaling events like this in the compute pools tab under the Status view.
- In the task view, you can observe GPU utilization thanks to
@gpu_profile
, - as well as monitor the results of the benchmark in real-time,
thanks to a custom
@card
.
The process takes a few minutes for the first time, as the cluster launches a new GPU instance and pulls the large image to it.
GPUs are fast! This GPU handles 74 squarings per second whereas a modern Macbook can do about 8. You can test various environments by yourself for fun.