Running steps across clouds
With Outerbounds, you can choose to execute steps of a flow across multiple cloud providers like AWS, Azure, and GCP. Multi-cloud compute is a powerful feature that helps you
- overcome constraints related to resource availability and services offered,
- provide access to specialized compute offerings, such as Trainium on AWS or TPUs on GCP,
- optimize cost by allowing you to move compute easily to the most cost-efficient environment,
- respect data locality by moving compute to data.
For example, if your primary compute cluster is hosted on AWS and you would like to execute parts of your flow
on a compute pool on Azure, just add node-selector=outerbounds.co/provider=azure
to the @kubernetes
decorator
for the step that should be executed on Azure.
To set up new compute pools across clouds, contact your support Slack.
Example: Scaling out to Azure
Save this flow in crosscloudflow.py
:
from metaflow import FlowSpec, step, resources, kubernetes
import urllib
class CrossCloudFlow(FlowSpec):
@step
@kubernetes
def start(self):
req = urllib.request.Request('https://raw.githubusercontent.com/dominictarr/random-name/master/first-names.txt')
with urllib.request.urlopen(req) as response:
data = response.read()
i = 0
self.titles = data[:10]
self.next(self.process, foreach='titles')
@resources(cpu=1,memory=512)
@kubernetes(node_selector="outerbounds.co/provider=azure")
@step
def process(self):
self.title = '%s processed' % self.input
self.next(self.join)
@step
def join(self, inputs):
self.results = [input.title for input in inputs]
self.next(self.end)
@step
def end(self):
print('\n'.join(self.results))
if __name__ == '__main__':
CrossCloudFlow()
Here, node_selector
is used to target an Azure-based compute pool. The flow illustrates a common pattern in cross-cloud processing:
- First, we retrieve a dataset in the primary cloud (the
start
step). - Processing of the dataset is scaled out to another cloud (the
process
step). - Results are retrieved back to the primary cloud (the
join
step).
Run the flow as usual:
python crosscloudflow.py run --with kubernetes
Open Status to observe the load between compute pools in real-time.