Overview of cost optimization
Cloud has made compute a relatively inexpensive commodity, at least when compared to the costs of human experts. Hence it tends to be a good investment to convert compute cycles to human productivity. For instance, consider an ML/AI developer who needs to train multiple models. Allowing them to train these models in parallel, rather than sequentially, can significantly reduce the overall training time.
Or, consider deploying a new training workflow in production. Instead of blindly overwriting the previous production version, it is beneficial to deploy the new workflow alongside the existing production workflow, so its correctness can be verified. This doubles compute requirements temporarily but the cost of incorrect results would be orders of magnitude higher.
Outerbounds is designed to give you cost-efficient access to compute, so you choose approaches that optimize for business outcomes without having to worry about skyrocketing infrastructure costs. For more background, see a blog article, The 6 Steps to Cost-Optimized ML/AI/Data Workloads.
Core Concepts
Let's start with an overview of core concepts of Outerbounds:
The green boxes denote individual Metaflow tasks. The size of a task is determined by its resource requirements - memory, CPU, GPU, and disk - defined with the
@resource
decorator. You can see the currently running tasks under the Runs view.For the purposes of this discussion, a workstation can be considered as a special kind of task, similarly consuming compute resources. You can see the running workstations in Setup Workstations.
The purple boxes denote cloud instances that are used to execute the tasks. The instances are launched and terminated automatically by Outerbounds, based on demand for compute resources. You can observe the recent demand and the number of instances in the Status view, which is meant for operational monitoring. To understand the cost impact of instances, i.e. what instances have been running and why, see Nodes Report.
Each instance belongs to a compute pool, representing a capacity pool of certain types of instances, possibly originating from other cloud than your main cloud. You can see the compute pools and their status under the pool tab in the Status view.
Optionally, you can define securely isolated environments through perimeters, for instance to separate production and development environments. Notably, compute pools may be shared across perimeters. You can see the currently available perimeters in the Perimeters view.
Optimizing cost
Before worrying about cost, take a look at Historical Spend to check the actual costs incurred. You may find that the total cost has been low enough that it does not warrant further optimization.
A key observation about cloud costs is that you pay for every second an instance is running, not for every second it is doing useful work. Hence a key lever to optimize costs is to increase the total utilization of instances, minimizing the number of wasted instance-seconds.
There are two main ways to increase utilization:
Right-size resource requests to ensure that tasks don't reserve more resources than what they consume.
Leverage shared compute pools to ensure that there is enough work to occupy live instances.
Right-sizing resource requests
Imagine a task that loads a dataframe in memory. You approximate that the dataframe needs at most 100GB of RAM,
so you annotate the task with @resources(memory=100_000)
. Hence, in order to execute a task, we need an instance
with at least 100GB of memory, e.g. an r5.4xlarge
instance that has 128GB of RAM.
In practice, the task may only occupy, say, 70GB of RAM. leading to a situation like this:
While the task executes, at least 30GB of RAM is underutilized, and possibly up to 58GB if no other tasks can fit on the same instance simultaneously. Worse, the effect is often multiplied over many instances, leading to significant underutilization of resources:
Resource underutilization is common in distributed computing systems like AWS EMR, Databricks, and others. This inefficiency is often difficult to detect and frequently goes unnoticed, resulting in unnecessarily high compute costs that are not accurately attributed to inefficient tasks.
Outerbounds has a view specifically to address this question and help you right-size resource requests. Open Flow Reporting to see exactly how resources have been utilized historically by each task executing on the platform. For more details about how to utilize the view, see Using cost reports.
Optimizing instance types
In the above scenario, another potential opportunity for cost optimization is to use smaller instance types. Outerbounds efficiently bin-packs tasks on instances, enabling a larger instance to execute multiple smaller tasks simultaneously. Therefore, smaller instance types don't always lead to higher utilization, as more instances may be required to handle the load.
Usually, the best approach is to start with instances that are large enough to handle all workloads. The system collects instrumentation about the total utilization rate over time, which you can observe in the Nodes view. Over time, if there is need for cost optimization and certain instance types seem subptimal, it may be beneficial to swap instances to other types.
Unless you are working with specialized and expensive instance types, such as large GPU instances, it is advisable to run actual workloads on the platform before optimizing the mix of instance types. Optimizing instance types is more effective when based on real utilization data.
Leveraging spot instances, reservations, discounts, and multiple clouds
Outerbounds works with any instance types available in your cloud account. You can utilize spot instances, instance reservations, negotiated discounts, and credits to further lower your compute costs. These resources are typically configured as a specific compute pool in your cluster.
In addition, Outerbounds makes it easy to bring compute pools from other clouds besides your main cloud - say, resources from GCP when you are mainly using AWS. This allows you to leverage credits, discounts, and other incentives between clouds, further lowering the total cost of compute.
Contact your support Slack to configure compute pools using spot instances and reservations, and to learn about available incentives to move compute between clouds.
Leveraging shared compute pools
There is another source of underutilization affecting systems with multiple compute pools and perimeters. Imagine
a typical scenario where the system is setup with two perimeters, a Production environment and a Development
environment. Both the perimeters have their dedicated compute pools, prod-aws-main
and dev-aws-main
respectively:
In this scenario, the production pool faces heavy demand, causing tasks to queue up due to insufficient capacity in the compute pool to handle all tasks simultaneously. Meanwhile, the development pool has an instance idling and another one underutilized.
In this case, separate compute pools might be beneficial to ensure that development workloads never consume resources needed for production. However, the strict boundary between the two results in suboptimal resource allocation and usage.
Alternatively, one compute pool could be shared between the two perimeters:
In this case, resources can be allocated on the fly to the perimeter that needs them the most, leading to a higher throughput, higher utilization, and hence lower total cost.
Contact your support Slack to set up perimeters and compute pools for them.