Deployments Deep Dive

We've seen how to get up and running with a simple deployment. This document goes deeper into the available features when working with deployments, along with some best practices and tips.

Commands quick-start

The CLI exposes 3 commands for you to work with deployments.

outerbounds app deploy [OPTIONS]: Create a new deployment, or modify an existing deployment with the same name.
outerbounds app list: List all the deployments on the platform.
outerbounds app delete: Delete a given deployment by its --name.

Setup the Deployment config

While its completely feasible to pass in all your configuration options using the CLI flags when you're calling the outerbounds app deploy command, we highly recommend you use config files to configure your deployment for the following reasons:

Clarity: All your deployment settings in one, easily readable place
Version control: Track configuration changes over time
Reusability: Easily replicate deployments across environments
Less error-prone: Avoid typos in long CLI commands
Explainability: Easier to annotate fields with long multi line comments in a config file as opposed to a bash command

💡 For rapid prototyping, you can choose to override the fields defined in your config with the CLI. Look at CLI options for more details.

Using Environment Variables

You may have some environment variables that your deployment depends on. For example, you may want to pass the S3 location of the model that you'd like to pull and serve. It makes sense to pass this as an environment variable.

To make sure that your deployment has all the required environment variables, you can use the environment top level field in the config. Take a look at the example below to see how we pass in the 2 important environment variables necessary for our deployment.

environment:
  DOWNLOAD_DIR: /tmp/models
  MODEL_NAME: llm

Using Secrets

For any sensitive information that your deployment depends on, such as API keys, we recommend using Outerbounds resource integrations. Integrations allow you an easy way to store secrets and use them safely in your Metaflow tasks or deployments.

To setup your secret, navigate to the "Integrations" tab on the Outebounds UI and create your secret.

Once you've configured your secret, you can use them in your deployments:

secrets: 
- openai-api-key # Should be same as the name of the integrations you setup on the UI

After this, any keys defined inside your integrations will be available as environment variables on your deployment. For example, if you had setup one key called MY_API_KEY inside the integration openai-api-key, you can now simply use it in your deployment as:

api_key = os.environ.get("MY_API_KEY")

Packaging non-python files

By default, we package all of your python files on your local system for them to run on the cloud. We also replicate the folder structure, so the relative paths for each file remains exactly the same.

Just like Metaflow tasks, you can define an additional list of file suffixes to be included in your deployment.

package:
    suffixes:
    - .sql
    - .txt

Multi-Step Startups

Sometimes, you may have a set of bootstrap scripts that you may want to run before starting your actual deployment. A good example is having a model-downloader.py that downloads a model to a specified location and then an app.py that loads the model in the downloaded location and powers inference on it.

You can achieve this kind of a setup by using the commands section.

commands:
  - "python model_downloader.py --model_name $MODEL_NAME"
  - "vllm serve $DOWNLOAD_DIR/$MODEL_NAME --dtype=half --task score"

API vs UI Access

You may either want to serve apps like Streamlit, Tensorboard (any other UI app of your choice), or you may want to serve API endpoints like Flask/FastAPI/vLLM apps.

If you setup UI access, then anyone who has access to the Outerbounds UI will have access to your deployment. The deployment will be guarded by the same auth that guards your Outerbounds UI.

If you setup API access, then the endpoint will be accessible over API for programatic clients. You can access the endpoint by providing your metaflow token as x-api-key.

Use the following block to control this setting:

auth:
    type: Browser # UI access. Use 'API' for API access.

Resource Management

To make sure your deployments perform as expected, you need to make sure that they have the right resources configured. Use the following block in your config to reserve the resources for your deployment.

resources:
  cpu: "2"              # CPU cores
  memory: "8Gi"         # Memory (use Mi or Gi units)
  gpu: "1"              # Number of GPUs
  disk: "100Gi"         # Persistent storage
  shared_memory: "2Gi"  # Shared memory (useful for vLLM, Ray, etc.)

Scaling Workers

Different deployments have different usage patterns, and hence, different requirements. Some deployments may have predictable traffic (whether high or low), while other deployments may have a variable traffic. Furthermore, the requirements you have may change depending on whether your deployment is meant for testing/prototyping usecases, versus production use cases.

You can either have a fixed number of workers, or setup autoscaling depending on requests.

Using fixed number of workers

You can set up a fixed number of workers that never autoscale in the following way:

replicas:
    fixed: 3

This will ensure that you always have 3 workers available. If any of the workers encounter an error, they will automatically be replaced by new workers to maintain your configured worker count.

Here are some cases where you may want to have a fixed number of workers:

You have a steady state traffic that doesn't vary too much.
Your SLAs are very strict that cannot afford any delays when responding to requests.

You may have your own reasons to have a fixed number of replicas, the points above just serve a limited example.

Using autocaling of workers

You can also setup autoscaling based on the request rate per minute. To do this, use the following config:

replicas:
  min: 1              # Minimum workers
  max: 10             # Maximum workers
  scaling_policy:
    rpm: 100          # Scale up at 100 requests/minute per worker

In the example above, if you're seeing ~500 requests per minute, then you will automatically have 5 workers running. The workers will scale down once the request rate goes down.

You can also enable autoscaling to 0 by setting min: 0.

Here are some cases where autoscaling may be useful to you:

You have very variable or unpredictable traffic patterns
You want to preserve costs
You don't have very strict latency requirements. Autoscaling workers can take some time, depending on the type of compute instances they're using

Targeting compute pools

There may be times where you want to make sure that your deployment runs on specific compute pools. This may be useful for a variety of reasons, like:

Cost tracking: If you have a separate compute pool carved out for your deployment, then you can easily calculate the cost of running the deployment by the cost incurred on a compute pool.
Compute Isolation: For critical applications, you may want to isolate them from all other deployments/workstations/tasks so that they are not impacted by any other workload running on the platform.
Compute Requirements: Especially when using GPUs, not all instances are the same. You may want to target a particular class of GPUs for your deployment.

You can use the following config to make sure your deployment always goes to one or more specific compute pools.

compute_pools:
- gpu-pool-1
- gpu-pool-2

Important: For any compute pool to be able to run deployments, you need to make sure that the setting is enabled on the compute pool using the UI. Go to your Compute Pools page, select a compute pool (or create a new one), and make sure "Inference deployments" is checked under "Advanced Routing".

Authenticating for cloud access

In general, your deployment may have dependencies in your cloud account that it needs to operate properly. For example, an app that needs access to your S3 buckets or DynamoDB tables to serve a request.

A deployment automatically runs with the default task role of that perimeter. This means that by default, you have access to everything that a Metaflow task running in that perimeter would have access to.

However, if you want to override the default role used for your deployments, you can set the environment variable OBP_AWS_DEPLOYMENT_IDENTITY in your config to the role that you want to use. You need to make sure that the role you're using is properly tagged and assumable by the task role.

Dependency Management

By default, if you have a requirements.txt at the root of your folder, we use it to bake a docker image for you that has all the packages specified in the requirements file.

You can also explicitly point your deployment to use a particular file for requirements.

dependencies: 
    python: "3.11" # Python version to use in your built docker container. 
    from_requirements_file: requirements.txt
  # from_pyproject_toml: pyproject.toml 

Just like Metaflow tasks, you can also define your dependencies purely by specifying the pypi/conda packages.

dependencies: 
    python: "3.11" 
    pypi: 
        numpy: 1.23.0
        pandas: ''

dependencies: 
    python: "3.11" 
    conda: 
        numpy: 1.23.0
        pandas: ''

In each case, if you want to provide your docker image, you can do so:

image: python:3.10-slim

You can optionally declare that you want to directly use the image provided and not install any packages on top of it.

image: python:3.10-slim
no_deps: true

Connecting to your PostgreSQL database

As part of the Outerbounds platform, we provision a PostgreSQL DB inside your cloud account. While this mostly serves as the home for recording metadata about all your Metaflow runs, you can also easily use it to also serve as the storage layer for your deployment.

This can be particularly useful for usecases like hyperparameter optimization with Optuna, which uses a relational DB to record all experiment metadata.

To do so, simply set:

peristence: postgres

You can then connect to the DB simply by connecting to localhost:5432 inside your deployment, and using your METAFLOW_SERVICE_AUTH_KEY as the DB password.

Monitoring

You can go to the Outerbounds UI and navigate to the Deployments tab to look at your deployment. Here you will find:

Logs of all your workers, which can be useful for debugging or general sanity checks.
Metrics on all your workers to understand resource tuning.
Metrics on your entire deployment (request rates, latencies) to understand performance.
Autoscaling charts to see how your deployment is scaling.
General health of your deployment and its workers.
Configuration attributes and update history of your deployment.

Example Deployments

The outerbounds/inference-examples Github repository contains a list of tutorials for you to hit the ground running with deployments!

Commands quick-start​

Setup the Deployment config​

Using Environment Variables​

Using Secrets​

Packaging non-python files​

Multi-Step Startups​

API vs UI Access​

Resource Management​

Scaling Workers​

Using fixed number of workers​

Using autocaling of workers​

Targeting compute pools​

Authenticating for cloud access​

Dependency Management​

Connecting to your PostgreSQL database​

Monitoring​

Example Deployments​