Skip to main content

Project structure

After setting up an empty project, you can begin adding your own components. Fundamentally, all projects are composed of these top-level components:

Flows

Flows refer to Metaflow flows, often interconnected through events. They form the backbone of your projects, handling data processing and ETL, model training and finetuning, autonomous / batch inferencing, amongst any other types of background processing and high-performance computing.

In projects, flows are stored under a subdirectory flows, one Metaflow flow (named flow.py) per subdirectory, alongside any supporting Python modules and packages. As a best practice, it is useful to add a README.md file for each flow describing its role. They will be surfaced in the UI as well.

Authoring ProjectFlows

Importantly, project flows should subclass from ProjectFlow instead of Metaflow's standard FlowSpec. In other words, simply author your flows like this:

from obproject import ProjectFlow

class MyFlow(ProjectFlow):
...

This leverages Metaflow's BaseFlow pattern to enrich flows with functionality related to the project structure. Besides this small detail, you may leverage all Metaflow features in your flows.

A typical flow hierarchy in a project repository ends up looking like this:

flows/

etl/
flow.py
README.md
feature_transformations.py
sql/process_data.sql

train_model/
flow.py
README.md
model.py

Deployments

Deployments are microservices that serve requests through real-time APIs. Use cases include

  • Model hosting and inference, including GenAI models running of fleets of GPUs.
  • UIs and dashboards, such as Tensorboard, Streamlit apps, or other internal UIs.
  • Real-time agents that respond to incoming requests and take action based on LLM outputs.

The platform’s strength comes from the tight connection between flows and deployments, bridging the offline and online worlds. For instance,

  • A flow can update a database for RAG continuously, which is then used in real-time by a deployed agent.
  • Or, you can have a custom app for monitoring model performance which you use to trigger a model retraining flow.
  • It is also possible to deploy model endpoints programmatically from a flow, for instance, whenever a new model has been trained.

In your project, place deployments in the deployments directory. Each deployment is defined by a configuration file, config.yml, as documented in the documentation for deployments. You can define dependencies for the deployment in a standard requirements.txt or pyproject.toml. As with flows, it is recommended to add a README.md for each deployment.

The project hierarchy will look like this

deployments/

monitoring_dashboard/
streamlit_app.py
config.yml
pyproject.toml
README.md

model_endpoint/
fastapi_server.py
config.yml
pyproject.toml
README.md

support_agent/
agent.py
config.yml
pyproject.toml
README.md

Code

Effective management of software dependencies is essential for building production-quality projects and enabling rapid iteration and collaboration.

A typical project consists of multiple layers of software dependencies:

  1. Code definining flows and deployments.
  2. Project-level shared libraries.
  3. Organization-level libraries shared across projects.
  4. Third-party dependencies, such as pandas and torch.

As an example, consider the following project that trains a fraud detection model and deploys it for real-time inference:

fraud_detection_model/

obproject.toml
pyproject.toml
README.md

src/
feature_encoders/
__init__.py
feature_encoder.py

flows/
trainer/
flow.py
mymodel.py
README.md

deployments/
inference/
fastapi_server.py
config.yml
README.md

Code defining flows and deployments is organized into subdirectories. In addition to the entrypoint file (flow.py) or deployment server, each flow or deployment can include supporting modules and packages, such as mymodel.py.

Project-level shared libraries should be placed under the src directory. Here, we define a package feature_encoders which is used both during training and inference to ensure offline-online consistency of features. Importantly, you should include a line

METAFLOW_PACKAGE_POLICY = 'include'

in the __init__.py module of each package, to ensure that the package gets included in the Metaflow's code package. Packages under src/ are added to PYTHONPATH automatically, so they are readily usable in flows and deployments.

Organization-level shared libraries can be handled in two ways:

  1. If you can set METAFLOW_PACKAGE_POLICY in packages, you may simply pip install them as usual or add them in your PYTHONPATH. Once you import them in your flows and deployments, they'll get packaged automatically. This is a convenient option for private packages, even if they are not pip install-able from a package repository.

  2. If the shared libraries are pushed to a package repository - private or public - you can treat them similarly as 3rd party dependencies, described below.

3rd party dependencies can be handled through Metaflow's @pypi or @condadecorators, or through the standard pyproject.toml or requirements.txt files. When the project is deployed, Outerbounds uses Fast Bakery to bake the requirements into a container image automatically.

For convenience, you may drop a project-wide pyproject.toml at the root of the project next to obproject.toml. For instance, to include pandas and fastapi in the project, you can specify a pyproject.toml as follows:

[project]
dependencies = [
"pandas==2.2.2",
"fastapi==0.116.1"
]

This file will be used universally in all flows ( through @pypi_base) and deployments without you having to specify anything else manually. This is handy if you want to ensure that all components of the project use the exact same set of dependencies.

Assets

What sets ML/AI projects apart from traditional software engineering is that they rely not only on code, but also on data and models.

A key difference between code and data/models is that in real-world systems, data and models are often updated continuously and automatically - new data streams in, and models are retrained constantly, whereas updating code is a much more manual process (even when the code is authored with AI co-pilots).

Metaflow artifacts are a core building block for managing data and models. Outerbounds Projects extends this concept with data assets and model assets, which complement artifacts by adding an extra layer of metadata and tracking.

Think of assets as a superset of artifacts: they let you elevate select data and models to a special status, making them easy to track in the UI. In practice, this gives you a model registry and data lineage tracking, seamlessly integrated with your projects. Read more about assets on a dedicated page.


Next, let's take a look at an example project that shows how all these pieces fit together.