Whether to Use a Flow's self Keyword
Question
How to determine whether I should store data in a flow's self
keyword?
Solution
This page discusses two considerations to help you answer this question when writing Metaflow flows. The first is whether the object you want to assign to self.variable_name
is able to be serialized with pickle and the second is about what type of data it is.
1Why Assign Data to the self Keyword?
In Metaflow, data can be assigned to variables with the flow object's self
keyword like self.variable_name
.
This makes the contents of self.variable_name
accessible in downstream steps or outside of the flow's runtime environment.
Storing data with the self
keyword in this way is referred to as storing flow artifacts.
2 The self Keyword and Serialization
It is important to know that when you use the self
keyword, Metaflow uses Python's built-in pickle module to serialize artifacts. This allows Metaflow to move artifacts so they are accessible in any downstream compute environment you run tasks in. Sometimes you may observe incompatibilities with pickle and popular machine learning libraries. In this case, libraries will typically provide their own serialization mechanism that you can use. Here is an example with XGBoost, which uses a dataset object called the DMatrix
that cannot be serialized with pickle.
3What Type of Data to Assign to self
Generally, there are three types of data that flows will read, create, and write.
- Input data
- Flow internal state
- Output data
Using the self
keyword in a flow is meant to track flow internal state for objects that can be pickled.
These artifacts are intended to track the state of variables that change throughout the flow lifecycle.
In a machine learning context, examples of data you might consider a flow artifact include:
- The distribution of a dataset's features.
- Hyperparameters and corresponding performance metric values.
- A URL to a new dataset version that was created during the flow.
How do I?
Pass Artifacts through a Join Step
Save and Version State of Artifacts
4What Type of Data Not to Assign to self
In the list of three kinds of data above, you typically will not want to use self
for input and output data.
Input datasets are typically stored in some data warehouse so they don't need to be stored by Metaflow again. They are often large, and it can be costly to duplicate storage by copying into your Metaflow data store. Examples of input datasets include raw data and features for model training.
Similarly, output datasets are meant to be consumed by systems outside Metaflow, so it is better to store them in another database or to a known location. This location might be a S3 bucket or a similar solution that makes sense for the downstream data access pattern. Examples of output datasets include transformed versions of raw datasets.
Instead of using self
for these large datasets, you can efficiently load these kinds of data using Metaflow's built-in cloud data integrations.
How do I?
Load CSV Data in Metaflow Steps
Chunk a DataFrame using Foreach