Store Artifacts across Metaflow Steps
Question
How can I use Metaflow to save and version data artifacts such as numpy arrays, pandas dataframes, or other Python objects with Metaflow. How can I access and update artifacts throughout the steps of a flow?
Solution
In this example you will see how you can save any Python object that can be pickled as an artifact - called some_data
in this example - by storing it in self
. You can then later access and update the artifact with self
to propagate changes.
1Run Flow
This flow shows how to
- Store a flow artifact.
- Update the artifact in a downstream step.
- Watch how the artifacts change during the flow.
from metaflow import FlowSpec, step
class ArtFlow(FlowSpec):
@step
def start(self):
self.some_data = [1,2,3] # define artifact state
self.next(self.middle)
@step
def middle(self):
print(f'the data artifact is: {self.some_data}')
self.some_data = [1,2,4] # update artifact state
self.next(self.end)
@step
def end(self):
print(f'the data artifact is: {self.some_data}')
if __name__ == '__main__':
ArtFlow()
When you run the flow, the artifact is correctly accessed across steps. Note that this functionality works regardless if you are running your flows locally or remotely (for example with @batch
).
python pass_artifacts_between_steps.py run --run-id-file artifacts-run.txt
2Access Artifacts Outside of Flow
You can use the client API to access data artifacts after a run is complete. There are many ways to access this data, but we show you several examples below.
You can reference Run(<FlowName>/<Run ID>)
to access artifacts:
from metaflow import Run
# saved the id from previous run in artifacts-run.txt
run_id = open('artifacts-run.txt').read()
some_data = Run(f'ArtFlow/{run_id}').data.some_data
print(some_data)
You can also get the artifact from the latest run as demonstrated below:
from metaflow import Flow
assert Flow('ArtFlow').latest_run.data.some_data == [1,2,4]