Package Files for Remote Compute
Question
How do I package files so they are usable in remote compute steps?
Solution
Metaflow includes all .py
files in your flow script's directory in the distribution. This means all .py
file dependencies will be available in remote compute environments.
The rest of this page shows how to include files that do not end in .py
using the --package-suffixes
argument.
This page describes how to use the --package-suffixes
command line argument. You can also use the METAFLOW_DEFAULT_PACKAGE_SUFFIXES
configuration option.
1Define a Dependency in the Flow Directory
Suppose you have the following two .sql
files in your flow script's directory and you want to access them on a remote compute instance.
SELECT * FROM DB1;
SELECT * FROM DB2;
This pattern is not unique to the .sql
file extension.
2Define a Flow that Uses the Dependencies Remotely
This flow shows how to:
- Run the
start
step remotely using Kubernetes. - Use the
query1.sql
andquery2.sql
files in the remote compute environment.
from metaflow import FlowSpec, step, kubernetes
class PackageSuffixesFlow(FlowSpec):
query1_file = 'query1.sql'
query2_file = 'query2.sql'
def read_query(self, file):
file_obj = open(file, 'r')
result = file_obj.read()
file_obj.close()
return result
@kubernetes
@step
def start(self):
self.query1 = self.read_query(self.query1_file)
self.query2 = self.read_query(self.query2_file)
self.next(self.end)
@step
def end(self):
print("Query 1:", self.query1)
print("Query 2:", self.query2)
if __name__ == "__main__":
PackageSuffixesFlow()
3Package the Dependencies and Run the Flow
The key to this page is Metaflow's package-suffixes
argument.
To run the PackageSuffixesFlow
and copy the local .sql
files so they are accessible in the start
step running in a Kubernetes pod, you can run the following command:
python package_suffixes_flow.py --package-suffixes='.sql' run