Skip to main content

Package Files for Remote Compute

Question

How do I package files so they are usable in remote compute steps?

Solution

Metaflow includes all .py files in your flow script's directory in the distribution. This means all .py file dependencies will be available in remote compute environments.

The rest of this page shows how to include files that do not end in .py using the --package-suffixes argument.

note

This page describes how to use the --package-suffixes command line argument. You can also use the METAFLOW_DEFAULT_PACKAGE_SUFFIXES configuration option.

1Define a Dependency in the Flow Directory

Suppose you have the following two .sql files in your flow script's directory and you want to access them on a remote compute instance.

query1.sql
SELECT * FROM DB1;
query2.sql
SELECT * FROM DB2;
note

This pattern is not unique to the .sql file extension.

2Define a Flow that Uses the Dependencies Remotely

This flow shows how to:

  • Run the start step remotely using Kubernetes.
  • Use the query1.sql and query2.sql files in the remote compute environment.
package_suffixes_flow.py
from metaflow import FlowSpec, step, kubernetes

class PackageSuffixesFlow(FlowSpec):

query1_file = 'query1.sql'
query2_file = 'query2.sql'

def read_query(self, file):
file_obj = open(file, 'r')
result = file_obj.read()
file_obj.close()
return result

@kubernetes
@step
def start(self):
self.query1 = self.read_query(self.query1_file)
self.query2 = self.read_query(self.query2_file)
self.next(self.end)

@step
def end(self):
print("Query 1:", self.query1)
print("Query 2:", self.query2)

if __name__ == "__main__":
PackageSuffixesFlow()

3Package the Dependencies and Run the Flow

The key to this page is Metaflow's package-suffixes argument. To run the PackageSuffixesFlow and copy the local .sql files so they are accessible in the start step running in a Kubernetes pod, you can run the following command:

python package_suffixes_flow.py --package-suffixes='.sql' run
     Workflow starting (run-id 186474):
[186474/start/1009082 (pid 67155)] Task is starting.
[186474/start/1009082 (pid 67155)] [pod t-8dvdn-8f2mw] Task is starting (Pod is pending, Container is waiting - ContainerCreating)...
[186474/start/1009082 (pid 67155)] [pod t-8dvdn-8f2mw] Setting up task environment.
[186474/start/1009082 (pid 67155)] [pod t-8dvdn-8f2mw] Downloading code package...
[186474/start/1009082 (pid 67155)] [pod t-8dvdn-8f2mw] Code package downloaded.
[186474/start/1009082 (pid 67155)] [pod t-8dvdn-8f2mw] Task is starting.
[186474/start/1009082 (pid 67155)] [pod t-8dvdn-8f2mw] Task finished with exit code 0.
[186474/start/1009082 (pid 67155)] Task finished successfully.
[186474/end/1009083 (pid 67159)] Task is starting.
[186474/end/1009083 (pid 67159)] Query 1: SELECT * FROM DB1;
[186474/end/1009083 (pid 67159)]
[186474/end/1009083 (pid 67159)] Query 2: SELECT * FROM DB2;
[186474/end/1009083 (pid 67159)]
[186474/end/1009083 (pid 67159)] Task finished successfully.
Done!

Further Reading