Share Local Data with S3
Question
How do I load data from a local directory structure on AWS Batch using Metaflow's S3 client?
Solution
When using Metaflow's @batch
decorator as a compute environment for a step, there are several options for accessing data. This page will show how to:
- Serialize data in a non-pickle format from a local step.
- Upload it to S3 using Metaflow's client.
- Read the data from a downstream step that runs on AWS Batch or Kubernetes.
1Acquire Data
The example will access this CSV file:
local_data.csv
1, 2, 3
4, 5, 6
2Configure metaflow.S3
When using Metaflow's @batch
decorator you need to have an S3 bucket configured. When S3 is configured in ~/.metaflow_config/config.json
artifacts defined like self.artifact_name
will be serialized and stored on S3. This means that for most cases you don't need to directly call Metaflow's S3 client. However, for a variety of reasons you may want to access arbitrary S3 bucket contents.
3Run Flow
This flow shows how to:
- Read the contents of
local_data.csv
usingIncludeFile
. - Serialize the contents of the file using the
json
module. - Get the data on AWS S3.
local_data_on_batch_s3.py
from metaflow import (FlowSpec, step, IncludeFile,
batch, S3)
import json
class S3FileFlow(FlowSpec):
data = IncludeFile('data',
default='./local_data.csv')
@step
def start(self):
with S3(run=self) as s3:
res = json.dumps({'data': self.data})
url = s3.put('data', res)
self.next(self.read_from_batch)
@batch(cpu=1)
@step
def read_from_batch(self):
# change `run=self` to any run
with S3(run=self) as s3:
data = s3.get('data').text
print(f"File contents: {json.loads(data)}")
self.next(self.end)
@step
def end(self):
print('Finished reading the data!')
if __name__ == '__main__':
S3FileFlow()
python local_data_on_batch_s3.py run
Further Reading
- Loading and storing data with Metaflow