Intermediate Computer Vision: Episode 3
Now that you have your data pipeline working locally, most likely in a notebook or set of Python scripts, it is a good time to think about how you access cloud resources. How do you configure the best development environment you can imagine for using your data pipelines when workflows require big data, occasionally bursting to the cloud, and versioning the results of many experiments?
Regardless of how you approach solutions to these challenges, you will need to move data back and forth from remote storage to the instances you run compute tasks on.
This episode will introduce you to the fundamentals of working with cloud data stores using Metaflow. You can follow along in this Jupyter notebook. We will focus on using AWS S3 to upload the files we just downloaded to an S3 bucket, but Metaflow works with other cloud providers as well. This will help us store that data in a place that is accessible to model training environments, whether they are on a laptop or remote GPU-powered instance.
1How Metaflow Helps You Read and Write Data in the Cloud
You can use the AWS CLI or boto3 to communicate with AWS resources from Metaflow code, but using the Metaflow tools has a few advantages.
First and foremost, it is fast. The S3 client is optimized for high throughput between S3 and AWS compute instances. This effect becomes powerful when reading and writing to S3 from a remote task running on AWS compute resources. Another principal benefit is the simplicity of the Metaflow client. There are a few intuitive APIs that interoperate seamlessly with your FlowSpec
definitions. The functionality in metaflow.S3
includes:
S3.get
to access a single object with its S3 key.S3.get_many
to access many objects in parallel with a list of S3 keys.S3.put
a single object in a user-specified S3 key.S3.put_files
a list of files to add to S3 as files in corresponding S3 keys. You can read details about these functions and more optimized S3 functionality in the Metaflow API reference.
2Move a Local Image Dataset to S3
When working with valuable data in an organization, such as a large image dataset you have curated, you will eventually want to store it in the cloud. In Episode 1, you saw how to download the dataset. Now, you will see how to push the data to an S3 bucket of your choosing. If you wish to run the code yourself you will need to choose an S3 bucket that you can write to. You can read more about S3 policies and see examples here.
The following code snippet shows how you can upload the dataset, the two zip files downloaded in Episode 1, to S3.
We use the put_files
functionality from Metaflow's S3 client for this.
from metaflow import S3
# Change this URI to that of an S3 bucket you want to write to.
S3_URI = 's3://outerbounds-tutorials/computer-vision/hand-gesture-recognition/'
# Relative, local paths that mirror the structure of the S3 bucket.
DATA_ROOT = 'data/'
images = os.path.join(DATA_ROOT, 'subsample.zip')
annotations = os.path.join(DATA_ROOT, 'subsample-annotations.zip')
with S3(s3root=S3_URI) as s3:
s3.put_files([(images, images), (annotations, annotations)])
3Download an Image Dataset from S3
We can also use Metaflow's S3 client to download the data. The following code isn't necessary to run if you have been running the code this far, since you have already downloaded the data locally in the first episode of the tutorial.
The _download_from_s3
function is used in flows to move the data from S3 and then unzip it on the instance where model training is done.
In the next episode, you will see how this function is used in context in the TrainHandGestureClassifier
flow.
import zipfile
import os
from metaflow import S3
def _download_from_s3(file):
with S3(s3root = self.S3_URI) as s3:
result = s3.get(file)
with zipfile.ZipFile(result.path, 'r') as zip_ref:
zip_ref.extractall(
os.path.join(DATA_ROOT, file.split('.zip')[0])
)
# EXAMPLE USES
# _download_from_s3('subsample.zip')
# _download_from_s3('subsample-annotations.zip')
In the last two lessons, you saw how to use PyTorch Dataloaders and how Metaflow makes it easy to move data around from your computer to cloud storage, and in the future to compute instances for tasks like data processing or model training. The ability to move data efficiently in these ways is fundamental when building a workflow for rapid prototyping. In the next lesson, we will shift focus to developing machine learning models in the cloud. Stay tuned for more on accessing GPUs, checkpointing model state, and more tips for setting up iterative, data-intensive, model development workflows.