Intermediate Computer Vision: Episode 6
This episode will focus on tracking model training results with TensorBoard.
When building a machine learning system, it is necessary to track results to make decisions that improve models. Sometimes, though, it isn't clear where to store results so that they are organized and accessible to the people who you want to see them. In this episode, you will see how to use the built-in versioning of the Metaflow datastore to organize TensorBoard logs by the Metaflow run that produced them.
1Controlling TensorBoard Logs
With TensorBoard, you can control where to log results using the log_dir
parameter.
You may want to do this in cases like the train
step of the TrainHandGestureClassifier
,
where we are writing TensorBoard logs from an ephemeral compute instance.
The goal is to write these logs to a persistent location that we can read from any computer with access to the S3 object.
The approach taken here is to use the existing Metaflow datastore, and its built-in versioning capabilities, to organize TensorBoard logs produced in Metaflow runs.
2Store TensorBoard Results in S3
In our case, we are using the TensorBoard and PyTorch integration. We can set the log_dir
location like:
log_dir = os.path.join(tensorboard_s3_prefix, experiment_path, "logs")
writer = torch.utils.tensorboard.SummaryWriter(log_dir=log_dir)
...
writer.add_scalar(f"loss/train", loss_value, step)
If an s3 prefix is used for the log_dir
argument of SummaryWriter
, then TensorBoard will log results.
We can use the Metaflow config to determine where we want to write the results.
For example, you will see the following logic to set the TensorBoard log storage location in the TrainHandGestureClassifier
code:
datastore = metaflow_config.METAFLOW_CONFIG['METAFLOW_DATASTORE_SYSROOT_S3']
self.experiment_storage_prefix = os.path.join(datastore, current.flow_name, current.run_id)
The train
step will then write TensorBoard logs to <experiment_storage_prefix>/experiments/logs
.
3View TensorBoard Results in S3
After running the TrainHandGestureClassifier
flow you will see a URI printed with the location where TensorBoard logs are stored.
You can run the following with your path:
tensorboard --logdir=<tensorboard_s3_prefix>/experiments
This can be run from the command line on your computer, assuming you have access to the S3 bucket which will be in the AWS account where your Metaflow deployment is.
Summary
Congratulations! You have completed all of the episodes in our Computer Vision Training in the Cloud tutorial. In these episodes, you have learned how to:
- Use a PyTorch
DataLoader
and a customDataset
. - Use Metaflow's S3 client to efficiently move data between your local machine, S3, and ephemeral compute instances that run Metaflow tasks.
- Create a flow that performs transfer learning on state-of-the-art computer vision models.
- Train models on GPUs.
- Set up model checkpoints to resume model state in flows and notebooks, saving costly progress.
- Use TensorBoard to track model training results, leveraging Metaflow's built-in versioning to organize results.
To keep progressing in your Metaflow journey you can:
- Check out the open-source repository.
- Join our Slack community and learn with us in #ask-metaflow.