Your First Flow
1Why Flows?
You may be familiar with authoring data science projects in a notebook. A notebook is a list of cells that contain Python code that is executed sequentially, one after another. Metaflow extends this concept by allowing you to define a graph of cells that Metaflow calls steps.
A big benefit of this graph approach is that some steps may be executed concurrently, which makes the code run much faster! As you will learn soon, structuring your projects as Metaflow flows brings many other benefits too. Luckily, if you know how to author a notebook, there isn't much new to learn when it comes to authoring Metaflow flows.
2Write Your First Flow
Let's start with the simplest possible flow.
Every flow you create must contain a start
and end
function. Above these functions, you will see the @step
decorator. You will learn all about this and more decorators in the next episode. You tell Metaflow the order to execute steps by using self.next
. Here you can see an example of a flow that contains only start
and end
steps. Note that you can write Python scripts containing flows in any text editor or notebook environment.
from metaflow import FlowSpec, step
class MinimumFlow(FlowSpec):
@step
def start(self):
self.next(self.end)
@step
def end(self):
print("Flow is done!")
if __name__ == "__main__":
MinimumFlow()
All of your flows inherit from FlowSpec
. In this example, you can see the MinimumFlow
object doing so. That is the only thing you need to know about object-oriented programming to use Metaflow. You only need to write an object that uses a FlowSpec
in a Python script.
3Run Your First Flow
Once the Python script containing your flow is defined, the flow can be run from the command line using the run
command:
python minimum_flow.py run
There is a lot of information you can view in the console including:
- Every Metaflow run gets a unique ID so that you can keep track of your experiments and have an unambiguous way to refer to the results of any particular run.
- A run executes the steps in order. The step that is currently being executed is denoted by the step name.
- When runtime processes are created for steps, they are called tasks. Each task is executed by a separate process (potentially in parallel) in your operating system, identified by a process ID aka
pid
. You can use any operating system-level monitoring tool such astop
to monitor the resource consumption of a task based on its process ID. - The combination of a flow name, run ID, step name, and task ID, uniquely identifies a task in your Metaflow environment, amongst all runs of any flows. Here, the flow name is omitted since it is the same for all lines. We call this globally unique identifier a
pathspec
.
Congratulations on running your first flow!
In the next episode, you will see how you can expand flows with decorators. Metaflow decorators can be used to send steps to the cloud, build experiment trackers, data visualizations, and more.