Debug Metaflow Errors with Resume
Question
I have a prototype flow that failed and I want to identify why it failed, where it failed, and debug it.
Solution
1Run Flow with Error
When running debug_error_with_resume.py
a ZeroDivisionError
is produced in the join
step.
This flow shows how to:
- Pass artifacts into a join step.
- Start a process to deal with Python errors in a task.
from metaflow import FlowSpec, step
class DebugFlow(FlowSpec):
@step
def start(self):
self.next(self.a, self.b)
@step
def a(self):
self.x = 1
self.next(self.join)
@step
def b(self):
self.x = 0
self.next(self.join)
@step
def join(self, inputs):
# divisor is next line is 0!
self.result = inputs.a.x / inputs.b.x
self.next(self.end)
@step
def end(self):
pass
if __name__ == '__main__':
DebugFlow()
python debug_error_with_resume.py run
2Debug Flow
Having seen that the code failed at the join
step, you can fix whatever may have caused this and resume
the flow from the faulty step
. There is a highlighted line in the in join
step of this script containing the ZeroDivisionError
. You can replace this line with
self.result = inputs.a.x / (inputs.b.x + 1e-12)
to fix the error.
3Resume Flow from Failed Task
Now you can resume
from join
without re-running the start
, a
, and b
steps. Note that by default the resume
feature will enter the flow at the step that produced the error in the last run. In this example none of the steps are time intensive, but you can imagine scenarios such as model training where steps may take a long time to compute and you wouldn't want to re-run a
and b
if those tasks did expensive model training and the error was in the downstream join
task.
python debug_error_with_resume.py resume