Debug Metaflow Errors with Resume
Question
I have a prototype flow that failed and I want to identify why it failed, where it failed, and debug it.
Solution
1Run Flow with Error
When running debug_error_with_resume.py a ZeroDivisionError is produced in the join step.
This flow shows how to:
- Pass artifacts into a join step.
- Start a process to deal with Python errors in a task.
from metaflow import FlowSpec, step
class DebugFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.a, self.b)
    @step
    def a(self):
        self.x = 1
        self.next(self.join)
    @step
    def b(self):
        self.x = 0
        self.next(self.join)
    @step
    def join(self, inputs):
        # divisor is next line is 0!
        self.result =  inputs.a.x / inputs.b.x 
        self.next(self.end)
    @step
    def end(self):
        pass
if __name__ == '__main__':
    DebugFlow()
python debug_error_with_resume.py run
2Debug Flow
Having seen that the code failed at the join step, you can fix whatever may have caused this and resume the flow from the faulty step. There is a highlighted line in the in join step of this script containing the ZeroDivisionError. You can replace this line with 
  self.result = inputs.a.x / (inputs.b.x + 1e-12)
to fix the error.
3Resume Flow from Failed Task
Now you can resume from join without re-running the start,  a, and b steps. Note that by default the resume feature will enter the flow at the step that produced the error in the last run. In this example none of the steps are time intensive, but you can imagine scenarios such as model training where steps may take a long time to compute and you wouldn't want to re-run a and b if those tasks did expensive model training and the error was in the downstream join task.  
python debug_error_with_resume.py resume