Random Forest Flow

In this episode, you will build a random forest model in a flow.

1Write a Random Forest Flow

The flow has the following structure:

Parameter values are defined at the beginning of the class.
- Default values can be overridden using command line arguments as shown in episode 1.4.
The start step loads and splits a dataset to be used in downstream tasks.
- The dataset for this task is small, so we can store it in self without introducing much copying and storage overhead. Remember that you can only use self for objects that can be pickled. To learn more about using self, see episode 1.3.
The train_rf step fits a sklearn.ensemble.RandomForestClassifier for the classification task using cross-validation.
The end step prints the accuracy scores for the classifier.

random_forest_flow.py
from metaflow import FlowSpec, step, Parameter

class RandomForestFlow(FlowSpec):

    max_depth = Parameter("max_depth", default=None)
    random_state = Parameter("seed", default=11)
    n_estimators = Parameter("n-est", default=10)
    min_samples_split = Parameter("min-samples", default=2)
    k_fold = Parameter("k", default=5)

    @step
    def start(self):
        from sklearn import datasets
        self.iris = datasets.load_iris()
        self.X = self.iris['data']
        self.y = self.iris['target']
        self.next(self.train_rf)

    @step
    def train_rf(self):
        from sklearn.ensemble import RandomForestClassifier
        from sklearn.model_selection import cross_val_score
        self.clf = RandomForestClassifier(
            n_estimators=self.n_estimators, 
            max_depth=self.max_depth,
            min_samples_split=self.min_samples_split, 
            random_state=self.random_state)
        self.scores = cross_val_score(
            self.clf, self.X, self.y, cv=self.k_fold)
        self.next(self.end)

    @step
    def end(self):
        import numpy as np
        msg = "Random Forest Accuracy: {} \u00B1 {}%"
        self.mean = round(100*np.mean(self.scores), 3)
        self.std = round(100*np.std(self.scores), 3)
        print(msg.format(self.mean, self.std))

if __name__ == "__main__":
    RandomForestFlow()

2Run the Random Forest Flow

python random_forest_flow.py run

     Workflow starting (run-id 1666720721614183):
     [1666720721614183/start/1 (pid 52687)] Task is starting.
     [1666720721614183/start/1 (pid 52687)] Task finished successfully.
     [1666720721614183/train_rf/2 (pid 52691)] Task is starting.
     [1666720721614183/train_rf/2 (pid 52691)] Task finished successfully.
     [1666720721614183/end/3 (pid 52702)] Task is starting.
     [1666720721614183/end/3 (pid 52702)] Random Forest Accuracy: 96.0 ± 3.266%
     [1666720721614183/end/3 (pid 52702)] Task finished successfully.
     Done!

In this episode, you trained a random forest and evaluated its accuracy. Once you have a workflow setup, Metaflow will work with any model you can express in python code! Here are more examples of using scikit-learn and Metaflow together:

In the next episode, you will see a similar workflow for an XGBoost model.

1Write a Random Forest Flow​

2Run the Random Forest Flow​

1Write a Random Forest Flow

2Run the Random Forest Flow