Random Forest Flow
In this episode, you will build a random forest model in a flow.
1Write a Random Forest Flow
The flow has the following structure:
- Parameter values are defined at the beginning of the class.
- Default values can be overridden using command line arguments as shown in episode 1.4.
- The
start
step loads and splits a dataset to be used in downstream tasks.- The dataset for this task is small, so we can store it in
self
without introducing much copying and storage overhead. Remember that you can only useself
for objects that can be pickled. To learn more about usingself
, see episode 1.3.
- The dataset for this task is small, so we can store it in
- The
train_rf
step fits asklearn.ensemble.RandomForestClassifier
for the classification task using cross-validation. - The
end
step prints the accuracy scores for the classifier.
random_forest_flow.py
from metaflow import FlowSpec, step, Parameter
class RandomForestFlow(FlowSpec):
max_depth = Parameter("max_depth", default=None)
random_state = Parameter("seed", default=11)
n_estimators = Parameter("n-est", default=10)
min_samples_split = Parameter("min-samples", default=2)
k_fold = Parameter("k", default=5)
@step
def start(self):
from sklearn import datasets
self.iris = datasets.load_iris()
self.X = self.iris['data']
self.y = self.iris['target']
self.next(self.train_rf)
@step
def train_rf(self):
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
self.clf = RandomForestClassifier(
n_estimators=self.n_estimators,
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
random_state=self.random_state)
self.scores = cross_val_score(
self.clf, self.X, self.y, cv=self.k_fold)
self.next(self.end)
@step
def end(self):
import numpy as np
msg = "Random Forest Accuracy: {} \u00B1 {}%"
self.mean = round(100*np.mean(self.scores), 3)
self.std = round(100*np.std(self.scores), 3)
print(msg.format(self.mean, self.std))
if __name__ == "__main__":
RandomForestFlow()
2Run the Random Forest Flow
python random_forest_flow.py run
In this episode, you trained a random forest and evaluated its accuracy. Once you have a workflow setup, Metaflow will work with any model you can express in python code! Here are more examples of using scikit-learn and Metaflow together:
In the next episode, you will see a similar workflow for an XGBoost model.