Natural Language Processing - Episode 2
This episode references model.py.
At the end of this episode, you will be able to:
- Build a simple NLP model using Tensorflow and scikit-learn to classify fashion reviews.
- Set up your model so that it can be easily integrated into a Metaflow flow.
1Description of a Custom Model
Now it’s time to build our ML model. We are going to define our model in a separate file with a custom class called Nbow_Model
. The model contains two subcomponents: the count vectorizer for preprocessing and the model. The Nbow_Model
class facilitates combining these two components together so that we don't have to deal with them separately.
Here is an explanation of the various methods in this model:
__init__
: Initialize the count vectorizer, a preprocessor that counts the tokens in the text, and a neural network to do the modeling.fit
: Fit the count vectorizer, followed by the model.predict
: Transform the data with the count vectorizer before making predictions.eval_acc
: Calculate model accuracy given a dataset and labels.eval_rocauc
: Calculate the area under the roc curve given a dataset and labels.model_dict
: This exposes a dictionary that has two components that form this model, the count vectorizer and the neural network. We will use this to serialize the model's data into Metaflow.from_dict
: This allows you to instantiate aNbowModel
from amodel_dict
which is useful for de-serializing data in Metaflow.
2How to Serialize Data
Anytime you create your own model library or define models in custom classes, we recommend explicitly defining how you will serialize and load the model. This will minimize the chances that things will break as your model code changes. Explicit definitions for serialization processes give you the ability to make sure any new versions of your code are backward compatible on how to load your model or allow you to deal with serialization/de-serialization accordingly in a way that is transparent to you. This is the purpose of the from_dict
method and model_dict
property in this example.
For Metaflow, it is very convenient if you have an interface that allows you to save model information that is pickleable, as that is how Metaflow saves data. That is the purpose of model_dict
and from_dict
: they allow saving and retrieving data from a pickleable data structure.
import tensorflow as tf
from tensorflow.keras import layers, optimizers, regularizers
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer
class NbowModel():
def __init__(self, vocab_sz):
self.vocab_sz = vocab_sz
# Instantiate the CountVectorizer
self.cv = CountVectorizer(
min_df=.005, max_df = .75, stop_words='english',
strip_accents='ascii', max_features=self.vocab_sz
)
# Define the keras model
inputs = tf.keras.Input(shape=(self.vocab_sz,),
name='input')
x = layers.Dropout(0.10)(inputs)
x = layers.Dense(
15, activation="relu",
kernel_regularizer=regularizers.L1L2(l1=1e-5, l2=1e-4)
)(x)
predictions = layers.Dense(1, activation="sigmoid",)(x)
self.model = tf.keras.Model(inputs, predictions)
opt = optimizers.Adam(learning_rate=0.002)
self.model.compile(loss="binary_crossentropy",
optimizer=opt, metrics=["accuracy"])
def fit(self, X, y):
res = self.cv.fit_transform(X).toarray()
self.model.fit(x=res, y=y, batch_size=32,
epochs=10, validation_split=.2)
def predict(self, X):
res = self.cv.transform(X).toarray()
return self.model.predict(res)
def eval_acc(self, X, labels, threshold=.5):
return accuracy_score(labels,
self.predict(X) > threshold)
def eval_rocauc(self, X, labels):
return roc_auc_score(labels, self.predict(X))
@property
def model_dict(self):
return {'vectorizer':self.cv, 'model': self.model}
@classmethod
def from_dict(cls, model_dict):
"Get Model from dictionary"
nbow_model = cls(len(
model_dict['vectorizer'].vocabulary_
))
nbow_model.model = model_dict['model']
nbow_model.cv = model_dict['vectorizer']
return nbow_model
3Fit the Custom Model on the Dataset
Next, let's import the NbowModel
and train it on this dataset. The purpose of doing this is to make sure the code works as we expect before using Metaflow. For this example, we will set our vocab_sz = 750
.
from model import NbowModel
import pandas as pd
model = NbowModel(vocab_sz=750)
df = pd.read_parquet('train.parquet')
model.fit(X=df['review'], y=df['labels'])
4Evaluate the Model Performance
Next, we can evaluate our model on the validation set as well, using the built-in evaluation methods we created:
valdf = pd.read_parquet('valid.parquet')
model_acc = model.eval_acc(
valdf['review'], valdf['labels'])
model_rocauc = model.eval_rocauc(
valdf['review'], valdf['labels'])
msg = 'Baseline Accuracy: {}\nBaseline AUC: {}'
print(msg.format(
round(model_acc, 3), round(model_rocauc, 3)
))
Great! This is an improvement upon our baseline! Now we have set up what we need to start using Metaflow. In the next lesson, we are going to operationalize the steps we manually performed here by refactoring them as a Metaflow flow.