Natural Language Processing - Episode 2

This episode references model.py.

At the end of this episode, you will be able to:

Build a simple NLP model using Tensorflow and scikit-learn to classify fashion reviews.
Set up your model so that it can be easily integrated into a Metaflow flow.

1Description of a Custom Model

Now it’s time to build our ML model. We are going to define our model in a separate file with a custom class called Nbow_Model. The model contains two subcomponents: the count vectorizer for preprocessing and the model. The Nbow_Model class facilitates combining these two components together so that we don't have to deal with them separately.

Here is an explanation of the various methods in this model:

__init__: Initialize the count vectorizer, a preprocessor that counts the tokens in the text, and a neural network to do the modeling.
fit: Fit the count vectorizer, followed by the model.
predict: Transform the data with the count vectorizer before making predictions.
eval_acc: Calculate model accuracy given a dataset and labels.
eval_rocauc: Calculate the area under the roc curve given a dataset and labels.
model_dict: This exposes a dictionary that has two components that form this model, the count vectorizer and the neural network. We will use this to serialize the model's data into Metaflow.
from_dict: This allows you to instantiate a NbowModel from a model_dict which is useful for de-serializing data in Metaflow.

2How to Serialize Data

Anytime you create your own model library or define models in custom classes, we recommend explicitly defining how you will serialize and load the model. This will minimize the chances that things will break as your model code changes. Explicit definitions for serialization processes give you the ability to make sure any new versions of your code are backward compatible on how to load your model or allow you to deal with serialization/de-serialization accordingly in a way that is transparent to you. This is the purpose of the from_dict method and model_dict property in this example.

For Metaflow, it is very convenient if you have an interface that allows you to save model information that is pickleable, as that is how Metaflow saves data. That is the purpose of model_dict and from_dict: they allow saving and retrieving data from a pickleable data structure.

model.py
import tensorflow as tf
from tensorflow.keras import layers, optimizers, regularizers
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer

class NbowModel():
    def __init__(self, vocab_sz):
        self.vocab_sz = vocab_sz
        # Instantiate the CountVectorizer
        self.cv = CountVectorizer(
            min_df=.005, max_df = .75, stop_words='english', 
            strip_accents='ascii', max_features=self.vocab_sz
        )

        # Define the keras model
        inputs = tf.keras.Input(shape=(self.vocab_sz,), 
                                name='input')
        x = layers.Dropout(0.10)(inputs)
        x = layers.Dense(
            15, activation="relu",
            kernel_regularizer=regularizers.L1L2(l1=1e-5, l2=1e-4)
        )(x)
        predictions = layers.Dense(1, activation="sigmoid",)(x)
        self.model = tf.keras.Model(inputs, predictions)
        opt = optimizers.Adam(learning_rate=0.002)
        self.model.compile(loss="binary_crossentropy", 
                           optimizer=opt, metrics=["accuracy"])

    def fit(self, X, y):
        res = self.cv.fit_transform(X).toarray()
        self.model.fit(x=res, y=y, batch_size=32, 
                       epochs=10, validation_split=.2)
    
    def predict(self, X):
        res = self.cv.transform(X).toarray()
        return self.model.predict(res)
    
    def eval_acc(self, X, labels, threshold=.5):
        return accuracy_score(labels, 
                              self.predict(X) > threshold)
    
    def eval_rocauc(self, X, labels):
        return roc_auc_score(labels,  self.predict(X))

    @property
    def model_dict(self): 
        return {'vectorizer':self.cv, 'model': self.model}

    @classmethod
    def from_dict(cls, model_dict):
        "Get Model from dictionary"
        nbow_model = cls(len(
            model_dict['vectorizer'].vocabulary_
        ))
        nbow_model.model = model_dict['model']
        nbow_model.cv = model_dict['vectorizer']
        return nbow_model

3Fit the Custom Model on the Dataset

Next, let's import the NbowModel and train it on this dataset. The purpose of doing this is to make sure the code works as we expect before using Metaflow. For this example, we will set our vocab_sz = 750.

from model import NbowModel
import pandas as pd
model = NbowModel(vocab_sz=750)
df = pd.read_parquet('train.parquet')
model.fit(X=df['review'], y=df['labels'])

    Epoch 1/10
      1/510 [..............................] - ETA: 1:17 - loss: 0.7379 - accuracy: 0.3438

    2023-03-31 17:05:22.277608: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


    510/510 [==============================] - 1s 836us/step - loss: 0.3559 - accuracy: 0.8495 - val_loss: 0.2983 - val_accuracy: 0.8756
    Epoch 2/10
    510/510 [==============================] - 0s 628us/step - loss: 0.2934 - accuracy: 0.8823 - val_loss: 0.2939 - val_accuracy: 0.8724
    Epoch 3/10
    510/510 [==============================] - 0s 651us/step - loss: 0.2825 - accuracy: 0.8876 - val_loss: 0.2934 - val_accuracy: 0.8754
    Epoch 4/10
    510/510 [==============================] - 0s 624us/step - loss: 0.2735 - accuracy: 0.8922 - val_loss: 0.2986 - val_accuracy: 0.8781
    Epoch 5/10
    510/510 [==============================] - 0s 626us/step - loss: 0.2654 - accuracy: 0.8963 - val_loss: 0.2967 - val_accuracy: 0.8759
    Epoch 6/10
    510/510 [==============================] - 0s 636us/step - loss: 0.2548 - accuracy: 0.9036 - val_loss: 0.3070 - val_accuracy: 0.8756
    Epoch 7/10
    510/510 [==============================] - 0s 620us/step - loss: 0.2476 - accuracy: 0.9056 - val_loss: 0.3051 - val_accuracy: 0.8783
    Epoch 8/10
    510/510 [==============================] - 0s 627us/step - loss: 0.2400 - accuracy: 0.9106 - val_loss: 0.3136 - val_accuracy: 0.8714
    Epoch 9/10
    510/510 [==============================] - 0s 632us/step - loss: 0.2303 - accuracy: 0.9180 - val_loss: 0.3269 - val_accuracy: 0.8741
    Epoch 10/10
    510/510 [==============================] - 0s 626us/step - loss: 0.2233 - accuracy: 0.9214 - val_loss: 0.3310 - val_accuracy: 0.8719

4Evaluate the Model Performance

Next, we can evaluate our model on the validation set as well, using the built-in evaluation methods we created:

valdf = pd.read_parquet('valid.parquet')
model_acc = model.eval_acc(
    valdf['review'], valdf['labels'])
model_rocauc = model.eval_rocauc(
    valdf['review'], valdf['labels'])

msg = 'Baseline Accuracy: {}\nBaseline AUC: {}'
print(msg.format(
    round(model_acc, 3), round(model_rocauc, 3)
))

    Baseline Accuracy: 0.875
    Baseline AUC: 0.912

Great! This is an improvement upon our baseline! Now we have set up what we need to start using Metaflow. In the next lesson, we are going to operationalize the steps we manually performed here by refactoring them as a Metaflow flow.

1Description of a Custom Model​

2How to Serialize Data​

3Fit the Custom Model on the Dataset​

4Evaluate the Model Performance​

1Description of a Custom Model

2How to Serialize Data

3Fit the Custom Model on the Dataset

4Evaluate the Model Performance