Getting started with LLMs

Introduction

Suppose you work at a fashion company where each night a few hundred (or a million) reviews come in. Your marketing team knows that word-of-mouth is the best approach, so your goal is to identify reviewers who are most likely to recommend your product, so the marketing team can proactively incentivize sharing their recommendation widely and reward them for being a loyal customer. You have been tasked with deciding what modeling approach to pursue.

The data engineering team has set up a data pipeline where new reviews are fed into a Postgres database on a regular interval (every few minutes in our simulation).

📈 Learning outcomes

You'll use a variety of LLMs and classical ML approaches to infer whether users are likely to recommend a product for each new batch of data. Along the way you'll learn how LLMs interact with Outerbounds platform, and how to use the platform in general. Specifically you'll cover these (and more!) topics:

Using vendor LLM APIs (OpenAI) and local models (HuggingFace) with Outerbounds
Accessing production databases (Postgres), not static datasets
Comparing zero-shot LLMs to a hybrid ML approach (embeddings + XGBoost)
Using Metaflow to organize results
Monitoring results across workflows and over time

Configure OpenAI API key

OpenAI will be used in several sessions, so an API key is required to run all the code samples. You do not need to have an OpenAI API key to follow along with everything the journey, but you will not be able to run the code samples that call OpenAI.

If you are following along on Outerbounds, go to the Integrations view and configure your OpenAI API key as a resource integration. If you are not an admin, ask for them to set up the resource integration for you. Alternatively, if you are not an admin or are following along using a non-Outerbounds Metaflow deployment, you can manually set your keys at the relevant places in the notebooks and workflow files.

📥 Download the content to your workstation

Run the

outerbounds tutorials pull --url https://outerbounds-journeys-content.s3.us-west-2.amazonaws.com/main/journeys.tar.gz --destination-dir ~/learn

command to download the content to your workstation.

The downloaded content may include code packages for several journeys like this; the one we are interested will reside under ~/learn/llm-end-to-end.

If you are not running this example on Outerbounds, you can change the ~/learn directory to a destination of your choice. If you are running on the platform, click next once you see Tutorials pulled successfully.

📝 Setup notebook

Open the notebook in the 00-setup-nb directory to set up your environment for this journey. The notebook will guide you through installing the required packages and validating your OpenAI connection.

🧪 Baseline notebook - zero shot inference with LLMs

Open the notebook in the 01-baseline-nb directory to learn about the dataset and the problem. The notebook will introduce the dataset used throughout this journey, and will call OpenAI to score the sentiment of clothing reviews. If you don't have an OpenAI API key, you can read along for this lesson.

🏭 Baseline workflow - evaluating with known labels

Use historical data with ground truth labels, and evaluate how good the LLM is at zero-shot inferring the sentiment of reviews. Run this command in the terminal:

cd ~/learn/llm-end-to-end/02-baseline-flow
python flow.py --environment=fast-bakery run --with kubernetes --eval True --n 200

🏭 Baseline workflow - inference on new reviews

Use the same workflow to infer the sentiment of new reviews. Run this command in the terminal:

cd ~/learn/llm-end-to-end/02-baseline-flow
python flow.py --environment=fast-bakery run --with kubernetes

Data changes in the real world

In real-world ML systems, there's often a delay between inference time - when the system makes a prediction - and discovering the true label that determines if the prediction was correct. In our scenario, the marketing team may need to wait days or weeks before knowing if a customer actually recommended the product to others in the way the intervention was designed to encourage.

As a data scientist using this system, here's what you need to know:

New reviews appear in the database at whatever time - every 10 minutes in our simulation.
The recommended_ind column uses NULL values to indicate reviews without feedback yet. For existing labels, this column will be 1 if the reviewer recommended the product, and 0 if they did not.
To work with historical data that has received integrated true labels fetch_table(table_name, only_labeled=True) is used. This pattern is useful during development and evaluations.
To work with the most recent batch of data (unlabeled) fetch_table(table_name, only_labeled=False) is used.

For any given ML workflow, it is generally quite helpful to build a pattern lets you both train/evaluate on historical data with known outcomes and make predictions on new data where feedback on the quality of the prediction is pending.

If you are following on Outerbounds platform, click next on the right-hand side panel.

🚀 Deploy a sensor workflow

Deploy a sensor workflow to monitor the database every N minutes, and send an event to the platform when new data is available. You can adjust the interval by changing the schedule parameter in the argo-workflows-create command. Run this command in the terminal:

cd ~/learn/llm-end-to-end/03-sensor-flow
python flow.py --environment=fast-bakery argo-workflows-create

🚀 Deploy the baseline workflow

Deploy a version of our baseline workflow that listens to the sensor workflow's event and runs the inference workflow when new data is available. Run this command in the terminal to deploy the workflow:

cd ~/learn/llm-end-to-end/04-deploy-baseline
python flow.py --production --environment=fast-bakery argo-workflows-create

🆕 Candidate workflow - iterating on the baseline

Deploy a candidate workflow that uses a different LLM model to infer the sentiment of reviews. Run this command in the terminal to deploy the workflow:

cd ~/learn/llm-end-to-end/05-deploy-candidate
python flow.py --production --environment=fast-bakery argo-workflows-create

📊 Monitoring and comparing candidate and baseline workflows

Compare the performance of the candidate and baseline workflows. This notebook shows how to use the Metaflow Client API to compare performance across workflows and over time.

If you feel adventurous, there is also a workflow version of the same reporting dashboard ready to deploy. Have fun!

Next steps

In this journey, you have learned how to use LLMs on Outerbounds to infer the sentiment of reviews. There are many more tasks AI systems can perform, and many more tools Outerbounds provides to help you design, implement, and scale them. You can explore the following resources to learn more:

Introduction​

📈 Learning outcomes​

Configure OpenAI API key​

📥 Download the content to your workstation​

📝 Setup notebook​

🧪 Baseline notebook - zero shot inference with LLMs​

🏭 Baseline workflow - evaluating with known labels​

🏭 Baseline workflow - inference on new reviews​

Data changes in the real world​

🚀 Deploy a sensor workflow​

🚀 Deploy the baseline workflow​

🆕 Candidate workflow - iterating on the baseline​

📊 Monitoring and comparing candidate and baseline workflows​

Next steps​