{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Load the dataset\n", "\n", "The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n", "\n", "We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# imports\n", "import pandas as pd\n", "import tiktoken\n", "from openai.embeddings_utils import get_embedding\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# embedding model parameters\n", "embedding_model = \"text-embedding-ada-002\"\n", "embedding_encoding = \"cl100k_base\" # this the encoding for text-embedding-ada-002\n", "max_tokens = 8000 # the maximum for text-embedding-ada-002 is 8191\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Time | \n", "ProductId | \n", "UserId | \n", "Score | \n", "Summary | \n", "Text | \n", "combined | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "1351123200 | \n", "B003XPF9BO | \n", "A3R7JR3FMEBXQB | \n", "5 | \n", "where does one start...and stop... with a tre... | \n", "Wanted to save some to bring to my Chicago fam... | \n", "Title: where does one start...and stop... wit... | \n", "
1 | \n", "1351123200 | \n", "B003JK537S | \n", "A3JBPC3WFUT5ZP | \n", "1 | \n", "Arrived in pieces | \n", "Not pleased at all. When I opened the box, mos... | \n", "Title: Arrived in pieces; Content: Not pleased... | \n", "
\n", " | ProductId | \n", "UserId | \n", "Score | \n", "Summary | \n", "Text | \n", "combined | \n", "n_tokens | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "B003XPF9BO | \n", "A3R7JR3FMEBXQB | \n", "5 | \n", "where does one start...and stop... with a tre... | \n", "Wanted to save some to bring to my Chicago fam... | \n", "Title: where does one start...and stop... wit... | \n", "52 | \n", "
297 | \n", "B003VXHGPK | \n", "A21VWSCGW7UUAR | \n", "4 | \n", "Good, but not Wolfgang Puck good | \n", "Honestly, I have to admit that I expected a li... | \n", "Title: Good, but not Wolfgang Puck good; Conte... | \n", "178 | \n", "
296 | \n", "B008JKTTUA | \n", "A34XBAIFT02B60 | \n", "1 | \n", "Should advertise coconut as an ingredient more... | \n", "First, these should be called Mac - Coconut ba... | \n", "Title: Should advertise coconut as an ingredie... | \n", "78 | \n", "
295 | \n", "B000LKTTTW | \n", "A14MQ40CCU8B13 | \n", "5 | \n", "Best tomato soup | \n", "I have a hard time finding packaged food of an... | \n", "Title: Best tomato soup; Content: I have a har... | \n", "111 | \n", "
294 | \n", "B001D09KAM | \n", "A34XBAIFT02B60 | \n", "1 | \n", "Should advertise coconut as an ingredient more... | \n", "First, these should be called Mac - Coconut ba... | \n", "Title: Should advertise coconut as an ingredie... | \n", "78 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
623 | \n", "B0000CFXYA | \n", "A3GS4GWPIBV0NT | \n", "1 | \n", "Strange inflammation response | \n", "Truthfully wasn't crazy about the taste of the... | \n", "Title: Strange inflammation response; Content:... | \n", "110 | \n", "
624 | \n", "B0001BH5YM | \n", "A1BZ3HMAKK0NC | \n", "5 | \n", "My favorite and only MUSTARD | \n", "You've just got to experience this mustard... ... | \n", "Title: My favorite and only MUSTARD; Content:... | \n", "80 | \n", "
625 | \n", "B0009ET7TC | \n", "A2FSDQY5AI6TNX | \n", "5 | \n", "My furbabies LOVE these! | \n", "Shake the container and they come running. Eve... | \n", "Title: My furbabies LOVE these!; Content: Shak... | \n", "47 | \n", "
619 | \n", "B007PA32L2 | \n", "A15FF2P7RPKH6G | \n", "5 | \n", "got this for the daughter | \n", "all i have heard since she got a kuerig is why... | \n", "Title: got this for the daughter; Content: all... | \n", "50 | \n", "
999 | \n", "B001EQ5GEO | \n", "A3VYU0VO6DYV6I | \n", "5 | \n", "I love Maui Coffee! | \n", "My first experience with Maui Coffee was bring... | \n", "Title: I love Maui Coffee!; Content: My first ... | \n", "118 | \n", "
1000 rows × 7 columns
\n", "