{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Load the dataset\n", "\n", "The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n", "\n", "We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# imports\n", "import pandas as pd\n", "import tiktoken\n", "from openai.embeddings_utils import get_embedding\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# embedding model parameters\n", "embedding_model = \"text-embedding-ada-002\"\n", "embedding_encoding = \"cl100k_base\" # this the encoding for text-embedding-ada-002\n", "max_tokens = 8000 # the maximum for text-embedding-ada-002 is 8191\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeProductIdUserIdScoreSummaryTextcombined
01351123200B003XPF9BOA3R7JR3FMEBXQB5where does one start...and stop... with a tre...Wanted to save some to bring to my Chicago fam...Title: where does one start...and stop... wit...
11351123200B003JK537SA3JBPC3WFUT5ZP1Arrived in piecesNot pleased at all. When I opened the box, mos...Title: Arrived in pieces; Content: Not pleased...
\n", "
" ], "text/plain": [ " Time ProductId UserId Score \\\n", "0 1351123200 B003XPF9BO A3R7JR3FMEBXQB 5 \n", "1 1351123200 B003JK537S A3JBPC3WFUT5ZP 1 \n", "\n", " Summary \\\n", "0 where does one start...and stop... with a tre... \n", "1 Arrived in pieces \n", "\n", " Text \\\n", "0 Wanted to save some to bring to my Chicago fam... \n", "1 Not pleased at all. When I opened the box, mos... \n", "\n", " combined \n", "0 Title: where does one start...and stop... wit... \n", "1 Title: Arrived in pieces; Content: Not pleased... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load & inspect dataset\n", "input_datapath = \"data/fine_food_reviews_1k.csv\" # to save space, we provide a pre-filtered dataset\n", "df = pd.read_csv(input_datapath, index_col=0)\n", "df = df[[\"Time\", \"ProductId\", \"UserId\", \"Score\", \"Summary\", \"Text\"]]\n", "df = df.dropna()\n", "df[\"combined\"] = (\n", " \"Title: \" + df.Summary.str.strip() + \"; Content: \" + df.Text.str.strip()\n", ")\n", "df.head(2)\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# subsample to 1k most recent reviews and remove samples that are too long\n", "top_n = 1000\n", "df = df.sort_values(\"Time\").tail(top_n * 2) # first cut to first 2k entries, assuming less than half will be filtered out\n", "df.drop(\"Time\", axis=1, inplace=True)\n", "\n", "encoding = tiktoken.get_encoding(embedding_encoding)\n", "\n", "# omit reviews that are too long to embed\n", "df[\"n_tokens\"] = df.combined.apply(lambda x: len(encoding.encode(x)))\n", "df = df[df.n_tokens <= max_tokens].tail(top_n)\n", "len(df)\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProductIdUserIdScoreSummaryTextcombinedn_tokens
0B003XPF9BOA3R7JR3FMEBXQB5where does one start...and stop... with a tre...Wanted to save some to bring to my Chicago fam...Title: where does one start...and stop... wit...52
297B003VXHGPKA21VWSCGW7UUAR4Good, but not Wolfgang Puck goodHonestly, I have to admit that I expected a li...Title: Good, but not Wolfgang Puck good; Conte...178
296B008JKTTUAA34XBAIFT02B601Should advertise coconut as an ingredient more...First, these should be called Mac - Coconut ba...Title: Should advertise coconut as an ingredie...78
295B000LKTTTWA14MQ40CCU8B135Best tomato soupI have a hard time finding packaged food of an...Title: Best tomato soup; Content: I have a har...111
294B001D09KAMA34XBAIFT02B601Should advertise coconut as an ingredient more...First, these should be called Mac - Coconut ba...Title: Should advertise coconut as an ingredie...78
........................
623B0000CFXYAA3GS4GWPIBV0NT1Strange inflammation responseTruthfully wasn't crazy about the taste of the...Title: Strange inflammation response; Content:...110
624B0001BH5YMA1BZ3HMAKK0NC5My favorite and only MUSTARDYou've just got to experience this mustard... ...Title: My favorite and only MUSTARD; Content:...80
625B0009ET7TCA2FSDQY5AI6TNX5My furbabies LOVE these!Shake the container and they come running. Eve...Title: My furbabies LOVE these!; Content: Shak...47
619B007PA32L2A15FF2P7RPKH6G5got this for the daughterall i have heard since she got a kuerig is why...Title: got this for the daughter; Content: all...50
999B001EQ5GEOA3VYU0VO6DYV6I5I love Maui Coffee!My first experience with Maui Coffee was bring...Title: I love Maui Coffee!; Content: My first ...118
\n", "

1000 rows × 7 columns

\n", "
" ], "text/plain": [ " ProductId UserId Score \\\n", "0 B003XPF9BO A3R7JR3FMEBXQB 5 \n", "297 B003VXHGPK A21VWSCGW7UUAR 4 \n", "296 B008JKTTUA A34XBAIFT02B60 1 \n", "295 B000LKTTTW A14MQ40CCU8B13 5 \n", "294 B001D09KAM A34XBAIFT02B60 1 \n", ".. ... ... ... \n", "623 B0000CFXYA A3GS4GWPIBV0NT 1 \n", "624 B0001BH5YM A1BZ3HMAKK0NC 5 \n", "625 B0009ET7TC A2FSDQY5AI6TNX 5 \n", "619 B007PA32L2 A15FF2P7RPKH6G 5 \n", "999 B001EQ5GEO A3VYU0VO6DYV6I 5 \n", "\n", " Summary \\\n", "0 where does one start...and stop... with a tre... \n", "297 Good, but not Wolfgang Puck good \n", "296 Should advertise coconut as an ingredient more... \n", "295 Best tomato soup \n", "294 Should advertise coconut as an ingredient more... \n", ".. ... \n", "623 Strange inflammation response \n", "624 My favorite and only MUSTARD \n", "625 My furbabies LOVE these! \n", "619 got this for the daughter \n", "999 I love Maui Coffee! \n", "\n", " Text \\\n", "0 Wanted to save some to bring to my Chicago fam... \n", "297 Honestly, I have to admit that I expected a li... \n", "296 First, these should be called Mac - Coconut ba... \n", "295 I have a hard time finding packaged food of an... \n", "294 First, these should be called Mac - Coconut ba... \n", ".. ... \n", "623 Truthfully wasn't crazy about the taste of the... \n", "624 You've just got to experience this mustard... ... \n", "625 Shake the container and they come running. Eve... \n", "619 all i have heard since she got a kuerig is why... \n", "999 My first experience with Maui Coffee was bring... \n", "\n", " combined n_tokens \n", "0 Title: where does one start...and stop... wit... 52 \n", "297 Title: Good, but not Wolfgang Puck good; Conte... 178 \n", "296 Title: Should advertise coconut as an ingredie... 78 \n", "295 Title: Best tomato soup; Content: I have a har... 111 \n", "294 Title: Should advertise coconut as an ingredie... 78 \n", ".. ... ... \n", "623 Title: Strange inflammation response; Content:... 110 \n", "624 Title: My favorite and only MUSTARD; Content:... 80 \n", "625 Title: My furbabies LOVE these!; Content: Shak... 47 \n", "619 Title: got this for the daughter; Content: all... 50 \n", "999 Title: I love Maui Coffee!; Content: My first ... 118 \n", "\n", "[1000 rows x 7 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Get embeddings and save them for future reuse" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage\n", "\n", "# This may take a few minutes\n", "df[\"embedding\"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))\n", "df.to_csv(\"data/fine_food_reviews_with_embeddings_1k.csv\")\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python3 (GPT)", "language": "python", "name": "gpt" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "vscode": { "interpreter": { "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" } } }, "nbformat": 4, "nbformat_minor": 4 }