Spaces:
Runtime error
Runtime error
File size: 16,311 Bytes
6ecf14b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Load the dataset\n",
"\n",
"The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).\n",
"\n",
"We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"import pandas as pd\n",
"import tiktoken\n",
"from openai.embeddings_utils import get_embedding\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# embedding model parameters\n",
"embedding_model = \"text-embedding-ada-002\"\n",
"embedding_encoding = \"cl100k_base\" # this the encoding for text-embedding-ada-002\n",
"max_tokens = 8000 # the maximum for text-embedding-ada-002 is 8191\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Time</th>\n",
" <th>ProductId</th>\n",
" <th>UserId</th>\n",
" <th>Score</th>\n",
" <th>Summary</th>\n",
" <th>Text</th>\n",
" <th>combined</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1351123200</td>\n",
" <td>B003XPF9BO</td>\n",
" <td>A3R7JR3FMEBXQB</td>\n",
" <td>5</td>\n",
" <td>where does one start...and stop... with a tre...</td>\n",
" <td>Wanted to save some to bring to my Chicago fam...</td>\n",
" <td>Title: where does one start...and stop... wit...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1351123200</td>\n",
" <td>B003JK537S</td>\n",
" <td>A3JBPC3WFUT5ZP</td>\n",
" <td>1</td>\n",
" <td>Arrived in pieces</td>\n",
" <td>Not pleased at all. When I opened the box, mos...</td>\n",
" <td>Title: Arrived in pieces; Content: Not pleased...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Time ProductId UserId Score \\\n",
"0 1351123200 B003XPF9BO A3R7JR3FMEBXQB 5 \n",
"1 1351123200 B003JK537S A3JBPC3WFUT5ZP 1 \n",
"\n",
" Summary \\\n",
"0 where does one start...and stop... with a tre... \n",
"1 Arrived in pieces \n",
"\n",
" Text \\\n",
"0 Wanted to save some to bring to my Chicago fam... \n",
"1 Not pleased at all. When I opened the box, mos... \n",
"\n",
" combined \n",
"0 Title: where does one start...and stop... wit... \n",
"1 Title: Arrived in pieces; Content: Not pleased... "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load & inspect dataset\n",
"input_datapath = \"data/fine_food_reviews_1k.csv\" # to save space, we provide a pre-filtered dataset\n",
"df = pd.read_csv(input_datapath, index_col=0)\n",
"df = df[[\"Time\", \"ProductId\", \"UserId\", \"Score\", \"Summary\", \"Text\"]]\n",
"df = df.dropna()\n",
"df[\"combined\"] = (\n",
" \"Title: \" + df.Summary.str.strip() + \"; Content: \" + df.Text.str.strip()\n",
")\n",
"df.head(2)\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1000"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# subsample to 1k most recent reviews and remove samples that are too long\n",
"top_n = 1000\n",
"df = df.sort_values(\"Time\").tail(top_n * 2) # first cut to first 2k entries, assuming less than half will be filtered out\n",
"df.drop(\"Time\", axis=1, inplace=True)\n",
"\n",
"encoding = tiktoken.get_encoding(embedding_encoding)\n",
"\n",
"# omit reviews that are too long to embed\n",
"df[\"n_tokens\"] = df.combined.apply(lambda x: len(encoding.encode(x)))\n",
"df = df[df.n_tokens <= max_tokens].tail(top_n)\n",
"len(df)\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ProductId</th>\n",
" <th>UserId</th>\n",
" <th>Score</th>\n",
" <th>Summary</th>\n",
" <th>Text</th>\n",
" <th>combined</th>\n",
" <th>n_tokens</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>B003XPF9BO</td>\n",
" <td>A3R7JR3FMEBXQB</td>\n",
" <td>5</td>\n",
" <td>where does one start...and stop... with a tre...</td>\n",
" <td>Wanted to save some to bring to my Chicago fam...</td>\n",
" <td>Title: where does one start...and stop... wit...</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>297</th>\n",
" <td>B003VXHGPK</td>\n",
" <td>A21VWSCGW7UUAR</td>\n",
" <td>4</td>\n",
" <td>Good, but not Wolfgang Puck good</td>\n",
" <td>Honestly, I have to admit that I expected a li...</td>\n",
" <td>Title: Good, but not Wolfgang Puck good; Conte...</td>\n",
" <td>178</td>\n",
" </tr>\n",
" <tr>\n",
" <th>296</th>\n",
" <td>B008JKTTUA</td>\n",
" <td>A34XBAIFT02B60</td>\n",
" <td>1</td>\n",
" <td>Should advertise coconut as an ingredient more...</td>\n",
" <td>First, these should be called Mac - Coconut ba...</td>\n",
" <td>Title: Should advertise coconut as an ingredie...</td>\n",
" <td>78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>295</th>\n",
" <td>B000LKTTTW</td>\n",
" <td>A14MQ40CCU8B13</td>\n",
" <td>5</td>\n",
" <td>Best tomato soup</td>\n",
" <td>I have a hard time finding packaged food of an...</td>\n",
" <td>Title: Best tomato soup; Content: I have a har...</td>\n",
" <td>111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>294</th>\n",
" <td>B001D09KAM</td>\n",
" <td>A34XBAIFT02B60</td>\n",
" <td>1</td>\n",
" <td>Should advertise coconut as an ingredient more...</td>\n",
" <td>First, these should be called Mac - Coconut ba...</td>\n",
" <td>Title: Should advertise coconut as an ingredie...</td>\n",
" <td>78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>623</th>\n",
" <td>B0000CFXYA</td>\n",
" <td>A3GS4GWPIBV0NT</td>\n",
" <td>1</td>\n",
" <td>Strange inflammation response</td>\n",
" <td>Truthfully wasn't crazy about the taste of the...</td>\n",
" <td>Title: Strange inflammation response; Content:...</td>\n",
" <td>110</td>\n",
" </tr>\n",
" <tr>\n",
" <th>624</th>\n",
" <td>B0001BH5YM</td>\n",
" <td>A1BZ3HMAKK0NC</td>\n",
" <td>5</td>\n",
" <td>My favorite and only MUSTARD</td>\n",
" <td>You've just got to experience this mustard... ...</td>\n",
" <td>Title: My favorite and only MUSTARD; Content:...</td>\n",
" <td>80</td>\n",
" </tr>\n",
" <tr>\n",
" <th>625</th>\n",
" <td>B0009ET7TC</td>\n",
" <td>A2FSDQY5AI6TNX</td>\n",
" <td>5</td>\n",
" <td>My furbabies LOVE these!</td>\n",
" <td>Shake the container and they come running. Eve...</td>\n",
" <td>Title: My furbabies LOVE these!; Content: Shak...</td>\n",
" <td>47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>619</th>\n",
" <td>B007PA32L2</td>\n",
" <td>A15FF2P7RPKH6G</td>\n",
" <td>5</td>\n",
" <td>got this for the daughter</td>\n",
" <td>all i have heard since she got a kuerig is why...</td>\n",
" <td>Title: got this for the daughter; Content: all...</td>\n",
" <td>50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999</th>\n",
" <td>B001EQ5GEO</td>\n",
" <td>A3VYU0VO6DYV6I</td>\n",
" <td>5</td>\n",
" <td>I love Maui Coffee!</td>\n",
" <td>My first experience with Maui Coffee was bring...</td>\n",
" <td>Title: I love Maui Coffee!; Content: My first ...</td>\n",
" <td>118</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" ProductId UserId Score \\\n",
"0 B003XPF9BO A3R7JR3FMEBXQB 5 \n",
"297 B003VXHGPK A21VWSCGW7UUAR 4 \n",
"296 B008JKTTUA A34XBAIFT02B60 1 \n",
"295 B000LKTTTW A14MQ40CCU8B13 5 \n",
"294 B001D09KAM A34XBAIFT02B60 1 \n",
".. ... ... ... \n",
"623 B0000CFXYA A3GS4GWPIBV0NT 1 \n",
"624 B0001BH5YM A1BZ3HMAKK0NC 5 \n",
"625 B0009ET7TC A2FSDQY5AI6TNX 5 \n",
"619 B007PA32L2 A15FF2P7RPKH6G 5 \n",
"999 B001EQ5GEO A3VYU0VO6DYV6I 5 \n",
"\n",
" Summary \\\n",
"0 where does one start...and stop... with a tre... \n",
"297 Good, but not Wolfgang Puck good \n",
"296 Should advertise coconut as an ingredient more... \n",
"295 Best tomato soup \n",
"294 Should advertise coconut as an ingredient more... \n",
".. ... \n",
"623 Strange inflammation response \n",
"624 My favorite and only MUSTARD \n",
"625 My furbabies LOVE these! \n",
"619 got this for the daughter \n",
"999 I love Maui Coffee! \n",
"\n",
" Text \\\n",
"0 Wanted to save some to bring to my Chicago fam... \n",
"297 Honestly, I have to admit that I expected a li... \n",
"296 First, these should be called Mac - Coconut ba... \n",
"295 I have a hard time finding packaged food of an... \n",
"294 First, these should be called Mac - Coconut ba... \n",
".. ... \n",
"623 Truthfully wasn't crazy about the taste of the... \n",
"624 You've just got to experience this mustard... ... \n",
"625 Shake the container and they come running. Eve... \n",
"619 all i have heard since she got a kuerig is why... \n",
"999 My first experience with Maui Coffee was bring... \n",
"\n",
" combined n_tokens \n",
"0 Title: where does one start...and stop... wit... 52 \n",
"297 Title: Good, but not Wolfgang Puck good; Conte... 178 \n",
"296 Title: Should advertise coconut as an ingredie... 78 \n",
"295 Title: Best tomato soup; Content: I have a har... 111 \n",
"294 Title: Should advertise coconut as an ingredie... 78 \n",
".. ... ... \n",
"623 Title: Strange inflammation response; Content:... 110 \n",
"624 Title: My favorite and only MUSTARD; Content:... 80 \n",
"625 Title: My furbabies LOVE these!; Content: Shak... 47 \n",
"619 Title: got this for the daughter; Content: all... 50 \n",
"999 Title: I love Maui Coffee!; Content: My first ... 118 \n",
"\n",
"[1000 rows x 7 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Get embeddings and save them for future reuse"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage\n",
"\n",
"# This may take a few minutes\n",
"df[\"embedding\"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))\n",
"df.to_csv(\"data/fine_food_reviews_with_embeddings_1k.csv\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python3 (GPT)",
"language": "python",
"name": "gpt"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"vscode": {
"interpreter": {
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
|