{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering\n", "\n", "We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000, 1536)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# imports\n", "import numpy as np\n", "import pandas as pd\n", "from ast import literal_eval\n", "\n", "# load data\n", "datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n", "\n", "df = pd.read_csv(datafile_path)\n", "df[\"embedding\"] = df.embedding.apply(literal_eval).apply(np.array) # convert string to numpy array\n", "matrix = np.vstack(df.embedding.values)\n", "matrix.shape\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Unnamed: 0 | \n", "ProductId | \n", "UserId | \n", "Score | \n", "Summary | \n", "Text | \n", "combined | \n", "n_tokens | \n", "embedding | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "B003XPF9BO | \n", "A3R7JR3FMEBXQB | \n", "5 | \n", "where does one start...and stop... with a tre... | \n", "Wanted to save some to bring to my Chicago fam... | \n", "Title: where does one start...and stop... wit... | \n", "52 | \n", "[0.007018072064965963, -0.02731654793024063, 0... | \n", "
1 | \n", "297 | \n", "B003VXHGPK | \n", "A21VWSCGW7UUAR | \n", "4 | \n", "Good, but not Wolfgang Puck good | \n", "Honestly, I have to admit that I expected a li... | \n", "Title: Good, but not Wolfgang Puck good; Conte... | \n", "178 | \n", "[-0.003140551969408989, -0.009995664469897747,... | \n", "
2 | \n", "296 | \n", "B008JKTTUA | \n", "A34XBAIFT02B60 | \n", "1 | \n", "Should advertise coconut as an ingredient more... | \n", "First, these should be called Mac - Coconut ba... | \n", "Title: Should advertise coconut as an ingredie... | \n", "78 | \n", "[-0.01757248118519783, -8.266511576948687e-05,... | \n", "
3 | \n", "295 | \n", "B000LKTTTW | \n", "A14MQ40CCU8B13 | \n", "5 | \n", "Best tomato soup | \n", "I have a hard time finding packaged food of an... | \n", "Title: Best tomato soup; Content: I have a har... | \n", "111 | \n", "[-0.0013932279543951154, -0.011112828738987446... | \n", "
4 | \n", "294 | \n", "B001D09KAM | \n", "A34XBAIFT02B60 | \n", "1 | \n", "Should advertise coconut as an ingredient more... | \n", "First, these should be called Mac - Coconut ba... | \n", "Title: Should advertise coconut as an ingredie... | \n", "78 | \n", "[-0.01757248118519783, -8.266511576948687e-05,... | \n", "