Viraj2307 commited on
Commit
c4751ad
·
1 Parent(s): 27083da
Notebooks/Product Recommendation System.ipynb CHANGED
@@ -2870,14 +2870,470 @@
2870
  "cell_type": "markdown",
2871
  "metadata": {},
2872
  "source": [
2873
- "The parameters i will use:/n/n\n",
2874
  "\n",
2875
- "window = 15: Defines the maximum distance between the current and predicted word within a sentence./n\n",
2876
- "sg = 1: Means the model will use the Skip-gram approach/n\n",
2877
- "hs = 0: Indicates that hierarchical softmax is not used because there arn't large vocabularies./n\n",
2878
- "negative=10: Sets the number of negative samples to 10./n\n",
2879
- "alpha=0.03: Set learning rate for the process to 0.03./n\n",
2880
- "min_alpha=0.0007: Sets the minimum learning rate to 0.0007./n"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2881
  ]
2882
  },
2883
  {
 
2870
  "cell_type": "markdown",
2871
  "metadata": {},
2872
  "source": [
2873
+ "The parameters we will use:\n",
2874
  "\n",
2875
+ "- window = 15: Defines the maximum distance between the current and predicted word within a sentence.\n",
2876
+ "\n",
2877
+ "- sg = 1: Means the model will use the Skip-gram approach\n",
2878
+ "- hs = 0: Indicates that hierarchical softmax is not used because there arn't large vocabularies.\n",
2879
+ "- negative=10: Sets the number of negative samples to 10.\n",
2880
+ "- alpha=0.03: Set learning rate for the process to 0.03.\n",
2881
+ "- min_alpha=0.0007: Sets the minimum learning rate to 0.0007."
2882
+ ]
2883
+ },
2884
+ {
2885
+ "cell_type": "code",
2886
+ "execution_count": 22,
2887
+ "metadata": {},
2888
+ "outputs": [
2889
+ {
2890
+ "data": {
2891
+ "text/plain": [
2892
+ "(3525897, 3561890)"
2893
+ ]
2894
+ },
2895
+ "execution_count": 22,
2896
+ "metadata": {},
2897
+ "output_type": "execute_result"
2898
+ }
2899
+ ],
2900
+ "source": [
2901
+ "model = Word2Vec(window = 15, sg = 1, hs = 0,\n",
2902
+ " negative = 10, \n",
2903
+ " alpha=0.03, min_alpha=0.0007,\n",
2904
+ " seed = 121)\n",
2905
+ "\n",
2906
+ "model.build_vocab(purchases_train, progress_per=200)\n",
2907
+ "\n",
2908
+ "model.train(purchases_train, total_examples = model.corpus_count, \n",
2909
+ " epochs=10, report_delay=1)"
2910
+ ]
2911
+ },
2912
+ {
2913
+ "cell_type": "code",
2914
+ "execution_count": 23,
2915
+ "metadata": {},
2916
+ "outputs": [
2917
+ {
2918
+ "name": "stdout",
2919
+ "output_type": "stream",
2920
+ "text": [
2921
+ "Word2Vec<vocab=3146, vector_size=100, alpha=0.03>\n"
2922
+ ]
2923
+ }
2924
+ ],
2925
+ "source": [
2926
+ "print(model)"
2927
+ ]
2928
+ },
2929
+ {
2930
+ "cell_type": "markdown",
2931
+ "metadata": {},
2932
+ "source": [
2933
+ "The model has a vocabulary of 3,159 unique words and their vectors of size 100 each.\n",
2934
+ "\n"
2935
+ ]
2936
+ },
2937
+ {
2938
+ "cell_type": "markdown",
2939
+ "metadata": {},
2940
+ "source": [
2941
+ "Extracting the vectors of all the words"
2942
+ ]
2943
+ },
2944
+ {
2945
+ "cell_type": "code",
2946
+ "execution_count": 24,
2947
+ "metadata": {},
2948
+ "outputs": [
2949
+ {
2950
+ "data": {
2951
+ "text/plain": [
2952
+ "(3146, 100)"
2953
+ ]
2954
+ },
2955
+ "execution_count": 24,
2956
+ "metadata": {},
2957
+ "output_type": "execute_result"
2958
+ }
2959
+ ],
2960
+ "source": [
2961
+ "X = model.wv[model.wv.index_to_key]\n",
2962
+ "\n",
2963
+ "X.shape"
2964
+ ]
2965
+ },
2966
+ {
2967
+ "cell_type": "markdown",
2968
+ "metadata": {},
2969
+ "source": [
2970
+ "Visualizing the model"
2971
+ ]
2972
+ },
2973
+ {
2974
+ "cell_type": "markdown",
2975
+ "metadata": {},
2976
+ "source": [
2977
+ "We have 100-dimensional embeddings. We can't even imagine 4 dimensions imagine 100. We are going to reduce the dimensions from 100 to 2 by using the UMAP algorithm."
2978
+ ]
2979
+ },
2980
+ {
2981
+ "cell_type": "code",
2982
+ "execution_count": 25,
2983
+ "metadata": {},
2984
+ "outputs": [
2985
+ {
2986
+ "name": "stderr",
2987
+ "output_type": "stream",
2988
+ "text": [
2989
+ "\n",
2990
+ "[notice] A new release of pip is available: 24.1.2 -> 24.3.1\n",
2991
+ "[notice] To update, run: python.exe -m pip install --upgrade pip\n"
2992
+ ]
2993
+ },
2994
+ {
2995
+ "name": "stdout",
2996
+ "output_type": "stream",
2997
+ "text": [
2998
+ "Collecting umap-learn\n",
2999
+ " Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)\n",
3000
+ "Requirement already satisfied: numpy>=1.17 in c:\\python3.11.1\\lib\\site-packages (from umap-learn) (1.26.4)\n",
3001
+ "Requirement already satisfied: scipy>=1.3.1 in c:\\python3.11.1\\lib\\site-packages (from umap-learn) (1.12.0)\n",
3002
+ "Requirement already satisfied: scikit-learn>=0.22 in c:\\python3.11.1\\lib\\site-packages (from umap-learn) (1.4.1.post1)\n",
3003
+ "Requirement already satisfied: numba>=0.51.2 in c:\\python3.11.1\\lib\\site-packages (from umap-learn) (0.59.1)\n",
3004
+ "Collecting pynndescent>=0.5 (from umap-learn)\n",
3005
+ " Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)\n",
3006
+ "Requirement already satisfied: tqdm in c:\\python3.11.1\\lib\\site-packages (from umap-learn) (4.66.2)\n",
3007
+ "Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in c:\\python3.11.1\\lib\\site-packages (from numba>=0.51.2->umap-learn) (0.42.0)\n",
3008
+ "Requirement already satisfied: joblib>=0.11 in c:\\python3.11.1\\lib\\site-packages (from pynndescent>=0.5->umap-learn) (1.3.2)\n",
3009
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in c:\\python3.11.1\\lib\\site-packages (from scikit-learn>=0.22->umap-learn) (3.3.0)\n",
3010
+ "Requirement already satisfied: colorama in c:\\python3.11.1\\lib\\site-packages (from tqdm->umap-learn) (0.4.6)\n",
3011
+ "Downloading umap_learn-0.5.7-py3-none-any.whl (88 kB)\n",
3012
+ " ---------------------------------------- 0.0/88.8 kB ? eta -:--:--\n",
3013
+ " ---------------------------------------- 88.8/88.8 kB 2.5 MB/s eta 0:00:00\n",
3014
+ "Downloading pynndescent-0.5.13-py3-none-any.whl (56 kB)\n",
3015
+ " ---------------------------------------- 0.0/56.9 kB ? eta -:--:--\n",
3016
+ " ---------------------------------------- 56.9/56.9 kB 2.9 MB/s eta 0:00:00\n",
3017
+ "Installing collected packages: pynndescent, umap-learn\n",
3018
+ "Successfully installed pynndescent-0.5.13 umap-learn-0.5.7\n"
3019
+ ]
3020
+ }
3021
+ ],
3022
+ "source": [
3023
+ "!pip install umap-learn"
3024
+ ]
3025
+ },
3026
+ {
3027
+ "cell_type": "code",
3028
+ "execution_count": 26,
3029
+ "metadata": {},
3030
+ "outputs": [
3031
+ {
3032
+ "data": {
3033
+ "image/png": "",
3034
+ "text/plain": [
3035
+ "<Figure size 1000x900 with 1 Axes>"
3036
+ ]
3037
+ },
3038
+ "metadata": {},
3039
+ "output_type": "display_data"
3040
+ }
3041
+ ],
3042
+ "source": [
3043
+ "import umap.umap_ as umap\n",
3044
+ "import warnings\n",
3045
+ "warnings.filterwarnings('ignore')\n",
3046
+ "\n",
3047
+ "cluster_embedding = umap.UMAP(n_neighbors=30, min_dist=0.0,\n",
3048
+ " n_components=2, random_state=42).fit_transform(X)\n",
3049
+ "\n",
3050
+ "plt.figure(figsize=(10,9))\n",
3051
+ "plt.scatter(cluster_embedding[:, 0], cluster_embedding[:, 1], s=3, cmap='Spectral');"
3052
+ ]
3053
+ },
3054
+ {
3055
+ "cell_type": "markdown",
3056
+ "metadata": {},
3057
+ "source": [
3058
+ "Every dot in this plot is a product. There are several tiny clusters representing similar products."
3059
+ ]
3060
+ },
3061
+ {
3062
+ "cell_type": "markdown",
3063
+ "metadata": {},
3064
+ "source": [
3065
+ "Recommending Products system Usage"
3066
+ ]
3067
+ },
3068
+ {
3069
+ "cell_type": "markdown",
3070
+ "metadata": {},
3071
+ "source": [
3072
+ "Now, our next step is to suggest similar products for a certain product or a product’s vector.\n",
3073
+ "\n",
3074
+ "Let's create a product-ID and product-description dictionary to easily map a product’s description to its ID and vice versa."
3075
+ ]
3076
+ },
3077
+ {
3078
+ "cell_type": "code",
3079
+ "execution_count": 28,
3080
+ "metadata": {},
3081
+ "outputs": [],
3082
+ "source": [
3083
+ "products = train_df[[\"StockCode\", \"Description\"]]\n",
3084
+ "\n",
3085
+ "# remove duplicates\n",
3086
+ "products.drop_duplicates(inplace=True, subset='StockCode', keep=\"last\")\n",
3087
+ "\n",
3088
+ "products_dict = products.groupby('StockCode')['Description'].apply(list).to_dict()"
3089
+ ]
3090
+ },
3091
+ {
3092
+ "cell_type": "code",
3093
+ "execution_count": 29,
3094
+ "metadata": {},
3095
+ "outputs": [
3096
+ {
3097
+ "data": {
3098
+ "text/plain": [
3099
+ "['RED WOOLLY HOTTIE WHITE HEART.']"
3100
+ ]
3101
+ },
3102
+ "execution_count": 29,
3103
+ "metadata": {},
3104
+ "output_type": "execute_result"
3105
+ }
3106
+ ],
3107
+ "source": [
3108
+ "products_dict['84029E']"
3109
+ ]
3110
+ },
3111
+ {
3112
+ "cell_type": "markdown",
3113
+ "metadata": {},
3114
+ "source": [
3115
+ "Create a function to return top 5 similar products by default:"
3116
+ ]
3117
+ },
3118
+ {
3119
+ "cell_type": "code",
3120
+ "execution_count": 30,
3121
+ "metadata": {},
3122
+ "outputs": [],
3123
+ "source": [
3124
+ "def similar_products(v, n = 5):\n",
3125
+ " \n",
3126
+ " # extract most similar products\n",
3127
+ " ms = model.wv.most_similar([v], topn= n+1)[1:]\n",
3128
+ " \n",
3129
+ " # extract name and similarity score\n",
3130
+ " new_ms = []\n",
3131
+ " for j in ms:\n",
3132
+ " pair = (products_dict[j[0]][0], j[1])\n",
3133
+ " new_ms.append(pair)\n",
3134
+ " \n",
3135
+ " return new_ms "
3136
+ ]
3137
+ },
3138
+ {
3139
+ "cell_type": "code",
3140
+ "execution_count": 31,
3141
+ "metadata": {},
3142
+ "outputs": [
3143
+ {
3144
+ "data": {
3145
+ "text/plain": [
3146
+ "['CHOCOLATE HOT WATER BOTTLE']"
3147
+ ]
3148
+ },
3149
+ "execution_count": 31,
3150
+ "metadata": {},
3151
+ "output_type": "execute_result"
3152
+ }
3153
+ ],
3154
+ "source": [
3155
+ "products_dict['22112']"
3156
+ ]
3157
+ },
3158
+ {
3159
+ "cell_type": "code",
3160
+ "execution_count": 32,
3161
+ "metadata": {},
3162
+ "outputs": [
3163
+ {
3164
+ "data": {
3165
+ "text/plain": [
3166
+ "[('HOT WATER BOTTLE I AM SO POORLY', 0.8668386936187744),\n",
3167
+ " ('HOT WATER BOTTLE TEA AND SYMPATHY', 0.8468259572982788),\n",
3168
+ " ('RETROSPOT HEART HOT WATER BOTTLE', 0.7833042740821838),\n",
3169
+ " ('HOT WATER BOTTLE KEEP CALM', 0.760246217250824),\n",
3170
+ " ('SCOTTIE DOG HOT WATER BOTTLE', 0.757988452911377)]"
3171
+ ]
3172
+ },
3173
+ "execution_count": 32,
3174
+ "metadata": {},
3175
+ "output_type": "execute_result"
3176
+ }
3177
+ ],
3178
+ "source": [
3179
+ "similar_products(model.wv['22112'])"
3180
+ ]
3181
+ },
3182
+ {
3183
+ "cell_type": "markdown",
3184
+ "metadata": {},
3185
+ "source": [
3186
+ "The results are pretty relevant and match well with the input product.\n",
3187
+ "\n",
3188
+ "BUT \n",
3189
+ "\n",
3190
+ "what we need is to recommend products based on the multiple purchases that made in the past. Let's see.\n",
3191
+ "\n",
3192
+ "One simple solution is to take the average of all the vectors of the products the user has bought so far and use this vector to find similar products."
3193
+ ]
3194
+ },
3195
+ {
3196
+ "cell_type": "code",
3197
+ "execution_count": 33,
3198
+ "metadata": {},
3199
+ "outputs": [],
3200
+ "source": [
3201
+ "def aggregate_vectors(products):\n",
3202
+ " product_vec = []\n",
3203
+ " for i in products:\n",
3204
+ " try:\n",
3205
+ " product_vec.append(model.wv[i])\n",
3206
+ " except KeyError:\n",
3207
+ " continue\n",
3208
+ " \n",
3209
+ " return np.mean(product_vec, axis=0)"
3210
+ ]
3211
+ },
3212
+ {
3213
+ "cell_type": "markdown",
3214
+ "metadata": {},
3215
+ "source": [
3216
+ "Using Validation list of purchase sequences."
3217
+ ]
3218
+ },
3219
+ {
3220
+ "cell_type": "code",
3221
+ "execution_count": 34,
3222
+ "metadata": {},
3223
+ "outputs": [
3224
+ {
3225
+ "data": {
3226
+ "text/plain": [
3227
+ "(242, 123)"
3228
+ ]
3229
+ },
3230
+ "execution_count": 34,
3231
+ "metadata": {},
3232
+ "output_type": "execute_result"
3233
+ }
3234
+ ],
3235
+ "source": [
3236
+ "len(purchases_val[0]), len(purchases_val[1])"
3237
+ ]
3238
+ },
3239
+ {
3240
+ "cell_type": "markdown",
3241
+ "metadata": {},
3242
+ "source": [
3243
+ "The length of the first list of products purchased by a user is 247 and the secound is 696. We will pass this products’ sequence of the validation set to the function aggregate_vectors to return array of 100 dimensions"
3244
+ ]
3245
+ },
3246
+ {
3247
+ "cell_type": "code",
3248
+ "execution_count": 35,
3249
+ "metadata": {},
3250
+ "outputs": [
3251
+ {
3252
+ "data": {
3253
+ "text/plain": [
3254
+ "((100,), (100,))"
3255
+ ]
3256
+ },
3257
+ "execution_count": 35,
3258
+ "metadata": {},
3259
+ "output_type": "execute_result"
3260
+ }
3261
+ ],
3262
+ "source": [
3263
+ "aggregate_vectors(purchases_val[0]).shape, aggregate_vectors(purchases_val[1]).shape"
3264
+ ]
3265
+ },
3266
+ {
3267
+ "cell_type": "markdown",
3268
+ "metadata": {},
3269
+ "source": [
3270
+ "Now we can use this result to get the most similar products:"
3271
+ ]
3272
+ },
3273
+ {
3274
+ "cell_type": "code",
3275
+ "execution_count": 37,
3276
+ "metadata": {},
3277
+ "outputs": [
3278
+ {
3279
+ "data": {
3280
+ "text/plain": [
3281
+ "[('RIBBON REEL CHRISTMAS SOCK BAUBLE', 0.7117326259613037),\n",
3282
+ " ('RIBBON REEL SNOWY VILLAGE', 0.6936593651771545),\n",
3283
+ " ('RIBBON REEL SOCKS AND MITTENS', 0.6842150092124939),\n",
3284
+ " ('RIBBON REEL CHRISTMAS PRESENT ', 0.6820096373558044),\n",
3285
+ " ('RIBBON REEL MAKING SNOWMEN ', 0.6784197092056274)]"
3286
+ ]
3287
+ },
3288
+ "execution_count": 37,
3289
+ "metadata": {},
3290
+ "output_type": "execute_result"
3291
+ }
3292
+ ],
3293
+ "source": [
3294
+ "similar_products(aggregate_vectors(purchases_val[11]))"
3295
+ ]
3296
+ },
3297
+ {
3298
+ "cell_type": "markdown",
3299
+ "metadata": {},
3300
+ "source": [
3301
+ "WOOOW!! IT'S COOLL\n",
3302
+ "\n",
3303
+ "The model recommended very relative products based the user's past boughts. We can easily get product suggestions based on the last few purchases."
3304
+ ]
3305
+ },
3306
+ {
3307
+ "cell_type": "code",
3308
+ "execution_count": 38,
3309
+ "metadata": {},
3310
+ "outputs": [
3311
+ {
3312
+ "data": {
3313
+ "text/plain": [
3314
+ "[('ASSORTED COLOUR MINI CASES', 0.681169331073761),\n",
3315
+ " ('BAKING SET 9 PIECE RETROSPOT ', 0.6694015264511108),\n",
3316
+ " ('RETROSPOT TEA SET CERAMIC 11 PC ', 0.6445483565330505),\n",
3317
+ " ('TRADITIONAL KNITTING NANCY', 0.6320570111274719),\n",
3318
+ " ('6 RIBBONS RUSTIC CHARM', 0.6178888082504272)]"
3319
+ ]
3320
+ },
3321
+ "execution_count": 38,
3322
+ "metadata": {},
3323
+ "output_type": "execute_result"
3324
+ }
3325
+ ],
3326
+ "source": [
3327
+ "similar_products(aggregate_vectors(purchases_val[11][-10:]))"
3328
+ ]
3329
+ },
3330
+ {
3331
+ "cell_type": "markdown",
3332
+ "metadata": {},
3333
+ "source": [
3334
+ "Coool!\n",
3335
+ "\n",
3336
+ "The customer seems to be interested in Home Decor recently."
3337
  ]
3338
  },
3339
  {
app.py CHANGED
@@ -148,7 +148,10 @@ basket = (
148
  .applymap(lambda x: 1 if x > 0 else 0)
149
  )
150
 
 
151
  frequent_itemsets = apriori(basket, min_support=0.05, use_colnames=True)
 
 
152
  if not frequent_itemsets.empty:
153
  rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
154
 
 
148
  .applymap(lambda x: 1 if x > 0 else 0)
149
  )
150
 
151
+ # Generate frequent itemsets
152
  frequent_itemsets = apriori(basket, min_support=0.05, use_colnames=True)
153
+
154
+ # Generate association rules
155
  if not frequent_itemsets.empty:
156
  rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
157