{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Ab4iZfp4eXPk" }, "source": [ "# Bike Rides and the Poisson Model\n", "\n", "To help the urban planners, you are called to model the daily bike rides in NYC using [this dataset](https://gist.github.com/sachinsdate/c17931a3f000492c1c42cf78bf4ce9fe/archive/7a5131d3f02575668b3c7e8c146b6a285acd2cd7.zip). The dataset contains date, day of the week, high and low temp, precipitation and bike ride counts as columns. \n", "\n" ] }, { "cell_type": "code", "source": [ "!wget https://gist.github.com/sachinsdate/c17931a3f000492c1c42cf78bf4ce9fe/archive/7a5131d3f02575668b3c7e8c146b6a285acd2cd7.zip\n", "!unzip 7a5131d3f02575668b3c7e8c146b6a285acd2cd7.zip" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gM-y0BWye7WN", "outputId": "ba5d103a-ec03-42ec-8bf2-1094d2ed99ee" }, "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "--2023-02-26 21:10:52-- https://gist.github.com/sachinsdate/c17931a3f000492c1c42cf78bf4ce9fe/archive/7a5131d3f02575668b3c7e8c146b6a285acd2cd7.zip\n", "Resolving gist.github.com (gist.github.com)... 192.30.255.112\n", "Connecting to gist.github.com (gist.github.com)|192.30.255.112|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://codeload.github.com/gist/c17931a3f000492c1c42cf78bf4ce9fe/zip/7a5131d3f02575668b3c7e8c146b6a285acd2cd7 [following]\n", "--2023-02-26 21:10:53-- https://codeload.github.com/gist/c17931a3f000492c1c42cf78bf4ce9fe/zip/7a5131d3f02575668b3c7e8c146b6a285acd2cd7\n", "Resolving codeload.github.com (codeload.github.com)... 192.30.255.120\n", "Connecting to codeload.github.com (codeload.github.com)|192.30.255.120|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: unspecified [application/zip]\n", "Saving to: ‘7a5131d3f02575668b3c7e8c146b6a285acd2cd7.zip’\n", "\n", "7a5131d3f02575668b3 [ <=> ] 2.56K --.-KB/s in 0s \n", "\n", "2023-02-26 21:10:53 (27.7 MB/s) - ‘7a5131d3f02575668b3c7e8c146b6a285acd2cd7.zip’ saved [2623]\n", "\n", "Archive: 7a5131d3f02575668b3c7e8c146b6a285acd2cd7.zip\n", "7a5131d3f02575668b3c7e8c146b6a285acd2cd7\n", " creating: c17931a3f000492c1c42cf78bf4ce9fe-7a5131d3f02575668b3c7e8c146b6a285acd2cd7/\n", " inflating: c17931a3f000492c1c42cf78bf4ce9fe-7a5131d3f02575668b3c7e8c146b6a285acd2cd7/nyc_bb_bicyclist_counts.csv \n" ] } ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "RxdI4hgDeXPr" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "np.random.seed(0)\n", "sns.set_theme(style='whitegrid', palette='pastel')\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "source": [ "filename = \"c17931a3f000492c1c42cf78bf4ce9fe-7a5131d3f02575668b3c7e8c146b6a285acd2cd7/nyc_bb_bicyclist_counts.csv\"\n", "\n", "df = pd.read_csv(filename)\n", "df.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "ImGoQW97gLQN", "outputId": "9e481eb5-d4f1-4d14-c8b2-d128abdab273" }, "execution_count": 3, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Date HIGH_T LOW_T PRECIP BB_COUNT\n", "0 1-Apr-17 46.0 37.0 0.00 606\n", "1 2-Apr-17 62.1 41.0 0.00 2021\n", "2 3-Apr-17 63.0 50.0 0.03 2470\n", "3 4-Apr-17 51.1 46.0 1.18 723\n", "4 5-Apr-17 63.0 46.0 0.00 2807" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateHIGH_TLOW_TPRECIPBB_COUNT
01-Apr-1746.037.00.00606
12-Apr-1762.141.00.002021
23-Apr-1763.050.00.032470
34-Apr-1751.146.01.18723
45-Apr-1763.046.00.002807
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "markdown", "metadata": { "id": "rdF7NfHKeXPo" }, "source": [ "## Maximum Likelihood I \n", " \n", "The obvious choice in distributions is the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution) which depends only on one parameter, λ, which is the average number of occurrences per interval. We want to estimate this parameter using Maximum Likelihood Estimation.\n", "\n", "Implement a Gradient Descent algorithm from scratch that will estimate the Poisson distribution according to the Maximum Likelihood criterion. Plot the estimated mean vs iterations to showcase convergence towards the true mean. \n", "\n", "References: \n", "\n", "1. [This blog post](https://towardsdatascience.com/the-poisson-process-everything-you-need-to-know-322aa0ab9e9a). \n", "\n", "2. [This blog post](https://towardsdatascience.com/understanding-maximum-likelihood-estimation-fa495a03017a) and note the negative log likelihood function. \n" ] }, { "cell_type": "markdown", "source": [ "Negative log likelihood function for Poisson distribution\n", "\n", "$n \\lambda + \\sum_{i=1}^n ln(x_i!) - ln(\\lambda) \\sum_{i=1}^n x_i$\n", "\n", "Derivative of negative log likelihood in respect to lambda\n", "\n", "$n - {1 \\over \\lambda} \\sum_{i=1}^n x_i$" ], "metadata": { "id": "_TdBnxVrlRYH" } }, { "cell_type": "code", "source": [ "def gradient(l, x):\n", " n = len(x)\n", " lambda_gradient = n - (1 / l) * np.sum(x)\n", " return lambda_gradient\n", "\n", "def gradient_descent(x, l, learning_rate, max_iter):\n", " lam = l\n", " lams = [] # keeping track of updated lambda\n", " for _ in range(max_iter):\n", " g = gradient(lam, x)\n", "\n", " lams.append(lam)\n", " lam = lam - learning_rate * g\n", " return lam, lams" ], "metadata": { "id": "noR1cKD4hS2c" }, "execution_count": 18, "outputs": [] }, { "cell_type": "code", "source": [ "cyclists = np.array(df['BB_COUNT'])\n", "iterations = 100000\n", "alpha = 0.001\n", "lam = 1\n", "\n", "estimated, lams = gradient_descent(cyclists, lam, alpha, iterations)\n", "\n", "actual = (1 / len(cyclists)) * np.sum(cyclists)\n", "\n", "print(f\"Estimated mean={estimated}\")\n", "print(f\"Actual mean={actual}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WsUk4PlAmY7L", "outputId": "c4eea6f9-db9a-47cc-e856-128348dcc496" }, "execution_count": 39, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Estimated mean=2679.715270113102\n", "Actual mean=2680.042056074766\n" ] } ] }, { "cell_type": "code", "source": [ "sns.lineplot(x=range(len(lams)), y=lams)\n", "plt.xlabel('Iterations')\n", "plt.ylabel('Estimated mean')\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 285 }, "id": "Dz-7upxEoKno", "outputId": "41370016-06be-4c4c-fa64-7a5a294fe3f6" }, "execution_count": 38, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "metadata": { "id": "XINkmCdVeXPr" }, "source": [ "## Maximum Likelihood II\n", "\n", "A colleague of yours suggest that the parameter $\\lambda$ must be itself dependent on the weather and other factors since people bike when its not raining. Assume that you model $\\lambda$ as \n", "\n", "$$\\lambda_i = \\exp(\\mathbf w^T \\mathbf x_i)$$\n", "\n", "where $\\mathbf x_i$ is one of the example features and $\\mathbf w$ is a set of parameters. \n", "\n", "Train the model with SGD with this assumption and compare the MSE of the predictions with the `Maximum Likelihood I` approach. \n", "\n", "You may want to use [this partial derivative of the log likelihood function](http://home.cc.umanitoba.ca/~godwinrt/7010/poissonregression.pdf)" ] }, { "cell_type": "markdown", "source": [ "${\\partial l \\over \\partial \\beta} = \\sum_{i=1}^n(y_i - exp(X'_i\\beta)X_i$" ], "metadata": { "id": "ni9r4QxeK4L3" } }, { "cell_type": "code", "source": [ "def s_gradient(x, y, w):\n", " y_est = np.exp(np.dot(x, w))\n", " res = (y - y_est) * x.T\n", "\n", " r = []\n", " for i in range(len(res)):\n", " r.append(res[i][0])\n", " return np.array(r)\n", "\n", "def s_gradient_descent(learning_rate, max_iter):\n", " w = np.array([0.0 for _ in range(4)])\n", " batch_size = 0.2\n", "\n", " for _ in range(max_iter):\n", " samples = df.sample(frac=batch_size)\n", " n = len(samples)\n", "\n", " x = np.hstack([\n", " np.array([1 for _ in range(n)]).reshape(n, 1),\n", " np.array(samples['HIGH_T']).reshape(n, 1),\n", " np.array(samples['LOW_T']).reshape(n, 1),\n", " np.array(samples['PRECIP']).reshape(n, 1)\n", " ])\n", " y = np.array(samples['BB_COUNT'])\n", " g = s_gradient(x, y, w)\n", " w = w - learning_rate * g\n", "\n", " return w" ], "metadata": { "id": "9e8m9A9jsrO2" }, "execution_count": 157, "outputs": [] }, { "cell_type": "code", "source": [ "iterations = 10\n", "alpha = 0.0001\n", "lam = 1000\n", "\n", "w = s_gradient_descent(alpha, iterations)\n", "# print(w)\n", "\n", "n = len(df)\n", "x = np.hstack([\n", " np.array([1 for _ in range(n)]).reshape(n, 1),\n", " np.array(df['HIGH_T']).reshape(n, 1),\n", " np.array(df['LOW_T']).reshape(n, 1),\n", " np.array(df['PRECIP']).reshape(n, 1)\n", "])\n", "l = np.exp(w * x.T)\n", "print(l)\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 223 }, "id": "FMOiaAn7A86p", "outputId": "f07affe7-c717-424e-c5ed-75c9f388ca00" }, "execution_count": 167, "outputs": [ { "output_type": "error", "ename": "ValueError", "evalue": "ignored", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'PRECIP'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m ])\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0ml\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexp\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mw\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mT\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 16\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ml\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (4,) (4,214) " ] } ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.9" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "7d6993cb2f9ce9a59d5d7380609d9cb5192a9dedd2735a011418ad9e827eb538" } }, "colab": { "provenance": [] } }, "nbformat": 4, "nbformat_minor": 0 }