{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import scipy as sp\n",
"import scipy.stats as st\n",
"import scipy.linalg as la\n",
"from math import sqrt\n",
"import pandas as pd\n",
"from pandas import DataFrame\n",
"import re\n",
"print(\"Modules Imported!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Linear Regression\n",
"In this lab we will apply linear regression to real and synthetic data and study Stochastic Gradient Descent.\n",
"\n",
"If you are not familiar with matrix operations, review [this page](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Elementary_Matrix_Operations_In_Python?lang=en) befor you start. \n",
"\n",
"Also, for matrix inversion, we will use [`numpy.linalg`](https://docs.scipy.org/doc/numpy/reference/routines.linalg.html), which we imported with \n",
"``` python\n",
"import numpy.linalg as la\n",
"```\n",
"In particular, you will find [`numpy.linalg.inv`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html#numpy.linalg.inv) useful."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Estimating Box Office Gross\n",
"(40 pts) We will use opening weekend gross to estimate total gross based on data for 2017 (until mid september). First, we will use `pandas` to read a csv file in to an object called a dataframe. (`pandas` can do a lot of things, but here we only use it to read from a file.) \n",
"We use panda as:\n",
"```\n",
"df = pd.read_csv('BoxOffice2017.csv')\n",
"```\n",
"But it can also take other arguments. For example, if the names of the columns are not given in the file, we can use `names=[...]` or we can set the delimiter to be any whitespace instead of just comma with `delim_whitespace=True`, as shown below:\n",
"```\n",
"df = pd.DataFrame(raw_data, names = ['first_name', 'last_name', 'age', 'education','income'], delim_whitespace=True)\n",
"```\n",
"We then print the first few rows to check that everything has loaded properly, and check the data type for each column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('BoxOffice2017.csv')\n",
"df.head() # The first few rows of the dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"Next we extract data from the dataframe. Our input variable is the Openning Gross and our output is Total Gross. The data needs some preprocessing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"Y = df[['TotalGross']].values # extract grosses\n",
"# remove ',' and '$' and convert to float\n",
"# Y is a two dimensional array, so we use a two for loops\n",
"for i in range(np.shape(Y)[0]): \n",
" for j in range(np.shape(Y)[1]):\n",
" Y[i,j] = float(re.sub('[$,]','',Y[i,j])) \n",
"Y = Y.astype(float)\n",
"\n",
"\n",
"X = df[['OpeningGross']].values # if we need multiple columns we can write, e.g., X = df[['OpeningGross','Theaters']].values\n",
"# solution\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"___\n",
"1. Plot the input and output in a scatter plot.\n",
"2. Find $\\theta^*$ minimizing the RMSE.\n",
"3. Compute the RMSE.\n",
"4. Plot the line $X\\theta^*$ on the same plot as the input-output data.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# solution\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Maximum likelihood with synthetic data ##\n",
"(30 pts) This time we will use synthetic data. The data will be multivariate normal and so minimizing RMSE is the same as maximum likelihood estimation.\n",
"\n",
"Generate a samples $\\left\\{(x_1,y_1),\\dotsc,(x_{100},y_{100})\\right\\}$ of size $100$, where $$x_i\\sim \\mathcal N (\\mu,K),$$ with mean $\\mu=[0,0]$ and covariance $$\n",
"K=\\left[\\begin{array}{cc}\n",
"10 & 5\\\\\n",
"5 & 10\\\\\n",
"\\end{array}\\right],$$ and \n",
"$$\n",
"y_i \\sim \\mathcal(\\theta^T x_i,1)$$ with\n",
"$$\n",
"\\theta=\\left[\\begin{array}{c}\n",
"3 \\\\\n",
"-1\\\\\n",
"\\end{array}\\right].$$\n",
"\n",
"Arrange these in two matrixes $X$ and $Y$ of sizes $100\\times2$ and $100\\times1$, respectively."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# solution\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the maximum likelihood estimate $\\theta^*$ of $\\theta$, and compare with true value. Note that this is the same as the estimate that minimizes RMSE. Find the RMSE."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# solution\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stochastic Gradient Descent for ML\n",
"(30 pts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding the root: The basic case\n",
"Suppose that we want to solve the equation $$f(\\theta)=0,$$ which we know to have a unique solution $\\theta^*$ such that $f(\\theta)$ is increasing at this solution. One way to find this solution is to start with an arbitrary $\\theta_0$ and let \n",
"$$\n",
"\\theta_t = \\theta_{t-1} - a_t f(\\theta_t),\n",
"$$\n",
"where $a_i$ satisfies \n",
"$$\\sum_{t=0}^\\infty a_t = \\infty,\\quad \\sum_{t=0}^\\infty a_t^2 < \\infty.$$\n",
"For example, we may let $a_t=\\frac1t$. It can be shown that $\\theta_t$ converges to $\\theta^*$.\n",
"\n",
"___\n",
"Consider the function $f(\\theta)=\\arctan \\theta -1$ and let $\\theta_0=0$. Use the method described above to find $\\theta^*$ such that $f(\\theta^*)=1$. Set the number of iterations to 1000. Plot the values of $\\theta_t$ and compare $\\theta_{1000}$ with $\\theta^*$. Let $a_t=\\frac1{t^{3/4}}$.\n",
"\n",
"Bonus: It is also interesting to see how the behavior changes with $a_t=\\frac1{t^r}$ for different values of $r$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# solution:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding the root with noisy observations###\n",
"Importantly, this method works even if we can't actually compute $f(\\theta)$ but instead can obtain $F(\\theta)$, which is a noisy version of $f(\\theta)$ such that\n",
"$$\n",
"f(\\theta)=E[F(\\theta)].\n",
"$$\n",
"The good news is that using the noisy version works too and we can let \n",
"$$\n",
"\\theta_t = \\theta_{t-1} - a_t F(\\theta_t),\n",
"$$\n",
"and again $\\theta_t$ converges to $\\theta^*$.\n",
"___\n",
"To see this in action, let $F(\\theta)=f(\\theta) + Z = \\arctan \\theta - 1 + Z$, where $Z\\sim \\mathcal N(0,1)$. Start with $\\theta_0=0$ and let\n",
"$$\n",
"\\theta_t = \\theta_{t-1} - a_t (f(\\theta_{t-1}+Z_t)),\n",
"$$\n",
"where, for each $t$, $Z_t$ is a new sample from $\\mathcal N(0,1)$. Plot the values of $\\theta_t$ and compare $\\theta_{1000}$ with $\\theta^*$. Let $a_t=\\frac1{t^{3/4}}$.\n",
"\n",
"Bonus: It is also interesting to see how the behavior changes with $a_t=\\frac1{t^r}$ for different values of $r$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# solution:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stochastic Gradient Descent###\n",
"Now consider two random variables $x$ and $y$ and suppose that we have $N$ samples: $\\{(x_1,y_1),(x_2,y_2),\\dotsc,(x_N,y_N)\\}$. Our goal is linear regression of $y$ with repect to $x$ as $y = \\theta^T x$. The gradient descent solution to this problem requires finding $\\theta$ such that \n",
"$$\\nabla J(\\theta)=-\\sum_{i=1}^N(y_i-\\theta^T x_i)x_i=0.$$\n",
"This can also be viewed as wanting to find the root of $f(\\theta)=-E[(y-\\theta^T x)x]$. But we don't know the distribution of $x$ and $y$. If we did, we could find the root by \n",
"$$\n",
"\\theta_t = \\theta_{t-1} - a_t f(\\theta_t),\n",
"$$\n",
"But based on the explanation above, we can instead find the root by\n",
"$$\n",
"\\theta_t = \\theta_{t-1} + a_t (y_i-\\theta_{t}^Tx_i)x_i,\n",
"$$\n",
"where $(x_i,y_i)$ is a random data point.\n",
"\n",
"In the context of linear regression, this method is refered to as _Least Mean Square_ or _LMS_ (which isn't a very good name). But the principle applies more generally and is refered to as _Stochastic Gradient Descent_, where _stochastic_ refers to the fact that we chose a random data point from our dataset in each step.\n",
"___\n",
"Set $\\theta_0=(0,0)$ and let\n",
"$$\n",
"\\theta_t=\\theta_{t-1} + \\frac{1}{(t+10)} (y_i-\\theta_t^Tx_i)x_i,\n",
"$$\n",
"for $t=1,\\dotsc,200$, where $i$ is a random index indicating which data point is used in this step. Use data generated in [Maximum likelihood with synthetic data](#Maximum-likelihood-with-synthetic-data). Note that $\\theta_t$ and $x_i$ are two-dimensional. Plot both elements of $\\theta_t$ and compare them with the ture values of [3,-1]."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# solution\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}