In [ ]:
 

Download (right-click, save target as ...) this page as a jupyterlab notebook ES-29


Laboratory 29: Multiple Regression

LAST NAME, FIRST NAME

R00000000

ENGR 1330 Laboratory 29 - In Lab


Background

Download the data set ca_housing.csv and describe its contents (no not the describe function, but words - what does it appear to contain)

In [1]:
# Load the necessary packages
import numpy as np
import pandas as pd
import seaborn as sns 
import statistics 
import math
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf

Get the datafile

In [ ]:
import requests # Module to process http/https requests
remote_url="http://54.243.252.9/engr-1330-webroot/8-Labs/Lab29/ca_housing.csv"  # set the url
rget = requests.get(remote_url, allow_redirects=True)  # get the remote resource, follow imbedded links
open('ca_housing.csv','wb').write(rget.content); # extract from the remote the contents, assign to a local file same name

Read the datafile into a dataframe

In [11]:
housing = pd.read_csv('ca_housing.csv')
housing.describe() # verify the read
Out[11]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude AveHouseVal
count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744 3.070655 35.631861 -119.569704 206855.816909
std 1.899822 12.585558 2.474173 0.473911 1132.462122 10.386050 2.135952 2.003532 115395.615874
min 0.499900 1.000000 0.846154 0.333333 3.000000 0.692308 32.540000 -124.350000 14999.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000 2.429741 33.930000 -121.800000 119600.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000 2.818116 34.260000 -118.490000 179700.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000 3.282261 37.710000 -118.010000 264725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000 1243.333333 41.950000 -114.310000 500001.000000

Data preprocessing

After loading the data, it’s a good practice to see if there are any missing values in the data. Count the number of missing values for each feature using isnull() .

In [3]:
housing.isnull().sum()
Out[3]:
MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
AveHouseVal    0
dtype: int64

It appears that all values have non-null entries, so no cleaning necessary.

Exploratory Data Analysis

Plot the distribution of the target variable AveHouseVal depending on Latitude and Longitude. The code below should get the following figure (assuming you named your dataframe "housing")

In [4]:
plt.figure(figsize=(10,8))
plt.scatter(housing['Latitude'], housing['Longitude'], c=housing['AveHouseVal'], s=housing['Population']/100)
plt.colorbar()
Out[4]:
<matplotlib.colorbar.Colorbar at 0xffff55a62370>

So it sort of looks like Callyfornia, notice the high value homes are along the coast, and get cheaper as one moves inland. Aslo note we are not correctly projecting the Lat-Lon values, so we should not use our script as a GIS-type tool just yet.

Correlation map

Next, we create a correlation matrix that measures the linear relationships between the variables.

The script below should produce a correlation map that prints the off-diagional correlation matrix terms, and color codes them/

In [5]:
corrmat = housing.corr()
plt.subplots(figsize=(12,9))
mask = np.zeros_like(corrmat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corrmat, vmax=0.9, square=True, annot=True, mask=mask, cbar_kws={"shrink": .5})
Out[5]:
<AxesSubplot:>

Build a Multiple-variable Model

The script below uses all the variables (its a dumb model but illustrates the syntax and package warnings we can use to improve the model)

In [25]:
# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('AveHouseVal ~ MedInc + AveRooms + HouseAge + AveRooms + Latitude + Longitude + AveOccup + AveRooms + AveBedrms + Population ', data=housing) # model object constructor syntax
model = model.fit()
pred = model.predict()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            AveHouseVal   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     3970.
Date:                Sun, 10 Apr 2022   Prob (F-statistic):               0.00
Time:                        21:34:28   Log-Likelihood:            -2.6025e+05
No. Observations:               20640   AIC:                         5.205e+05
Df Residuals:                   20631   BIC:                         5.206e+05
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -3.694e+06   6.59e+04    -56.067      0.000   -3.82e+06   -3.57e+06
MedInc      4.367e+04    419.680    104.054      0.000    4.28e+04    4.45e+04
AveRooms   -1.073e+04    588.538    -18.235      0.000   -1.19e+04   -9578.623
HouseAge     943.5778     44.628     21.143      0.000     856.104    1031.052
Latitude   -4.213e+04    719.687    -58.541      0.000   -4.35e+04   -4.07e+04
Longitude  -4.345e+04    753.289    -57.682      0.000   -4.49e+04    -4.2e+04
AveOccup    -378.6543     48.741     -7.769      0.000    -474.191    -283.117
AveBedrms   6.451e+04   2813.494     22.928      0.000     5.9e+04       7e+04
Population    -0.3976      0.475     -0.837      0.402      -1.329       0.533
==============================================================================
Omnibus:                     4393.650   Durbin-Watson:                   0.885
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            14087.596
Skew:                           1.082   Prob(JB):                         0.00
Kurtosis:                       6.420   Cond. No.                     2.38e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.38e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Select Useable Variables

To fit a linear regression model, we want to select those features that have a high correlation with our dependent variable AveHouseVal.

By looking at the correlation matrix we can see that MediaInc has a strong positive correlation with AverageHouseVal (0.69). The other two variables with highest correlation are HouseAge and AveRooms.

We should drop population as it could include zero (and its coefficient is already small). An important point when selecting features for a linear regression model is to check for multicollinearity. For example, the features Latitude and Longitude have 0.92 correlation with each other, so we should not include both of them simultaneously in our regression model.

Because the correlation between the variables MediaInc , HouseAve and AveRooms is not high, yet they have good correlation with AveHouseVal , we consider those three variables for our regression model.


Exercise 1

Build a model to predict AveHouseVal based on

  • MediaInc
  • HouseAge
  • AveRooms
  1. Report the equation of the model.
  2. Produce a histogram of AveHouseVal.
  3. Produce a histogram of the residuals.
  4. What is the mean value of the residuals?
  5. Do the residuals seem to be normally distributed? How will you assess?
  6. Are the residuals homoscedastic? (Yep you're gonna have to look that up)
In [ ]:
 

Exercise 2

Build a plot of AveHouseValue on the x-axis, and the predicted HouseValue on the y-axis.
Add an equal value line (i.e. [10000,500000],[10000,500000] in a second plot call).

Something like:

# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(housing['AveHouseVal'], pred, 'o')           # scatter plot actual vs model
plt.plot([10000,500000],[10000,500000] , 'r', linewidth=2)   # equal value line
plt.xlabel('Actual Value')
plt.ylabel('Predicted Value')
plt.title('Need a title')
plt.show();

If your model estimates a value of \$200,000 or less is your model over- or under-predicting?

In [ ]: