Download (right-click, save target as ...) this page as a jupyterlab notebook ES-29

Laboratory 29: Multiple Regression

LAST NAME, FIRST NAME

R00000000

ENGR 1330 Laboratory 29 - In Lab

Background¶

Download the data set ca_housing.csv and describe its contents (no not the describe function, but words - what does it appear to contain)

# Load the necessary packages
import numpy as np
import pandas as pd
import seaborn as sns 
import statistics 
import math
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf

Get the datafile

import requests # Module to process http/https requests
remote_url="http://54.243.252.9/engr-1330-webroot/8-Labs/Lab29/ca_housing.csv"  # set the url
rget = requests.get(remote_url, allow_redirects=True)  # get the remote resource, follow imbedded links
open('ca_housing.csv','wb').write(rget.content); # extract from the remote the contents, assign to a local file same name

Read the datafile into a dataframe

housing = pd.read_csv('ca_housing.csv')
housing.describe() # verify the read

Data preprocessing¶

After loading the data, it’s a good practice to see if there are any missing values in the data. Count the number of missing values for each feature using isnull() .

housing.isnull().sum()

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
AveHouseVal    0
dtype: int64

It appears that all values have non-null entries, so no cleaning necessary.

Exploratory Data Analysis¶

Plot the distribution of the target variable AveHouseVal depending on Latitude and Longitude. The code below should get the following figure (assuming you named your dataframe "housing")

plt.figure(figsize=(10,8))
plt.scatter(housing['Latitude'], housing['Longitude'], c=housing['AveHouseVal'], s=housing['Population']/100)
plt.colorbar()

<matplotlib.colorbar.Colorbar at 0xffff55a62370>

So it sort of looks like Callyfornia, notice the high value homes are along the coast, and get cheaper as one moves inland. Aslo note we are not correctly projecting the Lat-Lon values, so we should not use our script as a GIS-type tool just yet.

Correlation map¶

Next, we create a correlation matrix that measures the linear relationships between the variables.

The script below should produce a correlation map that prints the off-diagional correlation matrix terms, and color codes them/

corrmat = housing.corr()
plt.subplots(figsize=(12,9))
mask = np.zeros_like(corrmat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corrmat, vmax=0.9, square=True, annot=True, mask=mask, cbar_kws={"shrink": .5})

<AxesSubplot:>

Build a Multiple-variable Model¶

The script below uses all the variables (its a dumb model but illustrates the syntax and package warnings we can use to improve the model)

# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('AveHouseVal ~ MedInc + AveRooms + HouseAge + AveRooms + Latitude + Longitude + AveOccup + AveRooms + AveBedrms + Population ', data=housing) # model object constructor syntax
model = model.fit()
pred = model.predict()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            AveHouseVal   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     3970.
Date:                Sun, 10 Apr 2022   Prob (F-statistic):               0.00
Time:                        21:34:28   Log-Likelihood:            -2.6025e+05
No. Observations:               20640   AIC:                         5.205e+05
Df Residuals:                   20631   BIC:                         5.206e+05
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -3.694e+06   6.59e+04    -56.067      0.000   -3.82e+06   -3.57e+06
MedInc      4.367e+04    419.680    104.054      0.000    4.28e+04    4.45e+04
AveRooms   -1.073e+04    588.538    -18.235      0.000   -1.19e+04   -9578.623
HouseAge     943.5778     44.628     21.143      0.000     856.104    1031.052
Latitude   -4.213e+04    719.687    -58.541      0.000   -4.35e+04   -4.07e+04
Longitude  -4.345e+04    753.289    -57.682      0.000   -4.49e+04    -4.2e+04
AveOccup    -378.6543     48.741     -7.769      0.000    -474.191    -283.117
AveBedrms   6.451e+04   2813.494     22.928      0.000     5.9e+04       7e+04
Population    -0.3976      0.475     -0.837      0.402      -1.329       0.533
==============================================================================
Omnibus:                     4393.650   Durbin-Watson:                   0.885
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            14087.596
Skew:                           1.082   Prob(JB):                         0.00
Kurtosis:                       6.420   Cond. No.                     2.38e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.38e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Select Useable Variables¶

To fit a linear regression model, we want to select those features that have a high correlation with our dependent variable AveHouseVal.

By looking at the correlation matrix we can see that MediaInc has a strong positive correlation with AverageHouseVal (0.69). The other two variables with highest correlation are HouseAge and AveRooms.

We should drop population as it could include zero (and its coefficient is already small). An important point when selecting features for a linear regression model is to check for multicollinearity. For example, the features Latitude and Longitude have 0.92 correlation with each other, so we should not include both of them simultaneously in our regression model.

Because the correlation between the variables MediaInc , HouseAve and AveRooms is not high, yet they have good correlation with AveHouseVal , we consider those three variables for our regression model.

Exercise 1¶

Build a model to predict AveHouseVal based on

MediaInc
HouseAge
AveRooms

Report the equation of the model.
Produce a histogram of AveHouseVal.
Produce a histogram of the residuals.
What is the mean value of the residuals?
Do the residuals seem to be normally distributed? How will you assess?
Are the residuals homoscedastic? (Yep you're gonna have to look that up)

Exercise 2¶

Build a plot of AveHouseValue on the x-axis, and the predicted HouseValue on the y-axis.
Add an equal value line (i.e. [10000,500000],[10000,500000] in a second plot call).

Something like:

# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(housing['AveHouseVal'], pred, 'o')           # scatter plot actual vs model
plt.plot([10000,500000],[10000,500000] , 'r', linewidth=2)   # equal value line
plt.xlabel('Actual Value')
plt.ylabel('Predicted Value')
plt.title('Need a title')
plt.show();

If your model estimates a value of \$200,000 or less is your model over- or under-predicting?

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	AveHouseVal
count	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	3.870671	28.639486	5.429000	1.096675	1425.476744	3.070655	35.631861	-119.569704	206855.816909
std	1.899822	12.585558	2.474173	0.473911	1132.462122	10.386050	2.135952	2.003532	115395.615874
min	0.499900	1.000000	0.846154	0.333333	3.000000	0.692308	32.540000	-124.350000	14999.000000
25%	2.563400	18.000000	4.440716	1.006079	787.000000	2.429741	33.930000	-121.800000	119600.000000
50%	3.534800	29.000000	5.229129	1.048780	1166.000000	2.818116	34.260000	-118.490000	179700.000000
75%	4.743250	37.000000	6.052381	1.099526	1725.000000	3.282261	37.710000	-118.010000	264725.000000
max	15.000100	52.000000	141.909091	34.066667	35682.000000	1243.333333	41.950000	-114.310000	500001.000000