Download (right-click, save target as ...) this page as a jupyterlab notebook ES-29
LAST NAME, FIRST NAME
R00000000
ENGR 1330 Laboratory 29 - In Lab
Download the data set ca_housing.csv and describe its contents (no not the describe function, but words - what does it appear to contain)
# Load the necessary packages
import numpy as np
import pandas as pd
import seaborn as sns
import statistics
import math
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
Get the datafile
import requests # Module to process http/https requests
remote_url="http://54.243.252.9/engr-1330-webroot/8-Labs/Lab29/ca_housing.csv" # set the url
rget = requests.get(remote_url, allow_redirects=True) # get the remote resource, follow imbedded links
open('ca_housing.csv','wb').write(rget.content); # extract from the remote the contents, assign to a local file same name
Read the datafile into a dataframe
housing = pd.read_csv('ca_housing.csv')
housing.describe() # verify the read
After loading the data, it’s a good practice to see if there are any missing values in the data. Count the number of missing values for each feature using isnull() .
housing.isnull().sum()
It appears that all values have non-null entries, so no cleaning necessary.
Plot the distribution of the target variable AveHouseVal depending on Latitude and Longitude. The code below should get the following figure (assuming you named your dataframe "housing")
plt.figure(figsize=(10,8))
plt.scatter(housing['Latitude'], housing['Longitude'], c=housing['AveHouseVal'], s=housing['Population']/100)
plt.colorbar()
So it sort of looks like Callyfornia, notice the high value homes are along the coast, and get cheaper as one moves inland. Aslo note we are not correctly projecting the Lat-Lon values, so we should not use our script as a GIS-type tool just yet.
Next, we create a correlation matrix that measures the linear relationships between the variables.
The script below should produce a correlation map that prints the off-diagional correlation matrix terms, and color codes them/
corrmat = housing.corr()
plt.subplots(figsize=(12,9))
mask = np.zeros_like(corrmat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corrmat, vmax=0.9, square=True, annot=True, mask=mask, cbar_kws={"shrink": .5})
The script below uses all the variables (its a dumb model but illustrates the syntax and package warnings we can use to improve the model)
# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('AveHouseVal ~ MedInc + AveRooms + HouseAge + AveRooms + Latitude + Longitude + AveOccup + AveRooms + AveBedrms + Population ', data=housing) # model object constructor syntax
model = model.fit()
pred = model.predict()
print(model.summary())
To fit a linear regression model, we want to select those features that have a high correlation with our dependent variable AveHouseVal.
By looking at the correlation matrix we can see that MediaInc has a strong positive correlation with AverageHouseVal (0.69). The other two variables with highest correlation are HouseAge and AveRooms.
We should drop population as it could include zero (and its coefficient is already small). An important point when selecting features for a linear regression model is to check for multicollinearity. For example, the features Latitude and Longitude have 0.92 correlation with each other, so we should not include both of them simultaneously in our regression model.
Because the correlation between the variables MediaInc , HouseAve and AveRooms is not high, yet they have good correlation with AveHouseVal , we consider those three variables for our regression model.
Build a model to predict AveHouseVal based on
Build a plot of AveHouseValue on the x-axis, and the predicted HouseValue on the y-axis.
Add an equal value line (i.e. [10000,500000],[10000,500000] in a second plot call).
Something like:
# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(housing['AveHouseVal'], pred, 'o') # scatter plot actual vs model
plt.plot([10000,500000],[10000,500000] , 'r', linewidth=2) # equal value line
plt.xlabel('Actual Value')
plt.ylabel('Predicted Value')
plt.title('Need a title')
plt.show();
If your model estimates a value of \$200,000 or less is your model over- or under-predicting?