Download (right-click, save target as ...) this page as a jupyterlab notebook Lab29
LAST NAME, FIRST NAME
R00000000
ENGR 1330 Laboratory 29 - In Lab
Explore the data set heart.data.csv and determine the effect that the independent variables biking and smoking have on the dependent variable heart disease using a multiple linear regression model.
# Load the necessary packages
import numpy as np
import pandas as pd
import statistics
import math
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
Get the datafile
import requests # Module to process http/https requests
remote_url="http://54.243.252.9/engr-1330-webroot/8-Labs/Lab29/heart.data.csv" # set the url
rget = requests.get(remote_url, allow_redirects=True) # get the remote resource, follow imbedded links
open('heart.data.csv','wb').write(rget.content); # extract from the remote the contents, assign to a local file same name
Read into a dataframe. The database original source \@ https://www.scribbr.com/statistics/multiple-linear-regression/ does not report the units on each variable, a good guess is biking is miles per week, smoking is packs per week, and heart.disease is possibly hospital admissions per 100000 for coronary complications.
After the read we need some shenigagins to get the column names meaningful.
heartattack = pd.read_csv('heart.data.csv')
data = heartattack.rename(columns={"biking":"Bike","smoking":"Smoke","heart.disease":"Disease"})
data.head(3)
Now build a linear model
# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Disease ~ Bike + Smoke', data=data)
model = model.fit()
#print(model.summary())
# dir(model) # activate to find attributes
intercept = model.params[0]
slope = model.params[1]
Rsquare = model.rsquared
RMSE = math.sqrt(model.mse_total)
To find the various values a visit to Here is useful! Below we will construct a title line that contains the equation, RMSE, and R-square using type casting and concatenation, then pass it to the plot.
# Predict values
heartfail = model.predict()
titleline = 'Disease Index versus Lifestyle Variables \n' + 'R squared = ' + str(round(Rsquare,3)) + ' \n RMSE = ' + str(round(RMSE,2))
# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(data['Bike'], data['Disease'], 'o') # scatter plot showing actual data
plt.plot(data['Bike'], heartfail, marker = 's' ,color ='r', linewidth=0) # regression line
plt.xlabel('Biking (miles/week)')
plt.ylabel('Disease Index (Admissions/100,000 as per MMWR)')
plt.legend(['Observations','Model Prediction'])
plt.title(titleline)
plt.show()
titleline = 'Disease Index versus Lifestyle Variables \n' + 'R squared = ' + str(round(Rsquare,3)) + ' \n RMSE = ' + str(round(RMSE,2))
# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(data['Smoke'], data['Disease'], 'o') # scatter plot showing actual data
plt.plot(data['Smoke'], heartfail, marker = 's' ,color ='r', linewidth=0) # regression line
plt.xlabel('Smoking (packs/week)')
plt.ylabel('Disease Index (Admissions/100,000 as per MMWR)')
plt.legend(['Observations','Model Prediction'])
plt.title(titleline)
plt.show()
Now lets learn about the actual model
print(model.summary())
Interpret the results.
Using the tools from Lab 28, produce labeled plots of prediction intervals for disease index using biking and smoking as predictor varuables (note the work is already done above, you just need to access the object and make and label the plots)
The data are not ordered, hence we choose to plot only markers even for the models, to get usual plots using lines, you need to sort the fitted results and plot those - its a bit of a hassle and left as a bonus problem.