Download (right-click, save target as ...) this page as a jupyterlab notebook from: Lab18-TH
LAST NAME, FIRST NAME
R00000000
ENGR 1330 Exercise Set 18 - Homework
Execute the code cell below to profile your computer.
# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)
Recall in an earlier lab that you accessed a file of concrete strength and related mixture variables.
#Get database -- use the Get Data From URL Script
#Step 1: import needed modules to interact with the internet
import requests
#Step 2: make the connection to the remote file (actually its implementing "bash curl -O http://fqdn/path ...")
remote_url = 'http://54.243.252.9/engr-1330-webroot/8-Labs/Lab10/concreteData.xls' # an Excel file
response = requests.get(remote_url) # Gets the file contents puts into an object
output = open('concreteData.xls', 'wb') # Prepare a destination, local
output.write(response.content) # write contents of object to named local file
output.close() # close the connection
Then you changed some column names
import pandas
concreteData = pandas.read_excel('concreteData.xls') # read the file
# rename the columns
req_col_names = ["Cement", "BlastFurnaceSlag", "FlyAsh", "Water", "Superplasticizer",
"CoarseAggregate", "FineAggregate", "Age", "CC_Strength"]
curr_col_names = list(concreteData.columns)
mapper = {}
for i, name in enumerate(curr_col_names):
mapper[name] = req_col_names[i]
concreteData = concreteData.rename(columns=mapper)
concreteData.head() # show the dataframe
Then you did the mulitple plots
# ! sudo /opt/jupyterhub/bin/python3 -m pip install seaborn
import matplotlib.pyplot
import seaborn
%matplotlib inline
seaborn.pairplot(concreteData)
matplotlib.pyplot.show()
So it's a cool plot, but the meaningful data science question is which variable(s) have predictive value for estimating concrete strength?
Answer by:
Cement
variable, what is its correlation coefficient?
$Strength_{model} = \beta_0 + \beta_1 \cdot Cement $# correlation coefficients
# plotting functions (ok to use built-in in pandas)
# data model trial-and-error fit
# assess model - sum of squares residuals
Repeat the exercise using Age
then Water
as the predictor variable.
# data model trial-and-error fit
# plot
# assess model - sum of squares residuals
Which is the better model of the three you examined?