Download (right-click, save target as ...) this page as a jupyterlab notebook from: Lab18-TH

Exercise Set 18: Correlation

LAST NAME, FIRST NAME

R00000000

ENGR 1330 Exercise Set 18 - Homework

Exercise 0. Profile your computer¶

Execute the code cell below to profile your computer.

# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)

Exercise 1. All about concrete¶

Recall in an earlier lab that you accessed a file of concrete strength and related mixture variables.

#Get database -- use the Get Data From URL Script
#Step 1: import needed modules to interact with the internet
import requests
#Step 2: make the connection to the remote file (actually its implementing "bash curl -O http://fqdn/path ...")
remote_url = 'http://54.243.252.9/engr-1330-webroot/8-Labs/Lab10/concreteData.xls' # an Excel file
response = requests.get(remote_url) # Gets the file contents puts into an object
output = open('concreteData.xls', 'wb') # Prepare a destination, local
output.write(response.content) # write contents of object to named local file
output.close() # close the connection

Then you changed some column names

import pandas

concreteData = pandas.read_excel('concreteData.xls') # read the file
# rename the columns
req_col_names = ["Cement", "BlastFurnaceSlag", "FlyAsh", "Water", "Superplasticizer",
                 "CoarseAggregate", "FineAggregate", "Age", "CC_Strength"]
curr_col_names = list(concreteData.columns)

mapper = {}
for i, name in enumerate(curr_col_names):
    mapper[name] = req_col_names[i]

concreteData = concreteData.rename(columns=mapper)

concreteData.head() # show the dataframe

Then you did the mulitple plots

# ! sudo /opt/jupyterhub/bin/python3 -m pip install seaborn
import matplotlib.pyplot  
import seaborn 
%matplotlib inline
seaborn.pairplot(concreteData)
matplotlib.pyplot.show()

So it's a cool plot, but the meaningful data science question is which variable(s) have predictive value for estimating concrete strength?

Answer by:

Determine the correlation coefficient for the variable pairs.
Rank the predictive value of the variables from highest magnitude to lowest magnitude.
Build a linear data model based on the Cement variable, what is its correlation coefficient? $Strength_{model} = \beta_0 + \beta_1 \cdot Cement $
Build a scatterplot of of the data model and the observations, and use the plot to find values of the two parameters.
Your assessment of data model utility for this database?

# correlation coefficients

# plotting functions (ok to use built-in in pandas)

# data model trial-and-error fit

# assess model - sum of squares residuals

Repeat the exercise using Age then Water as the predictor variable.

# data model trial-and-error fit
# plot
# assess model - sum of squares residuals

Which is the better model of the three you examined?

	Cement	BlastFurnaceSlag	Water	Superplasticizer	CoarseAggregate	FineAggregate	Age	CC_Strength
0	540.0	0.0	162.0	2.5	1040.0	676.0	28	79.986111
1	540.0	0.0	162.0	2.5	1055.0	676.0	28	61.887366
2	332.5	142.5	228.0	0.0	932.0	594.0	270	40.269535
3	332.5	142.5	228.0	0.0	932.0	594.0	365	41.052780
4	198.6	132.4	192.0	0.0	978.4	825.5	360	44.296075