Laboratory 18: On Precognition and Other Sins of the Human Brain

# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)

DESKTOP-EH6HD63
desktop-eh6hd63\farha
C:\Users\Farha\Anaconda3\python.exe
3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)

Full name:¶

R#:¶

Title of the notebook¶

Date:¶

The human brain is amazing and mysterious in many ways. Have a look at these sequences. You, with the assistance of your brain, can guess the next item in each sequence, right?

A,B,C,D,E, __ ?
5,10,15,20,25, __ ?
2,4,8,16,32 __ ?
0,1,1,2,3, __ ?
1, 11, 21, 1211,111221, __ ?

But how does our brain do this? How do we 'guess | predict' the next step? Is it that there is only one possible option? is it that we have the previous items? or is it the relationship between the items?

What if we have more than a single sequence? Maybe two sets of numbers? How can we predict the next "item" in a situation like that?

Blue Points? Red Line? Fit? Does it ring any bells?

3 Problem 2 (8 pts)¶

The table below contains some experimental observations.

Elapsed Time (s)	Speed (m/s)
0	0
1.0	3
2.0	7
3.0	12
4.0	20
5.0	30
6.0	45.6
7.0	60.3
8.0	77.7
9.0	97.3
10.0	121.1

Plot the speed vs time (speed on y-axis, time on x-axis) using a scatter plot. Use blue markers.
Plot a red line on the scatterplot based on the linear model $f(x) = mx + b$
By trial-and-error find values of $m$ and $b$ that provide a good visual fit (i.e. makes the red line explain the blue markers).
Using this data model estimate the speed at $t = 15~\texttt{sec.}$

Let's go over some important terminology:¶

Linear Regression:
a basic predictive analytics technique that uses historical data to predict an output variable.

The Predictor variable (input):
the variable(s) that help predict the value of the output variable. It is commonly referred to as X.

The Output variable:
the variable that we want to predict. It is commonly referred to as Y.

To estimate Y using linear regression, we assume the equation: $Ye = βX + α$¶

where Yₑ is the estimated or predicted value of Y based on our linear equation.

Our goal is to find statistically significant values of the parameters α and β that minimise the difference between Y and Yₑ. If we are able to determine the optimum values of these two parameters, then we will have the line of best fit that we can use to predict the values of Y, given the value of X.

So, how do we estimate α and β?

We can use a method called Ordinary Least Squares (OLS).

The objective of the least squares method is to find values of α and β that minimise the sum of the squared difference between Y and Yₑ (distance between the linear fit and the observed points). We will not go through the derivation here, but using calculus we can show that the values of the unknown parameters are as follows:

where X̄ is the mean of X values and Ȳ is the mean of Y values. β is simply the covariance of X and Y (Cov(X, Y) devided by the variance of X (Var(X)).

Covariance:
In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

The Correlation Coefficient:
Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. There are several types of correlation coefficient, but the most popular is Pearson’s. Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression.Correlation coefficient formulas are used to find how strong a relationship is between data. The formulat for Pearson’s R:

The formulas return a value between -1 and 1, where:

1 : A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase of a fixed proportion in the other. For example, shoe sizes go up in (almost) perfect correlation with foot length.
-1: A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative decrease of a fixed proportion in the other. For example, the amount of gas in a tank decreases in (almost) perfect correlation with speed.
0 : Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.

Example 1: Let's have a look at the Problem 1 from Exam II

We had a table of recoded times and speeds from some experimental observations:¶

Elapsed Time (s)	Speed (m/s)
0	0
1.0	3
2.0	7
3.0	12
4.0	20
5.0	30
6.0	45.6
7.0	60.3
8.0	77.7
9.0	97.3
10.0	121.1

First let's create a dataframe:¶

# Load the necessary packages
import numpy as np
import pandas as pd
import statistics 
from matplotlib import pyplot as plt

# Create a dataframe:
time = [0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
speed = [0, 3, 7, 12, 20, 30, 45.6, 60.3, 77.7, 97.3, 121.2]
data = pd.DataFrame({'Time':time, 'Speed':speed})
data

Now, let's explore the data:¶

data.describe()

time_var = statistics.variance(time)
speed_var = statistics.variance(speed)

print("Variance of recorded times is ",time_var)
print("Variance of recorded speed is ",speed_var)

Variance of recorded times is  11.0
Variance of recorded speed is  1697.7759999999998

Is there a relationship ( based on covariance, correlation) between time and speed?¶

# To find the covariance  
data.cov()

# To find the correlation among the columns 
# using pearson method 
data.corr(method ='pearson')

Let's do linear regression with primitive Python:¶

To estimate "y" using the OLS method, we need to calculate "xmean" and "ymean", the covariance of X and y ("xycov"), and the variance of X ("xvar") before we can determine the values for alpha and beta. In our case, X is time and y is Speed.¶

# Calculate the mean of X and y
xmean = np.mean(time)
ymean = np.mean(speed)

# Calculate the terms needed for the numator and denominator of beta
data['xycov'] = (data['Time'] - xmean) * (data['Speed'] - ymean)
data['xvar'] = (data['Time'] - xmean)**2

# Calculate beta and alpha
beta = data['xycov'].sum() / data['xvar'].sum()
alpha = ymean - (beta * xmean)
print(f'alpha = {alpha}')
print(f'beta = {beta}')

alpha = -16.78636363636363
beta = 11.977272727272727

We now have an estimate for alpha and beta! Our model can be written as Yₑ = 11.977 X -16.786, and we can make predictions:¶

X = np.array(time)

ypred = alpha + beta * X
print(ypred)

[-16.78636364  -4.80909091   7.16818182  19.14545455  31.12272727
  43.1         55.07727273  67.05454545  79.03181818  91.00909091
 102.98636364]

Let’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model:¶

# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(X, ypred, color="red")     # regression line
plt.plot(time, speed, 'ro', color="blue")   # scatter plot showing actual data
plt.title('Actual vs Predicted')
plt.xlabel('Time (s)')
plt.ylabel('Speed (m/s)')

plt.show()

The red line is our line of best fit, Yₑ = 11.977 X -16.786. We can see from this graph that there is a positive linear relationship between X and y. Using our model, we can predict y from any values of X!

For example, if we had a value X = 20, we can predict that:¶

ypred_20 = alpha + beta * 20
print(ypred_20)

222.7590909090909

Linear Regression with statsmodels:¶

First, we use statsmodels’ ols function to initialise our simple linear regression model. This takes the formula y ~ X, where X is the predictor variable (Time) and y is the output variable (Speed). Then, we fit the model by calling the OLS object’s fit() method.¶

import statsmodels.formula.api as smf

# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Speed ~ Time', data=data)
model = model.fit()

We no longer have to calculate alpha and beta ourselves as this method does it automatically for us! Calling model.params will show us the model’s parameters:¶

model.params

Intercept   -16.786364
Time         11.977273
dtype: float64

In the notation that we have been using, α is the intercept and β is the slope i.e. α =-16.786364 and β = 11.977273.¶

# Predict values
speed_pred = model.predict()

# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(data['Time'], data['Speed'], 'o')           # scatter plot showing actual data
plt.plot(data['Time'], speed_pred, 'r', linewidth=2)   # regression line
plt.xlabel('Time (s)')
plt.ylabel('Speed (m/s)')
plt.title('model vs observed')

plt.show()

How good do you feel about this predictive model? Will you trust it?¶

Example 2: Advertising and Sells!

This is a classic regression problem. we have a dataset of the spendings on TV, Radio, and Newspaper advertisements and number of sales for a specific product. We are interested in exploring the relationship between these parameters and answering the following questions:¶

Can TV advertising spending predict the number of sales for the product?
Can Radio advertising spending predict the number of sales for the product?
Can Newspaper advertising spending predict the number of sales for the product?
Can we use the three of them to predict the number of sales for the product? | Multiple Linear Regression Model
Which parameter is a better predictor of the number of sales for the product?

# Import and display first rows of the advertising dataset
df = pd.read_csv('advertising.csv')
df.head()

# Describe the df
df.describe()

tv = np.array(df['TV'])
radio = np.array(df['Radio'])
newspaper = np.array(df['Newspaper'])
sales = np.array(df['Sales'])

# Get Variance and Covariance - What can we infer?
df.cov()

# Get Correlation Coefficient - What can we infer?
df.corr(method ='pearson')

# Answer the first question: Can TV advertising spending predict the number of sales for the product?
import statsmodels.formula.api as smf

# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Sales ~ TV', data=df)
model = model.fit()
print(model.params)

Intercept    7.032594
TV           0.047537
dtype: float64

# Predict values
TV_pred = model.predict()

# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(df['TV'], df['Sales'], 'o')           # scatter plot showing actual data
plt.plot(df['TV'], TV_pred, 'r', linewidth=2)   # regression line
plt.xlabel('TV advertising spending')
plt.ylabel('Sales')
plt.title('Predicting with TV spendings only')

plt.show()

# Answer the second question: Can Radio advertising spending predict the number of sales for the product?
import statsmodels.formula.api as smf

# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Sales ~ Radio', data=df)
model = model.fit()
print(model.params)

Intercept    9.311638
Radio        0.202496
dtype: float64

# Predict values
RADIO_pred = model.predict()

# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(df['Radio'], df['Sales'], 'o')           # scatter plot showing actual data
plt.plot(df['Radio'], RADIO_pred, 'r', linewidth=2)   # regression line
plt.xlabel('Radio advertising spending')
plt.ylabel('Sales')
plt.title('Predicting with Radio spendings only')

plt.show()

# Answer the third question: Can Newspaper advertising spending predict the number of sales for the product?
import statsmodels.formula.api as smf

# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Sales ~ Newspaper', data=df)
model = model.fit()
print(model.params)

Intercept    12.351407
Newspaper     0.054693
dtype: float64

# Predict values
NP_pred = model.predict()

# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(df['Newspaper'], df['Sales'], 'o')           # scatter plot showing actual data
plt.plot(df['Newspaper'], NP_pred, 'r', linewidth=2)   # regression line
plt.xlabel('Newspaper advertising spending')
plt.ylabel('Sales')
plt.title('Predicting with Newspaper spendings only')

plt.show()

# Answer the fourth question: Can we use the three of them to predict the number of sales for the product?
# This is a case of multiple linear regression model. This is simply a linear regression model with more than one predictor:
# and is modelled by:  Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ , where p is the number of predictors.
# In this case: Sales = α + β1*TV + β2*Radio + β3*Newspaper
# Multiple Linear Regression with scikit-learn:
from sklearn.linear_model import LinearRegression

# Build linear regression model using TV,Radio and Newspaper as predictors
# Split data into predictors X and output Y
predictors = ['TV', 'Radio', 'Newspaper']
X = df[predictors]
y = df['Sales']

# Initialise and fit model
lm = LinearRegression()
model = lm.fit(X, y)

print(f'alpha = {model.intercept_}')
print(f'betas = {model.coef_}')

alpha = 2.9388893694594085
betas = [ 0.04576465  0.18853002 -0.00103749]

# Therefore, our model can be written as:
#Sales = 2.938 + 0.046*TV + 0.1885*Radio -0.001*Newspaper
# we can predict sales from any combination of TV and Radio and Newspaper advertising costs! 
#For example, if we wanted to know how many sales we would make if we invested 
# $300 in TV advertising and $200 in Radio advertising and $50 in Newspaper advertising
#all we have to do is plug in the values:
new_X = [[300, 200,50]]
print(model.predict(new_X))

[54.32241174]

# Answer the final question : Which parameter is a better predictor of the number of sales for the product?
# How can we answer that?
# WHAT CAN WE INFER FROM THE BETAs ?

This notebook was inspired by a several blogposts including:

"Introduction to Linear Regression in Python" by Lorraine Li available at* https://towardsdatascience.com/introduction-to-linear-regression-in-python-c12a072bedf0
"In Depth: Linear Regression" available at* https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html
"A friendly introduction to linear regression (using Python)" available at* https://www.dataschool.io/linear-regression-in-python/

Here are some great reads on linear regression:

"Linear Regression in Python" by Sadrach Pierre available at* https://towardsdatascience.com/linear-regression-in-python-a1d8c13f3242
"Introduction to Linear Regression in Python" available at* https://cmdlinetips.com/2019/09/introduction-to-linear-regression-in-python/
"Linear Regression in Python" by Mirko Stojiljković available at* https://realpython.com/linear-regression-in-python/

Here are some great videos on linear regression:

"StatQuest: Fitting a line to data, aka least squares, aka linear regression." by StatQuest with Josh Starmer available at* https://www.youtube.com/watch?v=PaFPbb66DxQ&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU
"Statistics 101: Linear Regression, The Very Basics" by Brandon Foltz available at* https://www.youtube.com/watch?v=ZkjP5RJLQF4
"How to Build a Linear Regression Model in Python | Part 1" and 2,3,4! by Sigma Coding available at* https://www.youtube.com/watch?v=MRm5sBfdBBQ

	Time	Speed
count	11.000000	11.000000
mean	5.000000	43.100000
std	3.316625	41.204077
min	0.000000	0.000000
25%	2.500000	9.500000
50%	5.000000	30.000000
75%	7.500000	69.000000
max	10.000000	121.200000

	TV	Radio	Newspaper	Sales
0	230.1	37.8	69.2	22.1
1	44.5	39.3	45.1	10.4
2	17.2	45.9	69.3	9.3
3	151.5	41.3	58.5	18.5
4	180.8	10.8	58.4	12.9

	TV	Radio	Newspaper	Sales
count	200.000000	200.000000	200.000000	200.000000
mean	147.042500	23.264000	30.554000	14.022500
std	85.854236	14.846809	21.778621	5.217457
min	0.700000	0.000000	0.300000	1.600000
25%	74.375000	9.975000	12.750000	10.375000
50%	149.750000	22.900000	25.750000	12.900000
75%	218.825000	36.525000	45.100000	17.400000
max	296.400000	49.600000	114.000000	27.000000

	TV	Radio	Newspaper	Sales
TV	7370.949893	69.862492	105.919452	350.390195
Radio	69.862492	220.427743	114.496979	44.635688
Newspaper	105.919452	114.496979	474.308326	25.941392
Sales	350.390195	44.635688	25.941392	27.221853

	TV	Radio	Newspaper	Sales
TV	1.000000	0.054809	0.056648	0.782224
Radio	0.054809	1.000000	0.354104	0.576223
Newspaper	0.056648	0.354104	1.000000	0.228299
Sales	0.782224	0.576223	0.228299	1.000000