# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)
The table below contains some experimental observations.
Elapsed Time (s) | Speed (m/s) |
---|---|
0 | 0 |
1.0 | 3 |
2.0 | 7 |
3.0 | 12 |
4.0 | 20 |
5.0 | 30 |
6.0 | 45.6 |
7.0 | 60.3 |
8.0 | 77.7 |
9.0 | 97.3 |
10.0 | 121.1 |
Linear Regression:
a basic predictive analytics technique that uses historical data to predict an output variable.
The Predictor variable (input):
the variable(s) that help predict the value of the output variable. It is commonly referred to as X.
The Output variable:
the variable that we want to predict. It is commonly referred to as Y.
where Yₑ is the estimated or predicted value of Y based on our linear equation.
Covariance:
In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.
The Correlation Coefficient:
Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. There are several types of correlation coefficient, but the most popular is Pearson’s. Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression.Correlation coefficient formulas are used to find how strong a relationship is between data. The formulat for Pearson’s R:
The formulas return a value between -1 and 1, where:
1 : A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase of a fixed proportion in the other. For example, shoe sizes go up in (almost) perfect correlation with foot length.
-1: A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative decrease of a fixed proportion in the other. For example, the amount of gas in a tank decreases in (almost) perfect correlation with speed.
0 : Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.
Elapsed Time (s) | Speed (m/s) |
---|---|
0 | 0 |
1.0 | 3 |
2.0 | 7 |
3.0 | 12 |
4.0 | 20 |
5.0 | 30 |
6.0 | 45.6 |
7.0 | 60.3 |
8.0 | 77.7 |
9.0 | 97.3 |
10.0 | 121.1 |
# Load the necessary packages
import numpy as np
import pandas as pd
import statistics
from matplotlib import pyplot as plt
# Create a dataframe:
time = [0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
speed = [0, 3, 7, 12, 20, 30, 45.6, 60.3, 77.7, 97.3, 121.2]
data = pd.DataFrame({'Time':time, 'Speed':speed})
data
data.describe()
time_var = statistics.variance(time)
speed_var = statistics.variance(speed)
print("Variance of recorded times is ",time_var)
print("Variance of recorded speed is ",speed_var)
# To find the covariance
data.cov()
# To find the correlation among the columns
# using pearson method
data.corr(method ='pearson')
# Calculate the mean of X and y
xmean = np.mean(time)
ymean = np.mean(speed)
# Calculate the terms needed for the numator and denominator of beta
data['xycov'] = (data['Time'] - xmean) * (data['Speed'] - ymean)
data['xvar'] = (data['Time'] - xmean)**2
# Calculate beta and alpha
beta = data['xycov'].sum() / data['xvar'].sum()
alpha = ymean - (beta * xmean)
print(f'alpha = {alpha}')
print(f'beta = {beta}')
X = np.array(time)
ypred = alpha + beta * X
print(ypred)
# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(X, ypred, color="red") # regression line
plt.plot(time, speed, 'ro', color="blue") # scatter plot showing actual data
plt.title('Actual vs Predicted')
plt.xlabel('Time (s)')
plt.ylabel('Speed (m/s)')
plt.show()
ypred_20 = alpha + beta * 20
print(ypred_20)
import statsmodels.formula.api as smf
# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Speed ~ Time', data=data)
model = model.fit()
model.params
# Predict values
speed_pred = model.predict()
# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(data['Time'], data['Speed'], 'o') # scatter plot showing actual data
plt.plot(data['Time'], speed_pred, 'r', linewidth=2) # regression line
plt.xlabel('Time (s)')
plt.ylabel('Speed (m/s)')
plt.title('model vs observed')
plt.show()
# Import and display first rows of the advertising dataset
df = pd.read_csv('advertising.csv')
df.head()
# Describe the df
df.describe()
tv = np.array(df['TV'])
radio = np.array(df['Radio'])
newspaper = np.array(df['Newspaper'])
sales = np.array(df['Sales'])
# Get Variance and Covariance - What can we infer?
df.cov()
# Get Correlation Coefficient - What can we infer?
df.corr(method ='pearson')
# Answer the first question: Can TV advertising spending predict the number of sales for the product?
import statsmodels.formula.api as smf
# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Sales ~ TV', data=df)
model = model.fit()
print(model.params)
# Predict values
TV_pred = model.predict()
# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(df['TV'], df['Sales'], 'o') # scatter plot showing actual data
plt.plot(df['TV'], TV_pred, 'r', linewidth=2) # regression line
plt.xlabel('TV advertising spending')
plt.ylabel('Sales')
plt.title('Predicting with TV spendings only')
plt.show()
# Answer the second question: Can Radio advertising spending predict the number of sales for the product?
import statsmodels.formula.api as smf
# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Sales ~ Radio', data=df)
model = model.fit()
print(model.params)
# Predict values
RADIO_pred = model.predict()
# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(df['Radio'], df['Sales'], 'o') # scatter plot showing actual data
plt.plot(df['Radio'], RADIO_pred, 'r', linewidth=2) # regression line
plt.xlabel('Radio advertising spending')
plt.ylabel('Sales')
plt.title('Predicting with Radio spendings only')
plt.show()
# Answer the third question: Can Newspaper advertising spending predict the number of sales for the product?
import statsmodels.formula.api as smf
# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Sales ~ Newspaper', data=df)
model = model.fit()
print(model.params)
# Predict values
NP_pred = model.predict()
# Plot regression against actual data - What do we see?
plt.figure(figsize=(12, 6))
plt.plot(df['Newspaper'], df['Sales'], 'o') # scatter plot showing actual data
plt.plot(df['Newspaper'], NP_pred, 'r', linewidth=2) # regression line
plt.xlabel('Newspaper advertising spending')
plt.ylabel('Sales')
plt.title('Predicting with Newspaper spendings only')
plt.show()
# Answer the fourth question: Can we use the three of them to predict the number of sales for the product?
# This is a case of multiple linear regression model. This is simply a linear regression model with more than one predictor:
# and is modelled by: Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ , where p is the number of predictors.
# In this case: Sales = α + β1*TV + β2*Radio + β3*Newspaper
# Multiple Linear Regression with scikit-learn:
from sklearn.linear_model import LinearRegression
# Build linear regression model using TV,Radio and Newspaper as predictors
# Split data into predictors X and output Y
predictors = ['TV', 'Radio', 'Newspaper']
X = df[predictors]
y = df['Sales']
# Initialise and fit model
lm = LinearRegression()
model = lm.fit(X, y)
print(f'alpha = {model.intercept_}')
print(f'betas = {model.coef_}')
# Therefore, our model can be written as:
#Sales = 2.938 + 0.046*TV + 0.1885*Radio -0.001*Newspaper
# we can predict sales from any combination of TV and Radio and Newspaper advertising costs!
#For example, if we wanted to know how many sales we would make if we invested
# $300 in TV advertising and $200 in Radio advertising and $50 in Newspaper advertising
#all we have to do is plug in the values:
new_X = [[300, 200,50]]
print(model.predict(new_X))
# Answer the final question : Which parameter is a better predictor of the number of sales for the product?
# How can we answer that?
# WHAT CAN WE INFER FROM THE BETAs ?
This notebook was inspired by a several blogposts including:
Here are some great reads on linear regression:
Here are some great videos on linear regression: