Laboratory 15: "It's a Wrap"

In [100]:
# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)
DESKTOP-EH6HD63
desktop-eh6hd63\farha
C:\Users\Farha\Anaconda3\python.exe
3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)

Full name:

R#:

HEX:

Title of the notebook

Date:

Step0- Import the necessary libraries

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import scipy.stats
import seaborn as sns

Step1- A case of Mercury contamination of groundwater is reported. Our field operation team has just returned from the first round of sampling. During the initial sampling phase, three set of 20 samples were extracted from three wells are brought to the laboratory as a file. The units are "nanograms per liter (ng/l)" for mercury per liter of groundwater. Read the "lab15_minidf.csv" file as a dataframe.

In [15]:
data = pd.read_csv("lab15_minidf.csv") 
data
Out[15]:
Set1 Set2 Set3
0 43.338812 52.433597 101.360589
1 14.292122 57.099923 78.725571
2 40.534000 54.109884 100.029478
3 30.028993 49.095951 56.684550
4 65.068913 67.372179 82.210959
5 16.270190 59.161718 46.741041
6 62.635690 58.610019 54.710679
7 59.538645 69.943972 52.788361
8 46.319841 52.490195 82.780907
9 61.200924 68.361822 42.596775
10 70.046684 67.839000 116.854545
11 29.822311 69.825583 70.091906
12 37.054487 56.997331 43.959810
13 45.519222 44.697895 90.688038
14 19.508576 42.026923 50.000178
15 48.268636 54.733521 26.898010
16 31.393480 56.917967 69.055142
17 39.074647 42.643469 30.516556
18 63.265439 64.290856 87.759761
19 53.990135 46.966161 63.085782

Step2- Let's explore the dataset.

In [16]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
Set1    20 non-null float64
Set2    20 non-null float64
Set3    20 non-null float64
dtypes: float64(3)
memory usage: 608.0 bytes

Step3- Use descriptive statistics and get an estimate of the center of the distribution for each set

In [18]:
#For set1:
set1 = data['Set1']

print('For set 1',' the arithmetic mean is: ',set1.mean())
print('For set 1',' the median is: ',set1.median())
For set 1  the arithmetic mean is:  43.858587392747665
For set 1  the median is:  44.42901682176475
In [20]:
#For set2:
set2 = data['Set2']

print('For set 2',' the arithmetic mean is: ',set2.mean())
print('For set 2',' the median is: ',set2.median())
For set 2  the arithmetic mean is:  56.78089826605933
For set 2  the median is:  56.95764933866367
In [21]:
#For set3:
set3 = data['Set3']

print('For set 3',' the arithmetic mean is: ',set3.mean())
print('For set 3',' the median is: ',set3.median())
For set 3  the arithmetic mean is:  67.37693190592594
For set 3  the median is:  66.07046180680064

Step4- Use descriptive statistics and quantify the spread of data points for each set

In [23]:
#For set1:
print('For set 1',' the range is: ',np.ptp(set1))
print('For set 1',' the IQR is: ',scipy.stats.iqr(set1))
print('For set 1',' the 5-number summary is: ',set1.describe())
print('For set 1',' the variance is: ',statistics.variance(set1))
print('For set 1',' the standard deviation is: ',statistics.stdev(set1))
For set 1  the range is:  55.75456205934812
For set 1  the IQR is:  28.90185639631407
For set 1  the 5-number summary is:  count    20.000000
mean     43.858587
std      16.850441
min      14.292122
25%      31.052359
50%      44.429017
75%      59.954215
max      70.046684
Name: Set1, dtype: float64
For set 1  the variance is:  283.937370064773
For set 1  the standard deviation is:  16.85044124243555
In [24]:
#For set2:
print('For set 2',' the range is: ',np.ptp(set2))
print('For set 2',' the IQR is: ',scipy.stats.iqr(set2))
print('For set 2',' the 5-number summary is: ',set2.describe())
print('For set 2',' the variance is: ',statistics.variance(set2))
print('For set 2',' the standard deviation is: ',statistics.stdev(set2))
For set 2  the range is:  27.91704859996223
For set 2  the IQR is:  13.462001321421276
For set 2  the 5-number summary is:  count    20.000000
mean     56.780898
std       9.017977
min      42.026923
25%      51.599185
50%      56.957649
75%      65.061187
max      69.943972
Name: Set2, dtype: float64
For set 2  the variance is:  81.32390287103372
For set 2  the standard deviation is:  9.0179766506148
In [25]:
#For set3:
print('For set 3',' the range is: ',np.ptp(set3))
print('For set 3',' the IQR is: ',scipy.stats.iqr(set3))
print('For set 3',' the 5-number summary is: ',set3.describe())
print('For set 3',' the variance is: ',statistics.variance(set3))
print('For set 3',' the standard deviation is: ',statistics.stdev(set3))
For set 3  the range is:  89.95653530194858
For set 3  the IQR is:  34.84022657880441
For set 3  the 5-number summary is:  count     20.000000
mean      67.376932
std       24.727707
min       26.898010
25%       49.185394
50%       66.070462
75%       84.025620
max      116.854545
Name: Set3, dtype: float64
For set 3  the variance is:  611.4594963965791
For set 3  the standard deviation is:  24.7277070590174

Step5- Use descriptive statistics and compare the skewness of all sets

In [26]:
skew1 = set1.skew()
skew2 = set2.skew()
skew3 = set3.skew()
print('For set 1 the skewness is ',skew1,'For set 2 the skewness is ',skew2,'For set 3 the skewness is ',skew3)
For set 1 the skewness is  -0.2131967717425499 For set 2 the skewness is  -0.02809145150056633 For set 3 the skewness is  0.23123717307961747

Step6- Use boxplots and visually compare the spread of data points in all sets

In [27]:
fig = plt.figure(figsize =(10, 7)) 
plt.boxplot ([set1, set2, set3],1, '')
plt.show()

Step7- Use histograms and visually compare the distribution of data points in all sets

In [31]:
set1.plot.hist(density=False, bins=6,color="red")
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376dc29a08>
In [32]:
set2.plot.hist(density=False, bins=6,color="blue")
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376dc99688>
In [33]:
set3.plot.hist(density=False, bins=6,color="gold")
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376b9a62c8>
In [34]:
fig, ax = plt.subplots()
data.plot.hist(density=False, ax=ax, bins=6,color=("red","blue","gold"))
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376dcc8f48>

Step8- Use histograms with KDE and visually compare the continous shape of distributions in all sets

In [36]:
sns.distplot(set1,color='red', rug=True,kde=True)
sns.distplot(set2,color='blue', rug=True,kde=True)
sns.distplot(set3,color='gold', rug=True,kde=True)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376dd99488>

Step9- Use Gringorten Plotting Position Formula and draw a quantile plot for each set

In [37]:
# First, define the function for the Gringorten Plotting Position Formula: 
def gringorten_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    gringorten_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        gringorten_pp.append((i+1-0.44)/(len(sample)+0.12)) #values from the gringorten formula
    return gringorten_pp
In [38]:
# Second, apply it on each set
set1 = np.array(set1)
set2 = np.array(set2)
set3 = np.array(set3)

set1_grin = gringorten_pp(set1)
set2_grin = gringorten_pp(set2)
set3_grin = gringorten_pp(set3)
In [42]:
# Third, plot them
myfigure = plt.figure(figsize = (12,8)) # generate a object from the figure class, set aspect ratio

plt.scatter(set1_grin, set1 ,color ='red',
            marker ="^",  
            s = 50)
plt.scatter(set2_grin, set2 ,color ='blue',
            marker ="o",  
            s = 20)
plt.scatter(set3_grin, set3 ,color ='gold',
            marker ="s",  
            s = 20)

plt.xlabel("Density or Quantile Value") 
plt.ylabel("Value") 
plt.title("Quantile Plot for Set1, Set2, and Set3 based on Gringorton Plotting Functions") 
plt.show()

Step10- Fit a Normal, Gumbell (Double Exponential), and Gamma Distribution Data Model and find the best alternative for each set.

In [43]:
# Normal Quantile Function
import math

def normdist(x,mu,sigma):
    argument = (x - mu)/(math.sqrt(2.0)*sigma)    
    normdist = (1.0 + math.erf(argument))/2.0
    return normdist
In [46]:
#For Set1
mu = set1.mean() # Fitted Model
sigma = set1.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set1) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()
In [48]:
#For Set2
mu = set2.mean() # Fitted Model
sigma = set2.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set2) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()
In [51]:
#For Set3
mu = set3.mean() # Fitted Model
sigma = set3.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set3) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()
In [52]:
# Gumbell (Extreme Value Type I) Quantile Function

def ev1dist(x,alpha,beta):
    argument = (x - alpha)/beta
    constant = 1.0/beta
    ev1dist = math.exp(-1.0*math.exp(-1.0*argument))
    return ev1dist
In [54]:
#For Set1
sample = set1
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()
In [56]:
#For Set2
sample = set2
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()
In [57]:
#For Set2
sample = set3
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()
In [58]:
# Gamma (Pearson Type III) Quantile Function
def gammacdf(x,tau,alpha,beta): # Gamma Cumulative Density function - with three parameter to one parameter convert
    xhat = x-tau
    lamda = 1.0/beta
    gammacdf = scipy.stats.gamma.cdf(lamda*xhat, alpha)
    return gammacdf
In [59]:
#For Set1
set1_mean  = np.array(set1).mean()
set1_stdev = np.array(set1).std()
set1_skew  = scipy.stats.skew(set1)
set1_alpha = 4.0/(set1_skew**2)
set1_beta  = np.sign(set1_skew)*math.sqrt(set1_stdev**2/set1_alpha)
set1_tau   = set1_mean - set1_alpha*set1_beta
#
x = []; ycdf = []
xlow = (0.9*min(set1)); xhigh = (1.1*max(set1)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set1_tau,set1_alpha,set1_beta)
    ycdf.append(yvalue) 
####
rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(rycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set1_mean) + "\n"
mytitle += "SD = " + str(set1_stdev) + "\n"
mytitle += "Skew = " + str(set1_skew) + "\n"
plt.title(mytitle) 
plt.show()
In [60]:
#For Set2
set2_mean  = np.array(set2).mean()
set2_stdev = np.array(set2).std()
set2_skew  = scipy.stats.skew(set2)
set2_alpha = 4.0/(set2_skew**2)
set2_beta  = np.sign(set2_skew)*math.sqrt(set2_stdev**2/set2_alpha)
set2_tau   = set2_mean - set2_alpha*set2_beta
#
x = []; ycdf = []
xlow = (0.9*min(set2)); xhigh = (1.1*max(set2)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set2_tau,set2_alpha,set2_beta)
    ycdf.append(yvalue) 
####
rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(rycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set2_mean) + "\n"
mytitle += "SD = " + str(set2_stdev) + "\n"
mytitle += "Skew = " + str(set2_skew) + "\n"
plt.title(mytitle) 
plt.show()
In [62]:
#For Set3
set3_mean  = np.array(set3).mean()
set3_stdev = np.array(set3).std()
set3_skew  = scipy.stats.skew(set3)
set3_alpha = 4.0/(set3_skew**2)
set3_beta  = np.sign(set3_skew)*math.sqrt(set3_stdev**2/set3_alpha)
set3_tau   = set3_mean - set3_alpha*set3_beta
#
x = []; ycdf = []
xlow = (0.9*min(set3)); xhigh = (1.1*max(set3)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set3_tau,set3_alpha,set3_beta)
    ycdf.append(yvalue) 
####
#rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set3_mean) + "\n"
mytitle += "SD = " + str(set3_stdev) + "\n"
mytitle += "Skew = " + str(set3_skew) + "\n"
plt.title(mytitle) 
plt.show()

Step11- From visual assessment, Normal Distribution for Set1 and Set2, and Gamma Disribution for Set3 provide better fits. Run appropriate hypothesis tests and decide whether each set of samples has a normal disctribution or not.

In [63]:
# The Shapiro-Wilk Normality Test for Set1
from scipy.stats import shapiro
stat, p = shapiro(set1)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')
stat=0.955, p=0.451
Probably Gaussian
In [64]:
# The Shapiro-Wilk Normality Test for Set2
from scipy.stats import shapiro
stat, p = shapiro(set2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')
stat=0.941, p=0.246
Probably Gaussian
In [65]:
# The Shapiro-Wilk Normality Test for Set3
from scipy.stats import shapiro
stat, p = shapiro(set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')
stat=0.975, p=0.858
Probably Gaussian

Step13- Run appropriate hypothesis tests and decide whether the three sets are significantly different or not.

In [66]:
# The Student's t-test for Set1 and Set2
from scipy.stats import ttest_ind

stat, p = ttest_ind(set1, set2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
stat=-3.024, p=0.004
Probably different distributions
In [67]:
# The Student's t-test for Set1 and Set3
from scipy.stats import ttest_ind

stat, p = ttest_ind(set1, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
stat=-3.515, p=0.001
Probably different distributions
In [68]:
# The Student's t-test for Set2 and Set3
from scipy.stats import ttest_ind

stat, p = ttest_ind(set2, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
stat=-1.800, p=0.080
Probably the same distribution

Step14- Our field operation team installed a monitoring device on each well that can take samples and record the concentration of Mercury around 28 times per hour. After a month, the monitoring log is brought to the lab. Read the "lab15_maxidf.csv" file as a dataframe.

In [69]:
data = pd.read_csv("lab15_maxidf.csv") 
data
Out[69]:
SetA SetB SetC
0 38.929139 56.978021 44.285459
1 55.018891 64.192515 35.803822
2 67.735682 57.229914 64.765594
3 64.558983 64.927195 60.737594
4 50.817111 68.822775 44.945246
... ... ... ...
18857 27.688223 51.103824 21.939277
18858 34.347352 51.383197 48.454103
18859 41.411880 52.662563 34.445571
18860 64.918172 50.101195 59.970350
18861 19.794307 65.266238 42.728845

18862 rows × 3 columns

Step15- Let's explore the dataset.

In [70]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18862 entries, 0 to 18861
Data columns (total 3 columns):
SetA    18862 non-null float64
SetB    18862 non-null float64
SetC    18862 non-null float64
dtypes: float64(3)
memory usage: 442.2 KB

Step16- Use descriptive statistics and get an estimate of the center of the distribution for each set

In [72]:
#For set1:
set1 = data['SetA']

print('For set 1',' the arithmetic mean is: ',set1.mean())
print('For set 1',' the median is: ',set1.median())
For set 1  the arithmetic mean is:  50.13319492879079
For set 1  the median is:  50.07079019699734
In [73]:
#For set2:
set2 = data['SetB']

print('For set 2',' the arithmetic mean is: ',set2.mean())
print('For set 2',' the median is: ',set2.median())
For set 2  the arithmetic mean is:  54.93138284325201
For set 2  the median is:  54.911500610793205
In [74]:
#For set3:
set3 = data['SetC']

print('For set 3',' the arithmetic mean is: ',set3.mean())
print('For set 3',' the median is: ',set3.median())
For set 3  the arithmetic mean is:  60.295688567980314
For set 3  the median is:  53.64454069518722

Step17- Use descriptive statistics and quantify the spread of data points for each set

In [75]:
#For set1:
print('For set 1',' the range is: ',np.ptp(set1))
print('For set 1',' the IQR is: ',scipy.stats.iqr(set1))
print('For set 1',' the 5-number summary is: ',set1.describe())
print('For set 1',' the variance is: ',statistics.variance(set1))
print('For set 1',' the standard deviation is: ',statistics.stdev(set1))
For set 1  the range is:  122.21336316204193
For set 1  the IQR is:  20.44161568459856
For set 1  the 5-number summary is:  count    18862.000000
mean        50.133195
std         15.097537
min         -9.535961
25%         39.829720
50%         50.070790
75%         60.271335
max        112.677402
Name: SetA, dtype: float64
For set 1  the variance is:  227.93562995032482
For set 1  the standard deviation is:  15.09753721473555
C:\Users\Farha\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
In [76]:
#For set2:
print('For set 2',' the range is: ',np.ptp(set2))
print('For set 2',' the IQR is: ',scipy.stats.iqr(set2))
print('For set 2',' the 5-number summary is: ',set2.describe())
print('For set 2',' the variance is: ',statistics.variance(set2))
print('For set 2',' the standard deviation is: ',statistics.stdev(set2))
For set 2  the range is:  29.999847594508175
For set 2  the IQR is:  14.933746312221992
For set 2  the 5-number summary is:  count    18862.000000
mean        54.931383
std          8.666687
min         40.000107
25%         47.480143
50%         54.911501
75%         62.413890
max         69.999954
Name: SetB, dtype: float64
For set 2  the variance is:  75.11145928998276
For set 2  the standard deviation is:  8.666686753885983
In [77]:
#For set3:
print('For set 3',' the range is: ',np.ptp(set3))
print('For set 3',' the IQR is: ',scipy.stats.iqr(set3))
print('For set 3',' the 5-number summary is: ',set3.describe())
print('For set 3',' the variance is: ',statistics.variance(set3))
print('For set 3',' the standard deviation is: ',statistics.stdev(set3))
For set 3  the range is:  351.38258641900387
For set 3  the IQR is:  43.71636821620718
For set 3  the 5-number summary is:  count    18862.000000
mean        60.295689
std         34.727722
min          1.090492
25%         34.968173
50%         53.644541
75%         78.684541
max        352.473079
Name: SetC, dtype: float64
For set 3  the variance is:  1206.014687498265
For set 3  the standard deviation is:  34.727722175493525

Step18- Use descriptive statistics and compare the skewness of all sets

In [78]:
skew1 = set1.skew()
skew2 = set2.skew()
skew3 = set3.skew()
print('For set 1 the skewness is ',skew1,'For set 2 the skewness is ',skew2,'For set 3 the skewness is ',skew3)
For set 1 the skewness is  0.027136451602721625 For set 2 the skewness is  0.009331321763676393 For set 3 the skewness is  1.1709638917286735

Step19- Use boxplots and visually compare the spread of data points in all sets

In [79]:
fig = plt.figure(figsize =(10, 7)) 
plt.boxplot ([set1, set2, set3],1, '')
plt.show()

Step20- Use histograms and visually compare the distribution of data points in all sets

In [80]:
set1.plot.hist(density=False, bins=50,color="red")
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376e496f88>
In [81]:
set2.plot.hist(density=False, bins=50,color="blue")
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376e4968c8>
In [82]:
set3.plot.hist(density=False, bins=50,color="gold")
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376e0b4288>
In [83]:
fig, ax = plt.subplots()
data.plot.hist(density=False, ax=ax, bins=50,color=("red","blue","gold"))
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376e183d08>

Step21- Use histograms with KDE and visually compare the continous shape of distributions in all sets

In [85]:
sns.distplot(set1,color='red', rug=True,kde=True)
sns.distplot(set2,color='blue', rug=True,kde=True)
sns.distplot(set3,color='gold', rug=True,kde=True)
Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x2376e1da7c8>

Step22- Use Gringorten Plotting Position Formula and draw a quantile plot for each set

In [86]:
# First, define the function for the Gringorten Plotting Position Formula: 
def gringorten_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    gringorten_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        gringorten_pp.append((i+1-0.44)/(len(sample)+0.12)) #values from the gringorten formula
    return gringorten_pp
In [87]:
# Second, apply it on each set
set1 = np.array(set1)
set2 = np.array(set2)
set3 = np.array(set3)

set1_grin = gringorten_pp(set1)
set2_grin = gringorten_pp(set2)
set3_grin = gringorten_pp(set3)
In [89]:
# Third, plot them
myfigure = plt.figure(figsize = (12,8)) # generate a object from the figure class, set aspect ratio

plt.scatter(set1_grin, set1 ,color ='red',
            marker ="^",  
            s = 50)
plt.scatter(set2_grin, set2 ,color ='blue',
            marker ="o",  
            s = 20)
plt.scatter(set3_grin, set3 ,color ='gold',
            marker ="s",  
            s = 20)

plt.xlabel("Density or Quantile Value") 
plt.ylabel("Value") 
plt.title("Quantile Plot for Set1, Set2, and Set3 based on Gringorton Plotting Functions") 
plt.show()

Step23- Fit a Normal, Gumbell (Double Exponential), and Gamma Distribution Data Model and find the best alternative for each set.

In [90]:
# Normal Quantile Function
import math

def normdist(x,mu,sigma):
    argument = (x - mu)/(math.sqrt(2.0)*sigma)    
    normdist = (1.0 + math.erf(argument))/2.0
    return normdist
In [92]:
#For Set1
mu = set1.mean() # Fitted Model
sigma = set1.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set1) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()
In [93]:
#For Set2
mu = set2.mean() # Fitted Model
sigma = set2.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set2) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()
In [95]:
#For Set3
mu = set3.mean() # Fitted Model
sigma = set3.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set3) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()
In [60]:
# Gumbell (Extreme Value Type I) Quantile Function

def ev1dist(x,alpha,beta):
    argument = (x - alpha)/beta
    constant = 1.0/beta
    ev1dist = math.exp(-1.0*math.exp(-1.0*argument))
    return ev1dist
In [97]:
#For Set1
sample = set1
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()
In [98]:
#For Set2
sample = set2
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()
In [99]:
#For Set2
sample = set3
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()
In [100]:
# Gamma (Pearson Type III) Quantile Function
def gammacdf(x,tau,alpha,beta): # Gamma Cumulative Density function - with three parameter to one parameter convert
    xhat = x-tau
    lamda = 1.0/beta
    gammacdf = scipy.stats.gamma.cdf(lamda*xhat, alpha)
    return gammacdf
In [101]:
#For Set1
set1_mean  = np.array(set1).mean()
set1_stdev = np.array(set1).std()
set1_skew  = scipy.stats.skew(set1)
set1_alpha = 4.0/(set1_skew**2)
set1_beta  = np.sign(set1_skew)*math.sqrt(set1_stdev**2/set1_alpha)
set1_tau   = set1_mean - set1_alpha*set1_beta
#
x = []; ycdf = []
xlow = (0.9*min(set1)); xhigh = (1.1*max(set1)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set1_tau,set1_alpha,set1_beta)
    ycdf.append(yvalue) 
####
#rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set1_mean) + "\n"
mytitle += "SD = " + str(set1_stdev) + "\n"
mytitle += "Skew = " + str(set1_skew) + "\n"
plt.title(mytitle) 
plt.show()
In [102]:
#For Set2
set2_mean  = np.array(set2).mean()
set2_stdev = np.array(set2).std()
set2_skew  = scipy.stats.skew(set2)
set2_alpha = 4.0/(set2_skew**2)
set2_beta  = np.sign(set2_skew)*math.sqrt(set2_stdev**2/set2_alpha)
set2_tau   = set2_mean - set2_alpha*set2_beta
#
x = []; ycdf = []
xlow = (0.9*min(set2)); xhigh = (1.1*max(set2)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set2_tau,set2_alpha,set2_beta)
    ycdf.append(yvalue) 
####
#rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set2_mean) + "\n"
mytitle += "SD = " + str(set2_stdev) + "\n"
mytitle += "Skew = " + str(set2_skew) + "\n"
plt.title(mytitle) 
plt.show()
In [103]:
#For Set3
set3_mean  = np.array(set3).mean()
set3_stdev = np.array(set3).std()
set3_skew  = scipy.stats.skew(set3)
set3_alpha = 4.0/(set3_skew**2)
set3_beta  = np.sign(set3_skew)*math.sqrt(set3_stdev**2/set3_alpha)
set3_tau   = set3_mean - set3_alpha*set3_beta
#
x = []; ycdf = []
xlow = (0.9*min(set3)); xhigh = (1.1*max(set3)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set3_tau,set3_alpha,set3_beta)
    ycdf.append(yvalue) 
####
#rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set3_mean) + "\n"
mytitle += "SD = " + str(set3_stdev) + "\n"
mytitle += "Skew = " + str(set3_skew) + "\n"
plt.title(mytitle) 
plt.show()

Step24- Run appropriate hypothesis tests and decide whether each set of samples has a normal disctribution or not.

In [104]:
# The Shapiro-Wilk Normality Test for Set1
from scipy.stats import shapiro
stat, p = shapiro(set1)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')
stat=1.000, p=0.757
Probably Gaussian
C:\Users\Farha\Anaconda3\lib\site-packages\scipy\stats\morestats.py:1660: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
In [105]:
# The Shapiro-Wilk Normality Test for Set2
from scipy.stats import shapiro
stat, p = shapiro(set2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')
stat=0.955, p=0.000
Probably not Gaussian
In [106]:
# The Shapiro-Wilk Normality Test for Set3
from scipy.stats import shapiro
stat, p = shapiro(set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')
stat=0.929, p=0.000
Probably not Gaussian

Step25- Run appropriate hypothesis tests and decide whether the three sets are significantly different or not.

In [108]:
# The Student's t-test for Set1 and Set2
from scipy.stats import ttest_ind

stat, p = ttest_ind(set1, set2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
stat=-37.854, p=0.000
Probably different distributions
In [109]:
# The Student's t-test for Set1 and Set3
from scipy.stats import ttest_ind

stat, p = ttest_ind(set1, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
stat=-36.858, p=0.000
Probably different distributions
In [110]:
# The Student's t-test for Set2 and Set3
from scipy.stats import ttest_ind

stat, p = ttest_ind(set2, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
stat=-20.583, p=0.000
Probably different distributions
In [111]:
# Example of the Analysis of Variance Test
from scipy.stats import f_oneway
stat, p = f_oneway(set1, set2, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
stat=969.151, p=0.000
Probably different distributions
In [ ]:
 


Exercise: Normality... who cares?

Why should we check data for normality?

Make sure to cite any resources that you may use.

In [ ]:
 

In [ ]: