Laboratory 15: "It's a Wrap"

# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)

DESKTOP-EH6HD63
desktop-eh6hd63\farha
C:\Users\Farha\Anaconda3\python.exe
3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)

Full name:¶

R#:¶

HEX:¶

Title of the notebook¶

Date:¶

Step0- Import the necessary libraries¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import scipy.stats
import seaborn as sns

Step1- A case of Mercury contamination of groundwater is reported. Our field operation team has just returned from the first round of sampling. During the initial sampling phase, three set of 20 samples were extracted from three wells are brought to the laboratory as a file. The units are "nanograms per liter (ng/l)" for mercury per liter of groundwater. Read the "lab15_minidf.csv" file as a dataframe.¶

data = pd.read_csv("lab15_minidf.csv") 
data

Step2- Let's explore the dataset.¶

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
Set1    20 non-null float64
Set2    20 non-null float64
Set3    20 non-null float64
dtypes: float64(3)
memory usage: 608.0 bytes

Step3- Use descriptive statistics and get an estimate of the center of the distribution for each set¶

#For set1:
set1 = data['Set1']

print('For set 1',' the arithmetic mean is: ',set1.mean())
print('For set 1',' the median is: ',set1.median())

For set 1  the arithmetic mean is:  43.858587392747665
For set 1  the median is:  44.42901682176475

#For set2:
set2 = data['Set2']

print('For set 2',' the arithmetic mean is: ',set2.mean())
print('For set 2',' the median is: ',set2.median())

For set 2  the arithmetic mean is:  56.78089826605933
For set 2  the median is:  56.95764933866367

#For set3:
set3 = data['Set3']

print('For set 3',' the arithmetic mean is: ',set3.mean())
print('For set 3',' the median is: ',set3.median())

For set 3  the arithmetic mean is:  67.37693190592594
For set 3  the median is:  66.07046180680064

Step4- Use descriptive statistics and quantify the spread of data points for each set¶

#For set1:
print('For set 1',' the range is: ',np.ptp(set1))
print('For set 1',' the IQR is: ',scipy.stats.iqr(set1))
print('For set 1',' the 5-number summary is: ',set1.describe())
print('For set 1',' the variance is: ',statistics.variance(set1))
print('For set 1',' the standard deviation is: ',statistics.stdev(set1))

For set 1  the range is:  55.75456205934812
For set 1  the IQR is:  28.90185639631407
For set 1  the 5-number summary is:  count    20.000000
mean     43.858587
std      16.850441
min      14.292122
25%      31.052359
50%      44.429017
75%      59.954215
max      70.046684
Name: Set1, dtype: float64
For set 1  the variance is:  283.937370064773
For set 1  the standard deviation is:  16.85044124243555

#For set2:
print('For set 2',' the range is: ',np.ptp(set2))
print('For set 2',' the IQR is: ',scipy.stats.iqr(set2))
print('For set 2',' the 5-number summary is: ',set2.describe())
print('For set 2',' the variance is: ',statistics.variance(set2))
print('For set 2',' the standard deviation is: ',statistics.stdev(set2))

For set 2  the range is:  27.91704859996223
For set 2  the IQR is:  13.462001321421276
For set 2  the 5-number summary is:  count    20.000000
mean     56.780898
std       9.017977
min      42.026923
25%      51.599185
50%      56.957649
75%      65.061187
max      69.943972
Name: Set2, dtype: float64
For set 2  the variance is:  81.32390287103372
For set 2  the standard deviation is:  9.0179766506148

#For set3:
print('For set 3',' the range is: ',np.ptp(set3))
print('For set 3',' the IQR is: ',scipy.stats.iqr(set3))
print('For set 3',' the 5-number summary is: ',set3.describe())
print('For set 3',' the variance is: ',statistics.variance(set3))
print('For set 3',' the standard deviation is: ',statistics.stdev(set3))

For set 3  the range is:  89.95653530194858
For set 3  the IQR is:  34.84022657880441
For set 3  the 5-number summary is:  count     20.000000
mean      67.376932
std       24.727707
min       26.898010
25%       49.185394
50%       66.070462
75%       84.025620
max      116.854545
Name: Set3, dtype: float64
For set 3  the variance is:  611.4594963965791
For set 3  the standard deviation is:  24.7277070590174

Step5- Use descriptive statistics and compare the skewness of all sets¶

skew1 = set1.skew()
skew2 = set2.skew()
skew3 = set3.skew()
print('For set 1 the skewness is ',skew1,'For set 2 the skewness is ',skew2,'For set 3 the skewness is ',skew3)

For set 1 the skewness is  -0.2131967717425499 For set 2 the skewness is  -0.02809145150056633 For set 3 the skewness is  0.23123717307961747

Step6- Use boxplots and visually compare the spread of data points in all sets¶

fig = plt.figure(figsize =(10, 7)) 
plt.boxplot ([set1, set2, set3],1, '')
plt.show()

Step7- Use histograms and visually compare the distribution of data points in all sets¶

set1.plot.hist(density=False, bins=6,color="red")

<matplotlib.axes._subplots.AxesSubplot at 0x2376dc29a08>

set2.plot.hist(density=False, bins=6,color="blue")

<matplotlib.axes._subplots.AxesSubplot at 0x2376dc99688>

set3.plot.hist(density=False, bins=6,color="gold")

<matplotlib.axes._subplots.AxesSubplot at 0x2376b9a62c8>

fig, ax = plt.subplots()
data.plot.hist(density=False, ax=ax, bins=6,color=("red","blue","gold"))

<matplotlib.axes._subplots.AxesSubplot at 0x2376dcc8f48>

Step8- Use histograms with KDE and visually compare the continous shape of distributions in all sets¶

sns.distplot(set1,color='red', rug=True,kde=True)
sns.distplot(set2,color='blue', rug=True,kde=True)
sns.distplot(set3,color='gold', rug=True,kde=True)

<matplotlib.axes._subplots.AxesSubplot at 0x2376dd99488>

Step9- Use Gringorten Plotting Position Formula and draw a quantile plot for each set¶

# First, define the function for the Gringorten Plotting Position Formula: 
def gringorten_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    gringorten_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        gringorten_pp.append((i+1-0.44)/(len(sample)+0.12)) #values from the gringorten formula
    return gringorten_pp

# Second, apply it on each set
set1 = np.array(set1)
set2 = np.array(set2)
set3 = np.array(set3)

set1_grin = gringorten_pp(set1)
set2_grin = gringorten_pp(set2)
set3_grin = gringorten_pp(set3)

# Third, plot them
myfigure = plt.figure(figsize = (12,8)) # generate a object from the figure class, set aspect ratio

plt.scatter(set1_grin, set1 ,color ='red',
            marker ="^",  
            s = 50)
plt.scatter(set2_grin, set2 ,color ='blue',
            marker ="o",  
            s = 20)
plt.scatter(set3_grin, set3 ,color ='gold',
            marker ="s",  
            s = 20)

plt.xlabel("Density or Quantile Value") 
plt.ylabel("Value") 
plt.title("Quantile Plot for Set1, Set2, and Set3 based on Gringorton Plotting Functions") 
plt.show()

Step10- Fit a Normal, Gumbell (Double Exponential), and Gamma Distribution Data Model and find the best alternative for each set.¶

# Normal Quantile Function
import math

def normdist(x,mu,sigma):
    argument = (x - mu)/(math.sqrt(2.0)*sigma)    
    normdist = (1.0 + math.erf(argument))/2.0
    return normdist

#For Set1
mu = set1.mean() # Fitted Model
sigma = set1.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set1) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()

#For Set2
mu = set2.mean() # Fitted Model
sigma = set2.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set2) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()

#For Set3
mu = set3.mean() # Fitted Model
sigma = set3.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set3) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()

# Gumbell (Extreme Value Type I) Quantile Function

def ev1dist(x,alpha,beta):
    argument = (x - alpha)/beta
    constant = 1.0/beta
    ev1dist = math.exp(-1.0*math.exp(-1.0*argument))
    return ev1dist

#For Set1
sample = set1
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()

#For Set2
sample = set2
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()

#For Set2
sample = set3
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()

# Gamma (Pearson Type III) Quantile Function
def gammacdf(x,tau,alpha,beta): # Gamma Cumulative Density function - with three parameter to one parameter convert
    xhat = x-tau
    lamda = 1.0/beta
    gammacdf = scipy.stats.gamma.cdf(lamda*xhat, alpha)
    return gammacdf

#For Set1
set1_mean  = np.array(set1).mean()
set1_stdev = np.array(set1).std()
set1_skew  = scipy.stats.skew(set1)
set1_alpha = 4.0/(set1_skew**2)
set1_beta  = np.sign(set1_skew)*math.sqrt(set1_stdev**2/set1_alpha)
set1_tau   = set1_mean - set1_alpha*set1_beta
#
x = []; ycdf = []
xlow = (0.9*min(set1)); xhigh = (1.1*max(set1)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set1_tau,set1_alpha,set1_beta)
    ycdf.append(yvalue) 
####
rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(rycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set1_mean) + "\n"
mytitle += "SD = " + str(set1_stdev) + "\n"
mytitle += "Skew = " + str(set1_skew) + "\n"
plt.title(mytitle) 
plt.show()

#For Set2
set2_mean  = np.array(set2).mean()
set2_stdev = np.array(set2).std()
set2_skew  = scipy.stats.skew(set2)
set2_alpha = 4.0/(set2_skew**2)
set2_beta  = np.sign(set2_skew)*math.sqrt(set2_stdev**2/set2_alpha)
set2_tau   = set2_mean - set2_alpha*set2_beta
#
x = []; ycdf = []
xlow = (0.9*min(set2)); xhigh = (1.1*max(set2)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set2_tau,set2_alpha,set2_beta)
    ycdf.append(yvalue) 
####
rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(rycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set2_mean) + "\n"
mytitle += "SD = " + str(set2_stdev) + "\n"
mytitle += "Skew = " + str(set2_skew) + "\n"
plt.title(mytitle) 
plt.show()

#For Set3
set3_mean  = np.array(set3).mean()
set3_stdev = np.array(set3).std()
set3_skew  = scipy.stats.skew(set3)
set3_alpha = 4.0/(set3_skew**2)
set3_beta  = np.sign(set3_skew)*math.sqrt(set3_stdev**2/set3_alpha)
set3_tau   = set3_mean - set3_alpha*set3_beta
#
x = []; ycdf = []
xlow = (0.9*min(set3)); xhigh = (1.1*max(set3)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set3_tau,set3_alpha,set3_beta)
    ycdf.append(yvalue) 
####
#rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set3_mean) + "\n"
mytitle += "SD = " + str(set3_stdev) + "\n"
mytitle += "Skew = " + str(set3_skew) + "\n"
plt.title(mytitle) 
plt.show()

Step11- From visual assessment, Normal Distribution for Set1 and Set2, and Gamma Disribution for Set3 provide better fits. Run appropriate hypothesis tests and decide whether each set of samples has a normal disctribution or not.¶

# The Shapiro-Wilk Normality Test for Set1
from scipy.stats import shapiro
stat, p = shapiro(set1)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.955, p=0.451
Probably Gaussian

# The Shapiro-Wilk Normality Test for Set2
from scipy.stats import shapiro
stat, p = shapiro(set2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.941, p=0.246
Probably Gaussian

# The Shapiro-Wilk Normality Test for Set3
from scipy.stats import shapiro
stat, p = shapiro(set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.975, p=0.858
Probably Gaussian

Step13- Run appropriate hypothesis tests and decide whether the three sets are significantly different or not.¶

# The Student's t-test for Set1 and Set2
from scipy.stats import ttest_ind

stat, p = ttest_ind(set1, set2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=-3.024, p=0.004
Probably different distributions

# The Student's t-test for Set1 and Set3
from scipy.stats import ttest_ind

stat, p = ttest_ind(set1, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=-3.515, p=0.001
Probably different distributions

# The Student's t-test for Set2 and Set3
from scipy.stats import ttest_ind

stat, p = ttest_ind(set2, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=-1.800, p=0.080
Probably the same distribution

Step14- Our field operation team installed a monitoring device on each well that can take samples and record the concentration of Mercury around 28 times per hour. After a month, the monitoring log is brought to the lab. Read the "lab15_maxidf.csv" file as a dataframe.¶

data = pd.read_csv("lab15_maxidf.csv") 
data

Step15- Let's explore the dataset.¶

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18862 entries, 0 to 18861
Data columns (total 3 columns):
SetA    18862 non-null float64
SetB    18862 non-null float64
SetC    18862 non-null float64
dtypes: float64(3)
memory usage: 442.2 KB

Step16- Use descriptive statistics and get an estimate of the center of the distribution for each set¶

#For set1:
set1 = data['SetA']

print('For set 1',' the arithmetic mean is: ',set1.mean())
print('For set 1',' the median is: ',set1.median())

For set 1  the arithmetic mean is:  50.13319492879079
For set 1  the median is:  50.07079019699734

#For set2:
set2 = data['SetB']

print('For set 2',' the arithmetic mean is: ',set2.mean())
print('For set 2',' the median is: ',set2.median())

For set 2  the arithmetic mean is:  54.93138284325201
For set 2  the median is:  54.911500610793205

#For set3:
set3 = data['SetC']

print('For set 3',' the arithmetic mean is: ',set3.mean())
print('For set 3',' the median is: ',set3.median())

For set 3  the arithmetic mean is:  60.295688567980314
For set 3  the median is:  53.64454069518722

Step17- Use descriptive statistics and quantify the spread of data points for each set¶

#For set1:
print('For set 1',' the range is: ',np.ptp(set1))
print('For set 1',' the IQR is: ',scipy.stats.iqr(set1))
print('For set 1',' the 5-number summary is: ',set1.describe())
print('For set 1',' the variance is: ',statistics.variance(set1))
print('For set 1',' the standard deviation is: ',statistics.stdev(set1))

For set 1  the range is:  122.21336316204193
For set 1  the IQR is:  20.44161568459856
For set 1  the 5-number summary is:  count    18862.000000
mean        50.133195
std         15.097537
min         -9.535961
25%         39.829720
50%         50.070790
75%         60.271335
max        112.677402
Name: SetA, dtype: float64
For set 1  the variance is:  227.93562995032482
For set 1  the standard deviation is:  15.09753721473555

C:\Users\Farha\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

#For set2:
print('For set 2',' the range is: ',np.ptp(set2))
print('For set 2',' the IQR is: ',scipy.stats.iqr(set2))
print('For set 2',' the 5-number summary is: ',set2.describe())
print('For set 2',' the variance is: ',statistics.variance(set2))
print('For set 2',' the standard deviation is: ',statistics.stdev(set2))

For set 2  the range is:  29.999847594508175
For set 2  the IQR is:  14.933746312221992
For set 2  the 5-number summary is:  count    18862.000000
mean        54.931383
std          8.666687
min         40.000107
25%         47.480143
50%         54.911501
75%         62.413890
max         69.999954
Name: SetB, dtype: float64
For set 2  the variance is:  75.11145928998276
For set 2  the standard deviation is:  8.666686753885983

#For set3:
print('For set 3',' the range is: ',np.ptp(set3))
print('For set 3',' the IQR is: ',scipy.stats.iqr(set3))
print('For set 3',' the 5-number summary is: ',set3.describe())
print('For set 3',' the variance is: ',statistics.variance(set3))
print('For set 3',' the standard deviation is: ',statistics.stdev(set3))

For set 3  the range is:  351.38258641900387
For set 3  the IQR is:  43.71636821620718
For set 3  the 5-number summary is:  count    18862.000000
mean        60.295689
std         34.727722
min          1.090492
25%         34.968173
50%         53.644541
75%         78.684541
max        352.473079
Name: SetC, dtype: float64
For set 3  the variance is:  1206.014687498265
For set 3  the standard deviation is:  34.727722175493525

Step18- Use descriptive statistics and compare the skewness of all sets¶

skew1 = set1.skew()
skew2 = set2.skew()
skew3 = set3.skew()
print('For set 1 the skewness is ',skew1,'For set 2 the skewness is ',skew2,'For set 3 the skewness is ',skew3)

For set 1 the skewness is  0.027136451602721625 For set 2 the skewness is  0.009331321763676393 For set 3 the skewness is  1.1709638917286735

Step19- Use boxplots and visually compare the spread of data points in all sets¶

fig = plt.figure(figsize =(10, 7)) 
plt.boxplot ([set1, set2, set3],1, '')
plt.show()

Step20- Use histograms and visually compare the distribution of data points in all sets¶

set1.plot.hist(density=False, bins=50,color="red")

<matplotlib.axes._subplots.AxesSubplot at 0x2376e496f88>

set2.plot.hist(density=False, bins=50,color="blue")

<matplotlib.axes._subplots.AxesSubplot at 0x2376e4968c8>

set3.plot.hist(density=False, bins=50,color="gold")

<matplotlib.axes._subplots.AxesSubplot at 0x2376e0b4288>

fig, ax = plt.subplots()
data.plot.hist(density=False, ax=ax, bins=50,color=("red","blue","gold"))

<matplotlib.axes._subplots.AxesSubplot at 0x2376e183d08>

Step21- Use histograms with KDE and visually compare the continous shape of distributions in all sets¶

sns.distplot(set1,color='red', rug=True,kde=True)
sns.distplot(set2,color='blue', rug=True,kde=True)
sns.distplot(set3,color='gold', rug=True,kde=True)

<matplotlib.axes._subplots.AxesSubplot at 0x2376e1da7c8>

Step22- Use Gringorten Plotting Position Formula and draw a quantile plot for each set¶

# First, define the function for the Gringorten Plotting Position Formula: 
def gringorten_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    gringorten_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        gringorten_pp.append((i+1-0.44)/(len(sample)+0.12)) #values from the gringorten formula
    return gringorten_pp

# Second, apply it on each set
set1 = np.array(set1)
set2 = np.array(set2)
set3 = np.array(set3)

set1_grin = gringorten_pp(set1)
set2_grin = gringorten_pp(set2)
set3_grin = gringorten_pp(set3)

# Third, plot them
myfigure = plt.figure(figsize = (12,8)) # generate a object from the figure class, set aspect ratio

plt.scatter(set1_grin, set1 ,color ='red',
            marker ="^",  
            s = 50)
plt.scatter(set2_grin, set2 ,color ='blue',
            marker ="o",  
            s = 20)
plt.scatter(set3_grin, set3 ,color ='gold',
            marker ="s",  
            s = 20)

plt.xlabel("Density or Quantile Value") 
plt.ylabel("Value") 
plt.title("Quantile Plot for Set1, Set2, and Set3 based on Gringorton Plotting Functions") 
plt.show()

Step23- Fit a Normal, Gumbell (Double Exponential), and Gamma Distribution Data Model and find the best alternative for each set.¶

# Normal Quantile Function
import math

def normdist(x,mu,sigma):
    argument = (x - mu)/(math.sqrt(2.0)*sigma)    
    normdist = (1.0 + math.erf(argument))/2.0
    return normdist

#For Set1
mu = set1.mean() # Fitted Model
sigma = set1.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set1) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()

#For Set2
mu = set2.mean() # Fitted Model
sigma = set2.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set2) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()

#For Set3
mu = set3.mean() # Fitted Model
sigma = set3.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set3) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
plt.title(mytitle) 
plt.show()

# Gumbell (Extreme Value Type I) Quantile Function

def ev1dist(x,alpha,beta):
    argument = (x - alpha)/beta
    constant = 1.0/beta
    ev1dist = math.exp(-1.0*math.exp(-1.0*argument))
    return ev1dist

#For Set1
sample = set1
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()

#For Set2
sample = set2
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()

#For Set2
sample = set3
sample_mean = np.array(sample).mean()
sample_variance = np.array(sample).std()**2
alpha_mom = sample_mean*math.sqrt(6)/math.pi
beta_mom = math.sqrt(sample_variance)*0.45

################
mu = sample_mean # Fitted Model
sigma = math.sqrt(sample_variance)
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(sample) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = ev1dist(xlow + i*xstep,alpha_mom,beta_mom)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Extreme Value Type 1 Distribution Data Model sample mean = : " + str(sample_mean)+ " sample variance =:" + str(sample_variance)
plt.title(mytitle) 
plt.show()

# Gamma (Pearson Type III) Quantile Function
def gammacdf(x,tau,alpha,beta): # Gamma Cumulative Density function - with three parameter to one parameter convert
    xhat = x-tau
    lamda = 1.0/beta
    gammacdf = scipy.stats.gamma.cdf(lamda*xhat, alpha)
    return gammacdf

#For Set1
set1_mean  = np.array(set1).mean()
set1_stdev = np.array(set1).std()
set1_skew  = scipy.stats.skew(set1)
set1_alpha = 4.0/(set1_skew**2)
set1_beta  = np.sign(set1_skew)*math.sqrt(set1_stdev**2/set1_alpha)
set1_tau   = set1_mean - set1_alpha*set1_beta
#
x = []; ycdf = []
xlow = (0.9*min(set1)); xhigh = (1.1*max(set1)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set1_tau,set1_alpha,set1_beta)
    ycdf.append(yvalue) 
####
#rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set1_grin, set1 ,color ='red') 
plt.plot(ycdf, x, color ='darkred') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set1 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set1_mean) + "\n"
mytitle += "SD = " + str(set1_stdev) + "\n"
mytitle += "Skew = " + str(set1_skew) + "\n"
plt.title(mytitle) 
plt.show()

#For Set2
set2_mean  = np.array(set2).mean()
set2_stdev = np.array(set2).std()
set2_skew  = scipy.stats.skew(set2)
set2_alpha = 4.0/(set2_skew**2)
set2_beta  = np.sign(set2_skew)*math.sqrt(set2_stdev**2/set2_alpha)
set2_tau   = set2_mean - set2_alpha*set2_beta
#
x = []; ycdf = []
xlow = (0.9*min(set2)); xhigh = (1.1*max(set2)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set2_tau,set2_alpha,set2_beta)
    ycdf.append(yvalue) 
####
#rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set2_grin, set2 ,color ='blue') 
plt.plot(ycdf, x, color ='darkblue') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set2 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set2_mean) + "\n"
mytitle += "SD = " + str(set2_stdev) + "\n"
mytitle += "Skew = " + str(set2_skew) + "\n"
plt.title(mytitle) 
plt.show()

#For Set3
set3_mean  = np.array(set3).mean()
set3_stdev = np.array(set3).std()
set3_skew  = scipy.stats.skew(set3)
set3_alpha = 4.0/(set3_skew**2)
set3_beta  = np.sign(set3_skew)*math.sqrt(set3_stdev**2/set3_alpha)
set3_tau   = set3_mean - set3_alpha*set3_beta
#
x = []; ycdf = []
xlow = (0.9*min(set3)); xhigh = (1.1*max(set3)) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = gammacdf(xlow + i*xstep,set3_tau,set3_alpha,set3_beta)
    ycdf.append(yvalue) 
####
#rycdf = ycdf[::-1]

myfigure = plt.figure(figsize = (7,8)) # generate a object from the figure class, set aspect ratio
plt.scatter(set3_grin, set3 ,color ='gold') 
plt.plot(ycdf, x, color ='orange') 
plt.xlabel("Quantile Value") 
plt.ylabel("Value") 
mytitle = "For Set3 | Pearson (Gamma) Type III Distribution Data Model\n "
mytitle += "Mean = " + str(set3_mean) + "\n"
mytitle += "SD = " + str(set3_stdev) + "\n"
mytitle += "Skew = " + str(set3_skew) + "\n"
plt.title(mytitle) 
plt.show()

Step24- Run appropriate hypothesis tests and decide whether each set of samples has a normal disctribution or not.¶

# The Shapiro-Wilk Normality Test for Set1
from scipy.stats import shapiro
stat, p = shapiro(set1)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=1.000, p=0.757
Probably Gaussian

C:\Users\Farha\Anaconda3\lib\site-packages\scipy\stats\morestats.py:1660: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

# The Shapiro-Wilk Normality Test for Set2
from scipy.stats import shapiro
stat, p = shapiro(set2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.955, p=0.000
Probably not Gaussian

# The Shapiro-Wilk Normality Test for Set3
from scipy.stats import shapiro
stat, p = shapiro(set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.929, p=0.000
Probably not Gaussian

Step25- Run appropriate hypothesis tests and decide whether the three sets are significantly different or not.¶

# The Student's t-test for Set1 and Set2
from scipy.stats import ttest_ind

stat, p = ttest_ind(set1, set2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=-37.854, p=0.000
Probably different distributions

# The Student's t-test for Set1 and Set3
from scipy.stats import ttest_ind

stat, p = ttest_ind(set1, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=-36.858, p=0.000
Probably different distributions

# The Student's t-test for Set2 and Set3
from scipy.stats import ttest_ind

stat, p = ttest_ind(set2, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=-20.583, p=0.000
Probably different distributions

# Example of the Analysis of Variance Test
from scipy.stats import f_oneway
stat, p = f_oneway(set1, set2, set3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=969.151, p=0.000
Probably different distributions

	Set1	Set2	Set3
0	43.338812	52.433597	101.360589
1	14.292122	57.099923	78.725571
2	40.534000	54.109884	100.029478
3	30.028993	49.095951	56.684550
4	65.068913	67.372179	82.210959
5	16.270190	59.161718	46.741041
6	62.635690	58.610019	54.710679
7	59.538645	69.943972	52.788361
8	46.319841	52.490195	82.780907
9	61.200924	68.361822	42.596775
10	70.046684	67.839000	116.854545
11	29.822311	69.825583	70.091906
12	37.054487	56.997331	43.959810
13	45.519222	44.697895	90.688038
14	19.508576	42.026923	50.000178
15	48.268636	54.733521	26.898010
16	31.393480	56.917967	69.055142
17	39.074647	42.643469	30.516556
18	63.265439	64.290856	87.759761
19	53.990135	46.966161	63.085782

	SetA	SetB	SetC
0	38.929139	56.978021	44.285459
1	55.018891	64.192515	35.803822
2	67.735682	57.229914	64.765594
3	64.558983	64.927195	60.737594
4	50.817111	68.822775	44.945246
...	...	...	...
18857	27.688223	51.103824	21.939277
18858	34.347352	51.383197	48.454103
18859	41.411880	52.662563	34.445571
18860	64.918172	50.101195	59.970350
18861	19.794307	65.266238	42.728845

Laboratory 15: "It's a Wrap"

Full name:¶

R#:¶

HEX:¶

Title of the notebook¶

Date:¶

Step0- Import the necessary libraries¶

Step2- Let's explore the dataset.¶

Step3- Use descriptive statistics and get an estimate of the center of the distribution for each set¶

Step4- Use descriptive statistics and quantify the spread of data points for each set¶

Step5- Use descriptive statistics and compare the skewness of all sets¶

Step6- Use boxplots and visually compare the spread of data points in all sets¶

Step7- Use histograms and visually compare the distribution of data points in all sets¶

Step8- Use histograms with KDE and visually compare the continous shape of distributions in all sets¶

Step9- Use Gringorten Plotting Position Formula and draw a quantile plot for each set¶

Step10- Fit a Normal, Gumbell (Double Exponential), and Gamma Distribution Data Model and find the best alternative for each set.¶

Step11- From visual assessment, Normal Distribution for Set1 and Set2, and Gamma Disribution for Set3 provide better fits. Run appropriate hypothesis tests and decide whether each set of samples has a normal disctribution or not.¶

Step13- Run appropriate hypothesis tests and decide whether the three sets are significantly different or not.¶

Step14- Our field operation team installed a monitoring device on each well that can take samples and record the concentration of Mercury around 28 times per hour. After a month, the monitoring log is brought to the lab. Read the "lab15_maxidf.csv" file as a dataframe.¶

Step15- Let's explore the dataset.¶

Step16- Use descriptive statistics and get an estimate of the center of the distribution for each set¶

Step17- Use descriptive statistics and quantify the spread of data points for each set¶

Step18- Use descriptive statistics and compare the skewness of all sets¶

Step19- Use boxplots and visually compare the spread of data points in all sets¶

Step20- Use histograms and visually compare the distribution of data points in all sets¶

Step21- Use histograms with KDE and visually compare the continous shape of distributions in all sets¶

Step22- Use Gringorten Plotting Position Formula and draw a quantile plot for each set¶

Step23- Fit a Normal, Gumbell (Double Exponential), and Gamma Distribution Data Model and find the best alternative for each set.¶

Step24- Run appropriate hypothesis tests and decide whether each set of samples has a normal disctribution or not.¶

Step25- Run appropriate hypothesis tests and decide whether the three sets are significantly different or not.¶

Exercise: Normality... who cares?

Why should we check data for normality?¶

Make sure to cite any resources that you may use.¶