Laboratory 14

Full name:

R#:

HEX:

Title of the notebook

Date:

Important Terminology:

Plotting Position: An empirical distribution, based on a random sample from a (possibly unknown) probability distribution, obtained by plotting the exceedance (or cumulative) probability of the sample distribution against the sample value.
The exceedance probability for a particular sample value is a function of sample size and the rank of the particular sample. For exceedance probabilities, the sample values are ranked from largest to smallest. The general expression in common use for plotting position is

$$ P = \frac{m - b}{N + 1 -2b}\ $$

where m is the ordered rank of a sample value, N is the sample size, and b is a constant between 0 and 1, depending on the plotting method.

*From:
https://glossary.ametsoc.org/wiki/

Let's work on example. First, import the necessary packages:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read the "lab14_E1data.csv" file as a dataset:

In [2]:
data = pd.read_csv("lab14_E1data.csv") 
data
Out[2]:
Set1 Set2
0 46.688625 512.459480
1 44.825192 480.551364
2 71.453564 560.502112
3 30.360172 503.885912
4 47.657087 458.124749
... ... ...
95 60.040915 462.122309
96 21.527991 509.909507
97 59.523999 572.309957
98 38.173070 562.580099
99 39.671168 497.784981

100 rows × 2 columns

The dataset contains two sets of values: "Set1" and "Set2". Use descriptive functions to learn more the sets.

In [3]:
# Let's check out set1 and set2
set1 = data['Set1']
set2 = data['Set2']
print(set1)
print(set2)
0     46.688625
1     44.825192
2     71.453564
3     30.360172
4     47.657087
        ...    
95    60.040915
96    21.527991
97    59.523999
98    38.173070
99    39.671168
Name: Set1, Length: 100, dtype: float64
0     512.459480
1     480.551364
2     560.502112
3     503.885912
4     458.124749
         ...    
95    462.122309
96    509.909507
97    572.309957
98    562.580099
99    497.784981
Name: Set2, Length: 100, dtype: float64
In [4]:
set1.describe()
Out[4]:
count    100.000000
mean      48.566581
std       15.861475
min       13.660911
25%       38.229562
50%       49.369139
75%       59.580899
max       86.356515
Name: Set1, dtype: float64
In [5]:
set2.describe()
Out[5]:
count    100.000000
mean     508.276381
std       47.978391
min      408.244489
25%      470.288351
50%      507.096010
75%      541.199481
max      629.497949
Name: Set2, dtype: float64

Remember the Weibull Plotting Position formula from last session. Use Weibull Plotting Position formula to plot set1 and set2 quantiles on the same graph.
Do they look different? How?

In [6]:
def weibull_pp(sample): # Weibull plotting position function
# returns a list of plotting positions; sample must be a numeric list
    weibull_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        weibull_pp.append((i+1)/(len(sample)+1)) #values from the gringorten formula
    return weibull_pp
In [7]:
#Convert to numpy arrays
set1 = np.array(set1)
set2 = np.array(set2)
In [8]:
#Apply the weibull pp function
set1_wei = weibull_pp(set1)
set2_wei = weibull_pp(set2)
In [9]:
myfigure = matplotlib.pyplot.figure(figsize = (4,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(set1_wei, set1 ,color ='blue')
matplotlib.pyplot.scatter(set2_wei, set2 ,color ='orange')
matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Value") 
matplotlib.pyplot.title("Quantile Plot for Set1 and Set2 based on Weibull Plotting Function") 
matplotlib.pyplot.show()
Out[9]:

Do they look different? How?

Define functions for Gringorten, Cunnane, California, and Hazen Plotting Position Formulas. Overlay and Plot them all for set 1 and set2 on two different graphs.

In [18]:
def gringorten_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    gringorten_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        gringorten_pp.append((i+1-0.44)/(len(sample)+0.12)) #values from the gringorten formula
    return gringorten_pp
In [19]:
set1_grin = gringorten_pp(set1)
set2_grin = gringorten_pp(set2)
In [20]:
def cunnane_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    cunnane_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        cunnane_pp.append((i+1-0.40)/(len(sample)+0.2)) #values from the cunnane formula
    return cunnane_pp
In [21]:
set1_cun = cunnane_pp(set1)
set2_cun = cunnane_pp(set2)
In [22]:
def california_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    california_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        california_pp.append((i+1)/(len(sample))) #values from the cunnane formula
    return california_pp
In [23]:
set1_cal = california_pp(set1)
set2_cal = california_pp(set2)
In [24]:
def hazen_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    hazen_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        hazen_pp.append((i+1-0.5)/(len(sample))) #values from the cunnane formula
    return hazen_pp
In [25]:
set1_haz = hazen_pp(set1)
set2_haz = hazen_pp(set2)
In [26]:
myfigure = matplotlib.pyplot.figure(figsize = (12,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(set1_wei, set1 ,color ='blue',
            marker ="^",  
            s = 50)
matplotlib.pyplot.scatter(set1_grin, set1 ,color ='red',
            marker ="o",  
            s = 20)
matplotlib.pyplot.scatter(set1_cun, set1 ,color ='green',
            marker ="s",  
            s = 20)
matplotlib.pyplot.scatter(set1_cal, set1 ,color ='yellow',
            marker ="p",  
            s = 20)
matplotlib.pyplot.scatter(set1_haz, set1 ,color ='black',
            marker ="*",  
            s = 20)
matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Value") 
matplotlib.pyplot.title("Quantile Plot for Set1 based on Weibull, Gringorton, Cunnane, California, and Hazen Plotting Functions") 
matplotlib.pyplot.show()
Out[26]:
In [27]:
myfigure = matplotlib.pyplot.figure(figsize = (12,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(set2_wei, set2 ,color ='blue',
            marker ="^",  
            s = 50)
matplotlib.pyplot.scatter(set2_grin, set2 ,color ='red',
            marker ="o",  
            s = 20)
matplotlib.pyplot.scatter(set2_cun, set2 ,color ='green',
            marker ="s",  
            s = 20)
matplotlib.pyplot.scatter(set2_cal, set2 ,color ='yellow',
            marker ="p",  
            s = 20)
matplotlib.pyplot.scatter(set2_haz, set2 ,color ='black',
            marker ="*",  
            s = 20)
matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Value") 
matplotlib.pyplot.title("Quantile Plot for Set2 based on Weibull, Gringorton, Cunnane, California, and Hazen Plotting Functions") 
matplotlib.pyplot.show()
Out[27]:

Plot a histogram of Set1 with 10 bins.

In [28]:
import matplotlib.pyplot as plt
myfigure = matplotlib.pyplot.figure(figsize = (10,5)) # generate a object from the figure class, set aspect ratio

set1 = data['Set1']
set1.plot.hist(grid=False, bins=10, rwidth=1,
                   color='navy')
plt.title('Histogram of Set1')
plt.xlabel('Value')
plt.ylabel('Counts')
plt.grid(axis='y',color='yellow', alpha=1)
Out[28]:

Plot a histogram of Set2 with 10 bins.

In [29]:
set2 = data['Set2']
set2.plot.hist(grid=False, bins=10, rwidth=1,
                   color='darkorange')
plt.title('Histogram of Set2')
plt.xlabel('Value')
plt.ylabel('Counts')
plt.grid(axis='y',color='yellow', alpha=1)
Out[29]:

Plot a histogram of both Set1 and Set2 and discuss the differences.

In [30]:
fig, ax = plt.subplots()
data.plot.hist(density=False, ax=ax, title='Histogram: Set1 vs. Set2', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')
Out[30]:

The cool 'seaborn' package: Another way for plotting histograms and more!

In [32]:
import seaborn as sns
sns.displot(set1,color='navy', rug=True)
sns.displot(set2,color='darkorange', rug=True)
Out[32]:
<seaborn.axisgrid.FacetGrid at 0x7fe596d180a0>
Out[32]:
Out[32]:

Important Terminology:

Kernel Density Estimation (KDE): a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. This can be useful if you want to visualize just the “shape” of some data, as a kind of continuous replacement for the discrete histogram.

*From:
https://en.wikipedia.org/wiki/Kernel_density_estimation
https://mathisonian.github.io/kde/ >> A SUPERCOOL Blog!
https://www.youtube.com/watch?v=fJoR3QsfXa0 >> A Nice Intro to distplot in seaborn | Note that displot is pretty much the same thing!

In [33]:
sns.displot(set1,color='navy',kind='kde',rug=True)
Out[33]:
<seaborn.axisgrid.FacetGrid at 0x7fe5b0249730>
Out[33]:
In [25]:
sns.displot(set1,color='navy',kde=True)
sns.displot(set2,color='orange',kde=True)
Out[25]:
<seaborn.axisgrid.FacetGrid at 0x7f839eb7ce50>
Out[25]:
Out[25]:

Important Terminology:

Empirical Cumulative Distribution Function (ECDF): the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

*From:
https://en.wikipedia.org/wiki/Empirical_distribution_function

In [26]:
sns.displot(set1,color='navy',kind='ecdf')
Out[26]:
<seaborn.axisgrid.FacetGrid at 0x7f839cab3940>
Out[26]:

Fit a Normal distribution data model to both Set1 and Set2. Plot them seperately. Describe the fit.

In [35]:
set1 = data['Set1']
set2 = data['Set2']
set1 = np.array(set1)
set2 = np.array(set2)
set1_wei = weibull_pp(set1)
set2_wei = weibull_pp(set2)

# Normal Quantile Function
import math

def normdist(x,mu,sigma):
    argument = (x - mu)/(math.sqrt(2.0)*sigma)    
    normdist = (1.0 + math.erf(argument))/2.0
    return normdist
# For set1
mu = set1.mean() # Fitted Model
sigma = set1.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set1) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = matplotlib.pyplot.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(set1_wei, set1 ,color ='navy') 
matplotlib.pyplot.plot(ycdf, x, color ='gold',linewidth=3) 
matplotlib.pyplot.xlabel("Quantile Value") 
matplotlib.pyplot.ylabel("Set1 Value") 
mytitle = "Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
matplotlib.pyplot.title(mytitle) 
matplotlib.pyplot.show()
Out[35]:
In [36]:
# For set2
mu = set2.mean() # Fitted Model
sigma = set2.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(set2) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = matplotlib.pyplot.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(set2_wei, set2 ,color ='orange') 
matplotlib.pyplot.plot(ycdf, x, color ='purple',linewidth=3) 
matplotlib.pyplot.xlabel("Quantile Value") 
matplotlib.pyplot.ylabel("Set2 Value") 
mytitle = "Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
matplotlib.pyplot.title(mytitle) 
matplotlib.pyplot.show()
Out[36]:

Since it was an appropriate fit, we can use the normal distrubation to generate another sample randomly from the same population. Use a histogram with the new generated sets and compare them visually.

In [37]:
mu1 = set1.mean()
sd1 = set1.std()
mu2 = set2.mean()
sd2 = set2.std()
set1_s = np.random.normal(mu1, sd1, 100)
set2_s = np.random.normal(mu2, sd2, 100)
In [38]:
data_d = pd.DataFrame({'Set1s':set1_s,'Set2s':set2_s})

fig, ax = plt.subplots()
data_d.plot.hist(density=False, ax=ax, title='Histogram: Set1 samples vs. Set2 samples', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')
Out[38]:
In [39]:
fig, ax = plt.subplots()
data_d.plot.hist(density=False, ax=ax, title='Histogram: Set1 and Set1 samples vs. Set2 and Set2 samples', bins=40)
data.plot.hist(density=False, ax=ax, bins=40)

ax.set_ylabel('Count')
ax.grid(axis='y')
Out[39]:

Use boxplots to compare the four sets. Discuss their differences.

In [41]:
fig = plt.figure(figsize =(10, 7)) 
plt.boxplot ([set1, set1_s, set2, set2_s],1, '')
plt.show()
Out[41]:

The first pair and the second pair look similar while the two pairs look differnet, right? The question is how can we KNOW if two sets are truly (significantly) different or not?

Exercise 1:

  • Step1:Read the "lab14_E2data.csv" file as a dataset.
  • Step2:Describe the dataset numerically (using descriptive functions) and in your own words.
  • Step3:Plot histograms and compare the sets in the dataset. What do you infer from the histograms?
  • Step3*: This is a bonus step | Use "seaborn" to plot histograms with KDE and rugs!
  • Step4:Write appropriate functions for the Beard, Tukey, and Adamowski Plotting Position Formulas.
  • Step5:Apply your functions for the Beard, Tukey, and Adamowski Plotting Position Formulas on both sets and make quantile plots.
  • Step6:Use the Tukey Plotting Position Formula and fit a Normal and a LogNormal distribution data model. Plot them and visually assess which one provides a better fit for each set
  • Step7:Use the best distribution data model and a create two sample sets (one for each set) with 100 values.
  • Step8:Use boxplots and illustrate the differences and similarities between the sets. What do you infer from the boxplots?
In [43]:
# Step1:
data2 = pd.read_csv("lab14_E2data.csv") 
data2
Out[43]:
Set A Set B
0 100.512420 1524.953702
1 76.074982 854.300836
2 133.489273 872.743741
3 166.805272 1158.848418
4 88.457689 954.394634
... ... ...
995 112.290638 762.269055
996 105.462474 1196.784740
997 114.075449 734.047220
998 116.526470 858.490748
999 110.916179 935.086228

1000 rows × 2 columns

In [44]:
#Step2:
data2.describe()

#We have two sets of 1000 values. Set A has mean of 99.69 and a std of 24.3 while Set B's mean is 1014.56 with a std of 245. Overall, Set B has values that are orders of magnitude greater than Set A. Set B's minimum is close to where Set A's maximum value falls. 
Out[44]:
Set A Set B
count 1000.000000 1000.000000
mean 99.693460 1014.559330
std 24.353055 245.100128
min 22.024905 176.193608
25% 82.545272 862.899001
50% 100.100405 1009.909525
75% 116.534988 1181.943645
max 174.576681 1820.409944
In [45]:
# Step3:
fig, ax = plt.subplots()
data2.plot.hist(density=False, ax=ax, title='Histogram: Set A vs. Set B', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')

#Set B has a much wider spread (Larger range and IQR) than Set A. 
Out[45]:
In [49]:
#Step3*: Bonus Step

setA = data2['Set A']
setB = data2['Set B']

import seaborn as sns
sns.displot(setA,color='green',kde=True,rug=True)
sns.displot(setB,color='crimson',kde=True,rug=True)
Out[49]:
<seaborn.axisgrid.FacetGrid at 0x7fe592372430>
Out[49]:
Out[49]:
In [50]:
#Step4: Functions for the Beard, Tukey, and Adamowski Plotting Position Formulas

def beard_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    beard_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        beard_pp.append((i+1-0.31)/(len(sample)+0.38)) #values from the gringorten formula
    return beard_pp

def tukey_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    tukey_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        tukey_pp.append((i+1-1/3)/(len(sample)+1/3)) #values from the gringorten formula
    return tukey_pp

def adamowski_pp(sample): # plotting position function
# returns a list of plotting positions; sample must be a numeric list
    adamowski_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        adamowski_pp.append((i+1-0.25)/(len(sample)+0.5)) #values from the gringorten formula
    return adamowski_pp
In [53]:
#Step 5:
setA = data2['Set A']
setB = data2['Set B']
setA = np.array(setA)
setB = np.array(setB)

setA_brd = beard_pp(setA)
setB_brd = beard_pp(setB)

setA_tky = tukey_pp(setA)
setB_tky = tukey_pp(setB)

setA_adm = adamowski_pp(setA)
setB_adm = adamowski_pp(setB)

myfigure = matplotlib.pyplot.figure(figsize = (12,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(setA_brd, setA ,color ='blue',
            marker ="^",  
            s = 50)
matplotlib.pyplot.scatter(setA_tky, setA ,color ='red',
            marker ="o",  
            s = 20)
matplotlib.pyplot.scatter(setA_adm, setA ,color ='green',
            marker ="s",  
            s = 20)

matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Value") 
matplotlib.pyplot.title("Quantile Plot for SetA based on Beard, Tukey, and Adamowski Plotting Functions") 
matplotlib.pyplot.show()
Out[53]:
In [54]:
myfigure = matplotlib.pyplot.figure(figsize = (12,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(setB_brd, setB ,color ='blue',
            marker ="^",  
            s = 50)
matplotlib.pyplot.scatter(setB_tky, setB ,color ='red',
            marker ="o",  
            s = 20)
matplotlib.pyplot.scatter(setB_adm, setB ,color ='green',
            marker ="s",  
            s = 20)

matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Value") 
matplotlib.pyplot.title("Quantile Plot for SetB based on Beard, Tukey, and Adamowski Plotting Functions") 
matplotlib.pyplot.show()
Out[54]:
In [63]:
#Step6:
setA = data2['Set A']
setB = data2['Set B']
setA = np.array(setA)
setB = np.array(setB)
setA_tky = tukey_pp(setA)
setB_tky = tukey_pp(setB)

# Normal Quantile Function
import math

def normdist(x,mu,sigma):
    argument = (x - mu)/(math.sqrt(2.0)*sigma)    
    normdist = (1.0 + math.erf(argument))/2.0
    return normdist

# Log-Normal Quantile Function
def loggit(x):  # A prototype function to log transform x
    return(math.log(x))

def antiloggit(x):  # A prototype function to reverse log transform x
    return(math.exp(x))
In [58]:
# For setA
mu = setA.mean() # Fitted Model
sigma = setA.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(setA) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = matplotlib.pyplot.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(setA_tky, setA ,color ='navy') 
matplotlib.pyplot.plot(ycdf, x, color ='gold',linewidth=3) 
matplotlib.pyplot.xlabel("Quantile Value") 
matplotlib.pyplot.ylabel("SetA Value") 
mytitle = "Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
matplotlib.pyplot.title(mytitle) 
matplotlib.pyplot.show()
Out[58]:
In [72]:
setA = data2['Set A'].apply(loggit).tolist() # put the peaks into a list
setA_mean = np.array(setA).mean()
setA_variance = np.array(setA).std()**2
setA.sort() # sort the sample in place!
setA_tky = tukey_pp(setA)

################
mu = setA_mean # Fitted Model in Log Space
sigma = math.sqrt(setA_variance)
x = []; ycdf = []
xlow = 1; xhigh = 1.05*max(setA) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)

# Fitting Data to Log-Normal Data Model
# Now plot the sample values and plotting position
myfigure = matplotlib.pyplot.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(setA_tky, setA ,color ='blue') 
matplotlib.pyplot.plot(ycdf, x, color ='red') 
matplotlib.pyplot.xlabel("Quantile Value") 
matplotlib.pyplot.ylabel("log of setA") 
mytitle = "Log Normal Data Model log sample mean = : " + str(setA_mean)+ " log sample variance  =:" + str(setA_variance)
matplotlib.pyplot.title(mytitle) 
matplotlib.pyplot.show()
Out[72]:
In [73]:
setA2 = data2['Set A'].tolist() # pull original list
setA2.sort() # sort in place

################
mu = setA_mean # Fitted Model in Log Space
sigma = math.sqrt(setA_variance)
x = []; ycdf = []
xlow = 1; xhigh = 1.05*max(setA) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(antiloggit(xlow + i*xstep))
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = matplotlib.pyplot.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(setA_tky, setA2 ,color ='navy') 
matplotlib.pyplot.plot(ycdf, x, color ='red') 
matplotlib.pyplot.xlabel("Quantile Value") 
matplotlib.pyplot.ylabel("Value of Set A") 
mytitle = "Log Normal Data Model sample log mean = : " + str((setA_mean))+ " sample log variance  =:" + str((setA_variance))
matplotlib.pyplot.title(mytitle) 
matplotlib.pyplot.show()
Out[73]:
In [ ]:
# Normal Distribution Data Model is a better fit for Set A
In [74]:
# For setB
mu = setB.mean() # Fitted Model
sigma = setB.std()
x = []; ycdf = []
xlow = 0; xhigh = 1.2*max(setB) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)
# Fitting Data to Normal Data Model 
# Now plot the sample values and plotting position
myfigure = matplotlib.pyplot.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(setB_tky, setB ,color ='navy') 
matplotlib.pyplot.plot(ycdf, x, color ='gold',linewidth=3) 
matplotlib.pyplot.xlabel("Quantile Value") 
matplotlib.pyplot.ylabel("SetB Value") 
mytitle = "Normal Distribution Data Model sample mean = : " + str(mu)+ " sample variance =:" + str(sigma**2)
matplotlib.pyplot.title(mytitle) 
matplotlib.pyplot.show()
Out[74]:
In [75]:
setB = data2['Set B'].apply(loggit).tolist() # put the peaks into a list
setB_mean = np.array(setB).mean()
setB_variance = np.array(setB).std()**2
setB.sort() # sort the sample in place!
setB_tky = tukey_pp(setB)

################
mu = setB_mean # Fitted Model in Log Space
sigma = math.sqrt(setB_variance)
x = []; ycdf = []
xlow = 1; xhigh = 1.05*max(setB) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(xlow + i*xstep)
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue)

# Fitting Data to Log-Normal Data Model
# Now plot the sample values and plotting position
myfigure = matplotlib.pyplot.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(setB_tky, setB ,color ='blue') 
matplotlib.pyplot.plot(ycdf, x, color ='red') 
matplotlib.pyplot.xlabel("Quantile Value") 
matplotlib.pyplot.ylabel("log of setB") 
mytitle = "Log Normal Data Model log sample mean = : " + str(setA_mean)+ " log sample variance  =:" + str(setA_variance)
matplotlib.pyplot.title(mytitle) 
matplotlib.pyplot.show()
Out[75]:
In [76]:
setB2 = data2['Set B'].tolist() # pull original list
setB2.sort() # sort in place

################
mu = setB_mean # Fitted Model in Log Space
sigma = math.sqrt(setB_variance)
x = []; ycdf = []
xlow = 1; xhigh = 1.05*max(setB) ; howMany = 100
xstep = (xhigh - xlow)/howMany
for i in range(0,howMany+1,1):
    x.append(antiloggit(xlow + i*xstep))
    yvalue = normdist(xlow + i*xstep,mu,sigma)
    ycdf.append(yvalue) 
# Now plot the sample values and plotting position
myfigure = matplotlib.pyplot.figure(figsize = (7,9)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(setB_tky, setB2 ,color ='navy') 
matplotlib.pyplot.plot(ycdf, x, color ='red') 
matplotlib.pyplot.xlabel("Quantile Value") 
matplotlib.pyplot.ylabel("Value of Set B") 
mytitle = "Log Normal Data Model sample log mean = : " + str((setA_mean))+ " sample log variance  =:" + str((setA_variance))
matplotlib.pyplot.title(mytitle) 
matplotlib.pyplot.show()
Out[76]:
In [77]:
# Normal Distribution Data Model is a better fit for Set B
In [79]:
#Step7:
setA = data2['Set A']
setB = data2['Set B']
setA = np.array(setA)
setB = np.array(setB)
mu1 = setA.mean()
sd1 = setA.std()
mu2 = setB.mean()
sd2 = setB.std()
setA_s = np.random.normal(mu1, sd1, 100)
setB_s = np.random.normal(mu2, sd2, 100)
In [81]:
#Step8:
fig = plt.figure(figsize =(10, 7)) 
plt.boxplot ([setA, setA_s, setB, setB_s],1, '')
plt.show()
Out[81]:
In [ ]: