Download (right-click, save target as ...) this page as a jupyterlab notebook Lab17
LAST NAME, FIRST NAME
R00000000
ENGR 1330 Laboratory 17 - In-Lab
# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)
Accept my gratitude if you do! But in case you saw Agent K and Agent J sometime after Thursday or for any other reason, do not recall it, here is where were we left things:
We had a dataset with two sets of numbers (Set 1 and Set2). We did a bunch of stuff and decided that the Normal Distribution Data Model provides a good fit for both of sample sets. We, then used the right parameters for Normal Data Model (mean and standard deviation) to generate one new sample set based on each set. We then looked at the four sets next to each other and asked a rather simple question: Are these sets different or similar?
While we reached some assertions based on visual assessment, we did not manage to solidify our assertation in any numerical way. Well, now is the time!
#Load the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Lets get our data from http://54.243.252.9/engr-1330-webroot/8-Labs/Lab17/lab14_E1data.csv
#Previously ...
data = pd.read_csv("lab14_E1data.csv")
set1 = np.array(data['Set1'])
set2 = np.array(data['Set2'])
mu1 = set1.mean()
sd1 = set1.std()
mu2 = set2.mean()
sd2 = set2.std()
set1_s = np.random.normal(mu1, sd1, 100)
set2_s = np.random.normal(mu2, sd2, 100)
data2 = pd.DataFrame({'Set1s':set1_s,'Set2s':set2_s})
#Previously ...
fig, ax = plt.subplots()
data2.plot.hist(density=False, ax=ax, title='Histogram: Set1 and Set1 samples vs. Set2 and Set2 samples', bins=40)
data.plot.hist(density=False, ax=ax, bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')
#Previously ...
fig = plt.figure(figsize =(10, 7))
plt.boxplot ([set1, set1_s, set2, set2_s],1, '')
plt.show()
We can use statistical hypothesis tests to confirm that our sets are from Normal Distribution Data Models. We can use the Shapiro-Wilk Normality Test:
# the Shapiro-Wilk Normality Test for set1
from scipy.stats import shapiro
stat, p = shapiro(data['Set1'])
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('Probably Gaussian')
else:
print('Probably not Gaussian')
# the Shapiro-Wilk Normality Test for set2
from scipy.stats import shapiro
stat, p = shapiro(data['Set2'])
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('Probably Gaussian')
else:
print('Probably not Gaussian')
Now let's confirm that set1 and set1_s are from the same distribution. We can use the Mann-Whitney U Test for this:__
from scipy.stats import mannwhitneyu # import a useful non-parametric test
stat, p = mannwhitneyu(data['Set1'],data2['Set1s'])
print('statistic=%.3f, p-value at rejection =%.3f' % (stat, p))
if p > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
Let's also confirm that set2 and set2_s are from the same distribution:__
from scipy.stats import mannwhitneyu # import a useful non-parametric test
stat, p = mannwhitneyu(data['Set2'],data2['Set2s'])
print('statistic=%.3f, p-value at rejection =%.3f' % (stat, p))
if p > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
Based on the results we can say set1 and set1_s probably belong to the same distrubtion. The same can be stated about set2 and set2_s. Now let's check and see if set1 and set2 are SIGNIFICANTLY different or not?__
from scipy.stats import mannwhitneyu # import a useful non-parametric test
stat, p = mannwhitneyu(data['Set1'],data['Set2'])
print('statistic=%.3f, p-value at rejection =%.3f' % (stat, p))
if p > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
The test's result indicate that the set1 and set2 belong to distirbutions with different measures of central tendency (means). We can check the same for set1_s and set2_s as well:__
from scipy.stats import mannwhitneyu # import a useful non-parametric test
stat, p = mannwhitneyu(data2['Set1s'],data2['Set2s'])
print('statistic=%.3f, p-value at rejection =%.3f' % (stat, p))
if p > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
Now we can state at a 95% confidence level that set1 and set2 are different. The same for set1s and set2s.__
Repeat the analysis using http://54.243.252.9/engr-1330-webroot/8-Labs/Lab17/lab14_E2data.csv
# your code here
A dataset containing marks obtained by students on basic skills like basic math and language skills (reading and writing) is collected from an educational institution and we have been tasked to give them some important inferences.
Hypothesis: There is no difference in means of student performance in any of basic literacy skills i.e. reading, writing, math.
This is based on an example by Joju John Varghese on "Hypothesis Testing for Inference using a Dataset" available @ https://medium.com/swlh/hypothesis-testing-for-inference-using-a-data-set-aaa799e94cdf. The dataset is available @ https://www.kaggle.com/spscientist/students-performance-in-exams.
A local copy is available from http://54.243.252.9/engr-1330-webroot/8-Labs/Lab17/StudentsPerformance.csv
import requests # Module to process http/https requests
remote_url="http://54.243.252.9/engr-1330-webroot/8-Labs/Lab17/StudentsPerformance.csv" # set the url
rget = requests.get(remote_url, allow_redirects=True) # get the remote resource, follow imbedded links
open('StudentsPerformance.csv','wb').write(rget.content); # extract from the remote the contents, assign to a local file same name
df = pd.read_csv("StudentsPerformance.csv")
df.head()
df.describe()
set1 = df['math score']
set2 = df['reading score']
set3 = df['writing score']
import seaborn as sns
sns.distplot(set1,color='navy', rug=True)
sns.distplot(set2,color='darkorange', rug=True)
sns.distplot(set3,color='green', rug=True)
plt.xlim(0,100)
plt.xlabel('Test Results');
It seems that all three samples have the same population means and it seems there is no significant difference between them at all. Let's set the null and alternative hypothesis:
Ho: There is no difference in performance of students between math, reading and writing skills.
Ha: There is a difference in performance of students between math, reading and writing skills.
from scipy.stats import mannwhitneyu # import a useful non-parametric test
stat, p = mannwhitneyu(set1,set2)
print('statistic=%.3f, p-value at rejection =%.3f' % (stat, p))
if p > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
from scipy.stats import mannwhitneyu # import a useful non-parametric test
stat, p = mannwhitneyu(set1,set3)
print('statistic=%.3f, p-value at rejection =%.3f' % (stat, p))
if p > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
from scipy.stats import mannwhitneyu # import a useful non-parametric test
stat, p = mannwhitneyu(set2,set3)
print('statistic=%.3f, p-value at rejection =%.3f' % (stat, p))
if p > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
from scipy.stats import kruskal
stat, p = kruskal(set1, set2, set3)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
*hint: The Daily Planet is a fictional broadsheet newspaper appearing in American comic books published by DC Comics, commonly in association with Superman. Read more @ https://en.wikipedia.org/wiki/Daily_Planet
#Day 1
group1= 699
group2= 699
sold1= 175
sold2 = 200
rate1= sold1/group1
rate2 = sold2/group2
print(f"The ratio for the first group is {rate1:0.4f} copies sold per person")
print(f"The ratio for the second group is {rate2:0.4f} copies sold per person")
from scipy.stats import mannwhitneyu
import numpy as np
a_dist = np.zeros(group1)
a_dist[:sold1] = 1
b_dist = np.zeros(group2)
b_dist[:sold2] = 1
stat, p_value = mannwhitneyu(a_dist, b_dist, alternative="less")
print(f"Probability from Mann-Whitney U test for B <= A is {1.0-p_value:0.3f}")
#Week 1
group1= 5043
group2= 5043
sold1= 1072
sold2 = 1190
rate1= sold1/group1
rate2 = sold2/group2
print(f"The ratio for the first group is {rate1:0.3f} copies sold per person")
print(f"The ratio for the second group is {rate2:0.3f} copies sold per person")
from scipy.stats import mannwhitneyu
import numpy as np
a_dist = np.zeros(group1)
a_dist[:sold1] = 1
b_dist = np.zeros(group2)
b_dist[:sold2] = 1
stat, p_value = mannwhitneyu(a_dist, b_dist, alternative="less")
print(f"Probability from Mann-Whitney U test for B <= A is {1.0-p_value:0.3f}")
#Month 1
group1= 21000
group2= 21000
sold1= 4300
sold2 = 5700
rate1= sold1/group1
rate2 = sold2/group2
print(f"The ratio for the first group is {rate1:0.3f} copies sold per person")
print(f"The ratio for the second group is {rate2:0.3f} copies sold per person")
from scipy.stats import mannwhitneyu
import numpy as np
a_dist = np.zeros(group1)
a_dist[:sold1] = 1
b_dist = np.zeros(group2)
b_dist[:sold2] = 1
stat, p_value = mannwhitneyu(a_dist, b_dist, alternative="less")
print(f"Probability from Mann-Whitney U test for B <= A is {1.0 - p_value:0.15f}")
Here are some great reads on this topic:
Some great reads on A/B Testing:
Some great videos: