Download (right-click, save target as ...) this page as a jupyterlab notebook from: Lab21


Laboratory 21: "Towards Hypothesis Testing"

LAST NAME, FIRST NAME

R00000000

ENGR 1330 Laboratory 21 - In-Lab

Hypothesis Testing

Hypothesis tests are methods to quantify if two groups of data are similar or different. In this lab we will just get started using mostly exploratory data analysis and histograms, and will explore the concept in increasing detail over the next few labs.

First import some necessary packages:

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Next lets get a database to work with, in this case the database is simply two collections of numerical values.

In [ ]:
######### CODE TO AUTOMATICALLY DOWNLOAD THE DATABASE ################
#! pip install requests #install packages into local environment
import requests # import needed modules to interact with the internet
# make the connection to the remote file (actually its implementing "bash curl -O http://fqdn/path ...")
remote_url = 'http://54.243.252.9/engr-1330-webroot/8-Labs/Lab21/Lab21_data.csv' # a csv file
response = requests.get(remote_url) # Gets the file contents puts into an object
output = open('Lab21_data.csv', 'wb') # Prepare a destination, local
output.write(response.content) # write contents of object to named local file
output.close() # close the connection
In [31]:
mydata = pd.read_csv("Lab21_data.csv") 
mydata
Out[31]:
Set1 Set2
0 46.688625 512.459480
1 44.825192 480.551364
2 71.453564 560.502112
3 30.360172 503.885912
4 47.657087 458.124749
... ... ...
95 60.040915 462.122309
96 21.527991 509.909507
97 59.523999 572.309957
98 38.173070 562.580099
99 39.671168 497.784981

100 rows × 2 columns

Question 1

What are the names of the two series in "mydata"?

In [32]:
# put your answer here

Question 2

Describe the two data series, which has a larger mean value, which has a larger variance?

In [33]:
# put your answer here

Now lets prepare histograms of the two data series, an easy way to generate two histoprgams on the same plot is listed below

In [ ]:
fig, ax = plt.subplots()
mydata.plot.hist(density=False, ax=ax, title='Histogram: Set1 vs. Set2', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')

Question 3:

Are the two data series similar or not?

Describe (using words, and sentences, not the method) how the series are different.

In [34]:
# put your answer here

Now lets generate two more series using the descriptive statistics from "Set1" and "Set2"

In [35]:
set1_s = np.random.normal(np.array(mydata['Set1']).mean(), np.array(mydata['Set1']).std(), 100) # random sample from a normal distribution function
set2_s = np.random.normal(np.array(mydata['Set2']).mean(), np.array(mydata['Set2']).std(), 100) # random sample from a normal distribution function

Put these into a new dataframe

In [36]:
mydata_d = pd.DataFrame({'Set1s':set1_s,'Set2s':set2_s}) # make into a dataframe _d == derived

Now lets prepare histograms of the two data series, an easy way to generate two histograms on the same plot is listed below

In [ ]:
fig, ax = plt.subplots()
mydata_d.plot.hist(density=False, ax=ax, title='Histogram: Set1 samples vs. Set2 samples', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')

Question 4:

Are the two new data series similar or not?

Describe (using words, and sentences, not the method) how the series are different.

In [37]:
# put your answer here

Now lets examine all 4 data series. First a histogram of all 4 on the same graph

In [ ]:
fig, ax = plt.subplots()
mydata_d.plot.hist(density=False, ax=ax, title='Histogram: Set1 and Set1 samples vs. Set2 and Set2 samples', bins=40,alpha=0.5)
mydata.plot.hist(density=False, ax=ax, bins=40,alpha=0.5)

ax.set_ylabel('Count')
ax.grid(axis='y')

Question 5:

Are the series "Set1" and "Set1s" the same or different? How do they compare? What about series "Set2" and "Set2s"?

In [38]:
# put your answer here

Another graphical tool is a set of boxplots

In [ ]:
fig = plt.figure(figsize =(10, 7)) 
plt.boxplot ([set1, set1_s, set2, set2_s],1, '')
plt.xticks([1, 2, 3, 4], ['Set1', 'Set1_s', 'Set2', 'Set2_s'])
plt.show()

Question 6

Interpret the results of the boxplot. Are the Set2 "collections" different from the Set1 "collections"?

In [40]:
# put your answer here

Question 7

Suppose we are comparing the arithmetic means of the 4 collections. Are the mean values of "Set1" and "Set2" far apart?
How many "Set1" standard deviations is the "Set1" mean value from "Set2"? How about the converse (Set2 deviations)?

In [41]:
# put your answer here

Question 8

Visit SixSigma and after reading the Wiki, decide if the Set1 and Set2 collections are far enough apart to be considered "statistically" different. Repeat with Set1 and Set1s - are they far apart?

In [ ]:
# put your answer here


Exercise: What is the meaning of "statistically significant difference" ?

Make sure to cite any resources that you may use.

In [ ]: