Download (right-click, save target as ...) this page as a jupyterlab notebook from: Lab21

Laboratory 21: "Towards Hypothesis Testing"

LAST NAME, FIRST NAME

R00000000

ENGR 1330 Laboratory 21 - In-Lab

Hypothesis Testing¶

Hypothesis tests are methods to quantify if two groups of data are similar or different. In this lab we will just get started using mostly exploratory data analysis and histograms, and will explore the concept in increasing detail over the next few labs.

First import some necessary packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Next lets get a database to work with, in this case the database is simply two collections of numerical values.

######### CODE TO AUTOMATICALLY DOWNLOAD THE DATABASE ################
#! pip install requests #install packages into local environment
import requests # import needed modules to interact with the internet
# make the connection to the remote file (actually its implementing "bash curl -O http://fqdn/path ...")
remote_url = 'http://54.243.252.9/engr-1330-webroot/8-Labs/Lab21/Lab21_data.csv' # a csv file
response = requests.get(remote_url) # Gets the file contents puts into an object
output = open('Lab21_data.csv', 'wb') # Prepare a destination, local
output.write(response.content) # write contents of object to named local file
output.close() # close the connection

mydata = pd.read_csv("Lab21_data.csv") 
mydata

Question 1¶

What are the names of the two series in "mydata"?

# put your answer here

Question 2¶

Describe the two data series, which has a larger mean value, which has a larger variance?

# put your answer here

Now lets prepare histograms of the two data series, an easy way to generate two histoprgams on the same plot is listed below

fig, ax = plt.subplots()
mydata.plot.hist(density=False, ax=ax, title='Histogram: Set1 vs. Set2', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')

Question 3:¶

Are the two data series similar or not?

Describe (using words, and sentences, not the method) how the series are different.

# put your answer here

Now lets generate two more series using the descriptive statistics from "Set1" and "Set2"

set1_s = np.random.normal(np.array(mydata['Set1']).mean(), np.array(mydata['Set1']).std(), 100) # random sample from a normal distribution function
set2_s = np.random.normal(np.array(mydata['Set2']).mean(), np.array(mydata['Set2']).std(), 100) # random sample from a normal distribution function

Put these into a new dataframe

mydata_d = pd.DataFrame({'Set1s':set1_s,'Set2s':set2_s}) # make into a dataframe _d == derived

Now lets prepare histograms of the two data series, an easy way to generate two histograms on the same plot is listed below

fig, ax = plt.subplots()
mydata_d.plot.hist(density=False, ax=ax, title='Histogram: Set1 samples vs. Set2 samples', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')

Question 4:¶

Are the two new data series similar or not?

Describe (using words, and sentences, not the method) how the series are different.

# put your answer here

Now lets examine all 4 data series. First a histogram of all 4 on the same graph

fig, ax = plt.subplots()
mydata_d.plot.hist(density=False, ax=ax, title='Histogram: Set1 and Set1 samples vs. Set2 and Set2 samples', bins=40,alpha=0.5)
mydata.plot.hist(density=False, ax=ax, bins=40,alpha=0.5)

ax.set_ylabel('Count')
ax.grid(axis='y')

Question 5:¶

Are the series "Set1" and "Set1s" the same or different? How do they compare? What about series "Set2" and "Set2s"?

# put your answer here

Another graphical tool is a set of boxplots

fig = plt.figure(figsize =(10, 7)) 
plt.boxplot ([set1, set1_s, set2, set2_s],1, '')
plt.xticks([1, 2, 3, 4], ['Set1', 'Set1_s', 'Set2', 'Set2_s'])
plt.show()

Question 6¶

Interpret the results of the boxplot. Are the Set2 "collections" different from the Set1 "collections"?

# put your answer here

Question 7¶

Suppose we are comparing the arithmetic means of the 4 collections. Are the mean values of "Set1" and "Set2" far apart?
How many "Set1" standard deviations is the "Set1" mean value from "Set2"? How about the converse (Set2 deviations)?

# put your answer here

Question 8¶

Visit SixSigma and after reading the Wiki, decide if the Set1 and Set2 collections are far enough apart to be considered "statistically" different. Repeat with Set1 and Set1s - are they far apart?

# put your answer here

	Set1	Set2
0	46.688625	512.459480
1	44.825192	480.551364
2	71.453564	560.502112
3	30.360172	503.885912
4	47.657087	458.124749
...	...	...
95	60.040915	462.122309
96	21.527991	509.909507
97	59.523999	572.309957
98	38.173070	562.580099
99	39.671168	497.784981