Download (right-click, save target as ...) this page as a jupyterlab notebook from: Lab21
LAST NAME, FIRST NAME
R00000000
ENGR 1330 Laboratory 21 - In-Lab
Hypothesis tests are methods to quantify if two groups of data are similar or different. In this lab we will just get started using mostly exploratory data analysis and histograms, and will explore the concept in increasing detail over the next few labs.
First import some necessary packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Next lets get a database to work with, in this case the database is simply two collections of numerical values.
######### CODE TO AUTOMATICALLY DOWNLOAD THE DATABASE ################
#! pip install requests #install packages into local environment
import requests # import needed modules to interact with the internet
# make the connection to the remote file (actually its implementing "bash curl -O http://fqdn/path ...")
remote_url = 'http://54.243.252.9/engr-1330-webroot/8-Labs/Lab21/Lab21_data.csv' # a csv file
response = requests.get(remote_url) # Gets the file contents puts into an object
output = open('Lab21_data.csv', 'wb') # Prepare a destination, local
output.write(response.content) # write contents of object to named local file
output.close() # close the connection
mydata = pd.read_csv("Lab21_data.csv")
mydata
What are the names of the two series in "mydata"?
# put your answer here
Describe the two data series, which has a larger mean value, which has a larger variance?
# put your answer here
Now lets prepare histograms of the two data series, an easy way to generate two histoprgams on the same plot is listed below
fig, ax = plt.subplots()
mydata.plot.hist(density=False, ax=ax, title='Histogram: Set1 vs. Set2', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')
Are the two data series similar or not?
Describe (using words, and sentences, not the method) how the series are different.
# put your answer here
Now lets generate two more series using the descriptive statistics from "Set1" and "Set2"
set1_s = np.random.normal(np.array(mydata['Set1']).mean(), np.array(mydata['Set1']).std(), 100) # random sample from a normal distribution function
set2_s = np.random.normal(np.array(mydata['Set2']).mean(), np.array(mydata['Set2']).std(), 100) # random sample from a normal distribution function
Put these into a new dataframe
mydata_d = pd.DataFrame({'Set1s':set1_s,'Set2s':set2_s}) # make into a dataframe _d == derived
Now lets prepare histograms of the two data series, an easy way to generate two histograms on the same plot is listed below
fig, ax = plt.subplots()
mydata_d.plot.hist(density=False, ax=ax, title='Histogram: Set1 samples vs. Set2 samples', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')
Are the two new data series similar or not?
Describe (using words, and sentences, not the method) how the series are different.
# put your answer here
Now lets examine all 4 data series. First a histogram of all 4 on the same graph
fig, ax = plt.subplots()
mydata_d.plot.hist(density=False, ax=ax, title='Histogram: Set1 and Set1 samples vs. Set2 and Set2 samples', bins=40,alpha=0.5)
mydata.plot.hist(density=False, ax=ax, bins=40,alpha=0.5)
ax.set_ylabel('Count')
ax.grid(axis='y')
Are the series "Set1" and "Set1s" the same or different? How do they compare? What about series "Set2" and "Set2s"?
# put your answer here
Another graphical tool is a set of boxplots
fig = plt.figure(figsize =(10, 7))
plt.boxplot ([set1, set1_s, set2, set2_s],1, '')
plt.xticks([1, 2, 3, 4], ['Set1', 'Set1_s', 'Set2', 'Set2_s'])
plt.show()
Interpret the results of the boxplot. Are the Set2 "collections" different from the Set1 "collections"?
# put your answer here
Suppose we are comparing the arithmetic means of the 4 collections. Are the mean values of "Set1" and "Set2" far apart?
How many "Set1" standard deviations is the "Set1" mean value from "Set2"? How about the converse (Set2 deviations)?
# put your answer here
# put your answer here