21: Testing Hypothesis - Introductions¶

Comparing two (or more) collections of observations (graphically)

A procedure to systematically decide if two data collections are similar or substantially different.

Objectives¶

To apply fundamental concepts involved in probability estimation modeling and descriptive statistics;
- Concept of a hypothesis
Hypothesis components
- Null hypothesis and alternative hypothesis
- Normal distribution model
- One-tail, two-tail tests
- Attained significance
Decision Error
- Type-1, Type-2

Computational Thinking Concepts¶

The CT concepts include:

Abstraction => Represent data behavior with a function
Pattern Recognition => Patterns in data models to make decision

In engineering, when we wish to start asking questions about the data and interpret the results, we use statistical methods that provide a confidence or likelihood about the answers. In general, this class of methods is called statistical hypothesis testing, or significance tests. The material for today’s lecture is inspired by and gathered from several resources including:

Hypothesis testing in Machine learning using Python by Yogesh Agrawal available at https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce
Demystifying hypothesis testing with simple Python examples by Tirthajyoti Sarkar available at https://towardsdatascience.com/demystifying-hypothesis-testing-with-simple-python-examples-4997ad3c5294
A Gentle Introduction to Statistical Hypothesis Testing by Jason Brownlee available at https://machinelearningmastery.com/statistical-hypothesis-tests/

Fundamental Concepts¶

What is hypothesis testing ?
¶

Hypothesis testing is a statistical method that is used in making statistical decisions (about population) using experimental data (samples). Hypothesis Testing is basically an assumption that we make about the population parameter.

Example : You state “on average, students in the class are taller than 5 ft and 4 inches” or “an average boy is taller than an average girl” or “a specific treatment is effective in treating COVID-19 patients”.

We need some mathematical way support that whatever we are stating is true. We validate these hypotheses, basing our conclusion on random samples and empirical distributions.

Why do we use it ?
¶

Hypothesis testing is an essential procedure in experimentation. A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is supported by the sample data. When we say that a finding is statistically significant, it’s thanks to a hypothesis test.

Comparing Two Collections¶

Lets first examine a typical data question; you will need an empty notebook to follow along, as only the code is supplied!

Do construction activities impact stormwater solids metrics?

The webroot for the subsequent examples/exercises is http://54.243.252.9/engr-1330-webroot/9-MyJupyterNotebooks/41A-HypothesisTests/

Background¶

The Clean Water Act (CWA) prohibits storm water discharge from construction sites that disturb 5 or more acres, unless authorized by a National Pollutant Discharge Elimination System (NPDES) permit. Permittees must provide a site description, identify sources of contaminants that will affect storm water, identify appropriate measures to reduce pollutants in stormwater discharges, and implement these measures. The appropriate measures are further divided into four classes: erosion and sediment control, stabilization practices, structural practices, and storm water management. Collectively the site description and accompanying measures are known as the facility’s Storm Water Pollution Prevention Plan (SW3P). The permit contains no specific performance measures for construction activities, but states that ”EPA anticipates that storm water management will be able to provide for the removal of at least 80% of the total suspended solids (TSS).” The rules also note ”TSS can be used as an indicator parameter to characterize the control of other pollutants, including heavy metals, oxygen demanding pollutants, and nutrients commonly found in stormwater discharges”; therefore, solids control is critical to the success of any SW3P. Although the NPDES permit requires SW3Ps to be in-place, it does not require any performance measures as to the effectiveness of the controls with respect to construction activities. The reason for the exclusion was to reduce costs associated with monitoring storm water discharges, but unfortunately the exclusion also makes it difficult for a permittee to assess the effectiveness of the controls implemented at their site. Assessing the effectiveness of controls will aid the permittee concerned with selecting the most cost effective SW3P.

Problem Statement
¶

The files precon.CSV and durcon.CSV contain observations of cumulative rainfall, total solids, and total suspended solids collected from a construction site on Nasa Road 1 in Harris County.
The data in the file precon.CSV was collected before construction began. The data in the file durcon.CSV were collected during the construction activity.
The first column is the date that the observation was made, the second column the total solids (by standard methods), the third column is is the total suspended solids (also by standard methods), and the last column is the cumulative rainfall for that storm.

These data are not time series (there was sufficient time between site visits that you can safely assume each storm was independent. Our task is to analyze these two data sets and decide if construction activities impact stormwater quality in terms of solids measures.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Lets introduce script to automatically get the files from the named resource, in this case a web server!

Note

You would need to insert this script into your notebook, and run it to replicate the example here.

import requests # Module to process http/https requests

remote_url="http://54.243.252.9/engr-1330-webroot/9-MyJupyterNotebooks/41A-HypothesisTests/precon.csv"  # set the url
rget = requests.get(remote_url, allow_redirects=True)  # get the remote resource, follow imbedded links
open('precon.csv','wb').write(rget.content) # extract from the remote the contents, assign to a local file same name

remote_url="http://54.243.252.9/engr-1330-webroot/9-MyJupyterNotebooks/41A-HypothesisTests/durcon.csv"  # set the url
rget = requests.get(remote_url, allow_redirects=True)  # get the remote resource, follow imbedded links
open('durcon.csv','wb').write(rget.content) # extract from the remote the contents, assign to a local file same name

Read and examine the files, see if we can understand their structure

precon = pd.read_csv("precon.csv")
durcon = pd.read_csv("durcon.csv") 

precon.head()

	DATE	TS.PRE	TSS.PRE	RAIN.PRE
0	03/27/97	408.5	111.0	1.00
1	03/31/97	524.5	205.5	0.52
2	04/04/97	171.5	249.0	0.95
3	04/07/97	436.5	65.0	0.55
4	04/11/97	627.0	510.5	2.19

durcon.head()

	TS.DUR	TSS.DUR	RAIN.DUR
0	3014.0	2871.5	1.59
1	1137.0	602.0	0.53
2	2362.5	2515.0	0.74
3	395.5	130.0	0.11
4	278.5	36.5	0.27

precon.describe()

	TS.PRE	TSS.PRE	RAIN.PRE
count	17.000000	17.000000	17.000000
mean	462.941176	286.882353	0.878235
std	361.852779	312.659786	0.882045
min	163.000000	25.000000	0.130000
25%	268.000000	111.000000	0.460000
50%	408.500000	205.500000	0.690000
75%	524.500000	312.000000	0.950000
max	1742.000000	1373.000000	3.760000

durcon.describe()

	TS.DUR	TSS.DUR	RAIN.DUR
count	37.000000	37.000000	37.000000
mean	3495.283784	2749.216216	1.016486
std	7104.602041	5322.194188	1.391886
min	124.000000	14.000000	0.100000
25%	270.000000	57.000000	0.270000
50%	1058.000000	602.000000	0.550000
75%	2671.500000	2515.000000	1.010000
max	29954.500000	24146.500000	6.550000

Lets make some exploratory histograms to guide our investigation

Is the rainfall different before construction?

precon['RAIN.PRE'].hist(alpha=0.4,color='red',density="True")
durcon['RAIN.DUR'].hist(alpha=0.4,color='blue',density="True")

<AxesSubplot:>

This will show that as “distributions” they look pretty similar, although the during construction data has a few larger events.

Now

Is the total solids (TS) different before construction?

precon['TS.PRE'].hist(alpha=0.4,color='red',density="True")
durcon['TS.DUR'].hist(alpha=0.4,color='blue',density="True")

<AxesSubplot:>

Here it is hard to tell, but the preconstruction values are all to the left while the during construction phase has some large values.

Lets compare means and standard deviations

\( \mu TS_{pre} = 463 \)
\( \sigma TS_{pre} = 361 \)
\( \mu TS_{dur} = 3495 \)
\( \sigma TS_{dur} = 7104 \)

Certainly different, and the mean during construction is 8 pre-construction standard deviations larger, hence supportive of a claim that there is a difference, however the standard deviation of the during phase is huge, easily encompassing the preconstruction mean, so is there really a difference? We could resort to simulation to try to answer the question.

pre_s = np.random.normal(np.array(precon['TS.PRE']).mean(), np.array(precon['TS.PRE']).std(), 10000) # random sample from a normal distribution function
dur_s = np.random.normal(np.array(durcon['TS.DUR']).mean(), np.array(durcon['TS.DUR']).std(), 10000) # random sample from a normal distribution function
myfakedata_d = pd.DataFrame({'PreSim':pre_s,'DurSim':dur_s}) # make into a dataframe _d == derived
fig, ax = plt.subplots()
myfakedata_d.plot.hist(density=False, ax=ax, title='Histogram: Pre samples vs. Dur samples', bins=40)
ax.set_ylabel('Count')
ax.grid(axis='y')

Here we learn the standard deviations mean a lot, and the normal distribution is probably not the best model (negative solids don’t make physical sense). However it does point to the important issue, how to quantify the sameness or differences?

Thats the goal of hypothesis testing methods.

In lab you will use similar explorations, and next time we will get into the actual methods.

References¶

Laboratory 21¶

Examine (click) Laboratory 21 as a webpage at Laboratory 21.html

Download (right-click, save target as …) Laboratory 21 as a jupyterlab notebook from Laboratory 21.ipynb

Exercise Set 21¶

Examine (click) Exercise Set 21 as a webpage at Exercise 21

Download (right-click, save target as …) Exercise Set 21 as a jupyterlab notebook at Exercise Set 21

ENGR 1330 Course Notes

ENGR 1330 Computational Thinking with Data Science

21: Testing Hypothesis - Introductions¶

Objectives¶

Computational Thinking Concepts¶

Fundamental Concepts¶

What is hypothesis testing ?
¶

Why do we use it ?
¶

Comparing Two Collections¶

Background¶

Problem Statement
¶

References¶

Laboratory 21¶

Exercise Set 21¶

ENGR 1330 Course Notes

ENGR 1330 Computational Thinking with Data Science

21: Testing Hypothesis - Introductions¶

Objectives¶

Computational Thinking Concepts¶

Fundamental Concepts¶

What is hypothesis testing ?¶

Why do we use it ?¶

Comparing Two Collections¶

Background¶

Problem Statement ¶

References¶

Laboratory 21¶

Exercise Set 21¶

What is hypothesis testing ?
¶

Why do we use it ?
¶

Problem Statement
¶