Download (right-click, save target as ...) this page as a jupyterlab notebook Lab21-TH

Laboratory 21: Probability Estimation Modeling

LAST NAME, FIRST NAME

R00000000

ENGR 1330 Laboratory 21 - Homework

Important Terminology:¶

Population: In statistics, a population is the entire pool from which a statistical sample is drawn. A population may refer to an entire group of people, objects, events, hospital visits, or measurements.
Sample: In statistics and quantitative research methodology, a sample is a set of individuals or objects collected or selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations.
Distribution (Data Model): A data distribution is a function or a listing which shows all the possible values (or intervals) of the data. It also (and this is important) tells you how often each value occurs.

From https://www.investopedia.com/terms
https://www.statisticshowto.com/data-distribution/

Important Steps:¶

Get descriptive statistics- mean, variance, std. dev.
Use plotting position formulas (e.g., weibull, gringorten, cunnane) and plot the SAMPLES (data you already have)
Use different data models (e.g., normal, log-normal, Gumbell) and find the one that better FITs your samples- Visual or Numerical
Use the data model that provides the best fit to infer about the POPULATION

Estimate the magnitude of the annual peak flow at Llano River near Junction, TX.¶

The file 08150000.pkf is an actual WATSTORE formatted file for a USGS gage at Junction, Texas. The first few lines of the file look like:

Z08150000                       USGS 
H08150000       3030150994403004848267SW120902041854   1849    1634.32          
N08150000       Llano Rv nr Junction, TX
Y08150000       
308150000       19160522  11100                8.80                      
308150000       19170511    192                1.88                      
308150000       19180414  14900               10.50                      
308150000       19190924  35700                                          
308150000       19200514  13700               10.00                      
308150000       19210319    880                2.19                      
308150000       19220403  16100               10.97                      
308150000       19230425  60400

The first column are some agency codes that identify the station , the second column after the fourth row is a date in YYYYMMDD format, the third column is a discharge in CFS, the fourth and fifth column are not relevant for this laboratory exercise. The file was downloaded from

https://nwis.waterdata.usgs.gov/tx/nwis/peak?site_no=08150000&agency_cd=USGS&format=hn2

In the original file there are a several codes that are manually removed, so use the file at

http://54.243.252.9/engr-1330-webroot/8-Labs/Lab21/08150000.pkf

The laboratory task is to fit the data models to this data, decide the best model from visual perspective, and report from that data model the magnitudes of peak flow associated with the probabilitiess below (i.e. populate the table)

Exceedence Probability	Flow Value	Remarks
25%	????	75% chance of greater value
50%	????	50% chance of greater value
75%	????	25% chance of greater value
90%	????	10% chance of greater value
99%	????	1% chance of greater value (in flood statistics, this is the 1 in 100-yr chance event)
99.8%	????	0.002% chance of greater value (in flood statistics, this is the 1 in 500-yr chance event)
99.9%	????	0.001% chance of greater value (in flood statistics, this is the 1 in 1000-yr chance event)

The first step is to read the file, skipping the first part, then build a dataframe:

# Code to download the file here, or manual download

# Read the data file
amatrix = [] # null list to store matrix reads
rowNumA = 0
matrix1=[]
col0=[]
col1=[]
col2=[]
with open('08150000.pkf','r') as afile:
    lines_after_4 = afile.readlines()[4:]
afile.close() # Disconnect the file
howmanyrows = len(lines_after_4)
for i in range(howmanyrows):
    matrix1.append(lines_after_4[i].strip().split())
for i in range(howmanyrows):
    col0.append(matrix1[i][0])
    col1.append(matrix1[i][1])
    col2.append(matrix1[i][2])
# col2 is date, col3 is peak flow
#now build a datafranem

import pandas 
df = pandas.DataFrame(col0)
df['date']= col1
df['flow']= col2

df.head()

Now explore if you can plot the dataframe as a plot of peaks versus date.

# Plot here

From here on you can proceede using the lecture notebook as a go-by, although you should use functions as much as practical to keep your work concise

# Descriptive Statistics

# Weibull Plotting Position Function

# Normal Quantile Function

# Fitting Data to Normal Data Model

Normal Distribution Data Model¶

Exceedence Probability	Flow Value	Remarks
25%	????	75% chance of greater value
50%	????	50% chance of greater value
75%	????	25% chance of greater value
90%	????	10% chance of greater value
99%	????	1% chance of greater value (in flood statistics, this is the 1 in 100-yr chance event)
99.8%	????	0.002% chance of greater value (in flood statistics, this is the 1 in 500-yr chance event)
99.9%	????	0.001% chance of greater value (in flood statistics, this is the 1 in 1000-yr chance event)

# Log-Normal Quantile Function

# Fitting Data to Normal Data Model

Log-Normal Distribution Data Model¶

Exceedence Probability	Flow Value	Remarks
25%	????	75% chance of greater value
50%	????	50% chance of greater value
75%	????	25% chance of greater value
90%	????	10% chance of greater value
99%	????	1% chance of greater value (in flood statistics, this is the 1 in 100-yr chance event)
99.8%	????	0.002% chance of greater value (in flood statistics, this is the 1 in 500-yr chance event)
99.9%	????	0.001% chance of greater value (in flood statistics, this is the 1 in 1000-yr chance event)

# Gumbell EV1 Quantile Function

# Fitting Data to Gumbell EV1 Data Model

Gumbell Double Exponential (EV1) Distribution Data Model¶

Exceedence Probability	Flow Value	Remarks
25%	????	75% chance of greater value
50%	????	50% chance of greater value
75%	????	25% chance of greater value
90%	????	10% chance of greater value
99%	????	1% chance of greater value (in flood statistics, this is the 1 in 100-yr chance event)
99.8%	????	0.002% chance of greater value (in flood statistics, this is the 1 in 500-yr chance event)
99.9%	????	0.001% chance of greater value (in flood statistics, this is the 1 in 1000-yr chance event)

# Gamma (Pearson Type III) Quantile Function

# Fitting Data to Pearson (Gamma) III Data Model 
# This is new, in lecture the fit was to log-Pearson, same procedure, but not log transformed

Pearson III Distribution Data Model¶

Exceedence Probability	Flow Value	Remarks
25%	????	75% chance of greater value
50%	????	50% chance of greater value
75%	????	25% chance of greater value
90%	????	10% chance of greater value
99%	????	1% chance of greater value (in flood statistics, this is the 1 in 100-yr chance event)
99.8%	????	0.002% chance of greater value (in flood statistics, this is the 1 in 500-yr chance event)
99.9%	????	0.001% chance of greater value (in flood statistics, this is the 1 in 1000-yr chance event)

# Fitting Data to Log-Pearson (Log-Gamma) III Data Model

Log-Pearson III Distribution Data Model¶

Exceedence Probability	Flow Value	Remarks
25%	????	75% chance of greater value
50%	????	50% chance of greater value
75%	????	25% chance of greater value
90%	????	10% chance of greater value
99%	????	1% chance of greater value (in flood statistics, this is the 1 in 100-yr chance event)
99.8%	????	0.002% chance of greater value (in flood statistics, this is the 1 in 500-yr chance event)
99.9%	????	0.001% chance of greater value (in flood statistics, this is the 1 in 1000-yr chance event)

Summary of "Best" Data Model based on Graphical Fit¶

# your interpretation here

	0	date	flow
0	308150000	19160522	11100
1	308150000	19170511	192
2	308150000	19180414	14900
3	308150000	19190924	35700
4	308150000	19200514	13700