# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)
! pwd
%%html
<!--Script block to left align Markdown Tables-->
<style>
  table {margin-left: 0 !important;}
</style>
Last lesson we introduced empirical (based on observations) distributions and fit models to them - made the red line explain the blue dots. Ltes refresh a little:
Plotting Position: An empirical distribution, based on a random sample from some underlying distribution, is obtained by plotting an estimate of the exceedance (or cumulative) probability of the sample distribution against the sample value. 
The exceedance probability estimate for a particular sample value is a function of sample size and the rank of the particular sample. For exceedance probabilities, the sample values are ranked from largest to smallest.
The general expression in common use for plotting position is $ P = \frac{m - b}{N + 1 -2b} $
where m is the ordered rank of a sample value, N is the sample size, and b is a constant between 0 and 1, depending on the plotting method.
Lets look at some real data from https://www.depts.ttu.edu/techmrtweb/reports/complete_reports/0-6654-1_Final.pdf The database in the report is available at: http://54.243.252.9/engr-1330-psuedo-course/CECE-1330-PsuedoCourse/1-Lessons/Lesson14/database_txdot0-6654.txt
A proper URL reading would look something like:
import requests 
remote_url="http://54.243.252.9/engr-1330-psuedo-course/CECE-1330-PsuedoCourse/1-Lessons/Lesson14/database_txdot0-6654.txt"
rget = requests.get(remote_url, allow_redirects=True)  
open('database_txdot0-6654.txt','wb').write(rget.content) 
Of course neither of these will work with CoCalc, so you have to manually obtain the file.
Here we already have the file downloaded.
import pandas 
data = pandas.read_csv("database_txdot0-6654.txt",sep="|") 
data.head()
The file contains various fields, here let's just name a few.
CDA is the contributing drainage area, V is the mean section velocity in the stream at a gaging station, B is the topwidth, A is the cross section area of flow, Q is the beasured discharge.
data.describe()
Lets build two datasets, one with CDA > 300, and one CDA < 300 and examine the velocities under these two classifications.
dataBig = data[data['CDA'] >= 300][['STATION','CDA','V']]
dataLittle = data[data['CDA'] < 300][['STATION','CDA','V']]
Now lets pose a simple question. Are the velocities from large drainage areas, bigger than small areas? We can try a describe approach.
dataBig.describe()
dataLittle.describe()
Hard to tell, the mean values are pretty close, standard deviation about the same, the IQR is close. Lets try to plot two histograms side-by-side. We can get some guidance from https://stackoverflow.com/questions/45069828/how-to-plot-2-histograms-side-by-side?rq=1
And then Copy, Modify, Run (CMR) our new script.
import matplotlib.pyplot 
# make a couple of dataframes
set1 = dataBig['V']
set2 = dataLittle['V']
# use subplots
fig, axes = matplotlib.pyplot.subplots(1, 2)
axes[0].set_title('Velocity: Big Watersheds') 
axes[1].set_title('Velocity: Little Watersheds')
# here we build the plots - notice the ax variable
set1.hist( bins=100, rwidth=1, color='blue', density=True, ax=axes[0],figsize = (8,4))
set2.hist( bins=100, rwidth=1, color='red', density=True, ax=axes[1],figsize = (8,4))
The plots look pretty darned similar, not identical, but too similar to conclude the size matters.
So lets try our empirical distribution approach
# First copy into numpy arrays
import numpy
big1s = numpy.array(dataBig['V'])
little1s = numpy.array(dataLittle['V'])
def weibull_pp(sample): # Weibull plotting position function
# returns a list of plotting positions; sample must be a numeric list
    weibull_pp = [] # null list to return after fill
    sample.sort() # sort the sample list in place
    for i in range(0,len(sample),1):
        weibull_pp.append((i+1)/(len(sample)+1)) #values from the gringorten formula
    return weibull_pp
#Apply the weibull pp function
big1s_wei = weibull_pp(big1s)
little1s_wei = weibull_pp(little1s)
import matplotlib.pyplot
myfigure = matplotlib.pyplot.figure(figsize = (4,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(big1s_wei, big1s ,color ='blue')
matplotlib.pyplot.scatter(little1s_wei, little1s ,color ='orange')
matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Velocity Value") 
matplotlib.pyplot.title("Quantile Plot for Big and Little based on Weibull Plotting Function") 
matplotlib.pyplot.show()
Now we have something, the gold line looks different in a probabilistic sense than the blue line except in the extreme part of the diagram, so the mean velocity is smaller on small watersheds, but such watersheds are quite capable of generating high velocities on rare occasion.
Lets log transform the results and look again - it does not change anything, but will highlight differences.
import math
def loggit(x):  # A prototype function to log transform x
    return(math.log(x))
big1s = numpy.array(dataBig['V'].apply(loggit).tolist())
little1s = numpy.array(dataLittle['V'].apply(loggit).tolist())
#Apply the weibull pp function
big1s_wei = weibull_pp(big1s)
little1s_wei = weibull_pp(little1s)
myfigure = matplotlib.pyplot.figure(figsize = (4,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(big1s_wei, big1s ,color ='blue')
matplotlib.pyplot.scatter(little1s_wei, little1s ,color ='orange')
matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Velocity Value") 
matplotlib.pyplot.title("Quantile Plot for Big and Little based on Weibull Plotting Function") 
matplotlib.pyplot.show()
In log space we see even more of a difference. Next lesson we will learn about "hypothesis" tests where we can test if the differences are big enough to be meaningful, or are just a chance occurance.
Now let's play the same game with discharge, the product of velocity and area.
dataBig = data[data['CDA'] >= 300][['STATION','CDA','Q']]
dataLittle = data[data['CDA'] < 300][['STATION','CDA','Q']]
print(dataBig.describe())
print(dataLittle.describe())
# make a couple of dataframes
set1 = dataBig['Q']
set2 = dataLittle['Q']
# use subplots
fig, axes = matplotlib.pyplot.subplots(1, 2)
axes[0].set_title('Discharge: Big Watersheds') 
axes[1].set_title('Discharge: Little Watersheds')
# here we build the plots - notice the ax variable
set1.hist( bins=100, rwidth=1, color='blue', density=True, ax=axes[0],figsize = (8,4))
set2.hist( bins=100, rwidth=1, color='red', density=True, ax=axes[1],figsize = (8,4))
big1s = numpy.array(dataBig['Q'])
little1s = numpy.array(dataLittle['Q'])
#Apply the weibull pp function
big1s_wei = weibull_pp(big1s)
little1s_wei = weibull_pp(little1s)
myfigure = matplotlib.pyplot.figure(figsize = (8,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(big1s_wei, big1s ,color ='blue')
matplotlib.pyplot.scatter(little1s_wei, little1s ,color ='orange')
matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Discharge Value") 
matplotlib.pyplot.title("Quantile Plot for Big and Little based on Weibull Plotting Function") 
matplotlib.pyplot.show()
Here we see some difference - but would be hard to quantify meaningfully, lets repeat in log space. We will log transform the values and reanalyze.
import math
def loggit(x):  # A prototype function to log transform x
    return(math.log(x))
big1s = numpy.array(dataBig['Q'].apply(loggit).tolist())
little1s = numpy.array(dataLittle['Q'].apply(loggit).tolist())
#Apply the weibull pp function
big1s_wei = weibull_pp(big1s)
little1s_wei = weibull_pp(little1s)
myfigure = matplotlib.pyplot.figure(figsize = (8,8)) # generate a object from the figure class, set aspect ratio
matplotlib.pyplot.scatter(big1s_wei, big1s ,color ='blue')
matplotlib.pyplot.scatter(little1s_wei, little1s ,color ='orange')
matplotlib.pyplot.xlabel("Density or Quantile Value") 
matplotlib.pyplot.ylabel("Discharge Value") 
matplotlib.pyplot.title("Quantile Plot for Big and Little based on Weibull Plotting Function") 
matplotlib.pyplot.show()
In these cases, yes (at least for the mean value), we will learn methods next time to test for more subtle differences - in the general field called hypothesis tests.