Percentiles For example, let's consider the sizes of the five largest continents – Africa, Antarctica, Asia, North America, and South America – rounded to the nearest million square miles.
import numpy as np
sizes = np.array([12, 17, 6, 9, 7])
sizes
The 80th
percentile is the smallest value that is at least as large as 80%
of the elements of sizes
Step 1: sort the list in ascending order
Step 2: grasp 80%
of the elements from left to right
sorted_sizes = np.sort(sizes)
sorted_sizes
number_of_elements = 0.8*(len(sizes)-1)
number_of_elements
80th
percentile is at index 3th
(round down) or the number 12
sorted_sizes[3]
80th
percentile is at index 4th
(round up) or the number 17
sorted_sizes[4]
Handling with floating rank
number_of_elements = 0.7*(len(sizes)-1)
number_of_elements
round it up, becomes index 3th
; then 70th
percentile is at number 12
sorted_sizes[3]
Interpoate ("linear" approach) with floating rank
Step 1: Determine the elements at the calculated rank using fomular r=p(n-1)
; 70th
is at r=0.7*(5-1)=2.8; Example, rank 2.8
means that positions of elements 2th
and 3th
which are 9
and 12
, respectively
Step 2: Take the difference between these two elements and multiply it by the fractional portion of the rank. For our example, this is: (12 – 9)0.8 = 2.4
.
Step 3: Take the lower-ranked value in Step 1 and add the value from Step 2 to obtain the interpolated value for the percentile. For our example, that value is 9 + 2.4 = 11.4
.
Usig numpy and pandas
np.percentile(sizes, 80, interpolation='linear')
np.percentile(sizes, 70, interpolation='linear')
import pandas as pd
my_data = {
"Size": sizes
}
df = pd.DataFrame(my_data)
df
df["Size"].quantile(0.8, interpolation='linear')
df["Size"].quantile(0.7, interpolation='linear')
Other example
import pandas as pd
scores_and_sections = pd.read_csv('scores_by_section.csv')
scores_and_sections
scores_and_sections['Midterm'].hist(bins=np.arange(-0.5, 25.6, 1))
scores_and_sections['Midterm'].quantile(0.85)
Quantiles
scores_and_sections['Midterm'].quantile(0.25)
scores_and_sections['Midterm'].quantile(0.50)
scores_and_sections['Midterm'].quantile(0.75)
scores_and_sections['Midterm'].quantile(1)
scores_and_sections['Midterm'].max()
Bootstrap We study the Total Compensation
column
df = pd.read_csv("san_francisco_2015.csv")
df
we will focus our attention on those who had at least the equivalent of a half-time job for the whole year. At a minimum wage of about $10
per hour, and 20
hours per week for 52
weeks, that's a salary of about $10,000
.
df = df.loc[df["Salaries"] > 10000]
df
Visualize the histogram
my_bins = np.arange(0, 700000, 25000)
df['Total Compensation'].hist(bins=my_bins)
Compute the median
pop_median = df['Total Compensation'].median()
pop_median
df['Total Compensation'].quantile(0.50)
Now we estimate this value using bootstrap (resampling)
my_bins = np.arange(0, 700000, 25000)
our_sample = df.sample(500, replace=False)
our_sample['Total Compensation'].hist(bins=my_bins)
est_median = our_sample['Total Compensation'].median()
est_median
our_sample['Total Compensation'].quantile(0.50)
The sample size is large. By the law of averages, the distribution of the sample resembles that of the population, and consequently the sample median is not very far from the population median (though of course it is not exactly the same).
So now we have one estimate of the parameter. But had the sample come out differently, the estimate would have had a different value. We would like to be able to quantify the amount by which the estimate could vary across samples. That measure of variability will help us measure how accurately we can estimate the parameter.
resample_1 = our_sample.sample(frac=1.0, replace=True)
resample_1['Total Compensation'].hist(bins=my_bins)
Compute the median of the new sample
resample_1['Total Compensation'].median()
resample_2 = our_sample.sample(frac=1.0, replace=True)
resampled_median_2 = resample_2['Total Compensation'].median()
resampled_median_2
Resamnpling for 5,000
times
bstrap_medians = []
for i in range(1, 5000+1):
one_resample = our_sample.sample(frac=1.0, replace=True)
one_median = one_resample['Total Compensation'].median()
bstrap_medians.append(one_median)
my_median_data = {
"Median": bstrap_medians
}
median_df = pd.DataFrame(my_median_data)
median_df
median_df.hist()
import matplotlib.pyplot as plt
plt.hist(bstrap_medians)
plt.xlabel("Median")
plt.ylabel("Frequency")
plt.show()
plt.hist(bstrap_medians, zorder=1)
plt.xlabel("Median")
plt.ylabel("Frequency")
plt.scatter(pop_median, 0, color='red', s=30, zorder=2);
plt.show()
Let's find out the middle 95%
of the resampled medians contains the red dot
left = median_df.quantile(0.025)
left
right = median_df.quantile(0.975)
right
The population median of $110,305
is between these two numbers. The interval and the population median are shown on the histogram below.
plt.hist(median_values, zorder=1)
plt.xlabel("Median")
plt.ylabel("Frequency")
plt.plot([left, right], [0, 0], color='yellow', lw=3, zorder=2)
plt.scatter(pop_median, 0, color='red', s=30, zorder=3);
plt.show()
So, the "middle 95%" interval of estimates captured the parameter in our example
Let repeat the processs 100 times to see how frequently the interval contains the parameter. We will store all left and right ends per simulation.
def bootstrap_sample(our_sample):
bstrap_medians = []
for i in range(1, 5000+1):
one_resample = our_sample.sample(frac=1.0, replace=True)
one_median = one_resample['Total Compensation'].median()
bstrap_medians.append(one_median)
return bstrap_medians
left_ends = []
right_ends = []
for i in range(1, 100+1):
our_sample = df.sample(500, replace=False)
bstrap_medians = bootstrap_sample(our_sample)
my_median_data = {
"Median": bstrap_medians
}
median_df = pd.DataFrame(my_median_data)
left = median_df['Median'].quantile(0.025)
right = median_df['Median'].quantile(0.975)
left_ends.append(left)
right_ends.append(right)
my_left_right = {
"Left": left_ends,
"Right": right_ends
}
left_right_df = pd.DataFrame(my_left_right)
left_right_df
good_experiments = left_right_df[(left_right_df["Left"] < pop_median) & (left_right_df["Right"] > pop_median)]
good_experiments
for i in np.arange(100):
left = left_right_df.at[i, "Left"]
right = left_right_df.at[i, "Right"]
plt.plot([left, right], [i, i], color='gold')
plt.plot([pop_median, pop_median], [0, 100], color='red', lw=2)
plt.xlabel('Median (dollars)')
plt.ylabel('Replication')
plt.title('Population Median and Intervals of Estimates')
plt.show()
In other words, this process of estimation captures the parameter about 92%
of the time.