Download (right-click, save target as ...) this page as a jupyterlab notebook Lab18
LAST NAME, FIRST NAME
R00000000
ENGR 1330 Laboratory 18 - In-Lab
# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)
inspired by :
*hint: According to UN estimate data, almost 60 million (60,449,841) people live in Italy
For the first example in this lab, we are going to look at a problem from two perspectives, or two "modes" if you may:
The GOD mode and The MAN mode.
In GOD mode, we are assuming that we know EVERYTHING about our population (in this case, the population of Italy).
Suppose we know (theoretically) the exact percentage of people in Italy that love soccer and it’s 75%.
totalpop = 60*10**6 # Total adult population of Italy (60M)
fbl_p = 0.75 #percentage of those loving soccer|football !
fblpop = int(totalpop * fbl_p) #Population of those who love football
nfblpop = int(totalpop * (1-fbl_p)) #Population of those who doesn't love football
import numpy as np
fblpop_1 = np.ones(fblpop) #An array of "1"s | its length is equal to the population of those who love football | DO NOT ATTEMPT TO PRINT!!!
nfblpop_0 = np.zeros(nfblpop) #An array of "0"s | its length is equal to the population of those who doesn't love football | DO NOT ATTEMPT TO PRINT!!!
totpop_01 = np.hstack([fblpop_1,nfblpop_0]) #An array of "0 & 1"s | its length is equal to the total population of Italy | DO NOT ATTEMPT TO PRINT!!!
print(np.mean(totpop_01))
np.mean(np.random.choice(totpop_01, size=1000)) # Run multiple times
# Let's do it in a more sophisticated/engineery/data sciency way!
for i in range(10): #Let's take 10 samples
sample = np.random.choice(totpop_01, size=1000)
print('Sample', i, ':', np.mean(sample))
You can see that we’re getting different values for each sample, but the intuition (and statistics theory) says that the average of large amount of samples should be very close to the real percentage. Let’s do that! lets take many samples and see what happens:
values = [] #Create an empty list
for i in range(10000): #Let's take 10000 samples
sample = np.random.choice(totpop_01, size=1000) #Notice that the sample size is not changing
mean = np.mean(sample)
values.append(mean) #Store the mean of each sample set
print(np.mean(values)) #Printing the mean of means!
values = np.array(values)
print(values.std()) #Printing the standard deviation of means!
We created 10000 samples, checked the percentage of people who love soccer in each sample, and then just averaged them. we got 74.99% which is very close to the real value 75% that we as GOD knew!
Let’s plot a histogram of all the values we got in all the samples. Interestingly, this histogram is very similar to the normal distribution!
import seaborn as sns # a nice package, with useful tools
sns.distplot(values,color='purple', rug=True,kde=True); # the semi-colon suppresses the object address, but still shows the plot
if we do this process a very large number of times (infinite number of times) we will get an histogram that is very close to the normal distribution and we can know the parameters of this distribution. (The next code block takes a looong time to run, ~minutes)
values = [] #Create an empty list
for i in range(1000000): #Let's take 1000000 samples
sample = np.random.choice(totpop_01, size=1000) #Notice that the sample size is not changing
mean = np.mean(sample)
values.append(mean) #Store the mean of each sample set
if i%100000 == 0: # a little printing to keep us awake
print(i," samples")
print(np.mean(values)) #Printing the mean of means!
import seaborn as sns
sns.distplot(values,color='purple', rug=True,kde=True);
First of all, we can see that the center (the mean) of the histogram is near 75%, exactly as we expected, but we are able to say much more just by looking at the histogram, for example, we can say, that half of the samples are larger than 75%, or, we can say that roughly 25% are larger than 76%. We can also say that almost 95% of the samples are between 72% and 78%. Let's also have a look at the boxplot:
import matplotlib.pyplot as plt
fig = plt.figure(figsize =(10, 7))
plt.boxplot (values,1, '')
plt.show()
At this point, many people might ask two important questions, “How can I take infinite number of samples?” and “How does it helps me?”.
The answer to the first one is that if you are GOD, there is no stopping you! If you are not GOD, you can't!
To asnwer the second question, Let’s go back to our example, we initially took one sample of 1000 people and got a value close to 75% but not exactly 75%. We wanted to know, what is the chance that a random sample of 1000 people will have 73% soccer lovers. Using the information above, we can say that there’s a chance of (roughly) 20% that we’ll get a value that is smaller or equal to 73%.
We don’t actually need to do the infinite samples. In other words, you don't have to be GOD! You will know why and how in a few moments...
Back in our horrid and miserable Man mode, we don’t know the actual percentage of soccer lovers in Italy. In fact, we know nothing about the population.
We do know one thing though: We just took a sample and got 73%. But how does it help us?
What we also DO know, is that if we took infinite number of samples, the distibution of their means will look something like this:
Here μ is the population mean (real percentage of soccer lovers in our example), and σ is the standard deviation of the population.
If we know this (and we know the standard deviation) we are able to say that ~68% of the samples will have a mean that falls in the red area or, more than 95% of the samples will have a mean that falls outside the green area (in the middle) in this plot:
This is where the concept of margin of error becomes of great importance:
Let's mix the GOD mode and the MAN mode. LET's DO MAD MODE!
Of course the distance is symmetric, So if the sample percentage will fall 95% of the time between real percentage-3 and real percentage +3, then the real percentage will be 95% of the times between sample percentage -3 and sample percentage +3.
import seaborn as sns
sns.distplot(values,color='purple', rug=True,kde=True);
If we took a sample and got 73%, we can say that we are 95% confident that the real percentage is between 70% (73 -3) and 76% (73+3). This is a Confidence Interval Estimate, the interval is 73 +- 3 and the confidence is 95% (or the Type-I error probability is 5%).
From a normally distributed population, we randomly took a sample of 500 students with a mean score of 461 on the math section of SAT. Suppose the standard deviation of the population is 100, what is the estimated true population mean for the 95% confidence interval.
# Step 1- Organize the data
n = 500 #Sample size
Xbar = 461 #Sample mean
C = 0.95 #Confidence level
std = 100 #Standard deviation (σ)
z = 1.96 #The z value associated with 95% Confidence Interval
# Assuming a normally distributed population
# Assuming randomly selected samples
# Step2- Calculate the margin of error
import math
margin = z*(std/math.sqrt(n))
print('The margin of error is equal to : ',margin)
# Step3- Find the estimated true population mean for the 95% confidence interval
# To find the range of values you just have to add and subtract 8.765 from 461
low = Xbar-margin
high = Xbar+margin
print('the true population mean will be captured within the confidence interval of (',low,' , ',high,') and the confidence is 95%')
From a normally distributed population, we randolmy took a sample of 200 dogs with a mean weight of 70 pounds. Suppose the standard deviation of the population is 20:
# Step 1- Organize the data
# Step2- Calculate the margin of error
# Step3- Find the estimated true population mean for the 90%; 95% confidence interval