import pandas as pd
This data set lists the individual observations for 934 children in 205 families on which Galton (1886) based his cross-tabulation.
In addition to the question of the relation between heights of parents and their offspring, for which this data is mainly famous, Galton had another purpose which the data in this form allows to address: Does marriage selection indicate a relationship between the heights of husbands and wives, a topic he called assortative mating? Keen p. 297-298 provides a brief discussion of this topic.
family
: family ID, a factor with levels 001-204
father
: height of father
mother
: height of mother
midparentHeight
: mid-parent height, calculated as (father + 1.08*mother)/2
children
: number of children in this family
childNum
: number of this child within family. Children are listed in decreasing order of height for boys followed by girls
gender
: child gender, a factor with levels female male
childHeight
: height of child
df = pd.read_csv("galton.csv")
df
midparentHeight
, the predicted child height is the average heights of childrens whose midparent height is similar (assume: 0.5 inches deviation)¶Step1: From given midparentHeight
, find out childrens whose midparentHeight
is within 0.5 inches deviation
Step2: Compute the mean height of those childrens. The mean value is the predicted child height.
def predictChildHeight(midparentHeight):
similar_ones = df[(df["midparentHeight"] >= midparentHeight-0.5) & (df["midparentHeight"] <= midparentHeight+0.5)]
return similar_ones["childHeight"].mean()
predictChildHeight(75.43)
prediction = []
for index, row in df.iterrows():
predicted_height = predictChildHeight(row["midparentHeight"])
prediction.append(predicted_height)
df["Prediction"] = prediction
df
ax1 = df.plot.scatter(x="midparentHeight", y="childHeight", label="Real")
df.plot.scatter(x="midparentHeight", y="Prediction", ax=ax1, color="red", label="Prediction")
df["midparentHeight_su"] = (df["midparentHeight"] - df["midparentHeight"].mean()) / df["midparentHeight"].std()
df["childHeight_su"] = (df["childHeight"] - df["childHeight"].mean()) / df["childHeight"].std()
df["midparentHeight"].std()
1 SD
midparent is 1.8
inches; hence 0.5
inches is 0.5 / 1.8 = 0.28
SDs
def predictChildHeight_SU(midparentHeightSU):
similar_ones = df[(df["midparentHeight_su"] >= midparentHeightSU-0.28) & (df["midparentHeight_su"] <= midparentHeightSU+0.28)]
return similar_ones["childHeight_su"].mean()
prediction = []
for index, row in df.iterrows():
predicted_height = predictChildHeight_SU(row["midparentHeight_su"])
prediction.append(predicted_height)
df["Prediction_SU"] = prediction
df
ax1 = df.plot.scatter(x="midparentHeight_su", y="childHeight_su", label="Real")
df.plot.scatter(x="midparentHeight_su", y="Prediction_SU", ax=ax1, color="red", label="Prediction")
Tall parents have children who are not quite as exceptionally tall, on average. called this as regression to mediocrity.
Exceptionally short parents had children who were somewhat taller relative to their generation, on average. In general, individuals who are away from average on one variable are expected to be not quite as far away from average on the other. This is called the regression effect.
df.corr()
import matplotlib.pyplot as plt
plt.scatter(x=df["midparentHeight_su"], y=df["childHeight_su"], label="Original")
plt.scatter(x=df["midparentHeight_su"], y=df["Prediction_SU"], color="red", zorder=1, label="Prediction")
plt.plot([-3, 3], [-3*0.320950, 3*0.320950], color="yellow", lw=3, zorder=3, label="Regression line")
plt.legend()
plt.show()
Regression Line
In regression, we use the value of one variable (which we will call x
) to predict the value of another (which we will call y
). When the variables x
and y
are measured in standard units, the regression line for predicting y
based on x
has slope r
and passes through the origin.