Chapter two of Allen Downey’s book focuses on descriptive statistics and distributions. He starts the chapter off by summarising mean and variance – before we get started with the example lets familiarise ourselves with the definitions.
the mean of a sample of items with values , is the sum of those items divided by the number of items .
The mean of a sample can sometimes be a good description of the set, but when the set fluctuates greatly the mean can be misleading. We can use the variance of the sample to calculate the spread of the items in the set.
As you can see the variance is the mean squared deviation, if we take the square root of the variance we get the standard deviation of the set.
Now that we have that out the way lets dive into the example.
Calculate the standard deviation of the gestation time for first and other babies. What does the comparison suggest about our statistical significance?
You will find a coded example of exercise 1 here
from chapter1.main.first import Summarize, MakeTables, ProcessTables
from chapter2.main.thinkstats import Mean, Var
def main():
data_dir = "../resources/"
table, firsts, others = MakeTables(data_dir)
ProcessTables(firsts, others)
mu1 = round(Mean(firsts.lengths),2)
mu2 = round(Mean(others.lengths),2)
var1 = round(Var(firsts.lengths, mu1),2)
var2 = round(Var(firsts.lengths, mu2),2)
print "firsts: %s +/- %s" % (mu1, var1)
print "others: %s +/- %s" % (mu2, var2)
if __name__ == '__main__':
main()
# Output
firsts: 38.6 +/- 7.79
others: 38.52 +/- 7.8
As you can see the standard deviation of the pregnancies gestation period is large and suggests that the first born children arrive later is not statistically significant. Of course we will ask more questions of the data before we come to a more formal conclusion.