Thinks Stats – Chapter 3 – Exercise 3

Now that we understand percentiles we can move on to the Cumulative Distribution Function (CDF). The CDF maps values with their percentile rank. The CDF value {\bf x} of a function is the fraction of values lower than {\bf x} in the distribution.

We can see the bias CDF of the relay race below:

observed_speeds_cdf_relay

We can see that ~12mph is in the 99th percentile meaning  that 99%of competitors had an average speed less than 12mph.

From the distribution above we can generate a random sample based on the measured sample above. This is called re-sampling. The plot below shows the comparison between both measured and generated samples:

observed_speeds_cdf_generated_relay

To reproduce the results above you can pull the code from github.

Tagged with: , , ,
Posted in Stats, Think Stats

Think Stats – Chapter 3 – Eample 2

A percentile is the percentage of data that falls below a give rank. For example if a student receives a percentile mark of 90 % in their exam, they have done better or the same as 90 % of the people who have taken the exam. Hence the student is in the top 10% of students.

The one simple algorithm for calculating the percentile rank is shown below


def PercentileRank(scores, your_score):
"""Computes the percentile rank relative to a sample of scores."""
count = 0
for score in scores:
if score <= your_score:
count += 1

percentile_rank = 100.0 * count / len(scores)
return percentile_rank

from this we can calculate the score that would give you a specific percentile

def Percentile(scores, percentile_rank):
"""Computes the value that corresponds to a given percentile rank. """
scores.sort()
for score in scores:
if PercentileRank(scores, score) >= percentile_rank:
return score

# Example

scores = [55, 66, 77, 88, 99]

print 'prank, score'
for percentile_rank in [0, 20, 25, 40, 50, 60, 75, 80, 100]:
print percentile_rank,
print Percentile(scores, percentile_rank)

# Output
prank score
0 55
20 55
25 66
40 66
50 77
60 77
75 88
80 88
100 99

You can find the code above on github

Tagged with: , , ,
Posted in Stats, Think Stats

Think Stats – Chapter 3 – Exercise 1

A biased pmf can misrepresent the data at hand. One reason this can happen is due to oversampling.

An example of this can be seen in the class size example in Allen’s book. We are given binned data for classes from [5, 49], this is the true data (unbiased). If we however asked the students we can see that this will produce a biased dataset. Classes with a small amount of students will have fewer people to ask and larger classes will have more people to ask, therefore the data will have a dependency on the class size itself.

In this example we can achieve a biased pmf by scaling each bin by the class size and re-normalising. Therefore we are taking into account the dependency to the class size.

E.x.

Unbiased dataset

size           count

5-9           8

10-14       8

15-19       14

20-24     4

25-29     6

30-34    12

35-39    8

40-44    3

45-49    2

True values

pmf: mean 23.69
pmf: var 128.67

If we asked the students – bias
biased_pmf: mean 29.12
biased_pmf: var 109.23

class_size

Another example of a bias can be seen in relay races, where the probability of passing a runner is proportional to the difference in speed.

This can be seen if we to a view of the probability from a competitor traveling at 7.5 mph and the sample of everyone one else in the race. The biased pmf for the competitor will look like the one below:

observed_speeds_relay_bias

You can see with the biased pmf the probability of running the same speed as the competitor is very small. This makes sense as the competitor will not be aware of the other competitors traveling at this speed, as he will not pass or be passed by them.

The unbiased pmf has a more normal shape as expected.

observed_speeds_relay

 

As always you can find the code for this example on github.

Tagged with: , , , , ,
Posted in Stats, Think Stats

Think Stats – Chapter 2 – Exercise 5

Conditional probabilities can allow us to take a more focused look at certain conditions and how they affect the descriptive nature of the statistics.

For example suppose a female is still pregnant after 38 weeks, what is the chance she will be born in the next week. We now know the baby didn’t arrive in the first [0,38] weeks, how does this effect the probability of giving birth?

To calculate the new probability of the baby arriving on the 39th week we take the original pmf remove items from [0,38] weeks and then renormalise the pmf and compute the pmf at 39 weeks.

Lets compute the above algorithm and see the differences for first borns:

Risks:
ProbEarly first babies 0.00
ProbEarly others 0.00
ProbOnTime first babies 0.79
ProbOnTime others 0.88
ProbLate first babies 0.21
ProbLate others 0.12

Risk ratios (first babies / others):
ProbOnTime 0.91
ProbLate 1.67

We can see that there is a slight difference in the relative risk from our unconditional probabilities. This effect is in the region of ~1% and is more probably a statistical fluctuation than anything else. However you can see from the individual probabilities these numbers have increased more dramatically as one would intuitively expect.

Below I have plotted the conditional probability of each week from [35,46]. For example if a lady is still pregnant in the 34th week what is the probability of her giving birth on the 35 week…

conditional

Again to reproduce these results you can pull the code from github

Tagged with: , , , ,
Posted in Stats, Think Stats

Think Stats – Chapter 2 – Exercise 4

Lets get back to the question in hand…

Do first babies arrive late

Now lets try to break this down into different groups – a little bit more detail didn’t hurt anyone.

A baby born during 37 weeks or less is considered to be born early in this example. A baby born between [38, 40] weeks is said to be born on time and those later are said to be born late. The ranges are also called bins. From this we can now calculate the relative risk, which is the ratio of two probabilities.

So lets say that the probabilities for having:

ProbEarly first babies 0.18
ProbEarly others 0.17
ProbOnTime first babies 0.66
ProbOnTime others 0.74
ProbLate first babies 0.16
ProbLate others 0.09

then we can calculate the relative risk by :

Risk ratios (first babies / others):
ProbEarly 1.08
ProbOnTime 0.90
ProbLate 1.66

whats this saying…

We can conclude from the relative risk that first babies are 8% more likely to be earlier than other babies and 66 % more likely to be late than other babies. However other babies are 10% more likely to be on time.

If you would like to reproduce these results simply pull the code from github

Tagged with: , , , , ,
Posted in Stats, Think Stats

Think Stats – Chapter 2 – Exercise 3

In this exercise we will look at how we can use different plots to aid our analysis of the gestation period for live births.

In the last blog we represented the data as a Probability mass function (pmf). Now we will be looking at more visualisation techniques to better aid our analysis.

As usual you will find the code for this exercise on github.

Lets start by looking at the differences of representing the data as a histogram and a pmf.

nsfg_histnsfg_pmf

As you can see the shapes of both the histogram and pmf are comparable as expected (thank goodness for that). The pmf is of course normalised by the total frequency, the sum of the pmf equals 1. The description of both plots, which was explained in the previous post still holds, thus we will just leave this visualisation at that.

Now that we have an idea of what is going on with distributions, lets focus our attention to the week by week difference between first and other live births. I will visually represent this difference by calculating the differences of the two pmfs, multiplying it by 100 and plotting it using our friend matplotlib!

nsfg_diffs

The plot above now shows the relationship between first borns and others. We can see that others is ~6% more likely to be born in the 39th week than first borns. After 41 weeks first borns are more likely to be born. Of course we have not considered the statistical errors of these bins and also systematic uncertainties when recording the data.

Tagged with: , , , ,
Posted in Stats, Think Stats

Think Stats – Chapter 2 – Exercise 2

In this exercise we will be plotting the Probability mass function (pmf) of the first and other babies gestation period.

The function gives the probability of having a baby for a given week. The pmf describes discrete random variables, where as the Probability density function (pdf) is associated with a continuous function.

We can think of the pmf as:

\sum_{\bf x {\bf \epsilon} {\bf A}} {\bf f_{\bf X} {\bf(x)}} = {\bf 1}

We can replicate the equation above in python

 t = [1, 2, 2, 3, 5]
d = {}
d[x] = d.get(x, 0) + 1
total = len(d)
factor = 1 / total
for x in self.d:
self.d[x] *= factor
 

Now lets represent ourdata set in the form of a pmf and plot it using matplotlib.

first_and_others_gestation_periods

firsts: 38.60 +\- 7.79 : mode 39.00
others: 38.52 +\- 6.84 : mode 39.00

In the distribution above you can clearly see the most common value is 39.00 weeks, this value is called the mode. Around the mode the data is asymmetric, the data drops off quickly after 40 weeks which can be explained by the fact that pregnancies are usually intervened after 42 weeks. Values far from the mode are called outliers, some of these can be due to errors when taking or reporting the data. When we remove these you can clearly see the effect on the distribution mean and variance.

trimmed_first_and_others_gestation_periods

Trimmed firsts: 38.68 +\- 5.45 : mode 39.00
Trimmed others: 38.61 +\- 4.22 : mode 39.00

You can find the code for producing pmf and plots here.

Tagged with: , , , ,
Posted in Uncategorized

Think Stats – Chapter 2 – Exercise 1

Chapter two of Allen Downey’s book focuses on descriptive statistics and distributions. He starts the chapter off by summarising mean and variance – before we get started with the example lets familiarise ourselves with the definitions.

the mean of a sample of {\bf n} items with values {\bf x_i}, is the sum of those items divided by the number of items  {\bf n}.

{\bf \mu} = \frac{\bf 1}{\bf n}\sum{\bf x_i}

The mean of a sample can sometimes be a good description of the set, but when the set fluctuates greatly the mean can be misleading. We can use the variance of the sample to calculate the spread of the items in the set.

{\bf \sigma^2} = \frac{\bf 1}{\bf n}\sum{({\bf x_i}-{\bf \mu})^2}

As you can see the variance is the mean squared deviation, if we take the square root of the variance we get the standard deviation of the set.

Now that we have that out the way lets dive into the example.

Calculate the standard deviation of the gestation time for first and other babies. What does the comparison suggest about our statistical significance?

You will find a coded example of exercise 1 here

 from chapter1.main.first import Summarize, MakeTables, ProcessTables
from chapter2.main.thinkstats import Mean, Var

def main():

data_dir = "../resources/"
table, firsts, others = MakeTables(data_dir)
ProcessTables(firsts, others)

mu1 = round(Mean(firsts.lengths),2)
mu2 = round(Mean(others.lengths),2)

var1 = round(Var(firsts.lengths, mu1),2)
var2 = round(Var(firsts.lengths, mu2),2)

print "firsts: %s +/- %s" % (mu1, var1)
print "others: %s +/- %s" % (mu2, var2)

if __name__ == '__main__':
main()

# Output
firsts: 38.6 +/- 7.79
others: 38.52 +/- 7.8
 

As you can see the standard deviation of the pregnancies gestation period is large and suggests that the first born children arrive later is not statistically significant. Of course we will ask more questions of the data before we come to a more formal conclusion.

Tagged with: , , ,
Posted in Stats, Think Stats

Think Stats – Chapter 1 – Exercise 2

Now that we can read the data lets use some summary statistics to quantify the data.

Here we will look at the mean, also referred to by some as the average. A simple definition of the mean is as follows:

Given a sample of {\bf n} values, {\bf x_i}, the mean, {\bf \mu} is the sum of all the values divided by the number of values {\bf n}: {\bf \mu} = \frac{{\bf 1}}{{\bf n}} \sum{{\bf x_i}}

We will now calculate the mean for first live births and other live births from the dataset and see their mean difference.

First run exercise2 in the chapter1 directory.

 python exercise2.py
 

Next have a look at the code snippet of what is getting executed – also dive into the code an take a deeper look.

 """Prints summary statistics for first babies and others.

Returns:
tuple of Tables
"""
table, firsts, others = MakeTables(data_dir)
ProcessTables(firsts, others)

print 'Number of first babies', firsts.n
print 'Number of others', others.n

mu1, mu2 = firsts.mu, others.mu

print 'Mean gestation in weeks:'
print 'First babies', mu1
print 'Others', mu2

print 'Difference in hours', (mu1 - mu2) * 7.0 * 24

# Output
Number of first babies 4413
Number of others 4735
Mean gestation in weeks:
First babies 38.6009517335
Others 38.5229144667
Difference in hours 13.1102608186

 

Note:
I leave it up to the reader to look through the code. I am happy however to explain the code if needed – just ask.

As we can see the difference between first and other babies gestation period is on average ~13 hours longer.
Unfortunately it isn’t statistically significant to just state that without careful consideration of other effects.
For now we can call this difference an apparent effect of the data, as we are not too sure whats going on yet.
We still need to ask some more questions of the data to have better confidence in our result. This is what we will discuss in the coming exercises

Tagged with: , , ,
Posted in Uncategorized

Think Stats – Chapter 1 – Exercise 1

Do First Babies Arrive Late?

This is the question posed by Allen on the second page of the book. We use data from the National Survey of Family Growth (NSFG), which is gathered information on family life.

We call this data type cross-sectional, as it represents a group of people at a point in time. Longitudinal data however, observes a group over a repeated period of time.

The data is downloaded in the resources directory on github.

The first exercise is to test that we can retrieve the correct number of respondents and pregnancies reported from the data.

A respondent is someone who participates in a survey. A group of respondents is called a cohort.

Lets run exercise1.py from the chapter1 directory:

python exercise1.py

Number of respondents 7643
Number of pregnancies 13593
Tagged with: , , ,
Posted in Stats, Think Stats