Hypothesis Testing and Power Calculations
=========================================
One of things that R is used for is to perform simple testing and power
calculations using *canned* functions. These functions are very simple
to run; beign able to use and interpret them correctly is the hard part.
What is covered in this section
-------------------------------
1. Simple summary statistics
2. Functions dealing with probability distributions
3. Hypothesis testing
4. Power / Sample size calculations
5. Tabulating data
6. Using simulations to calculate power (use of ``for`` and ``if``)
7. Customization of graphical plots
Set random see
--------------
.. code:: python
set.seed(123)
Estimation
----------
Often we want to estimate functions of data - for example, the mean or
median of a set of numbers.
.. code:: python
# First create a sample of 10 numbers from a normal distribution with mean 0 and sd 1
(x <- rnorm(10))
.. raw:: html
- 1.2369408226151
- 0.465511274432508
- 0.070684445387072
- 1.42149413805108
- -1.15858605831964
- 0.460407001136643
- -0.625685808667741
- 0.313346968415322
- -1.24274709077289
- -0.945266218602314
.. code:: python
mean(x)
.. raw:: html
0.0336346830742124
.. code:: python
median(x)
.. raw:: html
0.16910863599832
.. code:: python
sd(x)
.. raw:: html
1.21633299330955
.. code:: python
var(x)
.. raw:: html
1.47946595061337
Hypothesis testing
------------------
Asssume theroy covered in morning statistics lecutres.
Example: Comparing group means
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The t-test
^^^^^^^^^^
.. code:: python
x <- rnorm(mean=5.5, sd=1.5, n=10)
y <- rnorm(mean=5, sd=1.5, n=10)
result <- t.test(x, y, alternative = "two.sided", paired = FALSE, var.equal = TRUE)
result
.. parsed-literal::
Two Sample t-test
data: x and y
t = 0.4472, df = 18, p-value = 0.6601
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.105854 1.703865
sample estimates:
mean of x mean of y
5.611938 5.312933
What is going on and what does all this actually mean?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The variable ``result`` is a list (similar to a hashmap or dictionary in
other languages), and named parts can be extracte with the ``$`` method
- e.g. ``result$p.value`` gives the p-value.
The test firs cacluates a number called a :math:`t` statistic that is
given by some function of the data. From theoretical anaysis, we know
that if the null hypothesis and assumptions of equal vairance and normal
distributiono of the data are correct, then the :math:`t` random
variable will have a :math:`t` distribution with 18 degrees of freedom
(20 data points minus 2 estimated means).
The formula fot calcualting the :math:`t`-statitic is
.. math::
t = \frac{(\bar{x_1} - \bar{x_2})}{\text{se}}
where :math:`\bar{x_1}` and :math:`\bar{x_2}` are the sample means, and
se (standard error) is
.. math::
\text{se} = \sqrt{s_1^2/n_1 + s_2^2/n_2}
where :math:`s_1^2` and :math:`s_2^2` are the sample variances.
We will calculate all these values to show what goes on in the sausage
factory.
.. code:: python
# These are custom R functions which we briefly showed you how to write in yesterday's session.
se <- function(x1, x2) {
n1 <- length(x1)
n2 <- length(x2)
v1 <- var(x1)
v2 <- var(x2)
sqrt(v1/n1 + v2/n2)
}
.. code:: python
t <- function(x1, x2) {
(mean(x1) - mean(x2))/se(x1, x2)
}
.. code:: python
t(x, y)
.. raw:: html
0.447153048434479
Now we will make use of our knowledge of probability distributions to
understan what the p-value is.
We will plot the PDF of the t-distribution with df=18. The x-axis is the
value of the t-statistic and the y-axis is the density (you can think of
the density as the height of a histogram with total area normalized to
sum to 1). The red lines are at the 2.5th and 97.5th quantiles - so for
a two-sided test, the t-statistic must be more extreme than these two
red lines (i.e. to the right of the 97.5th quantile or to the left of
the 2.5th quantile) to reject the null hypothesis. We sse that our
:math:`t` statisitc (and its symmeetric negative) in dashed green do not
meet this requirement - hence the p-value is > 0.05.
.. code:: python
xp <- seq(-5, 5, length.out = 100)
plot(xp, dt(xp, df=18), type="l", col="blue")
score <- t(x, y)
abline(v=c(score, -score), col="green", lty=2)
thresholds <- c(qt(0.025, df=18), qt(0.975, df=18))
abline(v=thresholds, col="red", lty=1)
.. image:: Hypothesistestingandpowercalcuations_files/Hypothesistestingandpowercalcuations_21_0.png
:width: 600
The p-value is the areau under the curve more extreme than the green
lines. Since the t-statisic is positive, we can find the area to its
right as one minus the cumulative density up to the value of the
t-statitic. Doubling this (why?) gives us the p-value.
.. code:: python
2*(1 - pt(score, df=18))
.. raw:: html
0.660098549618083
Note that this agrees with the value given by ``t.test``.
The ranksum test
^^^^^^^^^^^^^^^^
This calculates a test statistic :math:`W` that is the sum of the
outcomes for all pairwise comparisons of the ranks of the values in
:math:`x` and :math:`y`. Outcomes are 1 if the first item > the second
item, 0.5 for a tie and 0 otherwise. This can be simplified to the
follwoing formula
.. math::
W = R_1 - \frac{n_1(n_1 + 1)}{2}
where :math:`R_1` is the sum of ranks for the values in :math:`x`. For
large samples, :math:`W` can be considered to come from the normal
distribution with means aand standard deviations that can be calculated
from the data (look up by Googling or any mathematical statitics
textbook if interested).
.. code:: python
wilcox.test(x, y)
.. parsed-literal::
Wilcoxon rank sum test
data: x and y
W = 53, p-value = 0.8534
alternative hypothesis: true location shift is not equal to 0
Explicit calculation of statistic
---------------------------------
.. code:: python
w <- function(x, y) {
r <- rank(c(x, y))
nx <- length(x)
sum(r[1:nx]) - nx*(nx+1)/2
}
.. code:: python
w(x, y)
.. raw:: html
53
Effect of sample size
~~~~~~~~~~~~~~~~~~~~~
.. code:: python
x <- rnorm(mean=5.5, sd=1.5, n=200)
y <- rnorm(mean=5, sd=1.5, n=200)
t.test(x, y, )
.. parsed-literal::
Welch Two Sample t-test
data: x and y
t = 3.1236, df = 397.539, p-value = 0.001917
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1676522 0.7370648
sample estimates:
mean of x mean of y
5.474823 5.022465
.. code:: python
wilcox.test(x, y)
.. parsed-literal::
Wilcoxon rank sum test with continuity correction
data: x and y
W = 23106, p-value = 0.007229
alternative hypothesis: true location shift is not equal to 0
Work!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Supppose we measure the weights of 100 people before and after a
marathon. We want to know if there is a difference. What is the p-value
for an appropirate parametric and non-parametric test?
.. code:: python
before <- rnorm(n=100, mean=165, sd=25)
change <- rnorm(n=100, mean=-5, sd=10)
after <- before + change
.. code:: python
# Parametric test
.. code:: python
# Non-parametric test
Example: Comparing proportions
------------------------------
One sample
~~~~~~~~~~
.. code:: python
x <- sample(c('H', 'T'), 50, replace=TRUE)
table(x)
.. parsed-literal::
x
H T
26 24
.. code:: python
binom.test(table(x), p = 0.5)
.. parsed-literal::
Exact binomial test
data: table(x)
number of successes = 26, number of trials = 50, p-value = 0.8877
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.3741519 0.6633949
sample estimates:
probability of success
0.52
Two samples
~~~~~~~~~~~
.. code:: python
size = 50
x <- rbinom(n=1, prob=0.3, size=size)
y <- rbinom(n=1, prob=0.35, size=size)
.. code:: python
(m <- matrix(c(x, y, size-x, size-y), ncol = 2))
.. raw:: html
.. code:: python
prop.test(m)
.. parsed-literal::
2-sample test for equality of proportions with continuity correction
data: m
X-squared = 1.2994, df = 1, p-value = 0.2543
alternative hypothesis: two.sided
95 percent confidence interval:
-0.31032527 0.07032527
sample estimates:
prop 1 prop 2
0.20 0.32
Effect of sample size
^^^^^^^^^^^^^^^^^^^^^
.. code:: python
size <- 500
x <- rbinom(n=1, prob=0.3, size=size)
y <- rbinom(n=1, prob=0.35, size=size)
.. code:: python
(m <- matrix(c(x, y, size-x, size-y), ncol = 2))
.. raw:: html
Test that proportions are equal using z-score (prop.test)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
prop.test(m)
.. parsed-literal::
2-sample test for equality of proportions with continuity correction
data: m
X-squared = 6.0336, df = 1, p-value = 0.01404
alternative hypothesis: two.sided
95 percent confidence interval:
-0.13685795 -0.01514205
sample estimates:
prop 1 prop 2
0.310 0.386
Alternative using :math:`\chi^2` test
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
chisq.test(m)
.. parsed-literal::
Pearson's Chi-squared test with Yates' continuity correction
data: m
X-squared = 6.0336, df = 1, p-value = 0.01404
Work!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You find 3 circulating DNA fragments with the following properties
- fragment 1 has length 100 and is 35% CG
- fragment 2 has length 110 and is 40% CG
- fragment 3 has length 120 and is 50% GC
Do you reject the null hypotheses that the percent GC content is the
same for all 3 fraemnets?
Sample size calculations
------------------------
Need some explanation and disclaimers here!
See `Qucik-R Power `__ for
more examples and more detailed explanation of function parameters.
.. code:: python
install.packages("pwr", repos = "http://cran.r-project.org")
.. parsed-literal::
The downloaded binary packages are in
/var/folders/xf/rzdg30ps11g93j3w0h589q780000gn/T//RtmpBJlIML/downloaded_packages
.. code:: python
library(pwr)
.. parsed-literal::
Warning message:
: package ‘pwr’ was built under R version 3.1.3
For simple power calculations, you need 3 out of 4 of the follwoing: - n
= number of samples / experimental units - sig.level = what "p-value"
you will be using to determine significance - power = fraction of
experiments that will reject null hypothesis - d = "effect size" ~
depends on context
.. code:: python
d <- (5.5 - 5.0)/1.5
pwr.t.test(d = d, sig.level = 0.05, power = 0.8)
.. parsed-literal::
Two-sample t test power calculation
n = 142.2462
d = 0.3333333
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
.. code:: python
p1 = 0.30
p2 = 0.35
(h = abs(2*asin(p1) - 2*asin(p2)))
pwr.2p.test(h = h, sig.level = 0.05, power = 0.9)
.. raw:: html
0.105756899260226
.. parsed-literal::
Difference of proportion power calculation for binomial distribution (arcsine transformation)
h = 0.1057569
n = 1878.922
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: same sample sizes
Check our understanding of what power means
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Note: The code below is an example of a statistical *simulation*. It can
be made more efficient by vectorization, but looping is initially easer
to understand and the speed makes no practical differnece for such a
small exmaple.
Before running the code - try to answer this question:
If we performed the same experiment 1000 times with n=1879 (from
power calucations above), how many experiments would yield a p-value
of less than 0.05?
.. code:: python
size <- 1879
hits <- 0
ntrials <- 1000
for (i in 1:ntrials) {
x <- rbinom(n=1, prob=0.3, size=size)
y <- rbinom(n=1, prob=0.35, size=size)
m <- matrix(c(x, y, size-x, size-y), ncol = 2)
result <- prop.test(m)
hits <- hits + (result$p.value < 0.05)
}
hits/ntrials
.. raw:: html
0.886
Work!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Suppose an investigator proposes to use an unpaired t-test to examine
differences between two groups of size 13 and 16. What is the power at
the usual 0.05 significance level for effet sizes of 0.1, 0.5 and 1.0.