Hypothesis Testing and Power Calculations ========================================= One of things that R is used for is to perform simple testing and power calculations using *canned* functions. These functions are very simple to run; beign able to use and interpret them correctly is the hard part. What is covered in this section ------------------------------- 1. Simple summary statistics 2. Functions dealing with probability distributions 3. Hypothesis testing 4. Power / Sample size calculations 5. Tabulating data 6. Using simulations to calculate power (use of ``for`` and ``if``) 7. Customization of graphical plots Set random see -------------- .. code:: python set.seed(123) Estimation ---------- Often we want to estimate functions of data - for example, the mean or median of a set of numbers. .. code:: python # First create a sample of 10 numbers from a normal distribution with mean 0 and sd 1 (x <- rnorm(10)) .. raw:: html
  1. 1.2369408226151
  2. 0.465511274432508
  3. 0.070684445387072
  4. 1.42149413805108
  5. -1.15858605831964
  6. 0.460407001136643
  7. -0.625685808667741
  8. 0.313346968415322
  9. -1.24274709077289
  10. -0.945266218602314
.. code:: python mean(x) .. raw:: html 0.0336346830742124 .. code:: python median(x) .. raw:: html 0.16910863599832 .. code:: python sd(x) .. raw:: html 1.21633299330955 .. code:: python var(x) .. raw:: html 1.47946595061337 Hypothesis testing ------------------ Asssume theroy covered in morning statistics lecutres. Example: Comparing group means ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The t-test ^^^^^^^^^^ .. code:: python x <- rnorm(mean=5.5, sd=1.5, n=10) y <- rnorm(mean=5, sd=1.5, n=10) result <- t.test(x, y, alternative = "two.sided", paired = FALSE, var.equal = TRUE) result .. parsed-literal:: Two Sample t-test data: x and y t = 0.4472, df = 18, p-value = 0.6601 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.105854 1.703865 sample estimates: mean of x mean of y 5.611938 5.312933 What is going on and what does all this actually mean? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The variable ``result`` is a list (similar to a hashmap or dictionary in other languages), and named parts can be extracte with the ``$`` method - e.g. ``result$p.value`` gives the p-value. The test firs cacluates a number called a :math:`t` statistic that is given by some function of the data. From theoretical anaysis, we know that if the null hypothesis and assumptions of equal vairance and normal distributiono of the data are correct, then the :math:`t` random variable will have a :math:`t` distribution with 18 degrees of freedom (20 data points minus 2 estimated means). The formula fot calcualting the :math:`t`-statitic is .. math:: t = \frac{(\bar{x_1} - \bar{x_2})}{\text{se}} where :math:`\bar{x_1}` and :math:`\bar{x_2}` are the sample means, and se (standard error) is .. math:: \text{se} = \sqrt{s_1^2/n_1 + s_2^2/n_2} where :math:`s_1^2` and :math:`s_2^2` are the sample variances. We will calculate all these values to show what goes on in the sausage factory. .. code:: python # These are custom R functions which we briefly showed you how to write in yesterday's session. se <- function(x1, x2) { n1 <- length(x1) n2 <- length(x2) v1 <- var(x1) v2 <- var(x2) sqrt(v1/n1 + v2/n2) } .. code:: python t <- function(x1, x2) { (mean(x1) - mean(x2))/se(x1, x2) } .. code:: python t(x, y) .. raw:: html 0.447153048434479 Now we will make use of our knowledge of probability distributions to understan what the p-value is. We will plot the PDF of the t-distribution with df=18. The x-axis is the value of the t-statistic and the y-axis is the density (you can think of the density as the height of a histogram with total area normalized to sum to 1). The red lines are at the 2.5th and 97.5th quantiles - so for a two-sided test, the t-statistic must be more extreme than these two red lines (i.e. to the right of the 97.5th quantile or to the left of the 2.5th quantile) to reject the null hypothesis. We sse that our :math:`t` statisitc (and its symmeetric negative) in dashed green do not meet this requirement - hence the p-value is > 0.05. .. code:: python xp <- seq(-5, 5, length.out = 100) plot(xp, dt(xp, df=18), type="l", col="blue") score <- t(x, y) abline(v=c(score, -score), col="green", lty=2) thresholds <- c(qt(0.025, df=18), qt(0.975, df=18)) abline(v=thresholds, col="red", lty=1) .. image:: Hypothesistestingandpowercalcuations_files/Hypothesistestingandpowercalcuations_21_0.png :width: 600 The p-value is the areau under the curve more extreme than the green lines. Since the t-statisic is positive, we can find the area to its right as one minus the cumulative density up to the value of the t-statitic. Doubling this (why?) gives us the p-value. .. code:: python 2*(1 - pt(score, df=18)) .. raw:: html 0.660098549618083 Note that this agrees with the value given by ``t.test``. The ranksum test ^^^^^^^^^^^^^^^^ This calculates a test statistic :math:`W` that is the sum of the outcomes for all pairwise comparisons of the ranks of the values in :math:`x` and :math:`y`. Outcomes are 1 if the first item > the second item, 0.5 for a tie and 0 otherwise. This can be simplified to the follwoing formula .. math:: W = R_1 - \frac{n_1(n_1 + 1)}{2} where :math:`R_1` is the sum of ranks for the values in :math:`x`. For large samples, :math:`W` can be considered to come from the normal distribution with means aand standard deviations that can be calculated from the data (look up by Googling or any mathematical statitics textbook if interested). .. code:: python wilcox.test(x, y) .. parsed-literal:: Wilcoxon rank sum test data: x and y W = 53, p-value = 0.8534 alternative hypothesis: true location shift is not equal to 0 Explicit calculation of statistic --------------------------------- .. code:: python w <- function(x, y) { r <- rank(c(x, y)) nx <- length(x) sum(r[1:nx]) - nx*(nx+1)/2 } .. code:: python w(x, y) .. raw:: html 53 Effect of sample size ~~~~~~~~~~~~~~~~~~~~~ .. code:: python x <- rnorm(mean=5.5, sd=1.5, n=200) y <- rnorm(mean=5, sd=1.5, n=200) t.test(x, y, ) .. parsed-literal:: Welch Two Sample t-test data: x and y t = 3.1236, df = 397.539, p-value = 0.001917 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.1676522 0.7370648 sample estimates: mean of x mean of y 5.474823 5.022465 .. code:: python wilcox.test(x, y) .. parsed-literal:: Wilcoxon rank sum test with continuity correction data: x and y W = 23106, p-value = 0.007229 alternative hypothesis: true location shift is not equal to 0 Work! ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Supppose we measure the weights of 100 people before and after a marathon. We want to know if there is a difference. What is the p-value for an appropirate parametric and non-parametric test? .. code:: python before <- rnorm(n=100, mean=165, sd=25) change <- rnorm(n=100, mean=-5, sd=10) after <- before + change .. code:: python # Parametric test .. code:: python # Non-parametric test Example: Comparing proportions ------------------------------ One sample ~~~~~~~~~~ .. code:: python x <- sample(c('H', 'T'), 50, replace=TRUE) table(x) .. parsed-literal:: x H T 26 24 .. code:: python binom.test(table(x), p = 0.5) .. parsed-literal:: Exact binomial test data: table(x) number of successes = 26, number of trials = 50, p-value = 0.8877 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.3741519 0.6633949 sample estimates: probability of success 0.52 Two samples ~~~~~~~~~~~ .. code:: python size = 50 x <- rbinom(n=1, prob=0.3, size=size) y <- rbinom(n=1, prob=0.35, size=size) .. code:: python (m <- matrix(c(x, y, size-x, size-y), ncol = 2)) .. raw:: html
1040
1634
.. code:: python prop.test(m) .. parsed-literal:: 2-sample test for equality of proportions with continuity correction data: m X-squared = 1.2994, df = 1, p-value = 0.2543 alternative hypothesis: two.sided 95 percent confidence interval: -0.31032527 0.07032527 sample estimates: prop 1 prop 2 0.20 0.32 Effect of sample size ^^^^^^^^^^^^^^^^^^^^^ .. code:: python size <- 500 x <- rbinom(n=1, prob=0.3, size=size) y <- rbinom(n=1, prob=0.35, size=size) .. code:: python (m <- matrix(c(x, y, size-x, size-y), ncol = 2)) .. raw:: html
155345
193307
Test that proportions are equal using z-score (prop.test) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python prop.test(m) .. parsed-literal:: 2-sample test for equality of proportions with continuity correction data: m X-squared = 6.0336, df = 1, p-value = 0.01404 alternative hypothesis: two.sided 95 percent confidence interval: -0.13685795 -0.01514205 sample estimates: prop 1 prop 2 0.310 0.386 Alternative using :math:`\chi^2` test ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: python chisq.test(m) .. parsed-literal:: Pearson's Chi-squared test with Yates' continuity correction data: m X-squared = 6.0336, df = 1, p-value = 0.01404 Work! ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You find 3 circulating DNA fragments with the following properties - fragment 1 has length 100 and is 35% CG - fragment 2 has length 110 and is 40% CG - fragment 3 has length 120 and is 50% GC Do you reject the null hypotheses that the percent GC content is the same for all 3 fraemnets? Sample size calculations ------------------------ Need some explanation and disclaimers here! See `Qucik-R Power `__ for more examples and more detailed explanation of function parameters. .. code:: python install.packages("pwr", repos = "http://cran.r-project.org") .. parsed-literal:: The downloaded binary packages are in /var/folders/xf/rzdg30ps11g93j3w0h589q780000gn/T//RtmpBJlIML/downloaded_packages .. code:: python library(pwr) .. parsed-literal:: Warning message: : package ‘pwr’ was built under R version 3.1.3 For simple power calculations, you need 3 out of 4 of the follwoing: - n = number of samples / experimental units - sig.level = what "p-value" you will be using to determine significance - power = fraction of experiments that will reject null hypothesis - d = "effect size" ~ depends on context .. code:: python d <- (5.5 - 5.0)/1.5 pwr.t.test(d = d, sig.level = 0.05, power = 0.8) .. parsed-literal:: Two-sample t test power calculation n = 142.2462 d = 0.3333333 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group .. code:: python p1 = 0.30 p2 = 0.35 (h = abs(2*asin(p1) - 2*asin(p2))) pwr.2p.test(h = h, sig.level = 0.05, power = 0.9) .. raw:: html 0.105756899260226 .. parsed-literal:: Difference of proportion power calculation for binomial distribution (arcsine transformation) h = 0.1057569 n = 1878.922 sig.level = 0.05 power = 0.9 alternative = two.sided NOTE: same sample sizes Check our understanding of what power means ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Note: The code below is an example of a statistical *simulation*. It can be made more efficient by vectorization, but looping is initially easer to understand and the speed makes no practical differnece for such a small exmaple. Before running the code - try to answer this question: If we performed the same experiment 1000 times with n=1879 (from power calucations above), how many experiments would yield a p-value of less than 0.05? .. code:: python size <- 1879 hits <- 0 ntrials <- 1000 for (i in 1:ntrials) { x <- rbinom(n=1, prob=0.3, size=size) y <- rbinom(n=1, prob=0.35, size=size) m <- matrix(c(x, y, size-x, size-y), ncol = 2) result <- prop.test(m) hits <- hits + (result$p.value < 0.05) } hits/ntrials .. raw:: html 0.886 Work! ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Suppose an investigator proposes to use an unpaired t-test to examine differences between two groups of size 13 and 16. What is the power at the usual 0.05 significance level for effet sizes of 0.1, 0.5 and 1.0.