Practice 3

  1. Make a plot of the standard normal curve on the interval [-4, 4]. Give the plot a title “Standard normal curve”, an x label of “Normal deviate” and a y label of “Density”.
x <- pretty(-4:4, n=100)
y <- dnorm(x)
plot(x, y, type="l", main="Standard normal curve", xlab="Normal deviate", ylab="Density")
_images/PracticeTHREESolutions_2_0.png
  1. What is the area under the curve to the right of x=3? In other words, what is the probability of drawing a random number from the normal distribution that is 3 standard deviations or more larger than the mean?
1 - pnorm(3)
0.0013498980316301
  1. If the expression valuse for a gene are normally distributed with mean 10 and standard deviation 2, what is the value of a gene at the 95th percentile?
qnorm(0.95, mean=10, sd=2)
13.2897072539029

Generate 50 numbers from a normal distribtuion with mean=10 and sd=2. Now trnaform this vector so that the numbers have a stnadard normal distribtuion with mean=0 and sd=1.

x <- rnorm(50, 10, 2)
z <- (x - mean(x))/sd(x)
  1. A t-test with 6 degrees of freedom has a score of 3.5. Using only the dt, pt, qt or rt probability functions, what is the p-value if this was a two-sided test? Recall that a p-value is the probailty of seeing a value as extreme or more extreme than the observed score, assuming the score was drawn from the specified distirbution.
2*(1 - pt(3.5, df = 6))
0.0128263383328053
  1. Draw 1 million random numbers from the t-distirbution with 6 degrees of freedom. How many times is the numbr less than -3.5 or greater than 3.5?
x <- rt(100000, df=6)
sum(abs(x) > 3.5)
1238
  1. Find the mean value of all numeric variables for the mtcars data, grouping by number of gears and automtatic or manual transmission. (Hint: Use the aggregate function)
with(mtcars, aggregate(mtcars, by=list(gear=gear, transmission=am), FUN=mean))
geartransmissionmpgcyldisphpdratwtqsecvsamgearcarb
13016.106677.466667326.3176.13333.1326673.892617.6920.2032.666667
24021.055155.675100.753.86253.30520.0251043
34126.2754.5106.687583.8754.133752.272518.4350.75142
45121.386202.48195.63.9162.632615.640.2154.4
library(plyr)
library(reshape2)
data(airquality)
Warning message:
: package ‘plyr’ was built under R version 3.1.3
head(airquality)
OzoneSolar.RWindTempMonthDay
1411907.46751
23611887252
31214912.67453
41831311.56254
5NANA14.35655
628NA14.96656
  1. Use melt to convert the airquality dataframe into a “tall” format using Month and Day as teh id variables, saving it as a new datafrmae. Print the first 6 rows.
md <- melt(airquality, id=c("Month", "Day"))
head(md)
MonthDayvariablevalue
151Ozone41
252Ozone36
353Ozone12
454Ozone18
555OzoneNA
656Ozone28
  1. Find the avarage values of Ozone, Solar.R, Wind and Temp for each month using dcast. Hint: Give an extra argument na.rm = TRUE to ignore missing data.
dcast(md, Month ~ variable, mean, na.rm = TRUE)
MonthOzoneSolar.RWindTemp
1523.61538181.296311.6225865.54839
2629.44444190.166710.2666779.1
3759.11538216.48398.94193583.90323
4859.96154171.85718.79354883.96774
5931.44828167.433310.1876.9
  1. Find the avarage values of Ozone, Solar.R, Wind and Temp for each month using dcast, but only for the first 2 weeks of each month. Hint: Give an extra argument na.rm = TRUE to ignore missing data. Hint: Use the subset argument.
dcast(md, Month ~ variable, mean, subset = .(Day < 15), na.rm = TRUE)
MonthOzoneSolar.RWindTemp
1519.41667200.090911.1785766.28571
2640.5249.142910.7357182.85714
3764.81818228.71439.00714384.85714
4858.41667168.72738.72142985.5
5943.35714188.64299.40714382.21429

Questions below use the day.1 and day.2 dataframes

set.seed(123)
pid.1 <- c(1,1,2,2)
gid.1 <- c(1,2,1,2)
val.1 <- rnorm(4)
day.1 <- data.frame(pid=pid.1, gid=gid.1, val=val.1)

pid.2 <- c(1,1,2,2)
gid.2 <- c(1,2,1,2)
val.2 <- 1 + rnorm(4)
day.2 <- data.frame(pid=pid.2, gid=gid.2, val=val.2)
day.1
pidgidval
111-0.5604756
212-0.2301775
3211.558708
4220.07050839
day.2
pidgidval
1111.129288
2122.715065
3211.460916
422-0.2650612
  1. Suppose day.1 and day.2 are results from experiments performed on differnet days. Merge the data from day.1 and day.2 into a single dataframe caleld days to combine the data sets.
days <- merge(day.1, day.2, by=c("pid", "gid"), suffixes = 1:2)
days
pidgidval1val2
111-0.56047561.129288
212-0.23017752.715065
3211.5587081.460916
4220.07050839-0.2650612
  1. Sort the days dataframe by val1 in decreasing order.
days[order(-days$val1),]
pidgidval1val2
3211.5587081.460916
4220.07050839-0.2650612
212-0.23017752.715065
111-0.56047561.129288
(12) Remove duplicate rows from the following dataframe.
df <- read.csv("df.csv")
df
pidgidval1val2
111-0.56047561.129288
211-0.56047561.129288
312-0.23017752.715065
4220.07050839-0.2650612
5220.07050839-0.2650612
6211.5587081.460916
unique(df)
pidgidval1val2
111-0.56047561.129288
312-0.23017752.715065
4220.07050839-0.2650612
6211.5587081.460916