Some R functions are in libraries that have to be explicitly loaded. You
do this with the library command. The next cell shows an example of
loading the genefilter library which contains the rowttests
fucntion that we will be using later.
library(genefilter)
Attaching package: ‘genefilter’
The following object is masked from ‘package:base’:
anyNA
Most libraries have been pre-installed on the VM. However, if you need
to install a new library, all you need is to use the
install.packages() function. For example, you can install the MASS
(Modern Applied Statistics in S) library like so:
install.packages("MASS", repos = "http://cran.r-project.org")
Some specialized libraries come from the R BioConductor project, and the installation processs is slgihtly different. This will be covered in a later session.
By default, the result on the last line is displayed in the Jupyter
notebook. To see other lines, use the print function.
print(1 * 2 * 3 *4)
print(11 %% 3 == 1)
10 > 1
[1] 24
[1] FALSE
Type soemthing and hit the tab key - this will either complete the R command for you (if unique) or present a list of opitons.
Use ?topic to get a description of topic, or more verbosely, you can
also use help(topic).
?seq
| seq {base} | R Documentation |
Generate regular sequences. seq is a standard generic with a
default method. seq.int is a primitive which can be
much faster but has a few restrictions. seq_along and
seq_len are very fast primitives for two common cases.
seq(...)
## Default S3 method:
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
length.out = NULL, along.with = NULL, ...)
seq.int(from, to, by, length.out, along.with, ...)
seq_along(along.with)
seq_len(length.out)
... |
arguments passed to or from methods. |
from, to |
the starting and (maximal) end values of the
sequence. Of length |
by |
number: increment of the sequence. |
length.out |
desired length of the sequence. A
non-negative number, which for |
along.with |
take the length from the length of this argument. |
Numerical inputs should all be finite (that is, not infinite,
NaN or NA).
The interpretation of the unnamed arguments of seq and
seq.int is not standard, and it is recommended always to
name the arguments when programming.
seq is generic, and only the default method is described here.
Note that it dispatches on the class of the first argument
irrespective of argument names. This can have unintended consequences
if it is called with just one argument intending this to be taken as
along.with: it is much better to use seg_along in that
case.
seq.int is an internal generic which dispatches on
methods for "seq" based on the class of the first supplied
argument (before argument matching).
Typical usages are
seq(from, to) seq(from, to, by= ) seq(from, to, length.out= ) seq(along.with= ) seq(from) seq(length.out= )
The first form generates the sequence from, from+/-1, ..., to
(identical to from:to).
The second form generates from, from+by, ..., up to the
sequence value less than or equal to to. Specifying to -
from and by of opposite signs is an error. Note that the
computed final value can go just beyond to to allow for
rounding error, but is truncated to to. (‘Just beyond’
is by up to 1e-10 times abs(from - to).)
The third generates a sequence of length.out equally spaced
values from from to to. (length.out is usually
abbreviated to length or len, and seq_len is much
faster.)
The fourth form generates the integer sequence 1, 2, ...,
length(along.with). (along.with is usually abbreviated to
along, and seq_along is much faster.)
The fifth form generates the sequence 1, 2, ..., length(from)
(as if argument along.with had been specified), unless
the argument is numeric of length 1 when it is interpreted as
1:from (even for seq(0) for compatibility with S).
Using either seq_along or seq_len is much preferred
(unless strict S compatibility is essential).
The final form generates the integer sequence 1, 2, ...,
length.out unless length.out = 0, when it generates
integer(0).
Very small sequences (with from - to of the order of 10^{-14}
times the larger of the ends) will return from.
For seq (only), up to two of from, to and
by can be supplied as complex values provided length.out
or along.with is specified. More generally, the default method
of seq will handle classed objects with methods for
the Math, Ops and Summary group generics.
seq.int, seq_along and seq_len are
primitive.
seq.int and the default method of seq for numeric
arguments return a vector of type "integer" or "double":
programmers should not rely on which.
seq_along and seq_len return an integer vector, unless
it is a long vector when it will be double.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
The methods seq.Date and seq.POSIXt.
:,
rep,
sequence,
row,
col.
seq(0, 1, length.out = 11) seq(stats::rnorm(20)) # effectively 'along' seq(1, 9, by = 2) # matches 'end' seq(1, 9, by = pi) # stays below 'end' seq(1, 6, by = 3) seq(1.575, 5.125, by = 0.05) seq(17) # same as 1:17, or even better seq_len(17)
Use example to see usage examples.
example(seq)
seq> seq(0, 1, length.out = 11)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq> seq(stats::rnorm(20)) # effectively 'along'
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
seq> seq(1, 9, by = 2) # matches 'end'
[1] 1 3 5 7 9
seq> seq(1, 9, by = pi) # stays below 'end'
[1] 1.000000 4.141593 7.283185
seq> seq(1, 6, by = 3)
[1] 1 4
seq> seq(1.575, 5.125, by = 0.05)
[1] 1.575 1.625 1.675 1.725 1.775 1.825 1.875 1.925 1.975 2.025 2.075 2.125
[13] 2.175 2.225 2.275 2.325 2.375 2.425 2.475 2.525 2.575 2.625 2.675 2.725
[25] 2.775 2.825 2.875 2.925 2.975 3.025 3.075 3.125 3.175 3.225 3.275 3.325
[37] 3.375 3.425 3.475 3.525 3.575 3.625 3.675 3.725 3.775 3.825 3.875 3.925
[49] 3.975 4.025 4.075 4.125 4.175 4.225 4.275 4.325 4.375 4.425 4.475 4.525
[61] 4.575 4.625 4.675 4.725 4.775 4.825 4.875 4.925 4.975 5.025 5.075 5.125
seq> seq(17) # same as 1:17, or even better seq_len(17)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
To search for all functions with the phrase topic, use
apropos(topic).
apropos("seq")
For a broader search using “fuzzy” matching, use ?? prepended to the
search term.
??seq
Finally, you can always use a search engine - e.g. Googling for “how to generate random numbers from a normal distribution in R” returens:
R Learning Module: Probabilities and Distributions www.ats.ucla.edu › stat › r › modules University of California, Los Angeles Generating random samples from a normal distribution ... (For more information on the random number generator used in R please refer to the help pages for the ...
R: The Normal Distribution https://stat.ethz.ch/R-manual/R-devel/library/.../Normal.html ETH Zurich Normal {stats}, R Documentation ... quantile function and random generation for the normal distribution with mean equal to mean and ... number of observations.
CASt R: Random numbers astrostatistics.psu.edu/.../R/Random.html Pennsylvania State University It is often necessary to simulate random numbers in R. There are many functions available to ... Let’s consider the well-known normal distribution as an example: ?Normal .... Creating functions in R, as illustrated above, is a common procedure.
These ntoebooks will be interspersed with many such Work! sections,
since the only way to learn R is by using R. Remember to use TAB and
? and apropos whenever you’re stuck. If you don’t like to work
so much, reanme them as Play! sections since in our world
work == play.
If you put USD 100 in a bank that pays5 percent interest compounded daily, how much will you have in the bank at the end of one year? Recall from high school math that
where \(A\) is the amount, \(P\) is the principal, \(r\) is the interest rate, \(n\) is the compoundings per period and \(m\) is the number of periods.
What is the (Euclideean) distance between two points at (1,5) and (4,8)? Recall that the Eucliean distacne between two points (x1, y1) and (x2, y2) is given by the Pythagorean theorem:
1:4
seq(2, 12, 3)
seq(from=0, to=pi, by=pi/4)
seq(from=0, to=pi, length.out = 5)
x <- 1:10
y <- 1:10
x * y
x <- seq(0, 2*pi, length.out = 5)
sin(x)
A function is code tha behaves like a black box - you give it some input
and it returns some output. Functions are essenital because they allow
you to design use potentially complicated algorithms without having to
keep track of the algorithmic details - all you need to remeber are the
function name (which you can search with apropos or prepnding ??
to the search term), the input arguments (TAB provides hings), and what
type of output will be generated. Most of R’s usefulness comes from the
large collection of functions that it provides - for example,
library(), seq(), as.fractions() are all functions. You can
also write your own R functions - this will be covered in detail in a
later session - but we show the basics here so that you can follow the
statistics lecutures.
A function has the following structure
function_name <- function(arguments) {
body_of_function
return(result)
}
The simple example below will make this explicit.
my.func <- function(a, b) {
c <- a * b
return(a + b - c)
}
my.func(5, 4)
my.func(b=4, a=5)
What is the sum of all the numbers from 1 to 100?
What is the sum of the natural log of all the numbers from 1 to 100?
What is the sum of all the odd numbers from 1 to 10?
Write a function that returns the sum of all the odd numbers from a
to b, where a and b are the minimum and maximum values of a
sequence of integers. For example, if the funciton was called
sum.odd, then you would answer the previous question by
sum.odd(1, 100)
x <- seq(0, 2*pi, length.out = 100)
plot(x, sin(x), main="Graph of Sin(x)")
par(mfrow=c(2,2))
plot(x, sin(x), main="Graph of Sin(x)")
plot(x, cos(x), main="Graph of Cos(x)", type="o", col="blue", lty=2)
plot(x, tan(x), main="Graph of Tan(x)", type="l", col="red")
The concept of “saving” a graphical plot in R involves the following steps
pdf device)dev.close() - this writes the plot to filepdf("trig.pdf", width=12, height=4)
par(mfrow=c(1,3))
plot(x, sin(x), main="Graph of Sin(x)")
plot(x, cos(x), main="Graph of Cos(x)", type="o", col="blue", lty=2)
plot(x, tan(x), main="Graph of Tan(x)", type="l", col="red")
dev.off()
?device
| Devices {grDevices} | R Documentation |
The following graphics devices are currently available:
pdf Write PDF graphics commands to a file
postscript Writes PostScript graphics commands to
a file
xfig Device for XFIG graphics file format
bitmap bitmap pseudo-device via
Ghostscript (if available).
pictex Writes TeX/PicTeX graphics commands to a
file (of historical interest only)
The following devices will be functional if R was compiled to use them (they exist but will return with a warning on other systems):
X11 The graphics device for the X11 windowing system
cairo_pdf, cairo_ps PDF and PostScript
devices based on cairo graphics.
svg SVG device based on cairo graphics.
png PNG bitmap device
jpeg JPEG bitmap device
bmp BMP bitmap device
tiff TIFF bitmap device
quartz The graphics device for the OS X
native Quartz 2d graphics system. (This is only functional on
OS X where it can be used from the R.app GUI and from the
command line: but it will display on the local screen even for a
remote session.)
If no device is open, using a high-level graphics function will cause
a device to be opened. Which device is given by
options("device") which is initially set as the most
appropriate for each platform: a screen device for most interactive use and
pdf (or the setting of R_DEFAULT_DEVICE)
otherwise. The exception is interactive use under Unix if no screen
device is known to be available, when pdf() is used.
It is possible for an R package to provide further graphics devices and several packages on CRAN do so. These include other devices outputting SVG and PGF/TiKZ (TeX-based graphics, see http://pgf.sourceforge.net/).
The individual help files for further information on any of the
devices listed here;
X11.options, quartz.options,
ps.options and pdf.options for how to
customize devices.
dev.interactive,
dev.cur, dev.print,
graphics.off, image,
dev2bitmap.
capabilities to see if X11,
jpeg png, tiff,
quartz and the cairo-based devices are available.
## Not run: ## open the default screen device on this platform if no device is ## open if(dev.cur() == 1) dev.new() ## End(Not run)
Generate 100 random numbers from a standard normal distribution and
assign to a variable r. Plot a histogram of the distribuiton of
numbers in r.
Save the histogram you just geneerated as a PNG file.
Plot the sin, cos and tan functions from the examples for x
beteen 0 and 2\(\pi\), but place them all in the same plot using
different colors and line styles to distinguish them. Add an appropriate
figure legend.
The statistics lectures will use statistical simulations extensively to illustrate key concepts. Basically, a simulaition is the use of random numbers drawn from some probability distribution to simulate experimental results. Simulations are a wonderful way to learn more about probability distributions and the concepts behind statistical inference.
set.seed(123) # fix random number seed for reproducibility
(x <- c("H", "T"))
sample(x, 20, replace=TRUE)
(x <- 1:6)
sample(x, 10, replace=FALSE)
Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
sample(x)
sample(x)
sample(x, size = 3)
sample(x, size=10, replace=TRUE, prob=c(10,10,10,1,1,1))
par(mfrow=c(2,2)) # arrange plots in a 2 x 2 grid
x <- rnorm(n=100, mean=0, sd=1)
hist(x, main="Normal distribution")
rug(x)
x <- rexp(n=100, rate = 1)
hist(x, main="Exponenital distribution")
rug(x)
x <- runif(n=100, min=0, max=10)
hist(x, main="Uniform distribution")
rug(x)
x <- rbeta(n=100, shape1 = 0.5, shape2=0.5)
hist(x, main="Beta distribution")
rug(x)
par(mfrow=c(2,2)) # arrange plots in a 2 x 2 grid
x <- rpois(n=100, lambda = 1)
barplot(table(x), main="Poisson distribution")
x <- rbinom(n=100, size=10, prob=0.3)
barplot(table(x), main="Bionmial distribution")
x <- rnbinom(n=100, size=10, prob=0.3)
barplot(table(x), main="Negative binomial distribution")
x <- rgeom(n=100, prob=0.3)
barplot(table(x), main="Geometric distribution")
qqplot¶Most commonly, we want to check if data comes from a normal
distribution. To do this, we can plot the sample quantiles against the
theoretical quantiles - if the data is normally distributed, this should
not deviate far from a straight line. To chack against a normal
distribution, use qqnorm(x) where x is the sample to be tested. More
generally, you can test if x comes from any distribution using
qqplot(x, dist).
options(repr.plot.width = 6)
options(repr.plot.height = 12)
par(mfrow=c(4,2))
n <- 250
x <- rnorm(n=n, mean=10, sd=2)
qqnorm(x, main="Normal disrribution")
qqline(x, col="red", lwd=2)
hist(x, probability = TRUE, main="Normal")
lines(density(x), col="red", lwd=2)
x <- rgamma(n=n, scale=1, shape=5)
qqnorm(x, main="Gamma distribution")
qqline(x, col="red", lwd=2)
hist(x, probability = TRUE, main="right skew")
lines(density(x), col="red", lwd=2)
x <- rt(n=n, df=3)
qqnorm(x, main="Student T distribution")
qqline(x, col="red", lwd=2)
hist(x, probability = TRUE, main="heavy tails")
lines(density(x), col="red", lwd=2)
x <- rbeta(n=n, shape1 = 0.5, shape2=0.5)
qqnorm(x, main="Beta distribution")
qqline(x, col="red", lwd=2)
hist(x, probability = TRUE, main="bimodal")
lines(density(x), col="red", lwd=2)
How frequently do we find significant differences (p < 0.05) between two groups when they are drawn from the same distribuiton if we use a t test?
As a bonus, this code snippet also shows a simple way to time chunks of code.
To repeat an action many times, we use loops. The most useful loop construct in R is the for loop. A simple example shows how to use it.
for (i in 1:4) {
print(c(i, i*i))
}
[1] 1 1
[1] 2 4
[1] 3 9
[1] 4 16
Loops many include an if condition to only run the code if some condition is met.
for (i in 1:4) {
if (i %% 2 == 0) { # is i an even number?
print(c(i, i*i))
}
}
[1] 2 4
[1] 4 16
Now that we know how to use loops, let’s write the actural simulation.
nexpts = 10000
nsamps = 100
s = 0
alpha = 0.05
ptm <- proc.time()
for (i in 1:nexpts) {
u <- rnorm(nsamps)
v <- rnorm(nsamps)
if (t.test(u, v)$p.value < alpha)
s = s + 1
}
print(s/nexpts)
proc.time() - ptm
[1] 0.047
user system elapsed
2.544 0.006 2.551
ptm <- proc.time()
x <- matrix(rnorm(2*nexpts*nsamps), nexpts, 2*nsamps)
grp <- factor(rep(0:1, c(nsamps, nsamps)))
print(sum(rowttests(x, grp)$p.value < alpha)/nexpts)
proc.time() - ptm
[1] 0.0507
user system elapsed
0.544 0.017 0.562
How frequently do we find significant differences (p < 0.05) between two groups when group 1 is drawn form a \(\mathcal{N}(0, 1)\) distribution and group 2 from an \(\mathcal{N}(0.2, 1)\) distribution? Write one version using explicit looping and another using vecgorized code, using 10000 experiments and 100 samples per group.