There are lots of great tutorials for R and RStudio available online. Here we have collated an introduction that will be relevant for our course, in which we will refer a lot to this comprehensive Introduction to R tutorial at r-tutor.com.
If you haven’t already, install R and RStudio here (make sure you install R before Rstudio).
R is the programming language we will use throughout this module. RStudio is an IDE (integrated development environment) - a program which allows you to use R but with higher quality graphics, better workspace management and an overall improved design. You can use R without Rstudio (but not RStudio without R), however RStudio will make your life much easier!
This document is written in R markdown (.Rmd) which outputs a nice html (or pdf) file. But you can follow this tutorial by creating a regular R script (.R), this will be easier. A detailed guide to R markdown can be found here if you want to learn more.
This will open a blank script that you can write in and save. When you run lines of the script, they will appear in the console with their output. You can type commands directly in the console instead, but making a script means you have a saved record of your code. To run code you have written in a script, you can use the ‘Run’ button at the top right of the script, or just use Ctrl+Enter to run line by line.
Read the page http://www.r-tutor.com/r-introduction.
Try creating some variables: for example, create a variable x that is 7^2 (49).
Use base R functions to compute the square root of x, 2x, sin(x), cos(x), x+3x^2, etc.
x <- 7^2
x
## [1] 49
sqrt(x)
## [1] 7
2*x
## [1] 98
sin
. You
can type help(sin)
as directed in the tutorial, or use the
search function of the help tab, which should be on the right hand side
of RStudio.A useful tip for when writing R scripts is to add comments alongside
the actual function code. You can do this with the #
symbol. Everything after #
in a line of R code is
considered a comment i.e. not actual code to be run. This is useful when
you come back to code you wrote in the past and can’t remember what it
does! For example
sqrt(x) # this function calculates the square root of x
## [1] 7
Next, advance to the ‘basic data types’ page at http://www.r-tutor.com/r-introduction/basic-data-types, where you can familiarize yourself with the basic data types: numeric, integer, complex, logical and character, by going through the pages and running some of the code.
For example:
y = sin(x)
What is the class of y? What is the value of y?
is.integer(10)
## [1] FALSE
Why do you think the answer is FALSE
?
Read the section on integers
10
, sin(x)
, "Hi"
and
"3.99"
.Play with complex numbers if you like, though we won’t need them for this module. Then move ahead to the logical section
Now move to the character section. This is important, because often data in character format (like which county someone lives in or which colour something is) are represented as ‘factors’ in R. Factors are important in epidemiological data analysis, as patterns can be different depending on the value of a variable (e.g. infection rates show different trends in different counties).
substr
function to extract the first and fourth
characters. Play around with pasting strings together using the
paste
function. See ?paste
or
help(paste)
if you would like to know what the options are
(as with any function).Moving to the next section, we look at vectors http://www.r-tutor.com/r-introduction/vector. Vectors are key. In epi and genomic analyses, we usually want to store our data in a data.frame (or, as we will see later, a similar object called a tibble). Each column is a vector - of integers, characters, numeric values, etc. Working with vectors will give you the basic tools you need to work with data frames.
c()
. Name your
vector mychars
. You can also look at how to define this
vector more efficiently using the function rep()
.As we have seen, in R the assignment operator is <-
.
But it is possible to use =
as well. Although
<-
is more robust overall, in practice =
will work in almost all circumstances. =
is also used to
give functions their arguments.
Assigning the value of a variable is different than testing
equality; this is true in any programming language. We test equality
with ==
. So you can type x=2
to assign 2 to x,
then test x==4
(FALSE) or x==2
(TRUE).
Comparisons work on vectors in a pointwise way. Use
==
to print which entries of mychars
are
“hi”.
Make some quick vectors like these:
x=1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
x<5
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
y=2*(1:10) # the : operator goes through 1, 2, 3, 4, ...
y
## [1] 2 4 6 8 10 12 14 16 18 20
c()
to join them together. If they have different formats,
what happens? With reference to the vector
arithmetic page in the tutorial, multiply and divide them. A word of
warning: be careful of the recycling rule! Personally, I prefer not to
use it - I find it too easy to get something that surprises me or to
make a silly error.Explore indexing with reference to the next page http://www.r-tutor.com/r-introduction/vector/numeric-index-vector. What happens when you use negative indices? This is actually a really convenient feature of R - it’s easy to exclude based on a condition.
which
. Use it with
the vectors you have created so far to find the indices where those
vectors meet conditions that you specify. For example
which(x<10 & x>6)
and so on.One of the things that can take some getting used to in R is the
difference between ( ) and [ ], particularly if you’re used to another
language like Matlab. So x(10) is NOT the 10th member of the vector x.
x(10) gives an error (try it!). Instead, x[10]
is the 10th
member of the vector x. While this can be annoying, it makes code easier
to read because it’s immediately clear when you are accessing a
component of a (vector or matrix) variable vs when you are applying a
function to an argument.
x[x<10]
or x[x>3]
. If you’re not sure
what this means, you can check out the next tutorial page http://www.r-tutor.com/r-introduction/vector/logical-index-vectorR can also name its vector entries, data frame columns, rows and other objects. This is extremely useful.
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
## [1] 2 4 6 8 10 12 14 16 18
## [1] 2 10 12 14 16 18 20 22 24 26 28 30
mymatrix
:## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
You can access the whole fist row of a matrix with
mymatrix[1,]
and the whole first column with
mymatrix[,1]
(or, instead of 1, another row or column).
Matrix multiplication is done with X*Y
where X and Y are
matrices. R has (of course) many built-in functions for eigenvalues,
inverting matrices, decompositions and so on. We won’t need anything
like that for this course, but many matrix functions are listed here if you
are interested.
Note the cbind
(column bind; appending columns to each
other to create a matrix) and rbind
(row-bind: append rows
on to your matrix) functions. These also work with data frames.
cbind
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 2 3 4 5
## [3,] 3 4 5 6
## [4,] 4 5 6 7
## [5,] 5 6 7 8
## [6,] 6 7 8 9
## [7,] 7 8 9 10
Now start with reading about data frames, here.
head(mtcars)
and str(mtcars)
. Use the
[,1]
or [,3]
or [,4]
etc. commands
to extract column and row slices from mtcars.There are loads of datasets included with R, you can see a list of
them with the data()
command. These can be very useful when
learning new techniques in R, and a lot of online tutorials will use
them too. There are also so many tutorials and introductions to data
frames out there, if you want to learn more. For example, there are more
examples and a motivation and discussion of data frames at https://www.stat.berkeley.edu/~spector/s133/R-2a.html.
mtcars$disp
. You can use this to make plots,
e.g. histograms hist(mtcars$disp)
or direct (non-ggplot -
you’ll learn what this means later) scatter plots like
plot(mtcars$mpg, mtcars$hp)
(miles per gallon and
horsepower).hist(mtcars$disp)
as.data.frame
. You can create a matrix by
glueing columns of vectors together,
e.g. mymatrix=cbind(0:10, 2:12)
, and you can set their
names with the colnames
function.You can also create a data frame with the data.frame
function, e.g.
mydf=data.frame(ID=1:10, names=paste("person", 1:10, sep=''), country ="UK")
mydf
## ID names country
## 1 1 person1 UK
## 2 2 person2 UK
## 3 3 person3 UK
## 4 4 person4 UK
## 5 5 person5 UK
## 6 6 person6 UK
## 7 7 person7 UK
## 8 8 person8 UK
## 9 9 person9 UK
## 10 10 person10 UK
Read a bit about factors in R here https://www.r-bloggers.com/data-types-part-3-factors/
How many unique values of mtcars$vs
are there?
(hint: use the function unique
)
Make your own copy of the mtcars data,
e.g. owncars= mtcars
. Convert owncars$vs
to a
factor using the function as.factor
.
Note: entering the command
as.factor(owncars$vs)
does not convert the type of
owncars$vs
to factor, it creates a new variable which is
owncars$vs
as a factor. In order to convert, we need to
define owncars$vs = as.factor(owncars$vs)
.
When you are working with things like the in-built datasets, it is usually a good idea to make your own copy like this before making any changes. Then you know that the original version is still in memory if you need it.
Read about lists (next two links; http://www.r-tutor.com/r-introduction/list and http://www.r-tutor.com/r-introduction/list/named-list-members).
mylist
containing three items: a
vector with value 1,2,3 named FredVector
, a matrix of zeros
named JustineMatrix
and a character variable with the value
“hi”, named greeting
.## $FredVector
## [1] 1 2 3
##
## $JustineMatrix
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
##
## $greeting
## [1] "hi"
If you have long lists with many entries, R is nice in that if you
start to type e.g. mylist$
into the console, it will pop up
with the names of the list elements. If you have a big list (or data
frame, matrix etc.) and it’s hard to read printed in the console, you
can also look in the ‘Environment’ tab in RStudio, or enter the command
View(mylist)
, View(mydf)
. This last one is
particularly useful for big data frames because you can easily and
non-permanently sort and filter columns (like in Excel).
Although a lot of functions are available in base R, there’s a whole
world of other functions available as R packages. These are free
libraries of code written by R users all over the world - and we’ll be
using several during this module. You can install new packages from
within RStudio by clicking ‘Tools’ -> ‘Install Packages…’, or simply
by typing
install.packages("<the package's name>")
.
install.packages("ggplot2")
You’ll need to have internet access to download packages. Using the
install.packages()
function as above will download them
from CRAN (Comprehensive R Archive Network). However, you only have to
install a package once. When you want to use it again in the future, you
can use the library()
function to tell R to access it from
your local files.
library(ggplot2)
It’s also possible to download packages directly from other people’s Github repositories. The commands are different but the approach is very similar. We won’t need to do this for the module.
As you might have seen from the plots you made earlier, the in-built R plotting tools aren’t the most visually pleasing (at least in their default form). A really popular alternative is ggplot. We can’t cover all of ggplot here, but we will give a brief intro. (Like with data frames, there are many ggplot tutorials online if you want to learn more).
The ggplot (and wider ‘tidyverse’ packages) syntax is very different to basic R plotting, and it can take some getting used to. But once you do, it’s much easier to make complex and high quality figures.
help(mtcars)
to remind yourself of the
different column namesIn the next few steps we’ll make a scatterplot of miles/gallon
mpg
against gross horsepower hp
.
ggplot(mtcars, aes(x=hp, y=mpg))
aes
stands for ‘aesthetic mapping’ - each aesthetic is a
mapping between something you see (here, the x- and y-axes) and the
variable you want to use for it. However, as you’ve probably noticed,
this command gave us the right axes, but didn’t plot any data!
geom_point
:ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point()
Geometric objects (geom
s) are the actual marks or lines
that make up our plot: geom_point
for points,
geom_line
for lines, geom_boxplot
for boxplots
and many
more.
There’s a whole range of further customizations we can make to our ggplot.
gg <- ggplot(mtcars, aes(x=hp, y=mpg)) +
geom_point(aes(col=carb), size=3) +
geom_smooth(method="lm", col="firebrick", linewidth=2) +
coord_cartesian(xlim=c(50, 300), ylim=c(10, 30)) +
labs(title="Miles per gallon against gross horsepower", subtitle="From mtcars dataset", y="Mpg (US)", x="Horsepower (gross)", caption="1974 Car Trends") + theme_minimal()
plot(gg)
## `geom_smooth()` using formula = 'y ~ x'
You can plot using the ggplot command directly, as we did in steps 2 and 3, or you can assign them to a variable as I did here.
vs
. For
example, use ggplot with data owncars
instead of
mtcars
. Again, set the aesthetic aes
options
to x=hp,y=mpg
to make a scatter plot of mpg vs hp. Colour
the plot by the variable vs
. Above, you made a copy of
mtcars and changed vs to a factor. How does this affect the plots in
ggplot?We can define our own functions with the syntax
funName <- function(argumentlist){lines of code}
, like
this:
mysquare <- function(x) { return(x^2) }
mysquare(10)
## [1] 100
Or a bigger one:
myfunction <- function(x,y) {
xs=x^2
ys=y^2
print("hi")
return(xs+ys)
}
myfunction(1,2)
## [1] "hi"
## [1] 5
Unlike some other languages, R doesn’t return more than one value. If
you need to return a lot of different things, you have to put them in a
list; you can use a syntax like
return(list( firstthing=xs, secondthing=ys))
.
You can read more about functions, environments and so on here: https://www.datacamp.com/community/tutorials/functions-in-r-a-tutorial#user
getv
that takes one argument,
called k
, and returns a vector whose entries are
k+1, k+2, ..., k+10
. Test your function with
k=1000
.## [1] 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010
See https://www.r-bloggers.com/how-to-write-the-first-for-loop-in-r/.
The syntax of the for
loop in R is:
for (n in 1:10) {
print(2*n)
}
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20
R loops over the variable in the ( ), performing all commands inside
{ } for each value of the variable. Here, that means R loops through
n=1
, n=2
, …, n=10
, and prints the
value of 2*n
for each n
.
## [1] 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610
## [16] 987 1597 2584 4181 6765
A nice feature of R is explained here: https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/. These functions (lapply, sapply, vapply) often mean that you don’t have to explicitly write loops. You don’t need to know about these for this module, but they are a useful tool if you are planning on using R beyond this.
We can create some random samples from different distributions using
commands runif
(random uniform), rnorm
(random
normal), etc. We often need random numbers for the statistical
approaches used in genomic and epidemiological analyses - for example in
MCMC which we will touch on in this module and which is the focus of
other SISMID modules.
x=rnorm(10)
x
## [1] -0.67600883 0.70229648 0.05057819 -0.59845211 0.36900525 -0.84212223
## [7] 1.84340457 -0.08383091 -0.25703431 -0.48986945
y=x+rnorm(10,mean=50,sd=.1)
y
## [1] 49.42933 50.62009 49.99346 49.42930 50.30133 49.09431 51.69213 49.62173
## [9] 49.74392 49.48761
cor(x,y) # finds the correlation between x and y
## [1] 0.9921876
You can get reproducible results, even with random numbers, by setting the ‘seed’ in the random number generator to a specific value. Since random numbers in computers aren’t truly random (they are ‘pseudorandom’), this allows you to reset to the same seed later, so that you can get exactly the same random sample again.
set.seed(1301)
rnorm(5)
## [1] 0.1606469 -0.3419885 0.3508287 0.7388312 -1.6348454
set.seed(3)
rnorm(5)
## [1] -0.9619334 -0.2925257 0.2587882 -1.1521319 0.1957828
set.seed(1301)
rnorm(5)
## [1] 0.1606469 -0.3419885 0.3508287 0.7388312 -1.6348454
rnorm
and runif
to create some random
vectors: normally and uniformly distributed. Use plot
to
make some simple scatter plots of them.You can read files into R from any directory using a command like
read.csv("C:/MyFolder/MyFile.csv)
for .csv files,
read.table(...)
for .txt files and so on. But it can be
annoying to type the whole file path every time, particularly if your
files are buried inside lots of folders. A good solution is to put all
your data, code etc. for your project in one folder (let’s say it’s
“C:/MyAccount/Documents/SISMID2021”) and then use
setwd("C:\MyAccount\Documents\SISMID2021")
. This sets the
SISMID2021 folder as your ‘working directory’, and now to load or save
any files in the folder, you only need to use
read.csv("MyFile.csv)
.
getwd()
. If it’s not the folder you want it to be, set it
using setwd("<your file path here>")
.Note: if you want to make a new folder to have as
your working directory, make sure you do this outside of R first in the
usual way on your operating system. setwd()
cannot make
folders for you.
Save the variable y
you defined in the last section
to a text file using the command
write.table(y, file = "my_variable.txt")
. Check that your
file wrote to the folder you set as your working directory.
Remove variable y
from your R workspace using the
command rm(y)
. And then load your variable back in from
your file:
y=read.table("my_variable.txt")
Some other ways to view variables, data.frames etc. are:
head(y)
## [1] 49.42933 50.62009 49.99346 49.42930 50.30133 49.09431
str(y)
## num [1:10] 49.4 50.6 50 49.4 50.3 ...
?read.table
.
?read.table
also gives you the help for
read.csv
, which is basically the same as
read.table
but with slightly different options.In R, you don’t always need to specify all inputs for functions. Some
inputs are always required, like x
in
write.table
. But some inputs have a default setting that
will be assumed unless you tell R otherwise. For example,
append = FALSE
in write.table
: unless you set
append to TRUE, R will assume that you want to overwrite any existing
file with that name, rather than appending to the end of it.
That’s it for this tutorial. Hopefully this has given you a good introduction to R, you should now be ready to tackle our SISMID module and beyond.
If you want more R tutorials or ever get stuck when using R, remember the internet is your friend. There are almost endless resources available to help, from forums like Stack Overflow’s to detailed tutorials. For just one resource, check out https://www.datacamp.com/community/tags/r-programming.