The ape
package has a helpful function for reading
.fasta sequence files into R.
library(ape)
FMDseq <- read.FASTA(file = "AU-FMD-sequences.fasta")
FMDseq
## 50 DNA sequences in binary format stored in a list.
##
## All sequences of same length: 7667
##
## Labels:
## Case_1
## Case_2
## Case_3
## Case_4
## Case_5
## Case_6
## ...
##
## Base composition:
## a c g t
## 0.253 0.282 0.257 0.209
## (Total: 383.35 kb)
Note that, when we print an individual sequence, it shows numeric values rather than a, c, t, g.
head(FMDseq$Case_1)
## [1] 28 28 18 18 28 18
The ape
package has saved the sequences in byte format
for efficiency. If needed, we can request to print them as
characters.
head(as.character(FMDseq)$Case_1)
## [1] "c" "c" "t" "t" "c" "t"
There are many helpful R packages for analyzing genomic sequences.
library(ape)
The ape
(Analyses of Phylogenetics and Evolution)
package does more than read in sequences. We can also compute sequence
distances, plot and manipulate phylogenetic trees, and more. We’ll see
more from this package later in the module. For now, why not take a look
at the dist.dna()
function for computing distances between
the FMD sequences. How about image()
?
library(bioseq)
bioseq
is another useful package, specifically for
sequence manipulation. In contains in-built functions for clustering
sequences, among other things. Take a look at the
seq_cluster()
function. Note, bioseq
requires
its own sequence format, obtained through
e.g. as_dna(FMDseq)
Remember that we can read in other file types with e.g.
FMDepi <- read.csv("AU-FMD-epidata.csv")
head(FMDepi)
## herd.name herd.id latitude longitude herd.type n.animals day.onset
## 1 Case_1 109047 -36.5112796 144.8273 pigs large 3209 4
## 2 Case_2 104058 -36.49945583 144.9456 dairy 127 8
## 3 Case_3 120734 -36.46383352 145.0562 sheep 1235 8
## 4 Case_4 80984 -36.4563005 144.9038 mixed sheep 1705 9
## 5 Case_5 103912 -36.37673931 144.8036 dairy 213 9
## 6 Case_6 103959 -36.53691725 144.7564 dairy 561 10
## day.dx dx.reason day.culling.start
## 1 22 TP 25
## 2 34 DCP 37
## 3 27 SP 29
## 4 26 SP 30
## 5 34 ARP 37
## 6 35 ARP 38
Or as a tibble
library(tidyverse)
FMDepi <- read_csv("AU-FMD-epidata.csv")
head(FMDepi)
We could generate plots of this data using our favorite plotting
commands e.g. with ggplot
.
For the latitude/longitude data, several mapping packages exist in R. One such is
library(ggmap)
Although note, you will need to obtain a (free) Google API https://mapsplatform.google.com/ if you wish to use the Google Maps option.
The maps
package is a good, simpler, alternative.
library(maps)
You may want to consider how we could write a mathematical expression to combine these different sources of information (genomic distance, geographic location, times, …) so as to ascertain the most likely infectors of each herd. This will be the focus of much of this module.
Some ideas: what if we thought it was only possible for an infecting herd to be within x SNPs, y distance and z time of an infectee herd? What if we wanted to make it more likely that a larger herd transmitted infection than a smaller herd? Should the date of culling affect the chance that a herd transmits infection?