Getting genomic sequences into R

The ape package has a helpful function for reading .fasta sequence files into R.

library(ape)
FMDseq <- read.FASTA(file = "AU-FMD-sequences.fasta")
FMDseq
## 50 DNA sequences in binary format stored in a list.
## 
## All sequences of same length: 7667 
## 
## Labels:
## Case_1
## Case_2
## Case_3
## Case_4
## Case_5
## Case_6
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.253 0.282 0.257 0.209 
## (Total: 383.35 kb)

Note that, when we print an individual sequence, it shows numeric values rather than a, c, t, g.

head(FMDseq$Case_1)
## [1] 28 28 18 18 28 18

The ape package has saved the sequences in byte format for efficiency. If needed, we can request to print them as characters.

head(as.character(FMDseq)$Case_1)
## [1] "c" "c" "t" "t" "c" "t"

Working with genomic sequences in R

There are many helpful R packages for analyzing genomic sequences.

library(ape)

The ape (Analyses of Phylogenetics and Evolution) package does more than read in sequences. We can also compute sequence distances, plot and manipulate phylogenetic trees, and more. We’ll see more from this package later in the module. For now, why not take a look at the dist.dna() function for computing distances between the FMD sequences. How about image()?

library(bioseq)

bioseq is another useful package, specifically for sequence manipulation. In contains in-built functions for clustering sequences, among other things. Take a look at the seq_cluster() function. Note, bioseq requires its own sequence format, obtained through e.g. as_dna(FMDseq)

Accessing and analyzing the epidemiological data

Remember that we can read in other file types with e.g. 

FMDepi <- read.csv("AU-FMD-epidata.csv")
head(FMDepi)
##   herd.name herd.id     latitude longitude   herd.type n.animals day.onset
## 1    Case_1  109047  -36.5112796  144.8273  pigs large      3209         4
## 2    Case_2  104058 -36.49945583  144.9456       dairy       127         8
## 3    Case_3  120734 -36.46383352  145.0562       sheep      1235         8
## 4    Case_4   80984  -36.4563005  144.9038 mixed sheep      1705         9
## 5    Case_5  103912 -36.37673931  144.8036       dairy       213         9
## 6    Case_6  103959 -36.53691725  144.7564       dairy       561        10
##   day.dx dx.reason day.culling.start
## 1     22        TP                25
## 2     34       DCP                37
## 3     27        SP                29
## 4     26        SP                30
## 5     34       ARP                37
## 6     35       ARP                38

Or as a tibble

library(tidyverse)
FMDepi <- read_csv("AU-FMD-epidata.csv")
head(FMDepi)

We could generate plots of this data using our favorite plotting commands e.g. with ggplot.

For the latitude/longitude data, several mapping packages exist in R. One such is

library(ggmap)

Although note, you will need to obtain a (free) Google API https://mapsplatform.google.com/ if you wish to use the Google Maps option.

The maps package is a good, simpler, alternative.

library(maps)

Combining the above

You may want to consider how we could write a mathematical expression to combine these different sources of information (genomic distance, geographic location, times, …) so as to ascertain the most likely infectors of each herd. This will be the focus of much of this module.

Some ideas: what if we thought it was only possible for an infecting herd to be within x SNPs, y distance and z time of an infectee herd? What if we wanted to make it more likely that a larger herd transmitted infection than a smaller herd? Should the date of culling affect the chance that a herd transmits infection?