Reconstructing phylogenetic trees

In this exercise we will explore several approaches for constructing phylogenetic trees from timed sequence data in R: distance-based methods, maximum parsimony and maximum likelihood.

Introduction: getting started

  1. First, load up the packages we will need for this tutorial and set the seed (so that if you repeat this analysis you’ll get the same output).
library(stats)
library(ade4)
library(ape)
library(adegenet)
library(phangorn)
library(ape)
set.seed(8914) # You can put whatever number you want into the set.seed function

The beginning of this tutorial is adapted from the Adegenet package‘s tutorial ’Genetic data analysis using R: introduction to phylogenetics’, and we will use the same toy dataset to test out the methods. These data are US seasonal influenza (H3N2) sequences downloaded from Genbank (http://www.ncbi.nlm.nih.gov/genbank/), collected 1993-2008. The 80 sequences have already been aligned, and come from the hemagglutinin (HA) segment of the genome.

  1. Download the US flu data from the module webpage. As in exercise 2, the data consist of a .fasta sequence file and a .csv file with additional metadata (year and location of sequencing).

  2. Read the US flu data into R. Explore the data and confirm that the ‘accession numbers’ (unique id numbers) in the metadata match the names of the sequences. It’s always a good idea to do this quick check.

seqs <- read.dna(file = "usflu.fasta", format = "fasta")
meta <- read.csv("usflu.annot.csv", header = TRUE, row.names = 1)

seqs
## 80 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 1701 
## 
## Labels:
## CY013200
## CY013781
## CY012128
## CY013613
## CY012160
## CY012272
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.335 0.200 0.225 0.239 
## (Total: 136.08 kb)
class(seqs)
## [1] "DNAbin"
tail(meta)
##    accession year                              misc
## 75  EU516212 2007      (A/California/33/2007(H3N2))
## 76  FJ549055 2008        (A/Illinois/14/2008(H3N2))
## 77  EU779498 2008     (A/Mississippi/01/2008(H3N2))
## 78  EU779500 2008         (A/Indiana/02/2008(H3N2))
## 79  CY035190 2008 (A/Pennsylvania/PIT43/2008(H3N2))
## 80  EU852005 2008           (A/Texas/06/2008(H3N2))
table(meta$year) # We see that we have 5 sequences per year 1993-2008. 
## 
## 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 
##    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5    5
all(meta$accession == rownames(seqs)) # Should return TRUE if the ids all match
## [1] TRUE

Notice that the sequences are stored as DNAbin obects; this is an efficient way to store sequences using bytes instead of character vectors (which take up less memory). We have 80 sequences each of length 1701 nucleotides.

Introduction: distance-based phylogenies

In this section we will tackle 3 main tasks for creating distance-based phylogenies:

  1. compute pairwise genetic distance between all 80 flu sequences

  2. represent these distances on a tree

  3. estimate the molecular clock

Task (i): compute distances.

  1. Use the dist.dna() function in the ape package to compute the genetic distance between the US flu sequences. Enter help(dist.dna) to read about the various distance measures that can be used.
D <- dist.dna(seqs, model = "TN93")
length(D) # D is of length (80*80/2 - 80) i.e. it contains the distance from each sequence to every other sequence.
## [1] 3160

Here I chose the TN93 (Tamura and Nei 1993) model which allows for: different rates of transitions (A↔︎G or C↔︎T) and transversions (all other base pair mutations), unequal base frequencies (e.g. the proportion of A is not equal to the proportion of T), and between-site variation in the rate at which substitutions occur. Choice of evolutionary model is unfortunately beyond the scope of this module, but a good resource is The Phylogenetic Handbook, chapter 10.

In practice, a model such as TN93, which is fairly simple but still flexible (varying transition/transversion rates, unequal base frequencies and between-site substitution rate variation), will often be a good choice. You could try re-running this analysis with different evolutionary models to see how much impact it has on your trees.

Visualizing and summarizing the distance matrix D can be a difficult task, particularly if you have a large number of sequences. We don’t have too many sequences, so a heatmap of pairwise distances shouldn’t be too bad.

  1. Plot a heatmap of your pairwise distances
matD <- as.data.frame(as.matrix(D))
table.paint(matD, cleg = 0, clabel.row = 0.5, clabel.col = 0.5)

We can see some kind of genetic structure from the heatmap (patches of similar shades), which matches the sequences being approximately ordered by time. However, it’s hard to read any actual evolutionary relationships from this figure. For this, we need a phylogenetic tree…

Task (ii): build trees.

Phylogenetic trees help us to understand the genetic distance between samples, as well as the overall structure of a set of samples. However, they still involve simplification of the rich information contained within sequences and therefore some information will, in practice, always be lost in translation. There are also many different approaches for building trees, and in all but the simplest case you will get a different answer depending on which method you use.

  1. Build and plot a phylogenetic tree for the US flu data using the nj (Neighbour-Joining) algorithm in the ape package.
my_tree <- nj(D)
my_tree
## 
## Phylogenetic tree with 80 tips and 78 internal nodes.
## 
## Tip labels:
##   CY013200, CY013781, CY012128, CY013613, CY012160, CY012272, ...
## 
## Unrooted; includes branch lengths.
class(my_tree)
## [1] "phylo"
plot(my_tree, cex = 0.5) # cex changes the label text size
title("NJ tree for US flu data")

# We can 'ladderize' the tree - gives the tree a nicer (in my opinion) layout - check out how the plot changes. It's exactly the same tree, just with the plotting of branches reordered
my_tree <- ladderize(my_tree)
plot(my_tree, cex = 0.5)
title("NJ tree for US flu data")

There are many ways to customize phylogeny plots to display additional data and/or make interpretation of the tree easier.

  1. Plot a version of your Neighbour-Joining tree that labels and colours tips by their collection year (included in the metadata).
# first, remove the tips so we can manually add colourful ones: 
 plot(my_tree, show.tip = FALSE) 
 title("Unrooted NJ tree for US flu data")
 # Create a custom colour palette (you can choose any colours here):
 myPal <- colorRampPalette(c("red", "pink", "purple", "blue"))
 # Add custom tip labels with the year
 tiplabels(meta$year, bg = num2col(meta$year, col.pal = myPal),
 cex = 0.5)
 # Turn your custom colours into a nice legend with 5 categories
 temp <- pretty(min(meta$year):max(meta$year), 5)
 legend("topright", fill = num2col(temp, col.pal = myPal),
 leg = temp, ncol = 2, cex=0.8)

 # If you find that your legend is obstructing the tree, you can try changing the legend size (cex) or the legend position ("topright")

Notice in the above figure that the x-axis does NOT correspond to ancestry through time or evolutionary distance. This is because we have plotted an unrooted tree i.e. we did not provide the nj algorithm with a taxon defined as the ‘most ancient’ split in the tree to root it by. When plotting unrooted trees, it is best to clearly display them as such. To do this you can add type = "unrooted" to the plot command. In the below figure, the tree is no longer arranged left->right and so it’s more obvious that you are showing an unrooted tree.

is.rooted(my_tree)
## [1] FALSE
plot(my_tree, type = "unrooted", show.tip = FALSE)
title("Unrooted NJ tree for US flu data")
tiplabels(meta$year, bg = num2col(meta$year, col.pal = myPal),
 cex = 0.5)
legend("topright", fill = num2col(temp, col.pal = myPal),
 leg = temp, ncol = 2, cex=0.8)

Since we do have years for our sequences (in the metadata) it is possible, and probably preferred, to create a rooted phylogeny. To do this, any of the oldest (1993) sequences would be a good choice of root.

  1. Root and plot your nj phylogeny. Remember that the sequences are roughly ordered by year, so you want to choose one of the early ones to root by (you can confirm this by looking at the object meta).
# Here I root by the first sequence, but you could pick any of the first 5 (all 1993). You can also use e.g. 'out=1' to select the first sequence, rather than using the name. 
my_tree2 <- root(my_tree, out = "CY013200", resolve.root = TRUE)
# Ladderize again for that nice plot layout
my_tree2 <- ladderize(my_tree2)

plot(my_tree2, show.tip = FALSE, edge.width = 2)
title("Rooted NJ tree for US flu data")
# We add some transparency (transp) to the label boxes so we can see the ones underneath
tiplabels(meta$year, bg = transp(num2col(meta$year, col.pal = myPal),
0.7), cex = 0.5)
axisPhylo()
temp <- pretty(1993:2008, 5)
legend("topright", fill = num2col(temp, col.pal = myPal),
0.7, leg = temp, ncol = 2, cex=0.8)

In the rooted tree, the x-axis now represents evolutionary distance back to the root node. Please note, the x-axis is still not time - although, if we imagine that mutations accrue in relatively clock-like fashion, we will generally expect newer tips to be towards the right side of the tree and vice versa.

There are a wide range of other approaches than nj available for building phylogenetic trees in R. If you get time, check out the extension tasks section below to explore several of these alternatives.

Task (iii): estimating a molecular clock

Rooted trees, such as the one we just made, can be used to estimate rates of evolution. Unlike a regular clock which measures the passing of seconds, minutes or hours, a ‘molecular clock’ measures the accumulation of genetic distance. To estimate the clock rate of our rooted tree, we can perform a regression over all tips of the number of mutations from the root against the time since the root.

  1. Estimate the clock rate of your rooted influenza tree
# First, we plot the regression of mutations against time:

# How many mutations between each tip and the root?
mutFromRoot <- as.matrix(dist.dna(seqs, model="N"))[1,]
# How many years passed between each tip and the root?
yearFromRoot <- meta$year-meta$year[1]
plot(mutFromRoot~yearFromRoot, xlab="Years from the root",
ylab="Mutations from the root", main="US flu data molecular clock")

# Then we perform a regression of these quantities, and add to the plot. It should fit the points.
mclock <- lm(mutFromRoot~-1+yearFromRoot)
abline(mclock, col="blue",lwd=2)

# What are the regression coefficients?
summary(mclock)
## 
## Call:
## lm(formula = mutFromRoot ~ -1 + yearFromRoot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.327  -1.577   1.004   6.386  13.000 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## yearFromRoot  7.73274    0.07443   103.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.86 on 79 degrees of freedom
## Multiple R-squared:  0.9927, Adjusted R-squared:  0.9926 
## F-statistic: 1.079e+04 on 1 and 79 DF,  p-value: < 2.2e-16
mclock$coefficients
## yearFromRoot 
##     7.732742
mclock$coefficients/ncol(seqs)
## yearFromRoot 
##  0.004545998
365/mclock$coefficients
## yearFromRoot 
##     47.20189

The coefficient of yearFromRoot is ~7.7. This is our estimate of the number of substitutions per year for the HA segment of the H3N2 influenza genomes. Use this result to answer the following questions:

  • What is the substitution rate per site per year?
  • Therefore, how many days would you expect it to take to accumulate 1 mutation in a transmission chain?
  • Given your answers above and the fact that the generation time of influenza is 2-3 days, do you think that reconstructing influenza transmission trees from HA sequences is a good approach?

Introduction: maximum parsimony phylogenies

An alternative to a distance-based phylogeny is a maximum parsimony phylogeny.

In general terms, the scientific principle of parsimony can be described as a belief that the best explanation of your data is equal to the simplest possible explanation that is consistent with that data. Maximum parsimony phylogenetic reconstruction therefore seeks trees that minimize the number of evolutionary changes (substitutions) between ancestor and descendant. This makes for a simple and easy-to-motivate methodology, that is most appropriate for sequences with low divergence i.e. most sequences are very similar to one another.

In an ideal world we would search over the space of all possible trees to find the one which is most parsimonious. However, in reality this is usually computationally impossible unless you have very few tips in your tree. Many algorithms have been designed to instead search broadly over this space in a clever way - to try and find a decently parsimonious tree without needing to check them all. These algorithms generally take a strategy of (1) pick an initial tree and calculate it’s parsimony, (2) propose some small changes to the tree and calculate the new tree’s parsimony, (3) keep the new tree if it leads to a better parsimony, and (4) repeat until the parsimony score stops improving.

Parsimony-based phylogenetic reconstruction is implemented in the package phangorn. It requires a tree (in ape’s format, i.e. a phylo object) and the original DNA sequences in phangorn’s own format, phyDat. We convert the data and generate a tree to initialize the method:

  1. Create a maximum parsimony tree for the US flu data using the package phangorn. This requires an initial tree (you can use ape to make a neighbor-joining tree for this, as in the previous section) as well as the original DNA sequences, which must be converted to phangorn’s phyDat format, as follows:
seqs2 <- as.phyDat(seqs)
class(seqs2)
## [1] "phyDat"
seqs2
## 80 sequences with 1701 character and 269 different site patterns.
## The states are a c g t
#make the initial tree
tree_init <- nj(dist.dna(seqs, model = "raw"))
tree_init
## 
## Phylogenetic tree with 80 tips and 78 internal nodes.
## 
## Tip labels:
##   CY013200, CY013781, CY012128, CY013613, CY012160, CY012272, ...
## 
## Unrooted; includes branch lengths.
#The parsimony of the inital tree is given by:
parsimony(tree_init, seqs2)
## [1] 422
# Now perform the parsimony search
tree_parsim <- optim.parsimony(tree_init, seqs2)
## Final p-score 420 after  2 nni operations
tree_parsim 
## 
## Phylogenetic tree with 80 tips and 76 internal nodes.
## 
## Tip labels:
##   CY013200, CY013781, CY012128, CY013613, CY012160, CY012272, ...
## 
## Unrooted; no branch lengths.

The tree we get out will be unrooted without branch lengths. Here we find that the parsimony of the final tree is not much lower than the nj tree we started with, but that won’t always be the case. We can think of this as the methods (nj and optim.parsimony) relatively well agreeing on the best tree, which tends to happen for datasets with low divergence. As we have here!

  1. Plot your maximum parsimony tree (same approach as for the nj trees).
plot(tree_parsim, type = "unr", show.tip = FALSE, edge.width = 2)
title("Maximum-parsimony tree for US flu data")
tiplabels(meta$year, bg = transp(num2col(meta$year, col.pal = myPal),
0.7), cex = 0.5, fg = "transparent")
temp <- pretty(1993:2008, 5)
legend("topright", fill = transp(num2col(temp, col.pal = myPal),
0.7), leg = temp, ncol = 2, bg = transp("white"), cex=0.8)

Introduction: maximum likelihood phylogenies

Our final phylogenetic tree building approach is maximum likelihood. The idea here is very similar to maximum parsimony, in that we try to search over the space of all possible trees in a smart way to find the ‘best’ one. Except, this time we want to maximize some pre-defined likelihood function rather than maximizing parsimony.

A big benefit of this approach is that the likelihood to be maximized can be defined by any model of sequence evolution we choose, which makes it much more flexible. As well as optimizing over the space of trees, we can also simultaneously optimize over the parameters of our chosen model - to find the best tree and the best model parameters at the same time. A drawback, however, is that this means we need to decide what is the most appropriate model to use.

A common choice is the GTR + \(\Gamma\)(4) model (GTR = global reversible time). In this model, all possible substitutions (all transitions, transversions) are allowed to have different rates as well as the substitution rate being allowed to vary across sites. It is therefore a very flexible model.

Like maximum parsimony, maximum likelihood tree reconstruction can be performed with the phangorn package in R. Again, we require an initial tree and the DNA sequences.

  1. Create an initial tree for the maximum parsimony analysis from the US flu data using the package phangorn.
# phyDat format, same as before:
seqs2 <- as.phyDat(seqs)

#make the initial tree
tree_init <- nj(dist.dna(seqs, model = "TN93"))

#To begin the optimization procedure, we need to know the likelihood of the initial tree
pml(tree_init, seqs2, k = 4)
## model: JC+G(4) 
## loglikelihood: -5641.785 
## unconstrained loglikelihood: -4736.539 
## Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 1 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
##    a    c    g    t 
## 0.25 0.25 0.25 0.25

We find an initial log likelihood of -5641.785. If using different data/seed, you may find that the likelihood of your initial tree is NaN. This needs to be fixed, but isn’t a big problem. It is caused by missing data (missing or error bases) in the sequences. We should remove these before continuing. Even if you didn’t get an NaN likelihood, I recommend running the code to remove them anyway, just to avoid any possible problems later.

  1. Find the location of the missing bases in the data. Plot them, and remove them.
# First, find the location of the missing/incorrect bases (anything that isn't a,t,g,c)
na.posi <- which(apply(as.character(seqs), 2, function(e) any(!e %in%
                                                                 c("a", "t", "g", "c"))))
na.posi
##   [1]    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
##  [16]   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30
##  [31]   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45
##  [46]   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60
##  [61]   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75
##  [76]   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90
##  [91]   91   92   93   94   95   96   97   98   99  259  334  336  460  721 1121
# Plot them
temp <- apply(as.character(seqs), 2, function(e) sum(!e %in% c("a", "t", "g", "c")))
plot(temp, type = "l", col = "red", xlab = "Position in HA segment of sequence",
ylab = "Number of NAs")

# Remove them
seqs3 <- seqs[, -na.posi]
seqs3
## 80 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 1596 
## 
## Labels:
## CY013200
## CY013781
## CY012128
## CY013613
## CY012160
## CY012272
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.340 0.197 0.226 0.238 
## (Total: 127.68 kb)
# Aha! Looks much better now. Before:
table(as.character(seqs))
## 
##     -     a     c     g     k     m     r     s     t     w 
##   147 45595 27170 30613     1     2     1     1 32549     1
# After:
table(as.character(seqs3))
## 
##     a     c     g     t 
## 43402 25104 28828 30346
# Don't forget! Change to phyDat format
seqs4 <- as.phyDat(seqs3)

It turns out that most of the missing bases were towards the beginning of the alignment: this is usual.

Now we’re ready to search for a maximum likelihood tree.

  1. Find a maximum likelihood phylogeny for the US flu data, using optim.pml. Remember to start with a new initial tree, using your new sequences with missing data removed.
#Make the initial tree
tree_init <- nj(dist.dna(seqs3, model = "TN93"))
#Find its (log) likelihood
lh_init <- pml(tree_init, seqs4, k = 4)
lh_init
## model: JC+G(4) 
## loglikelihood: -5184.119 
## unconstrained loglikelihood: -4043.367 
## Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 1 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
##    a    c    g    t 
## 0.25 0.25 0.25 0.25
# Find the max. likelihood tree. We optimize the tree (optNni = TRUE), the base frequencies (optBf=TRUE), the substitution rates (optQ=TRUE) and assume a gamma distribution for the variation in substitution rates across sites (optGamma = TRUE).
fit <- optim.pml(lh_init, optNni = TRUE, optBf = TRUE, optQ = TRUE, optGamma = TRUE)
## optimize edge weights:  -5184.094 --> -5166.996 
## optimize base frequencies:  -5166.996 --> -5121.313 
## optimize rate matrix:  -5121.313 --> -4933.871 
## optimize shape parameter:  -4933.871 --> -4919.646 
## optimize edge weights:  -4919.646 --> -4919.326 
## optimize topology:  -4919.326 --> -4916.187  NNI moves:  2 
## optimize base frequencies:  -4916.187 --> -4915.89 
## optimize rate matrix:  -4915.89 --> -4915.868 
## optimize shape parameter:  -4915.868 --> -4915.867 
## optimize edge weights:  -4915.867 --> -4915.867 
## optimize topology:  -4915.867 --> -4915.867  NNI moves:  0 
## optimize base frequencies:  -4915.867 --> -4915.866 
## optimize rate matrix:  -4915.866 --> -4915.866 
## optimize shape parameter:  -4915.866 --> -4915.866 
## optimize edge weights:  -4915.866 --> -4915.866 
## optimize base frequencies:  -4915.866 --> -4915.866 
## optimize rate matrix:  -4915.866 --> -4915.866 
## optimize shape parameter:  -4915.866 --> -4915.866 
## optimize edge weights:  -4915.866 --> -4915.866 
## optimize base frequencies:  -4915.866 --> -4915.866 
## optimize rate matrix:  -4915.866 --> -4915.866 
## optimize shape parameter:  -4915.866 --> -4915.866 
## optimize edge weights:  -4915.866 --> -4915.866
fit
## model: F81+G(4) 
## loglikelihood: -4915.866 
## unconstrained loglikelihood: -4043.367 
## Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 0.2829846 
## 
## Rate matrix:
##           a          c         g          t
## a 0.0000000  2.3836329 8.2983982  0.8563163
## c 2.3836329  0.0000000 0.1485362 10.0779972
## g 8.2983982  0.1485362 0.0000000  1.0000000
## t 0.8563163 10.0779972 1.0000000  0.0000000
## 
## Base frequencies:  
##         a         c         g         t 
## 0.3415991 0.1953602 0.2243303 0.2387104
names(fit) # what are the components of the results object?
##  [1] "logLik"    "inv"       "k"         "shape"     "Q"         "bf"       
##  [7] "rate"      "siteLik"   "weight"    "g"         "w"         "eig"      
## [13] "data"      "model"     "INV"       "ll.0"      "tree"      "lv"       
## [19] "call"      "df"        "wMix"      "llMix"     "ASC"       "site.rate"

The results object (here named fit) contains not just the optimal tree (fit$tree), but also other useful info. Take some time to explore these. How much did your likelihood improve from initial to optimal?

Before choosing this as our favourite maximum likelihood tree, we can confirm that the optimized tree is better than the initial one using ANOVA and AIC:

# ANOVA
anova(lh_init, fit)
## Likelihood Ratio Test Table
##   Log lik.  Df Df change Diff log lik. Pr(>|Chi|)    
## 1  -5184.1 158                                       
## 2  -4915.9 166         8        536.51  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare AIC
AIC(lh_init)
## [1] 10684.24
AIC(fit)
## [1] 10163.73

Indeed, ANOVA has a good p-value and the AIC of the new tree is better than the initial one. Great news! We have obtained an improved tree.

  1. Lastly, plot and save your maximum likelihood phylogeny for the US flu data (just like for the other methods).
tree_ml <- root(fit$tree, 1)
write.tree(tree_ml, file = "myflutree.nwk")


plot(tree_ml, show.tip = FALSE, edge.width = 2)
title("Maximum-likelihood tree for US flu data")
tiplabels(meta$year, bg = transp(num2col(meta$year, col.pal = myPal),
0.7), cex = 0.5, fg = "transparent")
axisPhylo()
temp <- pretty(1993:2008, 5)
legend("topright", fill = transp(num2col(temp, col.pal = myPal),
0.7), leg = temp, ncol = 2, cex=0.8)

What patterns and observations do you see in your tree?

Overall, we find that for this data set, all methods return fairly similar trees. Our data is relatively simple, with few samples and low diversity, so this is not unexpected. For larger and more complex data, you will likely find much more stark differences between different methods. It’s usually a good idea to test out a few different tree building approaches before drawing any conclusions about your data.

How do we choose between methods?

  • Distance based methods. Pros: the fastest method, and fairly flexible too (can use different distance measures). Cons: no real way to compare between different distance-based methods to find the ‘best’.

  • Maximum parsimony. Pros: intuitive, works well for data with small amounts of genetic variation. Cons: simplistic evolutionary model, computationally intensive, works poorly for data with large amounts of genetic variation or where mutation rates vary across the tree.

  • Maximum likelihood. Pros: very flexible - can use any model of evolution, can perform model selection, usually works well. Cons: computationally intensive.

A popular branch of methods for tree reconstruction that we have not covered here are Bayesian methods - such as BEAST and BEAST2. These can be seen as an extension to maximum likelihood methods, except that instead of finding the single tree with ‘best’ likelihood, they seek samples of trees and model parameters which are most consistent with both the data and any prior knowledge you might have of the tree or parameters. Since the focus of our module is transmission inference and these methods are slightly more complex, we will not cover them here. However, these are very popular methods so worth a look if you are interested in this area.

We chose to focus on building trees in R here, in line with the rest of the module. But there also exist many other popular phylogenetic tree building and/or visualizing software. These include IQ-TREE, FastTree, GrapeTree, ggtree (in R), iTOL: Interactive Tree Of Life, and many more. Wikipedia actually has a very nice list here for building and here for visualizing.

Your own analysis: reconstructing TB and SARS-CoV-2 phylogenetic trees

  1. Use what you’ve learned from this tutorial to build phylogenies for the Roetzer TB genomes and/or the SARS-CoV-2 genomes from exercise 2

Remember that we don’t expect you to finish the whole tutorial for all 3 datasets within this single time slot. But hopefully you can make a good start, ask lots of questions, and then continue later if you want to.

Extension tasks and additional reading

A. The phylogenies we built above were not linked to time, though generally in transmission inference we have a time point associated with each sequence (e.g. the time of sampling). Check out the extension exercise linked from the module webpage for a tutorial on building timed trees.

B. Here are several alternatives to nj for building phylogenetic trees:

  • bionj (ape library): an improved version of Neighbor-Joining.

  • fastme.bal and fastme.ols (ape library): minimum evolution algorithms.

  • hclust (stats library): classical hierarchical clustering algorithms including single linkage, complete linkage, UPGMA, and others.

Test these packages for building a US flu data phylogeny and compare your results to that you obtained using nj. (Remember to look up the documentation for each package using e.g. ?bionj). How different do your phylogenies look? Are the branches longer/shorter/differently clustered in some methods? See point B below for more formal methods for comparison of trees.

C. Development of methods for comparison of phylogenetic trees is an active area of research. Investigate the R packages ape and treespace, or look up alternatives, for ways to compare trees, and apply these to your TB/COVID trees. Here is a document about treespace: https://cran.r-project.org/web/packages/treespace/vignettes/introduction.html