In this exercise we will explore several approaches for constructing phylogenetic trees from timed sequence data in R: distance-based methods, maximum parsimony and maximum likelihood.
library(stats)
library(ade4)
library(ape)
library(adegenet)
library(phangorn)
library(ape)
set.seed(8914) # You can put whatever number you want into the set.seed function
The beginning of this tutorial is adapted from the Adegenet package‘s tutorial ’Genetic data analysis using R: introduction to phylogenetics’, and we will use the same toy dataset to test out the methods. These data are US seasonal influenza (H3N2) sequences downloaded from Genbank (http://www.ncbi.nlm.nih.gov/genbank/), collected 1993-2008. The 80 sequences have already been aligned, and come from the hemagglutinin (HA) segment of the genome.
Download the US flu data from the module webpage. As in exercise 2, the data consist of a .fasta sequence file and a .csv file with additional metadata (year and location of sequencing).
Read the US flu data into R. Explore the data and confirm that the ‘accession numbers’ (unique id numbers) in the metadata match the names of the sequences. It’s always a good idea to do this quick check.
seqs <- read.dna(file = "usflu.fasta", format = "fasta")
meta <- read.csv("usflu.annot.csv", header = TRUE, row.names = 1)
seqs
## 80 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 1701
##
## Labels:
## CY013200
## CY013781
## CY012128
## CY013613
## CY012160
## CY012272
## ...
##
## Base composition:
## a c g t
## 0.335 0.200 0.225 0.239
## (Total: 136.08 kb)
class(seqs)
## [1] "DNAbin"
tail(meta)
## accession year misc
## 75 EU516212 2007 (A/California/33/2007(H3N2))
## 76 FJ549055 2008 (A/Illinois/14/2008(H3N2))
## 77 EU779498 2008 (A/Mississippi/01/2008(H3N2))
## 78 EU779500 2008 (A/Indiana/02/2008(H3N2))
## 79 CY035190 2008 (A/Pennsylvania/PIT43/2008(H3N2))
## 80 EU852005 2008 (A/Texas/06/2008(H3N2))
table(meta$year) # We see that we have 5 sequences per year 1993-2008.
##
## 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
all(meta$accession == rownames(seqs)) # Should return TRUE if the ids all match
## [1] TRUE
Notice that the sequences are stored as DNAbin
obects;
this is an efficient way to store sequences using bytes instead of
character vectors (which take up less memory). We have 80 sequences each
of length 1701 nucleotides.
In this section we will tackle 3 main tasks for creating distance-based phylogenies:
compute pairwise genetic distance between all 80 flu sequences
represent these distances on a tree
estimate the molecular clock
Task (i): compute distances.
dist.dna()
function in the ape
package to compute the genetic distance between the US flu sequences.
Enter help(dist.dna)
to read about the various distance
measures that can be used.D <- dist.dna(seqs, model = "TN93")
length(D) # D is of length (80*80/2 - 80) i.e. it contains the distance from each sequence to every other sequence.
## [1] 3160
Here I chose the TN93 (Tamura and Nei 1993) model which allows for: different rates of transitions (A↔︎G or C↔︎T) and transversions (all other base pair mutations), unequal base frequencies (e.g. the proportion of A is not equal to the proportion of T), and between-site variation in the rate at which substitutions occur. Choice of evolutionary model is unfortunately beyond the scope of this module, but a good resource is The Phylogenetic Handbook, chapter 10.
In practice, a model such as TN93, which is fairly simple but still flexible (varying transition/transversion rates, unequal base frequencies and between-site substitution rate variation), will often be a good choice. You could try re-running this analysis with different evolutionary models to see how much impact it has on your trees.
Visualizing and summarizing the distance matrix D can be a difficult task, particularly if you have a large number of sequences. We don’t have too many sequences, so a heatmap of pairwise distances shouldn’t be too bad.
matD <- as.data.frame(as.matrix(D))
table.paint(matD, cleg = 0, clabel.row = 0.5, clabel.col = 0.5)
We can see some kind of genetic structure from the heatmap (patches of similar shades), which matches the sequences being approximately ordered by time. However, it’s hard to read any actual evolutionary relationships from this figure. For this, we need a phylogenetic tree…
Task (ii): build trees.
Phylogenetic trees help us to understand the genetic distance between samples, as well as the overall structure of a set of samples. However, they still involve simplification of the rich information contained within sequences and therefore some information will, in practice, always be lost in translation. There are also many different approaches for building trees, and in all but the simplest case you will get a different answer depending on which method you use.
nj
(Neighbour-Joining) algorithm in the ape
package.my_tree <- nj(D)
my_tree
##
## Phylogenetic tree with 80 tips and 78 internal nodes.
##
## Tip labels:
## CY013200, CY013781, CY012128, CY013613, CY012160, CY012272, ...
##
## Unrooted; includes branch lengths.
class(my_tree)
## [1] "phylo"
plot(my_tree, cex = 0.5) # cex changes the label text size
title("NJ tree for US flu data")
# We can 'ladderize' the tree - gives the tree a nicer (in my opinion) layout - check out how the plot changes. It's exactly the same tree, just with the plotting of branches reordered
my_tree <- ladderize(my_tree)
plot(my_tree, cex = 0.5)
title("NJ tree for US flu data")
There are many ways to customize phylogeny plots to display additional data and/or make interpretation of the tree easier.
# first, remove the tips so we can manually add colourful ones:
plot(my_tree, show.tip = FALSE)
title("Unrooted NJ tree for US flu data")
# Create a custom colour palette (you can choose any colours here):
myPal <- colorRampPalette(c("red", "pink", "purple", "blue"))
# Add custom tip labels with the year
tiplabels(meta$year, bg = num2col(meta$year, col.pal = myPal),
cex = 0.5)
# Turn your custom colours into a nice legend with 5 categories
temp <- pretty(min(meta$year):max(meta$year), 5)
legend("topright", fill = num2col(temp, col.pal = myPal),
leg = temp, ncol = 2, cex=0.8)
# If you find that your legend is obstructing the tree, you can try changing the legend size (cex) or the legend position ("topright")
Notice in the above figure that the x-axis does NOT correspond to
ancestry through time or evolutionary distance. This is because we have
plotted an unrooted tree i.e. we did not provide the nj
algorithm with a taxon defined as the ‘most ancient’ split in the tree
to root it by. When plotting unrooted trees, it is best to clearly
display them as such. To do this you can add
type = "unrooted"
to the plot command. In the below figure,
the tree is no longer arranged left->right and so it’s more obvious
that you are showing an unrooted tree.
is.rooted(my_tree)
## [1] FALSE
plot(my_tree, type = "unrooted", show.tip = FALSE)
title("Unrooted NJ tree for US flu data")
tiplabels(meta$year, bg = num2col(meta$year, col.pal = myPal),
cex = 0.5)
legend("topright", fill = num2col(temp, col.pal = myPal),
leg = temp, ncol = 2, cex=0.8)
Since we do have years for our sequences (in the metadata) it is possible, and probably preferred, to create a rooted phylogeny. To do this, any of the oldest (1993) sequences would be a good choice of root.
nj
phylogeny. Remember that the
sequences are roughly ordered by year, so you want to choose one of the
early ones to root by (you can confirm this by looking at the object
meta
).# Here I root by the first sequence, but you could pick any of the first 5 (all 1993). You can also use e.g. 'out=1' to select the first sequence, rather than using the name.
my_tree2 <- root(my_tree, out = "CY013200", resolve.root = TRUE)
# Ladderize again for that nice plot layout
my_tree2 <- ladderize(my_tree2)
plot(my_tree2, show.tip = FALSE, edge.width = 2)
title("Rooted NJ tree for US flu data")
# We add some transparency (transp) to the label boxes so we can see the ones underneath
tiplabels(meta$year, bg = transp(num2col(meta$year, col.pal = myPal),
0.7), cex = 0.5)
axisPhylo()
temp <- pretty(1993:2008, 5)
legend("topright", fill = num2col(temp, col.pal = myPal),
0.7, leg = temp, ncol = 2, cex=0.8)
In the rooted tree, the x-axis now represents evolutionary distance back to the root node. Please note, the x-axis is still not time - although, if we imagine that mutations accrue in relatively clock-like fashion, we will generally expect newer tips to be towards the right side of the tree and vice versa.
There are a wide range of other approaches than nj
available for building phylogenetic trees in R. If you get time, check
out the extension tasks section below to explore several of these
alternatives.
Task (iii): estimating a molecular clock
Rooted trees, such as the one we just made, can be used to estimate rates of evolution. Unlike a regular clock which measures the passing of seconds, minutes or hours, a ‘molecular clock’ measures the accumulation of genetic distance. To estimate the clock rate of our rooted tree, we can perform a regression over all tips of the number of mutations from the root against the time since the root.
# First, we plot the regression of mutations against time:
# How many mutations between each tip and the root?
mutFromRoot <- as.matrix(dist.dna(seqs, model="N"))[1,]
# How many years passed between each tip and the root?
yearFromRoot <- meta$year-meta$year[1]
plot(mutFromRoot~yearFromRoot, xlab="Years from the root",
ylab="Mutations from the root", main="US flu data molecular clock")
# Then we perform a regression of these quantities, and add to the plot. It should fit the points.
mclock <- lm(mutFromRoot~-1+yearFromRoot)
abline(mclock, col="blue",lwd=2)
# What are the regression coefficients?
summary(mclock)
##
## Call:
## lm(formula = mutFromRoot ~ -1 + yearFromRoot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.327 -1.577 1.004 6.386 13.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## yearFromRoot 7.73274 0.07443 103.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.86 on 79 degrees of freedom
## Multiple R-squared: 0.9927, Adjusted R-squared: 0.9926
## F-statistic: 1.079e+04 on 1 and 79 DF, p-value: < 2.2e-16
mclock$coefficients
## yearFromRoot
## 7.732742
mclock$coefficients/ncol(seqs)
## yearFromRoot
## 0.004545998
365/mclock$coefficients
## yearFromRoot
## 47.20189
The coefficient of yearFromRoot
is ~7.7. This is our
estimate of the number of substitutions per year for the HA segment of
the H3N2 influenza genomes. Use this result to answer the following
questions:
An alternative to a distance-based phylogeny is a maximum parsimony phylogeny.
In general terms, the scientific principle of parsimony can be described as a belief that the best explanation of your data is equal to the simplest possible explanation that is consistent with that data. Maximum parsimony phylogenetic reconstruction therefore seeks trees that minimize the number of evolutionary changes (substitutions) between ancestor and descendant. This makes for a simple and easy-to-motivate methodology, that is most appropriate for sequences with low divergence i.e. most sequences are very similar to one another.
In an ideal world we would search over the space of all possible trees to find the one which is most parsimonious. However, in reality this is usually computationally impossible unless you have very few tips in your tree. Many algorithms have been designed to instead search broadly over this space in a clever way - to try and find a decently parsimonious tree without needing to check them all. These algorithms generally take a strategy of (1) pick an initial tree and calculate it’s parsimony, (2) propose some small changes to the tree and calculate the new tree’s parsimony, (3) keep the new tree if it leads to a better parsimony, and (4) repeat until the parsimony score stops improving.
Parsimony-based phylogenetic reconstruction is implemented in the
package phangorn
. It requires a tree (in ape’s format,
i.e. a phylo object) and the original DNA sequences in
phangorn
’s own format, phyDat
. We convert the
data and generate a tree to initialize the method:
phangorn
. This requires an initial tree (you can
use ape
to make a neighbor-joining tree for this, as in the
previous section) as well as the original DNA sequences, which must be
converted to phangorn
’s phyDat
format, as
follows:seqs2 <- as.phyDat(seqs)
class(seqs2)
## [1] "phyDat"
seqs2
## 80 sequences with 1701 character and 269 different site patterns.
## The states are a c g t
#make the initial tree
tree_init <- nj(dist.dna(seqs, model = "raw"))
tree_init
##
## Phylogenetic tree with 80 tips and 78 internal nodes.
##
## Tip labels:
## CY013200, CY013781, CY012128, CY013613, CY012160, CY012272, ...
##
## Unrooted; includes branch lengths.
#The parsimony of the inital tree is given by:
parsimony(tree_init, seqs2)
## [1] 422
# Now perform the parsimony search
tree_parsim <- optim.parsimony(tree_init, seqs2)
## Final p-score 420 after 2 nni operations
tree_parsim
##
## Phylogenetic tree with 80 tips and 76 internal nodes.
##
## Tip labels:
## CY013200, CY013781, CY012128, CY013613, CY012160, CY012272, ...
##
## Unrooted; no branch lengths.
The tree we get out will be unrooted without branch lengths. Here we
find that the parsimony of the final tree is not much lower than the
nj
tree we started with, but that won’t always be the case.
We can think of this as the methods (nj
and
optim.parsimony
) relatively well agreeing on the best tree,
which tends to happen for datasets with low divergence. As we have
here!
nj
trees).plot(tree_parsim, type = "unr", show.tip = FALSE, edge.width = 2)
title("Maximum-parsimony tree for US flu data")
tiplabels(meta$year, bg = transp(num2col(meta$year, col.pal = myPal),
0.7), cex = 0.5, fg = "transparent")
temp <- pretty(1993:2008, 5)
legend("topright", fill = transp(num2col(temp, col.pal = myPal),
0.7), leg = temp, ncol = 2, bg = transp("white"), cex=0.8)
Our final phylogenetic tree building approach is maximum likelihood. The idea here is very similar to maximum parsimony, in that we try to search over the space of all possible trees in a smart way to find the ‘best’ one. Except, this time we want to maximize some pre-defined likelihood function rather than maximizing parsimony.
A big benefit of this approach is that the likelihood to be maximized can be defined by any model of sequence evolution we choose, which makes it much more flexible. As well as optimizing over the space of trees, we can also simultaneously optimize over the parameters of our chosen model - to find the best tree and the best model parameters at the same time. A drawback, however, is that this means we need to decide what is the most appropriate model to use.
A common choice is the GTR + \(\Gamma\)(4) model (GTR = global reversible time). In this model, all possible substitutions (all transitions, transversions) are allowed to have different rates as well as the substitution rate being allowed to vary across sites. It is therefore a very flexible model.
Like maximum parsimony, maximum likelihood tree reconstruction can be
performed with the phangorn
package in R. Again, we require
an initial tree and the DNA sequences.
phangorn
.# phyDat format, same as before:
seqs2 <- as.phyDat(seqs)
#make the initial tree
tree_init <- nj(dist.dna(seqs, model = "TN93"))
#To begin the optimization procedure, we need to know the likelihood of the initial tree
pml(tree_init, seqs2, k = 4)
## model: JC+G(4)
## loglikelihood: -5641.785
## unconstrained loglikelihood: -4736.539
## Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 1
##
## Rate matrix:
## a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
##
## Base frequencies:
## a c g t
## 0.25 0.25 0.25 0.25
We find an initial log likelihood of -5641.785. If using different
data/seed, you may find that the likelihood of your initial tree is
NaN
. This needs to be fixed, but isn’t a big problem. It is
caused by missing data (missing or error bases) in the sequences. We
should remove these before continuing. Even if you didn’t get an
NaN
likelihood, I recommend running the code to remove them
anyway, just to avoid any possible problems later.
# First, find the location of the missing/incorrect bases (anything that isn't a,t,g,c)
na.posi <- which(apply(as.character(seqs), 2, function(e) any(!e %in%
c("a", "t", "g", "c"))))
na.posi
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## [46] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## [76] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 259 334 336 460 721 1121
# Plot them
temp <- apply(as.character(seqs), 2, function(e) sum(!e %in% c("a", "t", "g", "c")))
plot(temp, type = "l", col = "red", xlab = "Position in HA segment of sequence",
ylab = "Number of NAs")
# Remove them
seqs3 <- seqs[, -na.posi]
seqs3
## 80 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 1596
##
## Labels:
## CY013200
## CY013781
## CY012128
## CY013613
## CY012160
## CY012272
## ...
##
## Base composition:
## a c g t
## 0.340 0.197 0.226 0.238
## (Total: 127.68 kb)
# Aha! Looks much better now. Before:
table(as.character(seqs))
##
## - a c g k m r s t w
## 147 45595 27170 30613 1 2 1 1 32549 1
# After:
table(as.character(seqs3))
##
## a c g t
## 43402 25104 28828 30346
# Don't forget! Change to phyDat format
seqs4 <- as.phyDat(seqs3)
It turns out that most of the missing bases were towards the beginning of the alignment: this is usual.
Now we’re ready to search for a maximum likelihood tree.
optim.pml
. Remember to start with a new initial tree, using
your new sequences with missing data removed.#Make the initial tree
tree_init <- nj(dist.dna(seqs3, model = "TN93"))
#Find its (log) likelihood
lh_init <- pml(tree_init, seqs4, k = 4)
lh_init
## model: JC+G(4)
## loglikelihood: -5184.119
## unconstrained loglikelihood: -4043.367
## Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 1
##
## Rate matrix:
## a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
##
## Base frequencies:
## a c g t
## 0.25 0.25 0.25 0.25
# Find the max. likelihood tree. We optimize the tree (optNni = TRUE), the base frequencies (optBf=TRUE), the substitution rates (optQ=TRUE) and assume a gamma distribution for the variation in substitution rates across sites (optGamma = TRUE).
fit <- optim.pml(lh_init, optNni = TRUE, optBf = TRUE, optQ = TRUE, optGamma = TRUE)
## optimize edge weights: -5184.094 --> -5166.996
## optimize base frequencies: -5166.996 --> -5121.313
## optimize rate matrix: -5121.313 --> -4933.871
## optimize shape parameter: -4933.871 --> -4919.646
## optimize edge weights: -4919.646 --> -4919.326
## optimize topology: -4919.326 --> -4916.187 NNI moves: 2
## optimize base frequencies: -4916.187 --> -4915.89
## optimize rate matrix: -4915.89 --> -4915.868
## optimize shape parameter: -4915.868 --> -4915.867
## optimize edge weights: -4915.867 --> -4915.867
## optimize topology: -4915.867 --> -4915.867 NNI moves: 0
## optimize base frequencies: -4915.867 --> -4915.866
## optimize rate matrix: -4915.866 --> -4915.866
## optimize shape parameter: -4915.866 --> -4915.866
## optimize edge weights: -4915.866 --> -4915.866
## optimize base frequencies: -4915.866 --> -4915.866
## optimize rate matrix: -4915.866 --> -4915.866
## optimize shape parameter: -4915.866 --> -4915.866
## optimize edge weights: -4915.866 --> -4915.866
## optimize base frequencies: -4915.866 --> -4915.866
## optimize rate matrix: -4915.866 --> -4915.866
## optimize shape parameter: -4915.866 --> -4915.866
## optimize edge weights: -4915.866 --> -4915.866
fit
## model: F81+G(4)
## loglikelihood: -4915.866
## unconstrained loglikelihood: -4043.367
## Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 0.2829846
##
## Rate matrix:
## a c g t
## a 0.0000000 2.3836329 8.2983982 0.8563163
## c 2.3836329 0.0000000 0.1485362 10.0779972
## g 8.2983982 0.1485362 0.0000000 1.0000000
## t 0.8563163 10.0779972 1.0000000 0.0000000
##
## Base frequencies:
## a c g t
## 0.3415991 0.1953602 0.2243303 0.2387104
names(fit) # what are the components of the results object?
## [1] "logLik" "inv" "k" "shape" "Q" "bf"
## [7] "rate" "siteLik" "weight" "g" "w" "eig"
## [13] "data" "model" "INV" "ll.0" "tree" "lv"
## [19] "call" "df" "wMix" "llMix" "ASC" "site.rate"
The results object (here named fit
) contains not just
the optimal tree (fit$tree
), but also other useful info.
Take some time to explore these. How much did your likelihood improve
from initial to optimal?
Before choosing this as our favourite maximum likelihood tree, we can confirm that the optimized tree is better than the initial one using ANOVA and AIC:
# ANOVA
anova(lh_init, fit)
## Likelihood Ratio Test Table
## Log lik. Df Df change Diff log lik. Pr(>|Chi|)
## 1 -5184.1 158
## 2 -4915.9 166 8 536.51 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare AIC
AIC(lh_init)
## [1] 10684.24
AIC(fit)
## [1] 10163.73
Indeed, ANOVA has a good p-value and the AIC of the new tree is better than the initial one. Great news! We have obtained an improved tree.
tree_ml <- root(fit$tree, 1)
write.tree(tree_ml, file = "myflutree.nwk")
plot(tree_ml, show.tip = FALSE, edge.width = 2)
title("Maximum-likelihood tree for US flu data")
tiplabels(meta$year, bg = transp(num2col(meta$year, col.pal = myPal),
0.7), cex = 0.5, fg = "transparent")
axisPhylo()
temp <- pretty(1993:2008, 5)
legend("topright", fill = transp(num2col(temp, col.pal = myPal),
0.7), leg = temp, ncol = 2, cex=0.8)
What patterns and observations do you see in your tree?
Overall, we find that for this data set, all methods return fairly similar trees. Our data is relatively simple, with few samples and low diversity, so this is not unexpected. For larger and more complex data, you will likely find much more stark differences between different methods. It’s usually a good idea to test out a few different tree building approaches before drawing any conclusions about your data.
How do we choose between methods?
Distance based methods. Pros: the fastest method, and fairly flexible too (can use different distance measures). Cons: no real way to compare between different distance-based methods to find the ‘best’.
Maximum parsimony. Pros: intuitive, works well for data with small amounts of genetic variation. Cons: simplistic evolutionary model, computationally intensive, works poorly for data with large amounts of genetic variation or where mutation rates vary across the tree.
Maximum likelihood. Pros: very flexible - can use any model of evolution, can perform model selection, usually works well. Cons: computationally intensive.
A popular branch of methods for tree reconstruction that we have not covered here are Bayesian methods - such as BEAST and BEAST2. These can be seen as an extension to maximum likelihood methods, except that instead of finding the single tree with ‘best’ likelihood, they seek samples of trees and model parameters which are most consistent with both the data and any prior knowledge you might have of the tree or parameters. Since the focus of our module is transmission inference and these methods are slightly more complex, we will not cover them here. However, these are very popular methods so worth a look if you are interested in this area.
We chose to focus on building trees in R here, in line with the rest of the module. But there also exist many other popular phylogenetic tree building and/or visualizing software. These include IQ-TREE, FastTree, GrapeTree, ggtree (in R), iTOL: Interactive Tree Of Life, and many more. Wikipedia actually has a very nice list here for building and here for visualizing.
Remember that we don’t expect you to finish the whole tutorial for all 3 datasets within this single time slot. But hopefully you can make a good start, ask lots of questions, and then continue later if you want to.
A. The phylogenies we built above were not linked to time, though generally in transmission inference we have a time point associated with each sequence (e.g. the time of sampling). Check out the extension exercise linked from the module webpage for a tutorial on building timed trees.
B. Here are several alternatives to nj
for building
phylogenetic trees:
bionj
(ape
library): an improved
version of Neighbor-Joining.
fastme.bal
and fastme.ols
(ape
library): minimum evolution algorithms.
hclust
(stats
library): classical
hierarchical clustering algorithms including single linkage, complete
linkage, UPGMA, and others.
Test these packages for building a US flu data phylogeny and compare
your results to that you obtained using nj
. (Remember to
look up the documentation for each package using
e.g. ?bionj
). How different do your phylogenies look? Are
the branches longer/shorter/differently clustered in some methods? See
point B below for more formal methods for comparison of trees.
C. Development of methods for comparison of phylogenetic trees is an active area of research. Investigate the R packages ape and treespace, or look up alternatives, for ways to compare trees, and apply these to your TB/COVID trees. Here is a document about treespace: https://cran.r-project.org/web/packages/treespace/vignettes/introduction.html