1 Building phylogenetic trees for transmission reconstruction

When we build phylogenetic trees with software such as IQtree, raxml, FastTree, PAML or other maximum-likelihood tree reconstruction tools, the branches are typically in units of genetic distance (in substitutions per site). Trees typically do not have a root (though when they get read into R, they are sometimes interpreted as rooted nonetheless).

In contrast, for reconstructing transmission – either person to person transmission in software like TransPhylo, or phylogeographic movements of viruses – we often require a timed phylogenetic tree. This is a rooted phylogenetic tree whose branches are in units of calendar time, not substitutions per site. The root is the oldest node in the tree. Nodes have calendar dates, such that each tip’s date corresponds to the date of collection of the sample corresponding to the tip.

The relationship between the untimed and timed trees is shaped by the molecular clock of the pathogen. This is the model defining how substitutions occur through time. For example, do all branches in the phylogeny evolve at the same rate (a “strict clock”) or not (a “relaxed clock”)?

treedater can take a tree whose branch lengths are in units of genetic distance and create a tree with a root, whose branches are in units of calendar time, such that the tips line up at their collection dates (or some other provided date). To create such a tree, you need to provide the input (genetic distance) tree, the dates of sampling of the tips, and the length of the sequences used to reconstruct the tree. Because you are providing the number of substitutions per site, the tree and the dates, treedater can estimate the rate of substitution in calendar time. It can also estimate dates of tips where only a lower and upper bound are provided.

This extension exercise was adapted from the treedater tutorial written by Erik Volz.

2 A brief introduction to building timed phylogenies

treedater uses a relaxed clock model, allowing substitution rates to vary across the tree using a Gamma-Poisson mixture model. A substitution rate is estimated for every lineage in the tree, but it is also designed to be fast and useable for large phylogenetic trees via a heuristic search algorithm. You can read more about the math behind this in the paper E.M. Volz and S.D.W. Frost (2017) Scalable relaxed clock phylogenetic dating.

In its most simple form, we can run treedater on our phylogeny of choice (tree, an ape phylo object) with the command

dater(tree, samps, s)

where samps is a vector containing the sampling time of each tip, and s is the length of the genetic sequences. Note, samps should be a named vector matching the tip names in tree.

2.1 A timed phylogeny for the US flu data

We will read in our maximum likelihood tree for the US flu data and, by adding in the sampling dates, build a timed phylogeny. Remember, we built and saved this tree to file in Exercise 3.

library(treedater)
## Loading required package: ape
## Loading required package: limSolve
tree <- ape::read.tree("myflutree.nwk")
tree
## 
## Phylogenetic tree with 80 tips and 78 internal nodes.
## 
## Tip labels:
##   CY006627, CY006243, CY006595, CY006267, CY006259, CY001720, ...
## 
## Unrooted; includes branch lengths.

Note that this tree is unrooted. When fitting the clock model, we will estimate the best choice of root.

As mentioned, we need to tell treedater the sequence length for inferring the clock.

seqlen <- 1701 # can confirm by looking back at 'seqs' object from exercise 3 main

We also need the sampling time for each tip. Recall that the years were contained in the metadata file, and that we need a named vector

meta <- read.csv("usflu.annot.csv", header = TRUE, row.names = 1)
samps <- setNames(meta$year, meta$accession)
head(samps)
## CY013200 CY013781 CY012128 CY013613 CY012160 CY012272 
##     1993     1993     1993     1993     1993     1994

Now we are ready to run treedater, in its simplest form with a strict clock

dattree <- dater(tree , samps, seqlen, clock = 'strict' )
## Note: Initial guess of substitution rate not provided. Will attempt to guess starting conditions. Provide initial guesses of the rate using *omega0* parameter. 
## Note: Minimum temporal branch length  (*minblen*) set to 0.01875. Increase *minblen* in the event of convergence failures. 
## Tree is not rooted. Searching for best root position. Increase searchRoot to try harder.
## Initial guesses of substitution rate: 0.0050439056393304,0.00506882273806997,0.00527090721759518,0.00530753538974721,0.00549790879585996,0.00554624804142444 
## Initial guesses of substitution rate: 0.00493267096545287,0.00506882273806997,0.00515572303197038,0.0053075353897472,0.00537877509848789,0.00554624804142444 
## Initial guesses of substitution rate: 0.00497940962099535,0.00506882273806997,0.00520558634013873,0.0053075353897472,0.0054317630592821,0.00554624804142444 
## Initial guesses of substitution rate: 0.00497902637816774,0.00506882273806997,0.00520518662020955,0.0053075353897472,0.00543134686225136,0.00554624804142444 
## Initial guesses of substitution rate: 0.00501508625155921,0.00506882273806997,0.0052458582897811,0.0053075353897472,0.00547663032800299,0.00554624804142444
dattree 
## 
## Phylogenetic tree with 80 tips and 79 internal nodes.
## 
## Tip labels:
##   CY006627, CY006243, CY006595, CY006267, CY006259, CY001720, ...
## 
## Rooted; includes branch lengths.
## 
##  Time of common ancestor 
## 1991.28853741851 
## 
##  Time to common ancestor (before most recent sample) 
## 16.7114625814866 
## 
##  Weighted mean substitution rate (adjusted by branch lengths) 
## 0.00297168135068025 
## 
##  Unadjusted mean substitution rate 
## 0.00297168135068025 
## 
##  Clock model  
## strict 
## 
##  Coefficient of variation of rates 
## 0

Note that the algorithm has searched to find the best root, and provided us a tree with timed branches. Had we wanted to pick our own root, we simply could have provided a rooted tree like we obtained in the main exercise 3 script. treedater also picked its own starting guess for omega0, the substitution rate. We could have provided our own initial guess, if we had one. Providing either of these can significantly speed up the run time, although hopefully you found that for this data it was fast anyhow.

For more computationally demanding runs, we can set the parameter ncpu to run the function in parallel using multiple cores (the default value of ncpu is 1).

Now lets take a closer look at the output. treedater has estimated the substitution rate, and provided a ‘TMRCA’ - time to most recent common ancestor. We can plot our timed tree in the same way as for the ape package before.

plot(dattree, cex = .5) # 'cex' changes font size of tips
title("Timed tree for US flu data")
axisPhylo()

Now the axis is counting backwards in years - success!

We can run a root-to-tip regression on this tree, to look for outlier tips. Nothing in the tree above looks particularly like an outlier (e.g. a single tip on a long branch), but lets run the regression anyway for practice.

rootToTipRegressionPlot(dattree)

## Root-to-tip mean rate: 0.00520518662020955 
## Root-to-tip p value: 1.07989692086262e-72 
## Root-to-tip R squared (variance explained): 0.984814582815966 
## Returning fitted linear model.

The p-value is tiny, suggesting we have sufficient signal of a clock-like process. There are no clear outliers in the plot (deviations from the x=y diagonal), but lets continue to check for them for practice.

2.2 Outliers

treedater has in-built functions for outlier detection and removal. The outlierTips function looks for tips that do not provide a good fit to the molecular clock model.

outliers <- outlierTips(dattree , alpha = 0.20) 
##             taxon           q            p loglik       rates branch.length
## CY003336 CY003336 0.006922869 0.0002596076     NA 0.002971681       0.01875
## CY003785 CY003785 0.006922869 0.0002596076     NA 0.002971681       0.01875
## CY001453 CY001453 0.006922869 0.0002596076     NA 0.002971681       0.01875
## CY006627 CY006627 0.073714334 0.0083342582     NA 0.002971681       0.01875
## CY006243 CY006243 0.073714334 0.0083342582     NA 0.002971681       0.01875
## CY000185 CY000185 0.073714334 0.0083342582     NA 0.002971681       0.01875
## CY000545 CY000545 0.073714334 0.0119785793     NA 0.002971681       1.01875
## CY002328 CY002328 0.073714334 0.0083342582     NA 0.002971681       0.01875
## CY001152 CY001152 0.073714334 0.0090299158     NA 0.002971681       1.07500
## CY019301 CY019301 0.073714334 0.0119785793     NA 0.002971681       1.01875
## CY006155 CY006155 0.073714334 0.0119785793     NA 0.002971681       1.01875
## CY001704 CY001704 0.073714334 0.0099218100     NA 0.002971681       1.05625
## CY009476 CY009476 0.073714334 0.0083342582     NA 0.002971681       0.01875

At the alpha = 20% level, this function returns a table of the tips (taxa) in ascending order of model fit. An option for outlier removal is to discard tips with low q-value (note: we do this from the original tree, not the timed tree)

tree2 <- ape::drop.tip(tree, rownames(outliers[outliers$q < 0.20,]) )

We dropped 13/80 tips. This is quite a large proportion of the data - in practice we should consider if this is indeed sensible by carefully investigating these 13 tips. We might have gone overboard with our definition of outlier. Nonetheless, let’s rerun treedater with the reduced tree, and plot both timed trees. Where are the main differences?

dattree2 <- dater(tree2, samps, seqlen, clock='uncorrelated')  
## Note: Initial guess of substitution rate not provided. Will attempt to guess starting conditions. Provide initial guesses of the rate using *omega0* parameter. 
## Note: Minimum temporal branch length  (*minblen*) set to 0.0223880597014925. Increase *minblen* in the event of convergence failures. 
## Tree is not rooted. Searching for best root position. Increase searchRoot to try harder.
## Initial guesses of substitution rate: 0.00508422825314188,0.00510823486209489,0.00530930998352896,0.00534693206367195,0.00553439171391604,0.005585629265249 
## Initial guesses of substitution rate: 0.00498161705233903,0.00510823486209489,0.0052061420706219,0.00534693206367195,0.00543066708890477,0.005585629265249 
## Initial guesses of substitution rate: 0.00501404827982491,0.00510823486209489,0.00524040646038632,0.00534693206367195,0.00546676464094772,0.005585629265249 
## Initial guesses of substitution rate: 0.0050113873749215,0.00510823486209489,0.00523766582163522,0.00534693206367195,0.00546394426834894,0.005585629265249 
## Initial guesses of substitution rate: 0.00505027467043432,0.00510823486209489,0.00528108451615075,0.00534693206367195,0.00551189436186717,0.005585629265249
dattree2
## 
## Phylogenetic tree with 67 tips and 66 internal nodes.
## 
## Tip labels:
##   CY006595, CY006267, CY006259, CY001720, CY003096, CY000584, ...
## 
## Rooted; includes branch lengths.
## 
##  Time of common ancestor 
## 1991.85828681952 
## 
##  Time to common ancestor (before most recent sample) 
## 16.1417131804842 
## 
##  Weighted mean substitution rate (adjusted by branch lengths) 
## 0.00318595666711704 
## 
##  Unadjusted mean substitution rate 
## 0.00328985603314783 
## 
##  Clock model  
## uncorrelated 
## 
##  Coefficient of variation of rates 
## 0.11457320340524
plot(dattree, cex = .5)
title("Timed tree for US flu data")
axisPhylo()

plot(dattree2, cex = .5) 
title("Timed tree for US flu data, with outliers removed")
axisPhylo()

Note that the second tree was generated with an uncorrelated (relaxed) clock model. We can test the suitability of the relaxed clock (vs a strict one) with the following test (that may be slow to run)

rct <- relaxedClockTest(dattree2, samps, seqlen)
## Note: Initial guess of substitution rate not provided. Will attempt to guess starting conditions. Provide initial guesses of the rate using *omega0* parameter. 
## Note: *dater* called with treedater input tree. Will use rooted tree with branch lengths in substitions.
## Note: Minimum temporal branch length  (*minblen*) set to 0.0223880597014925. Increase *minblen* in the event of convergence failures. 
## Tree is rooted. Not estimating root position.
## Initial guesses of substitution rate: 0.00501404827982491,0.00510823486209489,0.00524040646038632,0.00534693206367195,0.00546676464094772,0.005585629265249 
## Running in quiet mode. To print progress, set quiet=FALSE.
## NOTE: Running with overrideSearchRoot will speed up execution but may underestimate variance.
## NOTE: Running with overrideTempConstraint will speed up execution but may underestimate variance. Bootstrap tree replicates may have negative branch lengths.
## Note: Initial guess of substitution rate not provided. Will attempt to guess starting conditions. Provide initial guesses of the rate using *omega0* parameter. 
## Note: *dater* called with treedater input tree. Will use rooted tree with branch lengths in substitions.
## Note: Minimum temporal branch length  (*minblen*) set to 0.0223880597014925. Increase *minblen* in the event of convergence failures. 
## Tree is rooted. Not estimating root position.
## Initial guesses of substitution rate: 0.00501404827982491,0.00510823486209489,0.00524040646038632,0.00534693206367195,0.00546676464094772,0.005585629265249 
## Best clock model:  uncorrelated 
## Null distribution of rate coefficient of variation: 1.21509147332787e-11 9.69345281686395e-07 
## Returning best treedater fit

This test indicates a relaxed clock is preferred.

There is more that can be done with the treedater package. It is possible to estimate confidence intervals around the inferred dates and rates, using a parametric or nonparametric bootstrap. It is also possible to put uncertainty around the sampling times when they are only known e.g. to nearest month or year. This is exactly the case we had here, with only years known, and so would be a good extension to the work above. See the treedater documentation for more information about how to do this.

3 Next steps

If you have time, try building timed phylogenies for the TB and COVID-19 data. In exercise 4, we will be using timed trees to reconstruct transmission. We will be providing these timed trees for you, but if you get this far you could see how different your timed treedater trees are from those we provide.