Reconstructing transmission pathways

In this exercise we will use the first of several R packages for reconstructing transmission pathways. This will be `outbreaker2’, as introduced in the previous lecture.

Introduction: getting started

First, load up the packages we will need for this tutorial and set the seed (so that if you repeat this analysis you’ll get the same output).

library(ape) # reading, manipulating sequences
library(outbreaker2) # the main reconstruction package
library(Hmisc) # general data science package - we use for dealing with dates

## Warning: package 'Hmisc' was built under R version 4.3.3

library(lubridate) # also for dealing with dates
library(EpiEstim) # primarily a package for estimation of reproduction numbers. We want a small function for defining our generation time
set.seed(1) # You can put whatever number you want into the set.seed function

The outbreaker2 package includes fake_outbreak - a small simulated outbreak perfect for trying out the package.

Explore the contents of the fake_outbreak object, and make sure you understand what each element is. (If you’re not sure about any of them, try help(fake_outbreak)). Visualize some elements of the data if you wish.

str(fake_outbreak)

## List of 6
##  $ onset : num [1:30] 0 2 4 4 6 6 6 7 7 8 ...
##  $ sample: num [1:30] 3 5 6 6 7 9 8 9 9 9 ...
##  $ dna   : 'DNAbin' raw [1:30, 1:10000] t t t t ...
##  $ w     : num [1:4] 0.0426 0.2128 0.4255 0.3191
##  $ ances : int [1:30] NA 1 2 NA 3 4 4 5 6 6 ...
##  $ ctd   :'data.frame':  35 obs. of  2 variables:
##   ..$ i: chr [1:35] "4" "10" "13" "11" ...
##   ..$ j: chr [1:35] "7" "29" "30" "22" ...

counts <- table(fake_outbreak$onset) # this function makes a table of how many instances of each onset time there are 
barplot(counts, xlab="Onset time", ylab = "Number of cases", main="Fake outbreak symptom onset times", col="tomato1")

Introduction: Running outbreaker2 on the fake outbreak

The main function of the outbreaker2 package is outbreaker, this runs the transmission reconstruction. args(outbreaker) prints the possible inputs for the outbreaker function.

args(outbreaker)

## function (data = outbreaker_data(), config = create_config(), 
##     priors = custom_priors(), likelihoods = custom_likelihoods(), 
##     moves = custom_moves()) 
## NULL

In this tutorial, we will work with the default underlying outbreaker model, as introduced in the previous lecture. This means that all we need to provide is the data, outbreaker_data(), and we can leave the priors (custom_priors()), likelihood (custom_likelihoods()) and moves (custom_moves()) at their default settings. create_config() controls various aspects of the MCMC computation: we’ll leave that as default for now too.

Run outbreaker2 on the fake outbreak with the default settings. First, we use the outbreaker_data() function to format our fake_outbreak (check out help(outbreaker_data)), then we feed that object into outbreaker.

fake_data <- outbreaker_data(dna = fake_outbreak$dna, dates = fake_outbreak$sample, ctd = fake_outbreak$ctd, w_dens = fake_outbreak$w)

result <- outbreaker(data = fake_data) # This should take about a minute to run

Take a look at the results, here in an object labelled result

class(result) # the results are stored in a type of data.frame called outbreaker_chains

## [1] "outbreaker_chains" "data.frame"

result

## 
## 
##  ///// outbreaker results ///
## 
## class:  outbreaker_chains data.frame
## dimensions 201 rows,  98 columns
## ancestries not shown: alpha_1 - alpha_30
## infection dates not shown: t_inf_1 - t_inf_30
## intermediate generations not shown: kappa_1 - kappa_30
## 
## /// head //
##   step       post       like    prior           mu        pi       eps
## 1    1 -1198.0669 -1199.4211 1.354240 0.0001000000 0.9000000 0.5000000
## 2   50  -576.3898  -578.4358 2.046006 0.0001322360 0.9719080 0.6238018
## 3  100  -579.1822  -580.9235 1.741283 0.0001218749 0.9395509 0.5810956
##       lambda
## 1 0.05000000
## 2 0.06582716
## 3 0.10193745
## 
## ...
## /// tail //
##      step      post      like    prior           mu        pi       eps
## 199  9900 -563.6538 -565.2708 1.616983 0.0001236515 0.9266639 0.7634021
## 200  9950 -567.0488 -569.0806 2.031785 0.0001179255 0.9703719 0.7057554
## 201 10000 -564.4037 -565.6457 1.242090 0.0001456115 0.8888590 0.6301657
##         lambda
## 199 0.07471296
## 200 0.07418330
## 201 0.08802858

Each row of result contains a single step from the MCMC chain, that is, a single sample of each parameter. At each step, result returns the value of the log posterior distribution post, the log likelihood like, the log prior prior, and all parameters/augmented data. The inferred sampled ancestor alpha, date of infection t_inf and number of generations between the case and their sampled ancestor kappa of each case in the outbreak is also reported at each step. (str(result) will show you all elements of result).

Introduction: Analysis and visualization of results from the fake outbreak

Before we visualize the actual transmission pathways we have inferred, it is a good idea to take a look at the mixing of the MCMC algorithm to ensure our results are trustworthy.

One way to do this is to look at trace plots of the MCMC chain

Use plot(result) to show a trace plot of the log-posterior values.

plot(result)

You should see something like this, a transient stage at the very start of the chain followed by something bumpy but relatively stable. If you used a different seed than me, your trace plot will look a bit different: that’s ok. We used 10,000 iterations here, but in a full analysis we would use more, in part to minimize the effect of the seed choice on the results.

You can also use plot to get trace plots of any other column in result. Try a few with e.g. plot(result, "prior"), plot(result, "t_inf_29")…

plot(result, "prior")

plot(result, "t_inf_29")

“t_inf_29” (the time of infection of case 29) doesn’t look so well mixed - we can see discrete jumps between different states. This is in part because the algorithm might not update 29’s infection time at each iteration - this may be a necessary compromise for computation time’s sake. It suggests to me that I should run more iterations though, before trusting my results too much.

That transient stage we saw at the beginning of the posterior trace plot is also undesirable to include in our results, because it represents a chain which has not yet converged. We can discard that stage from our trace plot by defining a ‘burn-in’ - this is, we throw away the first x iterations.

There’s no fixed rule about how big a burn-in x to use, but in general we want to keep as many samples as we can whilst removing that un-converged stage. It is often said that we want to make the trace plot look like a ‘fuzzy caterpillar’!

Plot the posterior trace plot without the burn-in stage. (We probably needed more iterations in the MCMC to get a true fuzzy caterpillar, but maybe you can see how this is caterpillar-esque).

plot(result, burnin = 1000)

Ok, so it looks like our MCMC mixed fairly well, and therefore we can continue to get some preliminary results. If this were a real analysis, I would still want to run more iterations, but this can be time consuming. So, let’s visualize the transmission we inferred.

First, plot the distribution of the parameters we estimated: mutation rate mu, reporting probability pi, contact reporting coverage eps and non-infectious contact rate lambda. Remember to remove the burn-in.

# These are 2 plotting options, a histogram or a smoothed density plot.
plot(result, "mu", type="hist", burnin = 1000)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot(result, "mu", type="density", burnin = 1000) # and so on for the other parameters

The plot command will recognize various type= arguments for plotting outbreaker_chains objects. We’ve just seen “hist” and “density”, and now we explore some others.

Next, visualize the inferred ancestries and transmission network. Remember your burn-in!

# Who-infected-who; the size of the dot pertains to the inferred probability that that person (row) was the infector of each case (column):
plot(result, type = "alpha", burnin = 1000)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the outbreaker2 package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Distribution of the inferred infection time of each case:
plot(result, type = "t_inf", burnin = 1000)

# How many generations were there between cases? Size of dot = probability again. E.g. 1 generation = no unsampled cases, sampled ancestor -> sampled descendant. 2 generations = 1 unsampled case, sampled ancestor -> unsampled ancestor -> sampled descendant.
plot(result, type = "kappa", burnin = 1000)

# The inferred transmission network! Thickness of line corresponds to probability of that transmission link (the proportion of the MCMC steps in which that pair were seen together)
plot(result, type = "network", burnin = 1000, min_support = 0.01)

Note that type = "network" opens up an interactive plot: try dragging the nodes around! The min_support argument is a useful one; it removes all edges with probability less than its value (0.01) so you don’t have a big mess of lines and can really see the strongest inferred transmission pairs.

Lastly, print out a summary of the outbreak inference.

summary(result, burnin = 1000)

## $step
##    first     last interval  n_steps 
##     1050    10000       50      180 
## 
## $post
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -570.1  -564.4  -561.6  -561.7  -559.2  -553.3 
## 
## $like
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -571.9  -566.1  -563.4  -563.6  -561.1  -555.4 
## 
## $prior
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6565  1.8102  2.0321  1.9284  2.1480  2.2984 
## 
## $mu
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 9.733e-05 1.304e-04 1.452e-04 1.484e-04 1.649e-04 2.108e-04 
## 
## $pi
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8329  0.9468  0.9704  0.9599  0.9830  0.9995 
## 
## $tree
##    from to time   support generations
## 1    NA  1   -1        NA          NA
## 2     1  2    1 1.0000000           1
## 3     2  3    3 1.0000000           1
## 4    NA  4    2        NA          NA
## 5     3  5    4 1.0000000           1
## 6    10  6    8 1.0000000           1
## 7     4  7    5 1.0000000           1
## 8     5  8    6 0.9555556           1
## 9     4  9    5 0.9500000           1
## 10    9 10    6 1.0000000           1
## 11    7 11    7 0.8166667           1
## 12    8 12    7 0.8388889           1
## 13    9 13    7 1.0000000           1
## 14    5 14    7 0.9888889           1
## 15    5 15    7 0.7277778           1
## 16    7 16    8 1.0000000           1
## 17    7 17    7 0.9944444           1
## 18    8 18    9 0.9833333           1
## 19    9 19    8 1.0000000           1
## 20   10 20   10 1.0000000           1
## 21   11 21   10 1.0000000           1
## 22   11 22   10 1.0000000           1
## 23   13 23    9 1.0000000           1
## 24   13 24    9 1.0000000           1
## 25   13 25    9 1.0000000           1
## 26   17 26    9 1.0000000           1
## 27   17 27   10 1.0000000           1
## 28   NA 28    9        NA          NA
## 29   10 29   10 1.0000000           1
## 30   13 30   10 1.0000000           1

This summary provides:

$step: a recap of which steps in the MCMC were used to create the summary $post, $like, $prior, $mu, $pi: distributional stats for these quantities *$tree: a consensus tree, that is, a tree made up of the most frequent infector/infectee pairs among your MCMC samples. The support column tells you the percentage of sampled trees which that pair occurred in, and generations is the most frequent value of kappa among those trees.

Your own analysis: TB and COVID-19

We have 2 datasets prepared for you to work with in this exercise. We will also be revisiting them in the later exercises. You may well not have time to finish working on both datasets by the end of this session: that’s ok! Feel free to keep working on them later, and post any questions on Slack.

Download the tuberculosis dataset from the module webpage and/or the COVID-19 dataset from the Slack channel.

The first dataset consists of 86 genomes from an outbreak of tuberculosis (TB) in Hamburg, Germany during the late 1990s/early 2000s. You can read more about the outbreak here: https://doi.org/10.1371/journal.pmed.1001387 (or on ResearchGate here). TB is caused by a slow-growing bacterium, Mycobacterium tuberculosis, and outbreaks can be hard to recreate using epidemiological information alone since they often occur over long time frames and ‘can occur during short contacts or in high risk populations (e.g., homeless or alcoholic populations)’. These 86 genomes belong to the largest strain cluster identified in Hamburg/Schleswig-Holstein from 1997-2010.

The second dataset consists of 40 early SARS-CoV-2 genomes from the UK, specifically from the B.1.13 lineage. They were collected during March and April 2020, by the COG Consortium UK https://www.cogconsortium.uk/.

Both datasets include aligned sequences in a .fasta file and dates of sampling in a .csv (SARS-CoV-2) or .txt (TB) file. The SARS-CoV-2 csv also includes some additional metadata, though we won’t be using it in this analysis.

Your task is to use outbreaker2 to infer transmission networks for the TB and/or SARS-CoV-2 outbreaks. Repeat the analysis we did for the fake outbreak. Loading the data and setting up outbreaker will require a couple of extra steps, some guidance on this is below. See also the ‘Hints, tips and common issues’ section.

2a) You can read in sequence files to R using the command myseq <- read.FASTA(file = "filenamehere.fasta") from the ape package. Remember that you can read in .csv files with read.csv and .txt files with read.table, or read_csv and read_table if you prefer to use tidyverse functionality.

2b) Recall that the outbreaker_data() function has 5 inputs. We have dna in the .fasta file, and there isn’t any contact tracing data ctd available for these datasets. So that leaves dates, w_dens and f_dens…

2c) When you read in the metadata, R will assume the date columns are character vectors unless you tell it otherwise. You can do this with the as.Date function e.g. as.Date(yourdatesvector). Note: the TB dates are provided in the same order as the sequence file.

Important! One of the trickier things in setting up outbreaker is making sure you input data on a suitable timescale. Outbreaker2 assumes transmission cannot occur over a time <1 unit of whatever timescale you use - so you should choose a timescale that is suitably small. But, choose a scale that is too small and your analysis will take ages to run. For the TB data, we suggest converting everything to months. For the COVID-19 data we suggest keeping the default timescale of days (so, you don’t need to do anything). Here is some simple code you can use to convert the TB data to months:

# For a vector of dates called 'dates'
dates<-(year(dates) - 1997)*12 + month(dates) + day(dates)/monthDays(dates) # convert to number of months since the start of 1997

2d) The last data aspect to define is generation time w_dens and colonization time distribution f_dens, also on a monthly scale for the TB data and a daily scale for the COVID data. There is a script to help you with this in the exercise 2 section of the module webpage wf_distribution.R.

Hints, tips and common issues:

Occasionally operating systems download .fasta files as .fasta.txt, which will cause errors in R. If this happens, you can just rename the file or e.g. in Windows: open in Notepad and save as type ‘All Files’ instead of text file.
For testing your method, you might want to use the create_config input to outbreaker to reduce the number of iterations in the MCMC - since these outbreaks are both much larger than the fake outbreak, outbreaker2 will take longer to run. We would suggest using 100 iterations whilst you get things working. You can increase this later if you want to do a longer run with more reliable results. For example:

my_config <- create_config(n_iter = 100)

You will need to ensure that the labels on each sample in your dna object match the labels on the date for each sample, exactly. To do this, before running outbreaker_data you could set names(dates) = labels(dna) (if your dates vector were named ‘dates’ and your dna object were named ‘dna’, and both were in the same order).
Mutation rate mu: mutation rates are typically expressed per unit time. However, outbreaker2 estimates them per generation of infection. Please keep this in mind if you compare your estimated rates to those in the literature.

Extension tasks and additional reading

A. To learn more about customizing the likelihood, priors and moves used in outbreaker2, check out the tutorial https://www.repidemicsconsortium.org/outbreaker2/articles/customisation.html. You could try this out with the TB/COVID data. For example, what if I told you that I think the reporting rate in both these outbreaks is around 30%, but that there’s no way it’s above 80%?

B. You can read more about outbreaker2 and outbreaker (the original package which outbreaker2 extends upon) at https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-018-2330-z (Campbell et al, 2017) and https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1003457&type=printable (Jombart et al, 2014)

Exercise 2: ‘outbreaker2’ for outbreak reconstruction