Exploring MGS Bias

Background

Marker gene and metagenomic sequencing (MGS) measurements are biased, or as a recent preprint puts it

...the community compositions measured by MGS are wrong.

This is because each step of an MGS workflow preferentially detects certain taxa (or genes) over others (Brooks 2016, Hugerth and Andersson 2017, Pollock et al. 2018).

Luckily, there’s hope for MGS still! In the preprint that I referenced above (Consistent and correctable bias in metagenomic sequencing measurements), McLaren and friends explore a model of how bias affects MGS measurements and what to do about it. Their model is nice and simple.

Bias multiplies the true relative abundances within each sample by taxon- and protocol-specific factors that describe the different efficiencies with which taxa are detected by the workflow.

Cool! So in this model, bias is multiplicative and each step in the protocol has a bias multiplier for each individual taxa. Therefore, to get the overall protocol bias for a particular OTU, you just multiply the bias (detection efficiency) in each step of the protocol for that OTU.

The authors go into a lot of detail with suggestions and best practices to handle these biases, but one that is particularly easy to implement is their suggestion to use techniques based on taxon ratios (i.e., compositional differences between samples) as they are less sensitive to bias than those based on proportions. Or as they put it

The fold-change in taxon ratios between samples is invariant to bias.

If you want to switch to ratio based techniques, check out "The statistical analysis of compositional data" by Aitchison and "Microbiome Datasets Are Compositional: And This Is Not Optional", a nice review about using compositional techniques to analyze microbiome data.

There is a LOT more cool stuff in the preprint (and I’m probably oversimplifying things), so I encourage you to take a look for more info!

Interactive app

To show how bias as modeled in the preprint affects observed community compositions and analyses, I whipped up this interactive app for you! (It's basically a interactive combination of Figures 2 and 3 from the preprint.)

The actual counts and protocol bias tables are editable. To do so, just click in one of the existing values and type in a new one! After changing a value, pressing tab, enter/return, or clicking elsewhere on the page will update the observed counts, bar charts, and sample distance calculations.

Play around with it and you can get a feel for the HUGE affect bias can have on your observed measurements.

Note: The sample distance table calculates a couple of common distance/dissimilarity stats used to compare samples (Euclidean, Bray-Curtis, and Aitchison). Ideally, the distance/dissimilarity between sample 1 and 2 should be the same for both the actual and observed rows, but as you will see, it only holds for Aitchison distance! (If you guessed that Aitchison distance uses taxa ratios, you were right. As it turns out, Aitchison distance is a distance measure for compositional data).

Actual counts

OTU ID	Sample 1	sample 2
OTU_1
OTU_2
OTU_3

Protocol biases

OTU ID	Extraction bias	PCR bias	Sequencing bias	Bioinformatics bias	Total bias
OTU_1					1.00x
OTU_2					18.00x
OTU_3					6.00x

Observed counts

OTU ID	Sample 1	sample 2
OTU_1	1	15
OTU_2	18	18
OTU_3	6	24

Sample distance

S1 vs S2	Euclidean	Bray-Curtis	Aitchison
Actual	14.32	0.74	3.67
Observed	22.80	0.39	3.67

Composition charts

If you enjoyed this post, consider sharing it on Twitter and subscribing to the RSS feed! If you have questions or comments, you can find me on Twitter or send me an email directly.

← Go back