Exploring MGS Bias
Marker gene and metagenomic sequencing (MGS) measurements are biased, or as a recent preprint puts it
...the community compositions measured by MGS are wrong.
Luckily, there’s hope for MGS still! In the preprint that I referenced above (Consistent and correctable bias in metagenomic sequencing measurements), McLaren and friends explore a model of how bias affects MGS measurements and what to do about it. Their model is nice and simple.
Bias multiplies the true relative abundances within each sample by taxon- and protocol-specific factors that describe the different efficiencies with which taxa are detected by the workflow.
Cool! So in this model, bias is multiplicative and each step in the protocol has a bias multiplier for each individual taxa. Therefore, to get the overall protocol bias for a particular OTU, you just multiply the bias (detection efficiency) in each step of the protocol for that OTU.
The authors go into a lot of detail with suggestions and best practices to handle these biases, but one that is particularly easy to implement is their suggestion to use techniques based on taxon ratios (i.e., compositional differences between samples) as they are less sensitive to bias than those based on proportions. Or as they put it
The fold-change in taxon ratios between samples is invariant to bias.
If you want to switch to ratio based techniques, check out "The statistical analysis of compositional data" by Aitchison and "Microbiome Datasets Are Compositional: And This Is Not Optional", a nice review about using compositional techniques to analyze microbiome data.
There is a LOT more cool stuff in the preprint (and I’m probably oversimplifying things), so I encourage you to take a look for more info!
To show how bias as modeled in the preprint affects observed community compositions and analyses, I whipped up this interactive app for you! (It's basically a interactive combination of Figures 2 and 3 from the preprint.)
The actual counts and protocol bias tables are editable. To do so, just click in one of the existing values and type in a new one! After changing a value, pressing tab, enter/return, or clicking elsewhere on the page will update the observed counts, bar charts, and sample distance calculations.
Play around with it and you can get a feel for the HUGE affect bias can have on your observed measurements.
Note: The sample distance table calculates a couple of common distance/dissimilarity stats used to compare samples (Euclidean, Bray-Curtis, and Aitchison). Ideally, the distance/dissimilarity between sample 1 and 2 should be the same for both the actual and observed rows, but as you will see, it only holds for Aitchison distance! (If you guessed that Aitchison distance uses taxa ratios, you were right. As it turns out, Aitchison distance is a distance measure for compositional data).