This is as much a moral rumination as a call for opinions and guidance. How can we better practically resolve taxa as amplicon surveys grow out of control? Can placement algorithms replace identity binning? Should they?
Sequence reads derived from environmental surveys of phylogenetic marker gene amplicons, such as 16S rRNA, CO1, etc. are typically “clustered” (Uclust, CD-Hit, mothur) to form operational taxonomic units (OTUs) after alignment to a reference database and before subsequent phylogenetic analysis or classification. In the widespread application of these genes (molecular “clocks”), this creates problems of cohesion across studies and sequencing platforms (and even run-to-run) because OTUs are internally defined by local neighbors.
Because ecology is important (right?), identifying ecologically and evolutionarily meaningful OTUs has become important, in the microbial world now often described as finding “Ecotypes” of broad clades of microbes. This is impossible with databases, where huge diverse groups are often lumped with a code derived from a single clone picked decades ago. Nonetheless, many marker genes have robust, curated databases, and the problem becomes one of annotation. Comparing organisms across studies has become a real problem in my field of aquatic microbiology. We have lots of groups independently “naming” clades, and lots of reference libraries for marker genes (especially 16S), but read binning by identity is dataset dependent and it can be hard to maintain continuity through time or quickly determine if two groups are talking about the same organism.
I am particularly interested in stabilizing this trajectory in time-series work by using placement to assign reads to nodes of a reference tree. Frankly in practice this is now practical computationally because binning can be so slow as datasets grow. I’d guess a ref should be robustly calculated (ML) with backbone constraints from a curated MSA database and possibly initially expanded using existing sequence libraries from previous work in the ecosystem in question (or analogues) to establish un-curated clades relevant locally. Subsequent amplicon surveys could then “classify” reads according to placements (using pplacer, for example @ematsen ) and nodes could serve as stable, reference-able, visualizable, (expandable?), classifiable taxonomic units.
I’ve been working for the last year to derive a robust reference alignment from databases and structure a classified, constrained tree and workflow for curation of marker gene survey outputs within the pplacer ecosystem (pplacer/guppy/rppr/ and especially taxtastic). Importantly, we wouldn’t be progressively adding sequences to a tree. We would just be allowing for stable nodal annotation of reads as a short-term way of detecting ecologically meaningful differential placements. In essence our goal is to “classify” to node rather than database annotations.
One topic of discussion would be nodal “Assignment”. In the context of pplacer, it would be nice to get an alternate “unambiguous” placement for a given pquery which is the most derived node for which likelihood weight is above some threshold (say 70%). This seems like a reasonable option to incorporate into pplacer: if pquery has an unacceptably low pointmass likelihood weight, reassign to a basal node monophyletic for the placement using an LCA-like algorithm (already in use for your classification algorithms). Giovannoni’s group at OSU has taken an approach to this by modifying pplacer placements with the BioPerl script LCA, which basically re-attaches pqueries to common basal nodes when placements are unacceptably “Fuzzy” as a single pointmass (their group calls this pipeline “Phylotyper” Vergin et al. 2013 (ISME Journal).
Another discussion point that would be really useful is some criteria for determining if pqueries are likely to belong to a “new clade” that isn’t well-represented in the tree. Said differently, it would be nice if post-hoc analyses on a placement mass can suggest if the tree needs to be expanded with additional reference sequences to resolve/accommodate new subclades in specific regions Are any of the existing metrics (adcl, edpl) useful for quantifying this likelihood? Would there be a way to make a “new node” in a refpkg during the placement process if some critical mass of sequences was attaching to a basal node in a clade with better likelihood than more derived nodes?
Thoughts?