Whenever I see an acrimonious debate about something where the evidence offered by either side consists of a collection of data sets along with simulations it makes me wonder where are those theoreticians?
I think that there is such a debate concerning the impact of missing data in phylogenetics (Wiens 2003, Lemmon et al 2009, Wiens and Morrill, 2011, Simmons 2012, Roure et al 2012). With pplacer, I have noticed that masking non-informative columns can have surprising effects on the relative likelihoods in cases when data is weak.
I think that the overall effect in the case of standard phylogenetic analysis is probably weak when the gaps are uniformly distributed, but when they are not I don’t think that it is. And because of primer bias, there is an interesting joint distribution on amplification probability by sequence identity for cases like RAD-seq.
It’s not uncommon to see people running trees on alignments that have a very high proportion of gap. Sanderson, McMahon, @mathmomike and did some interesting related work in their phylogenetic terraces paper, but for me this doesn’t quite do what I would like. I would just like to know the contribution to the phylogenetic likelihood is of adding a column with various patterns of gap given various phylogenetic trees.
In principle I think I have all of the skills to do this on my own, but it would be more fun to have others involved, especially someone that would be willing to do some computer work. Anyone want to play? We could do the work as an open Massively Multiplayer Online Research Project, with phylobabblers kibitzing from the sidelines.
Or is this a bad idea? Much ado about nothing?