my current approach takes a ton of time and memory when the number of ambiguous patterns (e.g. NGC, or RGC) get large. As a result, I currently only allow Y,R,W,S, and N in codons, and disallow K,M,B,D,H,V. Any thoughts?
right now every codon (e.g. ATA) or ambiguity code (e.g. NTY) gets its own index.
I am interesting in what algorithms people might use to actually read codons like KBN.
I use an unnecessarily complicated data representation system that, for each node in the tree, associates an emission likelihood to each possible state. So in the traditional setting, unobserved internal nodes would have 1.0 associated to each codon state, and leaves would have 1.0 associated with only the observed codon and 0.0 associated with all other codons. In this setting of complete overkill, if a taxon has the ambiguously coded codon KBN at a given site in the alignment, then the corresponding leaf would have 1.0 associated with all codons compatible with KBN and 0.0 associated with all codons not compatible with KBN.
This is probably not your question, but instead of precomputing a mask for each possible triple (e.g. KBN) you could precompute a mask for each ambiguous nucleotide for each of the three possible sites within a codon, and then binary-and the three masks together later when you need it. So for example the mask associated with KBN could be something like mask[K0] & mask[B1] & mask[N2].