I poked around this set a bit more, and I am not sure why it would be considered a good set for testing software. In my opinion, a good test set would have a phylogeny that was known by other criteria, such as the fossil record, morphology (in a type of organism where morphology can be well used to follow evolution), etc. I don’t see this set of insect data being ideal. The insects have shared a common ancestor some 400 million years ago, so even in conserved genes encoded in the nucleus there are many issues with molecular evolution. The butterflies are just a subset of insects and their most recent common ancestor was closer to 280 million years ago.
Many researchers who are using this type of data, with protein-coding genes which have been under hundreds of millions of years of selection pressures, are being careful to use only subsets of the data such as first and second position of each codon, and thus throwing away the sites which contribute more noise and misinformation (due to base composition bias, codon use bias, etc) than true phylogenetic signal.
Surely there must be good data sets which present particular types of problems, such as known shifts in codon use bias (GC:AT ratio has been shown to change rather rapidly as species adapt to high and low temperature environments), or a “long branches attract” problem (human, gorilla, mouse, rat, dozens of carnivors and ungulates), sets where there are well documented changes in rates of evolution between lineages, etc.
Each data set might be appropriate for only a small range of phylogenetic analyses. The set that tests how well a given program can resist the “long branches attract” problem will not be the same set that can test population genetics questions.