Skip Navigation

Full Tree of Life (ITIS edition)

cross-posted from: https://mander.xyz/post/41224832

ITIS Tree of Life

Just finished another visualization of entire taxonomy tree. Previous is buried here: GBIF ToL.

Main concept is very simple: each taxon is a point, and each taxon has a clockwise-bent arc from it's parent taxon.

Trick is to place those points in a meaningful way. At first, I was using force-directed algorithm to do it. In general, it succeeded in grouping points by clades, but introduced a lot of branch overlapping (check how purple Echinodermata is "intruded" into Arthopoda in GBIF version).

Force-directed algorithms can layout not only trees, but basically any graph, and I thought: maybe tree-specific algorithm will produce a better result? I've found out there is a cool Voronoi Treemap algorithm which for any given tree can build a set of nested polygons, a polygon for each node in a tree. Not only it eliminates branch overlapping problem, but also it ensures those branches fit into convex polygons and you can even add gaps between adjacent branches. So I've built a CLI wrapper around a Java implementation I've found on GitHub.

At first, I've used it for NCBI database, but I didn't use gaps and haven't published interactive version yet (but there are PNGs in Wikimedia Commons). Then, I've made a treemap for ITIS. Points are points and polygons have been used for mouse hover feature. When I was making force-directed GBIF, I had to separately compute those polygons for each clade of given ranks. Now both points and polygons are computed by an algorithm, which is nice.

What do you think?

11 comments
  • Very cool, and would also be nice to see the main subdivisions labelled accordingly.

    For example, Coleoptera should be the largest 'bubble' in the arthropod group, I think?

  • I don't fully understand how taxa are chosen for this to be displayed. It obviously shows only a fraction of all species, but how are they selected? I looked up Orchidaceae because I know there are about 27000 species. And Oncidium has 335 accepted species according to POWO, but only 5 are shown. Why these 5? Why not only the genus or all 335 species?

    ETA: looking over to the Araceae, Lemna has only 13 accepted species of which 7 are shown. Anthurium has 1357 accepted species, of which 10 are shown??? Interestingly, a lot of Aloe species (592) are displayed, although it has far fewer species than Anthurium. So I guess it is relative to the own group? And Lemna is considered it's own group within the Araceae?

    And it seems to be really driven by the number of genera, not of species. For example Poaceae is huge with around 780 genera and around 12,000 species. Bromeliaceae is tiny in comparison with about 80 genera and 3700 known species (both numbers from Wikipedia). I bet the cloud size difference is about the 780 to 80 and not in relation to the species that they hold. So taxonomic groups with many lumpers will be smaller and groups where a lot of taxonomists are active that like to split are huge...

    Edit2: Looking at the gbif tool you linked to, it feels much more intuitive and the genera I mentioned above are actually represented much better. Species diversity within certain genera is very apparent.

    Would be really cool to have a high-resolution poster with most of the order and family names! Although I would prefer the older version.

    • You have some level of expertise and with this visualisation you notice how underlying database leans towards some filling strategies, you see how maintainers put more work into one direction than another. This is quite cool.

    • Thank you for the detailed analysis! Can you check out pngs in wikimedia commons category I’ve linked in the post? There are some for GBIF, ITIS and NCBI. Some of them are more readable than online tool. 

      For visualisation, I’ve used whole database each time, so what you see is determined by what is in database. If you can’t find some genera or species in this tool, probably they are not present in the db.  Usually these databases are backed up by some scientific organisations, which can be focused on specific areas, probably it can affect level of detail.  E.g. I know some services are specialised in marine species. Catalogue of life doesn’t have any non-avian dinosaurs, etc. for ITIS, I’ve noticed they have a very low amount of Fungi. I don’t know why. GBIF is trying to get data from all sources they can find, probably that’s the reason why size of branches is close to what you expect. Also, they are the biggest, but don’t keep track of intermediate ranks — subphylums, subkingdoms etc. 

      I wonder if NCBI is representative? 

      These two files are from Commons category:

      GBIF Tree of Life - colored by class, force-directed.png

      NCBI Tree of Life - Voronoi Treemap - 4 Order.png

      • Haha, this wasn't even in detail and I only looked at it a bit on my phone screen. But it is quite enjoyable to have such a visualization, fantastic job!

        For plants, I usually go for POWO (Plants of the world online) and they have their own database on vascular plants, which according to them is incorporated into the GBIF database (and then also to the Catalogue of Life Checklist).

        You are probably right in that this is based on the underlying datasets and that GBIF does the best job regarding plants.

        Looking over to the animals in both visualizations, I feel like the GBIF one gives looks better and the ITIS one gives a slightly better overview. E.g. looking at Hymenoptera I need to zoom in much more in the ITIS one to get to the families and it doesn't show any intermediate rank between order and family. The GBIF one does the same but shows the family names also when zoomed out more. Although it is harder to distinguish between the borders of different order than in the ITIS dataset.

        However, looking at Hymenoptera also made me realize that both visualizations are a mess in their own way! There are many entries missing in the ITIS dataset: For example, there are 800 genera and over 8000 described species in the Symphyta, but they are only a tiny section south of the rest of the Hymenoptera. But just above there is a wasp genus named Microgaster with similarly many points that apparently only have a 100 described species!

        But in the GBIF visualization, the arrangement of various groups seems to be done in a haphazard way. For example, if you were to look for all the bees (Anthophila, but this rank is not shown), in the ITIS one they are all at least displayed close together (although not within one rank). But in the GBIF visualization, e.g. Apidae and Halictidae are at totally different ends within the Hymenoptera group. And there are much more basal groups like Tenthredinidae (in Symphyta) between. So within the Hymenoptera groups, this makes no sense at all! It would be much better if more basal groups would be closer to the origin and more distant lineages are more distant. But within this visualization all families within the Hymenoptera are just 'related' to the Hymenoptera and not each other.

        Maybe the problem is also that plants and animals have their own taxonomies that are structured differently. You usually don't need intermediate ranks between order and family in plants. In animals this seems to be quite different. Hm, not sure how to solve this in an elegant manner.

        Regarding your actual question if the NCBI visualization is representative: I cannot say. The png only shows plant orders and not even families. So it's impossible to tell what point cloud is which genus and how they are represented. Although looking at it, I feel like there are some locations where a huge number of points link to a single origin. E.g. within the Lepidoptera a third of all taxa lead to a single point. Not sure what this might be, because in the other two visualizations the Lepidoptera are much more diverse. The most diverse families are the Erebidae with about 25,000 species and the Geometridae with 23,000 species. But both are just a small portion of the 180,000 Lepidoptera species. So my guess is that the NCBI shows superfamilies and not families as intermediate ranks and in case of the Lepipoptera, the large cloud are the Noctuoidea, which contain about 70,000 species. But then there are even less ranks than in the other visualizations, if all listed taxa within this superfamily point to a single origin.

        I could go on for ages! This is so much fun, haha :)

11 comments