Richness or Observed Species (S)

By Hand
How many unique bugs are in there?
Published

December 22, 2021

Richness \(\textcolor{#037bcf}{S}\) is a measure of alpha-diversity, and is perhaps the easiest to calculate and comprehend. \(\textcolor{#037bcf}{S}\) is simply the count of the number of species with an abundance > 0. \(\textcolor{#037bcf}{S}\) is a per-sample metric (as opposed to per-species) in that a single value is obtained describing the entire sample.

Simple Calculation

We can easily calculate the richness of the 6 mock communities generated previously. As a reminder, Samples 1-3 had the same underlying community structure but were sampled to different depths, and Samples 4-6 had a decrease in evenness but were sampled to the same depth.

load(here("static", "data", "mock_community.rds"))

S = mock_community %>%
  pivot_longer(cols = -ASV, names_to = "SAMPLE", values_to = "ABUNDANCE") %>%
  group_by(SAMPLE) %>%
  filter(ABUNDANCE > 0) %>%
  tally(name = "S") 
Sample S
Sample_1 19
Sample_2 13
Sample_3 10
Sample_4 20
Sample_5 20
Sample_6 20

Notice how although samples 1-3 were all sampled from the same true population, we miss the more rare species simply by not sequencing to enough depth to capture them and \(\textcolor{#037bcf}{S}\) goes down. We saw similar results in the Jaccard analysis of these communities.

The more samples you have, the more you increase the chance that a species found in one sample is missing from another. Or, the larger the range of sequencing effort, the greater chance species found in the deeply sequenced samples are missing from the shallow sequenced samples.

For Samples 4-6, they were all sampled to a sufficient depth so all 20 species were found. That is about all that \(\textcolor{#037bcf}{S}\) can tell us, however, and we do not see the distribution differences here.

Advanced

There are more complicated ways to calculate \(\textcolor{#037bcf}{S}\) that attempt to use more rigorous statistics. The most popular of these are Chao1 and ACE2, and are both implemented in the vegan package in the estimateR() function (try ?estimateR for more information). These calculate \(\textcolor{#037bcf}{S}\) as well as the standard error (denoted with “se”). S.obs is the “simple” richness calculated as we did above.

V = mock_community %>%
  column_to_rownames("ASV") %>%
  t() %>%
  vegan::estimateR()
Sample S.obs S.chao1 se.chao1 S.ACE se.ACE
Sample_1 19 19.5 1.287887 20.35391 2.103822
Sample_2 13 14.0 2.283481 16.66667 1.178569
Sample_3 10 10.0 0.000000 10.00000 1.449138
Sample_4 20 20.0 0.000000 NaN NaN
Sample_5 20 20.0 0.000000 NaN NaN
Sample_6 20 20.0 0.000000 NaN NaN

For Samples 1-3 both methods did a fairly good job. We will look a little more closely at this method (as well as calculate it “by hand”) below.

ACE doesn’t deal well when there are no singletons/doubletons (or even ASVs found less than 10 times3), and that is why it failed to return a value for Samples 4-6.

Look into ChaoSpecies in the SpadeR package for more methods. Our sample dataset it too tiny to get meaningful results from these powerful tools, and they work better on real data. There is a Chao2 metric which uses presence/absence, but it doesn’t seem to pop-up much anywhere.

Chao1

Of all the values obtained in this advanced section, only the Chao1 is reasonably simple to do by hand. In the equation below, F1 and F2 are the counts of singletons and doubletons, respectively.

\[\textcolor{#037bcf}{S_{Chao1} = S_{Obs} + {f_1(f_1 - 1)\over2(f_2 + 1)}}\]

C = mock_community %>%
  pivot_longer(cols = c(everything(), -ASV), names_to = "SAMPLE", values_to = "ABUNDANCE") %>%
  mutate(S = case_when(  
    ABUNDANCE >  0 ~ 1)) %>%
  mutate(F1 = case_when(  
    ABUNDANCE == 1 ~ 1)) %>%
  mutate(F2 = case_when(  
    ABUNDANCE == 2 ~ 1)) %>%
  group_by(SAMPLE) %>%
  summarise(S = sum(S, na.rm = T),
            F1 = sum(F1, na.rm = T), 
            F2= sum(F2, na.rm = T)) %>%
  mutate(CHAO1 = S + ((F1 * (F1 - 1)) / (2 * (F2 + 1))))
Sample S F1 F2 CHAO1
Sample_1 19 2 1 19.5
Sample_2 13 2 0 14.0
Sample_3 10 0 3 10.0
Sample_4 20 0 0 20.0
Sample_5 20 0 0 20.0
Sample_6 20 0 0 20.0

Looking at the equation above, it is obvious why singletons are so important to these calculations. If there are no singletons, then the fraction equates to \(0\) and all you get is \(\textcolor{#037bcf}{S_{Chao1} = S_{Obs}}\).

Additionally, we see an overestimation in Sample_2 because of the high singleton but low doubleton count. This makes

\[\textcolor{#037bcf}{S_{Chao1} = 14 + {4(4 - 1)\over2(0 + 1)}}\]

which simplifies to

\[\textcolor{#037bcf}{S_{Chao1} = 14 + \frac{12}{2} = 20}\]

and is almost a 50% overestimation!

Further Reading

  • Species accumulation curves

Footnotes

  1. https://pubmed.ncbi.nlm.nih.gov/3427163/↩︎

  2. https://onlinelibrary.wiley.com/doi/10.1111/biom.12200↩︎

  3. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC93182/↩︎