load(here("static", "data", "mock_community.rds"))
= mock_community %>%
S pivot_longer(cols = -ASV, names_to = "SAMPLE", values_to = "ABUNDANCE") %>%
group_by(SAMPLE) %>%
filter(ABUNDANCE > 0) %>%
tally(name = "S")
Richness or Observed Species (S)
Richness \(\textcolor{#037bcf}{S}\) is a measure of alpha-diversity, and is perhaps the easiest to calculate and comprehend. \(\textcolor{#037bcf}{S}\) is simply the count of the number of species with an abundance > 0. \(\textcolor{#037bcf}{S}\) is a per-sample metric (as opposed to per-species) in that a single value is obtained describing the entire sample.
Simple Calculation
We can easily calculate the richness of the 6 mock communities generated previously. As a reminder, Samples 1-3 had the same underlying community structure but were sampled to different depths, and Samples 4-6 had a decrease in evenness but were sampled to the same depth.
Sample | S |
---|---|
Sample_1 | 19 |
Sample_2 | 13 |
Sample_3 | 10 |
Sample_4 | 20 |
Sample_5 | 20 |
Sample_6 | 20 |
Notice how although samples 1-3 were all sampled from the same true population, we miss the more rare species simply by not sequencing to enough depth to capture them and \(\textcolor{#037bcf}{S}\) goes down. We saw similar results in the Jaccard analysis of these communities.
The more samples you have, the more you increase the chance that a species found in one sample is missing from another. Or, the larger the range of sequencing effort, the greater chance species found in the deeply sequenced samples are missing from the shallow sequenced samples.
For Samples 4-6, they were all sampled to a sufficient depth so all 20 species were found. That is about all that \(\textcolor{#037bcf}{S}\) can tell us, however, and we do not see the distribution differences here.
Advanced
There are more complicated ways to calculate \(\textcolor{#037bcf}{S}\) that attempt to use more rigorous statistics. The most popular of these are Chao1 and ACE2, and are both implemented in the vegan
package in the estimateR()
function (try ?estimateR
for more information). These calculate \(\textcolor{#037bcf}{S}\) as well as the standard error (denoted with “se”). S.obs
is the “simple” richness calculated as we did above.
= mock_community %>%
V column_to_rownames("ASV") %>%
t() %>%
::estimateR() vegan
Sample | S.obs | S.chao1 | se.chao1 | S.ACE | se.ACE |
---|---|---|---|---|---|
Sample_1 | 19 | 19.5 | 1.287887 | 20.35391 | 2.103822 |
Sample_2 | 13 | 14.0 | 2.283481 | 16.66667 | 1.178569 |
Sample_3 | 10 | 10.0 | 0.000000 | 10.00000 | 1.449138 |
Sample_4 | 20 | 20.0 | 0.000000 | NaN | NaN |
Sample_5 | 20 | 20.0 | 0.000000 | NaN | NaN |
Sample_6 | 20 | 20.0 | 0.000000 | NaN | NaN |
For Samples 1-3 both methods did a fairly good job. We will look a little more closely at this method (as well as calculate it “by hand”) below.
ACE doesn’t deal well when there are no singletons/doubletons (or even ASVs found less than 10 times3), and that is why it failed to return a value for Samples 4-6.
Look into ChaoSpecies
in the SpadeR
package for more methods. Our sample dataset it too tiny to get meaningful results from these powerful tools, and they work better on real data. There is a Chao2 metric which uses presence/absence, but it doesn’t seem to pop-up much anywhere.
Chao1
Of all the values obtained in this advanced section, only the Chao1 is reasonably simple to do by hand. In the equation below, F1 and F2 are the counts of singletons and doubletons, respectively.
\[\textcolor{#037bcf}{S_{Chao1} = S_{Obs} + {f_1(f_1 - 1)\over2(f_2 + 1)}}\]
= mock_community %>%
C pivot_longer(cols = c(everything(), -ASV), names_to = "SAMPLE", values_to = "ABUNDANCE") %>%
mutate(S = case_when(
> 0 ~ 1)) %>%
ABUNDANCE mutate(F1 = case_when(
== 1 ~ 1)) %>%
ABUNDANCE mutate(F2 = case_when(
== 2 ~ 1)) %>%
ABUNDANCE group_by(SAMPLE) %>%
summarise(S = sum(S, na.rm = T),
F1 = sum(F1, na.rm = T),
F2= sum(F2, na.rm = T)) %>%
mutate(CHAO1 = S + ((F1 * (F1 - 1)) / (2 * (F2 + 1))))
Sample | S | F1 | F2 | CHAO1 |
---|---|---|---|---|
Sample_1 | 19 | 2 | 1 | 19.5 |
Sample_2 | 13 | 2 | 0 | 14.0 |
Sample_3 | 10 | 0 | 3 | 10.0 |
Sample_4 | 20 | 0 | 0 | 20.0 |
Sample_5 | 20 | 0 | 0 | 20.0 |
Sample_6 | 20 | 0 | 0 | 20.0 |
Looking at the equation above, it is obvious why singletons are so important to these calculations. If there are no singletons, then the fraction equates to \(0\) and all you get is \(\textcolor{#037bcf}{S_{Chao1} = S_{Obs}}\).
Additionally, we see an overestimation in Sample_2 because of the high singleton but low doubleton count. This makes
\[\textcolor{#037bcf}{S_{Chao1} = 14 + {4(4 - 1)\over2(0 + 1)}}\]
which simplifies to
\[\textcolor{#037bcf}{S_{Chao1} = 14 + \frac{12}{2} = 20}\]
and is almost a 50% overestimation!
Further Reading
- Species accumulation curves