= c("M1" = 3, "M2" = 0, "M3" = 1, "M4" = 7, "M5" = 0, "M6" = 1)
A = c("M1" = 0, "M2" = 0, "M3" = 3, "M4" = 5, "M5" = 2, "M6" = 0)
B
# dataframe with abundances
= data.frame(A = A, B = B)
df.abundances
# dataframe with presence/absence
= df.abundances
df.binary > 0] = 1 df.binary[df.binary
Jaccard Similarity and Distance
Jaccard is a similarity coefficient for the pairwise comparison of two groups considering the presence/absence of members (binary data). Like other similarity coefficients, it ranges from 0 to 1, with 1 stating the two groups are identical, and 0 indicating there are no shared members. Jaccard Similarity has the shorthand notation of \(\textcolor{#037bcf}{S_{7}}\) in Gower & Legendre Nomenclature1.
The Jaccard similarity can also be converted into a Jaccard distance by subtracting from 1. This distance will also range from 0 to 1, but with 1 stating the two groups are entirely different and have no members in common.
The interpretation is the percent of members found in both samples, and found in only one sample. So, a Jaccard Similarity of 0.3 means 30% of the members were found in both samples, and 70% where found in only one of the two samples. Which of the two samples has more of that 70% can not be determined, and could in theory be found entirely in one of the samples (see the examples below about how sample depth can skew Jaccard intepretations).
Setting Up the Data
For the examples below, we will consider two samples (A and B) with the abundance counts of 6 members (M1 - M6).
Abundance (blue) and Binary (red) Dataframes | ||||
---|---|---|---|---|
Member | Abundance | Binary | ||
A | B | A | B | |
M1 | 3 | 0 | 1 | 0 |
M2 | 0 | 0 | 0 | 0 |
M3 | 1 | 3 | 1 | 1 |
M4 | 7 | 5 | 1 | 1 |
M5 | 0 | 2 | 0 | 1 |
M6 | 1 | 0 | 1 | 0 |
Calculating Jaccard “by Hand”
Scoring
First, we need to “score” each row of our dataframe according to its patterns in the A and B samples. We will use a J
followed by its presence (1
) or absence (0
) in A and B. So, if it is found in both A and B it is J11
, if in A only it is J10
, B only it is J01
and not in any it will be J00
.
Binary
We will modify the df.binary
dataframe to include this score in a new J
column.
# using presence/absence data
= df.binary %>%
df.binary mutate(J = case_when(
== 1 & B == 1 ~ "J11",
A == 1 & B == 0 ~ "J10",
A == 0 & B == 1 ~ "J01",
A TRUE ~ "J00" # this TRUE statement captures anything not grouped above
))
Binary dataframe with the score (red) | |||
---|---|---|---|
Member | A | B | J |
M1 | 1 | 0 | J10 |
M2 | 0 | 0 | J00 |
M3 | 1 | 1 | J11 |
M4 | 1 | 1 | J11 |
M5 | 0 | 1 | J01 |
M6 | 1 | 0 | J10 |
Abundances
We could use the abundances instead of the binary dataframe by modifying the == 1
to > 0
. The resulting dataframe will be the same.
# using abundances instead of presence/absence. Code not run.
%>%
df.abundances mutate(J = case_when(
> 0 & B > 0 ~ "J11",
A > 0 & B == 0 ~ "J10",
A == 0 & B > 0 ~ "J01",
A TRUE ~ "J00"
))
Tallying the Scores
Second, we need to group and tally up the scores, and use into the equation below.
= df.binary %>%
J_counts group_by(J) %>%
tally()
# make a named vector
= J_counts$n
J names(J) = J_counts$J
J## J00 J01 J10 J11
## 1 1 2 2
\[\textcolor{#037bcf}{S_{7} = {J_{11}\over(J_{10} + J_{01} + J_{11})}}\]
Which, after plugging into the equation gives
\[\textcolor{#037bcf}{S_{7} = {2\over(2 + 1 + 2)} = 0.4}\]
The Jaccard similarity of 0.4
is stating that 40% of the members are found in both Samples (M3 and M4), and 60% are found in only one sample (M1 and M6 in A, M5 in B).
Note, double absences do not count towards similarity2, and the \(\textcolor{#037bcf}{J_{00}}\) counts are not used anywhere. Although we have 6 members in the dataframe, the percentages are based on 5 members because M2 was dropped for having a score of \(\textcolor{#037bcf}{J_{00}}\).
Distance (D7)
Jaccard Distance is \(\textcolor{#037bcf}{D_{7} = 1 - S_{7}}\). This is a metric distance.
To get the Jaccard Distance, some R packages use 1-S
, others use 1/S
, and yet others use the sqrt(1-S)
Scoring
The scoring is the same as above for Jaccard Similarity.
Tallying the Scores
Again, same as above, except this time the equation has a different numerator. This time, instead of using the members found in both samples, we use the members found in only one sample.
\[\textcolor{#037bcf}{D_{7} = {J_{10} + J_{01}\over(J_{10} + J_{01} + J_{11})}}\]
Which gives
\[\textcolor{#037bcf}{D_{7} = {2 + 1\over(2 + 1 + 2)} = 0.6}\]
And, of course, the interpretation is now the opposite than for the similarity. A D7 of 0.6
states that 60% of the members are found in only one sample, while 40% are shared between the two.
Jaccard and the vegan R package
As we determined earlier, the Jaccard Distance for the above dataframe should be 0.6, but what happens if we put it into vegan
?
Things work a little different when running jaccard
inside of vegdist
compared to other methods. First, it is calculating the Jaccard Dissimilarity (D7) rather than a similarity (S7). Second, if you give vegan a matrix and run jaccard
, it will instead calculate an extended jaccard3 value that is tied to Bray-Curtis4.
as.numeric(vegdist(t(df.binary), method = "jaccard"))
## [1] 0.6
as.numeric(vegdist(t(df.abundances), method = "jaccard"))
## [1] 0.625
We can see exactly how this extended Jaccard value was made by instead calculated the Bray-Curtis, and then running through the equation.
= as.numeric(vegdist(t(df.binary), method = "bray"))
bc.binary
bc.binary## [1] 0.4285714
* 2) / (1 + bc.binary) # same value as vegdist above
(bc.binary ## [1] 0.6
= as.numeric(vegdist(t(df.abundances), method = "bray"))
bc.abundances
bc.abundances## [1] 0.4545455
* 2) / (1 + bc.abundances) # same value as vegdist above
(bc.abundances ## [1] 0.625
Likely, most microbial ecologists do not realize the Jaccard values they are given are an extended Jaccard, and some may not even realize they are working with Dissimilarities rather than similarities.
The vegan folks don’t seem to think this is an issue though5, and appear to actively fight against giving some informative warnings.
Correct Jaccard in Vegan
So, what are you supposed to do? The fix in the vegan documentation is to make sure you say binary=T
.
as.numeric(vegdist(t(df.binary), method = "jaccard", binary = T))
## [1] 0.6
as.numeric(vegdist(t(df.abundances), method = "jaccard", binary = T))
## [1] 0.6
Here, we appear to get the correct dissimilarity values when using either the binary or abundance version of our dataframes.
Example
Let’s use the Mock Communities that we previously defined. Briefly, this dataset has Samples 1-3 with the same underlying population but sampled to different depths, and Samples 4-6 that have the same depth but a change in evenness from perfectly even (Sample 4) to fairly uneven (Sample 6).
We will calculate Jaccard using vegdist()
and binary=T
. Remember, this function returns the distance, not the similarity.
%>%
mock_community column_to_rownames("ASV") %>%
as.matrix() %>%
t() %>%
::vegdist(method = "jaccard", binary = T) %>%
veganas.matrix() %>%
as.data.frame() %>%
rownames_to_column("A") %>%
pivot_longer(cols = c(everything(), -A), names_to = "B", values_to = "Jaccard") %>%
mutate(B = fct_relevel(B, rev(unique(as.character(.$B))))) %>%
ggplot(aes(x = A, y = B, fill = Jaccard)) +
geom_tile(color = "gray30") +
scale_fill_viridis_c() +
labs(fill = "Jaccard Distance", x = "1st Sample", y = "2nd Sample", title = "Jaccard Distance (binary = T)") +
theme(legend.position = "right")
The results are surprising at first, and show why it is best to be cautious when using these sorts of comparisons, and why it is good to have an intuitive understanding of what they are actually saying. Let’s look at each group of communities individually.
Samples 1-3 (Differing Sampling Depth)
As a reminder, these three Samples had very different sample sizes.
Size of Samples 1-3 | |
---|---|
Sample | Sample size |
Sample_1 | 10000 |
Sample_2 | 1000 |
Sample_3 | 100 |
Even though these 3 samples come from the same true population, there appears to be greater Jaccard Distances between them than there is within Samples 4-6. This is particularly true because the different depths led to very different zero counts for the species in these three samples.
Number of species with and without a count | ||
---|---|---|
Sample | Non Zeroes | Zeroes |
Sample_1 | 19 | 1 |
Sample_2 | 13 | 7 |
Sample_3 | 10 | 10 |
So, it is clear that Sample_1 looks very distant from Sample_3 simply because we undersampled (or undersequenced) Sample_3. Had we had a larger sample we likely would have found some of those species that were in the true population, but at too low of an abundance to be found with shallow random sampling.
Samples 4-6 (Differing Diversity)
These samples all had the same sample size of 10,000, but they differed in the true population distribution of the 20 species from fairly even (Sample 4) to very uneven (Sample 6). Here, even though the communities were actually quite different, the Jaccard Distance between all three samples is 0, indicating that they are 100% similar. Of course, we know that Jaccard is only looking at presence/absence, so all we can conclude is that there are no species found in one sample that are missing from another. We cannot take these conclusion any further to say much else about the composition of the samples or the distributions of their community members, we’ll need other metrics for that.
Summary
In a real world situation, the true population is unknown and we can’t therefore check our results for consistency with our expectations. In both sets of communities a reliance on Jaccard alone would give us incorrect views of our samples. Community 1 looked quite different although they all had the same true population, and Community 2 look very similar even though the population structure is quite different.
Further Reading
- Bray-Curtis
- Dice Coefficient
Footnotes
Metric and Euclidean properties of dissimilarity coefficients↩︎
For example, should jungles and deserts be considered similar because they both lack polar bears?↩︎
https://stats.stackexchange.com/questions/242110/nmds-from-jaccard-and-bray-curtis-identical-is-that-a-bad-thing↩︎