Jaccard Similarity and Distance

By Hand
Similarity/Distance coefficient on binary data
Published

November 29, 2021

Jaccard is a similarity coefficient for the pairwise comparison of two groups considering the presence/absence of members (binary data). Like other similarity coefficients, it ranges from 0 to 1, with 1 stating the two groups are identical, and 0 indicating there are no shared members. Jaccard Similarity has the shorthand notation of \(\textcolor{#037bcf}{S_{7}}\) in Gower & Legendre Nomenclature1.

The Jaccard similarity can also be converted into a Jaccard distance by subtracting from 1. This distance will also range from 0 to 1, but with 1 stating the two groups are entirely different and have no members in common.

The interpretation is the percent of members found in both samples, and found in only one sample. So, a Jaccard Similarity of 0.3 means 30% of the members were found in both samples, and 70% where found in only one of the two samples. Which of the two samples has more of that 70% can not be determined, and could in theory be found entirely in one of the samples (see the examples below about how sample depth can skew Jaccard intepretations).

Setting Up the Data

For the examples below, we will consider two samples (A and B) with the abundance counts of 6 members (M1 - M6).

A = c("M1" = 3, "M2" = 0, "M3" = 1, "M4" = 7, "M5" = 0, "M6" = 1)
B = c("M1" = 0, "M2" = 0, "M3" = 3, "M4" = 5, "M5" = 2, "M6" = 0)

# dataframe with abundances
df.abundances = data.frame(A = A, B = B)

# dataframe with presence/absence
df.binary = df.abundances
df.binary[df.binary > 0] = 1
Abundance (blue) and Binary (red) Dataframes
Member Abundance Binary
A B A B
M1 3 0 1 0
M2 0 0 0 0
M3 1 3 1 1
M4 7 5 1 1
M5 0 2 0 1
M6 1 0 1 0

Calculating Jaccard “by Hand”

Scoring

First, we need to “score” each row of our dataframe according to its patterns in the A and B samples. We will use a J followed by its presence (1) or absence (0) in A and B. So, if it is found in both A and B it is J11, if in A only it is J10, B only it is J01 and not in any it will be J00.

Binary

We will modify the df.binary dataframe to include this score in a new J column.

# using presence/absence data
df.binary = df.binary %>%
    mutate(J = case_when(
        A == 1 & B == 1 ~ "J11",
        A == 1 & B == 0 ~ "J10",
        A == 0 & B == 1 ~ "J01",
        TRUE ~ "J00" # this TRUE statement captures anything not grouped above
    ))
Binary dataframe with the score (red)
Member A B J
M1 1 0 J10
M2 0 0 J00
M3 1 1 J11
M4 1 1 J11
M5 0 1 J01
M6 1 0 J10

Abundances

We could use the abundances instead of the binary dataframe by modifying the == 1 to > 0. The resulting dataframe will be the same.

# using abundances instead of presence/absence. Code not run. 
df.abundances %>%
    mutate(J = case_when(
        A > 0 & B > 0 ~ "J11",
        A > 0 & B == 0 ~ "J10",
        A == 0 & B > 0 ~ "J01",
        TRUE ~ "J00"
    ))

Tallying the Scores

Second, we need to group and tally up the scores, and use into the equation below.

J_counts = df.binary %>%
    group_by(J) %>%
    tally()

# make a named vector 
J = J_counts$n
names(J) = J_counts$J
J
## J00 J01 J10 J11 
##   1   1   2   2

\[\textcolor{#037bcf}{S_{7} = {J_{11}\over(J_{10} + J_{01} + J_{11})}}\]

Which, after plugging into the equation gives

\[\textcolor{#037bcf}{S_{7} = {2\over(2 + 1 + 2)} = 0.4}\]

The Jaccard similarity of 0.4 is stating that 40% of the members are found in both Samples (M3 and M4), and 60% are found in only one sample (M1 and M6 in A, M5 in B).

Note, double absences do not count towards similarity2, and the \(\textcolor{#037bcf}{J_{00}}\) counts are not used anywhere. Although we have 6 members in the dataframe, the percentages are based on 5 members because M2 was dropped for having a score of \(\textcolor{#037bcf}{J_{00}}\).

Distance (D7)

Jaccard Distance is \(\textcolor{#037bcf}{D_{7} = 1 - S_{7}}\). This is a metric distance.

Important

To get the Jaccard Distance, some R packages use 1-S, others use 1/S, and yet others use the sqrt(1-S)

Scoring

The scoring is the same as above for Jaccard Similarity.

Tallying the Scores

Again, same as above, except this time the equation has a different numerator. This time, instead of using the members found in both samples, we use the members found in only one sample.

\[\textcolor{#037bcf}{D_{7} = {J_{10} + J_{01}\over(J_{10} + J_{01} + J_{11})}}\]

Which gives

\[\textcolor{#037bcf}{D_{7} = {2 + 1\over(2 + 1 + 2)} = 0.6}\]

And, of course, the interpretation is now the opposite than for the similarity. A D7 of 0.6 states that 60% of the members are found in only one sample, while 40% are shared between the two.

Jaccard and the vegan R package

As we determined earlier, the Jaccard Distance for the above dataframe should be 0.6, but what happens if we put it into vegan?

Things work a little different when running jaccard inside of vegdist compared to other methods. First, it is calculating the Jaccard Dissimilarity (D7) rather than a similarity (S7). Second, if you give vegan a matrix and run jaccard, it will instead calculate an extended jaccard3 value that is tied to Bray-Curtis4.

as.numeric(vegdist(t(df.binary), method = "jaccard"))
## [1] 0.6
as.numeric(vegdist(t(df.abundances), method = "jaccard"))
## [1] 0.625

We can see exactly how this extended Jaccard value was made by instead calculated the Bray-Curtis, and then running through the equation.

bc.binary = as.numeric(vegdist(t(df.binary), method = "bray"))
bc.binary
## [1] 0.4285714
(bc.binary * 2) / (1 + bc.binary) # same value as vegdist above
## [1] 0.6

bc.abundances = as.numeric(vegdist(t(df.abundances), method = "bray"))
bc.abundances
## [1] 0.4545455
(bc.abundances * 2) / (1 + bc.abundances) # same value as vegdist above
## [1] 0.625

Likely, most microbial ecologists do not realize the Jaccard values they are given are an extended Jaccard, and some may not even realize they are working with Dissimilarities rather than similarities.

The vegan folks don’t seem to think this is an issue though5, and appear to actively fight against giving some informative warnings.

Correct Jaccard in Vegan

So, what are you supposed to do? The fix in the vegan documentation is to make sure you say binary=T.

as.numeric(vegdist(t(df.binary), method = "jaccard", binary = T))
## [1] 0.6
as.numeric(vegdist(t(df.abundances), method = "jaccard", binary = T))
## [1] 0.6

Here, we appear to get the correct dissimilarity values when using either the binary or abundance version of our dataframes.

Example

Let’s use the Mock Communities that we previously defined. Briefly, this dataset has Samples 1-3 with the same underlying population but sampled to different depths, and Samples 4-6 that have the same depth but a change in evenness from perfectly even (Sample 4) to fairly uneven (Sample 6).

We will calculate Jaccard using vegdist() and binary=T. Remember, this function returns the distance, not the similarity.

mock_community %>%
  column_to_rownames("ASV") %>% 
  as.matrix() %>%
  t() %>%
  vegan::vegdist(method = "jaccard", binary = T) %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column("A") %>%
  pivot_longer(cols = c(everything(), -A), names_to = "B", values_to = "Jaccard") %>%
  mutate(B = fct_relevel(B, rev(unique(as.character(.$B))))) %>% 
  ggplot(aes(x = A, y = B, fill = Jaccard)) +
    geom_tile(color = "gray30") +
    scale_fill_viridis_c() +
    labs(fill = "Jaccard Distance", x = "1st Sample", y = "2nd Sample", title = "Jaccard Distance (binary = T)") +
    theme(legend.position = "right")

The results are surprising at first, and show why it is best to be cautious when using these sorts of comparisons, and why it is good to have an intuitive understanding of what they are actually saying. Let’s look at each group of communities individually.

Samples 1-3 (Differing Sampling Depth)

As a reminder, these three Samples had very different sample sizes.

Size of Samples 1-3
Sample Sample size
Sample_1 10000
Sample_2 1000
Sample_3 100

Even though these 3 samples come from the same true population, there appears to be greater Jaccard Distances between them than there is within Samples 4-6. This is particularly true because the different depths led to very different zero counts for the species in these three samples.

Number of species with and without a count
Sample Non Zeroes Zeroes
Sample_1 19 1
Sample_2 13 7
Sample_3 10 10

So, it is clear that Sample_1 looks very distant from Sample_3 simply because we undersampled (or undersequenced) Sample_3. Had we had a larger sample we likely would have found some of those species that were in the true population, but at too low of an abundance to be found with shallow random sampling.

Samples 4-6 (Differing Diversity)

These samples all had the same sample size of 10,000, but they differed in the true population distribution of the 20 species from fairly even (Sample 4) to very uneven (Sample 6). Here, even though the communities were actually quite different, the Jaccard Distance between all three samples is 0, indicating that they are 100% similar. Of course, we know that Jaccard is only looking at presence/absence, so all we can conclude is that there are no species found in one sample that are missing from another. We cannot take these conclusion any further to say much else about the composition of the samples or the distributions of their community members, we’ll need other metrics for that.

Summary

In a real world situation, the true population is unknown and we can’t therefore check our results for consistency with our expectations. In both sets of communities a reliance on Jaccard alone would give us incorrect views of our samples. Community 1 looked quite different although they all had the same true population, and Community 2 look very similar even though the population structure is quite different.

Further Reading

  • Bray-Curtis
  • Dice Coefficient