Jaccard Index Optimization

Jaccard Index is one of the most common tools in Information Retrieval and is used to measure the similarity between two sets, mostly defined at a single bit level:

$$ \text{Jaccard}(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$

In code, one would rarely deal with boolean values due to obvious space-inefficiency, and would generally operate on octets of bits, packed into 8-bit unsigned integers, like this:

#include <bits> // brings `std::popcount`
#include <cstdint> // brings `std::size_t`

float jaccard(std::uint8_t const* a_vector, std::uint8_t const* b_vector, std::size_t count_octets) {
    std::size_t intersection = 0, union_ = 0;
    for (std::size_t i = 0; i < count_octets; ++i) {
        std::uint8_t a_octet = a_vector[i], b_octet = b_vector[i];
        intersection += std::popcount(a_octet & b_octet);
        union_ += std::popcount(a_octet | b_octet);
    }
    return (float)intersection / (float)union_;
}

That's, however, horribly inefficient! Assuming how often bit-level representation are now used in large-scale vector search deployments with USearch, this repository provides benchmarks and custom kernels exploring the performance impacts of following optimizations:

Using lookup tables to speed up population counts.
Using Harley-Seal and Odd-Majority CSAs for population counts.
Loop unrolling and inlining.
🔜 Precomputed Jaccard JUTs.
🔜 Floating-point operations instead of integer ones.

Running Examples

The most interesting part of the repository is the kernels.py file with all kinds of weird kernels worth reading through. Native optimizations are implemented NumBa and Cppyy JIT compiler for Python and C++, and are packed into UV-compatible scripts. To benchmark and test the kernels on your hardware, run the following command:

$ uv run --script kernels.py --count 10000 --ndims "1024,1536"
$ uv run --script kernels.py --help # for more options

For other benchmarks real data can be used. Luckily, modern embedding models are often trained in Quantization-aware manner, and precomputed WikiPedia embeddings are available on the HuggingFace portal:

Most of those vectors are 1024- to 1536-dimensional, with datasets containing between 10M and 300M vectors for multilingual. The following script will pull the English WikiPedia embeddings, save them to cohere/en*, and quantize them to 1-bit vectors:

$ uv run --script download.py --dataset cohere-en --quantize
$ uv run --script download.py --help # for more options

Afterwards, you can use index.py to check the search accuracy over binary data vs the original float32 data. That script is easy to modify to force USearch to use a CompiledMetric wrapping one of the kernels from kernels.py:

$ uv run --script indexing.py --data "cohere/" --limit 2e6 --dtype float16
$ uv run --script indexing.py --help # for more options

On a 16-core Intel Sapphire Rapids, baseline kernels may have the following performance:

Representation	Insert Ops/S	Search Ops/s	Recall@1
Half-precision	10'744	19'219	99.37 %
Single-bit	41'980	70'497	98.07 %

Optimizations

Knowing the length of embeddings is very handy for optimizations. If the embeddings are only 1024 bits long, we only need 2 ZMM CPU registers on x86 machines to store the entire vector. We don't need any for-loops, the entire operation can be unrolled and inlined.

uint64_t hamming_distance_1024d(uint8_t const* a_vector, uint8_t const* b_vector) {
    __m512i a_start = _mm512_loadu_si512((__m512i const*)(a_vector));
    __m512i a_end = _mm512_loadu_si512((__m512i const*)(a_vector + 64));
    __m512i b_start = _mm512_loadu_si512((__m512i const*)(b_vector));
    __m512i b_end = _mm512_loadu_si512((__m512i const*)(b_vector + 64));
    __m512i differences_start = _mm512_xor_epi64(a_start, b_start);
    __m512i differences_end = _mm512_xor_epi64(a_end, b_end);
    __m512i population_start = _mm512_popcnt_epi64(differences_start);
    __m512i population_end = _mm512_popcnt_epi64(differences_end);
    __m512i population = _mm512_add_epi64(population_start, population_end);
    return _mm512_reduce_add_epi64(population);
}

As shown in less_slow.cpp, decomposing for-loops (which are equivalent to if-statements and jumps) into unrolled kernels is a universally great idea. But the problem with the kernel above is that _mm512_popcnt_epi64 is an expensive instruction, and most Intel CPUs can only execute it on a single CPU port. There are several ways to implement population counts on SIMD-capable x86 CPUs, mostly relying on the following instructions:

VPOPCNTQ (ZMM, ZMM):
- On Ice Lake: 3 cycles latency and executes only on port 5.
- On Zen4: 2 cycles and executes on both port 0 and 1.
VPSHUFB (ZMM, ZMM, ZMM):
- On Skylake-X: 1 cycle latency and executes only on port 5.
- On Ice Lake: 1 cycle latency and executes only on port 5.
- On Zen4: 2 cycles and executes on both port 1 and 2.
VPSADBW (ZMM, ZMM, ZMM)
- On Ice Lake: 3 cycles latency and executes only on port 5.
- On Zen4: 3 cycles and executes on both port 0 and 1.
VPDPBUSDS (ZMM, ZMM, ZMM)
- On Ice Lake: 5 cycles latency and executes only on port 0.
- On Zen4: 4 cycles and executes on both port 0 and 1.

Interestingly, the EVEX variants of VPSHUFB and VPDPBUSDS instructions take different ports when dealing with YMM inputs on Ice Lake:

VPSHUFB_EVEX (YMM, YMM, YMM):
- On Skylake-X: 1 cycle latency and executes only on port 5.
- On Ice Lake: 1 cycle latency and executes on port 1 and 5.
- On Zen4: 2 cycles and executes on both port 1 and 2.
VPDPBUSDS_EVEX (YMM, YMM, YMM):
- On Ice Lake: 5 cycles latency and executes on both port 0 and 1.
- On Zen4: 4 cycles and executes on both port 0 and 1.

So when implementing the Jaccard distance, the most important kernel for binary similarity indices using the VPOPCNTQ, we will have 4 such instruction invocations that will all stall on the same port number 5:

float jaccard_distance_1024d(uint8_t const* a_vector, uint8_t const* b_vector) {
    __m512i a_start = _mm512_loadu_si512((__m512i const*)(a_vector));
    __m512i a_end = _mm512_loadu_si512((__m512i const*)(a_vector + 64));
    __m512i b_start = _mm512_loadu_si512((__m512i const*)(b_vector));
    __m512i b_end = _mm512_loadu_si512((__m512i const*)(b_vector + 64));
    __m512i intersection_start = _mm512_and_epi64(a_start, b_start);
    __m512i intersection_end = _mm512_and_epi64(a_end, b_end);
    __m512i union_start = _mm512_or_epi64(a_start, b_start);
    __m512i union_end = _mm512_or_epi64(a_end, b_end);
    __m512i population_intersection_start = _mm512_popcnt_epi64(intersection_start);
    __m512i population_intersection_end = _mm512_popcnt_epi64(intersection_end);
    __m512i population_union_start = _mm512_popcnt_epi64(union_start);
    __m512i population_union_end = _mm512_popcnt_epi64(union_end);
    __m512i population_intersection = _mm512_add_epi64(population_intersection_start, population_intersection_end);
    __m512i population_union = _mm512_add_epi64(population_union_start, population_union_end);
    return 1.f - _mm512_reduce_add_epi64(population_intersection) * 1.f / _mm512_reduce_add_epi64(population_union);
}

That's known to be a bottleneck and can be improved.

Lookup Tables

A common trick is to implement population counts using lookup tables. The idea may seem costly compared to _mm512_popcnt_epi64, but depends on the number of counts that need to be performed. Here's what a minimal kernel for 245-dimensional Jaccard distance may look like:

float jaccard_distance_256d(uint8_t const *a_vector, uint8_t const *b_vector) {
    __m256i a = _mm256_loadu_epi8((__m256i const*)(a_vector));
    __m256i b = _mm256_loadu_epi8((__m256i const*)(b_vector));
    
    __m256i intersection = _mm256_and_epi64(a, b);
    __m256i union_ = _mm256_or_epi64(a, b);

    __m256i low_mask = _mm256_set1_epi8(0x0f);
    __m256i lookup = _mm256_set_epi8(
        4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0,
        4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0);
    
    __m256i intersection_low = _mm256_and_si256(intersection, low_mask);
    __m256i intersection_high = _mm256_and_si256(_mm256_srli_epi16(intersection, 4), low_mask);
    __m256i union_low = _mm256_and_si256(union_, low_mask);
    __m256i union_high = _mm256_and_si256(_mm256_srli_epi16(union_, 4), low_mask);

    __m256i intersection_popcount = _mm256_add_epi8(
        _mm256_shuffle_epi8(lookup, intersection_low),
        _mm256_shuffle_epi8(lookup, intersection_high));
    __m256i union_popcount = _mm256_add_epi8(
        _mm256_shuffle_epi8(lookup, union_low),
        _mm256_shuffle_epi8(lookup, union_high));
    
    __m256i intersection_counts = _mm256_sad_epu8(intersection_popcount, _mm256_setzero_si256());
    __m256i union_counts = _mm256_sad_epu8(union_popcount, _mm256_setzero_si256());
    return 1.f - _mm256_reduce_add_epi64(intersection_counts) * 1.f / _mm256_reduce_add_epi64(union_counts);
}

The _mm256_reduce_add_epi64 is a product of our imagination, but you can guess what it does:

uint64_t _mm256_reduce_add_epi64(__m256i x) {
    __m128i lo128 = _mm256_castsi256_si128(x);
    __m128i hi128 = _mm256_extracti128_si256(x, 1);
    __m128i sum128 = _mm_add_epi64(lo128, hi128);
    __m128i hi64 = _mm_unpackhi_epi64(sum128, sum128);
    __m128i total = _mm_add_epi64(sum128, hi64);
    return uint64_t(_mm_cvtsi128_si64(total));
}

Harley-Seal and Odd-Majority CSA

Let's take a look at a few 192-dimensional bit-vectors for simplicity. Each will be represented with 3x 64-bit unsigned integers.

std::uint64_t a[3] = {0x0000000000000000, 0x0000000000000000, 0x0000000000000000};
std::uint64_t b[3] = {0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF};
std::uint64_t c[3] = {0x0000000000000000, 0xFFFFFFFFFFFFFFFF, 0x0000000000000000};

A naive Jaccard distance implementation in C++20 using STL's std::popcount would look like this:

#include <bit>

float jaccard_distance_192d(std::uint64_t const* a, std::uint64_t const* b) {
    int intersection = std::popcount(a[0] & b[0]) + std::popcount(a[1] & b[1]) + std::popcount(a[2] & b[2]);
    int union_ = std::popcount(a[0] | b[0]) + std::popcount(a[1] | b[1]) + std::popcount(a[2] | b[2]);
    return 1.f - intersection * 1.f / union_;
}

That's 6x std::popcount and we can easily reduce it to 4x, by using the following rule for both intersection and union_:

$$ \begin{aligned} X &:= {x_0,x_1,x_2} \\ \text{PopCount}(X) &:= \sum_{i=0}^{2}\text{PopCount}(x_i) \\ \text{Odd} &:= (x_0 \oplus x_1) \oplus\ x_2 \\ \text{Major} &:= \bigl((x_0 \oplus x_1)\land x_2\bigr) \lor\ (x_0 \land x_1) \\ \curvearrowright \text{PopCount}(X) &= 2 \text{PopCount}(\text{Major}) + \text{PopCount}(\text{Odd}) \end{aligned} $$

That rule is an example of Carry-Save Adders (CSAs) and can be used for longer sequences as well. So $N$ "Odd-Major" population counts can cover the logic needed for the $2^N-1$ original counts:

7x std::popcount can be reduced to 3x.
15x std::popcount can be reduced to 4x.
31x std::popcount can be reduced to 5x.

When dealing with 1024-dimensional bit-vectors on 64-bit machines, we can view them as 16x 64-bit words, leveraging the "Odd-Majority" trick to fold a 15x std::popcount into 4x, and then having 1x more call for the tail, thus shrinking from 16x to 5x calls, at the cost of several XOR and AND operations. Sadly, even on Intel Sapphire Rapids, none of the CSA schemes result in gains compared to VPOPCNTQ:

Profiling `jaccard_b1536_vpopcntq` in USearch over 1,000 vectors
- BOP/S: 326.08 G
- Recall@1: 100.00%
Profiling `jaccard_b1536_vpopcntq_3csa` in USearch over 1,000 vectors
- BOP/S: 310.28 G
- Recall@1: 100.00%

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.vscode		.vscode
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download.py		download.py
indexing.py		indexing.py
kernels.py		kernels.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Jaccard Index Optimization

Running Examples

Optimizations

Lookup Tables

Harley-Seal and Odd-Majority CSA

Links

About

Uh oh!

Uh oh!

Languages

License

ashvardanian/jaccard-index

Folders and files

Latest commit

History

Repository files navigation

Jaccard Index Optimization

Running Examples

Optimizations

Lookup Tables

Harley-Seal and Odd-Majority CSA

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages