Jaccard Index is one of the most common tools in Information Retrieval and is used to measure the similarity between two sets, mostly defined at a single bit level:
In code, one would rarely deal with boolean
values due to obvious space-inefficiency, and would generally operate on octets of bits, packed into 8-bit unsigned integers, like this:
#include <bits> // brings `std::popcount`
#include <cstdint> // brings `std::size_t`
float jaccard(std::uint8_t const* a_vector, std::uint8_t const* b_vector, std::size_t count_octets) {
std::size_t intersection = 0, union_ = 0;
for (std::size_t i = 0; i < count_octets; ++i) {
std::uint8_t a_octet = a_vector[i], b_octet = b_vector[i];
intersection += std::popcount(a_octet & b_octet);
union_ += std::popcount(a_octet | b_octet);
}
return (float)intersection / (float)union_;
}
That's, however, horribly inefficient! Assuming how often bit-level representation are now used in large-scale vector search deployments with USearch, this repository provides benchmarks and custom kernels exploring the performance impacts of following optimizations:
- Using lookup tables to speed up population counts.
- Using Harley-Seal and Odd-Majority CSAs for population counts.
- Loop unrolling and inlining.
- 🔜 Precomputed Jaccard JUTs.
- 🔜 Floating-point operations instead of integer ones.
The most interesting part of the repository is the kernels.py
file with all kinds of weird kernels worth reading through.
Native optimizations are implemented NumBa and Cppyy JIT compiler for Python and C++, and are packed into UV-compatible scripts.
To benchmark and test the kernels on your hardware, run the following command:
$ uv run --script kernels.py --count 10000 --ndims "1024,1536"
$ uv run --script kernels.py --help # for more options
For other benchmarks real data can be used. Luckily, modern embedding models are often trained in Quantization-aware manner, and precomputed WikiPedia embeddings are available on the HuggingFace portal:
Most of those vectors are 1024- to 1536-dimensional, with datasets containing between 10M and 300M vectors for multilingual.
The following script will pull the English WikiPedia embeddings, save them to cohere/en*
, and quantize them to 1-bit vectors:
$ uv run --script download.py --dataset cohere-en --quantize
$ uv run --script download.py --help # for more options
Afterwards, you can use index.py
to check the search accuracy over binary data vs the original float32
data.
That script is easy to modify to force USearch to use a CompiledMetric
wrapping one of the kernels from kernels.py
:
$ uv run --script indexing.py --data "cohere/" --limit 2e6 --dtype float16
$ uv run --script indexing.py --help # for more options
On a 16-core Intel Sapphire Rapids, baseline kernels may have the following performance:
Representation | Insert Ops/S | Search Ops/s | Recall@1 |
---|---|---|---|
Half-precision | 10'744 | 19'219 | 99.37 % |
Single-bit | 41'980 | 70'497 | 98.07 % |
Knowing the length of embeddings is very handy for optimizations.
If the embeddings are only 1024 bits long, we only need 2 ZMM
CPU registers on x86 machines to store the entire vector.
We don't need any for
-loops, the entire operation can be unrolled and inlined.
uint64_t hamming_distance_1024d(uint8_t const* a_vector, uint8_t const* b_vector) {
__m512i a_start = _mm512_loadu_si512((__m512i const*)(a_vector));
__m512i a_end = _mm512_loadu_si512((__m512i const*)(a_vector + 64));
__m512i b_start = _mm512_loadu_si512((__m512i const*)(b_vector));
__m512i b_end = _mm512_loadu_si512((__m512i const*)(b_vector + 64));
__m512i differences_start = _mm512_xor_epi64(a_start, b_start);
__m512i differences_end = _mm512_xor_epi64(a_end, b_end);
__m512i population_start = _mm512_popcnt_epi64(differences_start);
__m512i population_end = _mm512_popcnt_epi64(differences_end);
__m512i population = _mm512_add_epi64(population_start, population_end);
return _mm512_reduce_add_epi64(population);
}
As shown in less_slow.cpp, decomposing for
-loops (which are equivalent to if
-statements and jumps) into unrolled kernels is a universally great idea.
But the problem with the kernel above is that _mm512_popcnt_epi64
is an expensive instruction, and most Intel CPUs can only execute it on a single CPU port.
There are several ways to implement population counts on SIMD-capable x86 CPUs, mostly relying on the following instructions:
- VPOPCNTQ (ZMM, ZMM):
- On Ice Lake: 3 cycles latency and executes only on port 5.
- On Zen4: 2 cycles and executes on both port 0 and 1.
- VPSHUFB (ZMM, ZMM, ZMM):
- On Skylake-X: 1 cycle latency and executes only on port 5.
- On Ice Lake: 1 cycle latency and executes only on port 5.
- On Zen4: 2 cycles and executes on both port 1 and 2.
- VPSADBW (ZMM, ZMM, ZMM)
- On Ice Lake: 3 cycles latency and executes only on port 5.
- On Zen4: 3 cycles and executes on both port 0 and 1.
- VPDPBUSDS (ZMM, ZMM, ZMM)
- On Ice Lake: 5 cycles latency and executes only on port 0.
- On Zen4: 4 cycles and executes on both port 0 and 1.
Interestingly, the EVEX
variants of VPSHUFB
and VPDPBUSDS
instructions take different ports when dealing with YMM
inputs on Ice Lake:
- VPSHUFB_EVEX (YMM, YMM, YMM):
- On Skylake-X: 1 cycle latency and executes only on port 5.
- On Ice Lake: 1 cycle latency and executes on port 1 and 5.
- On Zen4: 2 cycles and executes on both port 1 and 2.
- VPDPBUSDS_EVEX (YMM, YMM, YMM):
- On Ice Lake: 5 cycles latency and executes on both port 0 and 1.
- On Zen4: 4 cycles and executes on both port 0 and 1.
So when implementing the Jaccard distance, the most important kernel for binary similarity indices using the VPOPCNTQ
, we will have 4 such instruction invocations that will all stall on the same port number 5:
float jaccard_distance_1024d(uint8_t const* a_vector, uint8_t const* b_vector) {
__m512i a_start = _mm512_loadu_si512((__m512i const*)(a_vector));
__m512i a_end = _mm512_loadu_si512((__m512i const*)(a_vector + 64));
__m512i b_start = _mm512_loadu_si512((__m512i const*)(b_vector));
__m512i b_end = _mm512_loadu_si512((__m512i const*)(b_vector + 64));
__m512i intersection_start = _mm512_and_epi64(a_start, b_start);
__m512i intersection_end = _mm512_and_epi64(a_end, b_end);
__m512i union_start = _mm512_or_epi64(a_start, b_start);
__m512i union_end = _mm512_or_epi64(a_end, b_end);
__m512i population_intersection_start = _mm512_popcnt_epi64(intersection_start);
__m512i population_intersection_end = _mm512_popcnt_epi64(intersection_end);
__m512i population_union_start = _mm512_popcnt_epi64(union_start);
__m512i population_union_end = _mm512_popcnt_epi64(union_end);
__m512i population_intersection = _mm512_add_epi64(population_intersection_start, population_intersection_end);
__m512i population_union = _mm512_add_epi64(population_union_start, population_union_end);
return 1.f - _mm512_reduce_add_epi64(population_intersection) * 1.f / _mm512_reduce_add_epi64(population_union);
}
That's known to be a bottleneck and can be improved.
A common trick is to implement population counts using lookup tables.
The idea may seem costly compared to _mm512_popcnt_epi64
, but depends on the number of counts that need to be performed.
Here's what a minimal kernel for 245-dimensional Jaccard distance may look like:
float jaccard_distance_256d(uint8_t const *a_vector, uint8_t const *b_vector) {
__m256i a = _mm256_loadu_epi8((__m256i const*)(a_vector));
__m256i b = _mm256_loadu_epi8((__m256i const*)(b_vector));
__m256i intersection = _mm256_and_epi64(a, b);
__m256i union_ = _mm256_or_epi64(a, b);
__m256i low_mask = _mm256_set1_epi8(0x0f);
__m256i lookup = _mm256_set_epi8(
4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0,
4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0);
__m256i intersection_low = _mm256_and_si256(intersection, low_mask);
__m256i intersection_high = _mm256_and_si256(_mm256_srli_epi16(intersection, 4), low_mask);
__m256i union_low = _mm256_and_si256(union_, low_mask);
__m256i union_high = _mm256_and_si256(_mm256_srli_epi16(union_, 4), low_mask);
__m256i intersection_popcount = _mm256_add_epi8(
_mm256_shuffle_epi8(lookup, intersection_low),
_mm256_shuffle_epi8(lookup, intersection_high));
__m256i union_popcount = _mm256_add_epi8(
_mm256_shuffle_epi8(lookup, union_low),
_mm256_shuffle_epi8(lookup, union_high));
__m256i intersection_counts = _mm256_sad_epu8(intersection_popcount, _mm256_setzero_si256());
__m256i union_counts = _mm256_sad_epu8(union_popcount, _mm256_setzero_si256());
return 1.f - _mm256_reduce_add_epi64(intersection_counts) * 1.f / _mm256_reduce_add_epi64(union_counts);
}
The _mm256_reduce_add_epi64
is a product of our imagination, but you can guess what it does:
uint64_t _mm256_reduce_add_epi64(__m256i x) {
__m128i lo128 = _mm256_castsi256_si128(x);
__m128i hi128 = _mm256_extracti128_si256(x, 1);
__m128i sum128 = _mm_add_epi64(lo128, hi128);
__m128i hi64 = _mm_unpackhi_epi64(sum128, sum128);
__m128i total = _mm_add_epi64(sum128, hi64);
return uint64_t(_mm_cvtsi128_si64(total));
}
Harley-Seal and Odd-Majority CSA
Let's take a look at a few 192-dimensional bit-vectors for simplicity. Each will be represented with 3x 64-bit unsigned integers.
std::uint64_t a[3] = {0x0000000000000000, 0x0000000000000000, 0x0000000000000000};
std::uint64_t b[3] = {0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF};
std::uint64_t c[3] = {0x0000000000000000, 0xFFFFFFFFFFFFFFFF, 0x0000000000000000};
A naive Jaccard distance implementation in C++20 using STL's std::popcount
would look like this:
#include <bit>
float jaccard_distance_192d(std::uint64_t const* a, std::uint64_t const* b) {
int intersection = std::popcount(a[0] & b[0]) + std::popcount(a[1] & b[1]) + std::popcount(a[2] & b[2]);
int union_ = std::popcount(a[0] | b[0]) + std::popcount(a[1] | b[1]) + std::popcount(a[2] | b[2]);
return 1.f - intersection * 1.f / union_;
}
That's 6x std::popcount
and we can easily reduce it to 4x, by using the following rule for both intersection
and union_
:
That rule is an example of Carry-Save Adders (CSAs) and can be used for longer sequences as well.
So
- 7x
std::popcount
can be reduced to 3x. - 15x
std::popcount
can be reduced to 4x. - 31x
std::popcount
can be reduced to 5x.
When dealing with 1024-dimensional bit-vectors on 64-bit machines, we can view them as 16x 64-bit words, leveraging the "Odd-Majority" trick to fold a 15x std::popcount
into 4x, and then having 1x more call for the tail, thus shrinking from 16x to 5x calls, at the cost of several XOR
and AND
operations.
Sadly, even on Intel Sapphire Rapids, none of the CSA schemes result in gains compared to VPOPCNTQ
:
Profiling `jaccard_b1536_vpopcntq` in USearch over 1,000 vectors
- BOP/S: 326.08 G
- Recall@1: 100.00%
Profiling `jaccard_b1536_vpopcntq_3csa` in USearch over 1,000 vectors
- BOP/S: 310.28 G
- Recall@1: 100.00%