Description
This is a clever solution to get the decompression / compression into a separate process, very helpful little tool. This pattern would work well for a couple additional compression tools that my group uses. I'm going to take a look at adding support for these formats, which should be pretty straightforward.
BGZF, or "blocked gzip" is a format that's used pretty widely in bioinformatics, it's basically a lot of gzipped files concatenated together, with some extra info in the headers and an index in a separate file saying where to seek. It's decompressible by normal gzip, so we actually see bgzf files as .gz more often than .bgz. It'd be really great to be able to compress bgz files with xopen as well. The blocked gzip reference implementation is distributed with htslib as a binary called bgzip
, and is available both from conda and most linux distros native packages (tabix on Ubuntu, for example).
Also, it'd be great to see this support zstd as well, which is just an excellent general purpose compression tool that I expect to rapidly grow in usage in the next few years.
Edit: To be clear, both of these tools are already usable from Python, there's a bgzip implementation here, and zstd has excellent Python bindings available, but getting the compression into another process like xopen does makes for much better performance.