Kafka: Expand `default_msg_processor` into a miniature decoding unit #306

amotl · 2025-07-13T15:03:54Z

Pitch

Loading data from Kafka into a database destination works well, but we found there are no options to specifically decode and break out the Kafka event value properly, in order to only relay that into the target database, without any metadata information.

Observation

For example, Kafka Connect provides such configuration options for similar use cases which are fragments hereof.

"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonIotaConverter",

Solution

This patch slightly builds upon and expands the existing default_msg_processor implementation to accept a few more options which resolve our problem.

Details

Accept a bunch of decoding options per KafkaDecodingOptions
Provide a bunch of output formatting options per KafkaEvent
Tie both elements together using KafkaEventProcessor

The machinery is effectively the same like before, but provides a few more options to allow type decoding for Kafka event's key/value slots (key_type and value_type), a selection mechanism to limit the output to specific fields only (include), a small projection mechanism to optionally drill down into a specific field (select), and an option to select the output format (format).

In combination, those decoding options allow users to relay JSON-encoded Kafka event values directly into a destination table, without any metadata wrappings. Currently, the output formatter provides three different variants out of the box (standard_v1, standard_v2, flexible) ¹. More variants can be added in the future, as other users or use cases may have different requirements in the same area.

Most importantly, the decoding unit is very compact, so relevant tasks don't need a corresponding transformation unit down the pipeline, to keep the whole ensemble lean, in the very spirit of ingestr.

Preview

uv pip install --upgrade 'ingestr @ git+https://github.com/crate-workbench/ingestr.git@kafka-decoder'

Example

docker run --rm --name=kafka \
  --publish=9092:9092 docker.io/apache/kafka:latest

echo '{"sensor_id":1,"ts":"2025-06-01 10:00","reading":42.42}' | \
  kcat -P -b localhost -t demo
echo '{"sensor_id":2,"ts":"2025-06-01 11:00","reading":451.00}' | \
  kcat -P -b localhost -t demo

ingestr ingest --yes \
  --source-uri "kafka://?bootstrap_servers=localhost:9092&group_id=test&value_type=json&select=value" \
  --source-table "demo" \
  --dest-uri "duckdb:///kafka.duckdb" \
  --dest-table "demo.kafka"

duckdb kafka.duckdb 'SELECT * FROM demo.kafka WHERE sensor_id>1;'

Backlog

Add software tests for non-standard decoding and output formatting options
Add docs and improve inline comments

The standard_v2 output format is intended to resolve Naming things: Rename _kafka_msg_id to _kafka__msg_id #289. ↩

- Accept a bunch of decoding options per `KafkaDecodingOptions` - Provide a bunch of output formatting options per `KafkaEvent` - Tie both elements together using `KafkaEventProcessor` The machinery is effectively the same like before, but provides a few more options to allow type decoding for Kafka event's key/value slots, a selection mechanism to limit the output to specific fields only, and a small projection mechanism to optionally drill down into a specific field. In combination, those decoding options allow users to relay JSON-encoded Kafka event values directly into a destination table, without any metadata wrappings. The output formatter provides three different variants out of the box. More variants can be added in the future, as other users or use cases may have different requirements in the same area. Most importantly, the decoding unit is very compact, so relevant tasks don't need a corresponding transformation unit down the pipeline, to keep the whole ensemble lean, in the very spirit of ingestr.

amotl force-pushed the kafka-decoder branch from 00a560a to 4a77c6e Compare July 13, 2025 16:32

amotl mentioned this pull request Jul 13, 2025

Naming things: Rename _kafka_msg_id to _kafka__msg_id #289

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kafka: Expand `default_msg_processor` into a miniature decoding unit #306

Kafka: Expand `default_msg_processor` into a miniature decoding unit #306

amotl commented Jul 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Kafka: Expand default_msg_processor into a miniature decoding unit #306

Are you sure you want to change the base?

Kafka: Expand default_msg_processor into a miniature decoding unit #306

Conversation

amotl commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pitch

Observation

Solution

Details

Preview

Example

Backlog

Footnotes

Uh oh!

Uh oh!

Kafka: Expand `default_msg_processor` into a miniature decoding unit #306

Kafka: Expand `default_msg_processor` into a miniature decoding unit #306

amotl commented Jul 13, 2025 •

edited

Loading