Skip to content

How to Stream OTLP to Parquet

import PostLogRecord from ’../../../components/docs/PostLogRecord.astro’; import StopCleanly from ’../../../components/docs/StopCleanly.astro’;

Use the duckdb-otlp Docker image in parquet mode to stream OTLP/HTTP exports into partitioned Parquet files.

Use this when you want ordinary DuckDB-readable files instead of a lakehouse catalog. The Parquet dataset is the only durable store — the daemon keeps no local copy of the data. Each seal writes the buffered rows to the configured Parquet root:

<export-root>/<table>/year=YYYY/month=MM/day=DD/*.parquet

PARQUET_EXPORT_PATH can be a local path such as /data/otlp-parquet or a DuckDB-writable URI such as s3://bucket/prefix.

Live ingestion uses OTLP/HTTP on port 4318. WASM builds do not include the ingest server.

Create parquet.env:

DUCKDB_MODE=parquet
DUCKDB_OTLP_TOKEN=dev-token-123456
DUCKDB_SCHEMA=otlp
PARQUET_EXPORT_PATH=/data/otlp-parquet
DUCKDB_QUACK_ENABLED=1
DUCKDB_QUACK_ADDR=0.0.0.0:9494
DUCKDB_QUACK_TOKEN=dev-quack-token-123456

If you omit PARQUET_EXPORT_PATH, the daemon writes to /data/parquet.

For S3, set PARQUET_EXPORT_PATH to an s3:// URI and provide AWS credentials through DuckDB’s credential chain:

DUCKDB_MODE=parquet
DUCKDB_OTLP_TOKEN=dev-token-123456
DUCKDB_SCHEMA=otlp
PARQUET_EXPORT_PATH=s3://duckdb-otlp-telemetry/otlp
AWS_REGION=us-west-2
AWS_PROFILE=cli-dev
DUCKDB_QUACK_ENABLED=1
DUCKDB_QUACK_ADDR=0.0.0.0:9494
DUCKDB_QUACK_TOKEN=dev-quack-token-123456

You can also use S3_BUCKET and S3_PREFIX instead of PARQUET_EXPORT_PATH; those resolve to s3://$S3_BUCKET/$S3_PREFIX.

For local Parquet output:

Terminal window
mkdir -p data
docker run --rm --name duckdb-otlp \
--env-file parquet.env \
-p 4318:4318 \
-p 9494:9494 \
-v "$(pwd)/data:/data" \
ghcr.io/smithclay/duckdb-otlp:latest

For S3 output, mount your AWS config read-only:

Terminal window
docker run --rm --name duckdb-otlp \
--env-file parquet.env \
-p 4318:4318 \
-p 9494:9494 \
-v "$(pwd)/data:/data" \
-v "$HOME/.aws:/root/.aws:ro" \
ghcr.io/smithclay/duckdb-otlp:latest

Each seal writes only to Parquet. For convenient inspection over Quack, the daemon lazily creates a read-only view over the exported files for each signal — otlp.otlp_logs, otlp.otlp_traces, otlp.otlp_metrics_gauge, otlp.otlp_metrics_sum, otlp.otlp_metrics_histogram, otlp.otlp_metrics_exp_histogram. These views read the Parquet dataset directly (so they reflect exactly what is durable, including any at-least-once duplicates) and appear after the first seal for that signal. They store no data. Logs and metrics partition by time_unix_nano; traces partition by start_time_unix_nano.

Flush through Quack from a local DuckDB process:

Terminal window
duckdb -unsigned -c "
INSTALL quack;
LOAD quack;
FROM quack_query(
'quack:localhost:9494',
'SELECT * FROM otlp_flush(''otlp:0.0.0.0:4318'')',
token = 'dev-quack-token-123456'
);
"

The example log record uses 2024-01-01, so local output lands under this path:

Terminal window
find data/otlp-parquet/otlp_logs/year=2024/month=01/day=01 -name '*.parquet'

For S3 output:

Terminal window
aws s3 ls \
"s3://duckdb-otlp-telemetry/otlp/otlp_logs/year=2024/month=01/day=01/" \
--profile cli-dev \
--region us-west-2 \
--recursive

To query the exported rows through the inspection view inside the running daemon (this reads the Parquet files, not a local copy):

Terminal window
duckdb -unsigned -c "
INSTALL quack;
LOAD quack;
FROM quack_query(
'quack:localhost:9494',
\$\$
SELECT time_unix_nano, service_name, severity_text, body
FROM otlp.otlp_logs
WHERE service_name = 'parquet-demo'
ORDER BY time_unix_nano DESC
LIMIT 5
\$\$,
token = 'dev-quack-token-123456'
);
"

To query local Parquet files directly:

SELECT service_name, severity_text, body
FROM read_parquet(
'data/otlp-parquet/otlp_logs/**/*.parquet',
hive_partitioning = true
)
WHERE year = '2024' AND month = '01' AND day = '01';

For S3 Parquet, create an S3 secret first:

INSTALL httpfs;
LOAD httpfs;
CREATE OR REPLACE SECRET plain_s3_secret (
TYPE s3,
PROVIDER credential_chain,
CHAIN config,
PROFILE 'cli-dev',
REGION 'us-west-2'
);
SELECT service_name, severity_text, body
FROM read_parquet(
's3://duckdb-otlp-telemetry/otlp/otlp_logs/**/*.parquet',
hive_partitioning = true
)
WHERE year = '2024' AND month = '01' AND day = '01';

For local output:

Terminal window
rm -rf data

For S3 output:

Terminal window
aws s3 rm "s3://duckdb-otlp-telemetry/otlp/" \
--profile cli-dev \
--region us-west-2 \
--recursive
  • If S3 startup cannot find credentials, confirm AWS_PROFILE in parquet.env matches a profile in the mounted $HOME/.aws directory.
  • If S3 writes fail, confirm the profile can write objects under PARQUET_EXPORT_PATH.
  • If no files appear after a 202 response, run otlp_flush before listing the output path.
  • Plain Parquet object writes are not catalog transactions (see the at-least-once caution above). A signal that already exported is never re-written, but if a single seal’s COPY fails part-way through and the server retries, that signal’s rows can be duplicated under the same partition. Deduplicate downstream, or use a catalog mode for exactly-once.