Self-hosting LLM observability: what we actually measured

2026-06-13 · self-hosting, benchmark, observability

What we found

Instrumentation overhead is a non-issue: every SDK we tested added under 0.05% to a typical LLM call (though there's a ~7x spread between them). The real difference is operational weight. Arize Phoenix came up as a single container (1.35 GB image, ~400 MB idle RAM, OTLP ingest verified end-to-end); Langfuse is a full platform that booted cleanly but runs as a 6-service stack costing ~2.1 GB idle RAM. Match the footprint to the need. (tested @ 2026-06, WSL2 + Docker)

Most "best self-hosted observability" lists never actually run the tools. We did. Two questions matter when you self-host: how much does the SDK slow my app down, and how much does the backend cost me to operate. Here is what we measured.

1. Instrumentation overhead (client-side)

We timed 50,000 spans per SDK with in-memory exporters (no network), against an uninstrumented baseline. Per-span overhead, and what it is as a fraction of a typical 500 ms LLM call:

SDK	overhead / span	% of a 500ms call
OpenTelemetry (raw)	~34 µs	0.007%
Traceloop / OpenLLMetry	~37 µs	0.007%
Langfuse SDK	~243 µs	0.049%

Takeaway: stop worrying about instrumentation overhead for LLM workloads — the model call dominates by 2,000x or more. The ~7x spread (a richer observation model costs more per span) only matters on very high-span, non-LLM hot paths.

2. Operational weight (self-host footprint)

We booted each backend from its official Docker setup and measured the footprint: how many containers it runs, the on-disk image size, idle RAM once healthy, time to a ready endpoint, and whether it ingests OpenTelemetry (OTLP) out of the box. (tested @ 2026-06, WSL2 + Docker, measured)

Tool	Containers	Image size	Idle RAM	Time to ready	OTLP ingest
Arize Phoenix	1	1.35 GB	~400 MB	19 s	yes (verified end-to-end)
Langfuse	6	1.16 GB	~2.1 GB (web container alone ~1.1 GB)	~10 s (images cached)	n/a (own SDK, not OTLP-native)
SigNoz	multi-service	—	—	not measured	not measured in our env*

Arize Phoenix is the lightweight end: one drop-in container, ~400 MB idle RAM, ready in under 20 seconds, and it accepted standard OTLP traces immediately. If you want self-hosted tracing running in five minutes, this is the shape to look for.

Langfuse booted cleanly this time — with images cached the stack came up in about 10 seconds. But it is a full platform, not a one-container drop-in: a 6-service stack (web, worker, Postgres, ClickHouse, cache, object store) costing ~2.1 GB idle RAM, with the web container alone at ~1.1 GB. It also ingests via its own SDK rather than OTLP natively. One host-port note worth knowing up front: its bundled MinIO object store defaults to :9090, which collides with Prometheus on a lot of dev boxes — remap it before you start. Far more capable for prompt management, evals and retention; just budget for the operational weight.

SigNoz we could not measure cleanly in our environment: its first-boot init container blocked downloading a GitHub release asset on our test network, so the stack never reached a healthy state. We are flagging this honestly as not measured in our env — it is an environmental/network issue on our side, not a SigNoz defect — rather than publishing a number we did not actually observe.

* SigNoz: init container blocked downloading a GitHub release asset on our test network (environmental, not a SigNoz defect).

How to read this

Match the footprint to the job. Need lightweight, OpenTelemetry-native tracing you can stand up fast? A single-container tool wins. Need prompt management, evals, datasets and long retention for a team? A full platform earns its operational weight. Don't pay multi-service ops cost for a single-container need, or vice versa.

Caveats, stated plainly: the overhead test is client-side span cost with in-memory export, not end-to-end backend latency; the self-host numbers are from one test environment (WSL2, Docker, tested @ 2026-06) and are footprint snapshots, not a load test. Langfuse times benefit from cached images; cold pulls take longer. SigNoz is reported as not-measured-in-our-env rather than estimated. We are extending the matrix to more tools and will keep every row labelled "tested" and dated. See our methodology.

Frequently asked questions

Does LLM observability instrumentation slow down my app?

Negligibly. In our test every SDK added under 0.05% to a typical 500ms LLM call; the model call dominates by thousands of times. Instrumentation overhead should not drive your tool choice for LLM workloads.

Which self-hosted LLM observability tool is lightest to run?

In our test Arize Phoenix was the lightweight end: a single container, ~1.35GB image and ~400MB idle RAM, accepting OTLP traces immediately. Full platforms like Langfuse are far more capable but ship multi-service stacks that are a larger operational commitment.

Is Langfuse hard to self-host?

It booted cleanly in our 2026-06 test (about 10 seconds with images cached), but it is a full platform, not a one-container drop-in: a 6-service stack (web, worker, Postgres, ClickHouse, cache, object store) costing ~2.1GB idle RAM, with the web container alone around 1.1GB. It uses its own SDK rather than OTLP natively, and its bundled MinIO defaults to host port :9090 (which collides with Prometheus on many dev boxes). Plenty capable for prompt management, evals and retention — just budget for the operational weight.