The neutral index of
AI agent observability tooling

Every AI agent observability, evals, guardrails & cost tool — compared by a neutral third party. 116 tools tracked across tracing, evals, guardrails, prompt management, cost and debugging — with licensing, self-hosting and pricing-model facts checked against primary sources. Built by an engineer who runs agent fleets in production, not by a vendor marketing team.

Maintained by Panshi · updated 2026-06-18

Popular comparisons

Langfuse vs LangSmith Langfuse vs Helicone Langfuse vs Arize Phoenix Braintrust vs LangSmith Opik (Comet) vs Langfuse Promptfoo vs Confident AI (DeepEval)Ragas vs Promptfoo AgentOps vs Langfuse Datadog LLM Observability vs LangSmith Helicone vs Portkey Guardrails AI vs NVIDIA NeMo Guardrails Lakera Guard vs LLM Guard (Protect AI)garak vs PyRIT TensorZero vs Langfuse Arize Phoenix vs SigNoz Traceloop (OpenLLMetry) vs OpenLIT Helicone vs LiteLLM Datadog LLM Observability vs Langfuse W&B Weave vs Langfuse Opik (Comet) vs Arize Phoenix Galileo vs Braintrust Confident AI (DeepEval) vs Ragas Pydantic Logfire vs Langfuse Portkey vs LiteLLM Braintrust vs Langfuse Maxim AI vs Langfuse LangSmith vs Helicone Datadog LLM Observability vs New Relic AI Monitoring Braintrust vs Confident AI (DeepEval)Galileo vs Arize Phoenix Langfuse vs SigNoz Arize Phoenix vs OpenLIT Maxim AI vs Braintrust Helicone vs OpenLIT Promptfoo vs OpenAI Evals Langfuse vs MLflow (Tracing & GenAI)

Best-of lists

Best self-hostable LLM observability tools Best OpenTelemetry-native LLM observability tools Best open-source LLM evals & testing tools Best free LLM observability tools Best LLM guardrails & safety tools Best LLM cost tracking & FinOps tools

All tools

Sort:

Agenta

Open-source prompt playground, registry and evaluation platform covering the prompt lifecycle from experimentation to deployment.

Prompt Managementopen sourceself-hostablefreemium★ 4.2kMature

AgentOps

Session-replay style observability for AI agents with time-travel debugging, cost tracking and first-class integrations into agent frameworks like CrewAI and AutoGen.

Agent Debugging & Replayopen sourcefreemiumOTel-native★ 5.6kEstablished

Amazon Bedrock AgentCore Observability

AWS-managed trace-level observability for agents via CloudWatch generative AI observability and OTEL.

Observability & TracingpaidOTel-native

Amberflo

AI monetization platform metering token usage across providers with native credit-based billing.

Cost & FinOpsfreemium

Arize AX

Enterprise AI observability and evaluation platform extending Arize's ML-monitoring heritage to LLM and agent workloads at scale.

Observability & Tracingself-hostablefreemiumOTel-native

Arize Phoenix

Open-source, OpenTelemetry-based tracing and evaluation library that runs locally or self-hosted, serving as the OSS on-ramp to Arize's enterprise platform.

Observability & Tracingopen sourceself-hostablefreeOTel-native★ 10.1kMature

Arthur

Model monitoring and firewall (Arthur Shield) for enterprise AI, focused on risk, bias and policy enforcement.

Guardrails & Safetyopen sourceself-hostablefreemium★ 429Established

Athina AI

IDE-style platform for prompt experimentation, evals and monitoring aimed at mixed technical/non-technical AI teams.

Evals & Testingself-hostablefreemium★ 300Early

Atla

Builds dedicated evaluator/judge models (Selene) and agent error-analysis tooling rather than a full observability suite.

Evals & Testingfreemium

Autoblocks

Testing and evaluation platform for LLM applications with human-in-the-loop review workflows.

Evals & Testingself-hostablefreemium

Azure AI Foundry Observability

Built-in evaluation, tracing and monitoring for models and agents inside Azure AI Foundry.

Observability & TracingpaidOTel-native

Bespoken

Automated testing and monitoring for IVR, voice assistants and conversational AI systems.

Evals & Testingself-hostablefreemium

Bluejay

End-to-end testing and monitoring platform for voice and chat AI agents with multilingual simulation.

Evals & Testingenterprise

Braintrust

Eval-first AI engineering platform with logging, datasets, an LLM proxy and a purpose-built trace database (Brainstore), aimed at production regression-catching.

Evals & Testingself-hostablefreemium

CalypsoAI

Inference-perimeter security platform with scanners and red-team agents guarding enterprise model traffic.

Guardrails & Safetyself-hostableenterprise

Cekura

Automated simulation testing and production monitoring for voice and chat AI agents.

Evals & Testingself-hostablefreemium

Cisco AI Defense (Robust Intelligence)

acquired

AI model validation and runtime guardrails productized from Robust Intelligence inside Cisco's security stack.

Guardrails & Safetyenterprise

CloudZero

AI unit-economics platform mapping LLM and GPU spend to cost per feature, per customer and per deployment.

Cost & FinOpsenterprise

Composo

Custom evaluation models that score accuracy and quality of enterprise LLM applications in production.

Evals & Testingself-hostableenterprise

Confident AI (DeepEval)

Pytest-style open-source LLM evaluation framework (DeepEval) with a hosted platform for benchmarking, regression testing and red-teaming (DeepTeam).

Evals & Testingopen sourceself-hostablefreemium★ 16.1kMature

Coralogix AI (Aporia)

acquired

Aporia's drift detection and AI guardrails folded into the Coralogix observability platform as its AI research arm.

Guardrails & Safetypaid

Coval

Simulation-first testing and replay for voice/chat agents, borrowing evaluation methodology from autonomous-vehicle testing.

Agent Debugging & ReplaypaidOTel-native

Datadog LLM Observability

LLM and agent tracing inside the Datadog APM suite, attractive to teams already standardized on Datadog for infrastructure monitoring.

Observability & TracingpaidOTel-native

Deepchecks

Continuous validation suite from the ML-testing world extended to LLM apps, scoring outputs across versions from dev to production.

Evals & Testingopen sourceself-hostablefreemium★ 4kEstablished

Dynatrace AI Observability

GenAI and LLM monitoring within the Dynatrace APM platform, covering tokens, cost and service health for enterprises already on Dynatrace.

Observability & TracingpaidOTel-native

Evidently AI

Open-source evaluation and monitoring library (100+ metrics) spanning tabular ML drift and LLM judge-based checks, with a managed cloud.

Evals & Testingopen sourceself-hostablefreemium★ 7.6kMature

Fiddler AI

Enterprise AI observability vendor from the ML-monitoring era, now offering LLM scoring, guardrails and bias/fairness auditing.

Observability & Tracingself-hostableenterpriseOTel-native

Finout

FinOps platform whose MegaBill model folds LLM/API spend into the same cost-allocation views as cloud infrastructure.

Cost & FinOpsenterprise

Freeplay

Prompt management, testing and human-review workflows aimed at cross-functional product teams shipping LLM features.

Prompt Managementpaid

Future AGI

Evaluation and observability platform with a focus on voice-agent simulation and programmatic re-scoring of historical scenarios.

Evals & Testingfreemium

Galileo

Evaluation and guardrails platform whose in-house Luna-2 small judge models target low-cost, low-latency scoring of agentic workloads.

Evals & Testingself-hostablefreemium

garak

Open-source LLM vulnerability scanner that runs pre-built probes for jailbreaks, leakage and injection.

Guardrails & Safetyopen sourceself-hostablefree★ 8.1kMature

General Analysis

AI red teaming and safety testing platform producing adversarial test suites for LLM applications.

Guardrails & Safetyself-hostableenterprise

Gentrace

Collaborative LLM testing and eval platform emphasizing UI-driven experiments shared between engineers and subject-matter experts.

Evals & Testingself-hostablefreemium

Giskard

Open-source testing framework that scans LLM apps for hallucination, injection and bias vulnerabilities, with a commercial evaluation hub.

Evals & Testingopen sourceself-hostablefreemium★ 5.4kMature

Google Stax

Experimental developer tool from Google Labs for LLM evaluation with human labeling and LLM-as-judge autoraters.

Evals & Testingfree

Grafana Cloud AI Observability

OTel-based GenAI observability solution on Grafana Cloud, built on open-source instrumentation rather than a proprietary SDK.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 74.4kMature

Guardrails AI

Open-source output-validation framework where composable validators enforce schemas, policies and safety constraints on LLM I/O.

Guardrails & Safetyopen sourceself-hostablefreemium★ 7kMature

Haize Labs

Automated red-teaming ('haizing') that stress-tests LLM systems to find jailbreaks and failure modes before deployment.

Guardrails & Safetypaid★ 343Growing

Hamming AI

Automated testing for voice agents that places thousands of simulated phone calls and scores transcripts against rubrics.

Evals & Testingpaid

Helicone

⚠ sunset/maintenance

Proxy/gateway-based LLM logging with one-line setup, unified cost and latency visibility across providers; now under Mintlify ownership.

Observability & Tracingopen sourceself-hostablefreemium★ 5.8kMature

HoneyHive

Evaluation and observability platform for production agents with prompt versioning, A/B tests and OTel-based tracing.

Observability & Tracingself-hostablefreemiumOTel-native

Humanloop

⚠ sunset/maintenance

Former prompt management and evaluation platform; shut down September 2025 with official migration paths to W&B, PromptLayer and Agenta.

Prompt Managemententerprise

Inspect AI

Government-built open-source framework for rigorous LLM and agent evaluations, popular for safety benchmarks and sandboxed agentic tasks.

Evals & Testingopen sourceself-hostablefree★ 2.2kMature

Invariant Labs

acquired

Agent trace analysis and security scanning (incl. MCP tool-poisoning research), with an Explorer UI for debugging agent runs.

Agent Debugging & Replayopen sourceself-hostablefreemium★ 427Established

Judgment Labs (judgeval)

Open-source agent behavior monitoring and evaluation library feeding agent post-training (RL/SFT).

Evals & Testingopen sourceself-hostablefreemium★ 1kMature

Kashikoi

Simulates multi-turn conversation flows to benchmark AI agents before deployment.

Evals & Testingenterprise

Lago

Open-source usage-based metering and billing engine used for AI token and credit pricing.

Cost & FinOpsopen sourceself-hostablefreemium★ 9.8kMature

Lakera Guard

acquired

Low-latency API guarding against prompt injection, data leakage and toxic content, backed by the Gandalf attack dataset.

Guardrails & Safetyself-hostablefreemium

Laminar

Open-source observability for long-running AI agents that captures LLM calls, tool use and browser actions for step-level debugging and replay.

Agent Debugging & Replayopen sourceself-hostablefreemiumOTel-native★ 3kMature

Langfuse

Open-source (MIT) LLM engineering platform combining tracing, prompt management, evals and datasets, widely used as the default self-hosted observability stack.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 29kMature

LangSmith

Closed-source tracing, evals and monitoring platform from the LangChain team, deepest integration with LangChain/LangGraph but usable via OTel from any stack.

Observability & Tracingself-hostablefreemiumOTel-native

Langtail

Spreadsheet-like prompt testing and deployment studio with assertions and security guardrails for smaller teams.

Prompt Managementfreemium

Langtrace

OpenTelemetry-native open-source tracing and metrics for LLM apps and agent frameworks, with a managed cloud option.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 1.2kGrowing

LangWatch

Open-source agent testing and observability built around the Scenario simulation framework, covering text, voice and adversarial tests.

Evals & Testingopen sourceself-hostablefreemium★ 3.3kMature

Latitude

Open-source prompt engineering platform with versioning, evals and agent-running infrastructure (PromptL).

Prompt Managementopen sourceself-hostablefreemium★ 4.1kMature

LiteLLM

Open-source LLM proxy/SDK that normalizes 100+ providers behind the OpenAI format with per-key budgets, spend tracking and rate limits.

Cost & FinOpsopen sourceself-hostablefreemium★ 50.2kMature

LLM Guard (Protect AI)

acquired

Open-source input/output scanner toolkit (35+ scanners) for PII, injection and toxicity checks on LLM traffic.

Guardrails & Safetyopen sourceself-hostablefree★ 3.1kEstablished

LLM Stats

Independent AI evaluations lab publishing model benchmarks and comparison data.

Evals & Testingfree

Lunary

Lightweight open-source LLM observability with tracing, analytics, prompt templates and PII masking, formerly known as LLMonitor.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native

Maxim AI

End-to-end agent simulation, evaluation and observability platform pitched at cross-functional product and engineering teams.

Evals & Testingself-hostablefreemium

Metronome

acquired

Enterprise usage metering and billing platform powering token-based pricing for major AI companies.

Cost & FinOpsenterprise

Mindgard

Automated AI red teaming platform testing LLMs, agents and multimodal models against MITRE ATLAS / OWASP-aligned attacks.

Guardrails & Safetyenterprise

MLflow (Tracing & GenAI)

The MLOps standard's GenAI extension: trace logging, LLM evaluation and prompt registry inside open-source MLflow 3.

Observability & Tracingopen sourceself-hostablefreeOTel-native★ 26.5kMature

New Relic AI Monitoring

AI/LLM monitoring layer in the New Relic APM platform tracking model latency, token cost and errors alongside conventional app telemetry.

Observability & TracingfreemiumOTel-native

NVIDIA NeMo Guardrails

Programmable conversational guardrails toolkit using the Colang DSL, covering input, dialog, retrieval, execution and output rails.

Guardrails & Safetyopen sourceself-hostablefree★ 6.4kMature

Okareo

Synthetic-user simulation that runs on every commit to catch regressions in agent tone, policy compliance, tool use and routing.

Evals & Testingfreemium

OpenAI Evals

OpenAI's original open-source eval framework and registry; largely superseded by the hosted Evals API but still a reference implementation.

Evals & Testingopen sourceself-hostablefree★ 18.7kMature

Openlayer

AI evaluation and observability platform spanning development tests and production monitoring.

Evals & Testingself-hostablefreemium★ 16Established

OpenLIT

OpenTelemetry-native open-source platform covering LLM tracing, GPU monitoring, guardrails and a prompt vault with one-line auto-instrumentation of 50+ providers.

Observability & Tracingopen sourceself-hostablefreeOTel-native★ 2.5kMature

OpenMeter

Open-source usage metering for AI/API products, commonly used to meter tokens for billing and internal chargeback.

Cost & FinOpsopen sourceself-hostablefreemium★ 2kMature

Opik (Comet)

Open-source LLM evaluation and tracing platform from Comet, combining trace logging, eval metrics and CI-friendly test suites.

Observability & Tracingopen sourceself-hostablefreemium★ 19.6kMature

Orb

Usage-based billing engine handling high-volume metering for AI and token-priced products.

Cost & FinOpsenterprise

Orq.ai

Generative AI collaboration platform bundling prompt management, deployments, evals and observability for SaaS teams.

Prompt Managementself-hostablefreemiumOTel-nativeObservabilityTracing

Patronus AI

Evaluation API and research-driven judge models (e.g. Lynx, Glider) for hallucination detection plus domain benchmarks like FinanceBench.

Evals & Testingfreemium

Pay-i

GenAI-specific FinOps tracking cost per request, per feature and per customer to give product teams real unit economics.

Cost & FinOpspaid

Petri

Open-source automated alignment auditing tool that probes target models with multi-turn simulated scenarios.

Evals & Testingopen sourceself-hostablefree★ 1.2kMature

Phospho

Open-source text analytics on LLM app messages, clustering and scoring conversations to surface what users actually do.

Observability & Tracingopen sourceself-hostablefreemium★ 440—

Pillar Security

AI security platform covering discovery, red teaming and runtime protection across the AI lifecycle.

Guardrails & Safetyself-hostableenterprise

Portkey

acquired

AI gateway routing 1,600+ models with built-in logging, cost tracking, caching and guardrails; observability comes as a side effect of the proxy layer.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 12.1kMature

PostHog LLM Analytics

LLM cost, latency and trace analytics bolted onto PostHog's product-analytics platform, letting teams join AI telemetry with user behavior data.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 35kMature

Prompt Security

acquired

Enterprise GenAI security platform monitoring employee and application LLM usage for injection, leakage and shadow AI.

Guardrails & Safetyself-hostableenterprise

Promptfoo

Config-file-driven open-source CLI for prompt evals, regression testing and LLM red-teaming that runs in CI.

Evals & Testingopen sourceself-hostablefreemium★ 22.2kMature

PromptHub

Git-style prompt version control with branch/commit/merge semantics, runtime REST retrieval and CI/CD quality gates.

Prompt Managementfreemium

PromptLayer

Prompt CMS with visual versioning, release labels and A/B testing, positioned so non-engineers can edit and deploy prompts independently.

Prompt Managementfreemium

Pydantic Logfire

OpenTelemetry-based observability service from the Pydantic team with first-class PydanticAI and Python ecosystem integration.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 4.3kMature

PyRIT

Python Risk Identification Toolkit automating single- and multi-turn adversarial probing of GenAI systems.

Guardrails & Safetyopen sourceself-hostablefree★ 63Established

Quotient AI

Evaluation and monitoring platform for detecting hallucinations and failures in AI agents.

Evals & Testingself-hostablefreemium

RagaAI (Catalyst)

Agent testing, evaluation and tracing platform with the open-source Catalyst SDK.

Evals & Testingopen sourceself-hostablefreemium★ 16.2kEstablished

Ragas

The de-facto open-source metric library for RAG evaluation (faithfulness, context precision/recall), used standalone or inside other platforms.

Evals & Testingopen sourceself-hostablefree★ 14.4kEstablished

Repello AI

AI red teaming platform whose ARTEMIS engine automates adversarial testing of LLM apps and agents.

Guardrails & Safetyenterprise

Respan (formerly Keywords AI)

Unified gateway-plus-observability control plane for tracing and evaluating agent behavior, rebranded from Keywords AI.

Observability & TracingfreemiumOTel-native

Roark

Production observability for voice agents that captures real calls and converts failures into test cases.

Observability & Tracingenterprise

Scorecard

Continuous evaluation platform providing fast feedback loops for testing and improving AI agents.

Evals & TestingfreemiumOTel-nativeSimulations

Sentry AI Agent Monitoring

Agent-call tracing and error monitoring inside Sentry, giving app developers LLM visibility in the tool they already use for crash reporting.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 44.1kMature

SigNoz

Open-source OpenTelemetry APM that handles LLM observability via standard OTel instrumentation rather than an LLM-specific SDK.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 27.3kMature

SPLX (SplxAI)

acquired

Automated AI security testing and red teaming for AI assistants and agents from build to runtime.

Guardrails & Safetyself-hostableenterprise

Straiker

AI-native security platform with red teaming and runtime guardrails for agentic applications.

Guardrails & Safetyenterprise

TensorZero

Open-source LLMOps stack unifying gateway, observability, evaluations, optimization and experimentation.

Observability & Tracingopen sourceself-hostablefreeOTel-native★ 11.5kMature

TestZeus (Hercules)

Open-source agentic end-to-end testing framework covering web, API and voice agent testing.

Evals & Testingopen sourceself-hostablefreemium★ 1kMature

The LLM Data Company (doteval)

AI-assisted workspace (doteval) for writing and managing LLM evaluations.

Evals & Testingenterprise

Tokencost

Lightweight open-source library maintaining an up-to-date price table for estimating prompt/completion costs across 400+ models.

Cost & FinOpsopen sourceself-hostablefree★ 2kEstablished

Traceloop (OpenLLMetry)

Vendor-neutral OpenTelemetry instrumentation for LLM apps (OpenLLMetry) that ships traces to any OTel backend, plus a hosted monitoring platform.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 7.2kMature

Tropir

Autonomous LLM-Ops engineer that traces, debugs and optimizes LLM pipelines.

Agent Debugging & Replayenterprise

Trubrics

User analytics and feedback tracking for LLM applications to surface real usage patterns.

Observability & Tracingopen sourceenterprise★ 156Early

TrueFoundry

AI gateway and deployment platform with built-in request logging, cost attribution and rate limiting for enterprise Kubernetes environments.

Observability & Tracingself-hostablefreemiumOTel-native

TruLens

acquired

Open-source library for feedback-function-based evaluation and tracing of RAG and agent apps, now stewarded by Snowflake.

Evals & Testingopen sourceself-hostablefree★ 3.4kMature

Vals AI

Industry-specific LLM benchmarks and enterprise evaluation for legal, tax and finance tasks.

Evals & Testingenterprise

Vantage

Cloud cost platform with native token-level ingest for OpenAI/Anthropic and an MCP server for querying AI spend from coding assistants.

Cost & FinOpsfreemium

Vellum

Low-code platform for prompts, workflows, evaluations and deployments with environment management for product teams.

Prompt Managementfreemium

Vertex AI Gen AI Evaluation Service

Managed evaluation service on Vertex AI for scoring models and agents with autoraters and custom metrics.

Evals & Testingpaid

Vijil

Agent trust platform combining automated evaluation, red teaming and runtime defenses for AI agents.

Guardrails & Safetyself-hostablefreemium

W&B Weave

acquired

LLM tracing and evaluation toolkit from Weights & Biases, integrated with the broader W&B experiment-tracking ecosystem.

Observability & Tracingopen sourceself-hostablefreemiumOTel-native★ 1.1kMature

The neutral index ofAI agent observability tooling

Popular comparisons

Best-of lists

All tools

The neutral index of
AI agent observability tooling