Back to Portfolio HomePortfolioOpenTelemetry Observability

🔭 End-to-End OpenTelemetry Observability

Case study — one trace ID, from the request to the audit record

Every guardrail and tool call instrumented as a span, traces and logs joined on a shared ID into Tempo and Loki, and four Grafana dashboards on top — built so that when a regulator asks what the system was asked, what fired, and what came back, the answer is a query, not a forensics project.

3Signals unified — traces, metrics, logs
1Trace ID stitches the whole request
5Ordered guardrail spans per request
4Grafana operator dashboards

In a regulated system, observability is the control

When an AI system drafts, scans, or monitors inside a bank or a healthcare payer, the hard question is never “did it work” — it is “can you prove what happened.” Model risk management frameworks expect explainability, auditability, and documented oversight. A pipeline that logs to one place, emits metrics to another, and keeps traces in a third leaves you reconstructing events after the fact, under pressure, from three systems that do not agree.

OpenTelemetry closes that gap by treating every request as a structured event. The request opens a parent span; each guardrail and each tool call runs as a child span beneath it; W3C trace context propagates through the whole pipeline so the spans stitch into one timeline; and the trace ID is injected into every log line so logs and traces join up instead of living in separate silos. That correlated record is the audit trail. The engineering is making it complete, consistent, and cheap enough to leave on in production.

The approach is vendor-neutral by design. Instrumentation is written once against the OpenTelemetry SDK and exported over OTLP, so the backend — Tempo and Loki here, a commercial APM elsewhere — is a configuration choice, not a rewrite.

A single trace, span by span

Below is the shape of one request through the ordered guardrail pipeline. The parent span bounds the request; each guardrail and the provider inference call hang beneath it as child spans. The trace ID at the top is the same key written into every log line for this request.

Illustrative trace. Span order encodes the five ordered guardrails around the inference; widths are representative, not measured production latency.

What each span records

root

request

Opens the parent span and mints the trace ID that every child span and every log line inherits.

01

input.sanitize

Normalizes and screens the inbound payload; records what was stripped so the cleanup itself is auditable.

02

guardrail.injection

Prompt-injection detection; the span attributes capture whether a pattern matched and the action taken.

03

policy.enforce

Per-app authorization and policy checks; the decision and its inputs land on the span as the control evidence.

tool

provider.inference

The pluggable LLM provider call, instrumented as a child span so model latency and outcome sit inside the request timeline.

04

output.filter

PII redaction and output screening; each redaction emits a metric, so the count is its own security signal.

05

rate.limit

Final throttle check, closing the request span and writing the result the trace will be queried on later.

Pipeline and exporters

Instrumentation emits through the OpenTelemetry SDK and ships over OTLP; the backend is a Grafana stack, swappable without touching the instrumentation.

OTel SDK → CollectorSpans, metrics, and logs over OTLP on :4317
TempoTraces land here; trace ID is the audit key
LokiLogs with trace ID injected on every line
GrafanaFour dashboards on :3000
W3C trace contextPropagated across guardrails and tool calls

OpenTelemetryOTLPTempoLokiGrafanaContext propagationOn-demand sampling

What the instrumentation buys you

Design decisions chosen so the telemetry stays useful under audit and affordable under load.

Parent / child spans

The request is the parent span; every guardrail and the provider call is a child beneath it. Nesting makes the order of operations and where time went legible at a glance.

Trace ID in every log

The active trace ID is injected into each log line, so traces and logs join on a shared key instead of being correlated by guesswork across timestamps.

Context propagation

W3C trace context is carried across each hop, so spans from different stages assemble into one coherent timeline rather than scattering.

The trace is the audit trail

What was asked, which guardrails fired, and what was returned all live on one trace. The regulator’s question has a query behind it, not a forensics project.

Sampling that stays on

An on-demand sampler keeps telemetry detailed where it matters without paying full-cardinality cost per request, so observability runs in production, not only in a drill.

Operator dashboards

Four Grafana boards turn raw spans and metrics into the views an on-call actually opens: latency, throughput, redaction hits, and per-tool error rates.

Observability beyond a single stack

The pipeline here exports to a Grafana stack, but the instrumentation discipline is portable. Three years of hands-on work span open-source and commercial backends — the choice of vendor follows the data-residency and tooling constraints of the environment, not the other way round.

PlatformRoleNotes
OpenTelemetry · Grafana · Tempo · Loki · JaegerVendor-neutral traces, metrics, logsOTLP export; trace–log correlation
DatadogAPM, metrics, log managementDashboard and monitor design
New RelicFull-stack APM and alertingIn production use since 2017
DynatraceDistributed tracingAutomated dependency mapping
Grafana CloudObservability backendBehind the GitOps-on-EKS platform & its 352-test suite

Certifications & training

Formal credentials behind the observability and platform work, plus the hands-on training that keeps the instrumentation current.

Observability

OpenTelemetry — Certified

Instrumentation, OTLP pipeline, trace–log correlation · Nov 2025

Datadog — Core Skills

Metrics, traces, and log management · Jan 2026

Datadog — Monitoring

Monitor and dashboard design · Jan 2026

New Relic — Certified

Full-stack APM and alerting · 2017

Platform & reliability

KEDA

Event-driven autoscaling · Linux Foundation · 2025–2026

DevOps / SRE

Reliability engineering practice · Linux Foundation · 2025–2026

GitOps

Declarative delivery and reconciliation · Linux Foundation · 2025–2026

Hands-on training

OpenTelemetry instrumentation lab

SDK spans / metrics / logs, OTLP collector, Tempo / Loki / Grafana wiring.

EKS GenAI workshop

GPU inference serving with instrumented model-serving observability.

Claude Code bootcamp (Packt)

Agentic engineering workflows · 100% final score.

View all certifications on LinkedIn →

Related published work

Where this observability layer was built and where the same patterns run.

AppSec MCP Server

8 tools, 5 ordered guardrails, 3 pluggable LLM providers — the reference implementation this telemetry was instrumented into.

View the portfolio →

GitOps on AWS EKS

Multi-cluster GitOps platform with Grafana Cloud observability and a 352-test validation suite.

View the portfolio →

GenAI & AppSec Advisory

How these controls apply to a regulated GenAI deployment your CISO can sign off on.

View the practice →

Want this kind of observability for your team?

OpenTelemetry instrumentation, trace–log correlation, and Grafana dashboards — scoped to your compliance framework.

See GenAI & AppSec advisory