AI Observability is Just Observability • Skylar Payne

Your CEO just asked why the AI gave that customer a hallucinated response yesterday. You stare at your expensive AI observability dashboard, scroll through dozens of metrics, and realize with growing frustration:

You still don’t have a clear answer.

Why did this customer get a hallucinated response yesterday?
Why is this feature slower than it used to be?
Why does this cost so much?

For B2B SaaS companies and startups, where every customer interaction matters and churn is costly, these unanswered questions create real business risk. After leading ML teams at Google and LinkedIn, I’ve seen this pattern repeat countless times. Here’s the truth: AI observability is just observability, and the solution is systematic evaluation that connects observability data to actionable insights.

You need to confidently answer: “That is a known failure mode. And here’s the dashboard showing exactly how, why, and what we’re improving next.”

What makes AI components different isn’t that they need special tools, but that they have specific contexts worth capturing for systematic evaluation. Most teams are either capturing too little data or drowning in irrelevant metrics that don’t connect to business outcomes.

The issue is lack of systematic evaluation that transforms observability into actionable insights.

The solution: OpenTelemetry and wide structured logs

Instead of buying another tool, here’s what you should do:

1. Embrace OpenTelemetry (OTEL)

OTEL has become the standard for observability, and for good reason. While you could build custom instrumentation, OTEL provides several critical advantages:

Vendor Flexibility: OTEL’s vendor-neutral approach means you can switch observability providers without rewriting instrumentation code. This is especially valuable as the AI observability market matures.
Unified Context: OTEL automatically correlates traces, metrics, and logs across your entire stack. This means you can trace a request from your API gateway through your RAG system and LLM calls, all with consistent context.
Rich Ecosystem: OTEL’s widespread adoption means you get access to pre-built instrumentation for common components (databases, message queues, HTTP clients). This saves significant engineering time.
Future-Proofing: As new observability features and best practices emerge, they’ll likely be implemented in OTEL first. Your investment in OTEL instrumentation continues to pay dividends.

Several providers have support for OTEL including:

You should prefer to use whatever observability tooling you are using for the rest of your system as this will allow you to analyze data across your AI and non AI components to get a holistic view of your system. Remember: AI observability is just observability.

2. Use wide structured logs (observability 2.0)

The key to effective observability is using wide, structured log events. Each log entry should be a complete record of what happened, containing all relevant context in one place. This makes debugging and analysis much simpler than jumping between different tools and data sources. For a more complete guide, consult Charity Majors’s Observability 2.0.

Here’s an example of such a log from an AI component:

{
  "timestamp": "2024-02-17T10:15:30Z",
  "service": "rag-service",
  "span_id": "abc123",
  "trace_id": "xyz789",
  "level": "INFO",
  "event_type": "llm_completion",

  // Request Context
  "user_id": "user_123",
  "session_id": "session_456",
  "request_id": "req_789",
  "client_version": "2.1.0",
  "feature": "customer_support_bot",

  // Prompt Context
  "prompt_template": "Given the context:\n{context}\n\nAnswer the question: {question}",
  "prompt_variables": {
    "question": "What is the company's return policy?",
    "context": "Policy document sections 3.1-3.4"
  },
  "system_message": "You are a helpful customer service agent...",

  // RAG Context
  "retrieved_documents": [
    {
      "id": "doc_123",
      "title": "Return Policy 2024",
      "score": 0.89,
      "chunk": "Returns must be initiated within 30 days..."
    }
  ],

  // Model Parameters
  "model": "gpt-4",
  "temperature": 0.7,
  "top_p": 1.0,
  "max_tokens": 500,

  // Performance Metrics
  "input_tokens": 245,
  "output_tokens": 128,
  "cached_tokens": 0,
  "latency_ms": 750,
  "rate_limited": false,

  // Response Data
  "completion": "Our return policy allows...",
  "finish_reason": "stop",
  "token_logprobs": [...],

  // User Feedback
  "user_rating": 5,
  "feedback_text": "Very helpful and accurate",
  "task_completed": true,
  "required_clarification": false
}

With this rich context captured in your logs, systematic evaluation becomes possible:

Track reliability metrics (error rates, latency) and connect them to user churn
Monitor costs (token usage, cache effectiveness) and tie them to business ROI
Measure quality (user satisfaction, hallucination rates) with statistical confidence
Debug specific issues with full context and prevent recurrence

This is the foundation of evaluation-driven development. It turns observability data into systematic improvement.

What you actually need to log

Here’s a comprehensive list of what you should track for every AI request, and why each matters:

Request context

User ID & Session ID: Crucial for understanding user patterns and debugging specific issues
Request ID: The thread that ties everything together
Timestamp: When did this happen? Essential for debugging and pattern analysis
Client Info: API client version, environment details - helps track down client-specific issues
Business Context: What feature/product/workflow was this request part of?

Prompt engineering context

Prompt Template: The base template used - critical for versioning and debugging
Prompt Variables: The actual values injected into your template
System Messages: Any system-level instructions provided
Previous Messages: For chat-based interactions, the conversation history
RAG Document References: Which documents were retrieved and their relevance scores

Model parameters

Model Provider: Which provider did you use (OpenAI, Anthropic, etc.)?
Model ID & Version: Which model did you use?
Temperature: Higher values mean more random outputs
Top P / Top K / Top A / Min P: Your sampling strategy affects output quality
Max Tokens: Helps track truncation issues
Stop Sequences: Custom stop conditions affect completion
Presence / Frequency / Repetition Penalty: Your model’s internal parameters for controlling repetition
Seed: often not supported; but useful for reproducibility when it is
Logit Bias: (if used) can bias the model’s internal sampling to make tokens more or less likely
Top Logprobs: (if used) can help you understand the model’s confidence in its choices by logging next most likely tokens for each position
Tools: The tools available to the model
Response Format: the expected format (schema) of the response

Performance metrics

Input Tokens: Cost tracking and input size monitoring
Output Tokens: Cost tracking and response size monitoring
Cached Tokens: Are you effectively using your cache?
Latency: End-to-end and component-wise timing
Time to First Token: Useful for understanding latency
Time to First Record: Useful for understanding user perceived latency when you are generating a list of records

Response data

Completion Text: The actual response
Finish Reason: Why did the model stop? (length, stop sequence, error?)
Token Logprobs: Useful for understanding model confidence
Error Details: Type, message, and stack trace if applicable

User feedback & quality metrics

Explicit Feedback: Thumbs up/down or ratings from users
Implicit Feedback: Did they use the response or discard it?
Task Success: Did the user achieve their goal?
Follow-up Queries: Did they need to ask for clarification?
Quality Flags: Internal quality checks (hallucination detection, toxicity, etc.)

Logging all of these will ensure you can compute metrics about the reliability, quality, and cost of your AI components and give you the information you need to understand and improve them over time.

Conclusion: from observability to evaluation-driven confidence

Next time your CEO asks about that hallucinated response, you’ll confidently answer: “That is a known failure mode. And here’s the dashboard showing exactly how, why, and what we’re improving next.”

You’ll be able to pull up the exact request, see what documents were retrieved, check the model parameters, and understand if other users faced similar issues. All without spending thousands on another tool.

Stop buying AI observability tools. Start building systematic evaluation frameworks that transform observability data into actionable business insights.

AI components are just another part of your system, but they require systematic evaluation to move from “flying blind” to confident control. The best tool is often the one you already have. You just need to use it systematically.

Ready to stop guessing and start knowing?

If you’re ready to build systematic evaluation that transforms your AI observability from expensive dashboards to actionable insights, I can help your team implement this framework in just one week.

My intensive workshop walks your engineering team through building an evaluation system using your own data, in your own codebase. You’ll move from drowning in metrics to confidently answering: “That is a known failure mode. This dashboard shows exactly how, why, and what we’re improving next.”

Schedule a free consultation to discuss your team’s specific challenges →