AI Observability is Just Observability

 

You’ve spent thousands on an AI observability platform. You’ve set up dozens of dashboards. And somehow, you still find yourself struggling to answer any question the CEO asks:

  • Why did this customer get a hallucinated response yesterday?
  • Why is this feature slower than it used to be?
  • Why does this cost so much?

After leading ML teams at Google and LinkedIn, and now serving as VP of Engineering, I’ve seen this pattern repeat countless times. Here’s the truth: AI observability is just observability. You don’t need another specialized tool - you need to understand how to properly instrument your AI components using tools you already have.

The Real Problem: Context is King

What makes AI components different isn’t that they need special tools, but that they have specific contexts worth capturing. Most teams are either capturing too little data or drowning in irrelevant metrics. Let’s fix that.

The Solution: OpenTelemetry and Wide Structured Logs

Instead of buying another tool, here’s what you should do:

1. Embrace OpenTelemetry (OTEL)

OTEL has become the standard for observability, and for good reason. While you could build custom instrumentation, OTEL provides several critical advantages:

  • Vendor Flexibility: OTEL’s vendor-neutral approach means you can switch observability providers without rewriting instrumentation code. This is especially valuable as the AI observability market matures.
  • Unified Context: OTEL automatically correlates traces, metrics, and logs across your entire stack. This means you can trace a request from your API gateway through your RAG system and LLM calls, all with consistent context.
  • Rich Ecosystem: OTEL’s widespread adoption means you get access to pre-built instrumentation for common components (databases, message queues, HTTP clients). This saves significant engineering time.
  • Future-Proofing: As new observability features and best practices emerge, they’ll likely be implemented in OTEL first. Your investment in OTEL instrumentation continues to pay dividends.

Several providers have support for OTEL including:

You should prefer to use whatever observability tooling you are using for the rest of your system as this will allow you to analyze data across your AI and non AI components to get a holistic view of your system. Remember: AI observability is just observability.

2. Use Wide Structured Logs (Observability 2.0)

The key to effective observability is using wide, structured log events. Each log entry should be a complete record of what happened, containing all relevant context in one place. This makes debugging and analysis much simpler than jumping between different tools and data sources. For a more complete guide, consult Charity Majors’s Observability 2.0.

Here’s an example of such a log from an AI component:

{
  "timestamp": "2024-02-17T10:15:30Z",
  "service": "rag-service",
  "span_id": "abc123",
  "trace_id": "xyz789",
  "level": "INFO",
  "event_type": "llm_completion",
  
  // Request Context
  "user_id": "user_123",
  "session_id": "session_456",
  "request_id": "req_789",
  "client_version": "2.1.0",
  "feature": "customer_support_bot",
  
  // Prompt Context
  "prompt_template": "Given the context:\n{context}\n\nAnswer the question: {question}",
  "prompt_variables": {
    "question": "What is the company's return policy?",
    "context": "Policy document sections 3.1-3.4"
  },
  "system_message": "You are a helpful customer service agent...",
  
  // RAG Context
  "retrieved_documents": [
    {
      "id": "doc_123",
      "title": "Return Policy 2024",
      "score": 0.89,
      "chunk": "Returns must be initiated within 30 days..."
    }
  ],
  
  // Model Parameters
  "model": "gpt-4",
  "temperature": 0.7,
  "top_p": 1.0,
  "max_tokens": 500,
  
  // Performance Metrics
  "input_tokens": 245,
  "output_tokens": 128,
  "cached_tokens": 0,
  "latency_ms": 750,
  "rate_limited": false,
  
  // Response Data
  "completion": "Our return policy allows...",
  "finish_reason": "stop",
  "token_logprobs": [...],
  
  // User Feedback
  "user_rating": 5,
  "feedback_text": "Very helpful and accurate",
  "task_completed": true,
  "required_clarification": false
}

With this rich context captured in your logs, you can:

  • Track reliability metrics (error rates, latency)
  • Monitor costs (token usage, cache effectiveness)
  • Measure quality (user satisfaction, hallucination rates)
  • Debug specific issues with full context

What You Actually Need to Log

Here’s a comprehensive list of what you should track for every AI request, and why each matters:

Request Context

  • User ID & Session ID: Crucial for understanding user patterns and debugging specific issues
  • Request ID: The thread that ties everything together
  • Timestamp: When did this happen? Essential for debugging and pattern analysis
  • Client Info: API client version, environment details - helps track down client-specific issues
  • Business Context: What feature/product/workflow was this request part of?

Prompt Engineering Context

  • Prompt Template: The base template used - critical for versioning and debugging
  • Prompt Variables: The actual values injected into your template
  • System Messages: Any system-level instructions provided
  • Previous Messages: For chat-based interactions, the conversation history
  • RAG Document References: Which documents were retrieved and their relevance scores

Model Parameters

  • Model Provider: Which provider did you use (OpenAI, Anthropic, etc.)?
  • Model ID & Version: Which model did you use?
  • Temperature: Higher values mean more random outputs
  • Top P / Top K / Top A / Min P: Your sampling strategy affects output quality
  • Max Tokens: Helps track truncation issues
  • Stop Sequences: Custom stop conditions affect completion
  • Presence / Frequency / Repetition Penalty: Your model’s internal parameters for controlling repetition
  • Seed: often not supported; but useful for reproducibility when it is
  • Logit Bias: (if used) can bias the model’s internal sampling to make tokens more or less likely
  • Top Logprobs: (if used) can help you understand the model’s confidence in its choices by logging next most likely tokens for each position
  • Tools: The tools available to the model
  • Response Format: the expected format (schema) of the response

Performance Metrics

  • Input Tokens: Cost tracking and input size monitoring
  • Output Tokens: Cost tracking and response size monitoring
  • Cached Tokens: Are you effectively using your cache?
  • Latency: End-to-end and component-wise timing
  • Time to First Token: Useful for understanding latency
  • Time to First Record: Useful for understanding user perceived latency when you are generating a list of records

Response Data

  • Completion Text: The actual response
  • Finish Reason: Why did the model stop? (length, stop sequence, error?)
  • Token Logprobs: Useful for understanding model confidence
  • Error Details: Type, message, and stack trace if applicable

User Feedback & Quality Metrics

  • Explicit Feedback: Thumbs up/down or ratings from users
  • Implicit Feedback: Did they use the response or discard it?
  • Task Success: Did the user achieve their goal?
  • Follow-up Queries: Did they need to ask for clarification?
  • Quality Flags: Internal quality checks (hallucination detection, toxicity, etc.)

Logging all of these will ensure you can compute metrics about the reliability, quality, and cost of your AI components and give you the information you need to understand and improve them over time.

Conclusion

Next time your CEO asks about that hallucinated response, you’ll be able to pull up the exact request, see what documents were retrieved, check the model parameters, and understand if other users faced similar issues. All without spending thousands on another tool. Stop buying AI observability tools. Start capturing the right context in your existing observability stack. Your budget (and your future self) will thank you. Remember: AI components are just another part of your system. Treat them that way, but make sure you capture their unique context. The best tool is often the one you already have - you just need to use it properly.

Need help improving your AI Observability? Book a free consult.

Or take the AI Maturity Model Assessment to see where you stand and get a personalized plan to improve your AI Maturity