Your CEO just asked why the AI gave that customer a hallucinated response yesterday. You stare at your expensive AI observability dashboard, scroll through dozens of metrics, and realize with growing frustration:
You still don’t have a clear answer.
- Why did this customer get a hallucinated response yesterday?
- Why is this feature slower than it used to be?
- Why does this cost so much?
For B2B SaaS companies and startups, where every customer interaction matters and churn is costly, these unanswered questions create real business risk. After leading ML teams at Google and LinkedIn, I’ve seen this pattern repeat countless times. Here’s the truth: AI observability is just observability—and the solution isn’t another specialized tool. You need systematic evaluation that connects observability data to actionable insights.
You need to confidently answer: “That is a known failure mode. And here’s the dashboard showing exactly how, why, and what we’re improving next.”
The Real Problem: Flying Blind with Data
What makes AI components different isn’t that they need special tools, but that they have specific contexts worth capturing for systematic evaluation. Most teams are either capturing too little data or drowning in irrelevant metrics that don’t connect to business outcomes.
The issue isn’t lack of data—it’s lack of systematic evaluation that transforms observability into actionable insights.
The Solution: OpenTelemetry and Wide Structured Logs
Instead of buying another tool, here’s what you should do:
1. Embrace OpenTelemetry (OTEL)
OTEL has become the standard for observability, and for good reason. While you could build custom instrumentation, OTEL provides several critical advantages:
- Vendor Flexibility: OTEL’s vendor-neutral approach means you can switch observability providers without rewriting instrumentation code. This is especially valuable as the AI observability market matures.
- Unified Context: OTEL automatically correlates traces, metrics, and logs across your entire stack. This means you can trace a request from your API gateway through your RAG system and LLM calls, all with consistent context.
- Rich Ecosystem: OTEL’s widespread adoption means you get access to pre-built instrumentation for common components (databases, message queues, HTTP clients). This saves significant engineering time.
- Future-Proofing: As new observability features and best practices emerge, they’ll likely be implemented in OTEL first. Your investment in OTEL instrumentation continues to pay dividends.
Several providers have support for OTEL including:
You should prefer to use whatever observability tooling you are using for the rest of your system as this will allow you to analyze data across your AI and non AI components to get a holistic view of your system. Remember: AI observability is just observability.
2. Use Wide Structured Logs (Observability 2.0)
The key to effective observability is using wide, structured log events. Each log entry should be a complete record of what happened, containing all relevant context in one place. This makes debugging and analysis much simpler than jumping between different tools and data sources. For a more complete guide, consult Charity Majors’s Observability 2.0.
Here’s an example of such a log from an AI component:
{
"timestamp": "2024-02-17T10:15:30Z",
"service": "rag-service",
"span_id": "abc123",
"trace_id": "xyz789",
"level": "INFO",
"event_type": "llm_completion",
// Request Context
"user_id": "user_123",
"session_id": "session_456",
"request_id": "req_789",
"client_version": "2.1.0",
"feature": "customer_support_bot",
// Prompt Context
"prompt_template": "Given the context:\n{context}\n\nAnswer the question: {question}",
"prompt_variables": {
"question": "What is the company's return policy?",
"context": "Policy document sections 3.1-3.4"
},
"system_message": "You are a helpful customer service agent...",
// RAG Context
"retrieved_documents": [
{
"id": "doc_123",
"title": "Return Policy 2024",
"score": 0.89,
"chunk": "Returns must be initiated within 30 days..."
}
],
// Model Parameters
"model": "gpt-4",
"temperature": 0.7,
"top_p": 1.0,
"max_tokens": 500,
// Performance Metrics
"input_tokens": 245,
"output_tokens": 128,
"cached_tokens": 0,
"latency_ms": 750,
"rate_limited": false,
// Response Data
"completion": "Our return policy allows...",
"finish_reason": "stop",
"token_logprobs": [...],
// User Feedback
"user_rating": 5,
"feedback_text": "Very helpful and accurate",
"task_completed": true,
"required_clarification": false
}
With this rich context captured in your logs, systematic evaluation becomes possible:
- Track reliability metrics (error rates, latency) and connect them to user churn
- Monitor costs (token usage, cache effectiveness) and tie them to business ROI
- Measure quality (user satisfaction, hallucination rates) with statistical confidence
- Debug specific issues with full context and prevent recurrence
This is the foundation of evaluation-driven development—turning observability data into systematic improvement.
What You Actually Need to Log
Here’s a comprehensive list of what you should track for every AI request, and why each matters:
Request Context
- User ID & Session ID: Crucial for understanding user patterns and debugging specific issues
- Request ID: The thread that ties everything together
- Timestamp: When did this happen? Essential for debugging and pattern analysis
- Client Info: API client version, environment details - helps track down client-specific issues
- Business Context: What feature/product/workflow was this request part of?
Prompt Engineering Context
- Prompt Template: The base template used - critical for versioning and debugging
- Prompt Variables: The actual values injected into your template
- System Messages: Any system-level instructions provided
- Previous Messages: For chat-based interactions, the conversation history
- RAG Document References: Which documents were retrieved and their relevance scores
Model Parameters
- Model Provider: Which provider did you use (OpenAI, Anthropic, etc.)?
- Model ID & Version: Which model did you use?
- Temperature: Higher values mean more random outputs
- Top P / Top K / Top A / Min P: Your sampling strategy affects output quality
- Max Tokens: Helps track truncation issues
- Stop Sequences: Custom stop conditions affect completion
- Presence / Frequency / Repetition Penalty: Your model’s internal parameters for controlling repetition
- Seed: often not supported; but useful for reproducibility when it is
- Logit Bias: (if used) can bias the model’s internal sampling to make tokens more or less likely
- Top Logprobs: (if used) can help you understand the model’s confidence in its choices by logging next most likely tokens for each position
- Tools: The tools available to the model
- Response Format: the expected format (schema) of the response
Performance Metrics
- Input Tokens: Cost tracking and input size monitoring
- Output Tokens: Cost tracking and response size monitoring
- Cached Tokens: Are you effectively using your cache?
- Latency: End-to-end and component-wise timing
- Time to First Token: Useful for understanding latency
- Time to First Record: Useful for understanding user perceived latency when you are generating a list of records
Response Data
- Completion Text: The actual response
- Finish Reason: Why did the model stop? (length, stop sequence, error?)
- Token Logprobs: Useful for understanding model confidence
- Error Details: Type, message, and stack trace if applicable
User Feedback & Quality Metrics
- Explicit Feedback: Thumbs up/down or ratings from users
- Implicit Feedback: Did they use the response or discard it?
- Task Success: Did the user achieve their goal?
- Follow-up Queries: Did they need to ask for clarification?
- Quality Flags: Internal quality checks (hallucination detection, toxicity, etc.)
Logging all of these will ensure you can compute metrics about the reliability, quality, and cost of your AI components and give you the information you need to understand and improve them over time.
Conclusion: From Observability to Evaluation-Driven Confidence
Next time your CEO asks about that hallucinated response, you’ll confidently answer: “That is a known failure mode. And here’s the dashboard showing exactly how, why, and what we’re improving next.”
You’ll be able to pull up the exact request, see what documents were retrieved, check the model parameters, and understand if other users faced similar issues. All without spending thousands on another tool.
Stop buying AI observability tools. Start building systematic evaluation frameworks that transform observability data into actionable business insights.
Remember: AI components are just another part of your system, but they require systematic evaluation to move from “flying blind” to confident control. The best tool is often the one you already have—you just need to use it systematically.
Ready to Stop Guessing and Start Knowing?
If you’re ready to build systematic evaluation that transforms your AI observability from expensive dashboards to actionable insights, I can help your team implement this framework in just one week.
My intensive workshop walks your engineering team through building a robust evaluation system using your own data, in your own codebase. You’ll move from drowning in metrics to confidently answering: “That is a known failure mode. And here’s the dashboard showing exactly how, why, and what we’re improving next.”
Schedule a free consultation to discuss your team’s specific challenges →