Debugging AI: Pinpointing Issues in Prompts, Models, and Data

Introduction: Becoming an AI Detective

Welcome back, future AI observability experts! In our previous chapters, we laid the groundwork for understanding AI systems by exploring structured logging, distributed tracing, and key metrics. We learned how to collect data that paints a picture of our AI’s health and performance.

Now, it’s time to put on our detective hats. Collecting data is crucial, but the real magic happens when we use that data to diagnose and fix problems. This chapter is all about debugging AI systems in production. Unlike traditional software, AI systems introduce unique challenges: non-determinism, the “black box” nature of models, and extreme sensitivity to input data and prompts. We’ll dive into how to systematically identify and resolve issues stemming from prompt engineering, model failures, and data quality.

By the end of this chapter, you’ll have a solid understanding of how to leverage your observability setup to pinpoint the root causes of AI system misbehavior, ensuring your AI applications run smoothly and reliably. Get ready to transform raw data into actionable insights!

The AI Debugging Landscape: Beyond Traditional Software

Debugging traditional software often involves stepping through code, inspecting variables, and understanding deterministic logic. Debugging AI, especially modern large language models (LLMs) or complex machine learning models, is a different beast entirely.

Unique Challenges of AI Debugging

Non-Determinism: Given the same input, an LLM might produce slightly different outputs due to temperature settings, sampling strategies, or even different model versions. This makes reproducing bugs tricky.
Black Box Nature: Understanding why a model made a specific prediction or generated a particular response can be incredibly difficult. We often only see the input and output.
Prompt Sensitivity: Even a minor change in a prompt’s wording, punctuation, or structure can drastically alter an LLM’s behavior.
Data Dependency: Model performance is inherently tied to the quality and distribution of its training and inference data. Issues can stem from data drift, quality degradation, or biases.
Cost Implications: Debugging in production can be expensive, especially with token-based LLM APIs. Efficient debugging minimizes wasteful API calls.

Our goal isn’t just to find a line of buggy code, but to understand the reasoning (or lack thereof) behind an AI’s output. This requires a holistic view, correlating logs, traces, and metrics.

Root Cause Analysis (RCA) in AI Systems

When an AI system misbehaves, we need a structured approach to find the root cause. This typically involves:

Observation: Identifying the symptom (e.g., “chatbot gives irrelevant answers,” “recommendation engine suggests bad items”).
Data Collection: Gathering all relevant observability data (logs, traces, metrics) for the affected interactions.
Hypothesis Generation: Based on the data, forming educated guesses about the potential cause (e.g., “Is the prompt too ambiguous? Is the model hallucinating? Is the input data malformed?”).
Experimentation/Investigation: Testing hypotheses, often by modifying prompts, checking data distributions, or trying different model parameters.
Resolution: Implementing a fix and verifying it.

Let’s explore common areas where AI issues arise and how to debug them.

Debugging Prompt Engineering Issues

Prompt engineering is the art and science of crafting inputs to guide an AI model to produce desired outputs. When an LLM goes off the rails, the prompt is often the first place to look.

What are Prompt Issues?

Irrelevant Responses: The model doesn’t address the user’s query.
Hallucinations: The model generates factually incorrect or nonsensical information.
Safety Violations: The model produces harmful, biased, or inappropriate content.
Jailbreaks: Users bypass safety mechanisms to elicit undesired behavior.
Poor Quality/Style: The response is grammatically incorrect, poorly structured, or doesn’t match the desired tone.
Token Limit Exceeded: The prompt or response is too long for the model’s context window.

How Observability Helps with Prompt Debugging

By meticulously tracking prompt-response pairs and associated metadata (as discussed in previous chapters), you gain an invaluable “history book” of your AI’s interactions.

Prompt Tracking: Store the exact input prompt sent to the model, including any system prompts, user messages, and context.
Response Tracking: Store the raw output from the model.
Metadata: Crucial context like user_id, session_id, model_version, prompt_template_version, temperature, top_p, and even internal chain of thought steps if using frameworks like LangChain or LlamaIndex.

When a user reports a bad response, you can instantly pull up the exact prompt that led to it. This allows you to:

Reproduce the issue: Run the exact prompt again, perhaps in a development environment or a dedicated prompt playground.
Identify patterns: Are similar prompts consistently causing issues? Is a specific prompt_template_version problematic?
Iterate and compare: Experiment with prompt variations and compare their outputs. This is where prompt versioning becomes critical.

Best Practice: Prompt Versioning

Treat your prompts like code. Version control them! When you modify a prompt template, assign it a new version number (e.g., v1.0, v1.1, v2.0). Store this prompt_template_version as an attribute in your traces and logs. This allows you to:

Compare model behavior across different prompt versions.
Roll back to a previous prompt if a new one introduces regressions.
Attribute performance changes to specific prompt updates.

Debugging Model Failures and Performance Degradation

Sometimes, the prompt is perfect, but the model itself isn’t performing as expected. This could manifest as:

Low Accuracy/Relevance: The model’s core task (e.g., classification, summarization) isn’t meeting performance targets.
Bias: The model exhibits unfair or discriminatory behavior.
Unexpected Outputs: The model generates outputs that are structurally correct but semantically wrong or inconsistent.
Slow Inference: The model is taking too long to respond, impacting user experience or costing more.

Leveraging Metrics for Model Debugging

Model Performance Metrics: Monitor task-specific metrics (e.g., accuracy, F1-score, BLEU score, ROUGE score for summarization) over time. A sudden drop signals a problem.
Latency Metrics: Track end-to-end response time, time-to-first-token, and tokens-per-second. Spikes indicate performance bottlenecks.
Resource Utilization: For self-hosted models, monitor CPU, GPU, memory, and network usage. High utilization could point to scaling issues or inefficient model serving.
Error Rates: Track API errors, model inference errors, or any exceptions raised during model execution.

Leveraging Traces for Model Debugging

Distributed traces are invaluable for understanding the internal workings (or at least the high-level steps) of a model call.

Pinpointing Slow Components: If your AI application involves multiple steps (e.g., data retrieval, prompt construction, LLM call, post-processing), traces show you exactly which step is consuming the most time.
Internal Model Steps (if instrumented): For custom models, you might instrument internal layers or functions as nested spans to see where computation time is spent. For LLMs, you can track token generation phases.
Contextual Attributes: Traces can carry attributes like model_id, model_version, model_provider, and even specific model parameters (e.g., temperature, top_k) which are crucial for debugging.

Leveraging Logs for Model Debugging

Logs provide granular details and error messages that might not be captured in metrics or traces.

Error Logs: Crucial for identifying exceptions during model loading, inference, or post-processing.
Prediction Logs: For specific, high-stakes predictions, you might log the model’s raw output, confidence scores, or even feature importance values to understand its decision-making.
Input/Output Validation Logs: Log instances where inputs are malformed or outputs don’t meet expected schema.

AI models are only as good as the data they process. Issues with input data can lead to catastrophic model failures, even if the model and prompt are technically sound.

Data Drift

Concept: Data drift occurs when the statistical properties of the input data to a model change over time, leading to a degradation in model performance. For example, if your sentiment analysis model was trained on formal text but now processes informal social media slang, it might “drift.”

Impact: Reduced accuracy, increased error rates, biased predictions.

Debugging:

Monitor Input Data Distributions: Track the distribution of key features in your model’s input. Tools like WhyLabs or Arize specialize in this, but you can also build custom checks.
Monitor Output Data Distributions: Changes in the distribution of model outputs (e.g., a sentiment model suddenly predicting mostly negative sentiment) can indicate drift.
Compare with Training Data: Regularly compare production input data distributions against your training data distributions.

Data Quality Issues

Concept: Problems like missing values, incorrect data types, outliers, or corrupted data records.

Impact: Model errors, unexpected behavior, biased predictions.

Debugging:

Input Validation Logs: Log instances where input data fails validation checks before being fed to the model.
Data Pre-processing Tracing: If your data goes through a complex pre-processing pipeline, instrument each step with traces to identify where data might be transformed incorrectly or dropped.
Feature-level Metrics: Track metrics for individual features, such as number of nulls, unique values, or value ranges.

Correlating Observability Data for AI Debugging

The true power of observability comes from correlating logs, traces, and metrics. Each provides a different lens, and together they give you a complete picture.

Imagine a scenario: A user reports that your LLM-powered customer service bot gave a “very unhelpful and aggressive” response.

Start with the Trace:
- Find the trace associated with the user’s interaction (using user_id or session_id as a search filter).
- The trace shows the entire request flow: user input received, prompt constructed, LLM API call made, response received, post-processing.
- You immediately see the exact llm.prompt.input and llm.response.output attributes captured in the main span.
- You notice the llm.model.name and llm.model.version used for that specific interaction.
- You also see the llm.latency_ms was unusually high, perhaps indicating a slow API call.
Dive into Logs:
- Using the trace_id from the trace, filter your centralized logs.
- You might find log entries indicating:
  - “Safety filter triggered: aggressiveness_score > 0.8” (if you have internal safety checks).
  - “Prompt template v1.2 used.”
  - An error from the LLM provider API about a malformed request, or a specific finish_reason other than stop.
Check Metrics:
- Look at the overall model_error_rate metric around the time of the incident. Was there a spike?
- Check latency_p99 for the specific model version. Was it consistently high, or an isolated incident?
- Review token_usage_cost for that user/session. Was an unusually high number of tokens consumed for a simple query?

Combining these insights:

The trace shows the exact prompt and response, confirming the “aggressive” output.
The logs indicate a safety filter was triggered, but perhaps not strongly enough, or the threshold is too high. It also confirms the prompt template version.
The metrics show high latency, which might be a symptom, not the root cause, but indicates potential performance issues.

This correlation allows you to form hypotheses:

“Prompt template v1.2 might be too aggressive, especially with certain user inputs.”
“The safety filter threshold needs adjustment.”
“The model itself might be prone to aggressive responses under certain conditions.”

This systematic approach helps you move from symptoms to root causes efficiently.

Step-by-Step Implementation: Instrumenting an LLM for Debugging with OpenTelemetry

Let’s get practical! We’ll extend our OpenTelemetry instrumentation to capture rich debugging information for an LLM call. We’ll use the OpenAI API as an example, but the principles apply to any LLM.

Prerequisites:

Python 3.8+
OpenTelemetry Python SDK installed and configured (refer to previous chapters for basic setup).
openai Python library installed (pip install openai).
An OpenAI API key (set as an environment variable OPENAI_API_KEY).

First, ensure your OpenTelemetry environment is set up. If you’re following from previous chapters, you likely have this. If not, here’s a quick refresher for a console exporter:

# observability_setup.py
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader

def setup_observability(service_name: str):
    # Resource for identifying your service
    resource = Resource.create({"service.name": service_name, "service.version": "1.0.0"})

    # Setup Tracing
    provider = TracerProvider(resource=resource)
    processor = SimpleSpanProcessor(ConsoleSpanExporter())
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

    # Setup Metrics (optional for this chapter, but good practice)
    reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
    meter_provider = MeterProvider(metric_readers=[reader], resource=resource)
    # You would normally set this globally: metrics.set_meter_provider(meter_provider)
    # For this example, we'll focus on traces.

    print(f"Observability setup complete for service: {service_name}")

# In a real application, you'd call this once at startup
# setup_observability("llm-debugging-service")

This observability_setup.py provides a basic OpenTelemetry setup. For production, you’d use an OTLP exporter to send data to a collector or a specific backend.

Now, let’s create our LLM application file, llm_app.py, and instrument it.

Step 1: Basic LLM Call with OpenTelemetry Span

We’ll start by making a simple LLM call and wrapping it in a span.

# llm_app.py
import os
import time
from openai import OpenAI
from opentelemetry import trace
from observability_setup import setup_observability # Import our setup function

# 1. Setup OpenTelemetry for our service
setup_observability("llm-debug-agent")

# 2. Get a tracer instance
tracer = trace.get_tracer(__name__)

# 3. Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def get_llm_response_basic(prompt: str) -> str:
    # 4. Create a span for the LLM interaction
    with tracer.start_as_current_span("llm_call.openai") as span:
        print(f"Calling LLM with prompt: '{prompt}'")
        try:
            start_time = time.time()
            response = client.chat.completions.create(
                model="gpt-3.5-turbo", # Using a common model as of 2026-03-20
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.7,
                max_tokens=150
            )
            end_time = time.time()
            latency_ms = (end_time - start_time) * 1000

            model_output = response.choices[0].message.content
            print(f"LLM Response: '{model_output}'")

            # 5. Add basic attributes to the span
            span.set_attribute("llm.model.name", "gpt-3.5-turbo")
            span.set_attribute("llm.prompt.input", prompt)
            span.set_attribute("llm.response.output", model_output)
            span.set_attribute("llm.latency_ms", latency_ms)

            return model_output
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            print(f"Error during LLM call: {e}")
            return f"Error: {e}"

if __name__ == "__main__":
    print("\n--- Basic LLM Call ---")
    get_llm_response_basic("What is the capital of France?")
    time.sleep(1) # Give exporter time to send spans

Explanation:

We import setup_observability and call it to initialize our tracer.
tracer = trace.get_tracer(__name__) gets a tracer instance.
with tracer.start_as_current_span("llm_call.openai") as span: creates a new span, making it the current active span. All subsequent operations within this with block will be part of this span.
Inside the try block, we make the OpenAI API call and calculate latency.
span.set_attribute(...) adds key-value pairs to our span, providing contextual debugging information. We’re capturing the model name, input prompt, output response, and latency.
Error handling records exceptions and sets the span status to ERROR.

Run this file (python llm_app.py). You’ll see the LLM output and then, in your console, the OpenTelemetry span details, including the attributes we added.

Step 2: Adding More Granular Debugging Attributes

Now, let’s enhance our instrumentation to capture even more details crucial for debugging prompt and model issues. This includes token usage, prompt template version, and simulated user/session IDs.

# llm_app.py (continued from above)

# ... (imports and setup_observability, tracer, client remain the same) ...

def get_llm_response_debug(prompt: str, user_id: str, session_id: str, prompt_template_version: str) -> str:
    # 1. Create a span for the LLM interaction
    with tracer.start_as_current_span("llm_call.openai.debug") as span:
        print(f"\nCalling LLM for user '{user_id}' with prompt: '{prompt}' (Template: {prompt_template_version})")

        # 2. Set contextual attributes from function parameters
        span.set_attribute("user.id", user_id)
        span.set_attribute("session.id", session_id)
        span.set_attribute("prompt.template.version", prompt_template_version)

        try:
            start_time = time.time()
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.7,
                max_tokens=150
            )
            end_time = time.time()
            latency_ms = (end_time - start_time) * 1000

            model_output = response.choices[0].message.content
            print(f"LLM Response: '{model_output}'")

            # 3. Add comprehensive LLM-specific attributes
            span.set_attribute("llm.model.name", "gpt-3.5-turbo")
            span.set_attribute("llm.prompt.input", prompt)
            span.set_attribute("llm.response.output", model_output)
            span.set_attribute("llm.latency_ms", latency_ms)
            span.set_attribute("llm.temperature", 0.7) # Capture model parameters
            span.set_attribute("llm.max_tokens", 150)

            # 4. Capture token usage from the API response
            if response.usage:
                span.set_attribute("llm.token.count.input", response.usage.prompt_tokens)
                span.set_attribute("llm.token.count.output", response.usage.completion_tokens)
                span.set_attribute("llm.token.count.total", response.usage.total_tokens)
                # Calculate cost (example, replace with actual rates)
                # For gpt-3.5-turbo (as of 2026-03-20, rates can vary):
                # Input: $0.0005 / 1K tokens, Output: $0.0015 / 1K tokens
                input_cost = (response.usage.prompt_tokens / 1000) * 0.0005
                output_cost = (response.usage.completion_tokens / 1000) * 0.0015
                total_cost = input_cost + output_cost
                span.set_attribute("llm.cost_usd", total_cost)

            return model_output
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            print(f"Error during LLM call: {e}")
            return f"Error: {e}"

if __name__ == "__main__":
    # ... (previous basic call) ...

    print("\n--- Debugging LLM Call ---")
    get_llm_response_debug(
        prompt="Explain quantum entanglement in simple terms for a 5-year-old.",
        user_id="user_123",
        session_id="sess_abc",
        prompt_template_version="v1.0"
    )

    get_llm_response_debug(
        prompt="Tell me a very scary ghost story.",
        user_id="user_456",
        session_id="sess_def",
        prompt_template_version="v1.1_scary_story"
    )
    time.sleep(1) # Give exporter time to send spans

Explanation of new attributes:

user.id and session.id: Crucial for filtering traces by specific users or interactions when debugging a reported issue.
prompt.template.version: Enables you to track which version of your prompt templates was used, vital for A/B testing prompts and rolling back.
llm.temperature, llm.max_tokens: Capturing model generation parameters helps you understand if specific settings are contributing to issues.
llm.token.count.input, llm.token.count.output, llm.token.count.total: Essential for cost monitoring and understanding the verbosity of prompts/responses.
llm.cost_usd: A calculated attribute to directly track the cost of each API call, enabling granular cost analysis and debugging cost spikes.

Run python llm_app.py again. You’ll now see much richer span details, allowing you to pinpoint issues more effectively. If a user reports a “bad” response, you can search your observability platform for their user.id and session.id, examine the llm.prompt.input and llm.response.output, check the prompt.template.version, and see if any specific model parameters or high costs are correlated.

Mini-Challenge: Adding a Custom Safety Score to Traces

Imagine your AI assistant needs to ensure responses are not overly aggressive or negative. You’ve implemented a simple (simulated) safety score.

Challenge: Modify the get_llm_response_debug function to calculate a simulated safety_score for the model_output and add it as a new span attribute named llm.response.safety_score.

Hint:

Define a simple function, calculate_safety_score(text: str) -> float, that returns a random float between 0.0 and 1.0 (or a more complex logic if you wish).
Call this function with model_output after you’ve received the response.
Use span.set_attribute("llm.response.safety_score", score) to add it to the span.

What to observe/learn: You’ll see how easy it is to extend your observability data with custom, AI-specific metrics that are directly relevant to your application’s domain, making debugging more powerful. This allows you to track and potentially alert on responses that exceed a certain safety threshold.

Common Pitfalls & Troubleshooting

Even with a great observability setup, debugging AI can have its traps.

Over-logging vs. Under-logging:
- Pitfall: Logging everything can lead to massive data volumes, high storage costs, and alert fatigue. Logging too little leaves you blind.
- Troubleshooting: Define a clear logging strategy. Use structured logging for critical events and errors. For traces, focus on capturing key attributes that help with reproduction and root cause analysis. Consider sampling strategies for high-volume, low-criticality events.
Lack of Contextual Metadata:
- Pitfall: If your traces and logs lack user_id, session_id, model_version, or prompt_template_version, you’ll struggle to connect a reported issue back to a specific interaction.
- Troubleshooting: Make it a strict requirement to include these essential attributes in all AI-related spans and log records. Pass them down through your application context.
Ignoring Prompt Versioning:
- Pitfall: Changing prompts frequently without tracking versions makes it impossible to know which prompt led to a particular model behavior or to compare performance over time.
- Troubleshooting: Implement a versioning system for all your prompt templates. Store the active prompt_template_version in your observability data. Consider A/B testing prompt versions and linking results to these versions.
Siloed Debugging Data:
- Pitfall: Having logs in one system, traces in another, and metrics in a third makes correlation extremely difficult and time-consuming.
- Troubleshooting: Use open standards like OpenTelemetry to ensure all your observability data shares a common context (like trace_id and span_id). Consolidate this data into a unified observability platform (e.g., SigNoz, Datadog, Grafana with Loki/Prometheus) for easy searching and correlation.

Summary: Your AI Debugging Superpowers

Congratulations! You’ve just gained some serious AI debugging superpowers. In this chapter, we’ve explored:

The unique challenges of debugging AI systems, from non-determinism to prompt sensitivity.
How to systematically approach Root Cause Analysis for AI.
Specific strategies for debugging prompt engineering issues by tracking prompt-response pairs and versioning.
Leveraging metrics, traces, and logs to diagnose model failures and performance degradation.
Identifying and addressing data-related problems like data drift and quality issues.
The critical importance of correlating all observability data to gain a holistic view and pinpoint root causes.
A practical, step-by-step example of instrumenting an LLM call with OpenTelemetry to capture rich debugging information.

Remember, effective AI debugging isn’t just about fixing bugs; it’s about understanding your AI system’s behavior, improving its reliability, and building user trust. By embracing comprehensive observability and a systematic debugging approach, you’re well on your way to mastering MLOps.

In the next chapter, we’ll build on this foundation by diving into Advanced Monitoring and Alerting Strategies, ensuring you’re not just reacting to problems but proactively identifying and preventing them before they impact users.

References

OpenTelemetry Python Documentation: https://opentelemetry.io/docs/languages/python/
OpenAI API Documentation: https://platform.openai.com/docs/api-reference
SigNoz - LLM Tracing with OpenTelemetry: https://signoz.io/docs/traces/llm-observability/
Microsoft Azure - Production AI Practices: https://github.com/microsoft/AZD-for-beginners/blob/main/docs/chapter-08-production/production-ai-practices.md

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.