Introduction

Welcome to the grand finale of our AI Observability journey! In previous chapters, we’ve explored the theoretical foundations of logging, tracing, and metrics for AI systems, understanding what they are and why they’re crucial. Now, it’s time to roll up our sleeves and bring these concepts to life with a hands-on project.

This chapter will guide you through building a complete, end-to-end observability pipeline for a simple Large Language Model (LLM) application. We’ll instrument our Python-based LLM service using OpenTelemetry for distributed tracing, custom metrics, and structured logging. Then, we’ll deploy an observability backend (SigNoz, which bundles Prometheus and Grafana) using Docker to collect, store, and visualize all our precious AI operational data. Get ready to see your AI system’s inner workings like never before!

By the end of this chapter, you’ll have a working example of how to:

  • Set up an observability stack using Docker.
  • Instrument a Python LLM application with OpenTelemetry.
  • Track prompts, responses, latency, and token usage.
  • Visualize traces, logs, and metrics in a unified dashboard.
  • Gain practical experience in MLOps observability.

Let’s turn theory into tangible, observable reality!

Project Overview: Observing an LLM Chatbot

Our mission is to build a minimal LLM chatbot and make it fully observable. Imagine this chatbot as a core component of a larger AI application. We’ll focus on tracking the key interactions: the user’s prompt, the LLM’s response, the time it took, and the resources consumed (like tokens).

The Observability Stack

To achieve this, we’ll leverage a powerful open-source stack:

  1. OpenTelemetry (OTel): The gold standard for vendor-neutral instrumentation. We’ll use its Python SDK to generate traces, metrics, and logs from our LLM application.
  2. SigNoz: An OpenTelemetry-native observability platform. It provides a unified UI for traces, metrics, and logs, making it an excellent choice for a hands-on project. Under the hood, SigNoz uses ClickHouse for storage and integrates Prometheus for metrics and Grafana for dashboards.
  3. Python 3.12+: Our primary language for the LLM application.
  4. Ollama (Optional, but Recommended): A fantastic tool for running open-source LLMs locally. This allows us to run the project entirely on our machine without needing external API keys, though you can easily swap it for OpenAI or another cloud LLM.
  5. Docker & Docker Compose: To orchestrate and run our observability backend services effortlessly.

Architectural Diagram

Let’s visualize how these components will fit together:

flowchart TD subgraph LLM_Application["LLM Application "] A[User Request] --> B{Process Prompt} B --> C[Call LLM] C --> D{Generate Response} D --> E[Return to User] B -.->|Generates Spans, Metrics, Logs| F(OpenTelemetry SDK) C -.->|Generates Spans, Metrics, Logs| F D -.->|Generates Spans, Metrics, Logs| F end subgraph Observability_Backend["Observability Backend "] F --> G[OpenTelemetry Collector] G --> H[SigNoz Query Service] G --> I[Prometheus] H --> J[ClickHouse] I --> K[Grafana] H --> K end K --> L[Developer / MLOps Engineer] L --> F

Explanation of the Flow:

  1. LLM Application: Our Python application handles user requests, processes prompts, makes calls to the LLM (Ollama or OpenAI), and generates responses.
  2. OpenTelemetry SDK: As the application runs, the OTel SDK intercepts key operations (prompt processing, LLM calls) and generates observability data (traces, spans, metrics, logs).
  3. OpenTelemetry Collector: The OTel SDK exports this data to the OTel Collector, which acts as a proxy. The collector then forwards traces and logs to SigNoz’s query service (which uses ClickHouse) and metrics to Prometheus.
  4. Prometheus: Scrapes metrics from the OTel Collector and stores them.
  5. SigNoz / ClickHouse: Stores the traces and logs collected by the OTel Collector.
  6. Grafana: Connects to Prometheus for metric visualization and can be integrated with SigNoz for a unified dashboard experience. SigNoz itself also provides robust dashboards.
  7. Developer/MLOps Engineer: Interacts with the SigNoz UI (which often includes Grafana-like features) to visualize, query, and analyze the observability data.

Step-by-Step Implementation

Let’s get our hands dirty!

Step 1: Set Up Your Environment

First, ensure you have Python 3.12+ and Docker Desktop installed on your system.

  1. Create Project Directory: Let’s start by creating a dedicated directory for our project.

    mkdir ai-observability-project
    cd ai-observability-project
    
  2. Set Up Python Virtual Environment: It’s always a good practice to use a virtual environment to manage project dependencies.

    python3.12 -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    

    You should see (venv) prefixing your terminal prompt, indicating the virtual environment is active.

  3. Install Observability Backend (SigNoz): We’ll use SigNoz as our observability platform. It’s easy to set up with Docker Compose.

    First, ensure Docker Desktop is running. Then, download the SigNoz docker-compose.yaml file.

    # Create a directory for SigNoz config
    mkdir signoz-setup
    cd signoz-setup
    
    # Download the SigNoz install script and run it
    # As of 2026-03-20, this is the recommended way from SigNoz docs
    curl -o install.sh https://raw.githubusercontent.com/SigNoz/signoz/main/deploy/install.sh && chmod +x install.sh
    sudo ./install.sh --mode docker
    
    cd .. # Go back to the main project directory
    

    Explanation:

    • We create a signoz-setup directory to keep things organized.
    • The curl command downloads SigNoz’s official installation script.
    • chmod +x install.sh makes the script executable.
    • sudo ./install.sh --mode docker runs the script, which will download the necessary docker-compose.yaml and start all SigNoz services (ClickHouse, Query Service, OpenTelemetry Collector, Prometheus, Grafana, etc.). This might take a few minutes depending on your internet speed.

    Once it’s done, SigNoz should be accessible at http://localhost:3301. Open it in your browser and complete the initial setup (create an account).

    Important Note: The SigNoz Docker Compose setup includes its own OpenTelemetry Collector, which is where our application will send its observability data. It also includes Prometheus and Grafana, integrated within the SigNoz UI.

Step 2: Create a Basic LLM Application

Let’s write a simple Python script that interacts with an LLM. We’ll use ollama for this example, which is great for local testing. If you prefer, you can swap this with openai or any other LLM client.

  1. Install LLM Client: If using Ollama, first download and install Ollama for your OS from https://ollama.com/download. Then, pull a model (e.g., llama2):

    ollama pull llama2
    

    Then install the Python client:

    pip install ollama
    

    If using OpenAI:

    pip install openai
    

    And make sure you have your OPENAI_API_KEY set as an environment variable.

  2. Create app.py: Create a file named app.py in your main ai-observability-project directory.

    # app.py
    import ollama # or import openai for OpenAI API
    
    def get_llm_response(prompt: str, model_name: str = "llama2"):
        """
        Interacts with the LLM to get a response.
        """
        print(f"Sending prompt to LLM ({model_name}): {prompt}")
        try:
            # For Ollama:
            response = ollama.chat(model=model_name, messages=[{'role': 'user', 'content': prompt}])
            llm_output = response['message']['content']
            # For OpenAI (uncomment and modify if using OpenAI):
            # client = openai.OpenAI()
            # response = client.chat.completions.create(
            #     model=model_name,
            #     messages=[{"role": "user", "content": prompt}]
            # )
            # llm_output = response.choices[0].message.content
    
            print(f"LLM Response: {llm_output[:100]}...") # Print first 100 chars
            return llm_output
        except Exception as e:
            print(f"Error calling LLM: {e}")
            return f"Error: {e}"
    
    if __name__ == "__main__":
        print("Welcome to the simple LLM chatbot! Type 'exit' to quit.")
        while True:
            user_prompt = input("You: ")
            if user_prompt.lower() == 'exit':
                break
            response = get_llm_response(user_prompt)
            print(f"Bot: {response}")
    

    Explanation:

    • This is a basic Python script that takes user input and sends it to an LLM.
    • get_llm_response encapsulates the LLM interaction logic.
    • The if __name__ == "__main__": block provides a simple command-line interface for interaction.
    • Crucially: This version has no observability yet. We’re building up!

Step 3: OpenTelemetry Instrumentation - Tracing

Now, let’s add distributed tracing to our LLM application. This will allow us to see the flow of a request, including the LLM call, as a series of connected spans.

  1. Install OpenTelemetry SDK:

    pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-http opentelemetry-instrumentation-logging
    

    Explanation:

    • opentelemetry-sdk: The core OpenTelemetry SDK.
    • opentelemetry-exporter-otlp-proto-http: This is the exporter we’ll use to send our OTel data to the SigNoz OpenTelemetry Collector, which listens on HTTP by default.
    • opentelemetry-instrumentation-logging: To automatically link Python’s standard logging to our traces.
  2. Modify app.py for Tracing: We’ll add the necessary OpenTelemetry initialization and wrap our LLM interaction in spans.

    # app.py (updated for tracing)
    import ollama # or import openai for OpenAI API
    import os
    import logging
    
    # --- OpenTelemetry Imports ---
    from opentelemetry import trace
    from opentelemetry.sdk.resources import Resource
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor
    from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
    from opentelemetry.instrumentation.logging import LoggingInstrumentor
    
    # --- Configure Logging ---
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    logger = logging.getLogger(__name__)
    
    # --- OpenTelemetry Setup (Global) ---
    def setup_opentelemetry():
        # Define a resource for our service
        resource = Resource.create({
            "service.name": "llm-chatbot-service",
            "service.version": "1.0.0",
            "environment": "production"
        })
    
        # Set up a TracerProvider
        provider = TracerProvider(resource=resource)
    
        # Configure OTLP HTTP exporter
        # SigNoz's OTel Collector typically listens on 4318 for HTTP
        otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
    
        # Add a BatchSpanProcessor to send spans in batches
        provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    
        # Set the global tracer provider
        trace.set_tracer_provider(provider)
    
        # Instrument Python's logging to automatically include trace/span IDs
        LoggingInstrumentor().instrument(set_logging_format=True)
        logger.info("OpenTelemetry TracerProvider initialized.")
    
    # Get a tracer for our application
    tracer = trace.get_tracer(__name__)
    
    # --- Original LLM Function (with tracing added) ---
    def get_llm_response(prompt: str, model_name: str = "llama2"):
        """
        Interacts with the LLM to get a response, now with tracing.
        """
        # Create a new span for the entire LLM interaction
        with tracer.start_as_current_span("llm.interaction") as span:
            span.set_attribute("user.prompt", prompt)
            span.set_attribute("llm.model_name", model_name)
            span.set_attribute("session.id", "user-123") # Example: Track a session ID
    
            logger.info(f"Sending prompt to LLM ({model_name}): {prompt}")
            try:
                # Create a child span for the actual LLM API call
                with tracer.start_as_current_span("llm.api_call") as api_span:
                    api_span.set_attribute("llm.provider", "ollama") # Or "openai"
                    # For Ollama:
                    response = ollama.chat(model=model_name, messages=[{'role': 'user', 'content': prompt}])
                    llm_output = response['message']['content']
                    # For OpenAI (uncomment and modify if using OpenAI):
                    # client = openai.OpenAI()
                    # response = client.chat.completions.create(
                    #     model=model_name,
                    #     messages=[{"role": "user", "content": prompt}]
                    # )
                    # llm_output = response.choices[0].message.content
    
                    api_span.set_attribute("llm.response.length", len(llm_output))
                    api_span.set_attribute("llm.response.truncated", llm_output[:200]) # Log part of response
                    # Add an event to the span for significant actions
                    api_span.add_event("llm_response_received")
    
                span.set_attribute("llm.full_response", llm_output)
                logger.info(f"LLM Response: {llm_output[:100]}...")
                return llm_output
            except Exception as e:
                logger.error(f"Error calling LLM: {e}", exc_info=True)
                span.set_status(trace.Status(trace.StatusCode.ERROR, description=str(e)))
                return f"Error: {e}"
    
    if __name__ == "__main__":
        setup_opentelemetry() # Initialize OTel before main loop
        logger.info("Welcome to the simple LLM chatbot! Type 'exit' to quit.")
        while True:
            user_prompt = input("You: ")
            if user_prompt.lower() == 'exit':
                break
            response = get_llm_response(user_prompt)
            print(f"Bot: {response}")
    

    Explanation of Changes:

    • setup_opentelemetry() function:
      • Initializes TracerProvider with a Resource that identifies our service (llm-chatbot-service). This is crucial for filtering and organizing data in your observability backend.
      • Configures OTLPSpanExporter to send traces to http://localhost:4318/v1/traces. This is the default HTTP endpoint for the OpenTelemetry Collector within SigNoz.
      • Uses BatchSpanProcessor for efficient, asynchronous span export.
      • trace.set_tracer_provider(provider) makes this provider globally available.
      • LoggingInstrumentor().instrument() automatically enriches our Python logging records with trace and span IDs, allowing for seamless correlation between logs and traces.
    • tracer = trace.get_tracer(__name__): Obtains a tracer instance.
    • with tracer.start_as_current_span(...): This is the core of tracing.
      • llm.interaction is the parent span, representing the entire process of handling a user’s prompt and getting an LLM response.
      • llm.api_call is a child span, specifically tracking the time spent on the actual LLM API call. This helps pinpoint latency issues.
      • .set_attribute(): Adds key-value pairs to the span, providing rich context like the user.prompt, llm.model_name, and llm.full_response.
      • .add_event(): Marks a specific point in time within a span, like llm_response_received.
      • Error handling now sets the span status to ERROR and logs the exception with exc_info=True, which ensures the full stack trace is captured in our logs and linked to the span.
    • setup_opentelemetry() call: Added to if __name__ == "__main__": to ensure OTel is initialized before any LLM interaction.

Step 4: OpenTelemetry Instrumentation - Metrics

Next, let’s add custom metrics to track key performance indicators for our LLM application.

  1. Install Metrics Exporter: We already installed opentelemetry-sdk and opentelemetry-exporter-otlp-proto-http, which includes the necessary components for metrics.

  2. Modify app.py for Metrics: We’ll add a MeterProvider and create a Counter for total requests and a Histogram for latency.

    # app.py (updated for metrics)
    import ollama
    import os
    import logging
    import time # For measuring latency
    
    # --- OpenTelemetry Imports ---
    from opentelemetry import trace, metrics
    from opentelemetry.sdk.resources import Resource
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor
    from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
    from opentelemetry.instrumentation.logging import LoggingInstrumentor
    
    from opentelemetry.sdk.metrics import MeterProvider
    from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
    from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
    
    # --- Configure Logging ---
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    logger = logging.getLogger(__name__)
    
    # --- OpenTelemetry Setup (Global) ---
    def setup_opentelemetry():
        resource = Resource.create({
            "service.name": "llm-chatbot-service",
            "service.version": "1.0.0",
            "environment": "production"
        })
    
        # --- Tracing Setup ---
        provider = TracerProvider(resource=resource)
        otlp_trace_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
        provider.add_span_processor(BatchSpanProcessor(otlp_trace_exporter))
        trace.set_tracer_provider(provider)
        LoggingInstrumentor().instrument(set_logging_format=True)
        logger.info("OpenTelemetry TracerProvider initialized.")
    
        # --- Metrics Setup ---
        # Configure OTLP HTTP exporter for metrics
        otlp_metric_exporter = OTLPMetricExporter(endpoint="http://localhost:4318/v1/metrics")
        # Use a PeriodicExportingMetricReader to send metrics regularly
        metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis=5000) # Every 5 seconds
    
        meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
        metrics.set_meter_provider(meter_provider)
        logger.info("OpenTelemetry MeterProvider initialized.")
    
    # Get a tracer and meter for our application
    tracer = trace.get_tracer(__name__)
    meter = metrics.get_meter(__name__)
    
    # --- Define Metrics Instruments ---
    llm_request_counter = meter.create_counter(
        name="llm_requests_total",
        description="Total number of LLM requests",
        unit="1"
    )
    
    llm_latency_histogram = meter.create_histogram(
        name="llm_request_latency_seconds",
        description="Latency of LLM requests in seconds",
        unit="s"
    )
    
    llm_input_tokens_counter = meter.create_counter(
        name="llm_input_tokens_total",
        description="Total number of input tokens sent to LLM",
        unit="tokens"
    )
    
    llm_output_tokens_counter = meter.create_counter(
        name="llm_output_tokens_total",
        description="Total number of output tokens received from LLM",
        unit="tokens"
    )
    
    # --- Original LLM Function (with tracing and metrics added) ---
    def get_llm_response(prompt: str, model_name: str = "llama2"):
        """
        Interacts with the LLM to get a response, now with tracing and metrics.
        """
        start_time = time.time() # Start timing for latency
    
        # Define common attributes for metrics and traces
        common_attributes = {
            "llm.model_name": model_name,
            "session.id": "user-123" # Example: Track a session ID
        }
    
        # Increment total request counter
        llm_request_counter.add(1, common_attributes)
    
        with tracer.start_as_current_span("llm.interaction", attributes=common_attributes) as span:
            span.set_attribute("user.prompt", prompt)
    
            logger.info(f"Sending prompt to LLM ({model_name}): {prompt}")
            llm_output = ""
            try:
                with tracer.start_as_current_span("llm.api_call", attributes=common_attributes) as api_span:
                    api_span.set_attribute("llm.provider", "ollama") # Or "openai"
    
                    # For Ollama:
                    response = ollama.chat(model=model_name, messages=[{'role': 'user', 'content': prompt}], options={'num_predict': 100})
                    llm_output = response['message']['content']
                    input_tokens = len(prompt.split()) # Simple token count approximation
                    output_tokens = len(llm_output.split()) # Simple token count approximation
                    # Ollama response might have 'prompt_eval_count', 'eval_count' for more accurate counts
    
                    # For OpenAI (uncomment and modify if using OpenAI):
                    # client = openai.OpenAI()
                    # response = client.chat.completions.create(
                    #     model=model_name,
                    #     messages=[{"role": "user", "content": prompt}]
                    # )
                    # llm_output = response.choices[0].message.content
                    # input_tokens = response.usage.prompt_tokens
                    # output_tokens = response.usage.completion_tokens
    
                    api_span.set_attribute("llm.response.length", len(llm_output))
                    api_span.set_attribute("llm.response.truncated", llm_output[:200])
                    api_span.add_event("llm_response_received")
                    api_span.set_attribute("llm.input_tokens", input_tokens)
                    api_span.set_attribute("llm.output_tokens", output_tokens)
    
                span.set_attribute("llm.full_response", llm_output)
                logger.info(f"LLM Response: {llm_output[:100]}...")
    
                # Record token metrics after successful call
                llm_input_tokens_counter.add(input_tokens, common_attributes)
                llm_output_tokens_counter.add(output_tokens, common_attributes)
    
                return llm_output
            except Exception as e:
                logger.error(f"Error calling LLM: {e}", exc_info=True)
                span.set_status(trace.Status(trace.StatusCode.ERROR, description=str(e)))
                return f"Error: {e}"
            finally:
                # Record latency in finally block to ensure it's always captured
                latency = time.time() - start_time
                llm_latency_histogram.record(latency, common_attributes)
                span.set_attribute("llm.request_latency_seconds", latency) # Also add to span for quick view
    
    if __name__ == "__main__":
        setup_opentelemetry()
        logger.info("Welcome to the simple LLM chatbot! Type 'exit' to quit.")
        try:
            while True:
                user_prompt = input("You: ")
                if user_prompt.lower() == 'exit':
                    break
                response = get_llm_response(user_prompt)
                print(f"Bot: {response}")
        except KeyboardInterrupt:
            logger.info("Chatbot shutting down.")
        finally:
            # Ensure all pending spans/metrics are exported on shutdown
            trace.get_tracer_provider().shutdown()
            metrics.get_meter_provider().shutdown()
    

    Explanation of New Changes:

    • setup_opentelemetry() for Metrics:
      • OTLPMetricExporter: Similar to the span exporter, but for metrics, also pointing to http://localhost:4318/v1/metrics.
      • PeriodicExportingMetricReader: Configures metrics to be exported at a regular interval (e.g., every 5 seconds).
      • MeterProvider: Manages the creation and export of metrics.
      • metrics.set_meter_provider(meter_provider): Sets the global meter provider.
    • Metric Instruments:
      • llm_request_counter: A Counter that increments for each LLM request. Ideal for tracking total counts.
      • llm_latency_histogram: A Histogram that records the distribution of latency values. Useful for understanding performance bottlenecks and calculating percentiles.
      • llm_input_tokens_counter and llm_output_tokens_counter: Counters to track token usage, essential for cost monitoring.
    • Recording Metrics:
      • llm_request_counter.add(1, common_attributes): Increments the counter with associated attributes (model name, session ID).
      • llm_latency_histogram.record(latency, common_attributes): Records the calculated latency. We use time.time() to measure the duration.
      • llm_input_tokens_counter.add(...) and llm_output_tokens_counter.add(...): Records the estimated token counts. Note: For accurate token counts, you’d typically rely on the LLM provider’s response (e.g., response.usage.prompt_tokens for OpenAI). Our simple len(prompt.split()) is an approximation.
      • Metrics are recorded after the LLM call, ensuring we only count successful interactions for latency and token usage.
    • Shutdown Hooks: Added trace.get_tracer_provider().shutdown() and metrics.get_meter_provider().shutdown() in a finally block to ensure all buffered data is sent before the application exits. This is critical for reliable data export.

Step 5: Running the System & Verification

Now that our application is fully instrumented, let’s run it and see the observability data flow into SigNoz.

  1. Ensure SigNoz is Running: From your signoz-setup directory, you can verify with docker-compose ps. All services should be Up. If not, navigate to ai-observability-project/signoz-setup and run:

    docker-compose up -d
    

    Then, access the SigNoz UI at http://localhost:3301.

  2. Run Your Instrumented LLM Application: From your main ai-observability-project directory, with your virtual environment active:

    python app.py
    

    Interact with the chatbot a few times:

    • You: Hello, how are you?
    • You: Tell me a short story about a brave knight.
    • You: What is the capital of France?
    • You: Explain AI observability.
    • You: exit
  3. Verify Data in SigNoz:

    • Traces:

      • Go to the SigNoz UI (http://localhost:3301).
      • Navigate to the “Traces” section (usually on the left sidebar).
      • You should see traces from your llm-chatbot-service. Each interaction with the LLM will be a new trace.
      • Click on a trace to see its details. You’ll observe the llm.interaction parent span and its llm.api_call child span.
      • Examine the attributes attached to each span (e.g., user.prompt, llm.model_name, llm.response.length, llm.request_latency_seconds).
      • You’ll also see the llm_response_received event.
    • Logs:

      • Navigate to the “Logs” section in SigNoz.
      • You’ll see your logger.info and logger.error messages.
      • Notice how each log entry automatically includes trace_id and span_id. This is the magic of OpenTelemetry’s logging instrumentation, allowing you to jump directly from a log message to its corresponding trace!
    • Metrics:

      • Navigate to the “Metrics” section or “Dashboards” in SigNoz.
      • You might need to create a new dashboard or panel if SigNoz doesn’t automatically detect your custom metrics immediately.
      • Search for metrics like llm_requests_total, llm_request_latency_seconds, llm_input_tokens_total, llm_output_tokens_total.
      • You can create graphs to visualize:
        • Total requests over time.
        • Average/P95 latency of LLM calls.
        • Total input/output tokens, which is crucial for cost monitoring.

Step 6: Basic Cost Monitoring (Calculated)

While direct cost integration often involves cloud provider APIs, we can use our token metrics to calculate estimated costs.

Let’s assume a simplified pricing model for our llama2 model (e.g., $0.0001 per input token, $0.0002 per output token).

We can retrieve these metrics from SigNoz/Prometheus and perform calculations. For a real-time dashboard, you’d configure a Grafana panel or a SigNoz dashboard query.

Example Grafana/SigNoz Metrics Query (PromQL):

To get the total cost, you’d typically query the llm_input_tokens_total and llm_output_tokens_total counters.

  • Total Input Token Cost: sum(rate(llm_input_tokens_total[5m])) * 0.0001 (Cost per second over 5 minutes) Or llm_input_tokens_total * 0.0001 (Total accumulated cost)

  • Total Output Token Cost: sum(rate(llm_output_tokens_total[5m])) * 0.0002 (Cost per second over 5 minutes) Or llm_output_tokens_total * 0.0002 (Total accumulated cost)

You can combine these into a single panel to show (llm_input_tokens_total * 0.0001) + (llm_output_tokens_total * 0.0002) for total estimated cost. This demonstrates how raw metrics become actionable business insights.

Mini-Challenge: Enhance Your Observability!

It’s your turn to add more value to our observability setup.

Challenge: Add a new custom attribute to the llm.interaction span and a new metric that tracks the response quality score.

Imagine you have a simple function that “evaluates” the LLM response (e.g., checks for keywords, length, or just returns a random score for demonstration).

  1. Implement a evaluate_response function: This function should take the prompt and response as input and return a numerical quality_score (e.g., an integer from 1 to 5). For simplicity, you can make it return a random number.
  2. Add a new UpDownCounter metric: Name it llm_response_quality_score with a suitable description and unit. An UpDownCounter is good here if scores can fluctuate up and down.
  3. Update get_llm_response:
    • Call your evaluate_response function after receiving the LLM output.
    • Add the quality_score as an attribute to the llm.interaction span (e.g., span.set_attribute("llm.response.quality_score", quality_score)).
    • Record the quality_score using your new UpDownCounter metric.

Hint:

  • You’ll need import random for a random score.
  • Remember to pass common_attributes when recording your new metric.
  • The UpDownCounter’s add() method can take positive or negative values to adjust the current count. For a simple score, just add(quality_score).

What to Observe/Learn: After implementing and running your updated app.py, interact with the chatbot a few more times.

  • In SigNoz Traces: You should see your new llm.response.quality_score attribute on each llm.interaction span.
  • In SigNoz Metrics/Dashboards: You should be able to find and visualize your llm_response_quality_score metric. This allows you to track the perceived quality of your LLM’s responses over time.

This challenge reinforces the idea of extending observability to cover domain-specific, business-critical aspects of your AI system.

Common Pitfalls & Troubleshooting

Even with clear steps, things can sometimes go sideways. Here are common issues and how to approach them:

  1. OpenTelemetry Exporter Connection Errors:

    • Symptom: Your app.py logs Connection refused or Failed to export spans/metrics messages.
    • Cause: The OpenTelemetry Collector (part of SigNoz) isn’t running or isn’t accessible at http://localhost:4318.
    • Fix:
      • Ensure Docker Desktop is running.
      • Navigate to your ai-observability-project/signoz-setup directory.
      • Run docker-compose ps to check if all services are Up. If not, run docker-compose up -d.
      • Verify the endpoint in app.py (http://localhost:4318/v1/traces and /v1/metrics) is correct.
  2. No Traces/Logs/Metrics in SigNoz UI:

    • Symptom: Your app.py runs without errors, but nothing appears in the SigNoz UI.
    • Cause: Data might not be reaching the collector, or the collector isn’t forwarding it correctly, or there’s a filtering issue in the UI.
    • Fix:
      • Check app.py logs: Are there any opentelemetry warnings or errors?
      • Verify service.name: Ensure the service.name attribute (llm-chatbot-service in our case) in app.py matches what you’re searching for in the SigNoz UI.
      • Time Range: In SigNoz, ensure the time range filter is set appropriately (e.g., “Last 5 minutes” or “Last 1 hour”).
      • Collector Logs: Check the logs of the otel-collector service within your SigNoz Docker Compose setup: docker-compose logs otel-collector. Look for errors or indications that it’s receiving data.
      • Shutdown: Ensure trace.get_tracer_provider().shutdown() and metrics.get_meter_provider().shutdown() are called when your app exits. Without them, buffered data might not be sent.
  3. Metrics Not Showing Up in Grafana/SigNoz Dashboards:

    • Symptom: Traces and logs are there, but your custom metrics are missing.
    • Cause: Prometheus might not be scraping the OTel Collector, or the metric names are incorrect, or the export interval is too long.
    • Fix:
      • Metric Reader Interval: In app.py, ensure export_interval_millis for PeriodicExportingMetricReader is set to a reasonable value (e.g., 5000ms for 5 seconds).
      • Metric Names: Double-check your metric names (e.g., llm_requests_total) against what you’re trying to query in SigNoz/Grafana. PromQL queries are case-sensitive.
      • Prometheus Target: Within the SigNoz setup, Prometheus is usually configured to scrape the OTel Collector. You can typically access the raw Prometheus UI (often http://localhost:9090 within the SigNoz setup, though it might be proxied) to verify if your otel-collector target is UP and if your metrics are visible in the “Graph” tab.
  4. Inaccurate Token Counts:

    • Symptom: The llm_input_tokens_total and llm_output_tokens_total metrics seem off.
    • Cause: Our simple len(prompt.split()) is an approximation. Tokenization is complex and model-specific.
    • Fix: For production, use the token counts provided by the LLM API response (e.g., response.usage.prompt_tokens for OpenAI) or a proper tokenizer library (e.g., tiktoken for OpenAI models, or transformers tokenizers for Hugging Face models) to get accurate counts.

Remember, observability is an iterative process. Start simple, verify, and then add more complexity as needed.

Summary

Congratulations! You’ve successfully built an end-to-end AI observability solution. This hands-on project has allowed you to:

  • Set up a complete observability stack using Docker Compose for SigNoz, which includes an OpenTelemetry Collector, Prometheus, and Grafana.
  • Instrument a Python LLM application with OpenTelemetry for distributed tracing, capturing the full lifecycle of an LLM interaction.
  • Define and record custom metrics like total requests, latency, and token usage, crucial for performance and cost monitoring.
  • Integrate structured logging with OpenTelemetry, ensuring logs are automatically correlated with traces for easier debugging.
  • Visualize all three pillars of observability (traces, metrics, logs) in a unified UI, demonstrating how they provide a holistic view of your AI system’s health.
  • Tackle a mini-challenge to extend observability with domain-specific metrics, highlighting the adaptability of OpenTelemetry.

This practical experience is invaluable for any MLOps practitioner or AI/ML engineer. You’ve moved beyond theoretical understanding to actively building and analyzing an observable AI system.

What’s Next?

This project is a solid foundation. Here are some ideas for where you can take it next:

  • Advanced Alerting: Configure alerts in SigNoz/Grafana for high latency, low quality scores, or unexpected cost spikes.
  • Automated Evaluation: Integrate more sophisticated LLM evaluation metrics (e.g., RAGAS, LlamaIndex evaluators) and track them as metrics or attributes.
  • Data Drift Monitoring: Track distributions of prompt lengths, response lengths, or specific keywords to detect changes in user behavior or model outputs.
  • A/B Testing Observability: Extend your attributes to include A/B test variant IDs to compare performance and behavior between different model versions or prompt strategies.
  • Security & Compliance: Implement stricter logging policies for sensitive data and ensure your observability data storage complies with regulations.
  • Real-world Deployment: Explore deploying this setup to a cloud environment (AWS, Azure, GCP) using managed services or Kubernetes.

Keep exploring, keep building, and remember: an observable AI system is a reliable and maintainable AI system!

References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.