Introduction

Welcome back, future AI observability experts! In our previous chapters, we laid the groundwork for understanding AI system health through comprehensive logging, distributed tracing, and critical metrics. We learned how to see what our AI systems are doing and how well they’re performing.

Now, it’s time to tackle another crucial, and often overlooked, aspect of running AI in production: cost. The rise of powerful Large Language Models (LLMs) and sophisticated AI APIs has brought incredible capabilities, but also a new challenge: managing unpredictable, usage-based expenses. A single runaway prompt or an inefficient model interaction can quickly inflate your cloud bill, turning innovation into a financial headache.

In this chapter, we’ll dive deep into the world of AI cost monitoring. You’ll learn the primary drivers of AI expenses, especially focusing on token usage for LLMs and API call costs. We’ll then get hands-on, integrating OpenTelemetry into a Python application to meticulously track these costs, allowing you to optimize performance, control budgets, and avoid those dreaded “bill shock” moments. Get ready to gain full visibility into your AI spending!

The Anatomy of AI Costs

Understanding where your AI expenses come from is the first step to controlling them. Unlike traditional software, where costs are often predictable server hours or fixed licenses, AI systems, especially those leveraging external APIs, introduce dynamic, usage-based pricing models.

Understanding AI Cost Drivers

Let’s break down the common culprits behind your AI bill:

Token-Based Pricing (LLMs)

For Large Language Models (LLMs), tokens are the fundamental unit of billing. But what exactly is a token? Think of tokens as chunks of text. They aren’t always whole words; they can be parts of words, punctuation, or even spaces. For example, the word “fantastic” might be one token, while “fantastically” could be “fantastic” + “ally”, making two tokens.

Major LLM providers like OpenAI, Anthropic, or Google typically charge per 1,000 tokens. Crucially, they often differentiate between:

  • Prompt Tokens: The tokens sent to the model as input.
  • Completion/Output Tokens: The tokens generated by the model as output.

Often, output tokens are more expensive than input tokens because generating new content is computationally more intensive. The specific pricing also varies significantly between different models (e.g., gpt-3.5-turbo vs. gpt-4o) and even within different versions of the same model.

Why is this important? Because the length and complexity of your prompts, and the desired length of the model’s responses, directly impact your token count and, consequently, your bill.

API Call Costs

Beyond tokens, some AI services might charge a flat rate per API call, or combine this with token-based pricing. For instance, a specialized image recognition API might charge per image processed, regardless of the complexity, or a vector database might charge per query.

Compute, Storage, and Data Transfer Costs

While our focus is on token and API costs, it’s essential to remember the underlying infrastructure expenses:

  • Compute Costs: If you’re hosting your own models, this includes the cost of GPUs, CPUs, and memory. These are typically billed by the hour or second.
  • Storage Costs: Storing model weights, training data, inference logs, and observability data incurs storage fees.
  • Data Transfer Costs: Moving data between different cloud regions, availability zones, or even in and out of your cloud provider can add up, especially with large models or frequent data exchanges.

Why Monitor AI Costs?

Monitoring AI costs isn’t just about preventing bill shock; it’s about strategic resource management and optimization:

  1. Budget Control: Stay within financial limits and allocate resources effectively.
  2. Performance Optimization: Identify prompts or workflows that are excessively expensive. Can you achieve the same result with fewer tokens?
  3. Cost Attribution: Understand which users, applications, or features are driving the highest costs. This is crucial for chargebacks or for prioritizing optimization efforts.
  4. Forecasting & Planning: Predict future expenses based on usage trends, helping with long-term budgeting.
  5. Anomaly Detection: Quickly spot unusual spikes in token usage or API calls that could indicate an issue (e.g., a buggy loop, a denial-of-service attack, or an inefficient prompt).

Key Metrics for Cost Monitoring

To effectively monitor AI costs, we need to track specific metrics:

  • Total Prompt Tokens: The sum of all input tokens.
  • Total Completion Tokens: The sum of all output tokens.
  • Total Tokens: The grand total of both.
  • Cost Per Interaction/Query: The actual or estimated dollar cost for each API call.
  • API Call Count: The number of times your application interacted with an AI service.
  • Cost Per User/Application Feature: Granular cost breakdown to understand where money is truly being spent.
  • Latency: While not a direct cost, high latency can mean longer compute times for self-hosted models or increased resource utilization in general.

How Observability Helps with Cost

The observability pillars we’ve discussed are perfectly suited to tackle cost monitoring:

  • Logs: Each interaction with an AI model can generate a log entry detailing the prompt, response, and critically, the token usage reported by the API.
  • Metrics: We can aggregate token counts, API call counts, and estimated costs over time to create dashboards and trigger alerts.
  • Traces: Distributed traces allow us to see the full lifecycle of a request, including multiple LLM calls within a complex agent. We can attach token usage and cost information directly to the relevant spans, pinpointing exactly which step in a multi-stage process is most expensive.

Step-by-Step Implementation: Tracking Token Usage with OpenTelemetry

Let’s get practical! We’ll instrument a simple Python application that uses the OpenAI API to track token usage as both trace attributes and custom metrics using OpenTelemetry.

Our Goal:

  1. Set up OpenTelemetry for tracing and metrics.
  2. Make an OpenAI API call.
  3. Capture the token usage reported by the OpenAI API.
  4. Record this token usage as attributes on a trace span.
  5. Emit custom metrics for total prompt and completion tokens, allowing for aggregation over time.
  6. (Mini-Challenge) Calculate and track estimated cost.

Prerequisites

Make sure you have Python (version 3.9+) installed. We’ll need the following libraries:

  • openai: For interacting with OpenAI’s LLMs.
  • opentelemetry-api: The OpenTelemetry API.
  • opentelemetry-sdk: The OpenTelemetry SDK.
  • opentelemetry-exporter-otlp: To export traces and metrics in OTLP format (a common standard for observability backends like SigNoz, Jaeger, Prometheus).

Let’s install them:

pip install openai~=1.12.0 opentelemetry-api~=1.22.0 opentelemetry-sdk~=1.22.0 opentelemetry-exporter-otlp~=1.22.0

Note on Versions (as of 2026-03-20):

  • openai: We’re using ~=1.12.0, which refers to the latest stable release in the 1.x series. OpenAI frequently updates its library, so always refer to their official documentation for the most current version and API details.
  • opentelemetry-python: We’re targeting ~=1.22.0 for the core components. OpenTelemetry is a rapidly evolving standard, and 1.22.0 represents a stable and feature-rich version as of early 2026. You can always check the OpenTelemetry Python documentation for the very latest.

You’ll also need an OpenAI API key. Set it as an environment variable:

export OPENAI_API_KEY="your_openai_api_key_here"

Step 1: Initialize OpenTelemetry SDK

First, let’s set up our OpenTelemetry SDK for both tracing and metrics. This involves configuring a TracerProvider and a MeterProvider. We’ll use the OTLP exporter, which sends data to a configured endpoint (e.g., a local OpenTelemetry Collector or an observability platform).

Create a file named llm_cost_tracker.py:

# llm_cost_tracker.py
import os
import openai

from opentelemetry import trace
from opentelemetry import metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

# --- 1. Basic OpenTelemetry Setup ---

# Define a resource for our service
# This helps identify our application in the observability backend
resource = Resource.create({
    "service.name": "llm-cost-tracker-app",
    "service.version": "1.0.0",
    "environment": "production",
})

# Configure TracerProvider for Tracing
# A TracerProvider manages Tracers, which create Spans
trace_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(trace_provider)

# Use OTLPSpanExporter to send spans to an OTLP endpoint (e.g., a collector)
# Default endpoint is http://localhost:4317 for gRPC
span_exporter = OTLPSpanExporter()

# BatchSpanProcessor asynchronously sends spans in batches
trace_provider.add_span_processor(BatchSpanProcessor(span_exporter))

# Get a tracer for our module
tracer = trace.get_tracer(__name__)

# Configure MeterProvider for Metrics
# A MeterProvider manages Meters, which create instruments (like Counters)
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter() # Default endpoint is http://localhost:4317 for gRPC
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

# Get a meter for our module
meter = metrics.get_meter(__name__)

# Define custom metrics
# Counters are good for accumulating values (like total tokens)
prompt_tokens_counter = meter.create_counter(
    "llm.prompt_tokens.total",
    description="Total number of prompt tokens used by LLMs",
    unit="tokens"
)
completion_tokens_counter = meter.create_counter(
    "llm.completion_tokens.total",
    description="Total number of completion tokens generated by LLMs",
    unit="tokens"
)

# Initialize OpenAI client
client = openai.OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

# Verify API key is set
if not client.api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set.")

print("OpenTelemetry and OpenAI client initialized.")

# --- Placeholder for LLM call function ---
def call_llm(prompt: str, model: str = "gpt-3.5-turbo"):
    print(f"\nCalling LLM with prompt: '{prompt[:50]}...'")
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return response
    except openai.APIError as e:
        print(f"OpenAI API Error: {e}")
        return None

# --- Main execution block ---
if __name__ == "__main_": # Use __main_ to prevent accidental execution before full code
    # This block will be filled in later steps
    pass

Explanation:

  • Resource: Identifies our service. This is crucial for filtering and grouping data in your observability backend.
  • TracerProvider: The entry point for tracing. We set it globally with trace.set_tracer_provider.
  • OTLPSpanExporter: Exports trace spans using the OpenTelemetry Protocol (OTLP) over gRPC. By default, it tries to connect to localhost:4317, a common port for an OpenTelemetry Collector.
  • BatchSpanProcessor: Buffers spans and sends them in batches for efficiency.
  • trace.get_tracer(__name__): Gets a Tracer instance, which is used to create Span objects.
  • MeterProvider: The entry point for metrics. We set it globally with metrics.set_meter_provider.
  • OTLPMetricExporter: Exports metrics using OTLP over gRPC.
  • PeriodicExportingMetricReader: Periodically pulls metrics from instruments and exports them.
  • metrics.get_meter(__name__): Gets a Meter instance, used to create metric instruments.
  • create_counter(...): Defines two Counter instruments, llm.prompt_tokens.total and llm.completion_tokens.total, which will accumulate token counts.

To run this, you’d typically have an OpenTelemetry Collector running locally or point to a remote endpoint. For local testing, you can run a collector with a simple configuration that prints to console:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [logging]
    metrics:
      receivers: [otlp]
      exporters: [logging]

Then run the collector: otelcol --config otel-collector-config.yaml

Step 2: Making an LLM Call and Capturing Usage

Now, let’s make an actual LLM call and see how we can get the token usage information. The OpenAI Python library, for its chat completions, returns a usage object within the response that contains prompt_tokens, completion_tokens, and total_tokens.

Let’s modify the call_llm function and the main block in llm_cost_tracker.py:

# ... (previous code for OTel setup and OpenAI client) ...

def call_llm(prompt: str, model: str = "gpt-3.5-turbo"):
    print(f"\nCalling LLM with model '{model}' and prompt: '{prompt[:50]}...'")
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        print(f"LLM Response: {response.choices[0].message.content[:100]}...")
        # CRITICAL: Access token usage from the response
        if response.usage:
            print(f"  Token Usage - Prompt: {response.usage.prompt_tokens}, "
                  f"Completion: {response.usage.completion_tokens}, "
                  f"Total: {response.usage.total_tokens}")
        else:
            print("  No token usage information available in response.")
        return response
    except openai.APIError as e:
        print(f"OpenAI API Error: {e}")
        return None

# --- Main execution block ---
if __name__ == "__main__": # Changed to __main__ to enable execution
    print("Starting LLM interaction...")
    sample_prompt = "Explain the concept of quantum entanglement in a single sentence."
    llm_response = call_llm(sample_prompt)

    if llm_response:
        print("\nLLM interaction complete. Check console for token usage details.")

    # Ensure metrics are exported before exiting
    meter_provider.shutdown()
    trace_provider.shutdown()
    print("OpenTelemetry providers shut down.")

Run this script (python llm_cost_tracker.py). You should see output similar to this, including the token usage:

OpenTelemetry and OpenAI client initialized.
Starting LLM interaction...

Calling LLM with model 'gpt-3.5-turbo' and prompt: 'Explain the concept of quantum entanglement in a single...'
LLM Response: Quantum entanglement is a phenomenon where two or more particles become linked in such a way that the sta...
  Token Usage - Prompt: 17, Completion: 26, Total: 43

LLM interaction complete. Check console for token usage details.
OpenTelemetry providers shut down.

Great! We’re successfully getting the token usage. Now let’s integrate this into OpenTelemetry.

Step 3: Integrating Token Metrics into Traces

Traces are perfect for associating token usage with a specific interaction. We’ll wrap our call_llm function with a trace span and add the token details as span attributes. This allows you to see the cost implications directly within your distributed trace waterfall.

Modify llm_cost_tracker.py to include the with tracer.start_as_current_span(...) block:

# ... (previous code for OTel setup, client, and counter definitions) ...

# Initialize OpenAI client
client = openai.OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

# Verify API key is set
if not client.api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set.")

print("OpenTelemetry and OpenAI client initialized.")

def call_llm_and_track(prompt: str, model: str = "gpt-3.5-turbo"):
    # Create a span for this LLM interaction
    with tracer.start_as_current_span("llm.chat.completion") as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.request.prompt", prompt) # Be mindful of logging sensitive data

        print(f"\nCalling LLM with model '{model}' and prompt: '{prompt[:50]}...'")
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "user", "content": prompt}
                ]
            )
            print(f"LLM Response: {response.choices[0].message.content[:100]}...")

            if response.usage:
                prompt_tokens = response.usage.prompt_tokens
                completion_tokens = response.usage.completion_tokens
                total_tokens = response.usage.total_tokens

                print(f"  Token Usage - Prompt: {prompt_tokens}, "
                      f"Completion: {completion_tokens}, "
                      f"Total: {total_tokens}")

                # Add token usage as span attributes
                span.set_attribute("llm.token.prompt", prompt_tokens)
                span.set_attribute("llm.token.completion", completion_tokens)
                span.set_attribute("llm.token.total", total_tokens)

                # Increment our custom metrics
                prompt_tokens_counter.add(prompt_tokens, {"llm.model": model})
                completion_tokens_counter.add(completion_tokens, {"llm.model": model})

            else:
                print("  No token usage information available in response.")
                span.set_attribute("llm.token.available", False)

            span.set_attribute("llm.response.content", response.choices[0].message.content)
            span.set_attribute("llm.success", True)
            return response

        except openai.APIError as e:
            print(f"OpenAI API Error: {e}")
            span.set_attribute("llm.success", False)
            span.set_attribute("error.message", str(e))
            span.set_status(trace.Status(trace.StatusCode.ERROR, description=str(e)))
            return None

# --- Main execution block ---
if __name__ == "__main__":
    print("Starting LLM interaction with OpenTelemetry tracking...")
    sample_prompt_1 = "Explain the concept of quantum entanglement in a single sentence."
    call_llm_and_track(sample_prompt_1)

    sample_prompt_2 = "Write a short, encouraging poem about learning new things."
    call_llm_and_track(sample_prompt_2, model="gpt-4o") # Use a different model to show model-specific metrics

    print("\nLLM interactions complete. Spans and metrics should be exported.")

    # Ensure metrics are exported before exiting
    # The PeriodicExportingMetricReader exports on a schedule, but manual shutdown
    # ensures any remaining buffered data is sent.
    meter_provider.shutdown()
    trace_provider.shutdown()
    print("OpenTelemetry providers shut down.")

Explanation of Changes:

  • with tracer.start_as_current_span("llm.chat.completion") as span:: This creates a new span named “llm.chat.completion” and makes it the current span. All operations within this with block will be part of this span.
  • span.set_attribute(...): We add context to our span.
    • llm.model: The model used (e.g., gpt-3.5-turbo, gpt-4o).
    • llm.request.prompt: The user’s prompt (be cautious with sensitive data here).
    • llm.token.prompt, llm.token.completion, llm.token.total: The token usage extracted from the OpenAI response. These are critical for cost analysis in traces.
    • llm.success, error.message, span.set_status: Standard tracing attributes for indicating success or failure.
  • prompt_tokens_counter.add(...): We increment our previously defined OpenTelemetry Counter instruments.
    • The {"llm.model": model} dictionary adds labels (or “attributes” in OTel metrics terminology) to the metrics. This is incredibly powerful! It allows you to break down your total token usage by specific LLM model, giving you insights like “How many prompt tokens did gpt-4o consume today?”

Now, run this script (python llm_cost_tracker.py) alongside your OpenTelemetry Collector. You should see detailed trace and metric data streaming into your collector’s logs. In a real observability platform (like SigNoz), you would visualize these spans with their attributes and see your llm.prompt_tokens.total and llm.completion_tokens.total metrics accumulating over time, broken down by model.

(Optional/Advanced) Using LangChain/LlamaIndex Callbacks

For those using higher-level LLM orchestration frameworks like LangChain or LlamaIndex, good news! These frameworks often provide built-in integrations or callback mechanisms to simplify observability.

For example, LangChain has LangChainTracer for OpenTelemetry, which can automatically instrument LLM calls and chains, capturing details like prompt, response, and token usage, and exporting them as spans. This significantly reduces the boilerplate code you need to write manually.

While we won’t provide a full code example here to keep the focus on core OpenTelemetry principles, be aware that leveraging these framework-specific integrations is a best practice when working with them. They abstract away much of the manual span creation and attribute setting, making your life easier.

Mini-Challenge: Calculate Estimated Cost

It’s your turn to apply what you’ve learned!

Challenge: Modify the call_llm_and_track function in llm_cost_tracker.py to:

  1. Define a hypothetical pricing model:
    • For gpt-3.5-turbo: Assume $0.0015 per 1,000 prompt tokens and $0.002 per 1,000 completion tokens.
    • For gpt-4o: Assume $0.005 per 1,000 prompt tokens and $0.015 per 1,000 completion tokens.
  2. Calculate the estimated_cost for each LLM interaction based on the token usage and the model’s pricing.
  3. Add llm.cost.estimated as a span attribute (a float value) to the existing span.
  4. Create a new OpenTelemetry Counter instrument called llm.cost.total_estimated and increment it with the estimated_cost after each interaction, including the llm.model label.

Hint:

  • You’ll need to define a dictionary or a small helper function to store your pricing rates per model.
  • Remember to divide prompt_tokens and completion_tokens by 1000 before multiplying by the per-1000 token price.
  • The Counter needs to be created once at the top, similar to prompt_tokens_counter.

What to Observe/Learn:

  • How to derive custom business metrics (like cost) from raw technical data (token counts).
  • The power of attaching derived metrics to both traces (for per-request context) and aggregated metrics (for overall trends).
  • The importance of understanding and modeling pricing structures.
Stuck? Click for a hint!

First, define a dictionary `LLM_PRICING_RATES` at the top of your script, mapping model names to another dictionary containing `prompt_cost_per_1k_tokens` and `completion_cost_per_1k_tokens`. Then, inside `call_llm_and_track`, retrieve these rates for the current `model`, perform the calculation, and add the attribute/increment the counter. Don't forget to define the `llm_total_estimated_cost_counter` at the top alongside the other counters.

Common Pitfalls & Troubleshooting

Monitoring AI costs comes with its own set of challenges. Being aware of these common pitfalls can save you time and money.

  1. Ignoring Token Context and Pricing Variations:

    • Pitfall: Assuming all tokens are priced equally, or that pricing is static. Different models (e.g., GPT-3.5 vs. GPT-4o), different providers, and even different API endpoints can have vastly different token costs. Prompt tokens are often cheaper than completion tokens.
    • Troubleshooting: Always refer to the official pricing pages of your AI providers. Implement a dynamic pricing lookup in your cost calculation logic, or at least ensure your estimated costs are based on the correct, up-to-date rates for each model you use.
  2. Lack of Granularity in Cost Attribution:

    • Pitfall: Only tracking total application cost. If your AI system serves multiple users, features, or internal teams, a single “total cost” metric doesn’t tell you who or what is driving the expenses.
    • Troubleshooting: Use labels/attributes on your metrics and traces (as we did with llm.model). Extend this to include user_id, feature_name, team_id, or session_id. This allows you to slice and dice your cost data, attribute costs to specific entities, and identify areas for targeted optimization.
  3. Overlooking Hidden Costs:

    • Pitfall: Focusing solely on token/API costs and forgetting about the underlying infrastructure costs for self-hosted models, data storage for logs/models, or network egress fees.
    • Troubleshooting: Integrate cloud provider billing data (e.g., AWS Cost Explorer, Azure Cost Management) into your overall financial monitoring strategy. For self-hosted models, ensure you track GPU/CPU utilization metrics and correlate them with your AI workloads.
  4. Alert Fatigue from Cost Spikes:

    • Pitfall: Setting up overly sensitive alerts for cost increases without establishing clear baselines or understanding normal fluctuations. This leads to constant false alarms.
    • Troubleshooting: Establish historical baselines for your token usage and estimated costs. Use anomaly detection techniques or define intelligent thresholds that account for expected peaks (e.g., during business hours, marketing campaigns). Configure alerts to trigger only for significant, sustained deviations from the norm.

Summary

Phew! You’ve navigated the complex world of AI costs and emerged with powerful tools and knowledge. Here’s a quick recap of what we’ve covered:

  • AI costs are often usage-based, with tokens being the primary billing unit for LLMs, and API calls contributing significantly to expenses.
  • Understanding the difference between prompt tokens and completion tokens and their respective pricing is crucial for accurate cost estimation.
  • Proactive cost monitoring is essential not just for budgeting, but for optimizing performance, attributing expenses, and forecasting future needs.
  • OpenTelemetry provides a robust, vendor-neutral way to instrument your AI applications. We learned how to:
    • Set up both tracing and metrics providers.
    • Capture token usage directly from LLM API responses.
    • Attach token details as attributes to trace spans, providing per-request cost context.
    • Emit custom metrics (like total prompt/completion tokens) with labels (e.g., llm.model) for powerful aggregation and analysis.
  • We discussed common pitfalls like ignoring pricing variations, lacking granular cost attribution, overlooking hidden infrastructure costs, and managing alert fatigue.

By diligently tracking token usage and API expenses, you’re not just preventing unexpected bills; you’re gaining a deeper understanding of your AI system’s efficiency and impact, enabling smarter resource allocation and more sustainable AI operations.

What’s Next?

Now that we can effectively monitor the performance and cost of our AI systems, it’s time to tackle the inevitable: when things go wrong. In our next chapter, we’ll dive into Debugging AI Systems, focusing on strategies and tools for root cause analysis, especially for those tricky, non-deterministic AI behaviors.

References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.