Chapter 4: The Pillars of Observability: Logs, Metrics, and Traces

Introduction: Seeing Inside Your Software

Welcome back, aspiring problem-solver! In the previous chapters, we laid the groundwork for a systematic approach to tackling engineering challenges. We learned how to break down complex problems, form hypotheses, and think critically about system behavior. But how do you know what your system is doing when it’s running in production? How do you gather the evidence needed to validate those hypotheses?

This is where observability comes in. Observability is the ability to infer the internal state of a system by examining its external outputs. It’s like having X-ray vision for your software, allowing you to understand why things are happening, not just that they are happening. Without good observability, even the most brilliant problem-solving mind is flying blind.

In this chapter, we’ll dive deep into the three pillars of modern observability: logs, metrics, and traces. You’ll learn what each is, why they’re crucial for diagnosing real-world issues, and how to instrument your applications to emit this vital information. We’ll use practical examples in Go, incorporating industry-standard tools like OpenTelemetry and Prometheus, to give you hands-on experience in building observable systems. Get ready to gain the superpower of seeing inside your software!

Core Concepts: The Three Pillars

Before we start writing code, let’s understand the fundamental components that make up an observable system. Think of logs, metrics, and traces as different lenses through which you view your application’s behavior. Each provides a unique perspective, and together, they paint a complete picture.

Observability vs. Monitoring: What’s the Difference?

You might hear “observability” and “monitoring” used interchangeably, but there’s a subtle yet important distinction.

Monitoring tells you if your system is working. It’s about collecting predefined data points (like CPU usage, memory, request rates) and alerting you when they cross certain thresholds. You monitor for known unknowns.
Observability allows you to ask any question about your system’s behavior, even questions you didn’t anticipate when you built it. It’s about understanding why something is happening. You use observability to explore unknown unknowns.

To put it simply: Monitoring is about dashboards and alerts. Observability is about debugging and root cause analysis in complex, distributed systems.

Pillar 1: Logs – The Event Stream

Imagine your application is telling a story of everything it does. Every user request, every database query, every error, every internal decision – it all gets written down. This narrative is your logs.

What are Logs? Logs are timestamped records of discrete events that occur within your application or system. They are typically lines of text, though modern systems increasingly favor structured formats.

Why are Logs Important? Logs are invaluable for:

Debugging specific issues: When a user reports an error, logs can show the exact sequence of events leading up to it.
Auditing and security: Logs can record who did what and when, helping to track suspicious activity.
Understanding application flow: By following log messages, you can trace the path of a request through different parts of your code.

Structured vs. Unstructured Logs

Unstructured logs: These are human-readable text strings, often generated by simple print statements.
```
2026-03-06 10:30:05 INFO User 'alice' logged in from 192.168.1.100
2026-03-06 10:30:06 ERROR Failed to fetch data for user 'bob': database connection lost
```
While easy to read for a human, they are difficult for machines to parse and query efficiently.
Structured logs: These logs are formatted, usually as JSON, making them machine-readable and easy to query in log management systems.
```
{"timestamp": "2026-03-06T10:30:05Z", "level": "info", "message": "User logged in", "user": "alice", "ip_address": "192.168.1.100"}
{"timestamp": "2026-03-06T10:30:06Z", "level": "error", "message": "Failed to fetch data", "user": "bob", "error": "database connection lost"}
```
Notice how each piece of information (timestamp, level, message, user, IP, error) is a distinct key-value pair. This allows you to filter logs by user, level, or error type with precision.

Best Practices for Logging:

Log Context: Include relevant information like request IDs, user IDs, component names, and environment details. This helps connect log messages across different services and requests.
Appropriate Levels: Use log levels (DEBUG, INFO, WARN, ERROR, FATAL) correctly to filter noise.
Avoid Sensitive Data: Never log passwords, API keys, or other sensitive personal identifiable information (PII).
Structured Logging: Always prefer structured logs for production systems.

Pillar 2: Metrics – The Aggregated View

If logs are individual stories, metrics are the aggregated statistics. They are numerical measurements captured over time, representing the health and performance of your system.

What are Metrics? Metrics are numerical values that represent a specific aspect of your system at a given point in time. They are typically time-series data, meaning they are collected periodically and stored with a timestamp.

Why are Metrics Important? Metrics are excellent for:

Monitoring system health: Track CPU, memory, disk I/O, network traffic.
Performance trending: Observe how latency, throughput, or error rates change over time.
Alerting: Trigger alerts when key performance indicators (KPIs) deviate from expected norms.
Capacity planning: Understand resource utilization to plan for future growth.

Common Types of Metrics:

Counters: A cumulative metric that only ever goes up. Useful for counting total requests, errors, or completed tasks.
- Example: http_requests_total
Gauges: A metric that represents a single numerical value that can go up or down. Useful for current values like CPU utilization, memory usage, or queue size.
- Example: current_queue_size
Histograms: Sample observations (e.g., request durations) and count them in configurable buckets. This allows you to calculate quantiles (e.g., 90th percentile latency), which are crucial for understanding user experience.
- Example: http_request_duration_seconds
Summaries: Similar to histograms but calculate configurable quantiles on the client side. More resource-intensive for high-cardinality data.

The Four Golden Signals: A popular framework for choosing what to measure, especially for user-facing services:

Latency: The time it takes to serve a request.
Traffic: How much demand is being placed on your system (e.g., requests per second).
Errors: The rate of requests that fail.
Saturation: How “full” your service is (e.g., CPU, memory, disk, network utilization).

By focusing on these four signals, you can get a comprehensive view of your system’s health.

Pillar 3: Traces – The Request’s Journey

In a distributed system, a single user action might involve dozens of services communicating with each other. A log message from one service doesn’t tell you what happened in another. This is where traces come in.

What are Traces? A trace represents the end-to-end journey of a single request or operation as it flows through multiple services in a distributed system. A trace is composed of multiple spans.

Span: A span represents a single operation within a trace (e.g., an HTTP request, a database query, a function call). Each span has a name, a start time, an end time, attributes (key-value metadata), and a parent-child relationship with other spans.

Why are Traces Important? Traces are critical for:

Root cause analysis in distributed systems: Pinpoint which service or component caused a latency spike or error.
Performance optimization: Identify bottlenecks across service boundaries.
Understanding complex interactions: Visualize the flow of data and control through microservices.
Service dependency mapping: Discover how services interact in real-time.

How Traces Work: Context Propagation The magic of distributed tracing lies in context propagation. When a request moves from one service to another, a unique trace ID and parent span ID are injected into the request headers. The receiving service then extracts this information and uses it to create new child spans, linking them back to the original trace. This forms a directed acyclic graph (DAG) of operations, showing the entire request path.

OpenTelemetry: The Standard for Observability Data Historically, collecting logs, metrics, and traces often involved vendor-specific agents and SDKs. This led to vendor lock-in and complexity. This is why OpenTelemetry (OTel) was created.

OpenTelemetry is a vendor-neutral, open-source set of APIs, SDKs, and tools designed to standardize the generation and collection of telemetry data (logs, metrics, and traces). It’s supported by the Cloud Native Computing Foundation (CNCF) and is rapidly becoming the industry standard.

Using OpenTelemetry means your instrumentation code is portable. You can switch your backend (where you send your observability data) without changing your application code. As of 2026-03-06, OpenTelemetry has stable SDKs for many popular languages, including Go, Java, Python, JavaScript, and more, with active development continuing across all signal types. For the latest status and documentation, always refer to the official OpenTelemetry website.

The Relationship Between Logs, Metrics, and Traces

While distinct, these three pillars are most powerful when used together.

Metrics tell you that a problem is occurring (e.g., “latency is high”).
Traces help you pinpoint where the problem is happening (e.g., “it’s the database call in Service B”).
Logs provide the granular details why it’s happening (e.g., “Service B’s log shows a ‘database connection lost’ error just before the slow query”).

Together, they provide a holistic view for effective problem-solving.

Let’s visualize this relationship:

graph TD subgraph Your_Application["Your Application "] APP[Application Logic] end subgraph Observability_Signals["Observability Signals"] APP --> L[Logs - Event Stream] APP --> M[Metrics - Aggregated Data] APP --> T[Traces - Request Journey] end subgraph OpenTelemetry_Collector["OpenTelemetry Collector"] L --> OTelC[Collects & Processes] M --> OTelC T --> OTelC end subgraph Observability_Backends["Observability Backends"] OTelC --> LogStore[Log Management System] OTelC --> MetricStore[Metric Database] OTelC --> TraceStore[Distributed Tracing System] end LogStore --> Dashboard[Monitoring & Alerting Dashboard] MetricStore --> Dashboard TraceStore --> Dashboard Dashboard --> Engineer[Engineer Troubleshooting]

This diagram illustrates how your application emits raw observability data, which is then often collected and processed by an OpenTelemetry Collector before being sent to specialized backend systems for storage, querying, and visualization. Finally, engineers use dashboards and tools to interpret this data and solve problems.

Step-by-Step Implementation: Instrumenting a Go Application

Let’s get our hands dirty! We’ll build a simple Go HTTP server and incrementally add logging, metrics, and tracing.

Prerequisites:

Go installed (version 1.21 or newer, as of 2026-03-06).
A text editor and a terminal.

Step 1: Set Up Your Project

First, create a new Go module and a main.go file.

Create a directory:

mkdir my-observability-app
cd my-observability-app

Initialize Go module:
```
go mod init my-observability-app
```

Create main.go:

// main.go
package main

import (
	"fmt"
	"log"
	"net/http"
	"time"
)

func main() {
	// Define a simple handler
	helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		fmt.Fprintf(w, "Hello, world! You requested: %s\n", r.URL.Path)
	})

	http.Handle("/hello", helloHandler)

	log.Println("Server starting on :8080")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatalf("Server failed to start: %v", err)
	}
}

Run it:
```
go run main.go
```
Open http://localhost:8080/hello in your browser. You should see “Hello, world! You requested: /hello”.

Step 2: Add Basic Logging

We’ll start by adding simple log package calls.

Modify main.go: Let’s add a log message inside our helloHandler to track when a request is processed.

// main.go (additions highlighted)
package main

import (
	"fmt"
	"log"
	"net/http"
	"time"
)

func main() {
	// Define a simple handler
	helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Add a log message here
		log.Printf("INFO: Request received for path: %s from %s", r.URL.Path, r.RemoteAddr)
		fmt.Fprintf(w, "Hello, world! You requested: %s\n", r.URL.Path)
	})

	http.Handle("/hello", helloHandler)

	log.Println("Server starting on :8080")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatalf("Server failed to start: %v", err)
	}
}

Explanation:

log.Printf is a simple way to print formatted strings to standard output (or stderr by default).
We’re including the request path and the remote address for context.

Run and observe: Restart the server (Ctrl+C then go run main.go). Access http://localhost:8080/hello a few times. You’ll see output like:

2026/03/06 10:30:00 Server starting on :8080
2026/03/06 10:30:05 INFO: Request received for path: /hello from 127.0.0.1:54321
2026/03/06 10:30:07 INFO: Request received for path: /hello from 127.0.0.1:54323

This is basic, unstructured logging.

Step 3: Add Metrics with Prometheus

Now, let’s add some metrics to track request counts and durations. We’ll use the Prometheus Go client library.

Install Prometheus client library:

go get github.com/prometheus/client_golang/prometheus@latest
go get github.com/prometheus/client_golang/prometheus/promhttp@latest

Modify main.go: We’ll define global counters and histograms, register them, and then create a middleware to record metrics for each request.

// main.go (additions highlighted)
package main

import (
	"fmt"
	"log"
	"net/http"
	"time"

	"github.com/prometheus/client_golang/prometheus" // New import
	"github.com/prometheus/client_golang/prometheus/promhttp" // New import
)

// Global Prometheus metrics
var (
	httpRequestsTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total number of HTTP requests.",
		},
		[]string{"path", "method", "status"}, // Labels for breakdown
	)
	httpRequestDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "Duration of HTTP requests in seconds.",
			Buckets: prometheus.DefBuckets, // Default buckets are good for starters
		},
		[]string{"path", "method", "status"},
	)
)

func init() {
	// Register Prometheus metrics when the package is initialized
	prometheus.MustRegister(httpRequestsTotal)
	prometheus.MustRegister(httpRequestDuration)
}

// loggingResponseWriter is a wrapper to capture the HTTP status code.
// This is needed because WriteHeader is typically called after the handler returns.
type loggingResponseWriter struct {
	http.ResponseWriter
	statusCode int
}

func (lrw *loggingResponseWriter) WriteHeader(code int) {
	lrw.statusCode = code
	lrw.ResponseWriter.WriteHeader(code)
}

func main() {
	// Define a simple handler
	helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		log.Printf("INFO: Request received for path: %s from %s", r.URL.Path, r.RemoteAddr)
		// Simulate some work that takes time
		time.Sleep(50 * time.Millisecond)
		fmt.Fprintf(w, "Hello, world! You requested: %s\n", r.URL.Path)
	})

	// Wrap the handler with observability middleware
	obsHandler := func(path string, handler http.Handler) http.Handler {
		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			start := time.Now()
			status := http.StatusOK // Default status, might be overridden by actual handler logic

			// Create a ResponseWriter wrapper to capture the status code
			lrw := &loggingResponseWriter{ResponseWriter: w, statusCode: http.StatusOK}

			// Call the actual handler
			handler.ServeHTTP(lrw, r)

			// After handler executes, capture metrics
			status = lrw.statusCode // Get the actual status code
			duration := time.Since(start).Seconds()

			// Define labels for our metrics
			labels := prometheus.Labels{"path": path, "method": r.Method, "status": fmt.Sprintf("%d", status)}
			httpRequestsTotal.With(labels).Inc() // Increment the counter
			httpRequestDuration.With(labels).Observe(duration) // Observe the duration
		})
	}

	// Register our handler with the observability middleware
	http.Handle("/hello", obsHandler("/hello", helloHandler))
	// Expose Prometheus metrics endpoint
	http.Handle("/metrics", promhttp.Handler())

	log.Println("Server starting on :8080")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatalf("Server failed to start: %v", err)
	}
}

Explanation:

httpRequestsTotal (CounterVec): A counter that increments for each HTTP request. CounterVec allows us to add labels (path, method, status) to break down the counts.
httpRequestDuration (HistogramVec): A histogram that records the duration of requests. HistogramVec also uses labels. prometheus.DefBuckets provides a reasonable default set of time ranges.
init() function: This Go special function runs before main(). We use it to prometheus.MustRegister our metrics, making them available to the Prometheus scraper.
loggingResponseWriter: This custom http.ResponseWriter wrapper is a common pattern in Go HTTP servers to capture the actual HTTP status code returned by the handler, which is essential for our status label.
obsHandler (Middleware): This function acts as an HTTP middleware. It wraps our helloHandler to:
1. Record the start time.
2. Call the original handler.
3. Capture the status code and duration after the handler completes.
4. Increment the httpRequestsTotal counter and Observe the httpRequestDuration with appropriate labels.
/metrics endpoint: We register promhttp.Handler() to expose our collected metrics in a format that Prometheus can scrape.

Run and observe: Restart the server. Access http://localhost:8080/hello a few times. Now, open http://localhost:8080/metrics in your browser. You’ll see a page full of metrics in Prometheus exposition format! Look for http_requests_total and http_request_duration_seconds. You should see them incrementing with each request to /hello.

Step 4: Add Tracing with OpenTelemetry

Finally, let’s add distributed tracing using OpenTelemetry. We’ll instrument our helloHandler and add a custom child span.

Install OpenTelemetry Go SDK:

go get go.opentelemetry.io/otel@latest
go get go.opentelemetry.io/otel/attribute@latest
go get go.opentelemetry.io/otel/exporters/stdout/stdouttrace@latest
go get go.opentelemetry.io/otel/sdk/resource@latest
go get go.opentelemetry.io/otel/sdk/trace@latest
go get go.opentelemetry.io/otel/semconv/v1.24.0@latest # Use a specific stable semantic conventions version
go get go.opentelemetry.io/otel/trace@latest

Note: semconv/v1.24.0 is a placeholder for a recent, stable version of semantic conventions. Always check go.opentelemetry.io/otel/semconv for the absolute latest stable version for 2026-03-06.

Modify main.go: We’ll add an initTracer function, set up OpenTelemetry in main, and then instrument our handler.

// main.go (additions highlighted)
package main

import (
	"context" // New import for OpenTelemetry context
	"fmt"
	"log"
	"net/http"
	"os" // New import for OpenTelemetry resource
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"

	"go.opentelemetry.io/otel" // New import
	"go.opentelemetry.io/otel/attribute" // New import
	"go.opentelemetry.io/otel/exporters/stdout/stdouttrace" // New import (for console output)
	"go.opentelemetry.io/otel/sdk/resource" // New import
	"go.opentelemetry.io/otel/sdk/trace" // New import
	semconv "go.opentelemetry.io/otel/semconv/v1.24.0" // New import: semantic conventions
	oteltrace "go.opentelemetry.io/otel/trace" // New import
)

// Global Prometheus metrics (unchanged)
var (
	httpRequestsTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total number of HTTP requests.",
		},
		[]string{"path", "method", "status"},
	)
	httpRequestDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "Duration of HTTP requests in seconds.",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"path", "method", "status"},
	)
)

func init() {
	prometheus.MustRegister(httpRequestsTotal)
	prometheus.MustRegister(httpRequestDuration)
}

// initTracer initializes an OpenTelemetry tracer provider.
// This sets up where our trace data will be sent.
func initTracer() *trace.TracerProvider {
	// Create stdout exporter to be able to see the traces directly in the console.
	// In a real application, you'd use an OTLP exporter to send to a collector/backend.
	exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
	if err != nil {
		log.Fatalf("failed to create stdout exporter: %v", err)
	}

	// For demonstration, use an always-on sampler. In production, use a parent-based sampler
	// to control tracing overhead.
	tp := trace.NewTracerProvider(
		trace.WithSampler(trace.AlwaysSample()), // Always sample all traces for demonstration
		trace.WithBatcher(exporter), // Export traces in batches
		trace.WithResource(resource.NewWithAttributes( // Define service attributes
			semconv.SchemaURL, // Use OpenTelemetry's standard schema
			semconv.ServiceName("my-observability-app"),
			semconv.ServiceVersion("1.0.0"),
			attribute.String("environment", "development"),
		)),
	)
	// Set the global TracerProvider, so all subsequent calls to otel.Tracer() use this provider.
	otel.SetTracerProvider(tp)
	// Set up text map propagator for context propagation in HTTP headers.
	otel.SetTextMapPropagator(oteltrace.NewCompositeTextMapPropagator())
	return tp
}

// loggingResponseWriter (unchanged)
type loggingResponseWriter struct {
	http.ResponseWriter
	statusCode int
}

func (lrw *loggingResponseWriter) WriteHeader(code int) {
	lrw.statusCode = code
	lrw.ResponseWriter.WriteHeader(code)
}

func main() {
	// Initialize OpenTelemetry tracer provider
	tp := initTracer()
	// Ensure the tracer provider is shut down when the application exits
	defer func() {
		if err := tp.Shutdown(context.Background()); err != nil {
			log.Printf("Error shutting down tracer provider: %v", err)
		}
	}()

	// Define a simple handler
	helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Get the current span from the request context.
		// If tracing is properly set up, this will be a child span of the incoming request span.
		ctx := r.Context()
		span := oteltrace.SpanFromContext(ctx)
		span.SetAttributes(attribute.String("http.request.id", "some-unique-id")) // Example custom attribute

		log.Printf(`{"level": "info", "message": "Request received", "path": "%s", "method": "%s", "trace_id": "%s", "span_id": "%s"}`,
			r.URL.Path, r.Method, span.SpanContext().TraceID().String(), span.SpanContext().SpanID().String())

		// Simulate some work
		time.Sleep(50 * time.Millisecond)

		// Create a child span for internal logic
		// This helps break down the request into smaller, observable operations
		_, childSpan := otel.Tracer("my-app").Start(ctx, "internal-logic-step")
		defer childSpan.End() // Ensure the child span is always ended
		time.Sleep(20 * time.Millisecond)
		childSpan.AddEvent("logic_processed", oteltrace.WithAttributes(attribute.Int("processed_items", 10)))

		fmt.Fprintf(w, "Hello, world! You requested: %s\n", r.URL.Path)
	})

	// Wrap the handler with observability middleware (now also starts tracing)
	obsHandler := func(path string, handler http.Handler) http.Handler {
		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			start := time.Now()
			status := http.StatusOK

			// Start a new OpenTelemetry span for the incoming request
			// This is the parent span for this request's processing
			ctx, span := otel.Tracer("my-app").Start(r.Context(), path,
				oteltrace.WithAttributes(
					semconv.HTTPMethodKey.String(r.Method),
					semconv.HTTPTargetKey.String(r.URL.Path),
					semconv.HTTPSchemeKey.String(r.URL.Scheme),
					semconv.NetHostNameKey.String(r.Host),
				),
			)
			defer span.End() // Ensure the span is always ended

			// Update the request's context with the new span, so downstream operations
			// can create child spans or access the current trace ID.
			r = r.WithContext(ctx)

			lrw := &loggingResponseWriter{ResponseWriter: w, statusCode: http.StatusOK}
			handler.ServeHTTP(lrw, r)

			status = lrw.statusCode
			duration := time.Since(start).Seconds()

			labels := prometheus.Labels{"path": path, "method": r.Method, "status": fmt.Sprintf("%d", status)}
			httpRequestsTotal.With(labels).Inc()
			httpRequestDuration.With(labels).Observe(duration)
		})
	}

	http.Handle("/hello", obsHandler("/hello", helloHandler))
	http.Handle("/metrics", promhttp.Handler())

	log.Println("Server starting on :8080")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatalf("Server failed to start: %v", err)
	}
}

Explanation:

initTracer(): This function is crucial.
- It creates a stdouttrace.New exporter. This is a simple exporter that prints trace data directly to your console, making it easy to see what’s happening without needing a full tracing backend. In a real system, you’d use an OTLP exporter to send data to an OpenTelemetry Collector or a tracing backend like Jaeger or Grafana Tempo.
- It sets up a trace.NewTracerProvider with an AlwaysSample() sampler (for demonstration, in production you’d sample a percentage of requests to manage overhead) and defines service-level attributes like ServiceName and ServiceVersion.
- otel.SetTracerProvider(tp) makes this provider the global default.
- otel.SetTextMapPropagator(...) configures how trace context (trace ID, span ID) is injected into and extracted from HTTP headers. This is vital for distributed tracing.
main() modifications:
- We call initTracer() and use defer tp.Shutdown() to ensure all pending trace data is exported before the application exits.
obsHandler (Tracing part):
- otel.Tracer("my-app").Start(r.Context(), path, ...): This is where a new span is started for each incoming HTTP request. r.Context() is important here; if an incoming request already has trace context (e.g., from another service), Start will automatically create a child span. Otherwise, it starts a new trace.
- defer span.End(): It’s critical to call End() on a span when the operation it represents is complete, so its duration can be calculated.
- r = r.WithContext(ctx): We update the request’s context with the newly created span. This allows our helloHandler (and any other functions it calls) to access the current trace and create child spans.
helloHandler (Tracing part):
- ctx := r.Context() and span := oteltrace.SpanFromContext(ctx): We retrieve the current span from the request context.
- span.SetAttributes(...): We can add custom attributes to the span, providing more context to our trace.
- log.Printf(..., span.SpanContext().TraceID().String(), span.SpanContext().SpanID().String()): We’re now including the trace ID and span ID in our log message! This is a super important practice for connecting logs to traces. If you see an error in a log, you can immediately jump to the corresponding trace to see the full request journey.
- _, childSpan := otel.Tracer("my-app").Start(ctx, "internal-logic-step"): We demonstrate creating a child span within the helloHandler. This allows you to break down a single operation into more granular steps in your trace, helping pinpoint exactly which part of your code is slow.
- childSpan.AddEvent(...): Events can be added to spans to mark significant points within an operation.

Run and observe: Restart the server. Access http://localhost:8080/hello a few times. Now, look at your console output. In addition to the log messages, you’ll see detailed JSON output from the stdouttrace exporter, representing your traces and spans!

# ... (previous logs and metrics) ...
2026/03/06 10:30:05 INFO: Request received for path: /hello from 127.0.0.1:54321
{
  "Name": "/hello",
  "Kind": "SPAN_KIND_SERVER",
  "StartTime": "2026-03-06T10:30:05.123456Z",
  "EndTime": "2026-03-06T10:30:05.200000Z",
  "TraceID": "...",
  "SpanID": "...",
  "ParentSpanID": "...",
  "Attributes": [
    {"Key": "http.method", "Value": {"Type": "STRING", "Value": "GET"}},
    {"Key": "http.target", "Value": {"Type": "STRING", "Value": "/hello"}},
    // ... more attributes
  ],
  "Events": [],
  "Status": {"Code": "STATUS_CODE_UNSET"},
  "InstrumentationLibrary": {"Name": "my-app"}
}
{
  "Name": "internal-logic-step",
  "Kind": "SPAN_KIND_INTERNAL",
  "StartTime": "2026-03-06T10:30:05.170000Z",
  "EndTime": "2026-03-06T10:30:05.190000Z",
  "TraceID": "...",
  "SpanID": "...",
  "ParentSpanID": "...", // This will match the SpanID of the "/hello" span
  "Attributes": [],
  "Events": [
    {"Name": "logic_processed", "Timestamp": "2026-03-06T10:30:05.180000Z", "Attributes": [{"Key": "processed_items", "Value": {"Type": "INT64", "Value": 10}}]}
  ],
  "Status": {"Code": "STATUS_CODE_UNSET"},
  "InstrumentationLibrary": {"Name": "my-app"}
}

You’ll see two spans for each request: one for /hello and one nested internal-logic-step. Notice how the ParentSpanID of the internal-logic-step matches the SpanID of the /hello span. This is the magic of tracing! Also, observe that your log messages now include the trace_id and span_id, creating a direct link between your detailed events and the overall request journey.

Mini-Challenge: Enhance Observability

You’ve built a basic observable service! Now, let’s make it a bit more robust.

Challenge: Modify the helloHandler to:

Introduce a simulated error condition (e.g., randomly return an HTTP 500 status code for 10% of requests).
When an error occurs:
- Log an ERROR level message with details about the error.
- Set the status of the current OpenTelemetry span to Error and add an event describing the error.
- Ensure the Prometheus metrics (http_requests_total and http_request_duration_seconds) correctly record the 500 status.

Hint:

For the random error, use rand.Intn(100) and check if it’s less than 10. Don’t forget to seed the random number generator (rand.Seed(time.Now().UnixNano()) in main() or init()).
To set a span’s status to error, use span.SetStatus(oteltrace.StatusCodeError, "Error message").
Remember to WriteHeader(http.StatusInternalServerError) to send the correct status code.

What to Observe/Learn:

How errors are reflected across logs, metrics, and traces.
The importance of consistent error reporting for troubleshooting.
How to connect log messages with specific error spans.

Common Pitfalls & Troubleshooting

Even with observability tools, it’s easy to make mistakes that hinder problem-solving.

Too Much or Too Little Logging:
- Too much: Logs become noisy, expensive to store, and hard to sift through.
- Too little: Critical information is missing when you need to debug.
- Troubleshooting: Use log levels effectively. Start with INFO in production, enable DEBUG only when actively troubleshooting. Leverage structured logging to make filtering easier.
High Cardinality Metrics:
- Adding too many unique labels to metrics (e.g., a user_id label) can explode the number of time series, making your metric backend slow, expensive, or even crash. This is known as “high cardinality.”
- Troubleshooting: Be judicious with labels. Use path, method, status (low cardinality) but avoid user_id or session_id (high cardinality). If you need to search by user, use logs or traces.
Broken Trace Context Propagation:
- If trace IDs aren’t correctly passed between services (e.g., missing HTTP headers), your traces will be “broken” – showing only parts of a request’s journey.
- Troubleshooting: Ensure your HTTP client and server libraries are configured to inject and extract trace context (OpenTelemetry propagators handle this automatically if configured correctly). Double-check custom middleware or network proxies that might strip headers.
Incomplete Instrumentation:
- Only instrumenting the entry points of your application but not critical internal functions or database calls means you’ll have gaps in your observability.
- Troubleshooting: Adopt a “depth-first” approach. Instrument your critical business logic and external calls (database, external APIs) first. Use child spans to break down long operations.

Summary

Phew! You’ve just taken a massive leap in your problem-solving journey. Understanding observability is not just about tools; it’s a mindset that empowers you to truly understand and debug complex systems.

Here are the key takeaways from this chapter:

Observability is the ability to understand your system’s internal state from its external outputs, critical for diagnosing unknown unknowns.
The three pillars are Logs (discrete events), Metrics (aggregated numerical data), and Traces (end-to-end request journeys).
Logs provide granular detail for specific events, especially valuable when structured.
Metrics offer an aggregated, time-series view of system health and performance, best used with the Four Golden Signals (Latency, Traffic, Errors, Saturation).
Traces visualize the flow of a single request through multiple services, using spans and context propagation to connect operations.
OpenTelemetry is the vendor-neutral standard for collecting all three types of telemetry data, providing portability and consistency.
Combining logs, metrics, and traces provides a holistic view for efficient root cause analysis.
Effective instrumentation requires careful consideration of what to log, what to measure, and how to propagate trace context.

You now have the foundational knowledge and practical experience to start building observable applications. In the next chapter, we’ll put these tools to the test as we dive into real-world incident analysis and postmortems, exploring how engineers use observability data to diagnose and resolve major outages.

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

Chapter 4: The Pillars of Observability: Logs, Metrics, and Traces

Table of Contents

Introduction: Seeing Inside Your Software

Core Concepts: The Three Pillars

Observability vs. Monitoring: What’s the Difference?

Pillar 1: Logs – The Event Stream

Pillar 2: Metrics – The Aggregated View

Pillar 3: Traces – The Request’s Journey

The Relationship Between Logs, Metrics, and Traces

Step-by-Step Implementation: Instrumenting a Go Application

Step 1: Set Up Your Project

Step 2: Add Basic Logging

Step 3: Add Metrics with Prometheus

Step 4: Add Tracing with OpenTelemetry

Mini-Challenge: Enhance Observability

Common Pitfalls & Troubleshooting

Summary

References