Introduction: Seeing Inside Your Software
Welcome back, aspiring problem-solver! In the previous chapters, we laid the groundwork for a systematic approach to tackling engineering challenges. We learned how to break down complex problems, form hypotheses, and think critically about system behavior. But how do you know what your system is doing when it’s running in production? How do you gather the evidence needed to validate those hypotheses?
This is where observability comes in. Observability is the ability to infer the internal state of a system by examining its external outputs. It’s like having X-ray vision for your software, allowing you to understand why things are happening, not just that they are happening. Without good observability, even the most brilliant problem-solving mind is flying blind.
In this chapter, we’ll dive deep into the three pillars of modern observability: logs, metrics, and traces. You’ll learn what each is, why they’re crucial for diagnosing real-world issues, and how to instrument your applications to emit this vital information. We’ll use practical examples in Go, incorporating industry-standard tools like OpenTelemetry and Prometheus, to give you hands-on experience in building observable systems. Get ready to gain the superpower of seeing inside your software!
Core Concepts: The Three Pillars
Before we start writing code, let’s understand the fundamental components that make up an observable system. Think of logs, metrics, and traces as different lenses through which you view your application’s behavior. Each provides a unique perspective, and together, they paint a complete picture.
Observability vs. Monitoring: What’s the Difference?
You might hear “observability” and “monitoring” used interchangeably, but there’s a subtle yet important distinction.
- Monitoring tells you if your system is working. It’s about collecting predefined data points (like CPU usage, memory, request rates) and alerting you when they cross certain thresholds. You monitor for known unknowns.
- Observability allows you to ask any question about your system’s behavior, even questions you didn’t anticipate when you built it. It’s about understanding why something is happening. You use observability to explore unknown unknowns.
To put it simply: Monitoring is about dashboards and alerts. Observability is about debugging and root cause analysis in complex, distributed systems.
Pillar 1: Logs – The Event Stream
Imagine your application is telling a story of everything it does. Every user request, every database query, every error, every internal decision – it all gets written down. This narrative is your logs.
What are Logs? Logs are timestamped records of discrete events that occur within your application or system. They are typically lines of text, though modern systems increasingly favor structured formats.
Why are Logs Important? Logs are invaluable for:
- Debugging specific issues: When a user reports an error, logs can show the exact sequence of events leading up to it.
- Auditing and security: Logs can record who did what and when, helping to track suspicious activity.
- Understanding application flow: By following log messages, you can trace the path of a request through different parts of your code.
Structured vs. Unstructured Logs
- Unstructured logs: These are human-readable text strings, often generated by simple
printstatements.
While easy to read for a human, they are difficult for machines to parse and query efficiently.2026-03-06 10:30:05 INFO User 'alice' logged in from 192.168.1.100 2026-03-06 10:30:06 ERROR Failed to fetch data for user 'bob': database connection lost - Structured logs: These logs are formatted, usually as JSON, making them machine-readable and easy to query in log management systems.Notice how each piece of information (timestamp, level, message, user, IP, error) is a distinct key-value pair. This allows you to filter logs by
{"timestamp": "2026-03-06T10:30:05Z", "level": "info", "message": "User logged in", "user": "alice", "ip_address": "192.168.1.100"} {"timestamp": "2026-03-06T10:30:06Z", "level": "error", "message": "Failed to fetch data", "user": "bob", "error": "database connection lost"}user,level, orerrortype with precision.
Best Practices for Logging:
- Log Context: Include relevant information like request IDs, user IDs, component names, and environment details. This helps connect log messages across different services and requests.
- Appropriate Levels: Use log levels (DEBUG, INFO, WARN, ERROR, FATAL) correctly to filter noise.
- Avoid Sensitive Data: Never log passwords, API keys, or other sensitive personal identifiable information (PII).
- Structured Logging: Always prefer structured logs for production systems.
Pillar 2: Metrics – The Aggregated View
If logs are individual stories, metrics are the aggregated statistics. They are numerical measurements captured over time, representing the health and performance of your system.
What are Metrics? Metrics are numerical values that represent a specific aspect of your system at a given point in time. They are typically time-series data, meaning they are collected periodically and stored with a timestamp.
Why are Metrics Important? Metrics are excellent for:
- Monitoring system health: Track CPU, memory, disk I/O, network traffic.
- Performance trending: Observe how latency, throughput, or error rates change over time.
- Alerting: Trigger alerts when key performance indicators (KPIs) deviate from expected norms.
- Capacity planning: Understand resource utilization to plan for future growth.
Common Types of Metrics:
- Counters: A cumulative metric that only ever goes up. Useful for counting total requests, errors, or completed tasks.
- Example:
http_requests_total
- Example:
- Gauges: A metric that represents a single numerical value that can go up or down. Useful for current values like CPU utilization, memory usage, or queue size.
- Example:
current_queue_size
- Example:
- Histograms: Sample observations (e.g., request durations) and count them in configurable buckets. This allows you to calculate quantiles (e.g., 90th percentile latency), which are crucial for understanding user experience.
- Example:
http_request_duration_seconds
- Example:
- Summaries: Similar to histograms but calculate configurable quantiles on the client side. More resource-intensive for high-cardinality data.
The Four Golden Signals: A popular framework for choosing what to measure, especially for user-facing services:
- Latency: The time it takes to serve a request.
- Traffic: How much demand is being placed on your system (e.g., requests per second).
- Errors: The rate of requests that fail.
- Saturation: How “full” your service is (e.g., CPU, memory, disk, network utilization).
By focusing on these four signals, you can get a comprehensive view of your system’s health.
Pillar 3: Traces – The Request’s Journey
In a distributed system, a single user action might involve dozens of services communicating with each other. A log message from one service doesn’t tell you what happened in another. This is where traces come in.
What are Traces? A trace represents the end-to-end journey of a single request or operation as it flows through multiple services in a distributed system. A trace is composed of multiple spans.
- Span: A span represents a single operation within a trace (e.g., an HTTP request, a database query, a function call). Each span has a name, a start time, an end time, attributes (key-value metadata), and a parent-child relationship with other spans.
Why are Traces Important? Traces are critical for:
- Root cause analysis in distributed systems: Pinpoint which service or component caused a latency spike or error.
- Performance optimization: Identify bottlenecks across service boundaries.
- Understanding complex interactions: Visualize the flow of data and control through microservices.
- Service dependency mapping: Discover how services interact in real-time.
How Traces Work: Context Propagation The magic of distributed tracing lies in context propagation. When a request moves from one service to another, a unique trace ID and parent span ID are injected into the request headers. The receiving service then extracts this information and uses it to create new child spans, linking them back to the original trace. This forms a directed acyclic graph (DAG) of operations, showing the entire request path.
OpenTelemetry: The Standard for Observability Data Historically, collecting logs, metrics, and traces often involved vendor-specific agents and SDKs. This led to vendor lock-in and complexity. This is why OpenTelemetry (OTel) was created.
OpenTelemetry is a vendor-neutral, open-source set of APIs, SDKs, and tools designed to standardize the generation and collection of telemetry data (logs, metrics, and traces). It’s supported by the Cloud Native Computing Foundation (CNCF) and is rapidly becoming the industry standard.
Using OpenTelemetry means your instrumentation code is portable. You can switch your backend (where you send your observability data) without changing your application code. As of 2026-03-06, OpenTelemetry has stable SDKs for many popular languages, including Go, Java, Python, JavaScript, and more, with active development continuing across all signal types. For the latest status and documentation, always refer to the official OpenTelemetry website.
The Relationship Between Logs, Metrics, and Traces
While distinct, these three pillars are most powerful when used together.
- Metrics tell you that a problem is occurring (e.g., “latency is high”).
- Traces help you pinpoint where the problem is happening (e.g., “it’s the database call in Service B”).
- Logs provide the granular details why it’s happening (e.g., “Service B’s log shows a ‘database connection lost’ error just before the slow query”).
Together, they provide a holistic view for effective problem-solving.
Let’s visualize this relationship:
This diagram illustrates how your application emits raw observability data, which is then often collected and processed by an OpenTelemetry Collector before being sent to specialized backend systems for storage, querying, and visualization. Finally, engineers use dashboards and tools to interpret this data and solve problems.
Step-by-Step Implementation: Instrumenting a Go Application
Let’s get our hands dirty! We’ll build a simple Go HTTP server and incrementally add logging, metrics, and tracing.
Prerequisites:
- Go installed (version 1.21 or newer, as of 2026-03-06).
- A text editor and a terminal.
Step 1: Set Up Your Project
First, create a new Go module and a main.go file.
- Create a directory:
mkdir my-observability-app cd my-observability-app - Initialize Go module:
go mod init my-observability-app - Create
main.go:// main.go package main import ( "fmt" "log" "net/http" "time" ) func main() { // Define a simple handler helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { fmt.Fprintf(w, "Hello, world! You requested: %s\n", r.URL.Path) }) http.Handle("/hello", helloHandler) log.Println("Server starting on :8080") if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatalf("Server failed to start: %v", err) } } - Run it:Open
go run main.gohttp://localhost:8080/helloin your browser. You should see “Hello, world! You requested: /hello”.
Step 2: Add Basic Logging
We’ll start by adding simple log package calls.
Modify
main.go: Let’s add a log message inside ourhelloHandlerto track when a request is processed.// main.go (additions highlighted) package main import ( "fmt" "log" "net/http" "time" ) func main() { // Define a simple handler helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Add a log message here log.Printf("INFO: Request received for path: %s from %s", r.URL.Path, r.RemoteAddr) fmt.Fprintf(w, "Hello, world! You requested: %s\n", r.URL.Path) }) http.Handle("/hello", helloHandler) log.Println("Server starting on :8080") if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatalf("Server failed to start: %v", err) } }Explanation:
log.Printfis a simple way to print formatted strings to standard output (or stderr by default).- We’re including the request path and the remote address for context.
Run and observe: Restart the server (
Ctrl+Cthengo run main.go). Accesshttp://localhost:8080/helloa few times. You’ll see output like:2026/03/06 10:30:00 Server starting on :8080 2026/03/06 10:30:05 INFO: Request received for path: /hello from 127.0.0.1:54321 2026/03/06 10:30:07 INFO: Request received for path: /hello from 127.0.0.1:54323This is basic, unstructured logging.
Step 3: Add Metrics with Prometheus
Now, let’s add some metrics to track request counts and durations. We’ll use the Prometheus Go client library.
Install Prometheus client library:
go get github.com/prometheus/client_golang/prometheus@latest go get github.com/prometheus/client_golang/prometheus/promhttp@latestModify
main.go: We’ll define global counters and histograms, register them, and then create a middleware to record metrics for each request.// main.go (additions highlighted) package main import ( "fmt" "log" "net/http" "time" "github.com/prometheus/client_golang/prometheus" // New import "github.com/prometheus/client_golang/prometheus/promhttp" // New import ) // Global Prometheus metrics var ( httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests.", }, []string{"path", "method", "status"}, // Labels for breakdown ) httpRequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "Duration of HTTP requests in seconds.", Buckets: prometheus.DefBuckets, // Default buckets are good for starters }, []string{"path", "method", "status"}, ) ) func init() { // Register Prometheus metrics when the package is initialized prometheus.MustRegister(httpRequestsTotal) prometheus.MustRegister(httpRequestDuration) } // loggingResponseWriter is a wrapper to capture the HTTP status code. // This is needed because WriteHeader is typically called after the handler returns. type loggingResponseWriter struct { http.ResponseWriter statusCode int } func (lrw *loggingResponseWriter) WriteHeader(code int) { lrw.statusCode = code lrw.ResponseWriter.WriteHeader(code) } func main() { // Define a simple handler helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { log.Printf("INFO: Request received for path: %s from %s", r.URL.Path, r.RemoteAddr) // Simulate some work that takes time time.Sleep(50 * time.Millisecond) fmt.Fprintf(w, "Hello, world! You requested: %s\n", r.URL.Path) }) // Wrap the handler with observability middleware obsHandler := func(path string, handler http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() status := http.StatusOK // Default status, might be overridden by actual handler logic // Create a ResponseWriter wrapper to capture the status code lrw := &loggingResponseWriter{ResponseWriter: w, statusCode: http.StatusOK} // Call the actual handler handler.ServeHTTP(lrw, r) // After handler executes, capture metrics status = lrw.statusCode // Get the actual status code duration := time.Since(start).Seconds() // Define labels for our metrics labels := prometheus.Labels{"path": path, "method": r.Method, "status": fmt.Sprintf("%d", status)} httpRequestsTotal.With(labels).Inc() // Increment the counter httpRequestDuration.With(labels).Observe(duration) // Observe the duration }) } // Register our handler with the observability middleware http.Handle("/hello", obsHandler("/hello", helloHandler)) // Expose Prometheus metrics endpoint http.Handle("/metrics", promhttp.Handler()) log.Println("Server starting on :8080") if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatalf("Server failed to start: %v", err) } }Explanation:
httpRequestsTotal(CounterVec): A counter that increments for each HTTP request.CounterVecallows us to add labels (path,method,status) to break down the counts.httpRequestDuration(HistogramVec): A histogram that records the duration of requests.HistogramVecalso uses labels.prometheus.DefBucketsprovides a reasonable default set of time ranges.init()function: This Go special function runs beforemain(). We use it toprometheus.MustRegisterour metrics, making them available to the Prometheus scraper.loggingResponseWriter: This customhttp.ResponseWriterwrapper is a common pattern in Go HTTP servers to capture the actual HTTP status code returned by the handler, which is essential for ourstatuslabel.obsHandler(Middleware): This function acts as an HTTP middleware. It wraps ourhelloHandlerto:- Record the start time.
- Call the original handler.
- Capture the status code and duration after the handler completes.
- Increment the
httpRequestsTotalcounter andObservethehttpRequestDurationwith appropriate labels.
/metricsendpoint: We registerpromhttp.Handler()to expose our collected metrics in a format that Prometheus can scrape.
Run and observe: Restart the server. Access
http://localhost:8080/helloa few times. Now, openhttp://localhost:8080/metricsin your browser. You’ll see a page full of metrics in Prometheus exposition format! Look forhttp_requests_totalandhttp_request_duration_seconds. You should see them incrementing with each request to/hello.
Step 4: Add Tracing with OpenTelemetry
Finally, let’s add distributed tracing using OpenTelemetry. We’ll instrument our helloHandler and add a custom child span.
Install OpenTelemetry Go SDK:
go get go.opentelemetry.io/otel@latest go get go.opentelemetry.io/otel/attribute@latest go get go.opentelemetry.io/otel/exporters/stdout/stdouttrace@latest go get go.opentelemetry.io/otel/sdk/resource@latest go get go.opentelemetry.io/otel/sdk/trace@latest go get go.opentelemetry.io/otel/semconv/v1.24.0@latest # Use a specific stable semantic conventions version go get go.opentelemetry.io/otel/trace@latestNote:
semconv/v1.24.0is a placeholder for a recent, stable version of semantic conventions. Always checkgo.opentelemetry.io/otel/semconvfor the absolute latest stable version for 2026-03-06.Modify
main.go: We’ll add aninitTracerfunction, set up OpenTelemetry inmain, and then instrument our handler.// main.go (additions highlighted) package main import ( "context" // New import for OpenTelemetry context "fmt" "log" "net/http" "os" // New import for OpenTelemetry resource "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" "go.opentelemetry.io/otel" // New import "go.opentelemetry.io/otel/attribute" // New import "go.opentelemetry.io/otel/exporters/stdout/stdouttrace" // New import (for console output) "go.opentelemetry.io/otel/sdk/resource" // New import "go.opentelemetry.io/otel/sdk/trace" // New import semconv "go.opentelemetry.io/otel/semconv/v1.24.0" // New import: semantic conventions oteltrace "go.opentelemetry.io/otel/trace" // New import ) // Global Prometheus metrics (unchanged) var ( httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests.", }, []string{"path", "method", "status"}, ) httpRequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "Duration of HTTP requests in seconds.", Buckets: prometheus.DefBuckets, }, []string{"path", "method", "status"}, ) ) func init() { prometheus.MustRegister(httpRequestsTotal) prometheus.MustRegister(httpRequestDuration) } // initTracer initializes an OpenTelemetry tracer provider. // This sets up where our trace data will be sent. func initTracer() *trace.TracerProvider { // Create stdout exporter to be able to see the traces directly in the console. // In a real application, you'd use an OTLP exporter to send to a collector/backend. exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint()) if err != nil { log.Fatalf("failed to create stdout exporter: %v", err) } // For demonstration, use an always-on sampler. In production, use a parent-based sampler // to control tracing overhead. tp := trace.NewTracerProvider( trace.WithSampler(trace.AlwaysSample()), // Always sample all traces for demonstration trace.WithBatcher(exporter), // Export traces in batches trace.WithResource(resource.NewWithAttributes( // Define service attributes semconv.SchemaURL, // Use OpenTelemetry's standard schema semconv.ServiceName("my-observability-app"), semconv.ServiceVersion("1.0.0"), attribute.String("environment", "development"), )), ) // Set the global TracerProvider, so all subsequent calls to otel.Tracer() use this provider. otel.SetTracerProvider(tp) // Set up text map propagator for context propagation in HTTP headers. otel.SetTextMapPropagator(oteltrace.NewCompositeTextMapPropagator()) return tp } // loggingResponseWriter (unchanged) type loggingResponseWriter struct { http.ResponseWriter statusCode int } func (lrw *loggingResponseWriter) WriteHeader(code int) { lrw.statusCode = code lrw.ResponseWriter.WriteHeader(code) } func main() { // Initialize OpenTelemetry tracer provider tp := initTracer() // Ensure the tracer provider is shut down when the application exits defer func() { if err := tp.Shutdown(context.Background()); err != nil { log.Printf("Error shutting down tracer provider: %v", err) } }() // Define a simple handler helloHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Get the current span from the request context. // If tracing is properly set up, this will be a child span of the incoming request span. ctx := r.Context() span := oteltrace.SpanFromContext(ctx) span.SetAttributes(attribute.String("http.request.id", "some-unique-id")) // Example custom attribute log.Printf(`{"level": "info", "message": "Request received", "path": "%s", "method": "%s", "trace_id": "%s", "span_id": "%s"}`, r.URL.Path, r.Method, span.SpanContext().TraceID().String(), span.SpanContext().SpanID().String()) // Simulate some work time.Sleep(50 * time.Millisecond) // Create a child span for internal logic // This helps break down the request into smaller, observable operations _, childSpan := otel.Tracer("my-app").Start(ctx, "internal-logic-step") defer childSpan.End() // Ensure the child span is always ended time.Sleep(20 * time.Millisecond) childSpan.AddEvent("logic_processed", oteltrace.WithAttributes(attribute.Int("processed_items", 10))) fmt.Fprintf(w, "Hello, world! You requested: %s\n", r.URL.Path) }) // Wrap the handler with observability middleware (now also starts tracing) obsHandler := func(path string, handler http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() status := http.StatusOK // Start a new OpenTelemetry span for the incoming request // This is the parent span for this request's processing ctx, span := otel.Tracer("my-app").Start(r.Context(), path, oteltrace.WithAttributes( semconv.HTTPMethodKey.String(r.Method), semconv.HTTPTargetKey.String(r.URL.Path), semconv.HTTPSchemeKey.String(r.URL.Scheme), semconv.NetHostNameKey.String(r.Host), ), ) defer span.End() // Ensure the span is always ended // Update the request's context with the new span, so downstream operations // can create child spans or access the current trace ID. r = r.WithContext(ctx) lrw := &loggingResponseWriter{ResponseWriter: w, statusCode: http.StatusOK} handler.ServeHTTP(lrw, r) status = lrw.statusCode duration := time.Since(start).Seconds() labels := prometheus.Labels{"path": path, "method": r.Method, "status": fmt.Sprintf("%d", status)} httpRequestsTotal.With(labels).Inc() httpRequestDuration.With(labels).Observe(duration) }) } http.Handle("/hello", obsHandler("/hello", helloHandler)) http.Handle("/metrics", promhttp.Handler()) log.Println("Server starting on :8080") if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatalf("Server failed to start: %v", err) } }Explanation:
initTracer(): This function is crucial.- It creates a
stdouttrace.Newexporter. This is a simple exporter that prints trace data directly to your console, making it easy to see what’s happening without needing a full tracing backend. In a real system, you’d use an OTLP exporter to send data to an OpenTelemetry Collector or a tracing backend like Jaeger or Grafana Tempo. - It sets up a
trace.NewTracerProviderwith anAlwaysSample()sampler (for demonstration, in production you’d sample a percentage of requests to manage overhead) and defines service-level attributes likeServiceNameandServiceVersion. otel.SetTracerProvider(tp)makes this provider the global default.otel.SetTextMapPropagator(...)configures how trace context (trace ID, span ID) is injected into and extracted from HTTP headers. This is vital for distributed tracing.
- It creates a
main()modifications:- We call
initTracer()and usedefer tp.Shutdown()to ensure all pending trace data is exported before the application exits.
- We call
obsHandler(Tracing part):otel.Tracer("my-app").Start(r.Context(), path, ...): This is where a new span is started for each incoming HTTP request.r.Context()is important here; if an incoming request already has trace context (e.g., from another service),Startwill automatically create a child span. Otherwise, it starts a new trace.defer span.End(): It’s critical to callEnd()on a span when the operation it represents is complete, so its duration can be calculated.r = r.WithContext(ctx): We update the request’s context with the newly created span. This allows ourhelloHandler(and any other functions it calls) to access the current trace and create child spans.
helloHandler(Tracing part):ctx := r.Context()andspan := oteltrace.SpanFromContext(ctx): We retrieve the current span from the request context.span.SetAttributes(...): We can add custom attributes to the span, providing more context to our trace.log.Printf(..., span.SpanContext().TraceID().String(), span.SpanContext().SpanID().String()): We’re now including the trace ID and span ID in our log message! This is a super important practice for connecting logs to traces. If you see an error in a log, you can immediately jump to the corresponding trace to see the full request journey._, childSpan := otel.Tracer("my-app").Start(ctx, "internal-logic-step"): We demonstrate creating a child span within thehelloHandler. This allows you to break down a single operation into more granular steps in your trace, helping pinpoint exactly which part of your code is slow.childSpan.AddEvent(...): Events can be added to spans to mark significant points within an operation.
Run and observe: Restart the server. Access
http://localhost:8080/helloa few times. Now, look at your console output. In addition to the log messages, you’ll see detailed JSON output from thestdouttraceexporter, representing your traces and spans!# ... (previous logs and metrics) ... 2026/03/06 10:30:05 INFO: Request received for path: /hello from 127.0.0.1:54321 { "Name": "/hello", "Kind": "SPAN_KIND_SERVER", "StartTime": "2026-03-06T10:30:05.123456Z", "EndTime": "2026-03-06T10:30:05.200000Z", "TraceID": "...", "SpanID": "...", "ParentSpanID": "...", "Attributes": [ {"Key": "http.method", "Value": {"Type": "STRING", "Value": "GET"}}, {"Key": "http.target", "Value": {"Type": "STRING", "Value": "/hello"}}, // ... more attributes ], "Events": [], "Status": {"Code": "STATUS_CODE_UNSET"}, "InstrumentationLibrary": {"Name": "my-app"} } { "Name": "internal-logic-step", "Kind": "SPAN_KIND_INTERNAL", "StartTime": "2026-03-06T10:30:05.170000Z", "EndTime": "2026-03-06T10:30:05.190000Z", "TraceID": "...", "SpanID": "...", "ParentSpanID": "...", // This will match the SpanID of the "/hello" span "Attributes": [], "Events": [ {"Name": "logic_processed", "Timestamp": "2026-03-06T10:30:05.180000Z", "Attributes": [{"Key": "processed_items", "Value": {"Type": "INT64", "Value": 10}}]} ], "Status": {"Code": "STATUS_CODE_UNSET"}, "InstrumentationLibrary": {"Name": "my-app"} }You’ll see two spans for each request: one for
/helloand one nestedinternal-logic-step. Notice how theParentSpanIDof theinternal-logic-stepmatches theSpanIDof the/hellospan. This is the magic of tracing! Also, observe that your log messages now include thetrace_idandspan_id, creating a direct link between your detailed events and the overall request journey.
Mini-Challenge: Enhance Observability
You’ve built a basic observable service! Now, let’s make it a bit more robust.
Challenge:
Modify the helloHandler to:
- Introduce a simulated error condition (e.g., randomly return an HTTP 500 status code for 10% of requests).
- When an error occurs:
- Log an
ERRORlevel message with details about the error. - Set the status of the current OpenTelemetry span to
Errorand add an event describing the error. - Ensure the Prometheus metrics (
http_requests_totalandhttp_request_duration_seconds) correctly record the500status.
- Log an
Hint:
- For the random error, use
rand.Intn(100)and check if it’s less than 10. Don’t forget to seed the random number generator (rand.Seed(time.Now().UnixNano())inmain()orinit()). - To set a span’s status to error, use
span.SetStatus(oteltrace.StatusCodeError, "Error message"). - Remember to
WriteHeader(http.StatusInternalServerError)to send the correct status code.
What to Observe/Learn:
- How errors are reflected across logs, metrics, and traces.
- The importance of consistent error reporting for troubleshooting.
- How to connect log messages with specific error spans.
Common Pitfalls & Troubleshooting
Even with observability tools, it’s easy to make mistakes that hinder problem-solving.
- Too Much or Too Little Logging:
- Too much: Logs become noisy, expensive to store, and hard to sift through.
- Too little: Critical information is missing when you need to debug.
- Troubleshooting: Use log levels effectively. Start with
INFOin production, enableDEBUGonly when actively troubleshooting. Leverage structured logging to make filtering easier.
- High Cardinality Metrics:
- Adding too many unique labels to metrics (e.g., a
user_idlabel) can explode the number of time series, making your metric backend slow, expensive, or even crash. This is known as “high cardinality.” - Troubleshooting: Be judicious with labels. Use
path,method,status(low cardinality) but avoiduser_idorsession_id(high cardinality). If you need to search by user, use logs or traces.
- Adding too many unique labels to metrics (e.g., a
- Broken Trace Context Propagation:
- If trace IDs aren’t correctly passed between services (e.g., missing HTTP headers), your traces will be “broken” – showing only parts of a request’s journey.
- Troubleshooting: Ensure your HTTP client and server libraries are configured to inject and extract trace context (OpenTelemetry propagators handle this automatically if configured correctly). Double-check custom middleware or network proxies that might strip headers.
- Incomplete Instrumentation:
- Only instrumenting the entry points of your application but not critical internal functions or database calls means you’ll have gaps in your observability.
- Troubleshooting: Adopt a “depth-first” approach. Instrument your critical business logic and external calls (database, external APIs) first. Use child spans to break down long operations.
Summary
Phew! You’ve just taken a massive leap in your problem-solving journey. Understanding observability is not just about tools; it’s a mindset that empowers you to truly understand and debug complex systems.
Here are the key takeaways from this chapter:
- Observability is the ability to understand your system’s internal state from its external outputs, critical for diagnosing unknown unknowns.
- The three pillars are Logs (discrete events), Metrics (aggregated numerical data), and Traces (end-to-end request journeys).
- Logs provide granular detail for specific events, especially valuable when structured.
- Metrics offer an aggregated, time-series view of system health and performance, best used with the Four Golden Signals (Latency, Traffic, Errors, Saturation).
- Traces visualize the flow of a single request through multiple services, using spans and context propagation to connect operations.
- OpenTelemetry is the vendor-neutral standard for collecting all three types of telemetry data, providing portability and consistency.
- Combining logs, metrics, and traces provides a holistic view for efficient root cause analysis.
- Effective instrumentation requires careful consideration of what to log, what to measure, and how to propagate trace context.
You now have the foundational knowledge and practical experience to start building observable applications. In the next chapter, we’ll put these tools to the test as we dive into real-world incident analysis and postmortems, exploring how engineers use observability data to diagnose and resolve major outages.
References
- OpenTelemetry Official Documentation
- Prometheus Official Documentation
- Google Cloud - Monitoring vs. Observability
- The Four Golden Signals by Google SRE
- Go OpenTelemetry Getting Started
- Prometheus Go Client Library
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.