Chapter 15: Observability: Logging, Monitoring, & Health Checks

Welcome to the final chapter of our comprehensive Node.js project guide! Throughout this series, we’ve built a robust, secure, and scalable Fastify application, containerized it with Docker, and deployed it to AWS ECS. In this pivotal chapter, we shift our focus to observability, a critical aspect of any production-grade application. Observability isn’t just about collecting data; it’s about understanding the internal state of your system from external outputs, enabling you to debug, optimize, and ensure reliability.

In this chapter, we will enhance our application’s logging capabilities using the high-performance pino logger, implement essential health check endpoints for robust service management, and integrate these outputs with AWS CloudWatch for centralized monitoring and alerting. We’ll explore why structured logging is superior for production environments and how health checks are vital for automated deployments and load balancing. By the end of this chapter, your application will not only be deployed but also equipped with the tools necessary for proactive issue detection and efficient operational management, marking the culmination of our journey from scratch to deployment.

Planning & Design

For a production application, knowing what’s happening inside is paramount. Our observability strategy will involve:

Structured Logging: Utilizing pino for efficient, machine-readable logs that can be easily parsed and analyzed by monitoring tools.
Health Checks: Implementing dedicated endpoints to report the application’s operational status, crucial for load balancers and container orchestration platforms like AWS ECS.
Basic Metrics (via logs/health): While a full-fledged metrics system like Prometheus/Grafana is beyond this chapter’s scope, we’ll ensure our logs and health checks provide enough data points for basic monitoring in CloudWatch.

Observability Architecture

The following diagram illustrates how our Fastify application will integrate logging and health checks with AWS services for enhanced observability.

File Structure Modifications

We’ll primarily modify existing files and introduce some configuration:

src/
├── config/
│   └── index.ts               # Add logging configuration
├── plugins/
│   └── logger.ts              # New: Pino logger plugin for Fastify
├── routes/
│   └── health.routes.ts       # New: Health check route
├── server.ts                  # Integrate logger and health route
└── ... (existing files)

Step-by-Step Implementation

1. Enhanced Logging with Pino

Pino is a highly performant Node.js logger designed for production. It outputs JSON logs, which are ideal for consumption by log aggregation services like AWS CloudWatch Logs.

a) Setup/Configuration

First, let’s install pino and pino-pretty (for development-friendly log output).

npm install pino pino-pretty
npm install --save-dev @types/pino

Next, let’s create a Fastify plugin for Pino to ensure it’s properly integrated and configured across our application.

File: src/plugins/logger.ts

import fp from 'fastify-plugin';
import { FastifyInstance, FastifyPluginOptions } from 'fastify';
import pino, { LoggerOptions } from 'pino';

// Define the shape of our logger configuration
interface LoggerConfig {
  level: string;
  prettyPrint: boolean;
}

declare module 'fastify' {
  interface FastifyInstance {
    log: pino.Logger; // Augment FastifyInstance with pino's logger
  }
}

export default fp(async (fastify: FastifyInstance, opts: FastifyPluginOptions) => {
  const loggerConfig: LoggerConfig = fastify.config.LOGGER; // Assuming logger config is in fastify.config

  const pinoOptions: LoggerOptions = {
    level: loggerConfig.level || 'info',
    // In production, we want JSON logs for CloudWatch.
    // In development, pino-pretty makes logs readable.
    transport: loggerConfig.prettyPrint
      ? {
          target: 'pino-pretty',
          options: {
            colorize: true,
            translateTime: 'SYS:HH:MM:ss Z',
            ignore: 'pid,hostname',
          },
        }
      : undefined, // undefined means default JSON output
  };

  const logger = pino(pinoOptions);

  fastify.decorate('log', logger);
  fastify.addHook('onRequest', (request, reply, done) => {
    request.log = logger.child({ reqId: request.id, method: request.method, url: request.url });
    done();
  });
  fastify.addHook('onResponse', (request, reply, done) => {
    request.log.info({
      statusCode: reply.statusCode,
      responseTime: reply.getResponseTime(),
    }, 'Request completed');
    done();
  });

  fastify.log.info(`Logger initialized with level: ${loggerConfig.level}, prettyPrint: ${loggerConfig.prettyPrint}`);
});

Explanation:

We use fastify-plugin to make our logger available globally across our Fastify instance.
The pinoOptions are dynamically set based on fastify.config.LOGGER. In development, pino-pretty provides human-readable output. In production, it defaults to JSON for easy parsing by CloudWatch.
fastify.decorate('log', logger) makes the pino logger instance available as fastify.log.
Hooks onRequest and onResponse are added to automatically log request details (with a unique reqId) and response times, providing valuable context for each transaction.
We augment the FastifyInstance interface to include log, ensuring TypeScript type safety.

Now, let’s update our configuration to include logger settings.

File: src/config/index.ts

// ... (previous imports and types)

interface AppConfig {
  port: number;
  host: string;
  // ... existing configs
  LOGGER: {
    level: string;
    prettyPrint: boolean;
  };
}

const config: AppConfig = {
  port: parseInt(process.env.PORT || '3000', 10),
  host: process.env.HOST || '0.0.0.0',
  // ... existing configs
  LOGGER: {
    level: process.env.LOG_LEVEL || 'info',
    prettyPrint: process.env.NODE_ENV !== 'production', // Only pretty print in non-production environments
  },
};

export default config;

Explanation:

We’ve added a LOGGER object to our AppConfig interface and config object.
LOG_LEVEL can be set via environment variables (e.g., debug, info, warn, error).
prettyPrint is set to true by default unless NODE_ENV is production, ensuring JSON logs in production.

b) Core Implementation

Now, register the logger plugin in your server.ts and use it.

File: src/server.ts

import Fastify from 'fastify';
import config from './config';
import { connectDB } from './utils/database';
// ... other plugin imports
import authRoutes from './routes/auth.routes';
import userRoutes from './routes/user.routes';
import loggerPlugin from './plugins/logger'; // Import our logger plugin
import healthRoutes from './routes/health.routes'; // Import health routes

const fastify = Fastify({
  logger: false, // Disable Fastify's default logger as we're using Pino
  genReqId: () => Math.random().toString(36).substring(2, 15) + Math.random().toString(36).substring(2, 15), // Unique request ID
});

// Decorate fastify with config before other plugins might need it
fastify.decorate('config', config);

// Register Pino logger plugin first
fastify.register(loggerPlugin, { logLevel: config.LOGGER.level });

// ... existing plugin registrations (e.g., helmet, cors, jwt, etc.)
// Example:
// fastify.register(helmet);
// fastify.register(cors);
// fastify.register(jwtPlugin);
// fastify.register(authMiddleware);

// Register routes
fastify.register(authRoutes, { prefix: '/api/auth' });
fastify.register(userRoutes, { prefix: '/api/users' });
fastify.register(healthRoutes, { prefix: '/api' }); // Register health routes

// Centralized error handling (from previous chapter)
fastify.setErrorHandler((error, request, reply) => {
  request.log.error({ error: error.message, stack: error.stack, statusCode: reply.statusCode }, 'Unhandled error caught by error handler');
  reply.status(reply.statusCode || 500).send({
    statusCode: reply.statusCode || 500,
    error: error.name,
    message: error.message || 'Internal Server Error',
  });
});

const start = async () => {
  try {
    await connectDB(fastify.log); // Pass logger to DB connection
    await fastify.listen({ port: config.port, host: config.host });
    fastify.log.info(`Server listening on ${config.host}:${config.port}`);
  } catch (err) {
    fastify.log.error(err, 'Server failed to start');
    process.exit(1);
  }
};

process.on('unhandledRejection', (reason, promise) => {
  fastify.log.error({ reason, promise }, 'Unhandled Rejection at: Promise');
});

process.on('uncaughtException', (error) => {
  fastify.log.error({ error }, 'Uncaught Exception thrown');
  process.exit(1); // Exit process after uncaught exception
});

start();

Explanation:

We disable Fastify’s built-in logger (logger: false) to avoid duplicate logs.
Our loggerPlugin is registered early.
The connectDB function (from previous chapters) now receives the fastify.log instance, allowing it to log database connection status.
Error handling and process exit handlers now use fastify.log.error for consistent logging.

Now, you can use request.log or fastify.log in any part of your application (routes, services, plugins).

Example usage in a route (e.g., src/routes/auth.routes.ts):

// ... existing imports
import { FastifyInstance } from 'fastify';
import { registerUser, loginUser } from '../services/auth.service';
import { registerSchema, loginSchema } from '../schemas/auth.schema'; // Assuming validation schemas

export default async function authRoutes(fastify: FastifyInstance) {
  fastify.post('/register', { schema: registerSchema }, async (request, reply) => {
    request.log.info('Received registration request'); // Using request.log
    try {
      const user = await registerUser(request.body as any);
      request.log.info({ userId: user.id }, 'User registered successfully');
      reply.status(201).send({ message: 'User registered successfully', userId: user.id });
    } catch (error: any) {
      request.log.error({ error: error.message }, 'Registration failed');
      reply.status(400).send({ message: error.message });
    }
  });

  fastify.post('/login', { schema: loginSchema }, async (request, reply) => {
    request.log.debug('Attempting login for user'); // Using request.log
    try {
      const { token, user } = await loginUser(request.body as any);
      request.log.info({ userId: user.id }, 'User logged in successfully');
      reply.send({ token });
    } catch (error: any) {
      request.log.warn({ error: error.message }, 'Login failed'); // Using warn for failed login attempts
      reply.status(401).send({ message: error.message });
    }
  });
}

c) Testing This Component

Start your application locally:
```
npm run dev
```
Observe console output: You should see pretty-printed logs with timestamps, log levels, and messages.
Make some requests: Use curl or Postman to hit your API endpoints (e.g., /api/auth/register, /api/auth/login).
Verify logs: Confirm that info, debug, and error logs appear as expected, including the request ID and response time logs.

If you set NODE_ENV=production locally, you would see JSON output:

NODE_ENV=production npm run dev

You’d then see output like:

{"level":30,"time":1673116800000,"pid":12345,"hostname":"my-host","msg":"Server listening on 0.0.0.0:3000"}
{"level":30,"time":1673116801000,"pid":12345,"hostname":"my-host","reqId":"abcdefg","method":"POST","url":"/api/auth/register","msg":"Received registration request"}

d) Production Considerations

Log Level: In production, set LOG_LEVEL to info or warn to avoid excessive debug logs impacting performance and storage costs.
Structured Logs: The JSON output from pino is automatically consumed by AWS CloudWatch Logs when deployed to ECS with appropriate IAM roles. CloudWatch can then parse, filter, and alert on these logs.
Sensitive Data: NEVER log sensitive information like passwords, API keys, or personal identifiable information (PII). Implement log redaction if necessary. Pino has features for this, but preventing it at the source is best.

2. Health Checks

Health checks are simple endpoints that indicate the operational status of your application. Load balancers use them to determine if an instance is healthy enough to receive traffic, and container orchestrators use them for liveness and readiness probes.

a) Setup/Configuration

Create a new file for health check routes.

File: src/routes/health.routes.ts

import { FastifyInstance } from 'fastify';
import { getDbConnectionStatus } from '../utils/database'; // Assume this function exists

export default async function healthRoutes(fastify: FastifyInstance) {
  fastify.get('/health', async (request, reply) => {
    request.log.debug('Health check requested');
    try {
      const dbStatus = await getDbConnectionStatus(); // Check database connection
      if (dbStatus.connected) {
        reply.status(200).send({
          status: 'UP',
          timestamp: new Date().toISOString(),
          database: dbStatus.message,
        });
      } else {
        request.log.error({ dbError: dbStatus.error }, 'Health check failed: Database not connected');
        reply.status(503).send({
          status: 'DOWN',
          timestamp: new Date().toISOString(),
          database: dbStatus.message,
          error: dbStatus.error,
        });
      }
    } catch (error: any) {
      request.log.error({ error: error.message }, 'Health check encountered an error');
      reply.status(500).send({
        status: 'DOWN',
        timestamp: new Date().toISOString(),
        error: error.message,
      });
    }
  });

  // Optional: A simpler liveness check for quick response
  fastify.get('/liveness', async (request, reply) => {
    request.log.debug('Liveness check requested');
    reply.status(200).send({
      status: 'UP',
      timestamp: new Date().toISOString(),
      message: 'Application is running',
    });
  });
}

Explanation:

/health: This is a more comprehensive readiness probe. It checks if the application is running and if critical dependencies (like the database) are accessible. If the DB is down, it returns 503 Service Unavailable.
/liveness: A simpler endpoint that just checks if the HTTP server is responding. This is useful for Kubernetes liveness probes or quick checks, indicating if the process is alive.

b) Core Implementation

We need to add getDbConnectionStatus to our database utility.

File: src/utils/database.ts (modification)

import { FastifyBaseLogger } from 'fastify';
import mongoose from 'mongoose';
import config from '../config';

let loggerInstance: FastifyBaseLogger; // Store logger instance

export const connectDB = async (logger: FastifyBaseLogger) => {
  loggerInstance = logger; // Assign logger
  try {
    await mongoose.connect(config.DATABASE_URL);
    logger.info('MongoDB connected successfully');
  } catch (error) {
    logger.error({ error }, 'MongoDB connection error');
    // It's crucial to exit or handle this error carefully in production
    // For now, we'll let the health check report it.
    throw error; // Re-throw to indicate connection failure on startup
  }
};

export const getDbConnectionStatus = async () => {
  if (mongoose.connection.readyState === 1) { // 1 means connected
    return { connected: true, message: 'Database is connected' };
  } else if (mongoose.connection.readyState === 2) { // 2 means connecting
    return { connected: false, message: 'Database is connecting', error: 'Connecting to database' };
  } else { // 0 or 3 means disconnected
    // Attempt to reconnect if disconnected, or just report failure
    try {
      if (loggerInstance) {
        loggerInstance.warn('Database not connected, attempting to reconnect for health check.');
      }
      await mongoose.connect(config.DATABASE_URL);
      return { connected: true, message: 'Database reconnected' };
    } catch (error: any) {
      if (loggerInstance) {
        loggerInstance.error({ error: error.message }, 'Failed to reconnect database during health check');
      }
      return { connected: false, message: 'Database is disconnected', error: error.message };
    }
  }
};

Explanation:

The getDbConnectionStatus function checks mongoose.connection.readyState.
If disconnected (readyState 0 or 3), it attempts a reconnect. This is a robust approach for health checks, as a momentary disconnect shouldn’t immediately mark the service as unhealthy if it can self-recover.
We’ve added loggerInstance to make the logger available within getDbConnectionStatus.

Remember to register healthRoutes in src/server.ts as shown in the previous server.ts update.

c) Testing This Component

Start your application locally:
```
npm run dev
```

Test the health endpoint:

curl http://localhost:3000/api/health

Expected output (if DB is connected):

{"status":"UP","timestamp":"2026-01-08T12:00:00.000Z","database":"Database is connected"}

Test the liveness endpoint:

curl http://localhost:3000/api/liveness

Expected output:

{"status":"UP","timestamp":"2026-01-08T12:00:00.000Z","message":"Application is running"}

Simulate a database outage: Stop your local MongoDB instance.

Retest /api/health:

curl http://localhost:3000/api/health

Expected output (status code 503):

{"status":"DOWN","timestamp":"2026-01-08T12:00:00.000Z","database":"Database is disconnected","error":"...connection error details..."}

And you should see corresponding errors in your application logs.

d) Production Considerations

Load Balancers (AWS ALB): Configure your Application Load Balancer target group health checks to point to /api/health with a 200/503 success code. This ensures unhealthy instances are removed from traffic rotation.
ECS Service: ECS services can use health checks for their task definitions. Configure a health check command (e.g., curl localhost:3000/api/health) within your container definition to ensure ECS replaces unhealthy tasks.
Response Time: Health check endpoints should be very fast. Avoid complex logic that could introduce latency.
Authentication: Health check endpoints are typically unauthenticated, but ensure they don’t expose sensitive information.

3. Basic Application Metrics via Logs and CloudWatch

While dedicated metrics systems like Prometheus offer rich capabilities, for many applications, a combination of structured logs and CloudWatch’s ability to extract metrics from logs can suffice.

a) Core Implementation (Conceptual)

Our structured logs already provide data points like responseTime, statusCode, reqId, etc. CloudWatch can use these.

No direct code changes are needed here, but it’s important to understand how existing logs become metrics.

b) Production Considerations (AWS CloudWatch)

When your application is deployed to AWS ECS, and your tasks have the correct IAM role to publish logs to CloudWatch, the following happens:

Log Streams: Each task instance will send its pino JSON logs to a dedicated log stream within a specified CloudWatch Log Group (e.g., /ecs/my-fastify-app).
Metric Filters: In CloudWatch, you can create “Metric Filters” on your log group.
- Example 1: Error Count: Create a filter that matches {"level":50} (Pino’s error level). This filter can then increment a custom metric like MyFastifyApp/Errors.
- Example 2: Request Latency: Create a filter that matches {"responseTime":*} and extract the responseTime value. You can then create a metric for MyFastifyApp/RequestLatency and aggregate by average, sum, min, max.
- Example 3: HTTP 5xx Count: Filter for {"statusCode":5*} to track server errors.
Alarms: Once you have metrics, you can create CloudWatch Alarms. For instance, an alarm could trigger if MyFastifyApp/Errors exceeds 10 in a 5-minute period, sending a notification via SNS to your team.
Dashboards: Create CloudWatch Dashboards to visualize these metrics alongside other AWS service metrics (e.g., CPU utilization of ECS tasks, database connections).

This approach leverages your existing robust logging setup to derive valuable monitoring insights without adding another complex metrics agent to your application.

Production Considerations (Chapter Summary)

Centralized Logging: Ensure all application logs are sent to a centralized logging system (AWS CloudWatch Logs). This is crucial for debugging, auditing, and compliance in a distributed environment.
Alerting: Set up CloudWatch Alarms on critical metrics (error rates, high latency, health check failures, CPU/memory usage) to get notified proactively.
Log Retention: Configure appropriate log retention policies in CloudWatch to manage costs and compliance.
Security: Regularly review logs for suspicious activity. Ensure logs themselves are secured (e.g., access control for CloudWatch Logs).
Performance: High-volume logging can impact performance. Use appropriate log levels for different environments.
Cost Management: Be mindful of the costs associated with log ingestion and storage in CloudWatch. Optimize log levels and retention.

Code Review Checkpoint

At this point, your application has significantly improved observability features:

src/plugins/logger.ts: Implements a Fastify plugin for pino, providing structured, high-performance logging.
src/config/index.ts: Updated to include LOGGER configuration, allowing dynamic log level and pretty-print settings based on the environment.
src/server.ts: Integrates the loggerPlugin and healthRoutes, and ensures all critical application events (startup, errors, unhandled rejections) are logged using the new pino instance.
src/routes/health.routes.ts: Introduces /api/health and /api/liveness endpoints for robust application status checks, including database connectivity.
src/utils/database.ts: Modified getDbConnectionStatus to provide real-time database health, and integrated pino for database-related logs.

These changes are fundamental for operating your application reliably in a production environment.

Common Issues & Solutions

Issue: Logs are not appearing in CloudWatch after deployment.
- Debugging:
  - Check ECS Task Execution Role: Ensure your ECS task definition has an IAM role attached with permissions for logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents.
  - Verify Log Driver: In your ECS task definition, confirm the logConfiguration for your container specifies awslogs driver and the correct awslogs-group, awslogs-region, and awslogs-stream-prefix.
  - Check Application Logs Locally: Does pino produce logs locally? Are there any errors on application startup related to logging?
  - Check CloudWatch Log Group: Does the specified log group exist in CloudWatch?
- Solution: Update your ECS task execution role and task definition’s log configuration.
Issue: Health checks are failing even when the application seems to be running.
- Debugging:
  - Access the endpoint manually: Can you curl the /api/health endpoint from within the ECS container or from your local machine (if port is exposed)?
  - Check application logs: What does your application’s pino log say when the health check is hit? Is it reporting a database connection issue or another dependency failure?
  - Database Connectivity: If the health check checks the database, is the database actually accessible from the ECS task? Check security groups, VPC settings.
- Solution: Address the underlying dependency issue (e.g., database connectivity, external service unavailability) or refine the health check logic to be more resilient or provide clearer error messages.
Issue: Excessive debug logs in production, incurring high CloudWatch costs.
- Debugging: Check the LOG_LEVEL environment variable in your ECS task definition.
- Solution: Ensure LOG_LEVEL is set to info, warn, or error in your production environment configuration. This will filter out debug logs, reducing log volume.

Testing & Verification

To thoroughly test and verify the observability features:

Local Testing:
- Run npm run dev.
- Confirm pino-pretty logs appear in your console.
- Hit various API endpoints (/api/auth/register, /api/users, etc.) and observe corresponding info, debug, and error logs.
- Test /api/health and /api/liveness endpoints.
- Intentionally break your database connection and re-test /api/health to confirm it reports DOWN and logs errors.
Deployment to AWS ECS:
- Rebuild and Push Docker Image:
```
docker build -t <your-repo-uri>/fastify-app:latest .
docker push <your-repo-uri>/fastify-app:latest
```
- Update ECS Service: Update your ECS service to use the new Docker image.
- Verify Health Checks:
  - In the AWS Console, navigate to your ECS service. Check the “Events” tab and “Tasks” tab. You should see tasks starting successfully, and the health status should be “Healthy.”
  - Go to your Application Load Balancer target group, and verify the registered targets are “Healthy.”
- Verify CloudWatch Logs:
  - Navigate to CloudWatch -> Log Groups. Find your application’s log group (e.g., /ecs/my-fastify-app).
  - Open the log group and select a log stream from one of your running tasks.
  - You should see your structured JSON logs flowing in real-time. Make some requests to your deployed application and observe new logs appearing.
  - Look for info messages from your application startup, and request/response logs.
  - (Optional) Simulate an error (e.g., try to register with an existing email) and verify an error log appears.
- Verify CloudWatch Metrics & Alarms (if configured):
  - If you set up metric filters and alarms, verify that metrics are being generated and that alarms are in an OK state (unless an error condition is met).

Summary & Next Steps

Congratulations! You have successfully completed the journey of building a production-ready Node.js Fastify application from scratch to deployment. In this final chapter, we equipped our application with essential observability features:

Structured Logging: Implemented pino for efficient, machine-readable logs.
Health Checks: Created robust /api/health and /api/liveness endpoints for service management and load balancing.
AWS Integration: Discussed how these features integrate seamlessly with AWS CloudWatch Logs for centralized monitoring and alerting.

This final layer of observability is crucial for maintaining the health, performance, and reliability of your application in a real-world production environment. You now have a solid foundation for any future Node.js projects, encompassing best practices from development to deployment and operations.

What’s Next? (Continuous Improvement)

While this guide concludes here, the journey of a production application is continuous. Consider these areas for further exploration and enhancement:

Advanced Monitoring: Integrate dedicated monitoring solutions like Prometheus & Grafana for more granular metrics, custom dashboards, and advanced alerting.
Distributed Tracing: Implement distributed tracing (e.g., OpenTelemetry, AWS X-Ray) to visualize request flows across multiple services, invaluable for microservices architectures.
Cost Optimization: Regularly review AWS resource usage and look for opportunities to optimize costs (e.g., right-sizing ECS tasks, optimizing database instances, managing log retention).
Disaster Recovery: Develop and test a disaster recovery plan, including regular backups, multi-AZ deployments, and failover strategies.
Performance Tuning: Continuously profile your application to identify bottlenecks and optimize code for better performance and scalability.
Security Audits: Conduct regular security audits, penetration testing, and vulnerability scanning. Stay updated on Node.js and dependency security advisories.
Automation: Automate more aspects of your operations, such as infrastructure provisioning (Infrastructure as Code with Terraform/CloudFormation) and advanced CI/CD pipelines.

Thank you for following along this comprehensive guide. May your Node.js applications be fast, reliable, and observable!

Chapter 15: Observability: Logging, Monitoring, & Health Checks

Table of Contents

Planning & Design

Observability Architecture

File Structure Modifications

Step-by-Step Implementation

1. Enhanced Logging with Pino

a) Setup/Configuration

b) Core Implementation

c) Testing This Component

d) Production Considerations

2. Health Checks

a) Setup/Configuration

b) Core Implementation

c) Testing This Component

d) Production Considerations

3. Basic Application Metrics via Logs and CloudWatch

a) Core Implementation (Conceptual)

b) Production Considerations (AWS CloudWatch)

Production Considerations (Chapter Summary)

Code Review Checkpoint

Common Issues & Solutions

Testing & Verification

Summary & Next Steps