Error Handling, Logging & Observability

Introduction

In the world of backend engineering, especially with high-concurrency platforms like Node.js, building resilient and maintainable applications requires more than just writing functional code. It demands a sophisticated understanding of how to handle errors gracefully, log effectively for diagnostics, and implement comprehensive observability to monitor and troubleshoot systems in production. This chapter delves into these critical aspects, providing a holistic preparation guide for Node.js developers at all career stages.

Mastering error handling ensures your application remains stable and user-friendly even when unexpected issues arise. Robust logging transforms raw data into actionable insights, crucial for debugging and understanding system behavior. Observability, encompassing metrics, traces, and logs, provides the deep visibility needed to diagnose performance bottlenecks, identify root causes of failures, and proactively maintain the health of your services. For interviews, demonstrating proficiency in these areas signals a candidate’s maturity in building production-ready systems, their ability to anticipate problems, and their systematic approach to problem-solving. This chapter covers fundamental concepts for interns and juniors, scaling up to advanced strategies for senior, staff, and lead engineers who design and maintain complex, distributed Node.js architectures.

Core Interview Questions

Intern/Junior Level

Q1: Explain the difference between synchronous and asynchronous error handling in Node.js.

A: In Node.js, synchronous errors (like syntax errors, type errors, or errors thrown from synchronous code) can be caught using a standard try...catch block. The execution flow is predictable within that block.

Asynchronous errors, however, are trickier because they occur outside the immediate execution context of the code that initiated them. These typically arise from I/O operations, network requests, or timeouts. Standard try...catch blocks do not catch asynchronous errors directly. Instead, they must be handled using:

Callbacks: The first argument of a callback function is conventionally reserved for an error object (Node.js error-first callbacks).
Promises: Using .catch() handlers or async/await with try...catch blocks.
Event Emitters: Listening for 'error' events.

Key Points:

try...catch works for synchronous code.
Asynchronous errors require specific patterns: error-first callbacks, Promise .catch(), or async/await’s try...catch.
Understanding the event loop is crucial for grasping why try...catch doesn’t work directly for async operations.

Common Mistakes:

Trying to wrap an entire asynchronous operation in a try...catch block and expecting it to catch errors from the async part (e.g., inside a setTimeout callback).
Forgetting to handle errors in Promise chains, leading to unhandled rejections.

Follow-up:

Can you provide an example of a synchronous error and how you’d catch it?
How would you handle an error from a Node.js fs.readFile operation using callbacks?

Q2: What are `process.on('uncaughtException')` and `process.on('unhandledRejection')` and when should you use them?

process.on('uncaughtException'): This event is emitted when a synchronous error is thrown and not caught anywhere in the application. When this happens, the Node.js process is in an undefined state. It’s generally considered bad practice to continue running the application after an uncaughtException. The best practice is to log the error, perform any necessary cleanup (like closing database connections), and then gracefully shut down the process (e.g., process.exit(1)).
process.on('unhandledRejection'): This event is emitted when a Promise is rejected and there’s no .catch() handler (or try...catch in an async function) to handle it. Similar to uncaughtException, an unhandledRejection indicates a programmer error and leaves the process in an uncertain state. Best practice is to log the error and consider a graceful shutdown.

Key Points:

These are fallback mechanisms, not primary error handling strategies.
They indicate severe programmer errors or unexpected states.
The primary action upon these events should be logging and gracefully exiting the process to prevent unpredictable behavior.

Common Mistakes:

Using these handlers to keep the application running after a critical error, which can lead to memory leaks, inconsistent state, or further failures.
Not understanding that they are meant for last resort error handling, not for handling business logic errors.

Follow-up:

Why is it generally recommended to exit the process after an uncaughtException?
How can unhandledRejection be prevented proactively?

Q3: How do you log information in a Node.js application, and what are the different levels of logging?

A: For basic logging, console.log(), console.warn(), console.error() can be used, but these are generally not suitable for production. In production-grade Node.js applications, dedicated logging libraries like Winston (v3.x) or Pino (v8.x) are preferred. These libraries allow for structured logging, different transport options (console, file, remote services), and logging levels.

Standard logging levels (often defined by RFC 5424) include:

FATAL: Critical error leading to application termination.
ERROR: Serious error requiring immediate attention but possibly not terminating the app.
WARN: Potential issue or unexpected behavior, but the application can continue.
INFO: General operational information, such as server start, user login.
DEBUG: Detailed information primarily for debugging purposes, usually disabled in production.
TRACE: Even finer-grained detail than DEBUG, often for following execution flow.

Key Points:

Use structured logging (JSON format) for easier parsing and analysis by log aggregation systems.
Choose appropriate logging levels based on the severity and purpose of the message.
In production, avoid excessive DEBUG or TRACE logging to minimize overhead unless actively troubleshooting.

Common Mistakes:

Using console.log exclusively in production, which lacks structure, levels, and proper output routing.
Logging sensitive information (passwords, PII) without sanitization.
Inconsistent logging formats across different parts of the application.

Follow-up:

Why is structured logging preferred over unstructured logging in production?
What’s a practical scenario where you would use a WARN level log?

Mid-Level Professional

Q4: Describe how you would implement a global error handling middleware in an Express.js application (Node.js v20.x or later).

A: In Express.js, error handling middleware is a special type of middleware function that takes four arguments: (err, req, res, next). Express automatically routes errors to these middleware functions when an error is thrown in a synchronous route handler or passed to next(err) in an asynchronous handler.

A global error handler is typically placed at the very end of the middleware chain, after all other routes and middleware.

// app.js
const express = require('express');
const app = express();
const logger = require('./utils/logger'); // Assume a structured logger

// ... other middleware and routes ...

// Example route that might throw an error
app.get('/error-test', (req, res, next) => {
  try {
    // Simulate a synchronous error
    throw new Error('Something went wrong synchronously!');
  } catch (error) {
    next(error); // Pass error to the error handling middleware
  }
});

app.get('/async-error-test', async (req, res, next) => {
  try {
    // Simulate an asynchronous operation that fails
    await someFailingAsyncOperation();
    res.send('This will not be sent');
  } catch (error) {
    next(error); // Catch async errors and pass to the error handling middleware
  }
});

// Global error handling middleware (must be last)
app.use((err, req, res, next) => {
  logger.error('An unhandled error occurred:', {
    message: err.message,
    stack: err.stack,
    method: req.method,
    url: req.originalUrl,
    ip: req.ip
  });

  // Determine status code (e.g., 500 for generic server errors, 400 for bad requests)
  const statusCode = err.statusCode || 500;

  // Send a generic error response to the client
  res.status(statusCode).json({
    status: 'error',
    message: statusCode === 500 ? 'Internal Server Error' : err.message,
    // In production, avoid sending detailed error messages or stack traces to clients
    ...(process.env.NODE_ENV === 'development' && { stack: err.stack })
  });
});

// Assume someFailingAsyncOperation is defined elsewhere
async function someFailingAsyncOperation() {
  return new Promise((resolve, reject) => {
    setTimeout(() => {
      reject(new Error('Failed during async operation!'));
    }, 100);
  });
}

// ... start server ...

Key Points:

Error middleware takes four arguments: (err, req, res, next).
It must be defined after all other routes and middleware.
Errors passed to next(err) or thrown in synchronous routes will be caught here.
Ensure sensitive error details (stack traces) are not exposed to clients in production.

Common Mistakes:

Placing the error middleware before routes, so it never gets invoked.
Not passing errors using next(err) in asynchronous routes, leading to unhandled rejections if not caught by a Promise handler.
Sending raw stack traces in production environments.

Follow-up:

How would you differentiate between different types of errors (e.g., validation errors vs. database errors) within this global handler and send appropriate HTTP status codes?
What happens if an error occurs within the error handling middleware itself?

Q5: What is the concept of “observability” in a Node.js application, and how does it differ from traditional monitoring?

A: Observability is the ability to infer the internal state of a system by examining its external outputs (logs, metrics, traces). It’s about being able to ask arbitrary questions about your system and get answers from the data it emits, without having to ship new code.

Key components of observability:

Logs: Discrete, timestamped events describing what happened at a specific point in time. Crucial for understanding specific incidents.
Metrics: Aggregated, numerical measurements of system behavior over time (e.g., CPU utilization, request latency, error rates). Good for trends, alerts, and dashboards.
Traces: Represent the end-to-end journey of a request through a distributed system, showing how different services interact, their dependencies, and latency contributions. Essential for debugging microservices.

How it differs from traditional monitoring:

Monitoring tells you if a system is working (pre-defined known-unknowns). It’s about watching known health indicators.
Observability tells you why it’s not working (exploring unknown-unknowns). It provides the tooling and data to debug unforeseen issues.
Monitoring often relies on predefined dashboards and alerts. Observability enables ad-hoc querying and deeper exploration of system behavior.

Key Points:

Observability provides a holistic view, enabling debugging of complex, distributed systems.
It’s built on the “three pillars”: logs, metrics, and traces.
The goal is to understand the why behind system behavior, not just the what.

Common Mistakes:

Confusing logging with full observability. Logs are a component, but not the whole picture.
Only monitoring basic infrastructure metrics without application-level insights.
Not instrumenting applications adequately to emit the necessary data (contextual logs, custom metrics, trace spans).

Follow-up:

How would you instrument a Node.js microservice to emit traces using OpenTelemetry (v1.x for Node.js)?
What kind of custom metrics would be valuable to track for a typical Node.js API, and why?

Q6: Explain “operational errors” versus “programmer errors” in Node.js, and how your handling strategy differs for each.

A: This distinction, popularized by Node.js contributor Joyent, is crucial for robust error handling:

Operational Errors: These are errors that occur during the normal operation of the program due to anticipated external factors. The application is generally still in a healthy state and can recover or respond gracefully. Examples:
- Invalid user input (e.g., ValidationError)
- Network connectivity issues (e.g., database unreachable)
- File not found (e.g., ENOENT from fs module)
- API rate limiting (e.g., 429 Too Many Requests)
- Authentication failures (e.g., 401 Unauthorized)
- Timeout errors
Handling Strategy: Catch these errors, log them, and send an appropriate, user-friendly response (e.g., HTTP 4xx or 5xx status codes with a descriptive message). The process does not necessarily need to terminate.
Programmer Errors: These are bugs in the code itself, indicating a defect in the application’s logic or design. The application is likely in an inconsistent or unstable state, and continuing execution is dangerous. Examples:
- Referencing an undefined variable.
- Calling a function with the wrong arguments that causes a TypeError.
- Unhandled exceptions or unhandled promise rejections that are not explicitly caught.
- Memory leaks.
Handling Strategy: These errors should ideally never reach production. If they do, they signal a critical flaw. The best practice is to log the error with full stack trace, notify developers, and perform a graceful shutdown of the process (e.g., process.exit(1)) after attempting minimal cleanup. This prevents the application from entering an unpredictable state and causing further damage. Process managers like PM2 or Kubernetes will then restart the application.

Key Points:

Operational errors are predictable and recoverable; programmer errors are unexpected bugs.
Handle operational errors gracefully and keep the app running.
Handle programmer errors by logging and shutting down to avoid cascading failures.
A custom error class hierarchy can help distinguish these types programmatically.

Common Mistakes:

Treating all errors as programmer errors and shutting down unnecessarily.
Treating all errors as operational errors and allowing the application to continue in an unstable state after a programmer error.
Not logging enough context to differentiate between the two.

Follow-up:

How would you implement custom error classes in Node.js to better categorize errors?
Given a database connection error, would you classify it as operational or programmer error, and how would you handle it?

Senior/Staff/Lead Level

Q7: Design a robust error handling strategy for a Node.js microservices architecture.

A: Designing error handling for microservices involves more than just individual service try...catch blocks. It requires a distributed strategy:

Standardized Error Structure: Define a consistent JSON error response format (e.g., code, message, details) across all services. This allows client applications and other services to easily parse and react to errors.
Centralized Logging and Aggregation: All services must log errors (with full context, correlation IDs/trace IDs) to a centralized log management system (e.g., ELK Stack, Splunk, Datadog Logs). This enables searching, filtering, and analysis across services.
Distributed Tracing (OpenTelemetry): Implement distributed tracing using a standard like OpenTelemetry (v1.x for Node.js) across all services. This allows tracing a single request’s journey through multiple services, identifying which service failed and the exact call path. Each log entry and error should include the trace ID.
Idempotent Operations & Retries: Design services to be idempotent where possible. Implement retry mechanisms (with exponential backoff and jitter) for transient operational errors (e.g., network issues, temporary service unavailability). Use libraries like axios-retry.
Circuit Breakers: For calls to external or downstream services, implement circuit breakers (e.g., using opossum library). This pattern prevents cascading failures by stopping requests to failing services, giving them time to recover, and quickly failing instead of timing out.
Dead Letter Queues (DLQs): For asynchronous communication (e.g., message queues like Kafka/RabbitMQ), use DLQs for messages that repeatedly fail processing. This prevents poison pills from blocking queues and allows for later inspection and reprocessing.
Alerting & Monitoring: Set up alerts based on error rates, latency spikes, or specific error codes detected in logs/metrics. Integrate with tools like Prometheus/Grafana or an APM solution (New Relic, Datadog).
Graceful Shutdowns: Ensure services handle SIGTERM/SIGINT signals to gracefully shut down, releasing resources (DB connections, open files) and flushing logs before exiting, especially when uncaughtException or unhandledRejection occurs.
Custom Error Classes: Implement custom error classes (e.g., ApiError, DatabaseError, ValidationError) to semantically categorize operational errors. This helps in conditional handling and reporting.
Error Boundaries/Fallbacks: For UI-facing services (e.g., using a Node.js BFF), consider implementing fallbacks or partial error responses where appropriate, to provide a degraded but still functional experience.

Key Points:

Consistency in error reporting and handling across all services is paramount.
Leverage distributed tracing for visibility into cross-service failures.
Employ resiliency patterns like retries and circuit breakers.
Centralized logging and proactive alerting are non-negotiable.

Common Mistakes:

Inconsistent error response formats across services.
Lack of correlation IDs for tracing requests through multiple services.
Not implementing circuit breakers, leading to cascading failures.
Ignoring unhandled promise rejections or uncaught exceptions, leading to unstable processes.

Follow-up:

How would you ensure all logs emitted by a service include a unique correlation ID for a given request?
What metrics would you monitor specifically related to error handling in a microservice?

Q8: How would you debug a memory leak in a Node.js application running in production?

A: Debugging memory leaks in production Node.js applications requires a systematic approach using specific tools:

Monitor Baseline Metrics: First, ensure you have memory usage (RSS, Heap Used/Total) metrics being collected and graphed (e.g., via Prometheus/Grafana, Datadog). Look for a continuously increasing trend without corresponding release.
Identify Potential Leak Source:
- Process Restarts: If a process manager (PM2, Kubernetes) is frequently restarting a service due to OOM (Out Of Memory) errors, it’s a strong indicator.
- High Traffic Patterns: Correlate memory spikes or leaks with specific API endpoints or background jobs that handle large data, long-lived connections, or complex object manipulations.
- Recent Deployments: If the leak started after a recent deploy, review changes for new features, dependencies, or large data structures.
Utilize Node.js Debugging Tools (on a non-critical instance or during a maintenance window):
- node --inspect (or ndb): Connect a debugger (Chrome DevTools, VS Code) to a running instance (preferably a dedicated diagnostic instance or scaled-up replica).
- Heap Snapshots: Take multiple heap snapshots at different times as the application runs and memory grows (e.g., 5 minutes apart).
  - In Chrome DevTools, go to Memory tab, select Heap snapshot, and record.
  - Compare consecutive snapshots to identify “retained objects” that are growing in number or size. Look for objects that shouldn’t persist across requests (e.g., global caches, event listener closures, unclosed database connections, large arrays/objects being added but never cleared).
- Heap Profiling (Allocation Timeline): Record an allocation profile to see where memory is being allocated over time. This helps pinpoint specific functions or code paths.
- CPU Profiling: While primarily for CPU, sometimes high CPU usage can be related to garbage collection struggling with a memory leak.
Code Analysis: Once suspicious objects or code paths are identified from snapshots:
- Inspect source code for global caches that are not properly bounded.
- Check for unclosed database connections or file handles.
- Look for accidentally retained references in closures, event emitters that aren’t removed, or timers that aren’t cleared.
- Review third-party libraries; sometimes they can introduce leaks.
Replicate in Development: Once a potential cause is narrowed down, try to replicate the leak in a local development environment with controlled load or specific test cases. This makes iterative debugging much faster.

Key Points:

Start with monitoring to confirm a leak exists.
Heap snapshots and comparison are the primary tools for identifying retained objects.
Focus on comparing snapshots over time to find growth.
Be cautious when performing these operations on production systems; use diagnostic instances if possible.

Common Mistakes:

Not having proper memory metrics in place to detect leaks early.
Guessing the source of the leak instead of using profiling tools.
Trying to debug a leak without a clear reproduction path or controlled environment.
Forgetting to disconnect debuggers or leave profiling tools running, which can impact performance.

Follow-up:

What are common types of objects that contribute to memory leaks in Node.js applications?
How can Node.js garbage collection (GC) behavior influence the perception of a memory leak?

Q9: Discuss the role of structured logging and how it integrates with log aggregation systems (e.g., ELK Stack, Splunk) in a modern Node.js backend.

A: Structured logging involves outputting logs as machine-readable data, typically in JSON format, instead of free-form text. Each log entry is an object with key-value pairs, where keys represent specific attributes (timestamp, level, message, request ID, user ID, service name, file, line number, custom context).

Role and Benefits:

Machine Readability: Enables efficient parsing, indexing, and querying by log aggregation systems.
Contextual Information: Easily include vital context (e.g., traceId, userId, requestId, transactionId) with every log message, making it simple to filter and trace specific operations.
Standardization: Enforces a consistent log format across different services and microservices, simplifying analysis.
Queryability: Allows powerful queries like “show all ERROR logs for service X where traceId is ABC and userId is 123 in the last hour.”
Alerting & Monitoring: Easier to set up alerts based on specific structured fields (e.g., alert if level: "ERROR" and message: "Database connection failed").
Reduced Ambiguity: Eliminates the need for complex regex parsing of unstructured logs.

Integration with Log Aggregation Systems (e.g., ELK Stack - Elasticsearch, Logstash, Kibana; Splunk; Datadog):

Log Collection: Node.js applications, using libraries like Winston (v3.x) or Pino (v8.x), output structured JSON logs to stdout/stderr or to files.
Log Shipper (e.g., Filebeat, Fluentd, Fluent Bit): Agents installed on the server or container collect these structured logs.
Log Parser/Processor (e.g., Logstash, Datadog Agent): These components receive the logs, perform any necessary transformations (though minimal for structured JSON), enrich them (add host metadata, service tags), and route them.
Log Storage/Indexing (e.g., Elasticsearch, Splunk Indexers): Logs are stored and indexed, making them highly searchable.
Log Analysis/Visualization (e.g., Kibana, Splunk Dashboards, Datadog Log Explorer): Tools provide user interfaces to query, filter, visualize, and create dashboards and alerts from the aggregated log data. The structured nature allows for dynamic field discovery and analysis.

Example (Pino structured log):

{"level":30,"time":1710000000000,"pid":12345,"hostname":"my-server","reqId":"abc-123","msg":"User logged in","userId":"user-456"}

Key Points:

Structured logs are machine-readable, typically JSON, for efficient processing.
They carry rich context, crucial for debugging distributed systems.
Log aggregators ingest, index, and provide powerful querying/visualization capabilities over structured logs.
Libraries like Pino and Winston are essential for implementing structured logging in Node.js.

Common Mistakes:

Still using plain text logs and relying on complex regex for parsing.
Not including enough context (e.g., requestId) in structured logs, making them less useful for tracing.
Logging too much detail (e.g., entire request bodies) by default in production, increasing storage costs and potentially exposing sensitive data.

Follow-up:

How would you handle logging sensitive information (e.g., passwords) in a structured log to prevent it from reaching the log aggregation system?
Beyond errors, what are other key events you would ensure are logged with appropriate context for a typical Node.js API?

Q10: How would you implement rate limiting and a circuit breaker pattern in a Node.js microservice consuming an external API?

Rate Limiting (for outgoing requests to external API): Rate limiting client requests to an external API prevents your service from overwhelming the external system and getting blocked. This is often done per instance or globally if coordinated.

Implementation (per instance):

Token Bucket/Leaky Bucket: Use libraries like rate-limiter-flexible (v2.x) or implement a simple counter.
A common approach is to create a wrapper around your HTTP client (e.g., axios).

// rateLimiter.js
const { RateLimiterMemory } = require('rate-limiter-flexible');

// Configure rate limiter: 10 requests per second
const externalApiRateLimiter = new RateLimiterMemory({
  points: 10, // 10 points
  duration: 1, // per 1 second
  blockDuration: 0 // block for 0 seconds if exceeded (just throw error)
});

async function callExternalApi(url, options) {
  try {
    await externalApiRateLimiter.consume('external-api-key', 1); // Consume 1 point
    // Proceed with the actual API call
    const response = await fetch(url, options); // Using fetch API (Node.js 18+)
    if (!response.ok) {
        throw new Error(`External API responded with status ${response.status}`);
    }
    return response.json();
  } catch (error) {
    if (error.key === 'external-api-key') { // rate-limiter-flexible throws error with key
      console.warn('Rate limit exceeded for external API call.');
      throw new Error('Too many requests to external API, please try again later.');
    }
    throw error; // Re-throw other errors
  }
}

module.exports = { callExternalApi };

Circuit Breaker: A circuit breaker prevents your service from continuously trying to call a failing external API, allowing the external service time to recover and protecting your own service from accumulating pending requests and potential timeouts.

Implementation (using opossum v7.x):

// circuitBreaker.js
const circuitBreaker = require('opossum');

// Define the function that calls the external API
async function actualExternalApiCall(url, options) {
  const response = await fetch(url, options);
  if (!response.ok) {
    // Treat 5xx responses as failures
    if (response.status >= 500) {
      throw new Error(`External API responded with server error ${response.status}`);
    }
    // Other client errors might not break the circuit, depending on strategy
    throw new Error(`External API responded with client error ${response.status}`);
  }
  return response.json();
}

const breakerOptions = {
  timeout: 3000, // If our function takes longer than 3 seconds, trigger a failure
  errorThresholdPercentage: 50, // If 50% of requests fail...
  resetTimeout: 10000 // ...then open the circuit for 10 seconds
};

// Create a circuit breaker
const breaker = circuitBreaker(actualExternalApiCall, breakerOptions);

// Define a fallback function when the circuit is open
breaker.fallback(async (url, options) => {
  console.warn('Circuit breaker is open! Falling back for external API call.');
  // Return a cached response, a default value, or throw a more specific error
  return { message: 'External service temporarily unavailable.' };
});

// Listen for circuit breaker events for logging and monitoring
breaker.on('open', () => console.error('Circuit Breaker OPEN for external API!'));
breaker.on('halfOpen', () => console.warn('Circuit Breaker HALF_OPEN for external API.'));
breaker.on('close', () => console.info('Circuit Breaker CLOSED for external API.'));
breaker.on('success', (result, latency) => console.debug('External API call success.', { latency }));
breaker.on('failure', (error, latency) => console.error('External API call failed.', { error: error.message, latency }));
breaker.on('timeout', (error, latency) => console.error('External API call timed out.', { error: error.message, latency }));

module.exports = { callExternalApiWithBreaker: breaker.fire };

Integration:

const { callExternalApi } = require('./rateLimiter');
const { callExternalApiWithBreaker } = require('./circuitBreaker');

async function processDataFromExternalSource() {
  try {
    // Combine rate limiting and circuit breaker
    // The circuit breaker should wrap the rate-limited call
    const data = await callExternalApiWithBreaker(
      () => callExternalApi('https://api.example.com/data', { method: 'GET' })
    );
    console.log('Received data:', data);
  } catch (error) {
    console.error('Failed to process data from external source:', error.message);
    // Handle the error (e.g., respond with a 503, retry later, log)
  }
}

// How to call the combined function (note: opossum expects a function argument when chaining)
async function getExternalDataRobustly() {
  try {
    const data = await callExternalApiWithBreaker(
      async (url, options) => {
        // This is the function that the circuit breaker will execute
        // We pass the actual external API call with rate limiting here
        return callExternalApi(url, options);
      },
      'https://api.example.com/data', // Arguments for the wrapped function
      { method: 'GET' }
    );
    return data;
  } catch (error) {
    console.error('Error fetching external data:', error.message);
    throw error;
  }
}

Key Points:

Rate Limiting: Protects the external API from your service. Limits outgoing requests.
Circuit Breaker: Protects your service from a failing external API. Prevents constant retries.
Use libraries like rate-limiter-flexible for rate limiting and opossum for circuit breaking.
Proper configuration of thresholds and timeouts is crucial for both.
Provide fallbacks for circuit breakers to gracefully degrade service.

Common Mistakes:

Not implementing any rate limiting, leading to IP blocking by external APIs.
Not using circuit breakers, causing your service to hang or fail repeatedly when a dependency is down.
Setting timeout too low or errorThresholdPercentage too high, causing premature circuit opening.
Not having a meaningful fallback strategy when the circuit opens.

Follow-up:

How would you handle global rate limiting across multiple instances of your Node.js microservice?
What metrics would you want to monitor for your circuit breaker’s state?

Debugging Exercises / Production Incidents

Q11: Scenario: Diagnosing a sudden spike in API latency for a Node.js service.

Scenario Setup: Your team monitors a critical Node.js API (running on Node.js v20.10.0, Express.js v4.18.2) deployed in Kubernetes. Suddenly, P99 latency jumps from 100ms to 2000ms, and error rates (500s) also slightly increase, but not proportionally to the latency spike. The service’s CPU and memory usage remain within normal bounds, but pod restarts are slightly up.

Questions an interviewer might ask:

Initial Reaction & First Steps: What’s your immediate reaction to this alert, and what are the very first three things you would check?
Troubleshooting Process: Outline a systematic troubleshooting process to diagnose the root cause. What tools and data would you leverage?
Potential Causes: Based on the symptoms, what are the most likely categories of issues you’d investigate?
Mitigation: What steps could you take to mitigate the impact while you’re still diagnosing?
Prevention: Once the root cause is identified, what measures would you put in place to prevent recurrence?

Expected Flow of Conversation & Answer:

1. Initial Reaction & First Steps:

A: My immediate reaction is to verify the alert’s validity and understand its scope.
1. Check dashboards: Confirm the latency spike across all instances and check other related metrics (throughput, error rates, dependency health, GC activity) on Grafana/Datadog. Is it affecting all endpoints or specific ones?
2. Check recent deployments: Was there a recent deployment (code, config, infrastructure)? A rollback might be a quick fix if so.
3. Check upstream/downstream dependencies: Look at dashboards for databases, caching layers (Redis/Memcached), message queues (Kafka), or other microservices this API depends on. A slow dependency is a common cause for increased latency.

2. Troubleshooting Process:

A: I’d follow a systematic “Observe, Orient, Decide, Act” (OODA) loop:
1. Observe (Data Collection):
  - Distributed Tracing (e.g., OpenTelemetry/Jaeger/Zipkin): Find recent slow requests and examine their traces. This is the most powerful tool for multi-service latency issues, pinpointing which span (internal function call, database query, external API call) is consuming the most time.
  - Application Logs (centralized ELK/Splunk/Datadog): Search for errors (specifically ERROR/FATAL levels), warnings, or unusual patterns in the timeframe of the spike. Correlate with trace IDs if available. Look for slow query logs from databases.
  - Custom Metrics: Check application-specific metrics like database query latencies, external API call latencies, event loop utilization, garbage collection pause times.
  - System Metrics (Kubernetes, host level): Confirm CPU, memory, network I/O, disk I/O are truly stable. Check resource limits/requests and node health in Kubernetes.
2. Orient (Hypothesis Generation): Based on the data, form hypotheses:
  - Slow Dependency: Most likely if traces show a long wait time for a database query or external API call.
  - Resource Contention: Less likely given CPU/memory are stable, but could be specific network I/O or disk I/O not captured.
  - Event Loop Blocking: A synchronous, CPU-intensive operation or complex regex blocking the event loop could cause latency without high overall CPU, but would show high Event Loop Utilization (ELU).
  - Garbage Collection Pauses: Frequent or long GC pauses can increase latency. Look at GC metrics.
  - Database Connection Pool Exhaustion: This could cause requests to queue up, increasing latency without increasing CPU.
  - Inefficient Code Path: A specific code path hit under certain conditions might be inefficient.
3. Decide (Action Plan): Prioritize hypotheses and decide on the next diagnostic step.
4. Act (Execute & Re-evaluate): Execute the step, gather more data, and repeat the loop. For example, if traces point to a slow database query, then investigate the database’s query plans, indexing, or connection pool.

3. Potential Causes:

Slow Downstream Dependency: (Most probable given the symptoms) The database, a caching service, or another internal/external microservice is experiencing high latency, causing our Node.js API to wait longer. Node.js waits are non-blocking, so CPU might not spike.
Database Connection Pool Exhaustion: Requests are queuing up waiting for a free database connection.
Inefficient Database Queries: A new or changed query causing a full table scan or missing index.
External API Rate Limiting/Throttling: The API might be getting throttled by a third-party service, leading to retries and increased latency.
“Hot” Code Path: A particular endpoint or function is being hit with specific, difficult-to-process data that causes a less efficient path to be taken, perhaps involving a complex synchronous calculation or data structure operation.
Garbage Collection Pressure: Though overall memory stable, certain operations might be creating a lot of short-lived objects, increasing GC frequency/duration and causing minor pauses.
Network Latency: Issues with network connectivity within the Kubernetes cluster or to external services.
Misconfigured Caching: Caching layers might not be working, leading to more direct (and slower) database/API calls.

4. Mitigation:

Rollback: If a recent deployment is suspected, roll back to the previous stable version.
Scale Up: Temporarily increase the number of Node.js pod replicas to distribute the load, though this might not fix a downstream dependency issue and could exacerbate it if the dependency is the bottleneck.
Graceful Degradation: Temporarily disable less critical features or use cached/stale data if appropriate, to reduce load on the problematic component.
Circuit Breakers/Timeouts: If not already in place, or if misconfigured, ensure external calls have appropriate timeouts and circuit breakers to fail fast instead of hanging.

5. Prevention:

Robust Monitoring & Alerting: Granular metrics (per endpoint latency, dependency latency, event loop utilization, GC metrics) and threshold-based alerts.
Comprehensive Distributed Tracing: Implement OpenTelemetry across all services for deep visibility.
Load Testing & Performance Benchmarking: Regularly test services under expected and peak load to identify bottlenecks pre-production.
Code Reviews & Performance Audits: Focus on N+1 queries, large synchronous operations, and inefficient data structures.
Automated Dependency Health Checks: Proactive checks for external services and databases.
Connection Pool Sizing: Properly tune database and other connection pools based on load.
Schema & Index Review: Regularly review database query plans and ensure proper indexing.
Circuit Breakers & Retries: Implement and tune these resilience patterns for all external dependencies.

Red flags to avoid:

Jumping to conclusions without data.
Blaming the database immediately without checking traces.
Suggesting a reboot without understanding the cause.
Not considering the impact of a fix/mitigation.

Q12: How would you identify and address a request processing bottleneck caused by a synchronous, CPU-intensive operation in a Node.js backend (v20.x)?

Scenario Setup: Your Node.js API experiences high latency and increased CPU utilization (approaching 100% on a single core) when a specific endpoint is hit. Other endpoints are mostly unaffected, but the entire application feels sluggish. You suspect a blocking operation.

Questions an interviewer might ask:

Symptoms of Blocking: What are the key symptoms that point towards a synchronous, CPU-intensive bottleneck in Node.js?
Identification Tools: What tools and techniques would you use to pinpoint the exact problematic code section?
Resolution Strategies: Once identified, what are the primary strategies for addressing such a bottleneck in Node.js?
Long-Term Prevention: How do you prevent such issues from being introduced in the future?

Expected Flow of Conversation & Answer:

1. Symptoms of Blocking:

A: The key symptoms indicating a synchronous, CPU-intensive bottleneck are:
- High Event Loop Utilization (ELU): This is the most direct indicator. Monitoring tools should show the event loop being “blocked” or “starved” (e.g., ELU approaching 100%).
- High CPU on a single core: Node.js is single-threaded for JavaScript execution, so a blocking operation will max out one core, even if the overall system CPU isn’t 100% across all cores.
- Increased Latency: Requests queue up as the single event loop thread is busy, leading to delayed responses.
- Affected Throughout: The server can’t process new requests efficiently.
- Healthy I/O, but Slow Processing: Database, network, and disk I/O metrics might appear fine, but the application logic layer is slow.
- Unresponsive Application: The entire application becomes unresponsive or “janky” for all pending requests because the single thread is tied up.

2. Identification Tools:

A:
1. CPU Profiling:
  - node --inspect (Chrome DevTools Profiler): Connect to the running Node.js process. Go to the Performance tab, record a profile while hitting the problematic endpoint. The flame graph will clearly show which functions are consuming the most CPU time, often indicating the blocking synchronous operation.
  - clinic.js (v13.x): Specifically, clinic doctor is excellent for visualizing CPU and Event Loop metrics, clinic flame generates flame graphs, and clinic bubbleprof identifies I/O bottlenecks. These are great for local testing or profiling on a dedicated staging environment.
2. Event Loop Monitoring: Libraries like loopbench (for manual observation) or APM tools (Datadog, New Relic) that collect Event Loop Utilization/Delay metrics are crucial for confirming the event loop is indeed blocking.
3. Logging: Look for custom logs that might indicate the start and end of specific complex operations, to help narrow down the time window.

3. Resolution Strategies:

A: Once identified, the primary strategies are:
1. Offload to Worker Threads (Node.js v10+, stable since v12, current v20.x): For CPU-intensive tasks (e.g., complex calculations, heavy data transformations, image processing, cryptographic operations), move them to Node.js Worker Threads. This executes the CPU-bound work on a separate thread, keeping the main event loop free to handle incoming requests. Use the worker_threads module.
2. Break Down into Smaller Asynchronous Chunks: If the operation can be broken down, use setImmediate or process.nextTick to yield control back to the event loop periodically. This makes it non-blocking, though it might increase overall execution time. (Less common for truly “CPU-intensive” but useful for large synchronous loops).
3. Optimize the Algorithm/Code: Review the logic. Can the algorithm be more efficient (e.g., using a better data structure, reducing iterations)? This is often the first step before offloading.
4. Caching: If the result of the CPU-intensive operation is often the same for the same input, cache the results (e.g., in Redis or an in-memory cache with expiration).
5. External Services: For extremely heavy tasks, offload them to dedicated services or tools outside Node.js (e.g., a Python service for data science, a specialized image processing library).
6. Clustering (Node.js cluster module): While not solving the blocking nature, running multiple Node.js processes (each on a different CPU core) can distribute the load. If one process gets blocked, others can still serve requests. This is a horizontal scaling approach.

4. Long-Term Prevention:

A:
- Performance Budgeting: Establish performance targets for critical operations and include performance testing in CI/CD.
- Proactive Profiling: Regularly profile CPU-intensive sections of the application, especially after introducing new complex logic or data processing.
- Code Review Emphasis: Train developers to recognize synchronous, CPU-bound patterns and recommend worker_threads or algorithmic optimizations.
- Mandatory Event Loop Monitoring: Integrate ELU metrics into dashboards and set up alerts for high utilization.
- Automated Performance Testing: Incorporate tools like clinic.js into development or CI to catch these issues early.

Red flags to avoid:

Suggesting “just buy more CPU” without addressing the root cause.
Misunderstanding the single-threaded nature of Node.js event loop.
Confusing I/O blocking (which Node.js handles well asynchronously) with CPU blocking.
Not mentioning worker_threads as the modern solution for CPU-bound tasks in Node.js.

MCQ Section

1. Which of the following is the primary mechanism for handling asynchronous errors in Node.js Promises? A. try...catch block around the entire Promise chain. B. process.on('uncaughtException'). C. .catch() method at the end of the Promise chain or try...catch with async/await. D. Using console.error().

**Correct Answer:** C
**Explanation:**
*   A: `try...catch` works for synchronous errors within an `async` function, but a bare `try...catch` won't catch rejections in a `Promise` chain if the error occurs outside the `try` block's direct execution path.
*   B: `uncaughtException` is for synchronous errors that escape the event loop, not unhandled Promise rejections. `unhandledRejection` is the specific handler for promises.
*   C: The `.catch()` method (or a `try...catch` block in an `async` function) is the correct and idiomatic way to handle Promise rejections.
*   D: `console.error()` is for logging, not error handling itself.

2. In a production Node.js application, which logging level would you typically use for general operational messages (e.g., “Server started”, “User logged in”) that do not indicate an error or warning? A. DEBUG B. TRACE C. INFO D. ERROR

**Correct Answer:** C
**Explanation:**
*   A (DEBUG) and B (TRACE) are for very detailed, fine-grained messages primarily used during development or deep debugging, and are often too verbose for production.
*   C (INFO) is the standard level for general, high-level operational messages that provide context about the application's normal flow.
*   D (ERROR) is reserved for serious issues that prevent normal operation.

3. What is the main advantage of using structured logging (e.g., JSON logs) over unstructured plain text logs in a microservices architecture? A. Structured logs are smaller in file size. B. They are easier for humans to read directly. C. They enable efficient parsing, querying, and analysis by log aggregation systems. D. Structured logs automatically prevent sensitive data from being logged.

**Correct Answer:** C
**Explanation:**
*   A: Structured logs are often larger due to metadata.
*   B: Plain text logs are generally easier for direct human reading, while structured logs (especially JSON) require tools for optimal readability.
*   C: The primary benefit of structured logs is their machine-readability, allowing log aggregators to easily index fields, perform complex queries, and generate analytics.
*   D: Structured logging does not inherently prevent sensitive data logging; developers must explicitly sanitize or redact sensitive information.

4. Which Node.js module is specifically designed to run CPU-intensive tasks on separate threads, preventing them from blocking the main event loop? A. cluster B. child_process C. worker_threads D. os

**Correct Answer:** C
**Explanation:**
*   A (`cluster`): Creates multiple Node.js *processes*, primarily for horizontal scaling across CPU cores, but each process still has a single event loop. It doesn't offload CPU-intensive tasks *within* a single process's event loop.
*   B (`child_process`): Spawns external processes, which can be used to run blocking tasks, but `worker_threads` is the more idiomatic Node.js way for in-process, thread-based concurrency for CPU-bound tasks.
*   C (`worker_threads`): This module allows developers to create truly parallel JavaScript execution environments (threads) within a single Node.js process, specifically for CPU-intensive operations.
*   D (`os`): Provides operating system-related utility methods.

5. A process.on('uncaughtException') handler is generally used for which of the following? A. Handling all expected application-level errors. B. Catching errors from Promise rejections without a .catch() handler. C. A last-resort mechanism for unhandled synchronous errors, followed by graceful process shutdown. D. Implementing custom error logging to a file.

**Correct Answer:** C
**Explanation:**
*   A: It's for *unhandled* exceptions, not expected errors. Expected errors should be handled locally where they occur.
*   B: `unhandledRejection` is for Promise rejections.
*   C: This is the correct use case. An `uncaughtException` signifies a severe programmer error or an unexpected state that makes the process unreliable. Continuing execution is dangerous, so graceful shutdown (after logging) is recommended.
*   D: While you might log within it, its primary purpose is not just custom logging, but rather handling process-level integrity.

Mock Interview Scenario

Scenario: Intermittent Database Deadlocks

Scenario Setup: Your Node.js backend service (Node.js v21.x, Express.js v4.18.2, PostgreSQL database, using pg module, deployed on Kubernetes with 5 replicas) is experiencing intermittent “database deadlock” errors. These errors are logged, but they don’t crash the service, only occasionally result in a 500 error for the end-user. The deadlocks seem to happen more frequently during peak load, especially on an endpoint that updates user profiles and their associated settings in separate database tables within a single transaction.

Interviewer: “We’ve observed these intermittent database deadlock errors. Can you walk me through your thought process for diagnosing and resolving this issue? What specific data would you look for, and what changes would you propose?”

Expected Flow of Conversation:

You: “Okay, intermittent deadlocks, especially under peak load and tied to a transaction involving multiple tables, points to a classic concurrency issue at the database level. My approach would be:

1. Verification and Initial Data Gathering: * Confirm Frequency & Impact: I’d check our centralized logging system (e.g., Datadog Logs/ELK) to quantify the frequency of these deadlock errors and their impact on specific API endpoints and user experience (e.g., percentage of 500 errors). * Database Metrics: I’d look at PostgreSQL metrics: * Active Connections: Are we hitting connection pool limits? * Query Latency: Are other queries also slowing down? * Locks: Are there any other types of locks or long-running transactions happening concurrently? * Deadlock Log: PostgreSQL logs deadlocks explicitly. I’d analyze these logs for details like the tables involved, the queries that formed the deadlock, and the transaction IDs. This is critical. * Application Tracing: If we have distributed tracing (OpenTelemetry), I’d examine traces of failed requests to see the exact sequence of database operations within the failing transaction and other concurrent transactions. * Application Logs: Look for any other warnings or errors around the time of the deadlocks that might provide context.

2. Hypotheses & Deeper Investigation: * Transaction Isolation Level: The default isolation level in PostgreSQL is READ COMMITTED. While generally good, if the specific transaction logic involves updates that are sensitive to read-write conflicts, a higher isolation level might be inadvertently leading to more contention, or a READ COMMITTED level might not be preventing some forms of non-repeatable reads that could contribute to deadlocks depending on the query order. * Order of Operations: The most common cause of deadlocks is inconsistent ordering of lock acquisition. If transaction A locks table Users then Settings, while transaction B locks Settings then Users, a deadlock can occur. I’d review the code for the user profile/settings update endpoint and identify the exact sequence of UPDATE/INSERT/DELETE statements on Users and Settings tables. * Missing Indexes: If a table scan is being performed within a transaction that’s also trying to update that table, it can lead to more aggressive locking and increased contention. * Long-Running Transactions: Are these transactions unnecessarily long? Holding locks for extended periods increases the chance of deadlocks. * High Concurrency: The deadlocks occurring during peak load confirms contention.

3. Proposed Solutions (ordered by priority/impact):

*   **a. Standardize Lock Order (High Priority):** This is usually the fix for logical deadlocks. I would ensure that *all* transactions involving `Users` and `Settings` tables acquire locks (implicitly, through `UPDATE` statements) in a consistent, predetermined order. For example, always update `Users` first, then `Settings`.
    *   *Implementation:* Refactor the transaction logic in the Node.js service to guarantee this order.
*   **b. Optimize Queries & Add Indexes:** Review the `UPDATE` and `SELECT` queries within the transaction for both `Users` and `Settings` tables. Ensure appropriate indexes are in place on columns used in `WHERE` clauses, especially those that define the `JOIN` or update targets. This reduces the time locks are held.
    *   *Implementation:* Use `EXPLAIN ANALYZE` on the problematic queries in PostgreSQL to find inefficiencies and suggest new indexes.
*   **c. Shorten Transactions:** Can the transaction be broken down or made more granular? If certain parts don't *strictly* need to be atomic with others, they can be moved outside the transaction.
    *   *Implementation:* Analyze the business logic to see if atomicity can be reduced.
*   **d. Implement Retry Logic with Exponential Backoff:** Since deadlocks are typically detected and terminated by the database, the Node.js application will receive an error. Implementing a retry mechanism (e.g., 2-3 retries with exponential backoff and jitter) around the transaction execution in the Node.js service can often resolve intermittent deadlocks gracefully, as the second attempt usually succeeds.
    *   *Implementation:* Use a library or custom logic to wrap the database transaction with retries.
*   **e. Consider `SELECT FOR UPDATE`/`FOR SHARE` (Carefully):** For highly contended scenarios, explicitly locking rows with `SELECT ... FOR UPDATE` can prevent deadlocks by acquiring locks *before* modifying, but this must be used very cautiously as it can also lead to more blocking and reduced concurrency if not applied precisely.
*   **f. Review Transaction Isolation Level (Advanced):** While changing isolation levels can impact concurrency, in very specific scenarios, adjusting it might be considered, but it's a complex decision with broader implications. I'd only consider this after exhausting other options.

4. Long-Term Prevention: * Automated Transaction Testing: Include integration tests that simulate concurrent updates on the affected tables to catch deadlock scenarios in CI/CD. * Database-Level Monitoring: Enhance monitoring for lock contention, long-running transactions, and deadlock frequency. * Code Review Focus: Emphasize transaction management and lock ordering during code reviews for new features involving multiple table updates. * Clear Transaction Boundaries: Ensure transaction logic is encapsulated and clearly defined in the Node.js code.

Interviewer (Potential follow-up): “Good. What if the deadlock errors are extremely rare, happening perhaps once a day? How would that change your approach?”

You: “If deadlocks are extremely rare, say once a day, the impact on overall user experience might be low, but they still indicate an underlying issue. My initial data gathering would focus on historical logs and database metrics from the exact time of the incident to catch the anomaly. Automated retry logic with exponential backoff would become a higher priority mitigation. It would address the user impact without requiring a full code change for an extremely rare event. However, I’d still log these retries and the original deadlock error with high fidelity. Long-term, I’d still push for a root cause analysis of the lock order or query optimization, but with a lower urgency than if it were a frequent occurrence. The cost-benefit of a complex code refactor for a very rare event needs to be weighed against the reliability provided by a simple retry.”

Red flags to avoid:

“Just restart the DB”: Avoid suggesting brute-force solutions without diagnosis.
“Node.js causes deadlocks”: Deadlocks are a database concurrency issue, not a Node.js specific problem.
Ignoring the database logs: The database often provides the most direct clues for deadlocks.
Not discussing lock order: This is almost always the core fix for logical deadlocks.
Not proposing retries: An effective way to handle transient database errors like deadlocks gracefully.

Practical Tips

Embrace Asynchronous Error Handling: Internalize the use of .catch() for Promises and try...catch with async/await. For older callback-based APIs, always use the error-first callback pattern.
Master Error Types: Understand the distinction between operational and programmer errors and design your application’s error flow accordingly. Create custom error classes to categorize your operational errors meaningfully.
Choose a Production-Grade Logger: Move beyond console.log for production. Libraries like Pino (v8.x) or Winston (v3.x) offer structured logging, different transport options, and robust error serialization.
Implement Structured Logging from Day One: Design your logs to be machine-readable (JSON) with rich context (e.g., requestId, userId, traceId, serviceName). This is paramount for debugging in distributed systems.
Prioritize Observability: Don’t just log errors. Implement metrics (Prometheus, OpenTelemetry Metrics) for application health and performance, and distributed tracing (OpenTelemetry Tracing) for end-to-end request visibility across services.
Learn node --inspect and Profiling Tools: Become proficient with Node.js built-in debugging tools (Chrome DevTools) for inspecting runtime behavior, heap snapshots (for memory leaks), and CPU profiles (for blocking operations). Tools like clinic.js (Doctor, Flame, Bubbleprof) are excellent for performance analysis.
Practice Defensive Programming: Validate all inputs, gracefully handle expected failures (e.g., external API outages), and design for partial system failures.
Understand Node.js Process Management: Know how process.on('uncaughtException') and process.on('unhandledRejection') work, and the importance of graceful shutdowns (process.exit(1)) in response to critical unhandled errors.
Study Resilience Patterns: Learn about Circuit Breakers, Retries with Exponential Backoff, and Bulkheads. Understand when and how to apply them using libraries like opossum or axios-retry.
Analyze Real-World Incidents: Read post-mortems from major companies. This builds intuition for common failure modes and effective diagnostic strategies.

Summary

This chapter has provided a comprehensive overview of error handling, logging, and observability in Node.js backend engineering, progressing from fundamental concepts to advanced strategies required for complex, distributed systems. We’ve explored the nuances of synchronous vs. asynchronous error handling, the critical role of structured logging with context, and the power of observability pillars – logs, metrics, and traces – in diagnosing and preventing production issues. Through practical questions, an MCQ section, and mock interview scenarios involving latency spikes and database deadlocks, you’ve gained insight into how to identify, troubleshoot, and mitigate common problems.

By mastering these topics, you demonstrate not only technical proficiency in Node.js but also a mature understanding of building resilient, maintainable, and observable backend services. Your ability to reason about system behavior under load, debug production incidents, and design for failure will distinguish you in any technical interview.

Next Steps in Preparation:

Experiment with a Node.js application, integrating Winston or Pino for structured logging.
Set up a simple OpenTelemetry example to send traces between two Node.js services.
Practice taking heap snapshots and CPU profiles using node --inspect on a sample Node.js application.
Implement a simple circuit breaker and retry mechanism in a dummy microservice calling an unreliable external API.

References

Node.js Official Documentation - Errors: https://nodejs.org/api/errors.html
Express.js Error Handling Guide: https://expressjs.com/en/guide/error-handling.html
OpenTelemetry Node.js Documentation (v1.x for Node.js): https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/
Winston Logger (v3.x) - GitHub: https://github.com/winstonjs/winston
Pino Logger (v8.x) - GitHub: https://github.com/pinojs/pino
Joyent - Production Node.js Error Handling: https://www.joyent.com/node-js/production/design/errors
opossum (Circuit Breaker) - GitHub: https://github.com/nodeshift/opossum

This interview preparation guide is AI-assisted and reviewed. It references official documentation and recognized interview preparation resources.

Error Handling, Logging & Observability

Table of Contents

Introduction

Core Interview Questions

Intern/Junior Level

Q1: Explain the difference between synchronous and asynchronous error handling in Node.js.

Q2: What are process.on('uncaughtException') and process.on('unhandledRejection') and when should you use them?

Q3: How do you log information in a Node.js application, and what are the different levels of logging?

Mid-Level Professional

Q4: Describe how you would implement a global error handling middleware in an Express.js application (Node.js v20.x or later).

Q5: What is the concept of “observability” in a Node.js application, and how does it differ from traditional monitoring?

Q6: Explain “operational errors” versus “programmer errors” in Node.js, and how your handling strategy differs for each.

Senior/Staff/Lead Level

Q7: Design a robust error handling strategy for a Node.js microservices architecture.

Q8: How would you debug a memory leak in a Node.js application running in production?

Q9: Discuss the role of structured logging and how it integrates with log aggregation systems (e.g., ELK Stack, Splunk) in a modern Node.js backend.

Q10: How would you implement rate limiting and a circuit breaker pattern in a Node.js microservice consuming an external API?

Debugging Exercises / Production Incidents

Q11: Scenario: Diagnosing a sudden spike in API latency for a Node.js service.

Q12: How would you identify and address a request processing bottleneck caused by a synchronous, CPU-intensive operation in a Node.js backend (v20.x)?

MCQ Section

Mock Interview Scenario

Scenario: Intermittent Database Deadlocks

Practical Tips

Summary

References

Q2: What are `process.on('uncaughtException')` and `process.on('unhandledRejection')` and when should you use them?