Introduction
As network engineering increasingly adopts NetDevOps principles, automation pipelines become central to network operations. While automation brings immense benefits in speed, consistency, and scalability, it also introduces new challenges in troubleshooting and demands a refined set of best practices to maintain reliability and security. This chapter is dedicated to equipping you with the knowledge and strategies to effectively diagnose and resolve issues within your NetDevOps workflows, as well as to implement robust best practices that ensure the long-term success and stability of your automated network infrastructure.
We will delve into common pitfalls encountered with Ansible, Python, Infrastructure as Code (IaC) tools, and multi-vendor network interactions. You will learn systematic approaches to debugging automation scripts, validating network state, and identifying root causes across the entire NetDevOps toolchain. Furthermore, we will establish critical best practices ranging from secure credential management and robust code reviews to continuous monitoring and performance optimization.
After completing this chapter, you will be able to:
- Systematically troubleshoot common issues in NetDevOps automation, including Ansible playbook failures, Python script errors, and IaC deployment problems.
- Identify and resolve multi-vendor interoperability challenges when using NETCONF, RESTCONF, and gRPC with YANG data models.
- Implement security best practices throughout your NetDevOps pipeline and network configurations.
- Optimize the performance of your automation scripts and infrastructure.
- Establish a resilient and maintainable NetDevOps environment through adherence to industry best practices.
Technical Concepts
Effective troubleshooting in NetDevOps hinges on a deep understanding of the underlying technical concepts that govern automation and network interaction.
1. Idempotency and State Management
Idempotency is a cornerstone of robust network automation. An idempotent operation is one that can be applied multiple times without changing the result beyond the initial application. In NetDevOps, this means an Ansible playbook or Python script should bring the network to a desired state, regardless of its current state, without causing unintended side effects on subsequent runs. Troubleshooting often involves identifying non-idempotent operations that cause unexpected changes or failures on repeated executions.
Detailed Technical Explanation: Idempotency is achieved by ensuring that configuration changes are applied conditionally. For instance, instead of always adding a VLAN, an automation script should first check if the VLAN exists and only create it if it doesn’t. If the VLAN already exists with the correct parameters, the script should report no changes. Tools like Ansible are inherently designed with idempotency in mind, where modules typically perform a “check mode” equivalent before applying changes. However, custom scripts or poorly designed playbooks can break this principle.
State management refers to how automation tracks and understands the current configuration and operational status of network devices. This is crucial for pre-checks, post-checks, and ensuring that automation only acts when necessary. Desired state configuration (DSC) via IaC relies heavily on effective state management.
Network Diagrams (Simplified IaC Workflow):
@startuml
skinparam handwritten true
skinparam style strict
node "Version Control System (VCS)" as VCS {
rectangle "IaC Repository" as Repo
}
cloud "CI/CD Pipeline" as CICD {
rectangle "Linter/Validator" as Linter
rectangle "Pre-Check Automation" as PreCheck
rectangle "Deployment Automation" as Deploy
rectangle "Post-Check/Test Automation" as PostCheck
}
cloud "Network Infrastructure" as Network {
rectangle "Cisco Devices" as Cisco
rectangle "Juniper Devices" as Juniper
rectangle "Arista Devices" as Arista
}
Repo [label="> Linter : Pushes changes (IaC)
Linter"] PreCheck : Validated IaC
PreCheck [label="> Network : Gathers Current State
Network"] PreCheck : Returns Current State
PreCheck [label="> Deploy : Desired vs Current State OK
Deploy"] Network : Applies Configuration
Network [label="> Deploy : Configuration Status
Deploy"] PostCheck : Deployment Result
PostCheck [label="> Network : Verifies New State
Network"] PostCheck : Returns New State
PostCheck --> Deploy : Verification Result
@enduml
Figure 13.1: Simplified NetDevOps CI/CD Workflow with State Management
RFC/Standard References: While no single RFC defines “idempotency” directly for network configuration, the principles are embedded in the design of configuration management protocols.
- RFC 6241 (NETCONF Protocol): NETCONF operations like
<edit-config>can be made idempotent through careful use of theoperationattribute (e.g.,create,merge,replace). - RFC 8040 (RESTCONF Protocol): RESTCONF utilizes standard HTTP methods (PUT, POST, DELETE, PATCH) where PUT is inherently idempotent (replaces the resource).
2. API Interaction (NETCONF, RESTCONF, gRPC, YANG)
NetDevOps heavily relies on programmatic interfaces for configuration and operational data. Issues can arise from incorrect API calls, malformed data models, or protocol mismatches.
Detailed Technical Explanation:
- NETCONF (Network Configuration Protocol): An XML-based protocol for managing network devices. It uses Remote Procedure Calls (RPCs) and defines explicit operations like
get,get-config,edit-config,commit. Troubleshooting often involves inspecting the XML RPC requests and responses.- RFC 6241: NETCONF Protocol
- RFC 6242: Using the NETCONF Protocol over SSH
- RESTCONF (RESTful Configuration Protocol): Provides a REST-like interface over HTTP(S) for accessing YANG-modeled data. It maps HTTP methods (GET, PUT, POST, DELETE) to configuration operations. Errors typically manifest as HTTP status codes (e.g., 400 Bad Request, 404 Not Found, 409 Conflict).
- RFC 8040: RESTCONF Protocol
- gRPC (gRPC Remote Procedure Call): A modern, high-performance, open-source universal RPC framework that can use Protocol Buffers for message serialization. It’s increasingly used with gNMI (gRPC Network Management Interface) for streaming telemetry and configuration. Troubleshooting involves examining gRPC status codes and payload structures.
- gNMI Specification: OpenConfig gNMI (not an RFC but a de facto standard).
- YANG (Yet Another Next Generation): A data modeling language used to model configuration and state data, notifications, and RPCs for network devices. YANG models provide a structured, vendor-agnostic way to define network features. Validation errors are common, indicating that the automation is trying to apply data that doesn’t conform to the device’s YANG model.
- RFC 7950: The YANG 1.1 Data Modeling Language
- RFC 7951: JSON Encoding of Data Modeled with YANG
Packet Diagram (Simplified NETCONF <edit-config> RPC):
packetdiag {
colwidth = 32
colheight = 16
0-7: SSH Header
8-15: SSH Payload (NETCONF)
16-31: Message ID Header
32-95: <rpc> tag
96-127: <edit-config> tag
128-191: <target><running/></target>
192-255: <config> tag
256-447: YANG-modeled XML Configuration Payload
448-479: </config> tag
480-511: </edit-config> tag
512-543: </rpc> tag
544-575: EOM (End of Message)
}
Figure 13.2: Simplified NETCONF <edit-config> Packet Structure
3. Control Plane vs. Data Plane
In automation, it’s critical to understand the distinction between the control plane (routing protocols, management plane) and the data plane (packet forwarding). Automation often focuses on the control plane (e.g., configuring interfaces, routing protocols, firewall rules), which then influences the data plane. Troubleshooting requires knowing whether an issue is an automation failure in configuring the control plane, or if the control plane configured correctly but the data plane is still not behaving as expected due to other factors (e.g., hardware issue, incorrect ASIC programming).
Detailed Technical Explanation:
- Control Plane: Manages the network’s operational state. This includes routing tables, MAC address tables, STP topology, ARP tables, and security policies. Automation typically interacts with the control plane to change configuration.
- Data Plane: Responsible for forwarding user traffic based on the control plane’s decisions. A data plane issue might manifest as packet loss, high latency, or incorrect forwarding, even if the control plane configuration appears correct.
Automation errors might cause a discrepancy where the desired control plane state is not fully pushed or activated, or worse, pushes conflicting configurations. Validating both control plane state (e.g.,
show ip route) and data plane behavior (e.g., ping, traceroute, traffic flow verification) is essential.
Network Diagram (Control Plane vs. Data Plane Interaction):
digraph "Control Plane vs Data Plane" {
rankdir=LR;
node [shape=box, style=filled, fillcolor=lightgray];
subgraph cluster_automation {
label = "NetDevOps Automation";
color = blue;
Automation_Tool [label="Ansible/Python/IaC"];
}
subgraph cluster_network {
label = "Network Device";
color = green;
Control_Plane [label="Control Plane\n(Routing, ACLs, Mgmt)"];
Data_Plane [label="Data Plane\n(Packet Forwarding)"];
}
Automation_Tool -> Control_Plane [label="Config/State Request" color=red];
Control_Plane -> Automation_Tool [label="Config/State Response" color=red];
Control_Plane -> Data_Plane [label="Updates Forwarding Tables" color=darkgreen];
Data_Plane -> Control_Plane [label="Operational Feedback" color=darkgreen, style=dotted];
User_Traffic [shape=cylinder, label="User Traffic"];
User_Traffic -> Data_Plane [label="Flows Through" color=orange];
{rank=same; Automation_Tool; Control_Plane; Data_Plane}
}
Figure 13.3: NetDevOps Automation Interacting with Network Device Planes
4. State Machines and Workflows (CI/CD Pipelines)
Complex NetDevOps automation often involves multi-stage CI/CD pipelines. Each stage acts as a state in a larger workflow. Failures at any stage can halt the entire process. Understanding the expected state transitions and dependencies between stages is crucial for troubleshooting.
Detailed Technical Explanation: A typical NetDevOps pipeline might involve stages like:
- Code Commit: Changes pushed to VCS.
- Linting/Syntax Check: Validating code for syntax and style.
- Unit/Integration Tests: Testing automation scripts against mock devices or a lab environment.
- Pre-Deployment Checks: Gathering current network state, validating prerequisites.
- Deployment: Applying configuration changes.
- Post-Deployment Checks/Tests: Verifying the applied configuration and operational state.
- Rollback (if needed): Reverting to a known good state.
Each stage has success and failure conditions. A failure in one stage often prevents subsequent stages from running. Troubleshooting involves identifying exactly which stage failed, examining its logs, and understanding the preconditions it expected.
Workflow Diagram (NetDevOps CI/CD Pipeline):
digraph NetDevOps_Pipeline {
rankdir=TB;
node [shape=box, style=filled, fillcolor=lightblue];
edge [color=gray];
subgraph cluster_start {
label = "Start";
color = transparent;
Node_Start [label="Code Commit"];
}
Node_Lint [label="Linting/Syntax Check"];
Node_Unit_Test [label="Unit/Integration Tests"];
Node_Pre_Check [label="Pre-Deployment Checks"];
Node_Deploy [label="Deployment (IaC/Ansible)"];
Node_Post_Check [label="Post-Deployment Verification"];
Node_Monitoring [label="Continuous Monitoring"];
Node_Alert [label="Alerting"];
Node_Rollback [label="Rollback/Remediation" fillcolor=lightcoral];
Node_Start -> Node_Lint;
Node_Lint -> Node_Unit_Test [label="Pass"];
Node_Lint -> Node_Rollback [label="Fail" color=red];
Node_Unit_Test -> Node_Pre_Check [label="Pass"];
Node_Unit_Test -> Node_Rollback [label="Fail" color=red];
Node_Pre_Check -> Node_Deploy [label="Pass"];
Node_Pre_Check -> Node_Rollback [label="Fail" color=red];
Node_Deploy -> Node_Post_Check [label="Success"];
Node_Deploy -> Node_Rollback [label="Fail" color=red];
Node_Post_Check -> Node_Monitoring [label="Verified"];
Node_Post_Check -> Node_Rollback [label="Verification Fail" color=red];
Node_Monitoring -> Node_Alert [label="Detects Anomaly" color=orange];
Node_Alert -> Node_Rollback [label="Trigger Remediation" color=red];
Node_Rollback -> Node_Start [label="Restart Process (Manual/Auto)" style=dotted];
{rank=min; Node_Start}
{rank=max; Node_Monitoring; Node_Alert}
}
Figure 13.4: NetDevOps CI/CD Pipeline Workflow
Configuration Examples (Multi-vendor)
Establishing a robust NetDevOps environment requires consistent and secure device configurations, especially regarding API access and management. Here are multi-vendor examples for enabling NETCONF/RESTCONF and setting up basic AAA for automation accounts.
1. Enabling NETCONF/RESTCONF and AAA for Automation
It’s crucial to secure API access and ensure that automation tools have the necessary permissions. This typically involves enabling the API protocols and configuring local or remote AAA for the automation user.
Cisco IOS XE
! Enable NETCONF/RESTCONF via YANG-based management plane
restconf
netconf-yang
! Create a local user for automation with privilege 15
username automation_user privilege 15 secret automation_password!
! Configure AAA for console and VTY lines
aaa new-model
aaa authentication login default local
aaa authorization exec default local
! Apply AAA to VTY lines and enable SSH for secure access
line vty 0 4
transport input ssh
logging synchronous
login authentication default
authorization exec default
!
! Important: Ensure SSH is configured and host key generated
crypto key generate rsa modulus 2048
ip domain name example.com
ip ssh version 2
!
Verification Commands (Cisco):
show running-config | section ^restconf|^netconf-yang|^username automation_user|^aaa|^line vty
show platform software yang-management process
Expected Output (Cisco - partial):
! Output for `show running-config | section ^restconf|^netconf-yang|^username automation_user|^aaa|^line vty`
restconf
netconf-yang
username automation_user privilege 15 secret 9 $9$0G1uQ5R9$7yX0M2uF2V7pL5rQ8nJ1Y0k5U9w8X4z2M1o5T2j5
aaa new-model
aaa authentication login default local
aaa authorization exec default local
line vty 0 4
transport input ssh
logging synchronous
login authentication default
authorization exec default
!
! Output for `show platform software yang-management process`
PID PPID TID STATUS CPU BINDING PRI NAME
23456 1234 7890 S 0.1% -- -- nesd
... (other YANG processes should be running)
Juniper JunOS
# Enable NETCONF over SSH (default for JunOS)
set system services netconf ssh
# Create a local user for automation with super-user permissions
set system login user automation_user uid 2000 class super-user
set system login user automation_user authentication plain-text-password
set system login user automation_user authentication password "automation_password!"
# Configure SSH for secure access (if not already done)
set system services ssh protocol-version v2
set system services ssh root-login deny
# Ensure access list for SSH if needed (example)
# set system services ssh allow-characters "[a-zA-Z0-9]"
commit and-quit
Security Warning: Using plain-text-password is for demonstration. In production, use set system login user automation_user authentication encrypted-password "$9$..." after setting the password securely or use SSH keys.
Verification Commands (Juniper):
show configuration system services | display set
show system users automation_user
show system connections | grep netconf
Expected Output (Juniper - partial):
# Output for `show configuration system services | display set`
set system services netconf ssh
set system services ssh protocol-version v2
set system services ssh root-login deny
#
# Output for `show system users automation_user`
automation_user {
uid 2000;
class super-user;
authentication {
encrypted-password "$9$..."; ## SECRET-DATA
}
}
#
# Output for `show system connections | grep netconf`
tcp 0 0 0.0.0.0:830 0.0.0.0:* LISTEN
Arista EOS
! Enable the eAPI (RESTCONF-like API)
management api http-https
no shutdown
protocol https
! Consider limiting access
# ip access-group API_ACL in
!
! Create a local user for automation with privilege 15
username automation_user privilege 15 secret automation_password!
! Configure AAA for console and VTY (similar to Cisco IOS)
aaa authentication login default local
aaa authorization exec default local
!
! Arista typically uses `enable secret` for privilege 15 password
enable secret 5 $5$uW/V$1t1234567890abcdefghijklmnopqrstuvwxyzabcdefg
!
Security Warning: Arista’s eAPI is a robust RESTful interface, but ensure HTTPS is always used in production and consider IP access lists for further security hardening.
Verification Commands (Arista):
show running-config | section ^management api http-https|^username automation_user|^aaa
show management api http-https
Expected Output (Arista - partial):
! Output for `show running-config | section ^management api http-https|^username automation_user|^aaa`
management api http-https
no shutdown
protocol https
username automation_user privilege 15 secret 5 $5$uW/V$1t1234567890abcdefghijklmnopqrstuvwxyzabcdefg
aaa authentication login default local
aaa authorization exec default local
!
! Output for `show management api http-https`
Management API HTTP-HTTPS:
Enabled: Yes
HTTPS port: 443
Global state: Enabled
...
Network Diagrams
Visualizing your NetDevOps environment and processes is key for effective understanding and troubleshooting.
1. NetDevOps Control Plane (nwdiag)
This diagram illustrates the core components of a NetDevOps control plane, including the automation tools, version control, and CI/CD server, interacting with network segments.
nwdiag {
network automation_network {
address = "10.0.0.0/24"
automation_server [address = "10.0.0.10", description = "Ansible/Python/Terraform"];
vcs_server [address = "10.0.0.11", description = "Gitlab/GitHub"];
ci_cd_server [address = "10.0.0.12", description = "Jenkins/Gitlab-CI"];
}
network management_network {
address = "192.168.1.0/24"
automation_server; // Connects to both
cisco_router [address = "192.168.1.1"];
juniper_switch [address = "192.168.1.2"];
arista_leaf [address = "192.168.1.3"];
}
// Connections via shared network blocks
// Implicit connections:
// automation_server <-> cisco_router, juniper_switch, arista_leaf
// vcs_server <-> automation_server, ci_cd_server
// ci_cd_server <-> automation_server
}
Figure 13.5: NetDevOps Control Plane Topology
2. Automation Flow for Configuration Deployment (Graphviz)
This diagram shows a typical sequence of operations for deploying configuration changes using NetDevOps tools.
digraph G {
rankdir=LR;
node [shape=box, style=filled, fillcolor=lightblue];
edge [color=gray, fontsize=10];
// Nodes
Config_Repo [label="Configuration Repo\n(IaC - YAML/Jinja)"];
Ansible_Playbook [label="Ansible Playbook\n(Python Scripts)"];
NETCONF_RPC [label="NETCONF/RESTCONF/gRPC API"];
Network_Device [label="Network Device"];
State_DB [label="Network State DB\n(Nautobot/NetBox)"];
// Edges
Config_Repo -> Ansible_Playbook [label="Reads Desired State"];
Ansible_Playbook -> NETCONF_RPC [label="Sends Config RPC"];
NETCONF_RPC -> Network_Device [label="Applies Configuration"];
Network_Device -> NETCONF_RPC [label="Returns Status/Telemetry"];
NETCONF_RPC -> Ansible_Playbook [label="Parses API Response"];
Ansible_Playbook -> State_DB [label="Updates Current State"];
State_DB -> Ansible_Playbook [label="Provides Current State"];
}
Figure 13.6: Automation Flow for Configuration Deployment
3. Multi-Vendor Automation Architecture (PlantUML)
A higher-level view of how different vendors are managed within a unified NetDevOps architecture.
@startuml
skinparam style strict
skinparam backgroundColor white
cloud "Cloud/SaaS" as CLOUD {
node "CI/CD Platform" as CICD {
component "Pipeline Runner" as Runner
}
}
node "Automation Server" as AUTOMATION_SERVER {
component "Ansible Control Node" as Ansible
component "Python Environment" as Python
component "IaC Tool (e.g., Terraform)" as Terraform
database "Secrets Manager" as Secrets
database "Inventory/Source of Truth" as SOT
}
package "Network Devices" as DEVICES {
node "Cisco IOS-XE" as Cisco
node "Juniper JunOS" as Juniper
node "Arista EOS" as Arista
}
CICD [label="> Runner
Runner"] Ansible : Trigger Playbooks
Runner [label="> Python : Execute Scripts
Runner"] Terraform : Apply IaC
Ansible [label="> Secrets : Retrieve Credentials
Python"] Secrets : Retrieve Credentials
Terraform [label="> Secrets : Retrieve Credentials
Ansible"] SOT : Get Inventory/Data
Python [label="> SOT : Get Inventory/Data
Terraform"] SOT : Get Inventory/Data
Ansible <[label="> Cisco : NETCONF/SSH
Ansible <"] Juniper : NETCONF/SSH
Ansible <[label="> Arista : eAPI/RESTCONF
Python <"] Cisco : NETCONF/RESTCONF/SSH (Netmiko/NAPALM)
Python <[label="> Juniper : NETCONF/SSH (NAPALM/ncclient)
Python <"] Arista : eAPI/RESTCONF (requests/pyeapi)
Terraform <[label="> Cisco : DNA Center Provider
Terraform <"] Juniper : Junos Provider
Terraform <[label="> Arista : Arista EOS Provider
SOT <"] DEVICES : Discovered State (Optional)
@enduml
Figure 13.7: Multi-Vendor NetDevOps Automation Architecture
Automation Examples
These examples demonstrate common automation tasks, focusing on best practices for error handling and idempotency.
1. Python Script: Verify NTP Configuration (Multi-Vendor)
This Python script uses napalm to verify NTP server configuration across Cisco and Juniper devices. It includes error handling and multi-vendor abstraction.
import json
from napalm import get_network_driver
from napalm.base.exceptions import ConnectionException, DriverError
# Configuration for devices - use a secure method for credentials in production
devices = [
{
"hostname": "cisco-rtr-01",
"device_type": "ios", # Or 'iosxe', 'nxos'
"username": "automation_user",
"password": "automation_password!",
"optional_args": {"port": 830, "transport": "netconf"} # For NETCONF
},
{
"hostname": "juniper-swo-01",
"device_type": "junos",
"username": "automation_user",
"password": "automation_password!",
"optional_args": {"port": 830, "transport": "netconf"} # For NETCONF
},
# Add Arista or other devices as needed, adjust device_type and optional_args
]
expected_ntp_servers = ["10.0.0.250", "10.0.0.251"]
def verify_ntp_config(device_info):
driver = get_network_driver(device_info["device_type"])
device = None
try:
device = driver(
hostname=device_info["hostname"],
username=device_info["username"],
password=device_info["password"],
optional_args=device_info.get("optional_args", {})
)
print(f"Connecting to {device_info['hostname']}...")
device.open()
# Get NTP peers using NAPALM's get_ntp_peers
ntp_peers = device.get_ntp_peers()
print(f"NTP Peers on {device_info['hostname']}: {json.dumps(ntp_peers, indent=2)}")
configured_servers = [peer['address'] for peer in ntp_peers.get('peers', [])]
missing_servers = [server for server in expected_ntp_servers if server not in configured_servers]
extra_servers = [server for server in configured_servers if server not in expected_ntp_servers]
if not missing_servers and not extra_servers:
print(f"SUCCESS: NTP configuration on {device_info['hostname']} matches expected.")
return True
else:
if missing_servers:
print(f"WARNING: Missing expected NTP servers on {device_info['hostname']}: {missing_servers}")
if extra_servers:
print(f"WARNING: Unexpected NTP servers found on {device_info['hostname']}: {extra_servers}")
return False
except ConnectionException as e:
print(f"ERROR: Could not connect to {device_info['hostname']}: {e}")
return False
except DriverError as e:
print(f"ERROR: NAPALM driver error on {device_info['hostname']}: {e}")
return False
except Exception as e:
print(f"ERROR: An unexpected error occurred with {device_info['hostname']}: {e}")
return False
finally:
if device:
print(f"Closing connection to {device_info['hostname']}.")
device.close()
if __name__ == "__main__":
all_ok = True
for dev in devices:
if not verify_ntp_config(dev):
all_ok = False
if all_ok:
print("\nAll devices passed NTP configuration verification.")
else:
print("\nSome devices failed NTP configuration verification.")
2. Ansible Playbook: Standardize Banner Configuration
This Ansible playbook enforces a standardized login banner across Cisco IOS/IOS-XE and Juniper JunOS devices. It utilizes Jinja2 templating for multi-vendor compatibility and Ansible’s idempotency.
---
- name: Standardize Network Device Banners
hosts: network_devices
gather_facts: false
connection: network_cli # Use network_cli for general devices; can be replaced with httpapi for Arista eAPI
vars:
login_banner_text: |
"*************************************************************
* UNAUTHORIZED ACCESS TO THIS DEVICE IS STRICTLY PROHIBITED *
* All activities are logged and monitored. *
*************************************************************"
tasks:
- name: Ensure correct login banner on Cisco devices
when: ansible_network_os == 'ios' or ansible_network_os == 'iosxe' or ansible_network_os == 'nxos'
cisco.ios.ios_banner:
banner: login
text: ""
state: present
register: cisco_banner_result
ignore_errors: true # Continue playbook even if one device fails
notify: Check Cisco Banner
- name: Ensure correct login banner on Juniper devices
when: ansible_network_os == 'junos'
community.juniper.junos_config:
lines: "set system login message "
save_when: always # Ensures commit on change
comment: "Set standardized login banner"
register: juniper_banner_result
ignore_errors: true
notify: Check Juniper Banner
- name: Ensure correct login banner on Arista EOS devices (via eAPI)
when: ansible_network_os == 'eos'
ansible.builtin.include_tasks: arista_banner_task.yml # Using a separate task file for clarity
vars:
arista_banner_text: "" # Remove quotes for Arista config
handlers:
- name: Check Cisco Banner
cisco.ios.ios_command:
commands: "show banner login"
register: cisco_check_banner
delegate_to: localhost
run_once: true # Only run once if multiple Cisco devices changed
when: cisco_banner_result is changed
debug:
msg: "Cisco Banner after change: "
- name: Check Juniper Banner
community.juniper.junos_rpc:
rpc: get-configuration
xpath: "/configuration/system/login/message"
register: juniper_check_banner
delegate_to: localhost
run_once: true
when: juniper_banner_result is changed
debug:
msg: "Juniper Banner after change: "
# arista_banner_task.yml (Separate file for Arista specific task)
# ---
# - name: Configure Arista EOS banner
# ansible.builtin.uri:
# url: "https://:443/command-api"
# method: POST
# headers:
# Content-Type: "application/json"
# body_format: json
# body:
# jsonrpc: "2.0"
# method: "runCmds"
# params:
# format: "json"
# timestamps: false
# cmds:
# - "enable"
# - "configure terminal"
# - "banner login "
# - "end"
# id: "1"
# validate_certs: false # WARNING: DO NOT USE IN PRODUCTION without proper cert validation
# user: ""
# password: ""
# force_basic_auth: true
# register: arista_banner_config
# changed_when: "'banner login' in arista_banner_config.json.result[2].output" # Simplified change detection
# tags: arista_banner
3. Terraform Example: Managing Cloud Network Resources (Conceptual)
This conceptual Terraform configuration provisions a virtual network and a virtual router within a public cloud, demonstrating IaC for network infrastructure.
# This is a conceptual example for a generic cloud provider.
# Real-world Terraform configurations for cloud providers (AWS, Azure, GCP)
# would use provider-specific resources.
# provider "aws" {
# region = "us-east-1"
# }
# Resource: Virtual Network
resource "cloud_network" "production_vpc" {
name = "prod-vpc"
cidr_block = "10.0.0.0/16"
region = "us-east-1"
tags = {
Environment = "Production"
ManagedBy = "Terraform"
}
}
# Resource: Subnet within the Virtual Network
resource "cloud_subnet" "app_subnet" {
name = "app-subnet"
network_id = cloud_network.production_vpc.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
tags = {
Application = "Web"
}
}
# Resource: Virtual Router/Gateway attached to the Virtual Network
resource "cloud_router" "edge_router" {
name = "prod-edge-router"
network_id = cloud_network.production_vpc.id
gateway_type = "internet" # Or "vpn", "direct_connect"
tags = {
Role = "Edge"
}
}
# Output the VPC ID and Subnet ID
output "vpc_id" {
value = cloud_network.production_vpc.id
description = "The ID of the production Virtual Private Cloud."
}
output "app_subnet_id" {
value = cloud_subnet.app_subnet.id
description = "The ID of the application subnet."
}
Security Considerations
Integrating security throughout the NetDevOps lifecycle is paramount. Automation, while powerful, can amplify security risks if not properly managed.
1. Attack Vectors and Mitigation Strategies
| Attack Vector | Description | Mitigation Strategy |
|---|---|---|
| Compromised Credentials/Secrets | Automation tools often store sensitive credentials (API keys, passwords, SSH private keys). If compromised, an attacker gains full control. | Use dedicated Secrets Management solutions (HashiCorp Vault, CyberArk, Ansible Vault for static, environmental variables for dynamic). Implement Least Privilege for automation accounts. Rotate credentials regularly. |
| Insecure Automation Code | Vulnerabilities (e.g., command injection, hardcoded secrets, insecure API calls, lack of input validation) in playbooks or scripts. | Mandatory Code Review (peer review), Static Application Security Testing (SAST) tools for Python, security linters for Ansible. Avoid hardcoding sensitive data. Enforce input validation. |
| Unauthorized Access to IaC Repository | Compromised Git repository allows attackers to inject malicious network configurations or automation logic. | Strict Role-Based Access Control (RBAC) for VCS. Implement Multi-Factor Authentication (MFA). Protect repository with strong branch protection rules and signed commits. |
| Vulnerable Automation Infrastructure | The CI/CD server, automation host, or network devices themselves can be vulnerable. | Keep all automation tooling, operating systems, and network device firmware/software patched and up-to-date. Isolate automation infrastructure with strict firewall rules. |
| Supply Chain Attacks | Using untrusted third-party modules or collections in automation. | Use Curated/Certified Collections (e.g., Red Hat Certified Ansible Collections, NAPALM). Pin dependencies to specific versions. Scan downloaded dependencies for vulnerabilities. |
| Logging and Auditing Deficiencies | Lack of comprehensive logs prevents detection of malicious or anomalous activity. | Implement Centralized Logging for all automation events, API calls, and device changes. Enable Auditing on network devices and automation tools. |
2. Security Best Practices
- Principle of Least Privilege (PoLP): Automation accounts should only have the minimum necessary permissions to perform their tasks. Avoid using
adminorrootaccounts. - Secrets Management: Never hardcode credentials in automation scripts or IaC repositories. Use dedicated secrets management solutions.
- Secure Communication: Always use encrypted protocols (SSH, HTTPS) for management plane access. Ensure TLS/SSL certificates are valid and verified.
- Input Validation: Validate all input passed to automation scripts or configuration templates to prevent injection attacks or invalid configurations.
- Code Review and Testing: Implement mandatory peer code reviews and comprehensive testing (including security tests) for all automation code before deployment.
- Immutable Infrastructure Principles: Where possible, treat automation artifacts (e.g., Docker images for CI/CD runners) as immutable. Any change requires rebuilding and re-testing.
- Network Segmentation: Isolate automation infrastructure (CI/CD servers, automation nodes) into dedicated, highly restricted network segments.
- Version Control and Audit Trails: Use a VCS for all IaC and automation code. This provides a full audit trail of who changed what and when.
- Regular Security Audits: Periodically audit your NetDevOps pipeline, automation scripts, and network device configurations for security weaknesses.
3. Security Configuration Example (Cisco IOS XE - AAA for Automation)
! Secure SSH access
ip ssh version 2
ip ssh authentication-retries 3
ip ssh timeout 60
! Configure AAA using TACACS+ (preferred for centralized management)
aaa new-model
aaa authentication login default group tacacs+ local
aaa authorization exec default group tacacs+ local
aaa authorization commands 15 default group tacacs+ local
aaa accounting exec default start-stop group tacacs+
aaa accounting commands 15 default start-stop group tacacs+
! TACACS+ Server configuration (replace with your server IP)
tacacs server TACACS_SERVER_GROUP
address ipv4 10.0.0.100
key 7 082B4F5E0A1A0F5C
!
aaa group server tacacs+ TACACS_GROUP
server name TACACS_SERVER_GROUP
!
! Assign automation_user to specific VTY lines or use remote AAA for all
line vty 0 4
transport input ssh
login authentication default
authorization exec default
!
! Critical: No insecure management protocols
no http server
no ip http server
no ip http secure-server
Security Warning: Never use a plain-text key for tacacs server key. Use encrypted keys and ensure the key is stored securely in a secrets manager.
Verification & Troubleshooting
Troubleshooting in a NetDevOps environment requires a methodical approach, combining traditional network diagnostic skills with an understanding of automation tool output and IaC principles.
1. The NetDevOps Troubleshooting Flow
digraph NetDevOps_Troubleshoot {
rankdir=TB;
node [shape=box, style=filled, fillcolor=lightblue, width=2.5];
edge [color=gray, fontsize=9];
Start [label="Problem Detected\n(Monitoring/Alert/Manual)"];
ReviewLogs [label="1. Review Automation/CI/CD Logs"];
IdentifyFailedStage [label="2. Identify Failed Stage/Task"];
InspectInputs [label="3. Inspect Inputs/Variables\n(IaC, Playbook vars, Jinja)"];
CheckConnectivity [label="4. Check Device Connectivity\n(SSH, API endpoint)"];
ValidateSyntax [label="5. Validate Code/YANG Syntax\n(Linter, `netconf-console --validate`)"];
ManualVerifyConfig [label="6. Manually Verify Device Config/State"];
ManualVerifyOper [label="7. Manually Verify Operational State\n(Data Plane)"];
IsolateIssue [label="8. Isolate Root Cause\n(Automation vs. Device vs. Environment)"];
ImplementFix [label="9. Implement Fix"];
TestAndDeploy [label="10. Test and Redeploy"];
End [label="Resolution"];
Start -> ReviewLogs;
ReviewLogs -> IdentifyFailedStage;
IdentifyFailedStage -> InspectInputs;
InspectInputs -> CheckConnectivity;
CheckConnectivity -> ValidateSyntax;
ValidateSyntax -> ManualVerifyConfig;
ManualVerifyConfig -> ManualVerifyOper;
ManualVerifyOper -> IsolateIssue;
IsolateIssue -> ImplementFix;
ImplementFix -> TestAndDeploy;
TestAndDeploy -> End;
// Feedback loops
ImplementFix -> ReviewLogs [label="Rerun & Re-verify"];
ManualVerifyOper -> ImplementFix [label="Found Error"];
ManualVerifyConfig -> ImplementFix [label="Found Error"];
ValidateSyntax -> ImplementFix [label="Found Error"];
CheckConnectivity -> ImplementFix [label="Found Error"];
InspectInputs -> ImplementFix [label="Found Error"];
}
Figure 13.8: NetDevOps Troubleshooting Flowchart
2. Common Issues and Resolution Steps
| Category | Common Issue | Debug Commands / Indicators | Resolution Steps Network (Dotted Line): Represents the logical segment for management traffic.**
Performance Optimization
Optimizing the performance of your NetDevOps pipeline and automation scripts is key for ensuring rapid deployments, timely verification, and resource efficiency.
1. Tuning Parameters and Capacity Planning
Ansible:
forks: Adjust this parameter inansible.cfgor via--forksto control parallel connections. Too many forks can overload the control node or target devices; too few can slow down large deployments.fact_caching: Use fact caching (e.g.,jsonfile,redis) to avoid repeatedly gathering facts, especially in large inventories.pipelining: Enable pipelining to reduce the number of SSH operations required to execute modules.- Strategy plugins: Experiment with different strategy plugins (e.g.,
linear,free,mitogen) for better performance in specific scenarios.mitogenis known for significant speedups. - ControlPersist: Configure
ControlPersistinssh.cfgto reuse SSH connections, reducing overhead.
Python:
- Connection Pooling: Reuse connections to network devices (e.g., maintain a pool of
netmikoornapalmobjects) rather than establishing a new connection for every operation. - Asynchronous Operations: Use asynchronous libraries (e.g.,
asynciowithasyncsshorhttpx) for concurrent operations, especially when dealing with many devices or slow APIs.Norniris an excellent framework for concurrent network automation in Python. - Efficient Data Structures/Algorithms: Optimize Python code for performance-critical sections, using appropriate data structures and efficient algorithms.
- Reduce API Calls: Minimize redundant API calls to network devices. Cache frequently accessed data where appropriate.
- Connection Pooling: Reuse connections to network devices (e.g., maintain a pool of
IaC (e.g., Terraform):
- State Backend Optimization: Use remote, performant state backends (e.g., S3, Azure Blob Storage, HashiCorp Consul) with appropriate locking.
- Modularization: Break down large configurations into smaller, manageable modules to reduce the blast radius and speed up plan/apply operations.
- Parallelism: Terraform typically runs operations in parallel by default; ensure your cloud provider limits aren’t causing throttling.
Capacity Planning for Automation Infrastructure:
- Monitor CPU, memory, disk I/O, and network utilization on your Ansible control node, Python automation servers, and CI/CD runners.
- Scale resources (CPU, RAM, network bandwidth) based on the size of your inventory, the complexity of your playbooks/scripts, and the frequency of deployments.
- Consider dedicated hardware or VMs for critical automation components.
2. Performance Metrics and Monitoring
- Automation Execution Time: Track the time taken for playbooks, scripts, and pipeline stages. Look for trends and spikes.
- API Response Times: Monitor the latency of API calls to network devices. High latency can indicate device overload or network issues.
- Network Device Resource Utilization: Track CPU, memory, and process utilization on network devices during automation runs. High utilization can lead to slower responses or even device instability.
- CI/CD Pipeline Duration: Monitor the overall execution time of your CI/CD pipelines. Identify bottlenecks in specific stages.
- Metrics Collection: Utilize tools like Prometheus and Grafana to collect and visualize these metrics over time. Integrate metrics into your CI/CD pipelines to automatically fail builds that exceed performance thresholds.
3. Monitoring Recommendations
- Centralized Logging: Aggregate logs from all automation tools, CI/CD platforms, and network devices into a central logging system (e.g., ELK Stack, Splunk, Graylog). This allows for quick correlation of events during troubleshooting.
- Alerting: Configure alerts for:
- Failed automation jobs.
- High API response times.
- Unusual device resource utilization during or after automation.
- Configuration drift detected by monitoring tools.
- Distributed Tracing: For complex microservices-based automation, consider distributed tracing (e.g., Jaeger, Zipkin) to visualize the flow of requests and identify performance bottlenecks across multiple services.
Hands-On Lab: Troubleshooting a Failed NTP Deployment
This lab simulates a common NetDevOps scenario: a failed configuration deployment, requiring you to identify the root cause using automation tools and device verification.
Lab Topology
nwdiag {
network automation_lab_net {
address = "10.0.0.0/24"
automation_host [address = "10.0.0.10", description = "Ansible/Python"];
}
network mgmt_net {
address = "192.168.10.0/24"
automation_host;
cisco_rtr [address = "192.168.10.1"];
juniper_sw [address = "192.168.10.2"];
}
}
Figure 13.9: Lab Topology for NTP Troubleshooting
Objectives
- Attempt an automated NTP server deployment to
cisco_rtrandjuniper_sw. - Observe the automation failure.
- Utilize Ansible’s debug output and manual verification to identify the root cause.
- Correct the issue in the playbook/inventory.
- Successfully redeploy the NTP configuration.
Step-by-Step Configuration
Prerequisites:
- An Ansible control node (the
automation_host) with Python,ansible(core andcisco.ios,community.junipercollections), andnapalminstalled. - Two network devices: one Cisco IOS-XE router (
cisco_rtr), one Juniper JunOS switch (juniper_sw), accessible via SSH fromautomation_host. - Automation user
automation_userwith passwordautomation_password!configured on both devices with privilege 15/superuser access. - NETCONF over SSH enabled on both devices (refer to previous configuration examples).
1. Initial Setup on automation_host:
inventory.ini:[network_devices] cisco_rtr ansible_host=192.168.10.1 ansible_network_os=iosxe ansible_user=automation_user ansible_password=automation_password! ansible_connection=network_cli juniper_sw ansible_host=192.168.10.2 ansible_network_os=junos ansible_user=automation_user ansible_password=automation_password! ansible_connection=network_clintp_deploy.yaml: (Intentionally buggy)--- - name: Deploy NTP Servers hosts: network_devices gather_facts: false vars: ntp_servers: - 10.0.0.250 - 10.0.0.251 tasks: - name: Configure NTP for Cisco IOS-XE when: ansible_network_os == 'iosxe' cisco.ios.ios_config: lines: - "ntp server prefer" # Bug: 'prefer' keyword is for single server parents: [] # Missing "ntp" parent for consistency diff_against: running match: none loop: "" register: cisco_ntp_result - name: Configure NTP for Juniper JunOS when: ansible_network_os == 'junos' community.juniper.junos_config: lines: - "set system ntp server authentication-key 10" # Bug: using authentication-key without definition save_when: always comment: "Configure NTP servers" loop: "" register: juniper_ntp_result
2. Attempt Deployment and Observe Failure:
- Execute the playbook:
ansible-playbook -i inventory.ini ntp_deploy.yaml -vvv - Expected Output: You will see failures for both Cisco and Juniper.
- Cisco will likely complain about invalid syntax near
preferwhen multiple servers are passed or similar parsing issues. - Juniper will complain about
authentication-key 10being used without a defined key, or other syntax errors.
- Cisco will likely complain about invalid syntax near
3. Identify Root Cause (Troubleshooting Steps):
- Review Automation Logs: The
-vvvflag for Ansible provides verbose output. Look for specific error messages returned by theios_configandjunos_configmodules. These usually contain the device’s exact CLI error or API error.- For Cisco, you might see something like
% Invalid input detected at '^' marker.or similar. - For Juniper, it might be a
syntax errorrelated toauthentication-key.
- For Cisco, you might see something like
- Inspect Inputs/Variables: Verify that
ntp_serversvariable is correctly defined. (In this case, it is, the problem is in how it’s used). - Check Connectivity: Use
ansible -m ping -i inventory.ini allto confirm SSH connectivity. (This should pass). - Validate Syntax (Mental/Manual):
- For Cisco: Can you manually configure
ntp server 10.0.0.250 preferthenntp server 10.0.0.251? No,preferis typically on one primary. The general command isntp server <IP>. - For Juniper: Can you manually configure
set system ntp server 10.0.0.250 authentication-key 10? This requires a key to be defined first. Without it, it’s invalid.
- For Cisco: Can you manually configure
- Manually Verify Device Config/State: SSH into
cisco_rtrandjuniper_sw.cisco_rtr:show running-config | section ntpjuniper_sw:show configuration system ntp | display set(You’ll see no changes, confirming the automation failed to apply).
Root Cause Analysis: The playbook contains incorrect configuration syntax for both Cisco and Juniper that does not align with best practices or device capabilities. The prefer keyword cannot be applied to multiple NTP servers in the manner attempted on Cisco, and authentication-key requires prior definition on Juniper.
4. Correct the Issues:
- Modify
ntp_deploy.yamlto use correct syntax:--- - name: Deploy NTP Servers hosts: network_devices gather_facts: false vars: ntp_servers: - 10.0.0.250 - 10.0.0.251 # Make this a secondary or use prefer on only one tasks: - name: Configure NTP for Cisco IOS-XE when: ansible_network_os == 'iosxe' cisco.ios.ios_config: lines: - "ntp server prefer" # Only one prefer - "ntp server " # Second server without prefer parents: [] # No parent needed for global config diff_against: running match: none register: cisco_ntp_result - name: Configure NTP for Juniper JunOS when: ansible_network_os == 'junos' community.juniper.junos_config: lines: - "set system ntp server " # Removed problematic authentication-key - "set system ntp server " save_when: always comment: "Configure NTP servers" register: juniper_ntp_result
5. Successfully Redeploy:
- Execute the corrected playbook:
ansible-playbook -i inventory.ini ntp_deploy.yaml -vvv - Expected Output: The playbook should now run successfully, reporting changes for the first run and
okon subsequent runs (idempotency).
Verification Steps:
- On
cisco_rtr:Expected: Both NTP servers configured, one withshow ntp associations show running-config | section ntpprefer. - On
juniper_sw:Expected: Both NTP servers configured.show ntp status show configuration system ntp | display set
Challenge Exercises:
- Modify the playbook to dynamically determine the
preferserver for Cisco based on a variable. - Add a
napalm_get_ntp_peerscheck (similar to the Python script earlier) to the Ansible playbook after configuration to verify the operational state of NTP. - Implement a simple rollback mechanism (e.g., using
rollback 1for Juniper orarchivefor Cisco) in an Ansible handler, triggered if the post-deployment check fails.
Best Practices Checklist
Adhering to these best practices will significantly improve the reliability, security, and maintainability of your NetDevOps initiatives.
[x] Configuration Best Practices
- Infrastructure as Code (IaC): Treat network configurations as code, storing them in a Version Control System (VCS) like Git.
- Idempotency: Design all automation to be idempotent. Running a script multiple times should yield the same result without unintended side effects.
- Desired State Configuration (DSC): Focus on defining the desired state rather than a sequence of commands. Let tools like Ansible, Terraform, or Nornir manage the transition.
- Modularity and Reusability: Break down playbooks, scripts, and IaC into smaller, reusable components (e.g., Ansible roles, Python modules, Terraform modules).
- Single Source of Truth (SOT): Implement a SOT (e.g., NetBox, Nautobot) for all network inventory, IP addressing, and device parameters. Avoid hardcoding.
- Templating: Use templating engines (Jinja2) for dynamic configuration generation, keeping configurations DRY (Don’t Repeat Yourself).
- Dry Runs/Check Mode: Always perform dry runs or use
check_mode(Ansible) before applying changes to production networks. - Small, Atomic Changes: Apply changes in small, logical, and atomic units. This minimizes blast radius and simplifies troubleshooting.
- Multi-Vendor Abstraction: Leverage tools and libraries that abstract away vendor-specific CLI/API differences (e.g., NAPALM, Ansible network modules, OpenConfig YANG models).
[x] Security Hardening
- Secrets Management: Store all credentials, API keys, and sensitive data in a dedicated secrets manager (Ansible Vault, HashiCorp Vault). Never hardcode them.
- Least Privilege: Grant automation accounts only the minimum necessary permissions on network devices and automation platforms.
- Secure Communications: Always use encrypted protocols (SSH, HTTPS/TLS) for device interaction. Validate certificates where applicable.
- Access Control: Implement strict Role-Based Access Control (RBAC) for your VCS, CI/CD platform, and automation tools.
- Code Security Scanning (SAST): Integrate static analysis tools into your CI/CD pipeline to scan automation code for vulnerabilities.
- Audit Logging: Ensure comprehensive logging and auditing are enabled on network devices and automation tools to track all changes.
- Network Segmentation: Isolate automation infrastructure within a secure network segment.
[x] Monitoring Setup
- Continuous Monitoring: Implement continuous monitoring of network device state, configuration, and performance.
- Centralized Logging: Aggregate all logs from automation, CI/CD, and network devices into a central platform.
- Alerting: Configure alerts for configuration drift, automation failures, performance degradations, and security events.
- Telemetry: Leverage streaming telemetry (gNMI, model-driven telemetry) for real-time insights into network state.
[x] Documentation
- Clear Readme Files: Provide comprehensive
README.mdfiles for each repository, explaining its purpose, how to use it, dependencies, and expected outcomes. - Code Comments: Comment your automation code adequately, explaining complex logic or non-obvious design decisions.
- Runbooks: Create runbooks for common operational tasks, including troubleshooting guides for known issues.
- Network Diagrams as Code: Maintain network diagrams using tools like PlantUML, nwdiag, Graphviz, or D2 within your VCS, alongside your IaC.
[x] Change Management
- CI/CD Pipeline Integration: Integrate automation fully into a CI/CD pipeline for automated testing, validation, and deployment.
- Approval Workflows: Implement human approval steps in the pipeline for critical deployments or changes to sensitive network segments.
- Automated Testing: Develop robust unit, integration, and end-to-end tests for all automation.
- Rollback Strategy: Plan for rollback. Ensure you have a clear, tested strategy to revert to a known good state if a deployment fails or causes issues.
- Post-Mortem Analysis: Conduct post-mortems for all significant incidents or failed deployments to learn and improve processes.
Reference Links
- NETCONF Protocol: RFC 6241, RFC 6242 (SSH)
- RESTCONF Protocol: RFC 8040
- YANG Data Modeling Language: RFC 7950, RFC 7951 (JSON Encoding)
- gNMI Specification: OpenConfig gNMI Repository
- Cisco DevNet: Cisco Network Automation Resources
- Juniper DevNet: Juniper Automation Documentation
- Ansible Network Automation: Ansible Documentation
- NAPALM: NAPALM Documentation
- Nornir: Nornir Documentation
- Python for Network Engineers: Network to Code Resources
- Blockdiag Suite (nwdiag, packetdiag): Official Documentation
- Graphviz: DOT Language Documentation
- PlantUML: PlantUML Official Site
- D2: D2 Official Site
What’s Next
This chapter has provided you with a robust framework for troubleshooting complex NetDevOps environments and established essential best practices for building secure, reliable, and high-performing automation solutions. You’ve learned to approach problems systematically, leverage tool-specific debugging, and enforce proactive security and operational hygiene.
In the next chapter, we will shift our focus to Advanced NetDevOps Integrations and the Future of Network Automation. We will explore topics such as integrating with IT Service Management (ITSM) systems, advanced CI/CD patterns like progressive rollouts and canary deployments, serverless functions for network operations, and emerging technologies that will shape the future of NetDevOps, including AI/ML for intent-based networking and self-healing networks. Get ready to explore the cutting edge of network automation!