Introduction
In the rapidly evolving landscape of modern networks, simply knowing that a device is “up” is no longer sufficient. Network engineers leveraging NetDevOps principles require deep, real-time insights into network state, performance, and behavior to proactively identify issues, optimize resources, and ensure application experience. This is where network monitoring, observability, and telemetry become paramount.
This chapter delves into the critical role of modern monitoring and observability in a NetDevOps ecosystem. We’ll explore the shift from traditional pull-based monitoring (like SNMP and Syslog) to advanced push-based streaming telemetry using protocols such as NETCONF, RESTCONF, gRPC, and gNMI, alongside standardized data models like YANG and OpenConfig. You’ll learn how to implement and automate these solutions across multi-vendor networks using Ansible and Python, integrating them into comprehensive observability platforms.
After completing this chapter, you will be able to:
- Differentiate between traditional monitoring, modern telemetry, and network observability.
- Understand the architecture and benefits of streaming telemetry using NETCONF/YANG, RESTCONF, gRPC, and gNMI.
- Configure and verify streaming telemetry subscriptions on Cisco, Juniper, and Arista devices.
- Utilize Ansible and Python to automate the configuration and collection of telemetry data.
- Design a basic network observability architecture incorporating collectors, time-series databases, and visualization tools.
- Identify and mitigate security risks associated with advanced monitoring solutions.
- Apply best practices for performance optimization and troubleshooting in telemetry-driven environments.
Technical Concepts
The journey from traditional monitoring to full network observability involves a fundamental shift in how network state information is collected, processed, and analyzed.
Traditional Monitoring vs. Modern Observability
Traditional Monitoring often relies on a “pull” model, where a monitoring system periodically queries network devices for specific metrics. Key technologies include:
- SNMP (Simple Network Management Protocol): A widely used application-layer protocol for managing and monitoring network devices. It uses agents on devices to collect data and a manager to query them. While ubiquitous, it can be chatty, less granular, and often lacks real-time capabilities. (Refer to RFC 3411-3418 for SNMPv3 standards).
- Syslog: A standard for message logging, allowing network devices to send event notifications (e.g., link up/down, error messages) to a central server. Excellent for event correlation but doesn’t provide granular metric data. (Refer to RFC 5424 for Syslog Protocol).
- NetFlow/IPFIX (IP Flow Information Export): Provides data on IP traffic flows, enabling analysis of traffic patterns, bandwidth usage, and security incidents. It’s flow-based, not packet-based, offering aggregates. IPFIX (RFC 5101) is the IETF standard based on NetFlow.
Modern Telemetry and Observability adopt a “push” model, where network devices actively stream highly granular, structured data to collectors in near real-time. This shift is driven by:
- Structured Data: Using data models like YANG for consistent, machine-readable data.
- High Granularity: Sub-second data collection, crucial for dynamic network behavior.
- Real-time Insights: Enables faster detection and response to anomalies.
- Reduced Polling Overhead: Devices push data when changes occur or at set intervals.
Observability goes beyond mere monitoring. While monitoring tells you if a system is working, observability helps you understand why it’s not working, or why its performance has changed. It involves collecting diverse data types (metrics, logs, traces) to build a comprehensive understanding of system behavior from external outputs.
Network Observability Architecture
A typical network observability architecture consists of several key components:
- Network Devices: The source of telemetry data.
- Telemetry Agents: Software running on devices (or built-in) responsible for collecting raw data and formatting it according to a data model.
- Telemetry Collectors: Software systems that receive, parse, and often buffer the high volume of streaming data from multiple devices. Examples include Telegraf, OpenNMS, Custom Python scripts.
- Time-Series Database (TSDB): Optimized for storing time-stamped data, allowing efficient querying and analysis of metrics over time. Examples include Prometheus, InfluxDB, VictoriaMetrics.
- Data Processing & Analytics: Tools that can enrich, filter, aggregate, and analyze the collected data.
- Visualization & Alerting: Dashboards (e.g., Grafana) to visualize trends and anomalies, and alerting mechanisms to notify engineers of critical events.
Let’s visualize this architecture:
@startuml
skinparam handwritten true
skinparam style strict
cloud "Internet/WAN" as WAN
package "Network Infrastructure" {
node "Core Router 1 (Cisco IOS XE)" as CR1
node "Aggregation Switch 1 (Juniper JunOS)" as AS1
node "Leaf Switch 1 (Arista EOS)" as LS1
}
package "Observability Platform" {
cloud "Telemetry Collectors" as Collectors {
component "gRPC Collector (e.g., Telegraf)" as GRPCC
component "NETCONF/RESTCONF Listener (e.g., Python app)" as NETCONFC
component "SNMP Manager" as SNMPM
component "Syslog Server" as SYSLOGS
}
database "Time-Series Database (TSDB)" as TSDB {
folder "Prometheus"
folder "InfluxDB"
}
node "Visualization & Alerting" as Viz {
artifact "Grafana Dashboards"
artifact "Alert Manager"
}
}
CR1 -[hidden] AS1
AS1 -[hidden] LS1
CR1 -up-> GRPCC : gRPC Streaming Telemetry (YANG)
AS1 -up-> GRPCC : gRPC Streaming Telemetry (OpenConfig)
LS1 -up-> GRPCC : gRPC Streaming Telemetry (OpenConfig)
CR1 -up-> NETCONFC : NETCONF/RESTCONF (YANG)
AS1 -up-> NETCONFC : NETCONF (YANG)
LS1 -up-> NETCONFC : RESTCONF (YANG/eAPI)
CR1 [label="> SNMPM : SNMP Traps/Polls
AS1"] SNMPM : SNMP Traps/Polls
LS1 [label="> SNMPM : SNMP Traps/Polls
CR1"] SYSLOGS : Syslog Events
AS1 [label="> SYSLOGS : Syslog Events
LS1"] SYSLOGS : Syslog Events
GRPCC [label="> TSDB : Store Metrics
NETCONFC"] TSDB : Store Metrics
SNMPM [label="> TSDB : Store Metrics
SYSLOGS"] TSDB : Store Logs/Metrics
TSDB --> Viz : Query Data
Viz .down.> "Network Operations Center (NOC)" as NOC : Alerts/Dashboards
@enduml
Streaming Telemetry Protocols
Streaming telemetry relies on modern, standardized protocols for efficient, structured data transfer.
NETCONF/YANG
- NETCONF (Network Configuration Protocol): An XML-based protocol designed for configuring and managing network devices. While its primary role is configuration, it can also be used to retrieve operational state data. It operates over secure transport mechanisms like SSH or TLS.
- RFC 6241: Network Configuration Protocol (NETCONF)
- YANG (Yet Another Next Generation): A data modeling language used to define the structure and content of configuration and state data for network devices. YANG models provide a formal, machine-readable schema for both configuration and operational data, enabling multi-vendor interoperability.
- RFC 7950: The YANG 1.1 Data Modeling Language
NETCONF can be used for “pulling” operational state data from devices, similar to SNMP, but with the advantage of structured YANG data.
RESTCONF/YANG
- RESTCONF: A REST-like protocol that uses HTTP(S) to provide a programmatic interface for interacting with network devices. It exposes the YANG data model as a resource tree, allowing clients to perform CRUD (Create, Read, Update, Delete) operations.
- RFC 8040: RESTCONF Protocol
RESTCONF offers a more web-friendly approach to access YANG-modeled data, which can be useful for integration with web applications and scripting.
gRPC and gNMI
- gRPC (Google Remote Procedure Call): A high-performance, open-source RPC framework that can run in any environment. It uses Protocol Buffers (Protobuf) as its Interface Definition Language (IDL) for defining service methods and message structures. gRPC is efficient due to its binary message format and use of HTTP/2.
- gNMI (gRPC Network Management Interface): A Cisco and Google-backed specification that defines a gRPC-based service for network management, including streaming telemetry. It allows clients to subscribe to specific data paths (defined by YANG/OpenConfig) and receive updates.
gRPC and gNMI are the preferred methods for high-volume, low-latency streaming telemetry due to their efficiency.
Protocol Flow: gRPC Streaming Telemetry
digraph gRPC_Telemetry {
rankdir=LR;
node [shape=box];
Client [label="Telemetry Collector (gNMI Client)"];
Device [label="Network Device (gNMI Server)"];
subgraph cluster_0 {
label="Subscription Request (Client to Device)";
style=filled;
color=lightgrey;
Client -> Device [label="Establish gRPC Channel\n(TLS Encrypted)"];
Device -> Client [label="Channel Acknowledged"];
Client -> Device [label="gNMI::SubscribeRequest\n(Path, Mode: periodic/on-change)"];
}
subgraph cluster_1 {
label="Data Stream (Device to Client)";
style=filled;
color=lightblue;
Device -> Client [label="gNMI::SubscribeResponse\n(Telemetry Update - Protobuf/JSON Payload)"];
Device -> Client [label="gNMI::SubscribeResponse\n(Telemetry Update - Protobuf/JSON Payload)"];
Device -> Client [label="... Continuous Stream ..."];
}
}
Conceptual Packet Structure: gRPC Telemetry (Simplified)
A gRPC packet, particularly over HTTP/2, is complex. Here’s a simplified view focusing on the payload within the context of a gNMI SubscribeResponse carrying an OpenConfig interface counter update.
packetdiag {
colwidth = 32
node_height = 72
// Ethernet Header
0-15: Dest MAC (6 bytes)
16-31: Source MAC (6 bytes)
32-35: EtherType (0x0800 for IPv4)
// IP Header (IPv4)
36-39: Version (4), IHL (5), DSCP, ECN
40-47: Total Length
48-51: Identification, Flags, Fragment Offset
52-55: TTL, Protocol (6 for TCP), Header Checksum
56-63: Source IP Address
64-71: Destination IP Address
// TCP Header
72-79: Source Port (e.g., device ephemeral)
80-87: Destination Port (e.g., 50051 for gRPC)
88-103: Sequence Number
104-119: Acknowledgment Number
120-123: Data Offset, Reserved, Flags (SYN, ACK, PSH, URG, etc.)
124-127: Window Size
128-131: Checksum
132-135: Urgent Pointer
// HTTP/2 Frame Header (simplified, as gRPC multiplexes on streams)
136-139: Length
140-141: Type (e.g., DATA), Flags
142-143: Stream Identifier, Reserved
// gRPC Header (simplified)
144-147: Compressed Flag, Message Length
// gNMI SubscribeResponse (Protobuf Encoded)
148-163: <<gNMI SubscribeResponse Message>>
164-179: timestamp (uint64)
180-195: prefix (string/Path)
196-211: update (list of Update messages)
212-227: Path (e.g., /interfaces/interface[name=GigabitEthernet1]/state/counters)
228-243: Val (TypedValue: counter value)
244-259: ... other updates ...
260-275: <<End gNMI Message>>
}
Data Models (OpenConfig and Vendor-Native YANG)
YANG Data Models are crucial for streaming telemetry. They define the structure, syntax, and semantics of data.
- Vendor-Native YANG Models: Provided by device vendors (e.g., Cisco, Juniper, Arista) and offer granular access to device-specific features and operational data. Examples:
Cisco-IOS-XE-interfaces-oper.yang,juniper-smi.yang. You can explore these on Cisco DevNet’s YANG Suite (developer.cisco.com/yangsuite). - OpenConfig: An industry-wide initiative to define a common set of vendor-neutral YANG data models for network configuration and operational state. Its goal is to provide a unified approach to managing multi-vendor networks. Using OpenConfig models simplifies automation and monitoring across diverse hardware. (Learn more at openconfig.net).
The use of YANG models, especially OpenConfig, is a cornerstone of effective multi-vendor NetDevOps.
Collector Architectures
Collectors are vital for handling the ingestion of telemetry data. They typically perform several functions:
- Ingestion: Receive data via gRPC, NETCONF, SNMP, etc.
- Parsing: Decode Protobuf/JSON/XML payloads into usable metrics.
- Tagging/Labeling: Add metadata (e.g., device hostname, interface name) to metrics for easier querying.
- Buffering: Temporarily store data before writing to a TSDB.
- Forwarding: Send processed metrics to a TSDB.
Popular open-source collector solutions include:
- Telegraf: A plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors to various output plugins (including Prometheus, InfluxDB). Excellent for gRPC telemetry.
- Prometheus Node Exporter: While primarily for host metrics, Prometheus itself can scrape metrics, and it has various exporters for network devices.
- Custom Python Applications: For highly specific use cases, a Python script can act as a gNMI client to subscribe, parse, and store data.
Configuration Examples (Multi-vendor)
Here, we’ll demonstrate configuring streaming telemetry (gRPC/gNMI and NETCONF/RESTCONF) on Cisco, Juniper, and Arista devices.
Cisco IOS XE/XR (gRPC Streaming Telemetry)
This example configures a periodic gRPC subscription to stream interface statistics.
! Configure NETCONF/RESTCONF for management access (often a prerequisite for gNMI)
! Enable NETCONF SSH transport
netconf-yang
ssh
! Enable RESTCONF HTTPS transport
restconf
transport https
! Use a local authentication method
authorization local
! Configure gNMI/gRPC telemetry
! Define the telemetry destination (collector)
telemetry ietf
destination-group TELEMETRY_COLLECTOR
address 192.168.10.10 port 50051
protocol grpc tls-enable
encoding encode-kvgpb
profile TELEMETRY_PROFILE ! Optional TLS profile if using client certs
! security
! trustpoint TELEMETRY_CLIENT_TP
! pki-enrollment mode auto-client
! ! Ensure the collector's certificate is trusted here
! ! cryptounique-name TLS_SERVER_IDENTITY
! !
! encryption aes256-gcm
! ! If using client certificates for mutual TLS
! ! certificate application telemetry
! ! ca-trustpoint TELEMETRY_CA_TP
! Define a sensor group for the data we want to stream (e.g., interface operational state)
sensor-group INTERFACE_OPER_STATE
! Use an OpenConfig path for multi-vendor consistency
! For Cisco, verify specific YANG paths are supported
! Example: openconfig-interfaces:interfaces/interface/state
! Example: Cisco-IOS-XE-interfaces-oper:interfaces/interface/state
! Path examples:
! /interfaces/interface/state/counters
! /interfaces/interface[name='GigabitEthernet1']/state/counters
path openconfig-interfaces:interfaces/interface/state/counters
! Define a subscription that links the sensor group to the destination
subscription PERIODIC_INTF_COUNTERS
sensor-group INTERFACE_OPER_STATE sample-interval 10000 ! 10-second interval
destination-group TELEMETRY_COLLECTOR
stream cisco-push
update-policy periodic
! --- Verification Commands ---
Verification Commands:
show telemetry ietf subscription PERIODIC_INTF_COUNTERS
show telemetry ietf destination-group TELEMETRY_COLLECTOR
show telemetry ietf sensor-group INTERFACE_OPER_STATE
show telemetry ietf connection all
Expected Output (Snippet):
Router# show telemetry ietf subscription PERIODIC_INTF_COUNTERS
Subscription ID: 100
Type: Dynamic
State: Enabled
Source Address: 0.0.0.0
Source VRF: <default>
Stream: cisco-push
Update policy: periodic
Update interval: 10000 ms
Sensor Groups:
Sensor Group: INTERFACE_OPER_STATE (ID: 100)
Path: openconfig-interfaces:interfaces/interface/state/counters
Destination Groups: TELEMETRY_COLLECTOR
Address: 192.168.10.10:50051
Transport: grpc
Encoding: encode-kvgpb
Profile: TELEMETRY_PROFILE
TLS: Enabled
Router# show telemetry ietf connection all
Telemetry connection 0:
Peer Address: 192.168.10.10
Peer Port: 50051
Local Address: 10.0.0.1
Local Port: 54321
State: Connected
Profile Name: TELEMETRY_PROFILE
Subscriptions: 100
Juniper JunOS (gRPC Streaming Telemetry)
This example configures gRPC streaming telemetry for interface statistics using OpenConfig models.
# Enable gRPC and specify its listening port
set services extension-service request-response grpc clear-text port 50051
# Configure a streaming telemetry sensor (data provider)
set services analytics sensor SENSOR_INTF_STATS
set services analytics sensor SENSOR_INTF_STATS description "Interface Stats"
# Specify the OpenConfig path to collect data
# Juniper uses 'open-config:' prefix for OpenConfig paths
set services analytics sensor SENSOR_INTF_STATS resource "/junos/system/linecard/interface/"
set services analytics sensor SENSOR_INTF_STATS resource "/interfaces/interface/state/" # OpenConfig path
# Configure a streaming telemetry export profile (collector destination and frequency)
set services analytics export-profile EXPORT_INTF_STATS
set services analytics export-profile EXPORT_INTF_STATS reporting-period 10 # 10 seconds
set services analytics export-profile EXPORT_INTF_STATS format gpb
set services analytics export-profile EXPORT_INTF_STATS transport grpc
set services analytics export-profile EXPORT_INTF_STATS target-address 192.168.10.10
set services analytics export-profile EXPORT_INTF_STATS target-port 50051
# Apply the sensor and export profile to a rule
set services analytics rule RULE_INTF_STATS
set services analytics rule RULE_INTF_STATS sensor-name SENSOR_INTF_STATS
set services analytics rule RULE_INTF_STATS export-profile EXPORT_INTF_STATS
# Commit the configuration
commit
Verification Commands:
show services analytics status
show services analytics client
show services analytics export-profile EXPORT_INTF_STATS
Expected Output (Snippet):
user@juniper> show services analytics status
Extension service status:
Current status: Running
Enabled on: FPC0
GRPC enabled: Yes, port 50051
user@juniper> show services analytics client
...
Client information:
Name: EXPORT_INTF_STATS
Address: 192.168.10.10, Port: 50051
Protocol: grpc
Sensor name: SENSOR_INTF_STATS
Reporting period: 10s
State: Connected
...
Arista EOS (gRPC Streaming Telemetry)
Arista EOS uses OpenConfig by default for gRPC telemetry.
! Enable eAPI (for RESTCONF-like interaction and generally good practice)
management api http-command
no shutdown
protocol https
vrf default
! Configure gRPC telemetry
! Define the telemetry receiver (collector)
telemetry
destination 192.168.10.10:50051
protocol gRPC
encoding GPB
tls profile TELEMETRY_TLS_PROFILE ! Optional TLS profile
source-interface Management1
! Define a sensor group (path to collect)
sensor-group INTERFACE_COUNTERS
path /Sysdb/interface/counters
path /interfaces/interface/state/statistics ! OpenConfig path for counters
! Define a subscription to push data from the sensor group to the destination
stream INTERFACE_STREAM
sensor-group INTERFACE_COUNTERS
destination 192.168.10.10:50051
interval 10000 ! 10 seconds
! --- Verification Commands ---
Verification Commands:
show telemetry
show telemetry destination 192.168.10.10:50051
show telemetry stream INTERFACE_STREAM
Expected Output (Snippet):
Arista# show telemetry
Telemetry Receiver State:
Receiver: 192.168.10.10:50051
Protocol: gRPC
Encoding: GPB
State: Active
Source-interface: Management1
Telemetry Streams:
Stream: INTERFACE_STREAM
Sensor Group: INTERFACE_COUNTERS
Destination: 192.168.10.10:50051
Interval: 10000 ms
Last Push: 00:00:02 ago
Push Count: 1234
Status: OK
Network Diagrams
Diagrams are essential for visualizing complex network concepts.
Network Topology: Telemetry Lab Setup (nwdiag)
nwdiag {
network core_network {
address = "10.0.0.0/24"
description = "Core Network Segment"
CR1 [address = "10.0.0.1"];
AS1 [address = "10.0.0.2"];
}
network mgmt_network {
address = "192.168.10.0/24"
description = "Management & Telemetry Network"
CR1 [address = "192.168.10.1"];
AS1 [address = "192.168.10.2"];
LS1 [address = "192.168.10.3"];
COLLECTOR [address = "192.168.10.10", description = "Telemetry Collector"];
TSDB [address = "192.168.10.11", description = "Time-Series DB"];
GRAFANA [address = "192.168.10.12", description = "Grafana / Visualization"];
}
// Connections implicitly defined by shared networks
CR1 -- AS1; // Represents logical connection in core_network
CR1 -- COLLECTOR; // Represents logical connection in mgmt_network
AS1 -- COLLECTOR;
LS1 -- COLLECTOR;
COLLECTOR -- TSDB;
TSDB -- GRAFANA;
}
Data Flow for Observability Platform (plantuml)
@startuml
scale 1.5
cloud "Network Devices" as NetDevs {
component "Cisco IOS XE" as C_DEV
component "Juniper JunOS" as J_DEV
component "Arista EOS" as A_DEV
}
rectangle "Telemetry Collection Layer" {
component "gNMI Collector\n(e.g., Telegraf)" as GNMI_COLLECTOR
component "SNMP Poller\n(e.g., Prometheus)" as SNMP_POLLER
component "Syslog Aggregator\n(e.g., Logstash)" as SYSLOG_AGG
}
database "Data Storage" {
component "Time-Series DB\n(e.g., Prometheus DB, InfluxDB)" as TSDB
component "Log Storage\n(e.g., Elasticsearch)" as LOG_STORE
}
rectangle "Analysis & Visualization" {
component "Metrics Dashboards\n(e.g., Grafana)" as GRAFANA
component "Alerting Engine\n(e.g., Alertmanager)" as ALERTS
component "Log Analysis\n(e.g., Kibana)" as KIBANA
}
C_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
J_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
A_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
C_DEV -down-> SNMP_POLLER : SNMPv3
J_DEV -down-> SNMP_POLLER : SNMPv3
A_DEV -down-> SNMP_POLLER : SNMPv3
C_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
J_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
A_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
GNMI_COLLECTOR -down-> TSDB : Write Metrics
SNMP_POLLER -down-> TSDB : Write Metrics
SYSLOG_AGG -down-> LOG_STORE : Write Logs
TSDB -up-> GRAFANA : Query Metrics
LOG_STORE -up-> KIBANA : Query Logs
GRAFANA -right-> ALERTS : Trigger Alerts
ALERTS -up-> "NetOps Team" : Notifications
@enduml
Automation Examples
Automating the setup of telemetry and the consumption of data is central to NetDevOps.
Python: gNMI Client for Streaming Telemetry
This Python script demonstrates how to subscribe to gNMI telemetry data from a network device using the grpc and gnmic libraries.
# pip install grpcio grpcio-tools gnmic
import grpc
import gnmic_pb2
import gnmic_pb2_grpc
import json
import time
import ssl
# Device details
DEVICE_IP = "192.168.10.1" # IP of your Cisco/Juniper/Arista device
DEVICE_PORT = 50051 # gNMI port, typically 50051
USERNAME = "admin"
PASSWORD = "password"
# Path to subscribe to (OpenConfig interface counters)
# Adjust based on your device's supported paths and configured sensor groups
# For Cisco IOS XE: "/openconfig-interfaces:interfaces/interface/state/counters"
# For Juniper JunOS: "/interfaces/interface/state/"
# For Arista EOS: "/interfaces/interface/state/statistics"
GNMI_PATH = "/interfaces/interface/state/statistics"
def stream_telemetry():
# Setup TLS/SSL context if needed (for secure gRPC)
# If your device uses TLS, replace grpc.insecure_channel with grpc.secure_channel
# and provide appropriate credentials/certificates.
# For simplicity, this example uses insecure_channel, but production should use TLS.
# Example for secure_channel (requires server cert for verification or client certs for mutual TLS)
# with open('path/to/server_cert.pem', 'rb') as f:
# trusted_certs = f.read()
# credentials = grpc.ssl_channel_credentials(root_certificates=trusted_certs)
# channel = grpc.secure_channel(f"{DEVICE_IP}:{DEVICE_PORT}", credentials)
# For insecure channel (NOT RECOMMENDED FOR PRODUCTION)
channel = grpc.insecure_channel(f"{DEVICE_IP}:{DEVICE_PORT}")
stub = gnmic_pb2_grpc.gNMIStub(channel)
subscribe_request = gnmic_pb2.SubscribeRequest()
subscription_list = subscribe_request.subscribe
# Create a subscription
subscription = subscription_list.subscription.add()
# The path to subscribe to
path_elem = subscription.path.elem.add()
path_elem.name = GNMI_PATH.split('/')[1] # Root element e.g., 'interfaces'
for p in GNMI_PATH.split('/')[2:]:
elem = subscription.path.elem.add()
if '[' in p and ']' in p:
# Handle key-value pairs in path, e.g., interface[name=GigabitEthernet1]
key_name = p.split('[')[0]
key_value = p.split('=')[1].strip(']')
elem.name = key_name
elem.key[key_name.rstrip('s')] = key_value # Adjust key based on YANG model
else:
elem.name = p
subscription.mode = gnmic_pb2.SubscriptionList.Mode.STREAM
subscription.sample_interval = 10_000_000_000 # 10 seconds in nanoseconds
subscription_list.mode = gnmic_pb2.SubscriptionList.Mode.STREAM
subscription_list.encoding = gnmic_pb2.Encoding.JSON_IETF # Or PROTO for protobuf
print(f"Subscribing to {GNMI_PATH} on {DEVICE_IP}:{DEVICE_PORT}...")
try:
# The stub.Subscribe method returns an iterator over responses
for response in stub.Subscribe(iter([subscribe_request])):
if response.update:
timestamp_ns = response.update.timestamp
timestamp_s = timestamp_ns / 1_000_000_000
prefix = gnmic_pb2.Path.to_json(response.update.prefix) if response.update.prefix else "N/A"
print(f"\n--- Telemetry Update ---")
print(f"Timestamp: {time.ctime(timestamp_s)} ({timestamp_ns} ns)")
print(f"Prefix: {prefix}")
for update in response.update.update:
path = gnmic_pb2.Path.to_json(update.path)
value = gnmic_pb2.TypedValue.to_json(update.val)
print(f" Path: {path}")
print(f" Value: {value}")
elif response.sync_response:
print("--- Synchronization complete ---")
else:
print(f"Received unknown response: {response}")
except grpc.RpcError as e:
print(f"gRPC Error: {e.details}")
except KeyboardInterrupt:
print("Subscription stopped by user.")
finally:
channel.close()
print("gRPC channel closed.")
if __name__ == "__main__":
stream_telemetry()
Ansible Playbook: Configure Streaming Telemetry
This playbook configures gRPC streaming telemetry on Cisco IOS XE, Juniper JunOS, and Arista EOS devices. It assumes ansible-network.network-cli and appropriate ansible.netcommon collections are installed and inventory is set up.
---
- name: Configure Multi-Vendor Streaming Telemetry
hosts: network_devices
gather_facts: false
connection: network_cli
vars:
telemetry_collector_ip: "192.168.10.10"
telemetry_collector_port: 50051
telemetry_sample_interval_ms: 10000 # 10 seconds
tasks:
- name: Ensure NETCONF/RESTCONF is enabled (Cisco IOS XE)
when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
cisco.ios.ios_config:
lines:
- netconf-yang
- restconf
- "restconf transport https"
- "restconf authorization local"
save_when: modified
- name: Configure gRPC Streaming Telemetry (Cisco IOS XE)
when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
cisco.ios.ios_config:
lines:
- "telemetry ietf"
- " destination-group TELEMETRY_COLLECTOR"
- " address port "
- " protocol grpc tls-enable" # Use 'tls-enable' for production, or 'no tls' for testing
- " encoding encode-kvgpb"
- " sensor-group INTERFACE_OPER_STATE"
- " path openconfig-interfaces:interfaces/interface/state/counters"
- " subscription PERIODIC_INTF_COUNTERS"
- " sensor-group INTERFACE_OPER_STATE sample-interval "
- " destination-group TELEMETRY_COLLECTOR"
- " stream cisco-push"
- " update-policy periodic"
save_when: modified
- name: Configure gRPC Streaming Telemetry (Juniper JunOS)
when: ansible_network_os == 'junos'
juniper.junos.junos_config:
lines:
- "set services extension-service request-response grpc clear-text port "
- "set services analytics sensor SENSOR_INTF_STATS resource \"/interfaces/interface/state/\""
- "set services analytics export-profile EXPORT_INTF_STATS reporting-period "
- "set services analytics export-profile EXPORT_INTF_STATS format gpb"
- "set services analytics export-profile EXPORT_INTF_STATS transport grpc"
- "set services analytics export-profile EXPORT_INTF_STATS target-address "
- "set services analytics export-profile EXPORT_INTF_STATS target-port "
- "set services analytics rule RULE_INTF_STATS sensor-name SENSOR_INTF_STATS"
- "set services analytics rule RULE_INTF_STATS export-profile EXPORT_INTF_STATS"
commit_empty_command: true # Allows committing an empty set if no changes
- name: Configure gRPC Streaming Telemetry (Arista EOS)
when: ansible_network_os == 'eos'
arista.eos.eos_config:
lines:
- "telemetry"
- " destination :"
- " protocol gRPC"
- " encoding GPB"
- " source-interface Management1" # Adjust as needed
- " sensor-group INTERFACE_COUNTERS"
- " path /interfaces/interface/state/statistics"
- " stream INTERFACE_STREAM"
- " sensor-group INTERFACE_COUNTERS"
- " destination :"
- " interval "
save_when: modified
- name: Verify Telemetry Configuration (Cisco IOS XE)
when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
cisco.ios.ios_command:
commands:
- "show telemetry ietf subscription PERIODIC_INTF_COUNTERS"
- "show telemetry ietf connection all"
register: cisco_telemetry_output
ignore_errors: true # Continue even if command fails
- ansible.builtin.debug:
msg: ""
when: cisco_telemetry_output.stdout is defined
- name: Verify Telemetry Configuration (Juniper JunOS)
when: ansible_network_os == 'junos'
juniper.junos.junos_command:
commands:
- "show services analytics status"
- "show services analytics client"
register: juniper_telemetry_output
ignore_errors: true
- ansible.builtin.debug:
msg: ""
when: juniper_telemetry_output.stdout is defined
- name: Verify Telemetry Configuration (Arista EOS)
when: ansible_network_os == 'eos'
arista.eos.eos_command:
commands:
- "show telemetry stream INTERFACE_STREAM"
- "show telemetry destination :"
register: arista_telemetry_output
ignore_errors: true
- ansible.builtin.debug:
msg: ""
when: arista_telemetry_output.stdout is defined
Security Considerations
Network telemetry streams vast amounts of operational data, making their security paramount. Compromised telemetry can lead to:
- Data Exposure: Sensitive network topology, performance, and traffic data falling into the wrong hands.
- System Manipulation: If telemetry agents or protocols have configuration capabilities, a breach could allow unauthorized configuration changes.
- Denial of Service (DoS): An attacker could overwhelm telemetry collectors with fabricated data or exhaust device resources by triggering excessive data streaming.
Attack Vectors and Mitigation Strategies
| Attack Vector | Mitigation Strategies |
|---|---|
| Unauthorized Data Access | Authentication: Use strong authentication (client certificates for gRPC/gNMI, AAA with NETCONF/RESTCONF, SNMPv3 with authPriv). Authorization: Implement granular access control (RBAC) to telemetry paths and data streams. Encryption: Always use TLS/SSL for gRPC, NETCONF, RESTCONF, and secure Syslog. |
| Tampering with Telemetry Data | Integrity: TLS/SSL provides data integrity checks. Use digital signatures where possible. Secure Sources: Ensure the network devices sending telemetry are themselves secured and not compromised. |
| DoS on Collector/Device | Rate Limiting: Implement rate limits on telemetry streams at the device if supported. Collector Scaling: Design collectors for horizontal scalability and redundancy. Network Segmentation: Isolate telemetry traffic on dedicated management networks/VLANs. Input Validation: Collectors should validate incoming data to prevent parsing malformed packets. |
| Compromised Monitoring Infrastructure | Hardening: Securely configure operating systems, databases, and applications in the observability stack. Least Privilege: Run collector services with minimal necessary permissions. Vulnerability Management: Regularly patch and scan all components. |
| Replay Attacks | Timestamps & Nonces: Protocols like gRPC incorporate timestamps and request/response matching to prevent replay. TLS Session Keys: Use fresh session keys for each connection. |
Security Best Practices
- Encrypt All Telemetry: Always use TLS/SSL for streaming telemetry (gRPC, NETCONF over SSH/TLS, RESTCONF over HTTPS). Never transmit sensitive data in plain text.
- Strong Authentication and Authorization: Implement multi-factor authentication for management interfaces. Use client certificates for gRPC authentication. Employ AAA for programmatic access to devices.
- Dedicated Management Network: Isolate telemetry traffic on a separate management network or VPN to reduce the attack surface.
- Principle of Least Privilege: Configure telemetry subscriptions to send only the data absolutely necessary. Limit access to monitoring tools and dashboards.
- Regular Auditing and Logging: Audit telemetry configurations and logs for suspicious activity. Ensure collectors log their own activity.
- Software Supply Chain Security: Use trusted sources for libraries and tools (e.g., Python
gnmiclibrary, Telegraf plugins). - Secure API Keys/Credentials: Store API keys and credentials for automation (Ansible, Python) securely using vaults or secret management systems.
Security Configuration Example (Cisco IOS XE - TLS for gRPC)
To enable TLS for gRPC streaming, you typically need to set up a Public Key Infrastructure (PKI) and trustpoints on the device and ensure your collector also presents a trusted certificate.
! Create a crypto key pair for the device
crypto key generate rsa label MY_TELEMETRY_KEY modulus 2048
! Define a trustpoint for the Certificate Authority (CA) that signed your collector's certificate
crypto pki trustpoint TELEMETRY_CA_TP
enrollment terminal
revocation-check none
usage ipsec ikev2 dot1x aaa web-auth tls
! You would paste the CA certificate here after "enrollment terminal"
! Example:
! certificate chain
! -----BEGIN CERTIFICATE-----
! ... CA Certificate Data ...
! -----END CERTIFICATE-----
! quit
! Define a trustpoint for the device's own identity certificate (signed by your internal CA)
crypto pki trustpoint TELEMETRY_DEVICE_IDENTITY_TP
enrollment terminal
revocation-check none
usage ipsec ikev2 dot1x aaa web-auth tls
rsakeypair MY_TELEMETRY_KEY
! You would paste the device's certificate here
! Example:
! certificate chain
! -----BEGIN CERTIFICATE-----
! ... Device Certificate Data ...
! -----END CERTIFICATE-----
! quit
! Link these to the telemetry profile
telemetry ietf
destination-group TELEMETRY_COLLECTOR
address 192.168.10.10 port 50051
protocol grpc tls-enable
encoding encode-kvgpb
profile TELEMETRY_TLS_PROFILE
profile TELEMETRY_TLS_PROFILE
! Define which certificate the device presents and which CA to trust for the client
device-identity TELEMETRY_DEVICE_IDENTITY_TP
peer-trustpoint TELEMETRY_CA_TP
Security Warning: Implementing PKI and TLS requires careful planning and certificate management. Incorrect configurations can lead to connection failures. Always test thoroughly in a lab environment first.
Verification & Troubleshooting
Effective verification and troubleshooting are crucial for maintaining a healthy telemetry pipeline.
Verification Commands
Beyond the vendor-specific show telemetry commands, here are general verification steps:
# Verify basic network connectivity to the collector
ping 192.168.10.10
# Verify TCP port connectivity to the collector's gRPC port
# On Linux:
nc -zv 192.168.10.10 50051
# Expected output: Connection to 192.168.10.10 50051 port [tcp/*] succeeded!
# From the collector, check if the gNMI client is running and connected
# (Specific command depends on the collector, e.g., 'systemctl status telegraf')
Expected Output
A healthy telemetry pipeline should show:
- Device-side: Subscriptions “Active” or “Connected,” counters for pushes increasing.
- Collector-side: Logs indicating successful connection, parsing, and forwarding of data.
- TSDB-side: Metrics appearing correctly tagged and indexed.
- Grafana/Visualization: Dashboards populated with real-time data.
Common Issues Table
| Issue | Possible Causes | Resolution Steps ```toml +++ title = “Network Monitoring, Observability, and Telemetry” date = 2026-01-24 draft = false weight = 11 description = “Explore modern network monitoring, observability, and telemetry paradigms, including streaming telemetry with NETCONF, RESTCONF, gRPC, and YANG models. Learn to implement and automate multi-vendor monitoring solutions using Ansible and Python, integrate with observability platforms, and apply NetDevOps principles for proactive network management, performance optimization, and robust troubleshooting.” slug = “network-monitoring-observability-telemetry” keywords = [“Network Monitoring”, “Observability”, “Telemetry”, “Streaming Telemetry”, “NETCONF”, “RESTCONF”, “gRPC”, “gNMI”, “YANG”, “OpenConfig”, “Ansible”, “Python”, “NetDevOps”, “Cisco”, “Juniper”, “Arista”, “Prometheus”, “Grafana”, “Time-Series Database”, “Infrastructure as Code”, “Network Automation”, “Packet Analysis”] tags = [“NetDevOps”, “Network Automation”, “Monitoring”, “Telemetry”, “Observability”, “Ansible”, “Python”, “YANG”, “Cisco”, “Multi-Vendor”] categories = [“Networking”, “NetDevOps”] +++
Introduction
In the rapidly evolving landscape of modern networks, simply knowing that a device is “up” is no longer sufficient. Network engineers leveraging NetDevOps principles require deep, real-time insights into network state, performance, and behavior to proactively identify issues, optimize resources, and ensure application experience. This is where network monitoring, observability, and telemetry become paramount.
This chapter delves into the critical role of modern monitoring and observability in a NetDevOps ecosystem. We’ll explore the shift from traditional pull-based monitoring (like SNMP and Syslog) to advanced push-based streaming telemetry using protocols such as NETCONF, RESTCONF, gRPC, and gNMI, alongside standardized data models like YANG and OpenConfig. You’ll learn how to implement and automate these solutions across multi-vendor networks using Ansible and Python, integrating them into comprehensive observability platforms.
After completing this chapter, you will be able to:
- Differentiate between traditional monitoring, modern telemetry, and network observability.
- Understand the architecture and benefits of streaming telemetry using NETCONF/YANG, RESTCONF, gRPC, and gNMI.
- Configure and verify streaming telemetry subscriptions on Cisco, Juniper, and Arista devices.
- Utilize Ansible and Python to automate the configuration and collection of telemetry data.
- Design a basic network observability architecture incorporating collectors, time-series databases, and visualization tools.
- Identify and mitigate security risks associated with advanced monitoring solutions.
- Apply best practices for performance optimization and troubleshooting in telemetry-driven environments.
Technical Concepts
The journey from traditional monitoring to full network observability involves a fundamental shift in how network state information is collected, processed, and analyzed.
Traditional Monitoring vs. Modern Observability
Traditional Monitoring often relies on a “pull” model, where a monitoring system periodically queries network devices for specific metrics. Key technologies include:
- SNMP (Simple Network Management Protocol): A widely used application-layer protocol for managing and monitoring network devices. It uses agents on devices to collect data and a manager to query them. While ubiquitous, it can be chatty, less granular, and often lacks real-time capabilities. (Refer to RFC 3411-3418 for SNMPv3 standards).
- Syslog: A standard for message logging, allowing network devices to send event notifications (e.g., link up/down, error messages) to a central server. Excellent for event correlation but doesn’t provide granular metric data. (Refer to RFC 5424 for Syslog Protocol).
- NetFlow/IPFIX (IP Flow Information Export): Provides data on IP traffic flows, enabling analysis of traffic patterns, bandwidth usage, and security incidents. It’s flow-based, not packet-based, offering aggregates. IPFIX (RFC 5101) is the IETF standard based on NetFlow.
Modern Telemetry and Observability adopt a “push” model, where network devices actively stream highly granular, structured data to collectors in near real-time. This shift is driven by:
- Structured Data: Using data models like YANG for consistent, machine-readable data.
- High Granularity: Sub-second data collection, crucial for dynamic network behavior.
- Real-time Insights: Enables faster detection and response to anomalies.
- Reduced Polling Overhead: Devices push data when changes occur or at set intervals.
Observability goes beyond mere monitoring. While monitoring tells you if a system is working, observability helps you understand why it’s not working, or why its performance has changed. It involves collecting diverse data types (metrics, logs, traces) to build a comprehensive understanding of system behavior from external outputs.
Network Observability Architecture
A typical network observability architecture consists of several key components:
- Network Devices: The source of telemetry data.
- Telemetry Agents: Software running on devices (or built-in) responsible for collecting raw data and formatting it according to a data model.
- Telemetry Collectors: Software systems that receive, parse, and often buffer the high volume of streaming data from multiple devices. Examples include Telegraf, OpenNMS, Custom Python scripts.
- Time-Series Database (TSDB): Optimized for storing time-stamped data, allowing efficient querying and analysis of metrics over time. Examples include Prometheus, InfluxDB, VictoriaMetrics.
- Data Processing & Analytics: Tools that can enrich, filter, aggregate, and analyze the collected data.
- Visualization & Alerting: Dashboards (e.g., Grafana) to visualize trends and anomalies, and alerting mechanisms to notify engineers of critical events.
Let’s visualize this architecture:
@startuml
skinparam handwritten true
skinparam style strict
cloud "Internet/WAN" as WAN
package "Network Infrastructure" {
node "Core Router 1 (Cisco IOS XE)" as CR1
node "Aggregation Switch 1 (Juniper JunOS)" as AS1
node "Leaf Switch 1 (Arista EOS)" as LS1
}
package "Observability Platform" {
cloud "Telemetry Collectors" as Collectors {
component "gRPC Collector (e.g., Telegraf)" as GRPCC
component "NETCONF/RESTCONF Listener (e.g., Python app)" as NETCONFC
component "SNMP Manager" as SNMPM
component "Syslog Server" as SYSLOGS
}
database "Time-Series Database (TSDB)" as TSDB {
folder "Prometheus"
folder "InfluxDB"
}
node "Visualization & Alerting" as Viz {
artifact "Grafana Dashboards"
artifact "Alert Manager"
}
}
CR1 -[hidden] AS1
AS1 -[hidden] LS1
CR1 -up-> GRPCC : gRPC Streaming Telemetry (YANG)
AS1 -up-> GRPCC : gRPC Streaming Telemetry (OpenConfig)
LS1 -up-> GRPCC : gRPC Streaming Telemetry (OpenConfig)
CR1 -up-> NETCONFC : NETCONF/RESTCONF (YANG)
AS1 -up-> NETCONFC : NETCONF (YANG)
LS1 -up-> NETCONFC : RESTCONF (YANG/eAPI)
CR1 [label="> SNMPM : SNMP Traps/Polls
AS1"] SNMPM : SNMP Traps/Polls
LS1 [label="> SNMPM : SNMP Traps/Polls
CR1"] SYSLOGS : Syslog Events
AS1 [label="> SYSLOGS : Syslog Events
LS1"] SYSLOGS : Syslog Events
GRPCC [label="> TSDB : Store Metrics
NETCONFC"] TSDB : Store Metrics
SNMPM [label="> TSDB : Store Metrics
SYSLOGS"] TSDB : Store Logs/Metrics
TSDB --> Viz : Query Data
Viz .down.> "Network Operations Center (NOC)" as NOC : Alerts/Dashboards
@enduml
Streaming Telemetry Protocols
Streaming telemetry relies on modern, standardized protocols for efficient, structured data transfer.
NETCONF/YANG
- NETCONF (Network Configuration Protocol): An XML-based protocol designed for configuring and managing network devices. While its primary role is configuration, it can also be used to retrieve operational state data. It operates over secure transport mechanisms like SSH or TLS.
- RFC 6241: Network Configuration Protocol (NETCONF)
- YANG (Yet Another Next Generation): A data modeling language used to define the structure and content of configuration and state data for network devices. YANG models provide a formal, machine-readable schema for both configuration and operational data, enabling multi-vendor interoperability.
- RFC 7950: The YANG 1.1 Data Modeling Language
NETCONF can be used for “pulling” operational state data from devices, similar to SNMP, but with the advantage of structured YANG data.
RESTCONF/YANG
- RESTCONF: A REST-like protocol that uses HTTP(S) to provide a programmatic interface for interacting with network devices. It exposes the YANG data model as a resource tree, allowing clients to perform CRUD (Create, Read, Update, Delete) operations.
- RFC 8040: RESTCONF Protocol
RESTCONF offers a more web-friendly approach to access YANG-modeled data, which can be useful for integration with web applications and scripting.
gRPC and gNMI
- gRPC (Google Remote Procedure Call): A high-performance, open-source RPC framework that can run in any environment. It uses Protocol Buffers (Protobuf) as its Interface Definition Language (IDL) for defining service methods and message structures. gRPC is efficient due to its binary message format and use of HTTP/2.
- gNMI (gRPC Network Management Interface): A Cisco and Google-backed specification that defines a gRPC-based service for network management, including streaming telemetry. It allows clients to subscribe to specific data paths (defined by YANG/OpenConfig) and receive updates.
gRPC and gNMI are the preferred methods for high-volume, low-latency streaming telemetry due to their efficiency.
Protocol Flow: gRPC Streaming Telemetry
digraph gRPC_Telemetry {
rankdir=LR;
node [shape=box];
Client [label="Telemetry Collector (gNMI Client)"];
Device [label="Network Device (gNMI Server)"];
subgraph cluster_0 {
label="Subscription Request (Client to Device)";
style=filled;
color=lightgrey;
Client -> Device [label="Establish gRPC Channel\n(TLS Encrypted)"];
Device -> Client [label="Channel Acknowledged"];
Client -> Device [label="gNMI::SubscribeRequest\n(Path, Mode: periodic/on-change)"];
}
subgraph cluster_1 {
label="Data Stream (Device to Client)";
style=filled;
color=lightblue;
Device -> Client [label="gNMI::SubscribeResponse\n(Telemetry Update - Protobuf/JSON Payload)"];
Device -> Client [label="gNMI::SubscribeResponse\n(Telemetry Update - Protobuf/JSON Payload)"];
Device -> Client [label="... Continuous Stream ..."];
}
}
Conceptual Packet Structure: gRPC Telemetry (Simplified)
A gRPC packet, particularly over HTTP/2, is complex. Here’s a simplified view focusing on the payload within the context of a gNMI SubscribeResponse carrying an OpenConfig interface counter update.
packetdiag {
colwidth = 32
node_height = 72
// Ethernet Header
0-15: Dest MAC (6 bytes)
16-31: Source MAC (6 bytes)
32-35: EtherType (0x0800 for IPv4)
// IP Header (IPv4)
36-39: Version (4), IHL (5), DSCP, ECN
40-47: Total Length
48-51: Identification, Flags, Fragment Offset
52-55: TTL, Protocol (6 for TCP), Header Checksum
56-63: Source IP Address
64-71: Destination IP Address
// TCP Header
72-79: Source Port (e.g., device ephemeral)
80-87: Destination Port (e.g., 50051 for gRPC)
88-103: Sequence Number
104-119: Acknowledgment Number
120-123: Data Offset, Reserved, Flags (SYN, ACK, PSH, URG, etc.)
124-127: Window Size
128-131: Checksum
132-135: Urgent Pointer
// HTTP/2 Frame Header (simplified, as gRPC multiplexes on streams)
136-139: Length
140-141: Type (e.g., DATA), Flags
142-143: Stream Identifier, Reserved
// gRPC Header (simplified)
144-147: Compressed Flag, Message Length
// gNMI SubscribeResponse (Protobuf Encoded)
148-163: <<gNMI SubscribeResponse Message>>
164-179: timestamp (uint64)
180-195: prefix (string/Path)
196-211: update (list of Update messages)
212-227: Path (e.g., /interfaces/interface[name=GigabitEthernet1]/state/counters)
228-243: Val (TypedValue: counter value)
244-259: ... other updates ...
260-275: <<End gNMI Message>>
}
Data Models (OpenConfig and Vendor-Native YANG)
YANG Data Models are crucial for streaming telemetry. They define the structure, syntax, and semantics of data.
- Vendor-Native YANG Models: Provided by device vendors (e.g., Cisco, Juniper, Arista) and offer granular access to device-specific features and operational data. Examples:
Cisco-IOS-XE-interfaces-oper.yang,juniper-smi.yang. You can explore these on Cisco DevNet’s YANG Suite (developer.cisco.com/yangsuite). - OpenConfig: An industry-wide initiative to define a common set of vendor-neutral YANG data models for network configuration and operational state. Its goal is to provide a unified approach to managing multi-vendor networks. Using OpenConfig models simplifies automation and monitoring across diverse hardware. (Learn more at openconfig.net).
The use of YANG models, especially OpenConfig, is a cornerstone of effective multi-vendor NetDevOps.
Collector Architectures
Collectors are vital for handling the ingestion of telemetry data. They typically perform several functions:
- Ingestion: Receive data via gRPC, NETCONF, SNMP, etc.
- Parsing: Decode Protobuf/JSON/XML payloads into usable metrics.
- Tagging/Labeling: Add metadata (e.g., device hostname, interface name) to metrics for easier querying.
- Buffering: Temporarily store data before writing to a TSDB.
- Forwarding: Send processed metrics to a TSDB.
Popular open-source collector solutions include:
- Telegraf: A plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors to various output plugins (including Prometheus, InfluxDB). Excellent for gRPC telemetry.
- Prometheus Node Exporter: While primarily for host metrics, Prometheus itself can scrape metrics, and it has various exporters for network devices.
- Custom Python Applications: For highly specific use cases, a Python script can act as a gNMI client to subscribe, parse, and store data.
Configuration Examples (Multi-vendor)
Here, we’ll demonstrate configuring streaming telemetry (gRPC/gNMI and NETCONF/RESTCONF) on Cisco, Juniper, and Arista devices.
Cisco IOS XE/XR (gRPC Streaming Telemetry)
This example configures a periodic gRPC subscription to stream interface statistics.
! Configure NETCONF/RESTCONF for management access (often a prerequisite for gNMI)
! Enable NETCONF SSH transport
netconf-yang
ssh
! Enable RESTCONF HTTPS transport
restconf
transport https
! Use a local authentication method
authorization local
! Configure gNMI/gRPC telemetry
! Define the telemetry destination (collector)
telemetry ietf
destination-group TELEMETRY_COLLECTOR
address 192.168.10.10 port 50051
protocol grpc tls-enable
encoding encode-kvgpb
profile TELEMETRY_PROFILE ! Optional TLS profile if using client certs
! security
! trustpoint TELEMETRY_CLIENT_TP
! pki-enrollment mode auto-client
! ! Ensure the collector's certificate is trusted here
! ! cryptounique-name TLS_SERVER_IDENTITY
! !
! encryption aes256-gcm
! ! If using client certificates for mutual TLS
! ! certificate application telemetry
! ! ca-trustpoint TELEMETRY_CA_TP
! Define a sensor group for the data we want to stream (e.g., interface operational state)
sensor-group INTERFACE_OPER_STATE
! Use an OpenConfig path for multi-vendor consistency
! For Cisco, verify specific YANG paths are supported
! Example: openconfig-interfaces:interfaces/interface/state
! Example: Cisco-IOS-XE-interfaces-oper:interfaces/interface/state
! Path examples:
! /interfaces/interface/state/counters
! /interfaces/interface[name='GigabitEthernet1']/state/counters
path openconfig-interfaces:interfaces/interface/state/counters
! Define a subscription that links the sensor group to the destination
subscription PERIODIC_INTF_COUNTERS
sensor-group INTERFACE_OPER_STATE sample-interval 10000 ! 10-second interval
destination-group TELEMETRY_COLLECTOR
stream cisco-push
update-policy periodic
! --- Verification Commands ---
Verification Commands:
show telemetry ietf subscription PERIODIC_INTF_COUNTERS
show telemetry ietf destination-group TELEMETRY_COLLECTOR
show telemetry ietf sensor-group INTERFACE_OPER_STATE
show telemetry ietf connection all
Expected Output (Snippet):
Router# show telemetry ietf subscription PERIODIC_INTF_COUNTERS
Subscription ID: 100
Type: Dynamic
State: Enabled
Source Address: 0.0.0.0
Source VRF: <default>
Stream: cisco-push
Update policy: periodic
Update interval: 10000 ms
Sensor Groups:
Sensor Group: INTERFACE_OPER_STATE (ID: 100)
Path: openconfig-interfaces:interfaces/interface/state/counters
Destination Groups: TELEMETRY_COLLECTOR
Address: 192.168.10.10:50051
Transport: grpc
Encoding: encode-kvgpb
Profile: TELEMETRY_PROFILE
TLS: Enabled
Router# show telemetry ietf connection all
Telemetry connection 0:
Peer Address: 192.168.10.10
Peer Port: 50051
Local Address: 10.0.0.1
Local Port: 54321
State: Connected
Profile Name: TELEMETRY_PROFILE
Subscriptions: 100
Juniper JunOS (gRPC Streaming Telemetry)
This example configures gRPC streaming telemetry for interface statistics using OpenConfig models.
# Enable gRPC and specify its listening port
set services extension-service request-response grpc clear-text port 50051
# Configure a streaming telemetry sensor (data provider)
set services analytics sensor SENSOR_INTF_STATS
set services analytics sensor SENSOR_INTF_STATS description "Interface Stats"
# Specify the OpenConfig path to collect data
# Juniper uses 'open-config:' prefix for OpenConfig paths
set services analytics sensor SENSOR_INTF_STATS resource "/junos/system/linecard/interface/"
set services analytics sensor SENSOR_INTF_STATS resource "/interfaces/interface/state/" # OpenConfig path
# Configure a streaming telemetry export profile (collector destination and frequency)
set services analytics export-profile EXPORT_INTF_STATS
set services analytics export-profile EXPORT_INTF_STATS reporting-period 10 # 10 seconds
set services analytics export-profile EXPORT_INTF_STATS format gpb
set services analytics export-profile EXPORT_INTF_STATS transport grpc
set services analytics export-profile EXPORT_INTF_STATS target-address 192.168.10.10
set services analytics export-profile EXPORT_INTF_STATS target-port 50051
# Apply the sensor and export profile to a rule
set services analytics rule RULE_INTF_STATS
set services analytics rule RULE_INTF_STATS sensor-name SENSOR_INTF_STATS
set services analytics rule RULE_INTF_STATS export-profile EXPORT_INTF_STATS
# Commit the configuration
commit
Verification Commands:
show services analytics status
show services analytics client
show services analytics export-profile EXPORT_INTF_STATS
Expected Output (Snippet):
user@juniper> show services analytics status
Extension service status:
Current status: Running
Enabled on: FPC0
GRPC enabled: Yes, port 50051
user@juniper> show services analytics client
...
Client information:
Name: EXPORT_INTF_STATS
Address: 192.168.10.10, Port: 50051
Protocol: grpc
Sensor name: SENSOR_INTF_STATS
Reporting period: 10s
State: Connected
...
Arista EOS (gRPC Streaming Telemetry)
Arista EOS uses OpenConfig by default for gRPC telemetry.
! Enable eAPI (for RESTCONF-like interaction and generally good practice)
management api http-command
no shutdown
protocol https
vrf default
! Configure gRPC telemetry
! Define the telemetry receiver (collector)
telemetry
destination 192.168.10.10:50051
protocol gRPC
encoding GPB
tls profile TELEMETRY_TLS_PROFILE ! Optional TLS profile
source-interface Management1
! Define a sensor group (path to collect)
sensor-group INTERFACE_COUNTERS
path /Sysdb/interface/counters
path /interfaces/interface/state/statistics ! OpenConfig path for counters
! Define a subscription to push data from the sensor group to the destination
stream INTERFACE_STREAM
sensor-group INTERFACE_COUNTERS
destination 192.168.10.10:50051
interval 10000 ! 10 seconds
! --- Verification Commands ---
Verification Commands:
show telemetry
show telemetry destination 192.168.10.10:50051
show telemetry stream INTERFACE_STREAM
Expected Output (Snippet):
Arista# show telemetry
Telemetry Receiver State:
Receiver: 192.168.10.10:50051
Protocol: gRPC
Encoding: GPB
State: Active
Source-interface: Management1
Telemetry Streams:
Stream: INTERFACE_STREAM
Sensor Group: INTERFACE_COUNTERS
Destination: 192.168.10.10:50051
Interval: 10000 ms
Last Push: 00:00:02 ago
Push Count: 1234
Status: OK
Network Diagrams
Diagrams are essential for visualizing complex network concepts.
Network Topology: Telemetry Lab Setup (nwdiag)
nwdiag {
network core_network {
address = "10.0.0.0/24"
description = "Core Network Segment"
CR1 [address = "10.0.0.1"];
AS1 [address = "10.0.0.2"];
}
network mgmt_network {
address = "192.168.10.0/24"
description = "Management & Telemetry Network"
CR1 [address = "192.168.10.1"];
AS1 [address = "192.168.10.2"];
LS1 [address = "192.168.10.3"];
COLLECTOR [address = "192.168.10.10", description = "Telemetry Collector"];
TSDB [address = "192.168.10.11", description = "Time-Series DB"];
GRAFANA [address = "192.168.10.12", description = "Grafana / Visualization"];
}
// Connections implicitly defined by shared networks
CR1 -- AS1; // Represents logical connection in core_network
CR1 -- COLLECTOR; // Represents logical connection in mgmt_network
AS1 -- COLLECTOR;
LS1 -- COLLECTOR;
COLLECTOR -- TSDB;
TSDB -- GRAFANA;
}
Data Flow for Observability Platform (plantuml)
@startuml
scale 1.5
cloud "Network Devices" as NetDevs {
component "Cisco IOS XE" as C_DEV
component "Juniper JunOS" as J_DEV
component "Arista EOS" as A_DEV
}
rectangle "Telemetry Collection Layer" {
component "gNMI Collector\n(e.g., Telegraf)" as GNMI_COLLECTOR
component "SNMP Poller\n(e.g., Prometheus)" as SNMP_POLLER
component "Syslog Aggregator\n(e.g., Logstash)" as SYSLOG_AGG
}
database "Data Storage" {
component "Time-Series DB\n(e.g., Prometheus DB, InfluxDB)" as TSDB
component "Log Storage\n(e.g., Elasticsearch)" as LOG_STORE
}
rectangle "Analysis & Visualization" {
component "Metrics Dashboards\n(e.g., Grafana)" as GRAFANA
component "Alerting Engine\n(e.g., Alertmanager)" as ALERTS
component "Log Analysis\n(e.g., Kibana)" as KIBANA
}
C_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
J_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
A_DEV -right-> GNMI_COLLECTOR : gRPC/gNMI (YANG, OpenConfig)
C_DEV -down-> SNMP_POLLER : SNMPv3
J_DEV -down-> SNMP_POLLER : SNMPv3
A_DEV -down-> SNMP_POLLER : SNMPv3
C_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
J_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
A_DEV -down-> SYSLOG_AGG : Syslog (RFC 5424)
GNMI_COLLECTOR -down-> TSDB : Write Metrics
SNMP_POLLER -down-> TSDB : Write Metrics
SYSLOG_AGG -down-> LOG_STORE : Write Logs
TSDB -up-> GRAFANA : Query Metrics
LOG_STORE -up-> KIBANA : Query Logs
GRAFANA -right-> ALERTS : Trigger Alerts
ALERTS -up-> "NetOps Team" : Notifications
@enduml
Automation Examples
Automating the setup of telemetry and the consumption of data is central to NetDevOps.
Python: gNMI Client for Streaming Telemetry
This Python script demonstrates how to subscribe to gNMI telemetry data from a network device using the grpc and gnmic libraries.
# pip install grpcio grpcio-tools gnmic
import grpc
import gnmic_pb2
import gnmic_pb2_grpc
import json
import time
import ssl
# Device details
DEVICE_IP = "192.168.10.1" # IP of your Cisco/Juniper/Arista device
DEVICE_PORT = 50051 # gNMI port, typically 50051
USERNAME = "admin"
PASSWORD = "password"
# Path to subscribe to (OpenConfig interface counters)
# Adjust based on your device's supported paths and configured sensor groups
# For Cisco IOS XE: "/openconfig-interfaces:interfaces/interface/state/counters"
# For Juniper JunOS: "/interfaces/interface/state/"
# For Arista EOS: "/interfaces/interface/state/statistics"
GNMI_PATH = "/interfaces/interface/state/statistics"
def stream_telemetry():
# Setup TLS/SSL context if needed (for secure gRPC)
# If your device uses TLS, replace grpc.insecure_channel with grpc.secure_channel
# and provide appropriate credentials/certificates.
# For simplicity, this example uses insecure_channel, but production should use TLS.
# Example for secure_channel (requires server cert for verification or client certs for mutual TLS)
# with open('path/to/server_cert.pem', 'rb') as f:
# trusted_certs = f.read()
# credentials = grpc.ssl_channel_credentials(root_certificates=trusted_certs)
# channel = grpc.secure_channel(f"{DEVICE_IP}:{DEVICE_PORT}", credentials)
# For insecure channel (NOT RECOMMENDED FOR PRODUCTION)
channel = grpc.insecure_channel(f"{DEVICE_IP}:{DEVICE_PORT}")
stub = gnmic_pb2_grpc.gNMIStub(channel)
subscribe_request = gnmic_pb2.SubscribeRequest()
subscription_list = subscribe_request.subscribe
# Create a subscription
subscription = subscription_list.subscription.add()
# The path to subscribe to
path_elem = subscription.path.elem.add()
path_elem.name = GNMI_PATH.split('/')[1] # Root element e.g., 'interfaces'
for p in GNMI_PATH.split('/')[2:]:
elem = subscription.path.elem.add()
if '[' in p and ']' in p:
# Handle key-value pairs in path, e.g., interface[name=GigabitEthernet1]
key_name = p.split('[')[0]
key_value = p.split('=')[1].strip(']')
elem.name = key_name
elem.key[key_name.rstrip('s')] = key_value # Adjust key based on YANG model
else:
elem.name = p
subscription.mode = gnmic_pb2.SubscriptionList.Mode.STREAM
subscription.sample_interval = 10_000_000_000 # 10 seconds in nanoseconds
subscription_list.mode = gnmic_pb2.SubscriptionList.Mode.STREAM
subscription_list.encoding = gnmic_pb2.Encoding.JSON_IETF # Or PROTO for protobuf
print(f"Subscribing to {GNMI_PATH} on {DEVICE_IP}:{DEVICE_PORT}...")
try:
# The stub.Subscribe method returns an iterator over responses
for response in stub.Subscribe(iter([subscribe_request])):
if response.update:
timestamp_ns = response.update.timestamp
timestamp_s = timestamp_ns / 1_000_000_000
prefix = gnmic_pb2.Path.to_json(response.update.prefix) if response.update.prefix else "N/A"
print(f"\n--- Telemetry Update ---")
print(f"Timestamp: {time.ctime(timestamp_s)} ({timestamp_ns} ns)")
print(f"Prefix: {prefix}")
for update in response.update.update:
path = gnmic_pb2.Path.to_json(update.path)
value = gnmic_pb2.TypedValue.to_json(update.val)
print(f" Path: {path}")
print(f" Value: {value}")
elif response.sync_response:
print("--- Synchronization complete ---")
else:
print(f"Received unknown response: {response}")
except grpc.RpcError as e:
print(f"gRPC Error: {e.details}")
except KeyboardInterrupt:
print("Subscription stopped by user.")
finally:
channel.close()
print("gRPC channel closed.")
if __name__ == "__main__":
stream_telemetry()
Ansible Playbook: Configure Streaming Telemetry
This playbook configures gRPC streaming telemetry on Cisco IOS XE, Juniper JunOS, and Arista EOS devices. It assumes ansible-network.network-cli and appropriate ansible.netcommon collections are installed and inventory is set up.
---
- name: Configure Multi-Vendor Streaming Telemetry
hosts: network_devices
gather_facts: false
connection: network_cli
vars:
telemetry_collector_ip: "192.168.10.10"
telemetry_collector_port: 50051
telemetry_sample_interval_ms: 10000 # 10 seconds
tasks:
- name: Ensure NETCONF/RESTCONF is enabled (Cisco IOS XE)
when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
cisco.ios.ios_config:
lines:
- netconf-yang
- restconf
- "restconf transport https"
- "restconf authorization local"
save_when: modified
- name: Configure gRPC Streaming Telemetry (Cisco IOS XE)
when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
cisco.ios.ios_config:
lines:
- "telemetry ietf"
- " destination-group TELEMETRY_COLLECTOR"
- " address port "
- " protocol grpc tls-enable" # Use 'tls-enable' for production, or 'no tls' for testing
- " encoding encode-kvgpb"
- " sensor-group INTERFACE_OPER_STATE"
- " path openconfig-interfaces:interfaces/interface/state/counters"
- " subscription PERIODIC_INTF_COUNTERS"
- " sensor-group INTERFACE_OPER_STATE sample-interval "
- " destination-group TELEMETRY_COLLECTOR"
- " stream cisco-push"
- " update-policy periodic"
save_when: modified
- name: Configure gRPC Streaming Telemetry (Juniper JunOS)
when: ansible_network_os == 'junos'
juniper.junos.junos_config:
lines:
- "set services extension-service request-response grpc clear-text port "
- "set services analytics sensor SENSOR_INTF_STATS resource \"/interfaces/interface/state/\""
- "set services analytics export-profile EXPORT_INTF_STATS reporting-period "
- "set services analytics export-profile EXPORT_INTF_STATS format gpb"
- "set services analytics export-profile EXPORT_INTF_STATS transport grpc"
- "set services analytics export-profile EXPORT_INTF_STATS target-address "
- "set services analytics export-profile EXPORT_INTF_STATS target-port "
- "set services analytics rule RULE_INTF_STATS sensor-name SENSOR_INTF_STATS"
- "set services analytics rule RULE_INTF_STATS export-profile EXPORT_INTF_STATS"
commit_empty_command: true # Allows committing an empty set if no changes
- name: Configure gRPC Streaming Telemetry (Arista EOS)
when: ansible_network_os == 'eos'
arista.eos.eos_config:
lines:
- "telemetry"
- " destination :"
- " protocol gRPC"
- " encoding GPB"
- " source-interface Management1" # Adjust as needed
- " sensor-group INTERFACE_COUNTERS"
- " path /interfaces/interface/state/statistics"
- " stream INTERFACE_STREAM"
- " sensor-group INTERFACE_COUNTERS"
- " destination :"
- " interval "
save_when: modified
- name: Verify Telemetry Configuration (Cisco IOS XE)
when: ansible_network_os == 'ios' or ansible_network_os == 'iosxr'
cisco.ios.ios_command:
commands:
- "show telemetry ietf subscription PERIODIC_INTF_COUNTERS"
- "show telemetry ietf connection all"
register: cisco_telemetry_output
ignore_errors: true # Continue even if command fails
- ansible.builtin.debug:
msg: ""
when: cisco_telemetry_output.stdout is defined
- name: Verify Telemetry Configuration (Juniper JunOS)
when: ansible_network_os == 'junos'
juniper.junos.junos_command:
commands:
- "show services analytics status"
- "show services analytics client"
register: juniper_telemetry_output
ignore_errors: true
- ansible.builtin.debug:
msg: ""
when: juniper_telemetry_output.stdout is defined
- name: Verify Telemetry Configuration (Arista EOS)
when: ansible_network_os == 'eos'
arista.eos.eos_command:
commands:
- "show telemetry stream INTERFACE_STREAM"
- "show telemetry destination :"
register: arista_telemetry_output
ignore_errors: true
- ansible.builtin.debug:
msg: ""
when: arista_telemetry_output.stdout is defined
Security Considerations
Network telemetry streams vast amounts of operational data, making their security paramount. Compromised telemetry can lead to:
- Data Exposure: Sensitive network topology, performance, and traffic data falling into the wrong hands.
- System Manipulation: If telemetry agents or protocols have configuration capabilities, a breach could allow unauthorized configuration changes.
- Denial of Service (DoS): An attacker could overwhelm telemetry collectors with fabricated data or exhaust device resources by triggering excessive data streaming.
Attack Vectors and Mitigation Strategies
| Attack Vector | Mitigation Strategies |
|---|---|
| Unauthorized Data Access | Authentication: Use strong authentication (client certificates for gRPC/gNMI, AAA with NETCONF/RESTCONF, SNMPv3 with authPriv). Authorization: Implement granular access control (RBAC) to telemetry paths and data streams. Encryption: Always use TLS/SSL for gRPC, NETCONF, RESTCONF, and secure Syslog. |
| Tampering with Telemetry Data | Integrity: TLS/SSL provides data integrity checks. Use digital signatures where possible. Secure Sources: Ensure the network devices sending telemetry are themselves secured and not compromised. |
| DoS on Collector/Device | Rate Limiting: Implement rate limits on telemetry streams at the device if supported. Collector Scaling: Design collectors for horizontal scalability and redundancy. Network Segmentation: Isolate telemetry traffic on dedicated management networks/VLANs. Input Validation: Collectors should validate incoming data to prevent parsing malformed packets. |
| Compromised Monitoring Infrastructure | Hardening: Securely configure operating systems, databases, and applications in the observability stack. Least Privilege: Run collector services with minimal necessary permissions. Vulnerability Management: Regularly patch and scan all components. |
| Replay Attacks | Timestamps & Nonces: Protocols like gRPC incorporate timestamps and request/response matching to prevent replay. TLS Session Keys: Use fresh session keys for each connection. |
Security Best Practices
- Encrypt All Telemetry: Always use TLS/SSL for streaming telemetry (gRPC, NETCONF over SSH/TLS, RESTCONF over HTTPS). Never transmit sensitive data in plain text.
- Strong Authentication and Authorization: Implement multi-factor authentication for management interfaces. Use client certificates for gRPC authentication. Employ AAA for programmatic access to devices.
- Dedicated Management Network: Isolate telemetry traffic on a separate management network or VPN to reduce the attack surface.
- Principle of Least Privilege: Configure telemetry subscriptions to send only the data absolutely necessary. Limit access to monitoring tools and dashboards.
- Regular Auditing and Logging: Audit telemetry configurations and logs for suspicious activity. Ensure collectors log their own activity.
- Software Supply Chain Security: Use trusted sources for libraries and tools (e.g., Python
gnmiclibrary, Telegraf plugins). - Secure API Keys/Credentials: Store API keys and credentials for automation (Ansible, Python) securely using vaults or secret management systems.
Security Configuration Example (Cisco IOS XE - TLS for gRPC)
To enable TLS for gRPC streaming, you typically need to set up a Public Key Infrastructure (PKI) and trustpoints on the device and ensure your collector also presents a trusted certificate.
! Create a crypto key pair for the device
crypto key generate rsa label MY_TELEMETRY_KEY modulus 2048
! Define a trustpoint for the Certificate Authority (CA) that signed your collector's certificate
crypto pki trustpoint TELEMETRY_CA_TP
enrollment terminal
revocation-check none
usage ipsec ikev2 dot1x aaa web-auth tls
! You would paste the CA certificate here after "enrollment terminal"
! Example:
! certificate chain
! -----BEGIN CERTIFICATE-----
! ... CA Certificate Data ...
! -----END CERTIFICATE-----
! quit
! Define a trustpoint for the device's own identity certificate (signed by your internal CA)
crypto pki trustpoint TELEMETRY_DEVICE_IDENTITY_TP
enrollment terminal
revocation-check none
usage ipsec ikev2 dot1x aaa web-auth tls
rsakeypair MY_TELEMETRY_KEY
! You would paste the device's certificate here
! Example:
! certificate chain
! -----BEGIN CERTIFICATE-----
! ... Device Certificate Data ...
! -----END CERTIFICATE-----
! quit
! Link these to the telemetry profile
telemetry ietf
destination-group TELEMETRY_COLLECTOR
address 192.168.10.10 port 50051
protocol grpc tls-enable
encoding encode-kvgpb
profile TELEMETRY_TLS_PROFILE
profile TELEMETRY_TLS_PROFILE
! Define which certificate the device presents and which CA to trust for the client
device-identity TELEMETRY_DEVICE_IDENTITY_TP
peer-trustpoint TELEMETRY_CA_TP
Security Warning: Implementing PKI and TLS requires careful planning and certificate management. Incorrect configurations can lead to connection failures. Always test thoroughly in a lab environment first.
Verification & Troubleshooting
Effective verification and troubleshooting are crucial for maintaining a healthy telemetry pipeline.
Verification Commands
Beyond the vendor-specific show telemetry commands, here are general verification steps:
# Verify basic network connectivity to the collector
ping 192.168.10.10
# Verify TCP port connectivity to the collector's gRPC port
# On Linux:
nc -zv 192.168.10.10 50051
# Expected output: Connection to 192.168.10.10 50051 port [tcp/*] succeeded!
# From the collector, check if the gNMI client is running and connected
# (Specific command depends on the collector, e.g., 'systemctl status telegraf')
Expected Output
A healthy telemetry pipeline should show:
- Device-side: Subscriptions “Active” or “Connected,” counters for pushes increasing.
- Collector-side: Logs indicating successful connection, parsing, and forwarding of data.
- TSDB-side: Metrics appearing correctly tagged and indexed.
- Grafana/Visualization: Dashboards populated with real-time data.
Common Issues Table
| Issue | Possible Causes | Resolution Steps |
| Connection Failed | Network path between device and collector is down. Firewall blocking. Incorrect IP/Port. | Check physical connections, firewalls (e.g., ufw status, firewalld --list-all on Linux, ACLs on router), and IP addresses/ports. |
| No Data Received by Collector | Incorrect sensor group path or subscription configuration. Firewall blocking outgoing traffic from device. Device resource contention. | Verify exact YANG path on device. Check device logs (show log on Juniper, show logging on Cisco) for telemetry errors. Ensure device firewalls (if any) permit outbound gRPC traffic. Increase sample-interval temporarily. |
| Data Received, but not Parsed/Stored | Collector configuration error (wrong encoding, incorrect path mapping). TSDB is down or misconfigured. | Check collector logs for parsing errors. Verify collector’s output plugin configuration for the TSDB. Check TSDB status and logs. Ensure YANG path is correctly mapped to Prometheus metric names or InfluxDB fields. |
| High CPU/Memory on Network Device | Too many subscriptions, too low sample interval, streaming too much data. | Increase sample-interval. Filter paths to only collect essential data. Use on-change subscriptions for highly dynamic but infrequent data. Optimize YANG paths. |
| TLS/SSL Handshake Failure | Mismatched certificates, incorrect trustpoints, expired certificates, incorrect ciphers. | Verify CA, device identity, and peer trustpoint configurations. Check certificate validity dates. Ensure cipher suites are compatible. Use debug crypto pki (Cisco) or show security pki (Juniper) for certificate status. |
| Inconsistent Data (e.g., missing metrics) | Network congestion, packet loss, collector overload, device buffer overflow. | Check network health between device and collector. Monitor collector resource usage (CPU, memory, disk I/O). Increase device telemetry buffer size (if configurable). |
Debug Commands
- Cisco IOS XE/XR:
debug telemetry all: Comprehensive debugging for telemetry.debug grpc all: Debug gRPC specific operations.show platform software telemetry state: Shows internal telemetry process state.
- Juniper JunOS:
monitor services analytics: Real-time monitoring of telemetry events.show log messages | grep telemetry: Filter system logs for telemetry-related messages.
- Arista EOS:
debug telemetry agent: Debug the telemetry agent process.show agent logs | grep Telemetry: Filter agent logs for telemetry.
Root Cause Analysis (RCA)
When troubleshooting, follow a systematic approach:
- Bottom-Up: Verify physical connectivity, then IP reachability, then port reachability (
ping,traceroute,nc). - Configuration Check: Double-check device telemetry configuration (paths, destinations, intervals, security).
- Process Status: Ensure telemetry processes are running on the device and collector processes are active.
- Log Analysis: Scrutinize device and collector logs for errors or warnings related to telemetry.
- Data Flow Validation: Use small, targeted Python scripts (like the example provided) or
gnmicCLI tools to test subscriptions directly against a single device. - Security Posture: Confirm firewalls, ACLs, and TLS configurations are correctly implemented and not blocking legitimate traffic.
Performance Optimization
Optimizing telemetry performance is crucial to avoid overwhelming network devices or the monitoring infrastructure.
- Sample Interval Tuning:
Periodicsubscriptions: Only use low sample intervals (e.g., <5 seconds) for highly critical, volatile metrics (e.g., interface errors, CPU utilization). For most operational data (e.g., routing table size), higher intervals (30-60 seconds) are sufficient.On-changesubscriptions: Preferon-changemode for data that changes infrequently but is critical to capture immediately (e.g., interface status changes, peer state). This reduces unnecessary data pushes.
- Efficient Data Encoding: Utilize Protocol Buffers (GPB) for gRPC telemetry. GPB is a binary format that is more compact and efficient than JSON or XML for high-volume data.
- Selective Data Collection (YANG Paths): Subscribe only to the specific YANG paths and leaves required. Avoid subscribing to entire modules or large branches if you only need a few metrics. Use filters where supported.
- Collector Scaling: Deploy collectors in a horizontally scalable architecture (e.g., multiple instances behind a load balancer). Ensure collectors have sufficient CPU, memory, and disk I/O to handle peak telemetry ingress.
- Time-Series Database Optimization:
- Choose a TSDB optimized for your data volume and query patterns (e.g., Prometheus for pull-based, InfluxDB for push-based, VictoriaMetrics for high scale).
- Implement data retention policies to automatically delete old data.
- Consider downsampling or aggregation of older data for long-term trends.
- Network Path Optimization: Ensure the network path between devices and collectors has sufficient bandwidth and low latency, especially for high-frequency telemetry.
- Device Resource Monitoring: Continuously monitor the CPU and memory utilization of network devices to ensure telemetry processing isn’t causing resource exhaustion. Adjust subscriptions if devices are overloaded.
Hands-On Lab
This lab will guide you through setting up gRPC streaming telemetry on a multi-vendor environment and collecting data.
Lab Topology
nwdiag {
network mgmt_vlan {
address = "192.168.100.0/24"
description = "Dedicated Management/Telemetry VLAN"
R1 [address = "192.168.100.1", description = "Cisco IOS XE Router"];
S1 [address = "192.168.100.2", description = "Juniper JunOS Switch"];
S2 [address = "192.168.100.3", description = "Arista EOS Switch"];
COLLECTOR_SERVER [address = "192.168.100.10", description = "Linux VM (Telegraf/Grafana)"];
}
// Logical connections
R1 -- COLLECTOR_SERVER;
S1 -- COLLECTOR_SERVER;
S2 -- COLLECTOR_SERVER;
}
Objectives
- Configure basic gRPC streaming telemetry on Cisco IOS XE, Juniper JunOS, and Arista EOS.
- Install and configure Telegraf as a gNMI collector on a Linux VM.
- Install and configure Prometheus as a time-series database.
- Install and configure Grafana for data visualization.
- Observe telemetry data streaming into Grafana dashboards.
Step-by-Step Configuration (Conceptual - requires lab environment setup)
Prerequisites:
- Three network devices (Cisco IOS XE, Juniper JunOS, Arista EOS) with management interfaces configured and reachable at 192.168.100.1, .2, .3 respectively.
- A Linux VM (e.g., Ubuntu Server) reachable at 192.168.100.10.
- Basic network connectivity verified between all devices and the VM.
Step 1: Configure Network Devices for gRPC Telemetry
- Cisco IOS XE (R1): Use the configuration from the “Cisco IOS XE/XR” section above, replacing
192.168.10.10with192.168.100.10. For simplicity, start withprotocol grpc no tlsif you don’t have PKI set up. - Juniper JunOS (S1): Use the configuration from the “Juniper JunOS” section, replacing
192.168.10.10with192.168.100.10. Useclear-textfor simplicity. - Arista EOS (S2): Use the configuration from the “Arista EOS” section, replacing
192.168.10.10with192.168.100.10. For simplicity, omit thetls profileline.
Step 2: Install and Configure Telegraf on COLLECTOR_SERVER (192.168.100.10)
# Update package list and install Telegraf
sudo apt update
sudo apt install telegraf
# Generate a sample Telegraf configuration
telegraf --sample-config --input-filter gnmi --output-filter prometheus_client > telegraf.conf
# Edit telegraf.conf (sudo vim telegraf.conf)
# Configure the gNMI input plugin:
[[inputs.gnmi]]
# Array of target network devices
targets = [
"192.168.100.1:50051", # Cisco
"192.168.100.2:50051", # Juniper
"192.168.100.3:50051" # Arista
]
username = "your_device_username"
password = "your_device_password"
# skip_verify = true # For labs without proper TLS certs (NOT for production)
# tls_ca = "/etc/telegraf/certs/ca.pem" # For production with TLS
# tls_cert = "/etc/telegraf/certs/client.pem"
# tls_key = "/etc/telegraf/certs/client-key.pem"
# Paths to subscribe to (must match what's configured on devices)
# Telegraf supports path prefixes for multiple targets or specific paths per target
# Example generic paths (adjust per device's actual sensor group)
paths = [
"/interfaces/interface/state/counters",
"/interfaces/interface/state/", # Broader OpenConfig path
"/junos/system/linecard/interface/" # Juniper specific
]
name_override = "gnmi_telemetry" # Metric name prefix in Prometheus
# Configure Prometheus client output plugin:
[[outputs.prometheus_client]]
listen = ":9273" # Telegraf will expose metrics on this port for Prometheus to scrape
metric_version = 2
collectors_metric = false
# Add device hostname/IP as a label
# You might need to adjust Telegraf's configuration to extract hostname if not automatically available
# e.g., using `tag_keys = ["device_ip"]` in [[inputs.gnmi]] or a processor
Save telegraf.conf to /etc/telegraf/telegraf.d/gnmi.conf or similar.
# Start Telegraf
sudo systemctl enable telegraf
sudo systemctl start telegraf
sudo systemctl status telegraf
Step 3: Install and Configure Prometheus on COLLECTOR_SERVER
# Download and extract Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.x.x/prometheus-2.x.x.linux-amd64.tar.gz
tar -xvf prometheus-2.x.x.linux-amd64.tar.gz
sudo mv prometheus-2.x.x.linux-amd64 /usr/local/prometheus
# Create prometheus.yml
sudo vim /usr/local/prometheus/prometheus.yml
global:
scrape_interval: 10s # How frequently Prometheus scrapes targets
scrape_configs:
- job_name: 'telegraf'
static_configs:
- targets: ['localhost:9273'] # Telegraf's Prometheus client endpoint
- job_name: 'network_devices_ping' # Example for basic reachability monitoring
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 192.168.100.1
- 192.168.100.2
- 192.168.100.3
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox_exporter:9115 # Replace with your blackbox exporter if used
- (Optional: install and configure
blackbox_exporterif using the ping job). - Start Prometheus. You’ll likely want to set it up as a systemd service for production.
/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml
# Access Prometheus UI at http://192.168.100.10:9090
Step 4: Install and Configure Grafana on COLLECTOR_SERVER
# Install Grafana
sudo apt-get install -y apt-transport-https software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana
# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server
# Access Grafana UI at http://192.168.100.10:3000 (default admin/admin)
- Add Prometheus as Data Source in Grafana:
- Navigate to
Configuration -> Data Sources. - Click
Add data source, selectPrometheus. - Set URL to
http://localhost:9090. - Save & Test.
- Navigate to
- Create a Dashboard:
- Create a new dashboard.
- Add a panel.
- Select your Prometheus data source.
- Enter a query, e.g.,
gnmi_telemetry_interface_statistics_in_pkts_total{host="192.168.100.1"}(metric names may vary based on Telegraf’s processing). - Watch the data stream in real-time.
Verification Steps
- Device-side: Run
show telemetrycommands on each device to confirm subscriptions are active and connected. - Telegraf: Check Telegraf logs (
sudo journalctl -u telegraf -f) for successful gNMI connections and metric collection. - Prometheus: Access
http://192.168.100.10:9090/targetsto ensure Telegraf’s Prometheus client is being scraped successfully. Use the Prometheus graph explorer to query for metrics likegnmi_telemetry_...to confirm data is present. - Grafana: Create and view dashboards with queries for interface counters, CPU utilization, etc., from your network devices.
Challenge Exercises
- Modify a device configuration to stream a different set of metrics (e.g., CPU utilization or routing table size). Update Telegraf and Grafana to visualize this new data.
- Change the
sample-intervalon one device and observe the effect on the Grafana dashboard’s granularity. - Implement basic TLS for gRPC (if you have a simple CA/certificate setup) and update device and Telegraf configurations.
- Add SNMP monitoring for
sysUpTimefrom the devices into Prometheus using Telegraf’s SNMP input plugin or Prometheus’s SNMP exporter.
Best Practices Checklist
- Standardized Data Models: Prioritize OpenConfig YANG models for multi-vendor consistency, falling back to vendor-native YANG when necessary.
- Secure by Design: Implement TLS/SSL for all telemetry streams. Use strong authentication (client certificates) and granular authorization.
- Dedicated Management Network: Isolate telemetry traffic on a separate network segment.
- Scalable Collector Architecture: Design collectors for horizontal scaling and redundancy to handle increasing data volumes.
- Appropriate Granularity: Tune
sample-intervaland leverageon-changesubscriptions judiciously to avoid overwhelming devices or collectors. - Efficient Encoding: Use binary encoding (GPB) for gRPC telemetry.
- Automated Deployment: Use Ansible or Python to automate the configuration of telemetry subscriptions across all network devices.
- Version Control: Store all telemetry configurations (device, collector, dashboard) in a version control system (Git) as Infrastructure as Code.
- Comprehensive Monitoring: Monitor the health and performance of the telemetry pipeline itself (devices’ CPU/memory, collector resources, TSDB health).
- Actionable Alerting: Configure alerts on significant deviations or anomalies detected from telemetry data.
- Data Retention Policy: Define and implement data retention policies for your TSDB to manage storage costs and query performance.
- Documentation: Maintain clear documentation of telemetry paths, data models, collector configurations, and dashboard structures.
- Regular Audits: Periodically audit telemetry configurations and access controls for security and compliance.
Reference Links
- NETCONF RFC: RFC 6241 - Network Configuration Protocol (NETCONF)
- YANG RFC: RFC 7950 - The YANG 1.1 Data Modeling Language
- RESTCONF RFC: RFC 8040 - RESTCONF Protocol
- OpenConfig: openconfig.net
- Cisco DevNet YANG Suite: developer.cisco.com/yangsuite
- gRPC: grpc.io
- gNMI Specification: github.com/openconfig/gnmi
- Prometheus: prometheus.io
- Grafana: grafana.com
- Telegraf: influxdata.com/time-series-platform/telegraf/
- Ansible Network Automation: docs.ansible.com/ansible/latest/network/index.html
- Python
grpciolibrary: pypi.org/project/grpcio/ - Python
gnmiclibrary (unofficial but useful): pypi.org/project/gnmic/
What’s Next
This chapter has equipped you with the foundational knowledge and practical skills to implement modern network monitoring and observability solutions using NetDevOps principles. You’ve seen how streaming telemetry, combined with standardized data models and automation, transforms reactive troubleshooting into proactive network management.
In the next chapter, we will delve into “Advanced Network Analytics and AI/ML for NetDevOps.” Building upon the rich telemetry data you’re now collecting, we will explore techniques for extracting deeper insights, predicting outages, detecting anomalies using machine learning, and integrating these advanced analytics into your continuous improvement pipeline. Get ready to turn data into predictive intelligence!