Network telemetry is the collection and analysis of network data. Understanding what's happening on your network is essential for troubleshooting, optimization, and security.
Types of Network Data
1. Flow Data "What traffic is flowing where?"
Source IP: 10.0.1.50
Dest IP: 10.1.0.20
Source Port: 54321
Dest Port: 443
Protocol: TCP
Bytes: 50,000
Packets: 250
Start Time: 14:32:15
Duration: 30 seconds
2. Packet Data "What's in each packet?"
Ethernet Frame:
├─ Source MAC: aa:bb:cc:dd:ee:01
├─ Dest MAC: aa:bb:cc:dd:ee:02
├─ Protocol: IPv4
├─ IP Header...
├─ TCP Header...
└─ Payload: ...
3. Counters "How many total packets/bytes?"
Interface stats:
├─ Total packets in: 1,000,000
├─ Total packets out: 950,000
├─ Dropped packets: 50,000
├─ Errors: 100
├─ Collisions: 0
└─ Timestamp: 14:32:15
4. Logs "What happened?"
14:32:15 - BGP session down: 203.0.113.1
14:32:20 - New route learned: 10.0.0.0/8 via 203.0.113.2
14:32:25 - DDoS detected: 1M packets/sec from 203.0.113.202
Flow-Based Telemetry
NetFlow (Cisco) Industry standard for flow data:
NetFlow v5 record:
├─ Source IP
├─ Destination IP
├─ Source Port
├─ Destination Port
├─ IP Protocol (TCP/UDP/ICMP)
├─ Input interface
├─ Output interface
├─ Layer 2 data
├─ Packets
├─ Bytes
├─ Start/End timestamps
└─ TCP flags
Every 'flow' exported as one record
Collector gathers records for analysis
sFlow (sample-based) Statistical sampling:
Sample 1 in 10,000 packets
Extrapolate statistics:
- 10,000 bytes measured
- Actual traffic likely ~100MB
- Lower overhead than NetFlow
- Less accuracy
IPFIX (IP Flow Information Export) Modern standard (NetFlow v9+):
Extensible:
- Can add custom fields
- Flexible templates
- Internet standard (RFC 7011)
Flow Collection Architecture
┌───┐ NetFlow
│ A ├─────────────┐
└───┘ │
↓
┌───┐ ┌─────────────┐ ┌──────────────┐
│ B ├────────►Collector │ │ Analysis & │
└───┘ │(Port 2055) │────►│ Dashboard │
└─────────────┘ │(Grafana, ELK)│
┌───┐ NetFlow v9 └──────────────┘
│ C ├─────────────┐
└───┘ │
↓
Collector
Flow Collectors:
- Cisco Prime, Cisco Tetration
- Kentik Detect
- SolarWinds NetFlow Traffic Analyzer
- Open Source: ntopng, flow-tools
Packet Analysis (tcpdump, Wireshark)
tcpdump — Command-line packet capture:
# Capture all traffic on eth0
tcpdump -i eth0
# Capture and save to file
tcpdump -i eth0 -w traffic.pcap
# Filter: only TCP traffic
tcpdump -i eth0 tcp
# Filter: only DNS traffic (UDP port 53)
tcpdump -i eth0 udp port 53
# Capture with complete packet data
tcpdump -i eth0 -C 10 -w traffic # 10MB filesWireshark — GUI packet analyzer:
Visualize pcap files:
├─ Protocol layers
├─ Packet-by-packet breakdown
├─ Flow view
└─ Statistics
Useful for:
- Debugging application issues
- Understanding protocol behavior
- Troubleshooting packet loss
- Analyzing network attacks
Metrics and KPIs
Key Metrics:
| Metric | Meaning | Example |
|---|---|---|
| Throughput | Bytes/sec | 100 Mbps |
| Packet Loss | % of packets lost | 0.1% |
| Latency | Delay time | 50ms |
| Jitter | Variance in latency | ±5ms |
| Availability | Uptime % | 99.9% |
| Utilization | % of capacity used | 65% |
Performance KPIs:
Response Time:
Goal: <100ms
Measurement: HTTP server response
Trend: Increasing → investigate slowness
Packet Loss:
Goal: <0.1%
Measurement: SNMP counters
Trend: Spikes → check congestion
Connection Success:
Goal: 99.99%
Measurement: TCP SYN → SYN-ACK success rate
Trend: Dropping → check availability
SNMP (Simple Network Management Protocol)
Purpose: Collect network device statistics
SNMP Versions:
| Version | Security | Use |
|---|---|---|
| v1 | None (plain text) | Legacy, don't use |
| v2c | Community string | Simple monitoring |
| v3 | Full authentication | Production recommended |
Common SNMP OIDs (Object Identifiers):
1.3.6.1.2.1.1.3 — System uptime
1.3.6.1.2.1.2.2.1.1 — Interface name
1.3.6.1.2.1.2.2.1.5 — Interface speed
1.3.6.1.2.1.2.2.1.10 — Octets in
1.3.6.1.2.1.2.2.1.16 — Octets out
1.3.6.1.2.1.2.2.1.20 — Dropped packets
SNMP Walk (Collect Data):
# Get all SNMP data
snmpwalk -v 2c -c public 192.168.1.1
# Get specific value
snmpget -v 2c -c public 192.168.1.1 \
1.3.6.1.2.1.1.3.0
# Returns: System uptimeApplication Performance Monitoring (APM)
Full-Stack Telemetry:
User Experience
↑
┌────────────────┐
│ Frontend │ (JavaScript errors, page load time)
├────────────────┤
│ Network │ (DNS time, TCP connection time)
├────────────────┤
│ Application │ (HTTP response time, database queries)
├────────────────┤
│ Infrastructure │ (CPU, memory, disk, network)
└────────────────┘
Tools: Datadog, New Relic, Dynatrace, Elastic
Monitoring Tools Comparison
| Tool | Type | For |
|---|---|---|
| Prometheus | Metrics | Infrastructure, applications |
| Grafana | Visualization | Dashboard, alerting |
| ELK Stack | Logs/Metrics | Centralized logging |
| Datadog | APM | Full-stack monitoring |
| Wireshark | Packet | Detailed analysis |
| ntopng | Flow | Network behavior |
| tcpdump | Packet | Quick capture |
Real-Time Network Monitoring
Types of Monitoring:
Push-Based (Agent-based):
┌────────────────┐
│ Agent on host │ (constantly sends data)
│ Sends metrics │
└────────┬───────┘
│ Push
↓
┌──────────┐
│Collector │
└──────────┘
Pros: Real-time, detailed
Cons: Overhead per host, scale challenge
Pull-Based (Scrape-based):
┌──────────────┐
│ Monitoring │ Periodically requests metrics
│ System │ (asks for data)
└──────┬───────┘
│ Pull
↓
┌──────────────┐
│ Host metrics │
│ endpoint │
└──────────────┘
Pros: Easier scale, host controls exposure
Cons: Potentially missing spikes between scrapes
Network Telemetry Use Cases
Use Case 1: Anomaly Detection
Normal traffic: 100 Mbps
Baseline: 99 Mbps ±5%
Anomaly: Spike to 500 Mbps
Alert: "DDoS or traffic spike detected"
→ Investigate immediately
Use Case 2: Capacity Planning
Current: 65% utilized, trending up 2% per week
Projection: Will hit 80% in 3 weeks
Action: Provision more capacity
Use Case 3: Troubleshooting
"Users report slow service"
Telemetry shows:
- High latency to database (300ms vs 50ms normally)
- Increased packet loss on database link
- Root cause: Database server overwhelmed
Fix: Scale database or investigate queries
Use Case 4: Security
NetFlow shows:
- Sudden outbound traffic to unknown IP
- Large data transfer (1GB/min)
- Destination: 203.0.113.202 (malicious IP)
Response: Block outgoing traffic to that IP
Quarantine affected server
Best Practices for Network Telemetry
✓ Collect baseline metrics before problems ✓ Set realistic alerting thresholds ✓ Monitor end-user experience, not just infrastructure ✓ Track trends over time (capacity planning) ✓ Archive data for post-incident analysis ✓ Use correlation (if latency high AND CPU high → bottleneck) ✓ Combine flow data with logs for context ✓ Securely store sensitive network data ✓ Test alerting (make sure notifications work) ✓ Regularly review and adjust metrics
Key Concepts
- NetFlow/IPFIX = Standard flow export
- Flow data = Source, destination, bytes, packets, duration
- Packet capture = Detailed but resource-intensive
- SNMP = Device statistics collection
- Metrics = Quantitative measurements
- APM = End-user experience monitoring
- Baseline = Normal behavior to detect anomalies
- Correlation = Relate multiple signals for insight
- Telemetry enables visibility into network behavior
- Visibility is prerequisite for optimization and security