Network performance determines user experience and application efficiency. Understanding performance characteristics and how to optimize them is critical for DevOps engineers.
Key Performance Metrics
Latency (Delay) Time for data to travel from source to destination:
Round Trip Time (RTT): 50ms
One Way Latency: 25ms
Measured by: ping, traceroute, synthetic monitoring
Bandwidth (Capacity) Maximum data rate a link can carry:
Gigabit Ethernet: 1 Gbps = 125 MB/s
10 Gbps = 1.25 GB/s
Lower of all links in path = bottleneck
Throughput (Actual Rate) Actual data rate achieved:
Theoretical max: 1 Gbps link
Actual throughput: 950 Mbps (95%)
Overhead: TCP headers, IP headers, retransmissions, etc.
Packet Loss Percentage of packets that don't arrive:
Sent: 1000 packets
Received: 998 packets
Lost: 2 packets = 0.2% loss
Causes: Congestion, line errors, buffer overflow
Jitter Variance in latency:
Normal: 50ms ± 5ms (jitter = 5ms)
Bad: 50ms ± 50ms (jitter = 50ms)
Affects: VoIP quality, streaming smoothness
Latency Sources
Propagation Delay Speed of light through medium:
Speed: ~200,000 km/sec in fiber
Distance: 100km = 100000 m
Delay: 100000 / 200000000 = 0.5ms
Minimum latency based on geography
Cannot improve below this
Processing Delay Time to examine and forward packets:
Router: Read header, lookup route: ~1ms
Switch: Learn MAC, forward: ~0.1ms
Firewall: Stateful inspection: ~5ms
Sum of all hops
Queuing Delay Wait in buffer if link busy:
Link utilization: 90%
Packets queued: High
Queuing delay: +20ms
Congestion causes
Can spike dramatically
Serialization Delay Time to transmit packet bits:
1500-byte packet on 1 Gbps link:
1500 bytes = 12000 bits
12000 bits / 1000000000 bps = 12 microseconds
High bandwidth = lower serialization delay
Total Latency = Propagation + Processing + Queuing + Serialization
Bandwidth Utilization
Link Capacity vs Actual Use
┌─────────────────────────────────────┐
│ 1 Gbps Ethernet link available │
├─────────────────────────────────────┤
│ Used: 600 Mbps (60%) │
│ Available: 400 Mbps (40%) │
├─────────────────────────────────────┤
│ Check: How much can add before │
│ congestion? Answer: ~400 Mbps more │
└─────────────────────────────────────┘
Rule of thumb: Keep under 70% for headroom
Measuring Network Performance
Ping Measure RTT:
ping 8.8.8.8
# Output:
# PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
# 64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=25.3 ms
# 64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=25.2 ms
# 64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=27.1 ms
# --- 8.8.8.8 statistics ---
# min/avg/max/stddev = 25.2/25.9/27.1/0.8 msTraceroute Show path and latency at each hop:
traceroute google.com
# Output shows:
# Hop 1: 192.168.1.1 1.2 ms
# Hop 2: 203.0.113.1 5.3 ms
# Hop 3: 203.0.113.100 15.2 ms
# Hop 4: 8.8.8.1 25.3 msMTR (My Traceroute) Combines ping and traceroute, continuous monitoring:
mtr -c 100 google.com
Shows:
- Packet loss % at each hop
- Latency statistics (min/avg/max)
- Continuously updatediperf/iperf3 Measure TCP/UDP throughput:
# Server
iperf3 -s
# Client
iperf3 -c server.example.com -t 30
# Output:
# [ ID] Interval Transfer Bitrate
# [ 5] 0.00-30.00 3.62 GBytes 1.04 Gbpsnetperf Network performance benchmarking:
netperf -H server.example.com -t TCP_RR
# Request/Response latency testPerformance Optimization Techniques
1. Link Aggregation (Bonding)
Multiple links → Single logical link:
┌─ Link 1 (1 Gbps) ─┐
├─ Link 2 (1 Gbps) ─┤ → Bonded: 2 Gbps
├─ Link 3 (1 Gbps) ─┤
└─ Link 4 (1 Gbps) ─┘
Benefits:
- Higher throughput (sum of links)
- Failover if one link fails
2. Compression
Reduce data volume:
Uncompressed HTTP: 1 MB
Compressed (gzip): 200 KB (80% reduction)
Benefits:
- Less bandwidth needed
- Faster transfer
- Less congestion
Tradeoff: CPU time for compression/decompression
3. Protocol Optimization
TCP Tuning:
# Increase TCP window size (more in-flight data)
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
# Longer TCP backlog (more concurrent connections)
sysctl -w net.ipv4.tcp_max_syn_backlog=5120
# TCP buffer auto-tuning
sysctl -w net.ipv4.tcp_tw_reuse=1UDP for Real-Time:
TCP: Reliable but retransmits (adds latency)
UDP: Unreliable but fast (good for gaming, VoIP)
Choice depends on tolerance for loss vs latency
4. QoS (Quality of Service)
Prioritize traffic:
Traffic Classes:
├─ Voice: Highest priority (must be ≤150ms)
├─ Video: High priority (must be ≤300ms)
├─ Web: Medium priority
└─ Best Effort: Low priority
When congested:
- Drop Best Effort traffic first
- Keep Voice traffic flowing
Result: Critical apps get consistent experience
5. Caching and CDN
Direct (no cache):
Client → Origin Server (30ms latency)
With CDN/Cache:
Client → Edge Server (5ms latency, cached copy)
Benefits:
- Lower latency
- Less origin server load
- Reduced bandwidth
Examples: Cloudflare, Akamai, AWS CloudFront
6. Connection Pooling
Reuse connections:
Without pooling:
Request 1: TCP handshake (30ms) + request (10ms) = 40ms
Request 2: TCP handshake (30ms) + request (10ms) = 40ms
Total: 80ms
With pooling:
Request 1: TCP handshake (30ms) + request (10ms) = 40ms
Request 2: Reuse same connection (10ms) = 10ms
Total: 50ms
Biggest improvement for many small requests
7. MTU (Maximum Transmission Unit) Tuning
Standard: 1500 bytes (0-1500 frame size)
Jumbo Frames: 9000 bytes
Larger MTU:
├─ Fewer packets for same data
├─ Lower per-packet overhead
├─ Improves throughput
└─ Lower CPU usage
Tradeoff: Not supported everywhere
Requirement: All network devices support same MTU
Set MTU:
# View current
ip link show eth0
# Change (temporary)
sudo ip link set eth0 mtu 9000
# Persistent (varies by distro)
# In netplan or network configNetwork Bottleneck Identification
Step 1: Measure
ping server.example.com → RTT = 100ms (high?)
iperf3 -c server.example.com → 100 Mbps (low?)
traceroute → Where is latency? Which hop?
Step 2: Analyze
High latency:
✓ Is it propagation? (geography, can't improve)
✓ Is it processing? (router CPU high?)
✓ Is it queuing? (link utilization high?)
✓ Is it congestion? (packet loss detected?)
Step 3: Locate Bottleneck
Throughput test shows 100 Mbps on 1 Gbps link:
Run: ifstat (interface statistics)
Look for:
- High TX errors? Driver issue
- High collisions? Half-duplex link
- Interface down? Connection problem
- Duplex mismatch? Speed negotiation issue
Run: netstat -i (interface utilization)
Look for: % utilization, errors, dropped
Step 4: Fix
Common fixes:
├─ Clear congestion (add capacity, reroute traffic)
├─ Fix duplex mismatch (force full-duplex)
├─ Update drivers (newer = often better)
├─ Physically move server (reduce latency)
├─ Add link aggregation (increase capacity)
├─ Optimize routes (less hops)
└─ Enable QoS (prioritize critical traffic)
Performance by Application Type
OLTP (OnLine Transaction Processing)
- Sensitive to: Latency
- Goal: ≤50ms response
- Focus: Minimize RTT
- Example: Online banking
Batch Processing
- Sensitive to: Throughput
- Goal: Complete in time window
- Focus: Maximize total data moved
- Example: Nightly reports
Streaming
- Sensitive to: Jitter, latency
- Goal: Consistent bitrate, ≤300ms latency
- Focus: QoS, bandwidth reservation
- Example: Video services
VoIP
- Sensitive to: Latency, packet loss, jitter
- Goal: ≤150ms, ≤1% loss, ≤10ms jitter
- Focus: QoS, dedicated bandwidth
- Example: Video conferencing
Performance Monitoring
Continuous Monitoring:
# Watch link utilization
watch -n 1 'ifstat -i eth0'
# Monitor connection count
watch 'netstat -tan | grep ESTABLISHED | wc -l'
# Track packet loss
ping -c 100 server.example.com | grep loss
# Bandwidth monitoring
nethogs (shows per-process bandwidth)
iotop (shows per-process disk I/O)Best Practices
✓ Establish performance baseline first ✓ Monitor continuously (don't wait for problems) ✓ Test with realistic load (lab vs production differ) ✓ Consider latency AND throughput (not just speed) ✓ Document what "good performance" means ✓ Test failover scenarios ✓ Use layered caching (multiple levels) ✓ Compress what makes sense (text yes, video no) ✓ Set realistic QoS policies ✓ Account for full round-trip (app → network → app)
Key Concepts
- Latency = Delay (milliseconds)
- Bandwidth = Capacity (megabits/second)
- Throughput = Actual rate (megabits/second, usually less than bandwidth)
- Packet loss = % of packets not arriving
- Jitter = Variance in latency
- Queuing delay = Wait due to congestion
- Propagation delay = Speed of light through medium (can't improve)
- Bottleneck = slowest link in path
- QoS = Prioritize traffic by importance
- Trade-off balance between latency (UDP) vs reliability (TCP)