Network problems impact everything. You need systematic debugging techniques to quickly identify root causes and resolve issues. This is where being methodical saves hours of frustration.
Troubleshooting Methodology
The OSI Model Approach:
Start at bottom (Physical), work up to top (Application):
Layer 7: Application (is the service responding?)
↑ Check: telnet, curl, netcat
Layer 6: Presentation (correct format?)
Layer 5: Session (connection established?)
Layer 4: Transport (TCP/UDP working?)
↑ Check: netstat, ss, lsof
Layer 3: Network (IP routing working?)
↑ Check: ping, traceroute, ip route
Layer 2: Data Link (ARP, MAC addresses?)
↑ Check: arp, ip link
Layer 1: Physical (cables, up/down?)
↑ Check: ethtool, link status
Systematic Troubleshooting Framework
1. Define the Problem
- ✓ Exactly what doesn't work:
- HTTP requests timing out
- DNS not resolving
- High packet loss
- Slow response times
- ✗ Don't: "Network is broken"
2. Gather Information
- ✓ When did it start?
- ✓ What changed?
- ✓ Who is affected (one user, everyone, one service)?
- ✓ What does working look like?
3. Form Hypothesis
Given the problem + info:
- My guess is: DNS server is restarting
- or: Load balancer misconfigured
- or: Firewall rule changed
4. Test Hypothesis
- Is DNS working? → dig google.com
- Is LB routing? → Check backend pool status
- Is firewall up? → Check rules
5. Resolve
If hypothesis confirmed:
- Fix root cause
- Document what changed
- Implement monitoring
Essential Network Tools
ping — Test connectivity:
# Simple ping
ping google.com
# Count packets
ping -c 5 google.com
# Show time to first response
ping -W 1 -c 1 google.comtraceroute — Show route to destination:
# See all hops
traceroute google.com
# With hostnames (-D) and timeout (-w)
traceroute -D -w 1 google.com
# ICMP traceroute
traceroute -I google.com
# UDP traceroute
traceroute -U google.comdig — DNS lookup:
# Simple lookup
dig google.com
# Query specific nameserver
dig @8.8.8.8 google.com
# Get all records
dig google.com ANY
# Short format (+short)
dig +short google.com
# Verbose (+trace shows delegation path)
dig +trace google.comnetstat — Connection statistics:
# All connections
netstat -a
# Listening ports
netstat -tln | head -20
# t=tcp, l=listening, n=numeric IPs
# Per-protocol statistics
netstat -s
# Process owning connection
netstat -tlnp | grep :8080ss — Socket statistics (modern netstat):
# All listening sockets
ss -tln
# Established connections
ss -tan
# With process info
ss -tlnp
# Summary
ss -scurl/wget — HTTP requests:
# Simple GET
curl http://example.com
# Show headers only
curl -I http://example.com
# Follow redirects
curl -L http://example.com
# Timeout
curl --connect-timeout 5 --max-time 10 http://example.com
# Verbose (show handshake, headers)
curl -v http://example.com
# Test specific hostname
curl -H "Host: example.com" http://203.0.113.1nc (netcat) — Raw TCP/UDP:
# Test if port is open
nc -zv example.com 80
# Listen on port
nc -l 8080
# Send data
echo "hello" | nc example.com 9000
# UDP test
nc -u example.com 53tcpdump — Packet capture:
# Capture all traffic
sudo tcpdump -i eth0
# Capture on port
sudo tcpdump -i eth0 port 80
# Capture to file
sudo tcpdump -i eth0 -w capture.pcap
# Read file
tcpdump -r capture.pcap
# Show MAC addresses
sudo tcpdump -i eth0 -e
# ASCII and hex
sudo tcpdump -i eth0 -XNetwork Troubleshooting Scenarios
Scenario 1: "I can't connect to service"
Step 1: Is the service running?
ss -tlnp | grep 8080
✗ No → service start
✓ Yes → Continue
Step 2: Can I reach it on localhost?
curl localhost:8080
✗ No → Service crashed, check logs
✓ Yes → Continue
Step 3: Can I reach it from another machine?
curl 192.168.1.100:8080
✗ No → Firewall? DNS? Routing?
Netstat → Check if listening on all IPs (0.0.0.0)
UFW → Check if port 8080 allowed
✓ Yes → DNS or firewall issue
Step 4: Check DNS
dig service.example.com → Returns IP?
✗ No → DNS misconfigured
✓ Yes → Correct IP?Scenario 2: "High latency to database"
Step 1: Confirm latency
ping db.example.com
↓ Shows high response time? Yes
Step 2: Check route
traceroute db.example.com
↓ Which hop is slow?
Step 3: Check network stats
ss -s
↓ Packet loss? Retransmissions?
Step 4: Check interface
ethtool eth0
↓ Speed/duplex mismatched?
Step 5: Check application
time curl db.example.com:5432
↓ Network slow or app slow?Scenario 3: "DNS not resolving"
Step 1: Check configured DNS
cat /etc/resolv.conf
↓ Shows nameservers?
Step 2: Test with public DNS
dig @8.8.8.8 google.com
✗ Fails → Network problem
✓ Works → Local DNS server problem
Step 3: Query local DNS directly
dig @192.168.1.1 example.com
✗ Fails → DNS server misconfigured
✓ Works → Resolver config wrong
Step 4: Check local DNS logs
sudo tail -f /var/log/named/default.log
↓ See query errors?Network Performance Testing
Check Response Time:
# Show DNS + TCP + TLS + request time
curl -w " DNS: %{time_namelookup}\n" \
-w " TCP: %{time_connect}\n" \
-w " TLS: %{time_appconnect}\n" \
-w " First Response: %{time_starttransfer}\n" \
-w " Total: %{time_total}\n" \
https://example.comBandwidth Test:
# Download speed test
curl -o /dev/null -w "%{speed_download}\n" http://example.com/large-file
# Using iperf3
server: iperf3 -s
client: iperf3 -c server-ip -t 10Packet Loss Detection:
# Ping with loss % shown
ping -c 100 example.com | grep '% packet loss'
# Continuous ping
ping -i 0.5 example.com # Send every 0.5 secondsFirewall Troubleshooting
"Connection refused"
# Check if port is listening
ss -tln | grep :8080
✗ Not there → Service not running
# Check firewall rules
ufw status | grep 8080
✗ Not allowed → ufw allow 8080
# Check iptables
sudo iptables -L | grep 8080
✗ Blocked → Add allow rule
# Check if traffic reaches server
sudo tcpdump -i eth0 port 8080
✗ Packets not arriving → Blocked upstream"Connection times out"
# Very likely firewall dropping packets
# (not saying "connection refused", just hanging)
# Increase timeout to confirm
curl --connect-timeout 60 http://server:port
# Check if ICMP is blocked
ping -c 1 server
✗ No response but can SSH? ICMP blocked
# Manually try connection
tcpclient server port # Wait to see what happensRouting Issues
"Can't reach subnet"
# Check routing table
ip route show
↓ Is destination subnet listed?
# Add route
sudo ip route add 10.0.0.0/8 via 192.168.1.1
# Make permanent (netplan)
echo 'routes:\n - to: 10.0.0.0/8\n via: 192.168.1.1' >> /etc/netplan/01-netcfg.yaml
sudo netplan apply"Asymmetric routing"
Outbound: A → Router1 → B (fast)
Inbound: B → Router2 → A (slow)
Diagnose with:
traceroute from A to B (shows outbound path)
traceroute from B to A (shows inbound path)
Connection Pool Problems
"Too many open connections"
# Check connected sockets
ss -s | grep TCP
# Find which process
ss -tlnp | wc -l
# Increase file descriptor limit
ulimit -n 10000
# Make permanent
echo "* soft nofile 10000" >> /etc/security/limits.confDNS Issues
"Wrong IP returned"
# Check cached locally
ss -s | grep DNS
# Clear cache (systemd)
sudo systemctl restart systemd-resolved
# Query authoritative nameserver directly
dig @ns1.example.com example.com
↓ Correct answer? If yes:
→ Your local resolver cached old value
→ TTL probably hasn't expired yet
# Check A record details
dig +trace example.com
↓ Follow delegation to see at which point IP changedMonitoring for Issues
Watch TCP connections:
# Real-time connection graph
watch -n 1 'ss -s | grep TCP'Monitor packet loss:
# Continuous monitoring
mtr google.com
# Shows loss per hop! MUCH better than tracerouteCheck interface errors:
ethtool eth0
ip -s link show eth0
↓ Look for:
RX errors
RX dropped
TX errors
TX droppedKey Concepts
- Use OSI model — Troubleshoot from bottom up
- Define problem clearly — Not "network is slow"
- Ping tests reachability — IP routing working?
- Traceroute shows path — Where does it fail?
- tcpdump shows actual packets — Low-level visibility
- DNS resolution is often the issue — Check first
- Firewall is second most common — Check rules
- Always check: service running, port listening, firewall allowing
- Use curl/nc for application layer — Is app responding?
- Document everything — Changes, errors, fixes