Score:1

traceroute: sometimes routers don't respond and user sees timeouts

in flag
ico

I'm an admin of small network and I'm investigating a problem my users complain about. The root of their complaints is traceroute: sometimes routers along the path simply don't respond to traceroute probes and users see timeouts (those *s in place of RTT).

The network consists of a few Linux routers connected by Ethernet/wireless. Linux routers 99% idle, link utilization 20 mbit/s, 2000 packets/s. Wireless is rock solid. PING to all routers along the path is 10 ms, with some variation of course. Flood PING to any of those hosts runs for minutes without any packet loss (and I mean 0 packets lost). Downloading some huge files over the network: 10.2 MB/s average.

The example correct traceroute looks like this:

# traceroute -nI 10.0.0.2
traceroute to 10.0.0.2 (10.0.0.2), 30 hops max, 60 byte packets
 1  192.168.0.1  3.919 ms  3.866 ms  4.117 ms
 2  10.41.13.1  4.149 ms  6.714 ms  6.707 ms
 3  10.41.1.11  8.475 ms  8.468 ms  8.705 ms
 4  10.0.0.2  8.697 ms  9.428 ms  9.707 ms

The problematic traceroutes look like this:

# traceroute -nI 10.0.0.2
traceroute to 10.0.0.2 (10.0.0.2), 30 hops max, 60 byte packets
 1  192.168.0.1  3.190 ms  3.140 ms  3.128 ms
 2  10.41.13.1  3.119 ms  3.113 ms *
 3  10.41.1.11  3.697 ms *  3.683 ms
 4  10.0.0.2  4.531 ms  4.524 ms  5.171 ms
# traceroute -nI 10.0.0.2
traceroute to 10.0.0.2 (10.0.0.2), 30 hops max, 60 byte packets
 1  192.168.0.1  3.471 ms  3.405 ms  3.388 ms
 2  10.41.13.1  3.372 ms  3.359 ms  3.350 ms
 3  10.41.1.11  5.039 ms * *
 4  10.0.0.2  5.105 ms  5.484 ms  5.473 ms

I investigated a bit with tcpdump and found out that traceroute works like this:

  1. At first sends a bunch of ICMP requests with TTL of 1, 2, 3, 4, 5, 6. Each TTL is sent 3 times. That is 18 packets :)
  2. It waits some time for all replies (Time Exceeded).
  3. When all replies return, show results.
  4. ..or wait for timeout and show results with missing replies marked with asterisks.

And the cause of timeouts is - the routers get all 3 respective requests but sometimes don't respond, they don't send ICMP Time Exceeded.

I suspect there are some settings that set this behavior on router. Namely icmp_ratelimit, icmp_ratemask, icmp_msgs_per_sec and icmp_msgs_burst. All somehow described at kernel.org docs. And here is the point I failed. I didn't come with any values of those variables to make the traceroute work all the times.

I tried setting this on all routers:

  • icmp_ratelimit set to 0 (don't limit anything)
  • icmp_msgs_per_sec set to 10000 (should be high enough)
  • icmp_msgs_burst set to 5000 (high enouth)

It didn't help me, I see the same behavior, random timeouts. I didn't mess with icmp_ratemask, because I don't fully understand how to exclude Time Exceeded's from limiting.

So finally, questions:

  1. If you are familiar with this type of traceroute problems, how did you solve it?
  2. If you are familiar with kernel settings mentioned above, what are "good enough" values?
  3. What is the correct way to modify icmp_ratemask to not limit Time Exceeded messages to make traceroute work without glitches?
  4. And extra - are there any security breaches when changing these (or any related) settings? I don't want to be DoS'ed nor to be a source of DDoS attack to anyone.
pl flag
In linux traceroute, it is possible to use UDP probes instead of ICMP (use `-U` option). It can help you decide if this have something to do with ICMP settings.
John Hanley avatar
cn flag
Routers are not required to respond to ICMP. Seeing an * means nothing and should be ignored.
ico avatar
in flag
ico
UDP/TCP/ICMP traceroutes: I was not clear about this, sorry. It doesn't matter which protocol I use for traceroute. Timeouts are seen when tracerouting from windows machines (default ICMP) or from linux machines (default UDP, optional ICMP). I personally try to use ICMP version, becouse TCP/UDP have other problems with firewalls, ICMP is usually allowed.
ico avatar
in flag
ico
John Hanley: That's true. But try to explain it to the User :) Anyway, I see these timeouts on my network more often than on internet. And since I am root on those routers, I would like to "force" them to respond to traceroutes.
John Hanley avatar
cn flag
You cannot force routers to respond to your "hello, how are you doing today?" messages. The router might be too busy to bother with your messages. Responding is optional. I never use ping to verify network connectivity or debug network problems. ICMP is an ancient protocol that has little value today. ICMP is one of the first items I disable on my systems.
Score:0
in flag

As part of control plane policies on hops ICMP probes mostly are ignored. I would suggest a dedicated on prem smokeping instance if you want to have more thorough, in terms of metrics and trends, historical data.

ico avatar
in flag
ico
Well I use icinga to monitor the network. There is also collectd daemon on each linux machine, collected data can be nicely seen with grafana. I also created a daemon to ping "interresting hosts" on network and plot graphs (like cacti or smokeping). So i know the state of the network. But my users don't - they see asterisks in traceroutes and that means buggy network (for them).. :(
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.