This is a technical article chronicling one of the most interesting bug hunts I’ve had the pleasure of chasing down.
At AdGear Technologies Inc. where I work, ssh is king. We use it for management, monitoring, deployments, log file harvesting, even real-time event streaming. It’s solid, reliable, has all the predictability of a native unix tool, and just works.
Until one day, random cron emails started flowing about it not working.
The machines in our London data center were randomly failing to send their log files to our data machines in our Montreal data center. This job is initiated periodically from cron, and the failure manifested itself as:
- cron emails stating that the ssh was unsuccessful
- Sometimes hangs
- Sometimes exits with a timeout error
- nagios warnings down the line for in-house sanity checks detecting the missing data in Montreal
We logged into the London machines, manually ran the push command, and it worked successfully. We brushed it off as temporary network partitions.
But the failures kept popping up randomly. Once a day, a couple of times a day, then one Friday morning, several times an hour. It was clear something’s getting worse. We kept up with manually pushing the files until we figure out what the problem was.
There were 17 hops between London and Montreal. We built a profile of latency and packet loss for them, and found that a couple were losing 1-3% of packets. We filed a ticket with our London DC ops to route away from them.
While London DC ops were verifying the packet loss, we started seeing random timeouts from London to our SECOND data center in Montreal, and hops to that data center did not share the same routes we observed the packet loss at. We concluded packet loss is not the main problem around the same time London DC ops replied saying they’re not able to replicate the packet loss or timeouts and that everything looked healthy on their end.
While manually keeping up with failed cron uploads, we noticed an interesting pattern. A file transfer either succeeded at a high speed, or didn’t succeed at all and hung/timed out. There were no instances of a file uploading slowly and finishing successfully.
Removing the large volume of data from the equation, we were able to recreate the scenario via simple vanilla ssh. On a London machine an “ssh mtl-machine” would either work immediately, or hang and never establish a connection. Eyebrows started going up.
Where the wild packets are
We triple-checked the ssh server configs and health in Montreal:
- DNS servers were responding fast
- DNS reverse lookup was not enabled
- Maximum client connections was high enough
- We were not under attack
- Bandwidth usage was nowhere near saturation
Besides, even if something was off, we were observing the hangs talking to 2 completely distinct data centers in Montreal. Furthermore, our other data centers (non-London) were talking happily to Montreal. Something about London was off.
We fired up tcpdump and started looking at the packets, both in summary and in captured pcaps loaded into wireshark. We saw telltale signs of packet loss and retransmission, but it was minimal and not particularly worrisome.
We then captured full connections from cases where ssh established successfully, and full connections from cases where the ssh connection hung.
Here’s what we logically saw when a connection from London to Montreal hung:
- Normal TCP handshake
- Bunch of ssh-specific back and forth, with normal TCP ack packets where they should be
- A particular packet sent from London and received in Montreal
- The same packet re-sent (and re-sent, several times) from London and received in Montreal
- Montreal’s just not responding to it!
It didn’t make sense why Montreal was not responding (hence London re-transmitting it). The connection was stalled at this point, as the layer 4 protocol was at a stalemate. More infuriatingly, if you kill the ssh attempt in London and re-launched it immediately, odds are it worked successfully. When it did, tcpdump showed Montreal receiving the packet but responding to it, and things moved on.
We enabled verbose debugging (-vvv) on the ssh client in London, and the hang occurred after it logged:
Googling “ssh hang SSH2_MSG_KEX_DH_GEX_GROUP” has many results - from bad WiFi, to windows TCP bugs, to buggy routers discarding TCP fragments. One solution for LANs was to figure out the path’s MSS and set that as the MTU on both ends.
I kept decrementing the MTU on a London server down from 1500 - it didn’t help until I hit the magic value 576. At that point, I was no longer able to get the ssh hanging behavior replicated. I had an ssh loop script running, and it was on-demand that I could cause timeouts by bringing the MTU back up to 1500, or make them disappear by setting it to 576.
Unfortunately these are public ad servers and globally setting the MTU to 576 won’t cut it, but the above did suggest that perhaps packet fragmentation or reassembly is broken somewhere.
Going back to check the received packets with tcpdump, there was no evidence of fragmentation. The received packet size matched exactly the packet size sent. If something did fragment the packet at byte 576+, something else reassembled it successfully.
Twinkle twinkle little mis-shapen star
Digging in some more, I was now looking at full packet dumps (tcpdump -s 0 -X) instead of just the headers. Comparing that magic packet in instances of ssh success vs ssh hang showed very little difference aside from TCP/IP header variations. It was however clear that this is the first packet in the TCP connection that had enough data to bypass the 576-byte mark - all previous packets were much smaller.
Comparing the same packet, during a hanging instance, as it left London, and as captured in Montreal, something caught my eye. Something very subtle, and I brushed it off as fatigue (it was late Friday at this point), but sure enough after a few refreshes and comparisons, I wasn’t imagining things.
Here’s the packet as it left London (minus the first few bytes identifying the IP addresses):
And here’s the same packet as it arrived in Montreal:
Did something there catch your eye ? If not, I don’t blame you. Feel free to copy each into a text editor and rapidly switch back-and-forth to see some characters dance. Here’s what it looks like when they’re placed in vimdiff:
Well well well. It’s not packet loss, it’s packet corruption! Very subtle, very predictable packet corruption.
Some interesting notes:
- The lower part of the packet (<576 bytes) is unaffected
- The affected portion is predictably corrupted on the 15th byte of every 16
- The corruption is predictable. All instances of “h” become “x”, all instances of “c” become “s”
Some readers might have already checked ASCII charts and reached the conclusion: There’s a single bit statically stuck at “1” somewhere. Flipping the 4th bit in a byte to 1 would reliably corrupt the above letters on the left side to the value on the right side.
The obvious culprits within our control (NIC cards, receiving machines) are not suspect due to the pattern of failure observed (several London machines -> Several Montreal data centers and machines). It’s got to be something upstream and close to London.
Going back to validate, things started to make sense. I also noticed a little hint in tcpdump verbose mode (tcp cksum bad) which was missed before. A Montreal machine receiving this packet discarded it at the kernel level after realizing it’s corrupt, never passing it to the userland ssh daemon. London then re-transmitted it, going through the same corruption, getting the same silent treatment. From ssh and sshd’s perspective, the connection was at a stalemate. From tcpdump’s perspective, there was no loss, and Montreal machines appeared to be just ignoring data.
We sent these findings to our London DC ops, and within a few minutes they changed outbound routes dramatically. The first router hop, and most hops afterwards, were different. The hanging problem disappeared.
Late Friday night fixes are nice because you can relax and not carry problems and support staff into the weekend :)
Happy that we were no longer suffering from this problem and that our systems are caught up with the backlog, I decided I’d try my hand at actually finding the device causing the corruption.
Having the London routes updated to not go through the old path meant that I couldn’t reproduce the problem easily. I asked around until I found a friend with a FreeBSD box in Montreal I could use, which was still accessed through the old routes from London.
Next, I wanted to make sure that the corruption is predictable even without ssh involvement. This was trivially proven with a few pipes.
Then in London:
Again, accounting for the randomness factor and settings things up in a retry loop, I got a few packets which remove any doubt about the previous conclusions. Here’s part of one - remember that we’re sending just a stream of nulls:
With the bug replicated, I needed to find a way to isolate which of the 17 hops along that path cause the corruption. There was simply no way to call up the provider of each cluster to ask them to check their systems.
I decided pinging each router, incrementally, might be the way to go. I crafted special ICMP packets that are large enough to go over the 576 safety margin, and filled entirely with NULLs. Then pinged the Montreal machine with them from London.
They came back perfectly normal. There was no corruption.
I tried all variations of speed, padding, size - to no avail. I simply could not observe corruption in the returned ICMP ping packets.
I replaced the netcat pipes with UDP instead of TCP. Again there was no corruption.
The corruption needed TCP to be reproducible - and TCP needs 2 cooperating endpoints. I tried in vain to see if all the routers had an open TCP port I can talk to directly, to no avail.
It seemed there was no easy way an external party can pinpoint the bad apple. Or was there ?
Mirror mirror on the wall
To detect whether corruption occurred or not, we need one of these scenarios:
- Control over the TCP peer we’re talking to inspect the packet at the destination
- Not just in userland, where the packet would not get delivered if the TCP checksum failed, but root + tcpdump to inspect it as it arrives
- A TCP peer that acts as an echo server to mirror back the data it received, so we get to inspect it at the sending node and detect corruption there
It suddenly occurred to me that the second data point is available to us. Not per-se, but consider this: In our very first taste of the problem, we observed ssh clients hanging when talking to ssh servers over the corrupting hop. This is a good passive signal that we can use instead of the active “echo” signal.
… and there are lots of open ssh servers out there on the internet to help us out.
We don’t need actual accounts on these servers - we just need to kickstart the ssh connection and see if the cipher exchange phase succeeds or hangs (with a reasonable number of retries to account for corruption randomness).
So this plan was hatched:
- Use the wonderful nmap tool - specifically - its “random IP” mode - to make a list of geographically distributed open ssh servers
- Test each server to determine whether it is:
- Unresponsive/unpredictable/firewalled -> Ignore it
- Negotiates successfully after being retried N times -> mark as “good”
- Negotiates with hangs at the telltale phase after being retried N times -> mark as “bad”
- For both “good” and “bad” servers, remember the traceroute to them
The idea was this: All servers marked as “bad” will share a few hops in their traceroute. We can then take that set of suspect hops, and subtract from it any that appear in the traceroutes of the “good” servers. Hopefully what’s left is only one or two.
After spending an hour manually doing the above exercise, I stopped to inspect the data. I had classified 16 servers as “BAD” and 25 servers as “GOOD”.
The first exercise was to find the list of hops that appear in all the traceroutes of the “BAD” servers. As I cleaned and trimmed the list, I realized I won’t even need to get to the “GOOD” list to remove false positives. Within the “BAD” lists alone, there remained only 1 that was common to all of them.
For what it’s worth, it was 2 providers away: London -> N hops upstream1 -> Y hops upstream2
It was the first in Y hops of upstream2 - right at the edge between upstream1 and upstream2, corrupting random TCP packets, causing many retries, and, depending on the protocol’s logical back-and-forth, hangs, or reduced transmission rates. You may have been a telephony provider who sufferred dropped calls, a retailer who lost a few customers or sales, the possibilities really are endless.
I followed up with our London DC ops with the single hop’s IP address. Hopefully with their direct relationship with upstream1 they can escalate through there and get it fixed.
/filed under crazy devops war stories
Through upstream1, I got confirmation that the hop I pointed out (first in upstream2) had an internal “management module failure” which affected BGP and routing between two internal networks. It’s still down (they’ve routed around it) until they receive a replacement for the faulty module.