The little ssh that (sometimes) couldn't

Preface

This is a technical article chronicling one of the most interesting bug hunts I’ve had the pleasure of chasing down.

At AdGear Technologies Inc. where I work, ssh is king. We use it for management, monitoring, deployments, log file harvesting, even some event streaming. It’s solid, reliable, has all the predictability of a native unix tool, and just works.

Until one day, random cron emails started flowing about it not working.

The timeout

The machines in our London data center were randomly failing to send their event log files to our data machines in our Montreal data center. This job is initiated periodically from cron, and the failure manifested itself as:

cron emails stating that the ssh was unsuccessful
- Sometimes hangs
- Sometimes exits with a timeout error
monitoring warnings down the line for in-house sanity checks detecting the missing data in Montreal

We logged into the London machines, manually ran the push command, and it worked successfully. We brushed it off as temporary network partitions.

The timeouts

But the failures kept popping up randomly. Once a day, a couple of times a day, then one Friday morning, several times an hour. It was clear something’s getting worse. We kept up with manually pushing the files until we figure out what the problem was.

There were 17 hops between London and Montreal. We built a profile of latency and packet loss for them, and found that a couple were losing 1-3% of packets. We filed a ticket with our London DC ops to route away from them.

While London DC ops were verifying the packet loss, we started seeing random timeouts from London to our SECOND data center in Montreal, and hops to that data center did not share the same routes we observed the packet loss at. We concluded packet loss is not the main problem around the same time London DC ops replied saying they’re not able to replicate the packet loss or timeouts and that everything looked healthy on their end.

The revelation

While manually keeping up with failed cron uploads, we noticed an interesting pattern. A file transfer either succeeded at a high speed, or didn’t succeed at all and hung/timed out. There were no instances of a file uploading slowly and finishing successfully.

Removing the large volume of data from the equation, we were able to recreate the scenario via simple vanilla ssh. On a London machine an “ssh mtl-machine” would either work immediately, or hang and never establish a connection. Eyebrows started going up.

Where the wild packets are

We triple-checked the ssh server configs and health in Montreal:

The servers appeared healthy by all measures
SSHd DNS reverse lookup was not enabled
SSHd Maximum client connections was high enough
We were not under attack
Bandwidth usage was nowhere near saturation

Besides, even if something was off, we were observing the hangs talking to 2 completely distinct data centers in Montreal. Furthermore, our other data centers (non-London) were talking happily to Montreal. Something about London was off.

We fired up tcpdump and started looking at the packets, both in summary and in captured pcaps loaded into wireshark. We saw telltale signs of packet loss and retransmission, but it was minimal and not particularly worrisome.

We then captured full connections from cases where ssh established successfully, and full connections from cases where the ssh connection hung.

Here’s what we logically saw when a connection from London to Montreal hung:

Normal TCP handshake
Bunch of ssh-specific back and forth, with normal TCP ACK packets where they should be
A particular packet sent from London and received in Montreal
The same packet re-sent (and re-sent, several times) from London and received in Montreal
Montreal’s just not responding to it!

It didn’t make sense why Montreal was not responding (hence London re-transmitting it). The connection was stalled at this point, as the layer 4 protocol was at a stalemate. More infuriatingly, if you kill the ssh attempt in London and re-launched it immediately, odds are it worked successfully. When it did, tcpdump showed Montreal receiving the packet but responding to it, and things moved on.

We enabled verbose debugging (-vvv) on the ssh client in London, and the hang occurred after it logged:

debug2: kex_parse_kexinit: first_kex_follows 0 
debug2: kex_parse_kexinit: reserved 0 
debug2: mac_setup: found hmac-md5
debug1: kex: server->client aes128-ctr hmac-md5 none
debug2: mac_setup: found hmac-md5
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP

Googling “ssh hang SSH2_MSG_KEX_DH_GEX_GROUP” has many results - from bad WiFi, to windows TCP bugs, to buggy routers discarding TCP fragments. One solution for LANs was to figure out the path’s MSS and set that as the MTU on both ends.

I kept decrementing the MTU on a London server down from 1500 - it didn’t help until I hit the magic value 576. At that point, I was no longer able to get the ssh hanging behavior replicated. I had an ssh loop script running, and it was on-demand that I could cause timeouts by bringing the MTU back up to 1500, or make them disappear by setting it to 576.

Unfortunately these are public web servers and globally setting the MTU to 576 won’t cut it, but the above did suggest that perhaps packet fragmentation or reassembly is broken somewhere.

Going back to check the received packets with tcpdump, there was no evidence of fragmentation. The received packet size matched exactly the packet size sent. If something did fragment the packet at byte 576+, something else reassembled it successfully.

Twinkle twinkle little mis-shapen star

Digging in some more, I was now looking at full packet dumps (tcpdump -s 0 -X) instead of just the headers. Comparing that magic packet in instances of ssh success vs ssh hang showed very little difference aside from TCP/IP header variations. It was however clear that this is the first packet in the TCP connection that had enough data to bypass the 576-byte mark - all previous packets were much smaller.

Comparing the same packet, during a hanging instance, as it left London, and as captured in Montreal, something caught my eye. Something very subtle, and I brushed it off as fatigue (it was late Friday at this point), but sure enough after a few refreshes and comparisons, I wasn’t imagining things.

Here’s the packet as it left London (minus the first few bytes identifying the IP addresses):

0x0040:  0b7c aecc 1774 b770 ad92 0000 00b7 6563  .|...t.p......ec
0x0050:  6468 2d73 6861 322d 6e69 7374 7032 3536  dh-sha2-nistp256
0x0060:  2c65 6364 682d 7368 6132 2d6e 6973 7470  ,ecdh-sha2-nistp
0x0070:  3338 342c 6563 6468 2d73 6861 322d 6e69  384,ecdh-sha2-ni
0x0080:  7374 7035 3231 2c64 6966 6669 652d 6865  stp521,diffie-he
0x0090:  6c6c 6d61 6e2d 6772 6f75 702d 6578 6368  llman-group-exch
0x00a0:  616e 6765 2d73 6861 3235 362c 6469 6666  ange-sha256,diff
0x00b0:  6965 2d68 656c 6c6d 616e 2d67 726f 7570  ie-hellman-group
0x00c0:  2d65 7863 6861 6e67 652d 7368 6131 2c64  -exchange-sha1,d
0x00d0:  6966 6669 652d 6865 6c6c 6d61 6e2d 6772  iffie-hellman-gr
0x00e0:  6f75 7031 342d 7368 6131 2c64 6966 6669  oup14-sha1,diffi
0x00f0:  652d 6865 6c6c 6d61 6e2d 6772 6f75 7031  e-hellman-group1
0x0100:  2d73 6861 3100 0000 2373 7368 2d72 7361  -sha1...#ssh-rsa
0x0110:  2c73 7368 2d64 7373 2c65 6364 7361 2d73  ,ssh-dss,ecdsa-s
0x0120:  6861 322d 6e69 7374 7032 3536 0000 009d  ha2-nistp256....
0x0130:  6165 7331 3238 2d63 7472 2c61 6573 3139  aes128-ctr,aes19
0x0140:  322d 6374 722c 6165 7332 3536 2d63 7472  2-ctr,aes256-ctr
0x0150:  2c61 7263 666f 7572 3235 362c 6172 6366  ,arcfour256,arcf
0x0160:  6f75 7231 3238 2c61 6573 3132 382d 6362  our128,aes128-cb
0x0170:  632c 3364 6573 2d63 6263 2c62 6c6f 7766  c,3des-cbc,blowf
0x0180:  6973 682d 6362 632c 6361 7374 3132 382d  ish-cbc,cast128-
0x0190:  6362 632c 6165 7331 3932 2d63 6263 2c61  cbc,aes192-cbc,a
0x01a0:  6573 3235 362d 6362 632c 6172 6366 6f75  es256-cbc,arcfou
0x01b0:  722c 7269 6a6e 6461 656c 2d63 6263 406c  r,rijndael-cbc@l
0x01c0:  7973 6174 6f72 2e6c 6975 2e73 6500 0000  ysator.liu.se...
0x01d0:  9d61 6573 3132 382d 6374 722c 6165 7331  .aes128-ctr,aes1
0x01e0:  3932 2d63 7472 2c61 6573 3235 362d 6374  92-ctr,aes256-ct
0x01f0:  722c 6172 6366 6f75 7232 3536 2c61 7263  r,arcfour256,arc
0x0200:  666f 7572 3132 382c 6165 7331 3238 2d63  four128,aes128-c
0x0210:  6263 2c33 6465 732d 6362 632c 626c 6f77  bc,3des-cbc,blow
0x0220:  6669 7368 2d63 6263 2c63 6173 7431 3238  fish-cbc,cast128
0x0230:  2d63 6263 2c61 6573 3139 322d 6362 632c  -cbc,aes192-cbc,
0x0240:  6165 7332 3536 2d63 6263 2c61 7263 666f  aes256-cbc,arcfo
0x0250:  7572 2c72 696a 6e64 6165 6c2d 6362 6340  ur,rijndael-cbc@
0x0260:  6c79 7361 746f 722e 6c69 752e 7365 0000  lysator.liu.se..
0x0270:  00a7 686d 6163 2d6d 6435 2c68 6d61 632d  ..hmac-md5,hmac-
0x0280:  7368 6131 2c75 6d61 632d 3634 406f 7065  sha1,umac-64@ope
0x0290:  6e73 7368 2e63 6f6d 2c68 6d61 632d 7368  nssh.com,hmac-sh
0x02a0:  6132 2d32 3536 2c68 6d61 632d 7368 6132  a2-256,hmac-sha2
0x02b0:  2d32 3536 2d39 362c 686d 6163 2d73 6861  -256-96,hmac-sha
0x02c0:  322d 3531 322c 686d 6163 2d73 6861 322d  2-512,hmac-sha2-
0x02d0:  3531 322d 3936 2c68 6d61 632d 7269 7065  512-96,hmac-ripe
0x02e0:  6d64 3136 302c 686d 6163 2d72 6970 656d  md160,hmac-ripem
0x02f0:  6431 3630 406f 7065 6e73 7368 2e63 6f6d  [email protected]
0x0300:  2c68 6d61 632d 7368 6131 2d39 362c 686d  ,hmac-sha1-96,hm
0x0310:  6163 2d6d 6435 2d39 3600 0000 a768 6d61  ac-md5-96....hma
0x0320:  632d 6d64 352c 686d 6163 2d73 6861 312c  c-md5,hmac-sha1,
0x0330:  756d 6163 2d36 3440 6f70 656e 7373 682e  umac-64@openssh.
0x0340:  636f 6d2c 686d 6163 2d73 6861 322d 3235  com,hmac-sha2-25
0x0350:  362c 686d 6163 2d73 6861 322d 3235 362d  6,hmac-sha2-256-
0x0360:  3936 2c68 6d61 632d 7368 6132 2d35 3132  96,hmac-sha2-512
0x0370:  2c68 6d61 632d 7368 6132 2d35 3132 2d39  ,hmac-sha2-512-9
0x0380:  362c 686d 6163 2d72 6970 656d 6431 3630  6,hmac-ripemd160
0x0390:  2c68 6d61 632d 7269 7065 6d64 3136 3040  ,hmac-ripemd160@
0x03a0:  6f70 656e 7373 682e 636f 6d2c 686d 6163  openssh.com,hmac
0x03b0:  2d73 6861 312d 3936 2c68 6d61 632d 6d64  -sha1-96,hmac-md
0x03c0:  352d 3936 0000 0015 6e6f 6e65 2c7a 6c69  5-96....none,zli
0x03d0:  6240 6f70 656e 7373 682e 636f 6d00 0000  [email protected]...
0x03e0:  156e 6f6e 652c 7a6c 6962 406f 7065 6e73  .none,zlib@opens
0x03f0:  7368 2e63 6f6d 0000 0000 0000 0000 0000  sh.com..........
0x0400:  0000 0000 0000 0000 0000 0000            ............

And here’s the same packet as it arrived in Montreal:

0x0040:  0b7c aecc 1774 b770 ad92 0000 00b7 6563  .|...t.p......ec
0x0050:  6468 2d73 6861 322d 6e69 7374 7032 3536  dh-sha2-nistp256
0x0060:  2c65 6364 682d 7368 6132 2d6e 6973 7470  ,ecdh-sha2-nistp
0x0070:  3338 342c 6563 6468 2d73 6861 322d 6e69  384,ecdh-sha2-ni
0x0080:  7374 7035 3231 2c64 6966 6669 652d 6865  stp521,diffie-he
0x0090:  6c6c 6d61 6e2d 6772 6f75 702d 6578 6368  llman-group-exch
0x00a0:  616e 6765 2d73 6861 3235 362c 6469 6666  ange-sha256,diff
0x00b0:  6965 2d68 656c 6c6d 616e 2d67 726f 7570  ie-hellman-group
0x00c0:  2d65 7863 6861 6e67 652d 7368 6131 2c64  -exchange-sha1,d
0x00d0:  6966 6669 652d 6865 6c6c 6d61 6e2d 6772  iffie-hellman-gr
0x00e0:  6f75 7031 342d 7368 6131 2c64 6966 6669  oup14-sha1,diffi
0x00f0:  652d 6865 6c6c 6d61 6e2d 6772 6f75 7031  e-hellman-group1
0x0100:  2d73 6861 3100 0000 2373 7368 2d72 7361  -sha1...#ssh-rsa
0x0110:  2c73 7368 2d64 7373 2c65 6364 7361 2d73  ,ssh-dss,ecdsa-s
0x0120:  6861 322d 6e69 7374 7032 3536 0000 009d  ha2-nistp256....
0x0130:  6165 7331 3238 2d63 7472 2c61 6573 3139  aes128-ctr,aes19
0x0140:  322d 6374 722c 6165 7332 3536 2d63 7472  2-ctr,aes256-ctr
0x0150:  2c61 7263 666f 7572 3235 362c 6172 6366  ,arcfour256,arcf
0x0160:  6f75 7231 3238 2c61 6573 3132 382d 6362  our128,aes128-cb
0x0170:  632c 3364 6573 2d63 6263 2c62 6c6f 7766  c,3des-cbc,blowf
0x0180:  6973 682d 6362 632c 6361 7374 3132 382d  ish-cbc,cast128-
0x0190:  6362 632c 6165 7331 3932 2d63 6263 2c61  cbc,aes192-cbc,a
0x01a0:  6573 3235 362d 6362 632c 6172 6366 6f75  es256-cbc,arcfou
0x01b0:  722c 7269 6a6e 6461 656c 2d63 6263 406c  r,rijndael-cbc@l
0x01c0:  7973 6174 6f72 2e6c 6975 2e73 6500 0000  ysator.liu.se...
0x01d0:  9d61 6573 3132 382d 6374 722c 6165 7331  .aes128-ctr,aes1
0x01e0:  3932 2d63 7472 2c61 6573 3235 362d 6374  92-ctr,aes256-ct
0x01f0:  722c 6172 6366 6f75 7232 3536 2c61 7263  r,arcfour256,arc
0x0200:  666f 7572 3132 382c 6165 7331 3238 2d63  four128,aes128-c
0x0210:  6263 2c33 6465 732d 6362 632c 626c 6f77  bc,3des-cbc,blow
0x0220:  6669 7368 2d63 6263 2c63 6173 7431 3238  fish-cbc,cast128
0x0230:  2d63 6263 2c61 6573 3139 322d 6362 632c  -cbc,aes192-cbc,
0x0240:  6165 7332 3536 2d63 6263 2c61 7263 666f  aes256-cbc,arcfo
0x0250:  7572 2c72 696a 6e64 6165 6c2d 6362 7340  ur,rijndael-cbs@
0x0260:  6c79 7361 746f 722e 6c69 752e 7365 1000  lysator.liu.se..
0x0270:  00a7 686d 6163 2d6d 6435 2c68 6d61 732d  ..hmac-md5,hmas-
0x0280:  7368 6131 2c75 6d61 632d 3634 406f 7065  sha1,umac-64@ope
0x0290:  6e73 7368 2e63 6f6d 2c68 6d61 632d 7368  nssh.com,hmac-sh
0x02a0:  6132 2d32 3536 2c68 6d61 632d 7368 7132  a2-256,hmac-shq2
0x02b0:  2d32 3536 2d39 362c 686d 6163 2d73 7861  -256-96,hmac-sxa
0x02c0:  322d 3531 322c 686d 6163 2d73 6861 322d  2-512,hmac-sha2-
0x02d0:  3531 322d 3936 2c68 6d61 632d 7269 7065  512-96,hmac-ripe
0x02e0:  6d64 3136 302c 686d 6163 2d72 6970 756d  md160,hmac-ripum
0x02f0:  6431 3630 406f 7065 6e73 7368 2e63 7f6d  [email protected]
0x0300:  2c68 6d61 632d 7368 6131 2d39 362c 786d  ,hmac-sha1-96,xm
0x0310:  6163 2d6d 6435 2d39 3600 0000 a768 7d61  ac-md5-96....h}a
0x0320:  632d 6d64 352c 686d 6163 2d73 6861 312c  c-md5,hmac-sha1,
0x0330:  756d 6163 2d36 3440 6f70 656e 7373 782e  umac-64@openssx.
0x0340:  636f 6d2c 686d 6163 2d73 6861 322d 3235  com,hmac-sha2-25
0x0350:  362c 686d 6163 2d73 6861 322d 3235 362d  6,hmac-sha2-256-
0x0360:  3936 2c68 6d61 632d 7368 6132 2d35 3132  96,hmac-sha2-512
0x0370:  2c68 6d61 632d 7368 6132 2d35 3132 3d39  ,hmac-sha2-512=9
0x0380:  362c 686d 6163 2d72 6970 656d 6431 3630  6,hmac-ripemd160
0x0390:  2c68 6d61 632d 7269 7065 6d64 3136 3040  ,hmac-ripemd160@
0x03a0:  6f70 656e 7373 682e 636f 6d2c 686d 7163  openssh.com,hmqc
0x03b0:  2d73 6861 312d 3936 2c68 6d61 632d 7d64  -sha1-96,hmac-}d
0x03c0:  352d 3936 0000 0015 6e6f 6e65 2c7a 7c69  5-96....none,z|i
0x03d0:  6240 6f70 656e 7373 682e 636f 6d00 0000  [email protected]...
0x03e0:  156e 6f6e 652c 7a6c 6962 406f 7065 6e73  .none,zlib@opens
0x03f0:  7368 2e63 6f6d 0000 0000 0000 0000 0000  sh.com..........
0x0400:  0000 0000 0000 0000 0000 0000            ............

Did something there catch your eye ? If not, I don’t blame you. Feel free to copy each into a text editor and rapidly switch back-and-forth to see some characters dance. Here’s what it looks like when they’re placed in vimdiff:

Vim diff packet

Well well well. It’s not packet loss, it’s packet corruption! Very subtle, very predictable packet corruption.

Some interesting notes:

The lower part of the packet (<576 bytes) is unaffected
The affected portion is predictably corrupted on the 15th byte of every 16
The corruption is predictable. All instances of “h” become “x”, all instances of “c” become “s”

Some readers might have already checked ASCII charts and reached the conclusion: There’s a single bit statically stuck at “1” somewhere. Flipping the 4th bit in a byte to 1 would reliably corrupt the above letters on the left side to the value on the right side.

The obvious culprits within our control (NIC cards, receiving machines) are not suspect due to the pattern of failure observed (several London machines -> Several Montreal data centers and machines). It’s got to be something upstream and close to London.

Going back to validate, things started to make sense. I also noticed a little hint in tcpdump verbose mode (tcp cksum bad) which was missed before. A Montreal machine receiving this packet discarded it at the kernel level after realizing it’s corrupt, never passing it to the userland ssh daemon. London then re-transmitted it, going through the same corruption, getting the same silent treatment. From ssh and sshd’s perspective, the connection was at a stalemate. From tcpdump’s perspective, there was no loss, and Montreal machines appeared to be just ignoring data.

We sent these findings to our London DC ops, and within a few minutes they changed outbound routes dramatically. The first router hop, and most hops afterwards, were different. The hanging problem disappeared.

Late Friday night fixes are nice because you can relax and not carry problems and support staff into the weekend :)

Where’s Waldo

Happy that we were no longer suffering from this problem and that our systems are caught up with the backlog, I decided I’d try my hand at actually finding the device causing the corruption.

Having the London routes updated to not go through the old path meant that I couldn’t reproduce the problem easily. I asked around until I found a friend with a FreeBSD box in Montreal I could use, which was still accessed through the old routes from London.

Next, I wanted to make sure that the corruption is predictable even without ssh involvement. This was trivially proven with a few pipes.

In Montreal:

nc -l -p 4000 > /dev/null

Then in London:

cat /dev/zero | nc mtl 4000

Again, accounting for the randomness factor and settings things up in a retry loop, I got a few packets which remove any doubt about the previous conclusions. Here’s part of one - remember that we’re sending just a stream of nulls(zeroes):

0x0210  .....
0x0220  0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0230  0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0240  0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0250  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0260  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0270  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0280  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0290  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x02a0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x02b0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x02c0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x02d0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x02e0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x02f0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0300  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0310  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0320  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0330  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0340  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0350  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0360  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0370  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0380  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x0390  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x03a0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x03b0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x03c0  0000 0000 0000 0000 0000 0000 0000 1000 ................
0x03d0  0000 0000 0000 0000 0000 0000 0000 0000 ................
0x03e0  .....

With the bug replicated, I needed to find a way to isolate which of the 17 hops along that path cause the corruption. There was simply no way to call up the provider of each cluster to ask them to check their systems.

I decided pinging each router, incrementally, might be the way to go. I crafted special ICMP packets that are large enough to go over the 576 safety margin, and filled entirely with NULLs. Then pinged the Montreal machine with them from London.

They came back perfectly normal. There was no corruption.

I tried all variations of speed, padding, size - to no avail. I simply could not observe corruption in the returned ICMP ping packets.

I replaced the netcat pipes with UDP instead of TCP. Again there was no corruption.

The corruption needed TCP to be reproducible - and TCP needs 2 cooperating endpoints. I tried in vain to see if all 17 router hops had an open TCP port I can talk to directly, to no avail.

It seemed there was no easy way an external party can pinpoint the bad apple. Or was there ?

Mirror mirror on the wall

To detect whether corruption occurred or not, we need one of these scenarios:

Control over the TCP peer we’re talking to inspect the packet at the destination
- Not just in userland, where the packet would not get delivered if the TCP checksum failed, but root + tcpdump to inspect it as it arrives
A TCP peer that acts as an echo server to mirror back the data it received, so we get to inspect it at the sending node and detect corruption there

It suddenly occurred to me that the second data point is available to us. Not per-se, but consider this: In our very first taste of the problem, we observed ssh clients hanging when talking to ssh servers over the corrupting hop. This is a good passive signal that we can use instead of the active “echo” signal.

… and there are lots of open ssh servers out there on the internet to help us out.

We don’t need actual accounts on these servers - we just need to kickstart the ssh connection and see if the cipher exchange phase succeeds or hangs (with a reasonable number of retries to account for corruption randomness).

So this plan was hatched:

Use the wonderful nmap tool - specifically - its “random IP” mode - to make a list of geographically distributed open ssh servers
Test each server to determine whether it is:
- Unresponsive/unpredictable/firewalled -> Ignore it
- Negotiates successfully after being retried N times -> mark as “good”
- Negotiates with hangs at the telltale phase after being retried N times -> mark as “bad”
For both “good” and “bad” servers, remember the traceroute to them

The idea was this: All servers marked as “bad” will share a few hops in their traceroute. We can then take that set of suspect hops, and subtract from it any that appear in the traceroutes of the “good” servers. Hopefully what’s left is only one or two.

After spending an hour manually doing the above exercise, I stopped to inspect the data. I had classified 16 servers as “BAD” and 25 servers as “GOOD”.

The first exercise was to find the list of hops that appear in all the traceroutes of the “BAD” servers. As I cleaned and trimmed the list, I realized I won’t even need to get to the “GOOD” list to remove false positives. Within the “BAD” lists alone, there remained only 1 that was common to all of them.

For what it’s worth, it was 2 providers away: London -> N hops upstream1 -> Y hops upstream2

It was the first in Y hops of upstream2 - right at the edge between upstream1 and upstream2, corrupting random TCP packets, causing many retries, and, depending on the protocol’s logical back-and-forth, hangs, or reduced transmission rates. You may have been a telephony provider who sufferred dropped calls, a retailer who lost a few customers or sales, the possibilities really are endless.

I followed up with our London DC ops with the single hop’s IP address. Hopefully with their direct relationship with upstream1 they can escalate through there and get it fixed.

/filed under crazy devops war stories

Update

Through upstream1, I got confirmation that the hop I pointed out (first in upstream2) had an internal “management module failure” which affected BGP and routing between two internal networks. It’s still down (they’ve routed around it) until they receive a replacement for the faulty module.

Thanks for the kind words and great comments here on Disqus, Reddit (/r/linux & /r/sysadmin) and hacker news

October 22, 2012