Discussion:
Detecting stale CLOSE-WAITs
(too old to reply)
SRSeedBurners
2021-07-05 16:22:07 UTC
Permalink
Had an incident where something happened on the circuit between vendor and our application. During normal operations they persistently maintain ~200 tcpip connections to our receiver application. An scf status detail on the stack will show 200 ESTAB connections to the appl port. After the incident the same command showed a bunch of CLOSE-WAIT that stayed resident for the whole day until we were able to take a maintenance window at midnight. Vendor re-routed traffic onto a backup line, stopped their client at which point I thought I saw ALL the connections (established and close-wait) disappear. I was not expecting to see the close-waits go away, thought I would have to bounce our side to get those gone but not the case.

Questions
- what does the CLOSE-WAIT indicate (from the manual: CLOSE-WAIT
waiting for a terminate connection request from the local user.)
- How do we detect and clean these up (task thrown on my by management!)
Randall
2021-07-05 19:26:43 UTC
Permalink
Had an incident where something happened on the circuit between vendor and our application. During normal operations they persistently maintain ~200 tcpip connections to our receiver application. An scf status detail on the stack will show 200 ESTAB connections to the appl port. After the incident the same command showed a bunch of CLOSE-WAIT that stayed resident for the whole day until we were able to take a maintenance window at midnight. Vendor re-routed traffic onto a backup line, stopped their client at which point I thought I saw ALL the connections (established and close-wait) disappear. I was not expecting to see the close-waits go away, thought I would have to bounce our side to get those gone but not the case.
Questions
- what does the CLOSE-WAIT indicate (from the manual: CLOSE-WAIT
waiting for a terminate connection request from the local user.)
- How do we detect and clean these up (task thrown on my by management!)
CLOSE_WAIT generally means that the remote end of a TCP connection terminated the link but the local end (the server) has received a FIN but has not yet closed the connection. If you call shutdown() without calling close(), this can happen as well. The TCP/IP stack cleans this up, generally when a TTL expires or when all required closes and terminations happen. If you do not perform a read() or recv() on the socket, then it will remain open until the stack detects that the remote is gone - gone could mean a close() and shutdown() or unplugging the cable or disconnecting the WiFi. This part is only partially in your applications control. You can try reducing the TTL, which may increase network traffic. You can try increasing the recv() call or decreasing the poll/select interval but that also can increase load.

Detecting this is generally as you have done, being in SCF. Dealing with them is generally outside of operational control - and there are tradeoffs in the application to make this as seamless as possible. This happens a lot with web servers.

Good luck.

Continue reading on narkive:
Loading...