Wednesday, May 15, 2013

If you have to reboot your servers often, its probably port exhaustion

This issue can impact any networked system of note, but it seems particularly insidious on Windows servers. Two servers I recently instituted with 2012, have become unresponsive after varying amounts of uptime. Email and limited DNS service on one; that server can die within 4-5 days uptime, and used to be able to run for weeks as a 2008R2 VM. Brand new file-server, with 50-100 possible users: maybe two weeks.

Earlier today, I had found an article from a Microsoft blogger regarding this behavior. She used Process Explorer to diagnose instances occurring with software; but no immediate fixes. Another Microsoft post suggested using netstat to trace how many TCP/IP ports are open at a time. Both articles referred to behavior on Windows to limit the number of ephemeral ports on a system. Basically, when you make a request to a service on a port (say 143 for email); the server will reply back to the client using a different port number (say 50000). The "ephemeral port range" on modern systems is ports 49152 to 65535: and this is implemented separately for TCP, UDP, v4, and v6 (technically meaning you could have ~64000 open connects using all protocols). As more connections are open, and left waiting, the available pool of these randomly-assigned ports drops: eventually to the point you can't RDP to reboot the affected server.

I saw some other mentions online regarding expanding the port range down to as low as 1024 (common, standardized applications originate at the 0-1023 port range). This might be good for a large number of connections: but even that might not be enough for applications with sloppy use of networking (or iffy networks). A SQL buff tackled this issue: he recommended a few changes, which I have adopted some to resolve my server's issues...

* You can use netstat -n (or even pipe it to a text file) to list your current port usage, as well its state. If you have a lot of "wait", the next will apply.
* Make the following as a registry file to import. It will create values to change the retry time on timed-out connections to 30 seconds (from a default of ~4 minutes); and a ~15 minute wait to check on established connections (default is hours-days).

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"KeepAliveTime"=dword:000dbba0
"TCPTimedWaitDelay"=dword:0000001e


You should reboot as able to implement this; and also to clear out existing connections. You can check periodically with netstat -n  to see if the number of "waits" decrease.

Added: it might still help to boost the port range on these servers. Command sequence for expanding the range to half the usable ports...

netsh int ipv4 set dynamicport tcp start=32768 num=32767
netsh int ipv4 set dynamicport udp start=32768 num=32767
netsh int ipv6 set dynamicport tcp start=32768 num=32767
netsh int ipv6 set dynamicport udp start=32768 num=32767

No comments:

Post a Comment