Monday, October 18, 2010

When to gripe to your Virtual Private Server ISP...

As some of you might know, apart from running Flight Sim Labs, Ltd., I have also been wearing the co-administrator hat for for a couple years now...

Not so long ago, the FSDeveloper admins (Arno, Nick, Jon and myself) selected to move to a new server ISP facility, as demand had grown substantially over the years and the older one was showing signs of old age. For reasons of familiarity, we decided the new server would be a Windows-based solution, running on a Virtual Private Server (so costs could remain very low, as we have no real income to support this volunteer-based effort).

This worked exceptionally well since the switch, allowing us to provide better service to our "customers" (Flight Simulator developers) with some nice new facilities, etc.

However, for the past couple months now, we've been getting reports by our users complaining that the server was not too stable - some times they'd be able to log on, other times they'd get connection timeouts or "reset by host" errors. While this was happening, our Domain Name Server also had some issues, so we attributed the problems to faulty name to IP resolutions.

Well- today the problem returned - and returned to stay. Nobody could connect to the web site or the forums, no matter what their location was, their browser, or their ISP. At the same time, however, the administrators could log on to the server via Remote Desktop, so it didn't look to be a network connectivity issue...

Digging into our server showed that while IIS 6.0 was running perfectly well, there were absolutely no user connections honored - instead, a bunch of "Connection_Refused" errors would appear every minute or so in the HTTPERR logs.

Some aggressive googling later, we identified the culprit:

Our ISP has selected Virtuozzo for their Virtual Private Server hosting solutions and our server is one of many VEs running on the same 32-bit physical machine.

While total free RAM on the physical machine is not as important, the case is not so with memory that cannot be "paged out" to disk (such as memory used by critical processes and drivers that need to stay in place all the time). 

The problem was isolated to IIS6.0 and non-paged memory pool shortage issues. IIS will refuse any new connections if it detects that non-paged memory pool usage has increased enough to leave less than 20Mb available on the physical machine (hence the many "Connection_Refused" errors in the HTTPERR logs).

We verified this was the problem by using a workaround - adding a "EnableAggressiveMemoryUsage" registry entry which temporarily fixed the problem, signaling that IIS should not refuse connections until free memory falls below 8MB of non-paged memory pool.

While we're elated that we can now serve our loyal FSDeveloper followers once again, we are a bit frustrated that the REAL solution will have to come from the ISP:

a) limit the number of VEs on the same machine so non-paged memory pool usage is reduced,
b) move our virtual server instance to a different machine with less VEs

Solution c) is also a possibility: Selecting a dedicated server to host - however, the expense for this would be about five times larger than the existing solution, so it cannot happen at this time...

Let's see what our friendly ISP has to say about this problem. I'll keep everyone posted when I hear back!

No comments: