wiki:InstallationGuidelines/Cluster

Context Navigation

Version 10 (modified by Dominic König, 8 years ago) ( diff )
--

Installation of a Sahana Eden Cluster

Scalable 3-tier architecture:

Database server (which could be Amazon RDS, and maybe a warm standby)
Middleware servers (Web2Py & hence Eden)
Web servers (e.g. Apache or Cherokee)

NB This is NOT a supported architecture currently. We have no knowledge of anyone succesfully deploying like this, although it should be theoretically possible.

We do have experience (IBM/DRK) of the DB server (PostgreSQL) running separately, which gives good performance, although when a query takes too long then mod_WSGI wasn't able to close that process.

Middleware

Some folders will need to be on a shared filesystem (e.g. NFS):

/uploads (or they could be synced across e.g. using rsync)
/errors {if these are to be viewed using Web UI)
/static/img/markers (unless markers don't need to be customised via Web UI)
/sessions (A better option is to use memcache for this - see 000_settings.py)
/databases (if using dynamic tables, such as gis_layer_shapefile, otherwise these can simply be synced together, e.g. using rsync. Or we can now store these in the DB using dal.py's DatabaseStoredFile)

Installation

Installation can be automated using these Fabric scripts:

https://github.com/lifeeth/spawn-eden

Issues

Problem

Server becomes suddenly unavailable for about 15 minutes, before returning to normal

Observations:

No signs of CPU/RAM saturation at the back-end (no signs of any back-end activity at all during the blackout)
Instant recovery: server is immediately back to normal after the blackout (no secondary delays)
All front-end workers stuck at the same time, and also recovering at exactly the same time
None of the logs showing any irregularities - no errors, no strange requests/responses

Critically:

Blackout length is constant (900sec), and doesn't seem to depend on request complexity
Terminating the uWSGI worker that got stuck first ("kill") resolves the problem, and lets all other workers recover instantly
Typically occurs after a period of lower traffic, at the moment when traffic increases again
At least one of the workers had been inactive for a significant period of time (e.g. 4000+ seconds) before the hanging request

Reason

There is a router between front- and back-end - which isn't the ideal configuration, of course, but normal in many clouds.

This router tracks connections, and drops any connections that have been idle for a certain period of time. That too a normal feature in many clouds.

If connections shall remain open for a longer period of time even if no communication happens (i.e. standby connections), then this is normally solved by keepalive messages which are sent between the TCP partners from time to time - so that the connection doesn't look dead to the router.

The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But here, the router has a much shorter timeout than that - so the connection is dropped before the first keepalive is ever sent.

Unfortunately, neither the front-end nor the back-end would notice this connection dropping, so they assume the connection to be still okay.

So, a worker which has been inactive for some time, would assume it can still send its query to the back-end via the open connection - but since the connection doesn't actually exist anymore, the back-end sends no ACK (because it doesn't even receive the transmission).

The worker would try again and again, until tcp_retries2 kicks in (15x60=900 seconds), and the front-end finally grasps that the connection is dead. Then, it immediately opens a new connection, and then the request succeeds instantly.

Now, even if we use multiple interpreters, there is still only one global libpq instance (that's the C library psycopg2 uses to communicate with PostgreSQL). And while that libpq instance tries to poll a dead connection, it is basically blocked for any concurrent requests from other interpreters.

Thus, the one failing worker drags all others with it, and so they all hang despite it's only one or two workers which actually have the problem. If you kill the sleepy worker, then the others will recover instantly (that's how I found the problem).

Of course, in a way, one would think that uWSGI's round-robin would prevent a worker from being idle for so long (4000+ seconds) - an average request rate of 2 req/sec should actually employ all 4 workers regularly.

But that isn't the case - most of the time, the first worker succeeds in much less than half a second, so it becomes available again before the next request comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%) across the workers, and in low-traffic times, the last worker or even the last two may indeed be idle for hours.

Solution

The solution is to set a much shorter tcp_keepalives_idle in postgresql.conf, i.e. shorter than the router timeout - then the router won't drop the connections even if the worker never does anything, and libpq will never get stuck.