wiki:InstallationGuidelines/Cluster

Context Navigation

Version 9 (modified by Fran Boon, 8 years ago) ( diff )
shorter tcp_keepalives_idle in postgresql.conf

Installation of a Sahana Eden Cluster

Scalable 3-tier architecture:

Database server (which could be Amazon RDS, and maybe a warm standby)
Middleware servers (Web2Py & hence Eden)
Web servers (e.g. Apache or Cherokee)

NB This is NOT a supported architecture currently. We have no knowledge of anyone succesfully deploying like this, although it should be theoretically possible.

We do have experience (IBM/DRK) of the DB server (PostgreSQL) running separately, which gives good performance, although when a query takes too long then mod_WSGI wasn't able to close that process.

Middleware

Some folders will need to be on a shared filesystem (e.g. NFS):

/uploads (or they could be synced across e.g. using rsync)
/errors {if these are to be viewed using Web UI)
/static/img/markers (unless markers don't need to be customised via Web UI)
/sessions (A better option is to use memcache for this - see 000_settings.py)
/databases (if using dynamic tables, such as gis_layer_shapefile, otherwise these can simply be synced together, e.g. using rsync. Or we can now store these in the DB using dal.py's DatabaseStoredFile)

Installation

Installation can be automated using these Fabric scripts:

https://github.com/lifeeth/spawn-eden

Issues

Problem

Over a couple of days users found the Village server suddenly unavailable for about 15 minutes, before returning to normal.

At first I thought it might be complex requests saturating the back-end, but it wasn't. The logs looked strange...

All four workers would suddenly get stuck - with trivial requests - all at exactly the same time, as if the back-end had gone away. And then, after an interval of 900 seconds, they would all continue as normal (no errors), again, all at exactly the same time.

That interval was exactly the same every time it happened - regardless of the time of day, regardless of the type of requests, and regardless of server load. It happened equally with low traffic in the middle of the night as with peak traffic around mid-day.

None of the logs showed any irregularities, none whatsoever. No errors, no strange responses - just delayed for no obvious reason.

So, four things were puzzling me here: 1) always exactly the same length of delay (900sec) - like a preset timeout 2) it never hit any software-side timeout, requests were processed normally, with proper responses and no errors 3) independence of the delay of current server load and request complexity 4) all four workers hit at exactly the same time

---

So, to unmask the problem, I've reduced all software-side timeouts (connection, harakiri, idle) to values well below those 900sec - in the hope that if it hits any of those timeouts, it would produce an error message telling me where to look.

But it just...doesn't. None of the timeouts is ever hit.

Not even harakiri - because in fact, the worker returns with a result in less than a second. But /then/ it's stuck, and I can't see why.

The 900sec timeout is most likely tcp_retries2, which is a low-level network setting that tells Linux to retry sending a packet over a tcp socket 15 times of no DSACK is received within 60sec. That's 15x60=900sec.

But TCP retries would require a connection to be established in the first place, and equally, the worker receiving and processing the request /and/ producing a response requires a proper incoming request - so the only place where it could occur is when sending the response, i.e. HTTP WAIT.

Reason

There is a router between front- and back-end - which isn't the ideal configuration, of course, but normal in many clouds.

This router tracks connections, and drops any connections that have been idle for a certain period of time. That too a normal feature in many clouds.

If connections shall remain open for a longer period of time even if no communication happens (i.e. standby connections), then this is normally solved by keepalive messages which are sent between the TCP partners from time to time - so that the connection doesn't look dead to the router.

The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But here, the router has a much shorter timeout than that - so the connection is dropped before the first keepalive is ever sent.

Unfortunately, neither the front-end nor the back-end would notice this connection dropping, so they assume the connection to be still okay.

So, a worker which has been inactive for some time, would assume it can still send its query to the back-end via the open connection - but since the connection doesn't actually exist anymore, the back-end sends no ACK (because it doesn't even receive the transmission).

The worker would try again and again, until tcp_retries2 kicks in (15x60=900 seconds), and the front-end finally grasps that the connection is dead. Then, it immediately opens a new connection, and then the request succeeds instantly.

Now, even if we use multiple interpreters, there is still only one global libpq instance (that's the C library psycopg2 uses to communicate with PostgreSQL). And while that libpq instance tries to poll a dead connection, it is basically blocked for any concurrent requests from other interpreters.

Thus, the one failing worker drags all others with it, and so they all hang despite it's only one or two workers which actually have the problem. If you kill the sleepy worker, then the others will recover instantly (that's how I found the problem).

Of course, in a way, one would think that uWSGI's round-robin would prevent a worker from being idle for so long (4000+ seconds) - an average request rate of 2 req/sec should actually employ all 4 workers regularly.

But that isn't the case - most of the time, the first worker succeeds in much less than half a second, so it becomes available again before the next request comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%) across the workers, and in low-traffic times, the last worker or even the last two may indeed be idle for hours.

Solution

The solution is to set a much shorter tcp_keepalives_idle in postgresql.conf, i.e. shorter than the router timeout - then the router won't drop the connections even if the worker never does anything, and libpq will never get stuck.