Context Navigation

Changes between Version 8 and Version 9 of InstallationGuidelines/Cluster

Timestamp:: 11/30/17 10:47:50 (8 years ago)
Author:: Fran Boon
Comment:: shorter tcp_keepalives_idle in postgresql.conf

Legend:

: Unmodified
: Added
: Removed
: Modified

InstallationGuidelines/Cluster

-              v8
+              v9
 Installation can be automated using these Fabric scripts:
 * https://github.com/lifeeth/spawn-eden
+== Issues ==
+=== Problem ===
+Over a couple of days users found the Village server suddenly unavailable for
+about 15 minutes, before returning to normal.
+At first I thought it might be complex requests saturating the back-end, but it
+wasn't. The logs looked strange...
+All four workers would suddenly get stuck - with trivial requests - all at
+exactly the same time, as if the back-end had gone away. And then, after an
+interval of 900 seconds, they would all continue as normal (no errors), again,
+all at exactly the same time.
+That interval was exactly the same every time it happened - regardless of the
+time of day, regardless of the type of requests, and regardless of server
+load. It happened equally with low traffic in the middle of the night as with
+peak traffic around mid-day.
+None of the logs showed any irregularities, none whatsoever. No errors, no
+strange responses - just delayed for no obvious reason.
+So, four things were puzzling me here:
+) always exactly the same length of delay (900sec) - like a preset timeout
+) it never hit any software-side timeout, requests were processed normally,
+with proper responses and no errors
+) independence of the delay of current server load and request complexity
+) all four workers hit at exactly the same time
+---
+So, to unmask the problem, I've reduced all software-side timeouts
+(connection, harakiri, idle) to values well below those 900sec - in the hope
+that if it hits any of those timeouts, it would produce an error message
+telling me where to look.
+But it just...doesn't. None of the timeouts is ever hit.
+Not even harakiri - because in fact, the worker returns with a result in less
+than a second. But /then/ it's stuck, and I can't see why.
+The 900sec timeout is most likely tcp_retries2, which is a low-level network
+setting that tells Linux to retry sending a packet over a tcp socket 15 times
+of no DSACK is received within 60sec. That's 15x60=900sec.
+But TCP retries would require a connection to be established in the first
+place, and equally, the worker receiving and processing the request /and/
+producing a response requires a proper incoming request - so the only place
+where it could occur is when sending the response, i.e. HTTP WAIT.
+=== Reason ===
+There is a router between front- and back-end - which isn't the ideal
+configuration, of course, but normal in many clouds.
+This router tracks connections, and drops any connections that have been idle
+for a certain period of time. That too a normal feature in many clouds.
+If connections shall remain open for a longer period of time even if no
+communication happens (i.e. standby connections), then this is normally solved
+by keepalive messages which are sent between the TCP partners from time to
+time - so that the connection doesn't look dead to the router.
+The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But
+here, the router has a much shorter timeout than that - so the connection is
+dropped before the first keepalive is ever sent.
+Unfortunately, neither the front-end nor the back-end would notice this
+connection dropping, so they assume the connection to be still okay.
+So, a worker which has been inactive for some time, would assume it can still
+send its query to the back-end via the open connection - but since the
+connection doesn't actually exist anymore, the back-end sends no ACK (because
+it doesn't even receive the transmission).
+The worker would try again and again, until tcp_retries2 kicks in (15x60=900
+seconds), and the front-end finally grasps that the connection is dead. Then,
+it immediately opens a new connection, and then the request succeeds
+instantly.
+Now, even if we use multiple interpreters, there is still only one global
+libpq instance (that's the C library psycopg2 uses to communicate with
+PostgreSQL). And while that libpq instance tries to poll a dead connection, it
+is basically blocked for any concurrent requests from other interpreters.
+Thus, the one failing worker drags all others with it, and so they all hang
+despite it's only one or two workers which actually have the problem. If you
+kill the sleepy worker, then the others will recover instantly (that's how I
+found the problem).
+Of course, in a way, one would think that uWSGI's round-robin would prevent a
+worker from being idle for so long (4000+ seconds) - an average request rate
+of 2 req/sec should actually employ all 4 workers regularly.
+But that isn't the case - most of the time, the first worker succeeds in much
+less than half a second, so it becomes available again before the next request
+comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%)
+across the workers, and in low-traffic times, the last worker or even the last
+two may indeed be idle for hours.
+=== Solution ===
+The solution is to set a much shorter tcp_keepalives_idle in postgresql.conf,
+i.e. shorter than the router timeout - then the router won't drop the
+connections even if the worker never does anything, and libpq will never get
+stuck.
 == See Also ==
 * http://web2py.com/books/default/chapter/29/13#HAProxy-a-high-availability-load-balancer