Context Navigation

Changes between Version 10 and Version 11 of InstallationGuidelines/Cluster

Timestamp:: 11/30/17 11:47:57 (8 years ago)
Author:: Dominic König
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

InstallationGuidelines/Cluster

-              v10
+              v11
  - At least one of the workers had been inactive for a significant period of time (e.g. 4000+ seconds) before the hanging request
 === Reason ===
+There is a router between front- and back-end - which isn't the ideal
+configuration, of course, but normal in many clouds.
+There is a router between front- and back-end - which isn't the ideal configuration, of course, but normal in many clouds.
+This router tracks connections, and drops any connections that have been idle
+for a certain period of time. That too a normal feature in many clouds.
+This router tracks connections, and drops any connections that have been idle for a certain period of time. That too a normal feature in many clouds.
+If connections shall remain open for a longer period of time even if no
+communication happens (i.e. standby connections), then this is normally solved
+by keepalive messages which are sent between the TCP partners from time to
+time - so that the connection doesn't look dead to the router.
+If connections shall remain open for a longer period of time even if no communication happens (i.e. standby connections), then this is normally solved by keepalive messages which are sent between the TCP partners from time to time - so that the connection doesn't look dead to the router.
+The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But
+here, the router has a much shorter timeout than that - so the connection is
+dropped before the first keepalive is ever sent.
+The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But here, the router has a much shorter timeout than that - so the connection is dropped before the first keepalive is ever sent.
+Unfortunately, neither the front-end nor the back-end would notice this
+connection dropping, so they assume the connection to be still okay.
+Unfortunately, neither the front-end nor the back-end would notice this connection dropping, so they assume the connection to be still okay.
+So, a worker which has been inactive for some time, would assume it can still
+send its query to the back-end via the open connection - but since the
+connection doesn't actually exist anymore, the back-end sends no ACK (because
+it doesn't even receive the transmission).
+So, a worker which has been inactive for some time, would assume it can still send its query to the back-end via the open connection - but since the connection doesn't actually exist anymore, the back-end sends no ACK (because it doesn't even receive the transmission).
+The worker would try again and again, until tcp_retries2 kicks in (15x60=900
+seconds), and the front-end finally grasps that the connection is dead. Then,
+it immediately opens a new connection, and then the request succeeds
+instantly.
+The worker would try again and again, until tcp_retries2 kicks in (15x60=900seconds), and the front-end finally grasps that the connection is dead. Then, it immediately opens a new connection, and then the request succeeds instantly.
+Now, even if we use multiple interpreters, there is still only one global
+libpq instance (that's the C library psycopg2 uses to communicate with
+PostgreSQL). And while that libpq instance tries to poll a dead connection, it
+is basically blocked for any concurrent requests from other interpreters.
+Now, even if we use multiple interpreters, there is still only one global libpq instance (that's the C library psycopg2 uses to communicate with PostgreSQL). And while that libpq instance tries to poll a dead connection, it is basically blocked for any concurrent requests from other interpreters.
+Thus, the one failing worker drags all others with it, and so they all hang
+despite it's only one or two workers which actually have the problem. If you
+kill the sleepy worker, then the others will recover instantly (that's how I
+found the problem).
+Thus, the one failing uWSGI worker process drags all others with it, and so they all hang despite it's only one or two workers which actually have the problem. If you kill the sleepy worker, then the others will recover instantly (that's how I found the problem).
+Of course, in a way, one would think that uWSGI's round-robin would prevent a
+worker from being idle for so long (4000+ seconds) - an average request rate
+of 2 req/sec should actually employ all 4 workers regularly.
+Of course, in a way, one would think that uWSGI's round-robin would prevent a worker from being idle for so long (4000+ seconds) - an average request rate of 2 req/sec should actually employ all 4 workers regularly.
+But that isn't the case - most of the time, the first worker succeeds in much
+less than half a second, so it becomes available again before the next request
+comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%)
+across the workers, and in low-traffic times, the last worker or even the last
+two may indeed be idle for hours.
+But that isn't the case - most of the time, the first worker succeeds in much less than half a second, so it becomes available again before the next request comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%) across the workers, and in low-traffic times, the last worker or even the last two may indeed be idle for hours.
 === Solution ===
 The solution is to set a much shorter tcp_keepalives_idle in postgresql.conf,