Changes between Version 10 and Version 11 of InstallationGuidelines/Cluster
- Timestamp:
- 11/30/17 11:47:57 (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
InstallationGuidelines/Cluster
v10 v11 42 42 - At least one of the workers had been inactive for a significant period of time (e.g. 4000+ seconds) before the hanging request 43 43 === Reason === 44 There is a router between front- and back-end - which isn't the ideal 45 configuration, of course, but normal in many clouds. 44 There is a router between front- and back-end - which isn't the ideal configuration, of course, but normal in many clouds. 46 45 47 This router tracks connections, and drops any connections that have been idle 48 for a certain period of time. That too a normal feature in many clouds. 46 This router tracks connections, and drops any connections that have been idle for a certain period of time. That too a normal feature in many clouds. 49 47 50 If connections shall remain open for a longer period of time even if no 51 communication happens (i.e. standby connections), then this is normally solved 52 by keepalive messages which are sent between the TCP partners from time to 53 time - so that the connection doesn't look dead to the router. 48 If connections shall remain open for a longer period of time even if no communication happens (i.e. standby connections), then this is normally solved by keepalive messages which are sent between the TCP partners from time to time - so that the connection doesn't look dead to the router. 54 49 55 The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But 56 here, the router has a much shorter timeout than that - so the connection is 57 dropped before the first keepalive is ever sent. 50 The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But here, the router has a much shorter timeout than that - so the connection is dropped before the first keepalive is ever sent. 58 51 59 Unfortunately, neither the front-end nor the back-end would notice this 60 connection dropping, so they assume the connection to be still okay. 52 Unfortunately, neither the front-end nor the back-end would notice this connection dropping, so they assume the connection to be still okay. 61 53 62 So, a worker which has been inactive for some time, would assume it can still 63 send its query to the back-end via the open connection - but since the 64 connection doesn't actually exist anymore, the back-end sends no ACK (because 65 it doesn't even receive the transmission). 54 So, a worker which has been inactive for some time, would assume it can still send its query to the back-end via the open connection - but since the connection doesn't actually exist anymore, the back-end sends no ACK (because it doesn't even receive the transmission). 66 55 67 The worker would try again and again, until tcp_retries2 kicks in (15x60=900 68 seconds), and the front-end finally grasps that the connection is dead. Then, 69 it immediately opens a new connection, and then the request succeeds 70 instantly. 56 The worker would try again and again, until tcp_retries2 kicks in (15x60=900seconds), and the front-end finally grasps that the connection is dead. Then, it immediately opens a new connection, and then the request succeeds instantly. 71 57 72 Now, even if we use multiple interpreters, there is still only one global 73 libpq instance (that's the C library psycopg2 uses to communicate with 74 PostgreSQL). And while that libpq instance tries to poll a dead connection, it 75 is basically blocked for any concurrent requests from other interpreters. 58 Now, even if we use multiple interpreters, there is still only one global libpq instance (that's the C library psycopg2 uses to communicate with PostgreSQL). And while that libpq instance tries to poll a dead connection, it is basically blocked for any concurrent requests from other interpreters. 76 59 77 Thus, the one failing worker drags all others with it, and so they all hang 78 despite it's only one or two workers which actually have the problem. If you 79 kill the sleepy worker, then the others will recover instantly (that's how I 80 found the problem). 60 Thus, the one failing uWSGI worker process drags all others with it, and so they all hang despite it's only one or two workers which actually have the problem. If you kill the sleepy worker, then the others will recover instantly (that's how I found the problem). 81 61 82 Of course, in a way, one would think that uWSGI's round-robin would prevent a 83 worker from being idle for so long (4000+ seconds) - an average request rate 84 of 2 req/sec should actually employ all 4 workers regularly. 62 Of course, in a way, one would think that uWSGI's round-robin would prevent a worker from being idle for so long (4000+ seconds) - an average request rate of 2 req/sec should actually employ all 4 workers regularly. 85 63 86 But that isn't the case - most of the time, the first worker succeeds in much 87 less than half a second, so it becomes available again before the next request 88 comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%) 89 across the workers, and in low-traffic times, the last worker or even the last 90 two may indeed be idle for hours. 91 64 But that isn't the case - most of the time, the first worker succeeds in much less than half a second, so it becomes available again before the next request comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%) across the workers, and in low-traffic times, the last worker or even the last two may indeed be idle for hours. 92 65 === Solution === 93 66 The solution is to set a much shorter tcp_keepalives_idle in postgresql.conf,