Changes between Version 10 and Version 11 of InstallationGuidelines/Cluster


Ignore:
Timestamp:
11/30/17 11:47:57 (4 years ago)
Author:
Dominic König
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • InstallationGuidelines/Cluster

    v10 v11  
    4242 - At least one of the workers had been inactive for a significant period of time (e.g. 4000+ seconds) before the hanging request
    4343=== Reason ===
    44 There is a router between front- and back-end - which isn't the ideal
    45 configuration, of course, but normal in many clouds.
     44There is a router between front- and back-end - which isn't the ideal configuration, of course, but normal in many clouds.
    4645
    47 This router tracks connections, and drops any connections that have been idle
    48 for a certain period of time. That too a normal feature in many clouds.
     46This router tracks connections, and drops any connections that have been idle for a certain period of time. That too a normal feature in many clouds.
    4947
    50 If connections shall remain open for a longer period of time even if no
    51 communication happens (i.e. standby connections), then this is normally solved
    52 by keepalive messages which are sent between the TCP partners from time to
    53 time - so that the connection doesn't look dead to the router.
     48If connections shall remain open for a longer period of time even if no communication happens (i.e. standby connections), then this is normally solved by keepalive messages which are sent between the TCP partners from time to time - so that the connection doesn't look dead to the router.
    5449
    55 The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But
    56 here, the router has a much shorter timeout than that - so the connection is
    57 dropped before the first keepalive is ever sent.
     50The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But here, the router has a much shorter timeout than that - so the connection is dropped before the first keepalive is ever sent.
    5851
    59 Unfortunately, neither the front-end nor the back-end would notice this
    60 connection dropping, so they assume the connection to be still okay.
     52Unfortunately, neither the front-end nor the back-end would notice this connection dropping, so they assume the connection to be still okay.
    6153
    62 So, a worker which has been inactive for some time, would assume it can still
    63 send its query to the back-end via the open connection - but since the
    64 connection doesn't actually exist anymore, the back-end sends no ACK (because
    65 it doesn't even receive the transmission).
     54So, a worker which has been inactive for some time, would assume it can still send its query to the back-end via the open connection - but since the connection doesn't actually exist anymore, the back-end sends no ACK (because it doesn't even receive the transmission).
    6655
    67 The worker would try again and again, until tcp_retries2 kicks in (15x60=900
    68 seconds), and the front-end finally grasps that the connection is dead. Then,
    69 it immediately opens a new connection, and then the request succeeds
    70 instantly.
     56The worker would try again and again, until tcp_retries2 kicks in (15x60=900seconds), and the front-end finally grasps that the connection is dead. Then, it immediately opens a new connection, and then the request succeeds instantly.
    7157
    72 Now, even if we use multiple interpreters, there is still only one global
    73 libpq instance (that's the C library psycopg2 uses to communicate with
    74 PostgreSQL). And while that libpq instance tries to poll a dead connection, it
    75 is basically blocked for any concurrent requests from other interpreters.
     58Now, even if we use multiple interpreters, there is still only one global libpq instance (that's the C library psycopg2 uses to communicate with PostgreSQL). And while that libpq instance tries to poll a dead connection, it is basically blocked for any concurrent requests from other interpreters.
    7659
    77 Thus, the one failing worker drags all others with it, and so they all hang
    78 despite it's only one or two workers which actually have the problem. If you
    79 kill the sleepy worker, then the others will recover instantly (that's how I
    80 found the problem).
     60Thus, the one failing uWSGI worker process drags all others with it, and so they all hang despite it's only one or two workers which actually have the problem. If you kill the sleepy worker, then the others will recover instantly (that's how I found the problem).
    8161
    82 Of course, in a way, one would think that uWSGI's round-robin would prevent a
    83 worker from being idle for so long (4000+ seconds) - an average request rate
    84 of 2 req/sec should actually employ all 4 workers regularly.
     62Of course, in a way, one would think that uWSGI's round-robin would prevent a worker from being idle for so long (4000+ seconds) - an average request rate of 2 req/sec should actually employ all 4 workers regularly.
    8563
    86 But that isn't the case - most of the time, the first worker succeeds in much
    87 less than half a second, so it becomes available again before the next request
    88 comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%)
    89 across the workers, and in low-traffic times, the last worker or even the last
    90 two may indeed be idle for hours.
    91 
     64But that isn't the case - most of the time, the first worker succeeds in much less than half a second, so it becomes available again before the next request comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%) across the workers, and in low-traffic times, the last worker or even the last two may indeed be idle for hours.
    9265=== Solution ===
    9366The solution is to set a much shorter tcp_keepalives_idle in postgresql.conf,