Changes between Version 8 and Version 9 of InstallationGuidelines/Cluster

11/30/17 10:47:50 (4 years ago)
Fran Boon

shorter tcp_keepalives_idle in postgresql.conf


  • InstallationGuidelines/Cluster

    v8 v9  
    2222Installation can be automated using these Fabric scripts:
     25== Issues ==
     26=== Problem ===
     27Over a couple of days users found the Village server suddenly unavailable for
     28about 15 minutes, before returning to normal.
     30At first I thought it might be complex requests saturating the back-end, but it
     31wasn't. The logs looked strange...
     33All four workers would suddenly get stuck - with trivial requests - all at
     34exactly the same time, as if the back-end had gone away. And then, after an
     35interval of 900 seconds, they would all continue as normal (no errors), again,
     36all at exactly the same time.
     38That interval was exactly the same every time it happened - regardless of the
     39time of day, regardless of the type of requests, and regardless of server
     40load. It happened equally with low traffic in the middle of the night as with
     41peak traffic around mid-day.
     43None of the logs showed any irregularities, none whatsoever. No errors, no
     44strange responses - just delayed for no obvious reason.
     46So, four things were puzzling me here:
     471) always exactly the same length of delay (900sec) - like a preset timeout
     482) it never hit any software-side timeout, requests were processed normally,
     49with proper responses and no errors
     503) independence of the delay of current server load and request complexity
     514) all four workers hit at exactly the same time
     55So, to unmask the problem, I've reduced all software-side timeouts
     56(connection, harakiri, idle) to values well below those 900sec - in the hope
     57that if it hits any of those timeouts, it would produce an error message
     58telling me where to look.
     60But it just...doesn't. None of the timeouts is ever hit.
     62Not even harakiri - because in fact, the worker returns with a result in less
     63than a second. But /then/ it's stuck, and I can't see why.
     65The 900sec timeout is most likely tcp_retries2, which is a low-level network
     66setting that tells Linux to retry sending a packet over a tcp socket 15 times
     67of no DSACK is received within 60sec. That's 15x60=900sec.
     69But TCP retries would require a connection to be established in the first
     70place, and equally, the worker receiving and processing the request /and/
     71producing a response requires a proper incoming request - so the only place
     72where it could occur is when sending the response, i.e. HTTP WAIT.
     74=== Reason ===
     75There is a router between front- and back-end - which isn't the ideal
     76configuration, of course, but normal in many clouds.
     78This router tracks connections, and drops any connections that have been idle
     79for a certain period of time. That too a normal feature in many clouds.
     81If connections shall remain open for a longer period of time even if no
     82communication happens (i.e. standby connections), then this is normally solved
     83by keepalive messages which are sent between the TCP partners from time to
     84time - so that the connection doesn't look dead to the router.
     86The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But
     87here, the router has a much shorter timeout than that - so the connection is
     88dropped before the first keepalive is ever sent.
     90Unfortunately, neither the front-end nor the back-end would notice this
     91connection dropping, so they assume the connection to be still okay.
     93So, a worker which has been inactive for some time, would assume it can still
     94send its query to the back-end via the open connection - but since the
     95connection doesn't actually exist anymore, the back-end sends no ACK (because
     96it doesn't even receive the transmission).
     98The worker would try again and again, until tcp_retries2 kicks in (15x60=900
     99seconds), and the front-end finally grasps that the connection is dead. Then,
     100it immediately opens a new connection, and then the request succeeds
     103Now, even if we use multiple interpreters, there is still only one global
     104libpq instance (that's the C library psycopg2 uses to communicate with
     105PostgreSQL). And while that libpq instance tries to poll a dead connection, it
     106is basically blocked for any concurrent requests from other interpreters.
     108Thus, the one failing worker drags all others with it, and so they all hang
     109despite it's only one or two workers which actually have the problem. If you
     110kill the sleepy worker, then the others will recover instantly (that's how I
     111found the problem).
     113Of course, in a way, one would think that uWSGI's round-robin would prevent a
     114worker from being idle for so long (4000+ seconds) - an average request rate
     115of 2 req/sec should actually employ all 4 workers regularly.
     117But that isn't the case - most of the time, the first worker succeeds in much
     118less than half a second, so it becomes available again before the next request
     119comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%)
     120across the workers, and in low-traffic times, the last worker or even the last
     121two may indeed be idle for hours.
     123=== Solution ===
     124The solution is to set a much shorter tcp_keepalives_idle in postgresql.conf,
     125i.e. shorter than the router timeout - then the router won't drop the
     126connections even if the worker never does anything, and libpq will never get
    24129== See Also ==