| 24 | |
| 25 | == Issues == |
| 26 | === Problem === |
| 27 | Over a couple of days users found the Village server suddenly unavailable for |
| 28 | about 15 minutes, before returning to normal. |
| 29 | |
| 30 | At first I thought it might be complex requests saturating the back-end, but it |
| 31 | wasn't. The logs looked strange... |
| 32 | |
| 33 | All four workers would suddenly get stuck - with trivial requests - all at |
| 34 | exactly the same time, as if the back-end had gone away. And then, after an |
| 35 | interval of 900 seconds, they would all continue as normal (no errors), again, |
| 36 | all at exactly the same time. |
| 37 | |
| 38 | That interval was exactly the same every time it happened - regardless of the |
| 39 | time of day, regardless of the type of requests, and regardless of server |
| 40 | load. It happened equally with low traffic in the middle of the night as with |
| 41 | peak traffic around mid-day. |
| 42 | |
| 43 | None of the logs showed any irregularities, none whatsoever. No errors, no |
| 44 | strange responses - just delayed for no obvious reason. |
| 45 | |
| 46 | So, four things were puzzling me here: |
| 47 | 1) always exactly the same length of delay (900sec) - like a preset timeout |
| 48 | 2) it never hit any software-side timeout, requests were processed normally, |
| 49 | with proper responses and no errors |
| 50 | 3) independence of the delay of current server load and request complexity |
| 51 | 4) all four workers hit at exactly the same time |
| 52 | |
| 53 | --- |
| 54 | |
| 55 | So, to unmask the problem, I've reduced all software-side timeouts |
| 56 | (connection, harakiri, idle) to values well below those 900sec - in the hope |
| 57 | that if it hits any of those timeouts, it would produce an error message |
| 58 | telling me where to look. |
| 59 | |
| 60 | But it just...doesn't. None of the timeouts is ever hit. |
| 61 | |
| 62 | Not even harakiri - because in fact, the worker returns with a result in less |
| 63 | than a second. But /then/ it's stuck, and I can't see why. |
| 64 | |
| 65 | The 900sec timeout is most likely tcp_retries2, which is a low-level network |
| 66 | setting that tells Linux to retry sending a packet over a tcp socket 15 times |
| 67 | of no DSACK is received within 60sec. That's 15x60=900sec. |
| 68 | |
| 69 | But TCP retries would require a connection to be established in the first |
| 70 | place, and equally, the worker receiving and processing the request /and/ |
| 71 | producing a response requires a proper incoming request - so the only place |
| 72 | where it could occur is when sending the response, i.e. HTTP WAIT. |
| 73 | |
| 74 | === Reason === |
| 75 | There is a router between front- and back-end - which isn't the ideal |
| 76 | configuration, of course, but normal in many clouds. |
| 77 | |
| 78 | This router tracks connections, and drops any connections that have been idle |
| 79 | for a certain period of time. That too a normal feature in many clouds. |
| 80 | |
| 81 | If connections shall remain open for a longer period of time even if no |
| 82 | communication happens (i.e. standby connections), then this is normally solved |
| 83 | by keepalive messages which are sent between the TCP partners from time to |
| 84 | time - so that the connection doesn't look dead to the router. |
| 85 | |
| 86 | The default interval for keepalives in Debian is 7200sec, i.e. 2 hours. But |
| 87 | here, the router has a much shorter timeout than that - so the connection is |
| 88 | dropped before the first keepalive is ever sent. |
| 89 | |
| 90 | Unfortunately, neither the front-end nor the back-end would notice this |
| 91 | connection dropping, so they assume the connection to be still okay. |
| 92 | |
| 93 | So, a worker which has been inactive for some time, would assume it can still |
| 94 | send its query to the back-end via the open connection - but since the |
| 95 | connection doesn't actually exist anymore, the back-end sends no ACK (because |
| 96 | it doesn't even receive the transmission). |
| 97 | |
| 98 | The worker would try again and again, until tcp_retries2 kicks in (15x60=900 |
| 99 | seconds), and the front-end finally grasps that the connection is dead. Then, |
| 100 | it immediately opens a new connection, and then the request succeeds |
| 101 | instantly. |
| 102 | |
| 103 | Now, even if we use multiple interpreters, there is still only one global |
| 104 | libpq instance (that's the C library psycopg2 uses to communicate with |
| 105 | PostgreSQL). And while that libpq instance tries to poll a dead connection, it |
| 106 | is basically blocked for any concurrent requests from other interpreters. |
| 107 | |
| 108 | Thus, the one failing worker drags all others with it, and so they all hang |
| 109 | despite it's only one or two workers which actually have the problem. If you |
| 110 | kill the sleepy worker, then the others will recover instantly (that's how I |
| 111 | found the problem). |
| 112 | |
| 113 | Of course, in a way, one would think that uWSGI's round-robin would prevent a |
| 114 | worker from being idle for so long (4000+ seconds) - an average request rate |
| 115 | of 2 req/sec should actually employ all 4 workers regularly. |
| 116 | |
| 117 | But that isn't the case - most of the time, the first worker succeeds in much |
| 118 | less than half a second, so it becomes available again before the next request |
| 119 | comes in. That way, the load distribution is rather uneven (60%-25%-10%-5%) |
| 120 | across the workers, and in low-traffic times, the last worker or even the last |
| 121 | two may indeed be idle for hours. |
| 122 | |
| 123 | === Solution === |
| 124 | The solution is to set a much shorter tcp_keepalives_idle in postgresql.conf, |
| 125 | i.e. shorter than the router timeout - then the router won't drop the |
| 126 | connections even if the worker never does anything, and libpq will never get |
| 127 | stuck. |
| 128 | |