38 | | That interval was exactly the same every time it happened - regardless of the |
39 | | time of day, regardless of the type of requests, and regardless of server |
40 | | load. It happened equally with low traffic in the middle of the night as with |
41 | | peak traffic around mid-day. |
| 32 | - No signs of CPU/RAM saturation at the back-end (no signs of any back-end activity at all during the blackout) |
| 33 | - Instant recovery: server is immediately back to normal after the blackout (no secondary delays) |
| 34 | - All front-end workers stuck at the same time, and also recovering at exactly the same time |
| 35 | - None of the logs showing any irregularities - no errors, no strange requests/responses |
46 | | So, four things were puzzling me here: |
47 | | 1) always exactly the same length of delay (900sec) - like a preset timeout |
48 | | 2) it never hit any software-side timeout, requests were processed normally, |
49 | | with proper responses and no errors |
50 | | 3) independence of the delay of current server load and request complexity |
51 | | 4) all four workers hit at exactly the same time |
52 | | |
53 | | --- |
54 | | |
55 | | So, to unmask the problem, I've reduced all software-side timeouts |
56 | | (connection, harakiri, idle) to values well below those 900sec - in the hope |
57 | | that if it hits any of those timeouts, it would produce an error message |
58 | | telling me where to look. |
59 | | |
60 | | But it just...doesn't. None of the timeouts is ever hit. |
61 | | |
62 | | Not even harakiri - because in fact, the worker returns with a result in less |
63 | | than a second. But /then/ it's stuck, and I can't see why. |
64 | | |
65 | | The 900sec timeout is most likely tcp_retries2, which is a low-level network |
66 | | setting that tells Linux to retry sending a packet over a tcp socket 15 times |
67 | | of no DSACK is received within 60sec. That's 15x60=900sec. |
68 | | |
69 | | But TCP retries would require a connection to be established in the first |
70 | | place, and equally, the worker receiving and processing the request /and/ |
71 | | producing a response requires a proper incoming request - so the only place |
72 | | where it could occur is when sending the response, i.e. HTTP WAIT. |
73 | | |
| 39 | - Blackout length is constant (900sec), and doesn't seem to depend on request complexity |
| 40 | - Terminating the uWSGI worker that got stuck first ("kill") resolves the problem, and lets all other workers recover instantly |
| 41 | - Typically occurs after a period of lower traffic, at the moment when traffic increases again |
| 42 | - At least one of the workers had been inactive for a significant period of time (e.g. 4000+ seconds) before the hanging request |