Changes between Version 9 and Version 10 of InstallationGuidelines/Cluster

11/30/17 11:46:09 (5 years ago)
Dominic König



  • InstallationGuidelines/Cluster

    v9 v10  
    2525== Issues ==
    2626=== Problem ===
    27 Over a couple of days users found the Village server suddenly unavailable for
    28 about 15 minutes, before returning to normal.
    30 At first I thought it might be complex requests saturating the back-end, but it
    31 wasn't. The logs looked strange...
     28 - Server becomes suddenly unavailable for about 15 minutes, before returning to normal
    33 All four workers would suddenly get stuck - with trivial requests - all at
    34 exactly the same time, as if the back-end had gone away. And then, after an
    35 interval of 900 seconds, they would all continue as normal (no errors), again,
    36 all at exactly the same time.
    38 That interval was exactly the same every time it happened - regardless of the
    39 time of day, regardless of the type of requests, and regardless of server
    40 load. It happened equally with low traffic in the middle of the night as with
    41 peak traffic around mid-day.
     32 - No signs of CPU/RAM saturation at the back-end (no signs of any back-end activity at all during the blackout)
     33 - Instant recovery: server is immediately back to normal after the blackout (no secondary delays)
     34 - All front-end workers stuck at the same time, and also recovering at exactly the same time
     35 - None of the logs showing any irregularities - no errors, no strange requests/responses
    43 None of the logs showed any irregularities, none whatsoever. No errors, no
    44 strange responses - just delayed for no obvious reason.
    46 So, four things were puzzling me here:
    47 1) always exactly the same length of delay (900sec) - like a preset timeout
    48 2) it never hit any software-side timeout, requests were processed normally,
    49 with proper responses and no errors
    50 3) independence of the delay of current server load and request complexity
    51 4) all four workers hit at exactly the same time
    53 ---
    55 So, to unmask the problem, I've reduced all software-side timeouts
    56 (connection, harakiri, idle) to values well below those 900sec - in the hope
    57 that if it hits any of those timeouts, it would produce an error message
    58 telling me where to look.
    60 But it just...doesn't. None of the timeouts is ever hit.
    62 Not even harakiri - because in fact, the worker returns with a result in less
    63 than a second. But /then/ it's stuck, and I can't see why.
    65 The 900sec timeout is most likely tcp_retries2, which is a low-level network
    66 setting that tells Linux to retry sending a packet over a tcp socket 15 times
    67 of no DSACK is received within 60sec. That's 15x60=900sec.
    69 But TCP retries would require a connection to be established in the first
    70 place, and equally, the worker receiving and processing the request /and/
    71 producing a response requires a proper incoming request - so the only place
    72 where it could occur is when sending the response, i.e. HTTP WAIT.
     39 - Blackout length is constant (900sec), and doesn't seem to depend on request complexity
     40 - Terminating the uWSGI worker that got stuck first ("kill") resolves the problem, and lets all other workers recover instantly
     41 - Typically occurs after a period of lower traffic, at the moment when traffic increases again
     42 - At least one of the workers had been inactive for a significant period of time (e.g. 4000+ seconds) before the hanging request
    7443=== Reason ===
    7544There is a router between front- and back-end - which isn't the ideal