Context Navigation

Changes between Version 9 and Version 10 of InstallationGuidelines/Cluster

Timestamp:: 11/30/17 11:46:09 (8 years ago)
Author:: Dominic König
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

InstallationGuidelines/Cluster

-              v9
+              v10
 == Issues ==
 === Problem ===
-Over a couple of days users found the Village server suddenly unavailable for
-about 15 minutes, before returning to normal.
+At first I thought it might be complex requests saturating the back-end, but it
+wasn't. The logs looked strange...
+ - Server becomes suddenly unavailable for about 15 minutes, before returning to normal
+All four workers would suddenly get stuck - with trivial requests - all at
+exactly the same time, as if the back-end had gone away. And then, after an
+interval of 900 seconds, they would all continue as normal (no errors), again,
+all at exactly the same time.
+Observations:
+That interval was exactly the same every time it happened - regardless of the
+time of day, regardless of the type of requests, and regardless of server
+load. It happened equally with low traffic in the middle of the night as with
+peak traffic around mid-day.
+ - No signs of CPU/RAM saturation at the back-end (no signs of any back-end activity at all during the blackout)
+ - Instant recovery: server is immediately back to normal after the blackout (no secondary delays)
+ - All front-end workers stuck at the same time, and also recovering at exactly the same time
+ - None of the logs showing any irregularities - no errors, no strange requests/responses
+None of the logs showed any irregularities, none whatsoever. No errors, no
+strange responses - just delayed for no obvious reason.
+Critically:
+So, four things were puzzling me here:
+) always exactly the same length of delay (900sec) - like a preset timeout
+) it never hit any software-side timeout, requests were processed normally,
+with proper responses and no errors
+) independence of the delay of current server load and request complexity
+) all four workers hit at exactly the same time
+---
+So, to unmask the problem, I've reduced all software-side timeouts
+(connection, harakiri, idle) to values well below those 900sec - in the hope
+that if it hits any of those timeouts, it would produce an error message
+telling me where to look.
+But it just...doesn't. None of the timeouts is ever hit.
+Not even harakiri - because in fact, the worker returns with a result in less
+than a second. But /then/ it's stuck, and I can't see why.
+The 900sec timeout is most likely tcp_retries2, which is a low-level network
+setting that tells Linux to retry sending a packet over a tcp socket 15 times
+of no DSACK is received within 60sec. That's 15x60=900sec.
+But TCP retries would require a connection to be established in the first
+place, and equally, the worker receiving and processing the request /and/
+producing a response requires a proper incoming request - so the only place
+where it could occur is when sending the response, i.e. HTTP WAIT.
+ - Blackout length is constant (900sec), and doesn't seem to depend on request complexity
+ - Terminating the uWSGI worker that got stuck first ("kill") resolves the problem, and lets all other workers recover instantly
+ - Typically occurs after a period of lower traffic, at the moment when traffic increases again
+ - At least one of the workers had been inactive for a significant period of time (e.g. 4000+ seconds) before the hanging request
 === Reason ===
 There is a router between front- and back-end - which isn't the ideal