[squeak-dev] Server timeouts and 504 return codes
Das.Linux at gmx.de
Sun Jan 27 12:22:48 UTC 2019
> On 27.01.2019, at 02:53, tim Rowledge <tim at rowledge.org> wrote:
> A couple of weeks ago I had a problem loading something via SqueakMap that resulted in a 504 error. Chris M quite rightly pointed out that responding to a timeout with an immediate retry might not be the best thing (referencing some code I published to try to handle this problem); looking at the error more closely I finally noticed that a 504 is a *gateway* timeout rather than anything that seems likely to be a problem at the SM or MC repository server. Indeed the error came back much quicker than the 45 seconds timeout that we seem to have set for our http connections.
> I'm a long way from being an expert in the area of connecting to servers via gateways and what their timeous might be etc. so excuse stupid-question syndrome - I know this isn't Quora where stupid-question is the order of the day.
> Am I right in thinking that a 504 error means that some *intermediate* server timed out according to some setting in its internal config ?
> Am I right in imagining that we can't normally affect that timeout?
Well, we can.
What happens here:
- All our websites, including all HTTP services, such as the Map, arrive together at squeak.org, aka alan.box.squeak.org
That is an nginx server. And also the server who eventually spits out the 504.
- alan then sees we want a connection to the Map, and does a HTTP request to ted.box.squeak.org (=> alan is a _reverse proxy_)
and upon response gets us that back.
- if ted fails to respond in 60s, alan gives a 504.
Simple as that. This limits the possibility that we wait too long (ie >60s) on ted.
Elephant in the room: why not directly ted? the nginx on alan is configured as hardened as I thought best, and actually can handle a multitude of requests much better than our squeak-based "application servers". This distinction between reverse proxy and application server is btw quite standard and enables some things. For example:
We can tune a lot of things on alan with regards to how it should handle things. The simplest being:
- we can tune the timeout: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout
that's where the 60s come from, and we could simply crank it up.
- HOWEVER: this could mean we eventually run into other timeouts, for example on the server or even in TCP or so.
- so increasing this just like that _may_ help or _may_ make the Map useless altogether, so please be careful y'all :)
- we can cache: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache
- we could make alan not even ask ted when we know the answer already.
- Attention: we need a lot of information on what is stable and what not to do this.
- (its tempting to try, tho)
- (we probably want that for squeaksource/source.squeak for the MCZ requests. but we lose the download statistics then…)
- Note: a lot of time is probably spend by ted generating HTTP and by alan parsing HTTP. Using Fcgi, for example, reduces that, and is supported by both nginx (https://nginx.org/en/docs/http/ngx_http_fastcgi_module.html) and GemStone, but I don't know whether we already have one in squeak.
> If I have any reasonable grasp on this then we should probably detect the 504 (in part by explicitly using a WebClient and its error handling rather than the slightly wonky httpSocket faced we have currently) and retry the connection ? Any other error or a timeout at *our* end would still be best handled as an error.
All 500-ish codes essentially say "the server is to blame" and the client can do noghitn about that.
I don't think that 504 is meaningfully better handled than 503 or 502 in the WebClient. It think it's ok to pass that through.
> Except of course a 418 which has well defined error handling...
At least not 451…
> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
> You forgot to do your backup 16 days ago. Tomorrow you'll need that version.
More information about the Squeak-dev