[squeak-dev] Server timeouts and 504 return codes

Sun Jan 27 17:50:50 UTC 2019

On Sun, 27 Jan 2019, Tobias Pape wrote:

> Hi
>
>> On 27.01.2019, at 02:53, tim Rowledge <tim at rowledge.org> wrote:
>> 
>> A couple of weeks ago I had a problem loading something via SqueakMap that resulted in a 504 error. Chris M quite rightly pointed out that responding to a timeout with an immediate retry might not be the best thing (referencing some code I published to try to handle this problem); looking at the error more closely I finally noticed that a 504 is a *gateway* timeout rather than anything that seems likely to be a problem at the SM or MC repository server. Indeed the error came back much quicker than the 45 seconds timeout that we seem to have set for our http connections.
>> 
>> I'm a long way from being an expert in the area of connecting to servers via gateways and what their timeous might be etc. so excuse stupid-question syndrome - I know this isn't Quora where stupid-question is the order of the day. 
>> Am I right in thinking that a 504 error means that some *intermediate* server timed out according to some setting in its internal config ?
>> Am I right in imagining that we can't normally affect that timeout?
>> 
>
> Well, we can.
>
> What happens here:
>
> - All our websites, including all HTTP services, such as the Map, arrive together at squeak.org, aka alan.box.squeak.org
>  That is an nginx server. And also the server who eventually spits out the 504.
> - alan then sees we want a connection to the Map, and does a HTTP request to ted.box.squeak.org (=> alan is a _reverse proxy_)
>  and upon response gets us that back.
>
> - if ted fails to respond in 60s, alan gives a 504.
>
> Simple as that. This limits the possibility that we wait too long (ie >60s) on ted.
>
> Elephant in the room: why not directly ted? the nginx on alan is configured as hardened as I thought best, and actually can handle a multitude of requests much better than our squeak-based "application servers". This distinction between reverse proxy and application server is btw quite standard and enables some things. For example:
>
> We can tune a lot of things on alan with regards to how it should handle things. The simplest being: 
>
> - we can tune the timeout: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout
>  that's where the 60s come from, and we could simply crank it up.
>  - HOWEVER: this could mean we eventually run into other timeouts, for example on the server or even in TCP or so.
>  - so increasing this just like that _may_ help or _may_ make the Map useless altogether, so please be careful y'all :)

Tim reported shorter than 45s timeouts, so it is very likely an issue with 
the SqueakMap image.

>
> but also:
> - we can cache: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache
>  - we could make alan not even ask ted when we know the answer already.
>  - Attention: we need a lot of information on what is stable and what not to do this.
>  - (its tempting to try, tho)
>  - (we probably want that for squeaksource/source.squeak for the MCZ requests. but we lose the download statistics then…)

If squeaksource/mc used ETags, then the squeaksource image could simply 
return 304 and let nginx serve the cached mczs while keeping the
statistics updated.
That would also let us save bandwidth by not downloading files already 
sitting in the client's package cache.
We could also use nginx to serve files instead of the image, but then the 
image would have to know that it's sitting behind nginx.

> - Note: a lot of time is probably spend by ted generating HTTP and by alan parsing HTTP. Using Fcgi, for example, reduces that, and is supported by both nginx (https://nginx.org/en/docs/http/ngx_http_fastcgi_module.html) and GemStone, but I don't know whether we already have one in squeak.

I'm 99% sure http overhead is negligible.

Levente

>
>> If I have any reasonable grasp on this then we  should probably detect the 504 (in part by explicitly using a WebClient and its error handling rather than the slightly wonky httpSocket faced we have currently) and retry the connection ? Any other error or a timeout at *our* end would still be best handled as an error. 
>
> All 500-ish codes essentially say "the server is to blame" and the client can do noghitn about that.
> I don't think that 504 is meaningfully better handled than 503 or 502 in the WebClient. It think it's ok to pass that through.
>
>
>> 
>> Except of course a 418 which has well defined error handling... 
>> 
>
> At least not 451…
>
> Best regards
> 	-Tobias
>
>> tim
>> --
>> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
>> You forgot to do your backup 16 days ago.  Tomorrow you'll need that version.
>> 
>> 
>>