[squeak-dev] Server timeouts and 504 return codes

Sun Jan 27 20:48:13 UTC 2019

On Sun, 27 Jan 2019, Chris Muller wrote:

> Hi guys,
>
>> >> A couple of weeks ago I had a problem loading something via SqueakMap that resulted in a 504 error. Chris M quite rightly pointed out that responding to a timeout with an immediate retry might not be the best thing (referencing some code I published to try to handle this problem); looking at the error more closely I finally noticed that a 504 is a *gateway* timeout rather than anything that seems likely to be a problem at the SM or MC repository server. Indeed the error came back much quicker than the 45 seconds timeout that we seem to have set for our http connections.
>> >>
>> >> I'm a long way from being an expert in the area of connecting to servers via gateways and what their timeous might be etc. so excuse stupid-question syndrome - I know this isn't Quora where stupid-question is the order of the day.
>> >> Am I right in thinking that a 504 error means that some *intermediate* server timed out according to some setting in its internal config ?
>> >> Am I right in imagining that we can't normally affect that timeout?
>> >>
>> >
>> > Well, we can.
>> >
>> > What happens here:
>> >
>> > - All our websites, including all HTTP services, such as the Map, arrive together at squeak.org, aka alan.box.squeak.org
>> >  That is an nginx server. And also the server who eventually spits out the 504.
>> > - alan then sees we want a connection to the Map, and does a HTTP request to ted.box.squeak.org (=> alan is a _reverse proxy_)
>> >  and upon response gets us that back.
>
> Thanks for the great explanation!  I want to learn more about
> admin'ing, so its great to have this in-context example of a
> reverse-proxy, thanks for setting that up!
>
>> > - if ted fails to respond in 60s, alan gives a 504.
>
> 60s seems like a ideally balanced timeout setting -- the longest any
> possible request should be expected to wait ... and yet clients can
> still shorten to 45s or 30 if they want a shorter timeout.
>
>> > Simple as that. This limits the possibility that we wait too long (ie >60s) on ted.
>> >
>> > Elephant in the room: why not directly ted? the nginx on alan is configured as hardened as I thought best, and actually can handle a multitude of requests much better than our squeak-based "application servers". This distinction between reverse proxy and application server is btw quite standard and enables some things. For example:
>> >
>> > We can tune a lot of things on alan with regards to how it should handle things. The simplest being:
>> >
>> > - we can tune the timeout: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout
>> >  that's where the 60s come from, and we could simply crank it up.
>> >  - HOWEVER: this could mean we eventually run into other timeouts, for example on the server or even in TCP or so.
>> >  - so increasing this just like that _may_ help or _may_ make the Map useless altogether, so please be careful y'all :)
>>
>> Tim reported shorter than 45s timeouts, so it is very likely an issue with
>> the SqueakMap image.
>
> Yes, the SqueakMap server image is one part of the dynamic, but I
> think another is a bug in the trunk image.  I think the reason Tim is
> not seeing 45 seconds before error is because the timeout setting of
> the high-up client is not being passed all the way down to the
> lowest-level layers -- e.g., from HTTPSocket --> WebClient -->
> SocketStream --> Socket.  By the time it gets down to Socket which
> does the actual work, it's operating on its own 30 second timeout.

I would expect subsecond reponse times. 30 seconds is just unacceptably 
long.

>
> It is a fixed amount of time, I *think* still between 30 and 45
> seconds, that it takes the SqueakMap server to save its model after an
> update (e.g., adding a Release, etc.).  It's so long because the
> server is running on a very old 3.x image, interpreter VM.  It's
> running a HttpView2 app which doesn't even compile in modern Squeak.
> That's why it hasn't been brought forward yet, but I am working on a
> new API service to replace it with the eventual goal of SqueakMap
> being an "App Store" experience, and it will not suffer timeouts.
>
>> > but also:
>> > - we can cache: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache
>> >  - we could make alan not even ask ted when we know the answer already.
>> >  - Attention: we need a lot of information on what is stable and what not to do this.
>> >  - (its tempting to try, tho)
>> >  - (we probably want that for squeaksource/source.squeak for the MCZ requests. but we lose the download statistics then…)
>>
>> If squeaksource/mc used ETags, then the squeaksource image could simply
>> return 304 and let nginx serve the cached mczs while keeping the
>> statistics updated.
>
> Tim's email was about SqueakMap, not SqueakSource.  SqueakSource

That part of the thread changed direction. It happens sometimes.

> serves the mcz's straight off the hard-drive platter.  We don't need
> to trade away download statistics to save a few ms on a mcz request.

Download statistics would stay the same despite being flawed (e.g. 
you'll download everything multiple times even if those files are sitting 
in your package cache).
You would save seconds, not milliseconds by not downloading files again.

>
>> That would also let us save bandwidth by not downloading files already
>> sitting in the client's package cache.
>
> How so?  Isn't the package-cache checked before hitting the server at
> all?  It certainly should be.

No, it's not. Currently that's not possible, because different files can 
have the same name. And currently we have no way to tell them apart.

Levente

>
> Best,
>  Chris
>
>
>> We could also use nginx to serve files instead of the image, but then the
>> image would have to know that it's sitting behind nginx.
>>
>> > - Note: a lot of time is probably spend by ted generating HTTP and by alan parsing HTTP. Using Fcgi, for example, reduces that, and is supported by both nginx (https://nginx.org/en/docs/http/ngx_http_fastcgi_module.html) and GemStone, but I don't know whether we already have one in squeak.
>>
>> I'm 99% sure http overhead is negligible.
>>
>> Levente
>>
>> >
>> >> If I have any reasonable grasp on this then we  should probably detect the 504 (in part by explicitly using a WebClient and its error handling rather than the slightly wonky httpSocket faced we have currently) and retry the connection ? Any other error or a timeout at *our* end would still be best handled as an error.
>> >
>> > All 500-ish codes essentially say "the server is to blame" and the client can do noghitn about that.
>> > I don't think that 504 is meaningfully better handled than 503 or 502 in the WebClient. It think it's ok to pass that through.
>> >
>> >
>> >>
>> >> Except of course a 418 which has well defined error handling...
>> >>
>> >
>> > At least not 451…
>> >
>> > Best regards
>> >       -Tobias
>> >
>> >> tim
>> >> --
>> >> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
>> >> You forgot to do your backup 16 days ago.  Tomorrow you'll need that version.
>> >>
>> >>
>> >>