[squeak-dev] Server timeouts and 504 return codes

Tobias Pape Das.Linux at gmx.de
Mon Jan 28 07:38:45 UTC 2019


> On 27.01.2019, at 21:48, Levente Uzonyi <leves at caesar.elte.hu> wrote:
> 
> On Sun, 27 Jan 2019, Chris Muller wrote:
> 
>> Hi guys,
>> 
>>>>> A couple of weeks ago I had a problem loading something via SqueakMap that resulted in a 504 error. Chris M quite rightly pointed out that responding to a timeout with an immediate retry might not be the best thing (referencing some code I published to try to handle this problem); looking at the error more closely I finally noticed that a 504 is a *gateway* timeout rather than anything that seems likely to be a problem at the SM or MC repository server. Indeed the error came back much quicker than the 45 seconds timeout that we seem to have set for our http connections.
>>>>> 
>>>>> I'm a long way from being an expert in the area of connecting to servers via gateways and what their timeous might be etc. so excuse stupid-question syndrome - I know this isn't Quora where stupid-question is the order of the day.
>>>>> Am I right in thinking that a 504 error means that some *intermediate* server timed out according to some setting in its internal config ?
>>>>> Am I right in imagining that we can't normally affect that timeout?
>>>>> 
>>>> 
>>>> Well, we can.
>>>> 
>>>> What happens here:
>>>> 
>>>> - All our websites, including all HTTP services, such as the Map, arrive together at squeak.org, aka alan.box.squeak.org
>>>> That is an nginx server. And also the server who eventually spits out the 504.
>>>> - alan then sees we want a connection to the Map, and does a HTTP request to ted.box.squeak.org (=> alan is a _reverse proxy_)
>>>> and upon response gets us that back.
>> 
>> Thanks for the great explanation!  I want to learn more about
>> admin'ing, so its great to have this in-context example of a
>> reverse-proxy, thanks for setting that up!
>> 
>>>> - if ted fails to respond in 60s, alan gives a 504.
>> 
>> 60s seems like a ideally balanced timeout setting -- the longest any
>> possible request should be expected to wait ... and yet clients can
>> still shorten to 45s or 30 if they want a shorter timeout.
>> 
>>>> Simple as that. This limits the possibility that we wait too long (ie >60s) on ted.
>>>> 
>>>> Elephant in the room: why not directly ted? the nginx on alan is configured as hardened as I thought best, and actually can handle a multitude of requests much better than our squeak-based "application servers". This distinction between reverse proxy and application server is btw quite standard and enables some things. For example:
>>>> 
>>>> We can tune a lot of things on alan with regards to how it should handle things. The simplest being:
>>>> 
>>>> - we can tune the timeout: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout
>>>> that's where the 60s come from, and we could simply crank it up.
>>>> - HOWEVER: this could mean we eventually run into other timeouts, for example on the server or even in TCP or so.
>>>> - so increasing this just like that _may_ help or _may_ make the Map useless altogether, so please be careful y'all :)
>>> 
>>> Tim reported shorter than 45s timeouts, so it is very likely an issue with
>>> the SqueakMap image.
>> 
>> Yes, the SqueakMap server image is one part of the dynamic, but I
>> think another is a bug in the trunk image.  I think the reason Tim is
>> not seeing 45 seconds before error is because the timeout setting of
>> the high-up client is not being passed all the way down to the
>> lowest-level layers -- e.g., from HTTPSocket --> WebClient -->
>> SocketStream --> Socket.  By the time it gets down to Socket which
>> does the actual work, it's operating on its own 30 second timeout.
> 
> I would expect subsecond reponse times. 30 seconds is just unacceptably long.
> 
>> 
>> It is a fixed amount of time, I *think* still between 30 and 45
>> seconds, that it takes the SqueakMap server to save its model after an
>> update (e.g., adding a Release, etc.).  It's so long because the
>> server is running on a very old 3.x image, interpreter VM.  It's
>> running a HttpView2 app which doesn't even compile in modern Squeak.
>> That's why it hasn't been brought forward yet, but I am working on a
>> new API service to replace it with the eventual goal of SqueakMap
>> being an "App Store" experience, and it will not suffer timeouts.
>> 
>>>> but also:
>>>> - we can cache: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache
>>>> - we could make alan not even ask ted when we know the answer already.
>>>> - Attention: we need a lot of information on what is stable and what not to do this.
>>>> - (its tempting to try, tho)
>>>> - (we probably want that for squeaksource/source.squeak for the MCZ requests. but we lose the download statistics then…)
>>> 
>>> If squeaksource/mc used ETags, then the squeaksource image could simply
>>> return 304 and let nginx serve the cached mczs while keeping the
>>> statistics updated.
>> 
>> Tim's email was about SqueakMap, not SqueakSource.  SqueakSource
> 
> That part of the thread changed direction. It happens sometimes.
> 
>> serves the mcz's straight off the hard-drive platter.  We don't need
>> to trade away download statistics to save a few ms on a mcz request.
> 
> Download statistics would stay the same despite being flawed (e.g. you'll download everything multiple times even if those files are sitting in your package cache).
> You would save seconds, not milliseconds by not downloading files again.

I think we trivially could make that happen by using X-Sendfile (apapche) or X-Accel-Redirect (nginx).
(https://www.nginx.com/resources/wiki/start/topics/examples/x-accel/)

The image gets the request but instead of searchign and serving the file, it answers with such a header and the reverse-proxy takes care of the rest.
Problem here: reverse-proxy must have access to the files, which it currently has not.
> 
>> 
>>> That would also let us save bandwidth by not downloading files already
>>> sitting in the client's package cache.
>> 
>> How so?  Isn't the package-cache checked before hitting the server at
>> all?  It certainly should be.
> 
> No, it's not. Currently that's not possible, because different files can have the same name. And currently we have no way to tell them apart.
> 
> Levente
> 
>> 
>> Best,
>> Chris
>> 
>> 
>>> We could also use nginx to serve files instead of the image, but then the
>>> image would have to know that it's sitting behind nginx.
>>> 
>>>> - Note: a lot of time is probably spend by ted generating HTTP and by alan parsing HTTP. Using Fcgi, for example, reduces that, and is supported by both nginx (https://nginx.org/en/docs/http/ngx_http_fastcgi_module.html) and GemStone, but I don't know whether we already have one in squeak.
>>> 
>>> I'm 99% sure http overhead is negligible.
>>> 
>>> Levente
>>> 
>>>> 
>>>>> If I have any reasonable grasp on this then we  should probably detect the 504 (in part by explicitly using a WebClient and its error handling rather than the slightly wonky httpSocket faced we have currently) and retry the connection ? Any other error or a timeout at *our* end would still be best handled as an error.
>>>> 
>>>> All 500-ish codes essentially say "the server is to blame" and the client can do noghitn about that.
>>>> I don't think that 504 is meaningfully better handled than 503 or 502 in the WebClient. It think it's ok to pass that through.
>>>> 
>>>> 
>>>>> 
>>>>> Except of course a 418 which has well defined error handling...
>>>>> 
>>>> 
>>>> At least not 451…
>>>> 
>>>> Best regards
>>>>      -Tobias
>>>> 
>>>>> tim
>>>>> --
>>>>> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
>>>>> You forgot to do your backup 16 days ago.  Tomorrow you'll need that version.
>>>>> 
>>>>> 
>>>>> 



More information about the Squeak-dev mailing list