[squeak-dev] Server timeouts and 504 return codes

Sun Jan 27 20:10:10 UTC 2019

Hi guys,

> >> A couple of weeks ago I had a problem loading something via SqueakMap that resulted in a 504 error. Chris M quite rightly pointed out that responding to a timeout with an immediate retry might not be the best thing (referencing some code I published to try to handle this problem); looking at the error more closely I finally noticed that a 504 is a *gateway* timeout rather than anything that seems likely to be a problem at the SM or MC repository server. Indeed the error came back much quicker than the 45 seconds timeout that we seem to have set for our http connections.
> >>
> >> I'm a long way from being an expert in the area of connecting to servers via gateways and what their timeous might be etc. so excuse stupid-question syndrome - I know this isn't Quora where stupid-question is the order of the day.
> >> Am I right in thinking that a 504 error means that some *intermediate* server timed out according to some setting in its internal config ?
> >> Am I right in imagining that we can't normally affect that timeout?
> >>
> >
> > Well, we can.
> >
> > What happens here:
> >
> > - All our websites, including all HTTP services, such as the Map, arrive together at squeak.org, aka alan.box.squeak.org
> >  That is an nginx server. And also the server who eventually spits out the 504.
> > - alan then sees we want a connection to the Map, and does a HTTP request to ted.box.squeak.org (=> alan is a _reverse proxy_)
> >  and upon response gets us that back.

Thanks for the great explanation!  I want to learn more about
admin'ing, so its great to have this in-context example of a
reverse-proxy, thanks for setting that up!

> > - if ted fails to respond in 60s, alan gives a 504.

60s seems like a ideally balanced timeout setting -- the longest any
possible request should be expected to wait ... and yet clients can
still shorten to 45s or 30 if they want a shorter timeout.

> > Simple as that. This limits the possibility that we wait too long (ie >60s) on ted.
> >
> > Elephant in the room: why not directly ted? the nginx on alan is configured as hardened as I thought best, and actually can handle a multitude of requests much better than our squeak-based "application servers". This distinction between reverse proxy and application server is btw quite standard and enables some things. For example:
> >
> > We can tune a lot of things on alan with regards to how it should handle things. The simplest being:
> >
> > - we can tune the timeout: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout
> >  that's where the 60s come from, and we could simply crank it up.
> >  - HOWEVER: this could mean we eventually run into other timeouts, for example on the server or even in TCP or so.
> >  - so increasing this just like that _may_ help or _may_ make the Map useless altogether, so please be careful y'all :)
>
> Tim reported shorter than 45s timeouts, so it is very likely an issue with
> the SqueakMap image.

Yes, the SqueakMap server image is one part of the dynamic, but I
think another is a bug in the trunk image.  I think the reason Tim is
not seeing 45 seconds before error is because the timeout setting of
the high-up client is not being passed all the way down to the
lowest-level layers -- e.g., from HTTPSocket --> WebClient -->
SocketStream --> Socket.  By the time it gets down to Socket which
does the actual work, it's operating on its own 30 second timeout.

It is a fixed amount of time, I *think* still between 30 and 45
seconds, that it takes the SqueakMap server to save its model after an
update (e.g., adding a Release, etc.).  It's so long because the
server is running on a very old 3.x image, interpreter VM.  It's
running a HttpView2 app which doesn't even compile in modern Squeak.
That's why it hasn't been brought forward yet, but I am working on a
new API service to replace it with the eventual goal of SqueakMap
being an "App Store" experience, and it will not suffer timeouts.

> > but also:
> > - we can cache: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache
> >  - we could make alan not even ask ted when we know the answer already.
> >  - Attention: we need a lot of information on what is stable and what not to do this.
> >  - (its tempting to try, tho)
> >  - (we probably want that for squeaksource/source.squeak for the MCZ requests. but we lose the download statistics then…)
>
> If squeaksource/mc used ETags, then the squeaksource image could simply
> return 304 and let nginx serve the cached mczs while keeping the
> statistics updated.

Tim's email was about SqueakMap, not SqueakSource.  SqueakSource
serves the mcz's straight off the hard-drive platter.  We don't need
to trade away download statistics to save a few ms on a mcz request.

> That would also let us save bandwidth by not downloading files already
> sitting in the client's package cache.

How so?  Isn't the package-cache checked before hitting the server at
all?  It certainly should be.

Best,
  Chris

> We could also use nginx to serve files instead of the image, but then the
> image would have to know that it's sitting behind nginx.
>
> > - Note: a lot of time is probably spend by ted generating HTTP and by alan parsing HTTP. Using Fcgi, for example, reduces that, and is supported by both nginx (https://nginx.org/en/docs/http/ngx_http_fastcgi_module.html) and GemStone, but I don't know whether we already have one in squeak.
>
> I'm 99% sure http overhead is negligible.
>
> Levente
>
> >
> >> If I have any reasonable grasp on this then we  should probably detect the 504 (in part by explicitly using a WebClient and its error handling rather than the slightly wonky httpSocket faced we have currently) and retry the connection ? Any other error or a timeout at *our* end would still be best handled as an error.
> >
> > All 500-ish codes essentially say "the server is to blame" and the client can do noghitn about that.
> > I don't think that 504 is meaningfully better handled than 503 or 502 in the WebClient. It think it's ok to pass that through.
> >
> >
> >>
> >> Except of course a 418 which has well defined error handling...
> >>
> >
> > At least not 451…
> >
> > Best regards
> >       -Tobias
> >
> >> tim
> >> --
> >> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
> >> You forgot to do your backup 16 days ago.  Tomorrow you'll need that version.
> >>
> >>
> >>