[squeak-dev] Server timeouts and 504 return codes

Tue Jan 29 00:02:05 UTC 2019

Hi again,

> >>>>>>> Yes, the SqueakMap server image is one part of the dynamic, but I
> >>>>>>> think another is a bug in the trunk image.  I think the reason Tim is
> >>>>>>> not seeing 45 seconds before error is because the timeout setting of
> >>>>>>> the high-up client is not being passed all the way down to the
> >>>>>>> lowest-level layers -- e.g., from HTTPSocket --> WebClient -->
> >>>>>>> SocketStream --> Socket.  By the time it gets down to Socket which
> >>>>>>> does the actual work, it's operating on its own 30 second timeout.
> >>>>>>
> >>>>>> I would expect subsecond reponse times. 30 seconds is just unacceptably
> >>>>>> long.
> >>>>>
> >>>>> Well, it depends on if, for example, you're in the middle of
> >>>>> Antarctica with a slow internet connection in an office with a fast
> >>>>> connection.  A 30 second timeout is just the maximum amount of time
> >>>>> the client will wait for the entire process before presenting a
> >>>>> debugger, that's all it can do.
> >>>>
> >>>> We can be sure that Tim should get subsecond response times instead of
> >>>> timeouts after 30 seconds.
> >>>
> >>> Right, but timeout settings are a necessary tool sometimes, my point
> >>> was that we should fix client code in trunk to make timeouts work
> >>> properly.
> >>>
> >>> Incidentally, 99% of SqueakMap requests ARE subsecond -- just go to
> >>> map.squeak.org and click around and see.  For the remaining 1% that
> >>> aren't, the issue is known and we're working on a new server to fix
> >>> that.
> >>
> >> Great! That was my point: the image needs to be fixed.
> >
> > But, you're referring to the server image as "the image needs to be
> > fixed", which I've already conceded, whereas I'm referring to the
> > client image -- our trunk image -- as also needing the suspected
> > bug(s) with WebClient (et al) fixed.
>
> I don't think anything related to the server timeouts needs to be fixed
> there. Sure, more user friendly error messages could be useful, but in
> relation to Tim's problem at the network level there's nothing wrong in
> the client.

I'm not sure if you're saying that timeout setting currently works
correctly in trunk, or that it doesn't _need_ to work correctly.
Hopefully the former...

> >>>>>>> It is a fixed amount of time, I *think* still between 30 and 45
> >>>>>>> seconds, that it takes the SqueakMap server to save its model after an
> >>>
> >>> and so if in the meantime it can simply be made to wait 45s instead of
> >>> 30s, then current SqueakMap will only be that occasional delay at
> >>> worst, instead of the annoying debugger we currently get.
> >>
> >> I don't see why that would make a difference: the user would get a
> >> debugger anyway, but only 15 seconds later.
> >
> > No!  :)  As I said:
> >
> >>>>>>> It is a fixed amount of time, I *think* still between 30 and 45
> >>>>>>> seconds, that it takes the SqueakMap server to save its model
> >
> > So they would get a response < 15s later, not a debuuger.
>
> Provided the image is able to answer before 45 seconds, which is very
> likely not the case here.

My assertions are based on my experience and observations working on
the SMSqueakMap server image, and being the admin of that server image
since 2011...

> > The server needs the same amount of time to save every time whenever
> > it happens -- it's very predictable -- and right now to avoid a
> > debugger Squeak trunk image simply needs to be fixed to honor the 45s
> > timeout instead of ignoring it and always defaulting to 30.

... the serialized size of the model in 2011 was 715K, today it's
777K.  Do you have any reason to think the save times _shouldn't_ be
consistent?

I suggest you download the SqueakMap server image and check it out for
yourself.  I actually can't even remember which server its on, but I
could send you a copy of it.

> Uh. Do you say it's because of the image is being saved? I never liked
> that idea of image-based persistance, and this is a good reason why.

No.  The SqueakMap server image saves the SMSqueakMap object to a file
using ReferenceStream.  See the files in your Squeak directory
/sm/map.[nnnn].gz.  The server created those files, and you downloaded
a copy of them when the SqueakMap "Update" button was clicked.

> >>> Are alan and andreas co-located?
> >>
> >> They are cloud servers in the same data center.
> >>
> >>>
> >>>> The file doesn't have to be read from the disk either.
> >>>
> >>> I assume you mean "read from disk" on alan?  What about after it's
> >>> cached so many mcz's in RAM that its paging out to swap file?  To me,
> >>> wasing precious RAM (of any server) to cache old MCZ file contents
> >>> that no one will ever download (because they become old very quickly)
> >>> feels wasteful.  Dragster cars are wasteful too, but yes, they are
> >>> "faster"... on a dragstrip.  :)  I guess there'd have to be some kind
> >>> of application-specific smart management of the cache...
> >>
> >> Nginx's proxy_cache can handle that all automatically. Also, we don't need
> >> a large cache. A small, memory-only cache would do it.
> >
> > How "small" could it be and still contain all the MCZ's you want to
> > use to update an an "old" image?
>
> In my definition of old, it is at most back to the last release.
> In today's terms of small: less than 1 GB. But in this case I presume less
> than 100 MB would be enough.

I never doubted you that nginx is faster, only that it would provide
any noticeable difference to most UC's...

> >>> Levente, what about the trunk directory listing, can it cache that?
> >>
> >> Sure.

... except if we did cache this.  I think this would alleviate
meaningful amount of load off the server and infrastructure.

> >>> By "the image" I assume you mean the SqueakSource server image.  But
> >>> opening the file takes very little time.  Original web-sites were
> >>> .html files, remember how fast those were?  Plus, filesystems "cache"
> >>> file contents into their own internal caches anyway...
> >>
> >> Each file uses one external semaphore, each socket uses three. If you use
> >> a default image, there can be no more than 256 external semaphores which
> >> is ridiculous for a server,
> >
> > So, that is that (256 / 4 = 64) concurrent requests for a MCZ before
> > it is full?   Probably enough for our small community, but you also
> > said that's just a default we can increase?  Something I'd like to
> > know if I need for Magma too, where can I find this setting?
>
> As Tobias wrote, you'll get bitten by this quickly. It's not enough for
> any server facing the internet where non-friendly actors are common.

Why haven't we then?  Are we being shielded by alan?

> >> and it'll just grind to a halt when some load
> >> arrives. Every time the external semaphore table is full, a GC is
> >> triggered to try clear it up via the finalization process.
> >> Reading a file into memory is slow, writing it to a socket is slow.
> >> (Compared to nginx which uses sendfile to let the kernel handle that).
> >> And Squeak can only use a single process to handle everything.
> >
> > To me, it comes back to UX.  If we ever get enough load for that to be
> > an issue, it might be worth looking into.
>
> It's basic stuff. Without this a single web crawler can render your image
> unusable.

Sounds like something that's easy to increase, although wouldn't that
just make the server vulnerable to a larger version of the same
attack?   So I've got to think some of the strategy must depend on
_how_ the server responds to invalid requests, too...

> >>> Yes, it still has to return back through alan but I assume alan does
> >>> not wait for a "full download" received from andreas before its
> >>> already pipeing back to the Squeak client.  If true, then it seems
> >>> like it only amounts to saving one hop, which would hardly be
> >>> noticeable over what we have now.
> >>
> >> The goal of caching is not about saving a hop, but to avoid handling files
> >> in Squeak.
> >>
> >>>
> >>>> Nginx does that thing magnitudes faster than
> >>>> Squeak.
> >>>
> >>> The UX would not be magnitudes faster though, right?
> >>
> >> Directly by letting nginx serving the file, no, but the server image would
> >> be less likely to get stalled (return 5xx responses).
> >
> > SqueakMap and SqueakSource.com are old still with plans for upgrading,
> > but are you still getting 5xx's on source.squeak.org?
>
> I never said I had problems. Tim had them with SqueakMap. As I mentioned
> before, the discussion changed direction.

It's been an enlightening discussion in any case.  :)

> >> But the caching scheme I described in this thread would make the UX a lot
> >> quicker too, because data would not have to be transferred when you
> >> already have it.
> >
> > I assume you mean "data would not have to be transferred" from andreas
> > to alan... from within the same data center..!   :)
>
> I understand your confusion. There are at least 3 suggestions described in
> this thread to remedy the situation. All with different effects.

Okay, maybe you meant that for client requesting from alan, alan would
check a timestamp on the header of the request sent from client, and
quickly send back a code saying, "you already got it."

But my point was that client should check the local package-cache first anyway.

> >>>>>>>> That would also let us save bandwidth by not downloading files already
> >>>>>>>> sitting in the client's package cache.
> >>>>>>>
> >>>>>>> How so?  Isn't the package-cache checked before hitting the server at
> >>>>>>> all?  It certainly should be.
> >>>>>>
> >>>>>> No, it's not. Currently that's not possible, because different files can
> >>>>>> have the same name. And currently we have no way to tell them apart.
> >>>>>
> >>>>> No.  No two MCZ's may have the same name, certainly not withiin the
> >>>>> same repository, because MCRepository cannot support that.  So maybe
> >>>>
> >>>> Not at the same time, but it's possible, and it just happened recently
> >>>> with Chronology-ul.21.
> >>>> It is perfectly possible that a client has a version in its package cache
> >>>> with the same name as a different version on the server.
> >>>
> >>> But we don't want to restrict what's possible in our software design
> >>> because of that.  That situation is already a headache anyway.  Same
> >>> name theoretically can come only from the same person (if we ensure
> >>> unique initials) and so this is avoidable / fixable by resaving one of
> >>> them as a different name...
> >>
> >> It wasn't me who created the duplicate. If your suggestion had been in
> >> place, some images out there, including mine, would have been broken by
> >> the update process.
> >
> > I don't think so, since I said it would open up the .mcz in
> > package-cache and verify the UUID.
>
> What is the UUID of an mcd?

mcd's are the same as mcz's except with fewer MCDefinitions inside.  I
assume you mean mcm here, which was not part of any of this discussion
so far.  Still, I don't see any issues.  Dup names are simply not
supported, period.

> > I guess I don't know what you mean -- I see only one Chronology-ul.21
> > in the ancestry currently anyway..
>
> Never said it was in the ancestry. In the Trunk there is:
>
> Name: Chronology-Core-ul.21
> Author: dtl
> Time: 4 January 2019, 1:17:39.848442 pm
> UUID: 5d9b02fa-8e37-4678-adda-f302163732a1
>
> In the Treated Inbox there is:
>
> Name: Chronology-Core-ul.21
> Author: ul
> Time: 26 December 2018, 1:48:40.220196 am
> UUID: 2e6f6ce2-d0ec-41a0-b27c-88c642e5afc9

Okay.  Are you the author of both?  This is something you yourself
need to guard against doing, but as it's in Treated anyway, I don't
really see any pertinent impact in the case of Chronology-Core-ul.21.

> > I'm sure you would agree it's better for client images to check their
> > local package-cache first before hitting nginx.
>
> Sure, but that can only be possible if the server sends more information
> about the package the client should download (e.g. the UUID or
> some hash). Without that the client would assume that it has the right
> version when it doesn't and failure is unavoidable. (as I described above
> in relation to Chronology-Core-ul.21).

I think I would need a more detailed and/or concrete example of what
you mean, because I'm not understanding the validity of your assertion
that it isn't possible without hitting the server.  What Use Case are
you talking about?  I was talking about the UC of "Diffing two
Versions".  The UUID's are all in the mcz's / mcd's.

So, for example, if I wanted to diff a selected Chronology-Core-ul.21
with its ancestor the process would be:

  - client identifies ancestor of a selected Chronology-Core-ul.21.
Let's just say its Chronology-Core-ul.20, with id 'abc-xyz-123'.
  - client looks in package-cache, finds a Chronology-Core-ul.20.mcz
file, opens it up and checks the UUID.
  - if its 'abc-xyz-123', then it uses it.

The server was never hit at all, so it doesn't need to "send more
information"...  As I said, duplicate names are something to be
avoided in the first place, we shouldn't restrict the potential of the
tools because of the possibility of a duplicate-named Version.

Best,
  Chris