[Seaside] Re: Scaling Seaside apps (was: About SToR)
knight at acm.org
Sat Aug 5 14:09:42 UTC 2006
OK, I've joined this list now. I'll respond to a couple of comments in various messages.
I note that as far as Postgresql, my experience is with the driver that Bruce Badger did in VW.
In response to my proxied post about Glorp doing everything internally in terms of cursors, Richard Huxton dev at archonet.com wrote:
Hmm - I just don't believe that. OK, for certain queries you'll have to
fetch everything (if you want a sorted list, you can't get the first
item until you're sure it's the smallest value). However, if you ask for
a cursor then you get a cursor. It even plans queries differently on the
grounds that with a cursor you'll want the first row asap.
I think that was a lack of clear terms on my part. When I said uses cursors internally, what I meant is that when Glorp gets results back, it expects to get them as a stream, and to get the results one by one using the driver level API. If you asked for the results in an array, well, it will get them all from the stream and build the array. If you asked for them in a GlorpCursoredStream it will essentially expose that streaming API to you. As I understand it, it is common for databases to calculate results lazily, so if you fetch those results one by one, the database will do only enough to send an initial set of results. Explicitly telling it you want a cursor will give the database a better hint as to what you want, but usually if you ask for a billion rows at once it will get the idea that for its own internal purposes it might want to defer calculation. So I don't really know if for a large query the bottleneck is postgresql doing all the calculations up front, or that the postgresql protocol does not have a mechanism for passing only part of the results back before the remainder has been calculated, or if it's just the driver at the Smalltalk level ignoring such protocol and getting everything.
Glorp doesn't do anything in terms of actually declaring named cursors and issuing fetch statements.
In a different part of the thread, Jeremy Shute shutej at crazilocks.com wrote:
But, looking at the process browser, the messages in the GLORP-operating
thread seem to relate to the assembly of queries. The cursor would
eliminate the READ queries, but the WRITE queries are still necessary,
requiring assembly one-at-a-time, unless GLORP is smart enough to use a
global prepared statement underneath for insertions. I see no reason why it
shouldn't be -- insertions are very homogenous operations, unlike WHERE
clauses. Perhaps a cursor WOULD solve the problem?
Glorp does group insertions together. This can give some very large performance improvements, e.g. if it can use Oracle array binding over a high-latency connection. However, for Postgresql there isn't any directly analogous mechanism. The mechanisms I saw available were a) just concatenating the inserts together in a single large statement with semicolons separating them. This helps with network traffic, but doesn't really tell the database it's doing many similar inserts. b) The bulk load mechanism. This seemed very much aimed at loading data from files, and looked like it might be possible to use more directly, but quite awkward. I haven't done this.
So right now on Postgresql (and others) Glorp will just concatenate the inserts together (up to a limit). One thing to watch out for from a performance perspective is to turn binding off for Postgresql. At least in the VW driver, binding was implemented by taking the queries, splitting them up into strings, and then concatenating the string representation of the bound values into the query. If you have a query with a very long string (e.g. because it's a large number of inserts being done all at once) then this became a huge performance bottleneck, and doesn't even seem to offer any of the benefits of binding in terms of not needing to convert back and forth between string representations.
Glorp will also keep a cache of prepared statements and re-use them if the hooks are in place, but I wasn't aware of how to do that for Postgresql. I suspect it's less important in such a situation, in that you'd have to have binding done better in order to effectively reuse prepared statements. I also have a feeling, although I haven't measured, that Postgresql is more tolerant of dynamic SQL for performance than, say, Oracle, which really really wants you to use prepared statements.
and in other messages mentioned both bypassing caches and someone else mentioned a "report query" which doesn't create objects or put things into cache, but just returns raw data.
I haven't gotten around to writing a NoCacheCachePolicy, but it would be quite an easy implementation. But be aware that then you lose object identity, which means relationships can get confused, and if you're writing, Glorp may not properly know whether to insert or update (it bases that on knowing whether or not it read the object). If you only want raw data, then you can do that using a retrieve:. e.g.
aQuery retrieve: [:each | each primaryKey].
aQuery retrieve: [:each | each parentNode primaryKey timestamp].
which would then give you back an array of two simple types, a primary key and the timestamp of the parent, and put nothing into cache.
Alan Knight [|], Cincom Smalltalk Development
knight at acm.org
aknight at cincom.com
"The Static Typing Philosophy: Make it fast. Make it right. Make it run." - Niall Ross
More information about the Seaside