relational for what? [was: Design Principles Behind Smalltalk, Revisited]

Howard Stearns hstearns at wisc.edu
Tue Jan 2 20:36:22 UTC 2007


Yes, I'm quite serious. I'm asking what kinds of problems RDBMS are 
uniquely best at solving (or at least no worse). I'm not asking whether 
they CAN be used for this problem or that.  I'm asking this from an 
engineering/mathematics perspective, not a business ("we've always done 
things this way" or "we like this vendor") perspective.

I'm new to the Enterprise Software world, having been mostly in either 
industrial or "hard problem" software. But the 3-tier application 
architecture we use for financial processing at our 26 state campuses 
(University of Wisconsin) appears to me to be typical: large numbers of 
individual browser (not communicating with each other) interact through 
a Web server farm to the Application Servers. The overall application is 
too large as implemented to allow the load to be accommodated, so it is 
divided by functional area into a farm of individual applications that 
do not talk directly to each other. This partitioning isn't very 
successful, because the users tend to do the same functional activities 
at the same times of day, so most of the applications sit idle while a 
few are at their limit. I assumed that a single database was used so 
that the RDBMS could ensure data consistency between all these different 
applications.  But it turns out that the Oracle database can't handle 
that, so instead, each functional area gets its own database.  Most of 
the work done by the system (and most of the work of programmers like 
me) is to COPY data from one table to another at night when the system 
is otherwise quiet. [There is this Byzantine dance in which data is 
copied from one ledger to the next, with various checks against yet 
another set of ledgers. The whole thing is kept in sync by offsetting 
entries ("double entry") that are reconciled once a month or once a year 
when the system is shut down. Amazing.] The whole thing is kludged so 
that nothing ends up handling more than a few gigs of records at a time. 
  [Naively, it seems like the obvious solution for this (mathematically) 
is a hashing operation to keep the data evenly distributed over 
in-memory systems on a LAN, plus an in-memory cache of recently used 
chunks. But let's assume I'm missing something. The task here is to 
figure out what I'm not seeing.]

Maybe this isn't typical, but it is the architecture that Oracle and its 
PeopleSoft division pushes on us in their extensive training classes. 
And it appears to be the architecture discussed in the higher education 
IT conferences and Web sites in the U.S.

My experience with non-Enterprise Web/application software is also 
limited, but installations I've encountered since -- when did Phil and 
Alex's Excellent Web site come out? -- appear to also use partitioning 
to keep the working sets down to a few gigs.

My friends at Ab Initio won't tell me what they do or how they do it, 
but no one's claiming they use a RDBMS as Codd described it.

Anyway, either the data AS USED fits into memory or doesn't. If it does, 
then what benefit is the relational math providing? If it doesn't, then 
we have to ask whether the math techniques that were developed to 
provide efficient random access over disks 20 years ago are still valid. 
Is this still the fastest way? (Answer is no.) Is there some 
circumstance in which it is the fastest? Or the safest? Or allow us to 
do something that we could not do otherwise?

Having tools to allow a cult of specialists to break your own computing 
model (the relational calculus) is not feature, but a signal that 
something is wrong.

I tried briefly to combine JJ's answer with Peter's to find an 
appropriate niche. (Again, I'm trying to look at the math, not fit and 
finish, availability of experienced programmers, color of brochure...) 
For exampe, there could be a class of problems for which the data set is 
a few 10's of gigs and needs to be operated on as a whole. And that 
queries are fairly arbitrary and exploratory, not production-oriented. 
Etc. But I haven't been able to come up with one that doesn't have 
better characteristics as a distributed system.  Maybe if we define the 
problem as "and you only have one commodity box to do it on." That's 
fair. Maybe that's it?  (Then we need to find an "enterprise" with only 
one box...)

J J wrote:
>> From: Howard Stearns <hstearns at wisc.edu>
>> Reply-To: The general-purpose Squeak developers 
>> list<squeak-dev at lists.squeakfoundation.org>
>> To: The general-purpose Squeak developers 
>> list<squeak-dev at lists.squeakfoundation.org>
>> Subject: relational for what? [was: Design Principles Behind 
>> Smalltalk, Revisited]
>> Date: Tue, 02 Jan 2007 08:18:24 -0600
>>
>> J J wrote:
>>>> ... I simply believe in the right tool for the right job,
>>> and you can't beat an RDB in it's domain. ...
>>
>> That's something I've never really understood: what is the domain in 
>> which Relational Databases excel?
> 
> Handling large amounts of enterprise data.  If you have never worked in 
> a large company, you probably wont appreciate this.  But in a large 
> company you have a *lot* of data, and different applications want to see 
> different parts of it.  In an RDBMS this is no problem, you normalize 
> the data and take one of a few strategies to supply it to the different 
> consumers (e.g. views, stored procedures, etc.).
> 
>> - Data too large to fit in memory? Well, most uses today may have been 
>> too large to fit in memory 20 years ago, but aren't today. And even 
>> for really big data sets today, networks are much faster than disk 
>> drives, so a distributed database (e.g., a DHT) will be faster.   
>> Sanity check: Do you think Google uses an RDB for storing indexes and 
>> a cache of the WWW?
> 
> Are you serious with this (data too large to fit into memory)?  And if 
> you use a good RDBMS then you don't have to worry about disk speed or 
> distribution.  The DBA's can watch how the database is being used and 
> tune this (i.e. partition the data and move it to another CPU, etc., etc.).
> 
> Oh, but you found one example where someone with a lot of data didn't 
> use a RDB.  I guess we can throw the whole technology sector in the 
> trash.  Sanity check:  google is trying to keep a current snapshot of 
> all websites and run it on commodity hardware.  You could do exactly the 
> same thing with a lot less CPU's using a highly tuned, distributed 
> RDBMS.  They chose to hand tune code instead of an RDBMS.
> 
>> - Transactional processing with rollback, three-phase commit, etc? 
>> Again, these don't appear to actually be used by the application 
>> servers that get connected to the databases today. And if they were, 
>> would this be a property of relational databases per se?
> 
> What data point are you using?  Sure little blogs and things like that 
> probably don't use it, and that probably is the majority of database 
> users.  But how much wealth (i.e. money and jobs) are being generated by 
> those compared to larger companies.
> 
> All the applications I write at work absolutely require such 
> functionality and I have no intention of writing it myself.
> 
>> Finally, in world with great distributed computing power, is 
>> centralized transaction processing really a superior model?
> 
> Some people seem to think so:
> http://lambda-the-ultimate.org/node/463
> 
> And there is more then that.  I believe in that paper (dont have time to 
> verify) they mention that hardware manufacturers are also starting to 
> take this approach as well because fine grain locking is so bad.
> 
>> - Set processing? I'm not sure what you mean by set data, JJ. I've 
>> seen set theory taught in a procedural style, a functional style, and 
>> in an object oriented style, but outside of ERP system training 
>> classes, I've never seen it taught in a relational style. I'm not even 
>> sure what that means. (Tables with other than one key, ...) That's not 
>> a proof that relational is worse, but it does suggest to me that the 
>> premise is worth questioning.
> 
> I thought this was the common way of expression the data operations one 
> does in an RDBMS.  To give an example of the power; not too long ago I 
> had to write a report about the state of various systems on the network 
> in relation to the applications that run on them.  My first approach was 
> simply read the data into objects and extract the data via coding.  But 
> after the requirements for the reports changed a couple of times I got 
> sick of hand writing joins, unions, etc. etc. and just downloaded a 
> database.  It took about 5 minutes to set up the scheme and import all 
> the data.  After that I could quickly generate any report the requesters 
> could dream up.  Since SQL is effectively a DSL over relational data, my 
> code changed from many statements to 1 per report.
> 
>> - Working with other applications that are designed to use RDB's? 
>> Maybe, but that's a tautology, no?
> 
> Again, one has to work in a large company to appreciate the nature of 
> enterprise application development.
> 
>> I'm under the impression (could be wrong) that RDBMS were created to 
>> solve a particular problem that may or may not have been true at the 
>> time, but which is no longer the situation today. And what are called 
>> RDBMS no longer actually conform to the original problem/solution 
>> space anyway.
> 
> I don't know what the first RDBMS was created for, but what they are 
> today and have been for the span of my career is certainly not a 
> solution to a problem no one has.
> 
> The fact is, there are two basic kinds of databases: Relational and 
> Hierarchical (LDAP, OODB).  Each is good at dealing with certain kinds 
> of data and bad at others.
> 
> _________________________________________________________________
> Fixing up the home? Live Search can help 
> http://imagine-windowslive.com/search/kits/default.aspx?kit=improve&locale=en-US&source=hmemailtaglinenov06&FORM=WLMTAG 
> 
> 
> 

-- 
Howard Stearns
University of Wisconsin - Madison
Division of Information Technology
mailto:hstearns at wisc.edu
jabber:hstearns at wiscchat.wisc.edu
voice:+1-608-262-3724



More information about the Squeak-dev mailing list