Re: relational for what? [was: Design Principles Behind Smalltalk, Revisited]

2 Jan 2007

      Yes, I'm quite serious. I'm asking what kinds of problems RDBMS are 
uniquely best at solving (or at least no worse). I'm not asking whether 
they CAN be used for this problem or that.  I'm asking this from an 
engineering/mathematics perspective, not a business ("we've always done 
things this way" or "we like this vendor") perspective.
I'm new to the Enterprise Software world, having been mostly in either 
industrial or "hard problem" software. But the 3-tier application 
architecture we use for financial processing at our 26 state campuses 
(University of Wisconsin) appears to me to be typical: large numbers of 
individual browser (not communicating with each other) interact through 
a Web server farm to the Application Servers. The overall application is 
too large as implemented to allow the load to be accommodated, so it is 
divided by functional area into a farm of individual applications that 
do not talk directly to each other. This partitioning isn't very 
successful, because the users tend to do the same functional activities 
at the same times of day, so most of the applications sit idle while a 
few are at their limit. I assumed that a single database was used so 
that the RDBMS could ensure data consistency between all these different 
applications.  But it turns out that the Oracle database can't handle 
that, so instead, each functional area gets its own database.  Most of 
the work done by the system (and most of the work of programmers like 
me) is to COPY data from one table to another at night when the system 
is otherwise quiet. [There is this Byzantine dance in which data is 
copied from one ledger to the next, with various checks against yet 
another set of ledgers. The whole thing is kept in sync by offsetting 
entries ("double entry") that are reconciled once a month or once a year 
when the system is shut down. Amazing.] The whole thing is kludged so 
that nothing ends up handling more than a few gigs of records at a time. 
  [Naively, it seems like the obvious solution for this (mathematically) 
is a hashing operation to keep the data evenly distributed over 
in-memory systems on a LAN, plus an in-memory cache of recently used 
chunks. But let's assume I'm missing something. The task here is to 
figure out what I'm not seeing.]
Maybe this isn't typical, but it is the architecture that Oracle and its 
PeopleSoft division pushes on us in their extensive training classes. 
And it appears to be the architecture discussed in the higher education 
IT conferences and Web sites in the U.S.
My experience with non-Enterprise Web/application software is also 
limited, but installations I've encountered since -- when did Phil and 
Alex's Excellent Web site come out? -- appear to also use partitioning 
to keep the working sets down to a few gigs.
My friends at Ab Initio won't tell me what they do or how they do it, 
but no one's claiming they use a RDBMS as Codd described it.
Anyway, either the data AS USED fits into memory or doesn't. If it does, 
then what benefit is the relational math providing? If it doesn't, then 
we have to ask whether the math techniques that were developed to 
provide efficient random access over disks 20 years ago are still valid. 
Is this still the fastest way? (Answer is no.) Is there some 
circumstance in which it is the fastest? Or the safest? Or allow us to 
do something that we could not do otherwise?
Having tools to allow a cult of specialists to break your own computing 
model (the relational calculus) is not feature, but a signal that 
something is wrong.
I tried briefly to combine JJ's answer with Peter's to find an 
appropriate niche. (Again, I'm trying to look at the math, not fit and 
finish, availability of experienced programmers, color of brochure...) 
For exampe, there could be a class of problems for which the data set is 
a few 10's of gigs and needs to be operated on as a whole. And that 
queries are fairly arbitrary and exploratory, not production-oriented. 
Etc. But I haven't been able to come up with one that doesn't have 
better characteristics as a distributed system.  Maybe if we define the 
problem as "and you only have one commodity box to do it on." That's 
fair. Maybe that's it?  (Then we need to find an "enterprise" with only 
one box...)
J J wrote:
...
...
From: Howard Stearns hstearns@wisc.edu
Reply-To: The general-purpose Squeak developers 
listsqueak-dev@lists.squeakfoundation.org
To: The general-purpose Squeak developers 
listsqueak-dev@lists.squeakfoundation.org
Subject: relational for what? [was: Design Principles Behind 
Smalltalk, Revisited]
Date: Tue, 02 Jan 2007 08:18:24 -0600
J J wrote:
...
...
... I simply believe in the right tool for the right job,
and you can't beat an RDB in it's domain. ...
That's something I've never really understood: what is the domain in 
which Relational Databases excel?
Handling large amounts of enterprise data.  If you have never worked in 
a large company, you probably wont appreciate this.  But in a large 
company you have a *lot* of data, and different applications want to see 
different parts of it.  In an RDBMS this is no problem, you normalize 
the data and take one of a few strategies to supply it to the different 
consumers (e.g. views, stored procedures, etc.).
...

Data too large to fit in memory? Well, most uses today may have been

too large to fit in memory 20 years ago, but aren't today. And even 
for really big data sets today, networks are much faster than disk 
drives, so a distributed database (e.g., a DHT) will be faster.   
Sanity check: Do you think Google uses an RDB for storing indexes and 
a cache of the WWW?
Are you serious with this (data too large to fit into memory)?  And if 
you use a good RDBMS then you don't have to worry about disk speed or 
distribution.  The DBA's can watch how the database is being used and 
tune this (i.e. partition the data and move it to another CPU, etc., etc.).
Oh, but you found one example where someone with a lot of data didn't 
use a RDB.  I guess we can throw the whole technology sector in the 
trash.  Sanity check:  google is trying to keep a current snapshot of 
all websites and run it on commodity hardware.  You could do exactly the 
same thing with a lot less CPU's using a highly tuned, distributed 
RDBMS.  They chose to hand tune code instead of an RDBMS.
...

Transactional processing with rollback, three-phase commit, etc?

Again, these don't appear to actually be used by the application 
servers that get connected to the databases today. And if they were, 
would this be a property of relational databases per se?
What data point are you using?  Sure little blogs and things like that 
probably don't use it, and that probably is the majority of database 
users.  But how much wealth (i.e. money and jobs) are being generated by 
those compared to larger companies.
All the applications I write at work absolutely require such 
functionality and I have no intention of writing it myself.
...
Finally, in world with great distributed computing power, is 
centralized transaction processing really a superior model?
Some people seem to think so:
http://lambda-the-ultimate.org/node/463
And there is more then that.  I believe in that paper (dont have time to 
verify) they mention that hardware manufacturers are also starting to 
take this approach as well because fine grain locking is so bad.
...

Set processing? I'm not sure what you mean by set data, JJ. I've

seen set theory taught in a procedural style, a functional style, and 
in an object oriented style, but outside of ERP system training 
classes, I've never seen it taught in a relational style. I'm not even 
sure what that means. (Tables with other than one key, ...) That's not 
a proof that relational is worse, but it does suggest to me that the 
premise is worth questioning.
I thought this was the common way of expression the data operations one 
does in an RDBMS.  To give an example of the power; not too long ago I 
had to write a report about the state of various systems on the network 
in relation to the applications that run on them.  My first approach was 
simply read the data into objects and extract the data via coding.  But 
after the requirements for the reports changed a couple of times I got 
sick of hand writing joins, unions, etc. etc. and just downloaded a 
database.  It took about 5 minutes to set up the scheme and import all 
the data.  After that I could quickly generate any report the requesters 
could dream up.  Since SQL is effectively a DSL over relational data, my 
code changed from many statements to 1 per report.
...

Working with other applications that are designed to use RDB's?

Maybe, but that's a tautology, no?
Again, one has to work in a large company to appreciate the nature of 
enterprise application development.
...
I'm under the impression (could be wrong) that RDBMS were created to 
solve a particular problem that may or may not have been true at the 
time, but which is no longer the situation today. And what are called 
RDBMS no longer actually conform to the original problem/solution 
space anyway.
I don't know what the first RDBMS was created for, but what they are 
today and have been for the span of my career is certainly not a 
solution to a problem no one has.
The fact is, there are two basic kinds of databases: Relational and 
Hierarchical (LDAP, OODB).  Each is good at dealing with certain kinds 
of data and bad at others.

Fixing up the home? Live Search can help 
http://imagine-windowslive.com/search/kits/default.aspx?kit=improve&loca...
-- 
Howard Stearns
University of Wisconsin - Madison
Division of Information Technology
mailto:hstearns@wisc.edu
jabber:hstearns@wiscchat.wisc.edu
voice:+1-608-262-3724