Yes, I'm quite serious. I'm asking what kinds of problems RDBMS are uniquely best at solving (or at least no worse). I'm not asking whether they CAN be used for this problem or that. I'm asking this from an engineering/mathematics perspective, not a business ("we've always done things this way" or "we like this vendor") perspective.
I'm new to the Enterprise Software world, having been mostly in either industrial or "hard problem" software. But the 3-tier application architecture we use for financial processing at our 26 state campuses (University of Wisconsin) appears to me to be typical: large numbers of individual browser (not communicating with each other) interact through a Web server farm to the Application Servers. The overall application is too large as implemented to allow the load to be accommodated, so it is divided by functional area into a farm of individual applications that do not talk directly to each other. This partitioning isn't very successful, because the users tend to do the same functional activities at the same times of day, so most of the applications sit idle while a few are at their limit. I assumed that a single database was used so that the RDBMS could ensure data consistency between all these different applications. But it turns out that the Oracle database can't handle that, so instead, each functional area gets its own database. Most of the work done by the system (and most of the work of programmers like me) is to COPY data from one table to another at night when the system is otherwise quiet. [There is this Byzantine dance in which data is copied from one ledger to the next, with various checks against yet another set of ledgers. The whole thing is kept in sync by offsetting entries ("double entry") that are reconciled once a month or once a year when the system is shut down. Amazing.] The whole thing is kludged so that nothing ends up handling more than a few gigs of records at a time. [Naively, it seems like the obvious solution for this (mathematically) is a hashing operation to keep the data evenly distributed over in-memory systems on a LAN, plus an in-memory cache of recently used chunks. But let's assume I'm missing something. The task here is to figure out what I'm not seeing.]
Maybe this isn't typical, but it is the architecture that Oracle and its PeopleSoft division pushes on us in their extensive training classes. And it appears to be the architecture discussed in the higher education IT conferences and Web sites in the U.S.
My experience with non-Enterprise Web/application software is also limited, but installations I've encountered since -- when did Phil and Alex's Excellent Web site come out? -- appear to also use partitioning to keep the working sets down to a few gigs.
My friends at Ab Initio won't tell me what they do or how they do it, but no one's claiming they use a RDBMS as Codd described it.
Anyway, either the data AS USED fits into memory or doesn't. If it does, then what benefit is the relational math providing? If it doesn't, then we have to ask whether the math techniques that were developed to provide efficient random access over disks 20 years ago are still valid. Is this still the fastest way? (Answer is no.) Is there some circumstance in which it is the fastest? Or the safest? Or allow us to do something that we could not do otherwise?
Having tools to allow a cult of specialists to break your own computing model (the relational calculus) is not feature, but a signal that something is wrong.
I tried briefly to combine JJ's answer with Peter's to find an appropriate niche. (Again, I'm trying to look at the math, not fit and finish, availability of experienced programmers, color of brochure...) For exampe, there could be a class of problems for which the data set is a few 10's of gigs and needs to be operated on as a whole. And that queries are fairly arbitrary and exploratory, not production-oriented. Etc. But I haven't been able to come up with one that doesn't have better characteristics as a distributed system. Maybe if we define the problem as "and you only have one commodity box to do it on." That's fair. Maybe that's it? (Then we need to find an "enterprise" with only one box...)
J J wrote:
From: Howard Stearns hstearns@wisc.edu Reply-To: The general-purpose Squeak developers listsqueak-dev@lists.squeakfoundation.org To: The general-purpose Squeak developers listsqueak-dev@lists.squeakfoundation.org Subject: relational for what? [was: Design Principles Behind Smalltalk, Revisited] Date: Tue, 02 Jan 2007 08:18:24 -0600
J J wrote:
... I simply believe in the right tool for the right job,
and you can't beat an RDB in it's domain. ...
That's something I've never really understood: what is the domain in which Relational Databases excel?
Handling large amounts of enterprise data. If you have never worked in a large company, you probably wont appreciate this. But in a large company you have a *lot* of data, and different applications want to see different parts of it. In an RDBMS this is no problem, you normalize the data and take one of a few strategies to supply it to the different consumers (e.g. views, stored procedures, etc.).
- Data too large to fit in memory? Well, most uses today may have been
too large to fit in memory 20 years ago, but aren't today. And even for really big data sets today, networks are much faster than disk drives, so a distributed database (e.g., a DHT) will be faster. Sanity check: Do you think Google uses an RDB for storing indexes and a cache of the WWW?
Are you serious with this (data too large to fit into memory)? And if you use a good RDBMS then you don't have to worry about disk speed or distribution. The DBA's can watch how the database is being used and tune this (i.e. partition the data and move it to another CPU, etc., etc.).
Oh, but you found one example where someone with a lot of data didn't use a RDB. I guess we can throw the whole technology sector in the trash. Sanity check: google is trying to keep a current snapshot of all websites and run it on commodity hardware. You could do exactly the same thing with a lot less CPU's using a highly tuned, distributed RDBMS. They chose to hand tune code instead of an RDBMS.
- Transactional processing with rollback, three-phase commit, etc?
Again, these don't appear to actually be used by the application servers that get connected to the databases today. And if they were, would this be a property of relational databases per se?
What data point are you using? Sure little blogs and things like that probably don't use it, and that probably is the majority of database users. But how much wealth (i.e. money and jobs) are being generated by those compared to larger companies.
All the applications I write at work absolutely require such functionality and I have no intention of writing it myself.
Finally, in world with great distributed computing power, is centralized transaction processing really a superior model?
Some people seem to think so: http://lambda-the-ultimate.org/node/463
And there is more then that. I believe in that paper (dont have time to verify) they mention that hardware manufacturers are also starting to take this approach as well because fine grain locking is so bad.
- Set processing? I'm not sure what you mean by set data, JJ. I've
seen set theory taught in a procedural style, a functional style, and in an object oriented style, but outside of ERP system training classes, I've never seen it taught in a relational style. I'm not even sure what that means. (Tables with other than one key, ...) That's not a proof that relational is worse, but it does suggest to me that the premise is worth questioning.
I thought this was the common way of expression the data operations one does in an RDBMS. To give an example of the power; not too long ago I had to write a report about the state of various systems on the network in relation to the applications that run on them. My first approach was simply read the data into objects and extract the data via coding. But after the requirements for the reports changed a couple of times I got sick of hand writing joins, unions, etc. etc. and just downloaded a database. It took about 5 minutes to set up the scheme and import all the data. After that I could quickly generate any report the requesters could dream up. Since SQL is effectively a DSL over relational data, my code changed from many statements to 1 per report.
- Working with other applications that are designed to use RDB's?
Maybe, but that's a tautology, no?
Again, one has to work in a large company to appreciate the nature of enterprise application development.
I'm under the impression (could be wrong) that RDBMS were created to solve a particular problem that may or may not have been true at the time, but which is no longer the situation today. And what are called RDBMS no longer actually conform to the original problem/solution space anyway.
I don't know what the first RDBMS was created for, but what they are today and have been for the span of my career is certainly not a solution to a problem no one has.
The fact is, there are two basic kinds of databases: Relational and Hierarchical (LDAP, OODB). Each is good at dealing with certain kinds of data and bad at others.
Fixing up the home? Live Search can help http://imagine-windowslive.com/search/kits/default.aspx?kit=improve&loca...