[Newbies] Pre-Getting started info: Unicode, utf8, large memory
need
Charles Hixson
charleshixsn at earthlink.net
Thu Apr 29 18:26:41 UTC 2010
On 04/28/2010 09:31 PM, Herbert König wrote:
> Hi Charles,
>
> seems you are on top of things. So just a few remarks. My experience
> is from Squeak 3.8 so you should check if what I say holds true for
> current Squeak.
>
> Check out the UTF8 speed. I combine tab delimited files from disparate
> sources into more complex objects and write out new files. First thing
> was to change to non UTF8 for speed reasons. Seems you can't do this.
>
I'm not worried about speed for this first part, and for the follow-up
I'm more worried about computational speed than utf8 reading speed. If
I can't depend on virtual memory and automatic roll-in/out (nobody seems
to offer that!) then it means LOTS of database interaction. Which is
where I get worried about Magma...as apparently it holds a partial
reference to everything in RAM.
> CH> I looked at Magma, and couldn't figure out whether it would be useful or
> CH> not. I've no idea just how fast it is, how capacious it is, or how much
>
> Chris Muller is on Squeak dev and I'm sure he will be able to tell you
> if you would hit the limits of Magma. Gjallar (www.Gjallar.se) uses
> Magma in a commercial project (last time I looked).
>
> CH> ahead of time. And I want locally separate files, so I guess I'd
> CH> probably use sqlite or Firebird. With Sqlite I might need to have
> CH> multiple databases to handle the final system, so it would probably be
> CH> best to partition things early. (Either that or build some sort of
> CH> hierarchical storage system that rolled things from database to database
> CH> depending of how recently it was accessed.)
>
> SqueakDbx or (openDbx in other languages) might be of interest. I use
> mysql from Squeak in a commercial setting, no problems.
>
That is of interest, but MySql is in the same boat as PostGreSQL with
having a system level database rather and separate database files. This
makes many of the uses that I intend problematical...and difficult at
best. Both Firebird and Sqlite, however, allow specified db files.
Sqlite is more common, so that's probably what I'll choose, even though
Firebird has a reputation for being more efficient. (However I think
both are supported by openDbx, so probably also by SqueakDbx.)
> CH> I'm guessing that FileStream would handle file BOM markers gracefully.
> CH> (Most of my files are utf8 with BOM markers at the head.) This isn't
>
> Just try it to be sure..
>
Yeah, that will be a part of the first test.
> CH> (I wouldn't need any fancy mapper. If I weren't dealing with LOTS of
> CH> variable length arrays of variable length strings, I could just fit the
> CH> data into a simple C struct without any pointers whatsoever. So all I
> CH> need is to be able to save a list of lists of chars, plus a few integers
> CH> that would all fit comfortably into 32 bits. [Many of them would fit
> CH> into 8 bits.])
>
> CouchDB has caught my attention for inhomogeneous data, scalability,
> replication. But then I consider javascript a nice functional language
> and I like JSON (available in Squeak). At least look at map reduce
> algorithm for being able to utilize multi-core or multiple boxes.
> Whatever language you choose.
>
Multiple boxes isn't particularly interesting, but I'm expecting the
number of cores/box to ramp up quickly over the next decade...and that
*is* interesting.
> CH> later, and D doesn't have much in the way of concurrency handling. I'm
> CH> not sure that Hydra counts...though it sounds like I need to look into
> CH> it. The question would be how to programs running on separate virtual
> CH> machines communicate with each other.
>
> Two different issues, Hydra addresses one single machine and does not
> support current Squeak (recent discussion on Squeak dev). The other
> issue is communicating via network. This is where you'll end up.
>
I don't expect to end up "communicating via network", except, perhaps,
via localhost. But I do expect to end up running several processes,
probably on different cores. This causes many, but not all, of the same
problems. (Current support is less important, as this is something a
bit off in the future. But it needs to be planned for now, before I
start writing the code.) Guess I'll see if I can find that "Squeak dev"
discussion. Perhaps Dbus is the correct answer...I've only skimmed over
its specs, but it looks plausible. (Getting info back from separate
processes seems a major problem with most of the approaches. It may
well turn out that TCP over UnixSockets is the best approach
available..though I *would* like something better.)
More information about the Beginners
mailing list