[Newbies] Pre-Getting started info: Unicode, utf8, large memory need

Charles Hixson charleshixsn at earthlink.net
Thu Apr 29 18:26:41 UTC 2010


On 04/28/2010 09:31 PM, Herbert König wrote:
> Hi Charles,
>
> seems you are on top of things. So just a few remarks. My experience
> is from Squeak 3.8 so you should check if what I say holds true for
> current Squeak.
>
> Check out the UTF8 speed. I combine tab delimited files from disparate
> sources into more complex objects and write out new files. First thing
> was to change to non UTF8 for speed reasons. Seems you can't do this.
>    
I'm not worried about speed for this first part, and for the follow-up 
I'm more worried about computational speed than utf8 reading speed.  If 
I can't depend on virtual memory and automatic roll-in/out (nobody seems 
to offer that!) then it means LOTS of database interaction.  Which is 
where I get worried about Magma...as apparently it holds a partial 
reference to everything in RAM.
> CH>  I looked at Magma, and couldn't figure out whether it would be useful or
> CH>  not.  I've no idea just how fast it is, how capacious it is, or how much
>
> Chris Muller is on Squeak dev and I'm sure he will be able to tell you
> if you would hit the limits of Magma. Gjallar (www.Gjallar.se) uses
> Magma in a commercial project (last time I looked).
>
> CH>  ahead of time.  And I want locally separate files, so I guess I'd
> CH>  probably use sqlite or Firebird.  With Sqlite I might need to have
> CH>  multiple databases to handle the final system, so it would probably be
> CH>  best to partition things early.  (Either that or build some sort of
> CH>  hierarchical storage system that rolled things from database to database
> CH>  depending of how recently it was accessed.)
>
> SqueakDbx or (openDbx in other languages) might be of interest. I use
> mysql from Squeak in a commercial setting, no problems.
>    
That is of interest, but MySql is in the same boat as PostGreSQL with 
having a system level database rather and separate database files.  This 
makes many of the uses that I intend problematical...and difficult at 
best.  Both Firebird and Sqlite, however, allow specified db files.  
Sqlite is more common, so that's probably what I'll choose, even though 
Firebird has a reputation for being more efficient.  (However I think 
both are supported by openDbx, so probably also by SqueakDbx.)
> CH>  I'm guessing that FileStream would handle file BOM markers gracefully.
> CH>  (Most of my files are utf8 with BOM markers at the head.)  This isn't
>
> Just try it to be sure..
>    
Yeah, that will be a part of the first test.
> CH>  (I wouldn't need any fancy mapper.  If I weren't dealing with LOTS of
> CH>  variable length arrays of variable length strings, I could just fit the
> CH>  data into a simple C struct without any pointers whatsoever.  So all I
> CH>  need is to be able to save a list of lists of chars, plus a few integers
> CH>  that would all fit comfortably into 32 bits.  [Many of them would fit
> CH>  into 8 bits.])
>
> CouchDB has caught my attention for inhomogeneous data, scalability,
> replication. But then I consider javascript a nice functional language
> and I like JSON (available in Squeak). At least look at map reduce
> algorithm for being able to utilize multi-core or multiple boxes.
> Whatever language you choose.
>    
Multiple boxes isn't particularly interesting, but I'm expecting the 
number of cores/box to ramp up quickly over the next decade...and that 
*is* interesting.
> CH>  later, and D doesn't have much in the way of concurrency handling.  I'm
> CH>  not sure that Hydra counts...though it sounds like I need to look into
> CH>  it.  The question would be how to programs running on separate virtual
> CH>  machines communicate with each other.
>
> Two different issues, Hydra addresses one single machine and does not
> support current Squeak (recent discussion on Squeak dev). The other
> issue is communicating via network. This is where you'll end up.
>    
I don't expect to end up "communicating via network", except, perhaps, 
via localhost.  But I do expect to end up running several processes, 
probably on different cores.  This causes many, but not all, of the same 
problems.  (Current support is less important, as this is something a 
bit off in the future.  But it needs to be planned for now, before I 
start writing the code.)  Guess I'll see if I can find that "Squeak dev" 
discussion.  Perhaps Dbus is the correct answer...I've only skimmed over 
its specs, but it looks plausible.  (Getting info back from separate 
processes seems a major problem with most of the approaches.  It may 
well turn out that TCP over UnixSockets is the best approach 
available..though I *would* like something better.)


More information about the Beginners mailing list