[Newbies] Pre-Getting started info: Unicode, utf8, large memory need

Wed Apr 28 19:02:11 UTC 2010

On 04/27/2010 09:10 PM, Herbert König wrote:
> Hi Charles,
>
> CH>  1)  How does one read&write utf8 files?
>
> the FileStream class uses UTF8 files by default since some time. So if
> you want non utf8 files (e.g. for speed reasons) you have to take
> extra measures.
>
> CH>  2)  Can strings by indexed by chars, even if they are unicode rather
> CH>  than ascii?
>
> yes. See class WideString.
>
> CH>  3)  What happens if you have more data than will fit into RAM?
>
> use a database :-))
>
> CH>  (For 3 "use a database" is an acceptable answer, but I'm hoping for
> CH>  something involving automatic paging.)
>
> There are object databases like Magma which make this less painful.
> And OR mappers. Commercial products handle bigger than RAM images
> (GemStone) of which I thought they would have a free version but can't
> find it on their website.
>
> CH>  An additional, but much less urgent, question is "How does one use
> CH>  Squeak on multiple cores of a multi-core processor?"
>
> There is an experiment "Hydra VM" which can run multiple images each
> in their native thread. Squeak is a single OS thread and uses green
> threads inside.
>
> You might tell us, what you want to achieve. Personally I'd say start
> small :-)
>    

Well, I am starting small, but the database isn't all that small.  I'm 
planning, as a first step, building a bibliographic database of 
"interesting books" from GutenPrint (the Gutenberg Project).  They often 
leave out things like "When was this first published?"  (Sometimes it 
isn't known.)  that I want to include in my bibliography, and I also 
want to include things like Story index and Author index for 
publications (e.g. magazines) that have multiple stories with multiple 
authors.  Some of this I've already done by hand, but unfortunately I 
used two different formats, and also the info needs to be relocated to 
the end of the file.  (I'm planning a table just prior to the "</body>" 
tag.)

The next step is to generate catalogs from this bibliographic 
information.  Then I want to package them together with the files onto 
something that will fit onto a DVD by the middle of November.  (That 
should be practical.)

The next step is to build indexes of names and where they appear.  Etc. 
(I don't have the details planned out.  Automated information retrieval 
is the goal, but not just free-form retrieval, and I don't know exactly 
what I'll need to do. It's likely to require pre-computing a lot of 
partial answers, though.)

I looked at Magma, and couldn't figure out whether it would be useful or 
not.  I've no idea just how fast it is, how capacious it is, or how much 
ram it consumes, and I don't even know what I should measure.  It's the 
kind of thing that could look like it was working fine until one 
suddenly passed some critical usage level, and then it would just barely 
work at all, and I can't guess how one could determine that usage level 
ahead of time.  And I want locally separate files, so I guess I'd 
probably use sqlite or Firebird.  With Sqlite I might need to have 
multiple databases to handle the final system, so it would probably be 
best to partition things early.  (Either that or build some sort of 
hierarchical storage system that rolled things from database to database 
depending of how recently it was accessed.)

I'm guessing that FileStream would handle file BOM markers gracefully.  
(Most of my files are utf8 with BOM markers at the head.)  This isn't 
totally standard, as many utf8 files don't have any markers to show that 
they aren't ascii (or extended ascii), but it's ONE of the standard 
approaches.

(I wouldn't need any fancy mapper.  If I weren't dealing with LOTS of 
variable length arrays of variable length strings, I could just fit the 
data into a simple C struct without any pointers whatsoever.  So all I 
need is to be able to save a list of lists of chars, plus a few integers 
that would all fit comfortably into 32 bits.  [Many of them would fit 
into 8 bits.])

So far I'm still choosing the language.  I've got one routine 
implemented in D, Python, Ruby, and Java so far.  Those could all be 
made to work.  I'm currently working on a Vala implementation, and I'm 
considering a Smalltalk one.  If D had the libraries for later use, it 
would be the clear winner so far.  Unfortunately, I'm also considering 
later, and D doesn't have much in the way of concurrency handling.  I'm 
not sure that Hydra counts...though it sounds like I need to look into 
it.  The question would be how to programs running on separate virtual 
machines communicate with each other.  (N.B.:  Ruby and Python also have 
this problem.  Vala appears to have solved it.)

I also considered "go", but it appears to be to beta at the moment.  The 
design of the language poses unique requirements on the documentation 
that they don't seem to be addressing.  (It could be because the 
language is still in an early stage of development.)

Long term goal (1-4 decades):  A librarian program that can dig the 
answers to "reasonable" questions out of the books that it handles.  And 
can also recommend books in answer to slightly less reasonable questions.