[Newbies] Pre-Getting started info: Unicode, utf8, large memory
need
Charles Hixson
charleshixsn at earthlink.net
Wed Apr 28 19:02:11 UTC 2010
On 04/27/2010 09:10 PM, Herbert König wrote:
> Hi Charles,
>
> CH> 1) How does one read&write utf8 files?
>
> the FileStream class uses UTF8 files by default since some time. So if
> you want non utf8 files (e.g. for speed reasons) you have to take
> extra measures.
>
> CH> 2) Can strings by indexed by chars, even if they are unicode rather
> CH> than ascii?
>
> yes. See class WideString.
>
> CH> 3) What happens if you have more data than will fit into RAM?
>
> use a database :-))
>
> CH> (For 3 "use a database" is an acceptable answer, but I'm hoping for
> CH> something involving automatic paging.)
>
> There are object databases like Magma which make this less painful.
> And OR mappers. Commercial products handle bigger than RAM images
> (GemStone) of which I thought they would have a free version but can't
> find it on their website.
>
> CH> An additional, but much less urgent, question is "How does one use
> CH> Squeak on multiple cores of a multi-core processor?"
>
> There is an experiment "Hydra VM" which can run multiple images each
> in their native thread. Squeak is a single OS thread and uses green
> threads inside.
>
> You might tell us, what you want to achieve. Personally I'd say start
> small :-)
>
Well, I am starting small, but the database isn't all that small. I'm
planning, as a first step, building a bibliographic database of
"interesting books" from GutenPrint (the Gutenberg Project). They often
leave out things like "When was this first published?" (Sometimes it
isn't known.) that I want to include in my bibliography, and I also
want to include things like Story index and Author index for
publications (e.g. magazines) that have multiple stories with multiple
authors. Some of this I've already done by hand, but unfortunately I
used two different formats, and also the info needs to be relocated to
the end of the file. (I'm planning a table just prior to the "</body>"
tag.)
The next step is to generate catalogs from this bibliographic
information. Then I want to package them together with the files onto
something that will fit onto a DVD by the middle of November. (That
should be practical.)
The next step is to build indexes of names and where they appear. Etc.
(I don't have the details planned out. Automated information retrieval
is the goal, but not just free-form retrieval, and I don't know exactly
what I'll need to do. It's likely to require pre-computing a lot of
partial answers, though.)
I looked at Magma, and couldn't figure out whether it would be useful or
not. I've no idea just how fast it is, how capacious it is, or how much
ram it consumes, and I don't even know what I should measure. It's the
kind of thing that could look like it was working fine until one
suddenly passed some critical usage level, and then it would just barely
work at all, and I can't guess how one could determine that usage level
ahead of time. And I want locally separate files, so I guess I'd
probably use sqlite or Firebird. With Sqlite I might need to have
multiple databases to handle the final system, so it would probably be
best to partition things early. (Either that or build some sort of
hierarchical storage system that rolled things from database to database
depending of how recently it was accessed.)
I'm guessing that FileStream would handle file BOM markers gracefully.
(Most of my files are utf8 with BOM markers at the head.) This isn't
totally standard, as many utf8 files don't have any markers to show that
they aren't ascii (or extended ascii), but it's ONE of the standard
approaches.
(I wouldn't need any fancy mapper. If I weren't dealing with LOTS of
variable length arrays of variable length strings, I could just fit the
data into a simple C struct without any pointers whatsoever. So all I
need is to be able to save a list of lists of chars, plus a few integers
that would all fit comfortably into 32 bits. [Many of them would fit
into 8 bits.])
So far I'm still choosing the language. I've got one routine
implemented in D, Python, Ruby, and Java so far. Those could all be
made to work. I'm currently working on a Vala implementation, and I'm
considering a Smalltalk one. If D had the libraries for later use, it
would be the clear winner so far. Unfortunately, I'm also considering
later, and D doesn't have much in the way of concurrency handling. I'm
not sure that Hydra counts...though it sounds like I need to look into
it. The question would be how to programs running on separate virtual
machines communicate with each other. (N.B.: Ruby and Python also have
this problem. Vala appears to have solved it.)
I also considered "go", but it appears to be to beta at the moment. The
design of the language poses unique requirements on the documentation
that they don't seem to be addressing. (It could be because the
language is still in an early stage of development.)
Long term goal (1-4 decades): A librarian program that can dig the
answers to "reasonable" questions out of the books that it handles. And
can also recommend books in answer to slightly less reasonable questions.
More information about the Beginners
mailing list