[squeak-dev] FileStreams Limit

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Sat Feb 19 13:14:00 UTC 2022


Hi Jörg,


Le sam. 19 févr. 2022 à 11:20, Jörg Belger <unique75 at web.de> a écrit :

> Hi Chris,
>
> My current very simple implementation is to have a
>
> - database that is stored in a directory with same name
> - the database has multiple signals, where each is stored in a
> subdirectory with same name
> - a signal consists of multiple fragments that are stored in a file with
> an ISO date in name, the ISO date makes it sortable in explorer
> - a fragment contains multiple rows that are stored as time/value pair
> - currently I use binary format, where the time consumes 48bit and the
> value consumes 32 floating bits
>
> - all the objects are also hold in memory for faster access
> - that is why I created the fragment objects, which can easily drop their
> rows-collection from memory
> - the idea is just to call later one method that drops fragments that are
> older than a date, so that I have only the last 2 years of data in memory
>
> In an earlier version I used CSV fragment files, which is more human
> readable or more Jörg-readable. But a CSV row consumes at least 26 bytes
> with the nanosecond part, instead of the 10 binary bytes. With the old API
> it didn’t matter because I had not so much data and human-readable format
> was better for me. But with the new realtime api I have approx. 10MB per
> fragment, which is  approx 2.5 GB per year for only one signal. I have 30
> assets where I store 4 different signals. That means I would have 300 GB of
> CSV data for one year. With the binary format this shrinks to 115 GB. There
> are 529 more assets where I store 5 signals, but this data comes only
> per-minute and does not matter really.
>
> Of course there is more optimization potential when I store the 4 values
> as one object that re-use the time-part, but I decided for now to save the
> 4 values as separate signal which is more flexible, coz I can add later
> other signals from artificial intelligence output or other calculations and
> I can mix different signals together in a graph.
>
> Currently I am thinking about to collect the data only in Smalltalk
> memory, but write it only every minute, then I can re-open the file, write
> a bunch of unwritten rows and close the file again. If the machine crashes
> meanwhile then I lost only 1 min of data. Either I need a new background
> process that looks every minute for unwritten data or I extend my current
> data provider architecture and maybe they do that job in their idle action,
> when they have nothing to do.
>
>
If you collect large chunks of data in Smalltalk, I strongly recommend to
use subclasses of RawBitsArray (like Float32Array for example).
The alternative of using Array(s) of Float(s) consumes lot more Smalltalk
objects, put lot of pressure on garbage collector, and our generation
scavenger, which is not really optimal in such context where newly created
objects are long-lived.

Is there a possibility in Magma, that I can change multiple objects over
> time, but defer the commit action? Everything I understand so far is, that
> I need to encapsulate my change operations into a commit block, where each
> object change is then tracked. This looks like Glorp to me. In VisualWorks
> I have implemented a single-user database systems based on the
> immutable-flags. It looks to me that Squeak has currently not this feature
> to have immutable objects, I could not find a method like #isImmutable like
> in VisualWorks. With that mechanism you can track object changes and later
> you need simply to send a #commit to your session.
>

For immutability, please see #beReadOnlyObject.


> But I think it should be possible in Magma to have something like this:
>
> session trackChanges: [session root at: 1 put: #change].
> session trackChanges: [session root at: 2 put: #change].
>> session commit
>
> The advantage of my files is of course I can simply remove older fragment
> files from the signal directory, zip it and put it somewhere else as backup
> and clean up the database a bit to make it smaller in the runtime. But I
> will have a look at that what you described as „browsing the magma
> database“ :-)
>
> Ahh and the other advantage of my files is I can use it directly in my
> Python scripts to read it in. If I use a Magma database I need an exporter.
>
> Jörg
>
>
> Am 19.02.2022 um 06:00 schrieb Chris Muller <asqueaker at gmail.com>:
>
> Hi Jörg,
>
> My problem is simply that I need to leave the streams open coz reopening
>> for every write is too slow.
>>
>
> I'm all too-familiar with this challenge!  For Magma's file
> storage, depending on the app's design and client behavior, there is no
> theoretical upper limit on the number of files utilized.  As you can
> imagine, it didn't take long for a medium sized domain to run into the
> upper limit of simultaneously-open files, and affect the server (this was
> back in 2005).  I realized, to have a resilient server, Magma's design
> would be *required* to be able to operate within all manner of
> resource limits.
>
> How does it solve this particular one?  It defaults to a maximum of only
> 180 files open at a given time (low enough for almost any environment, but
> large enough to be able to perform), which can be adjusted up or down by
> the app administrator according to their knowledge of their VM and OS
> environment.  Internally, Magma adds opened FileStreams into a fixed-size
> LRU cache.  Upon any access, a FileStream is "renewed" back to the top,
> while as more streams are opened beyond the set capacity, the least-used
> are closed just before being pushed off the bottom.
>
> It's a strategy that has worked remarkably well over the years.
>
> I have realtime data coming through a socket in nanosecond precision and
>> the file handling must be very fast. Currently I have 120 nanosecond
>> realtime streams and 2645 minute-based streams.
>>
>
> Magma is fast enough for well-designed applications that can tolerate sub
> second response times, but not sub nanosecond requirements.  To do the kind
> of real time application you mentioned in Squeak, I think you would have to
> just dump it to a file that is consumed separately, or make some sort of
> implementation dedicated to that use-case.  Darn, I hate to have to say
> that, sorry.
>
>
>> As I use now binary format instead of the previous CSV format, I cannot
>> read the plain data files anyway, so maybe I will give Magma a try. It does
>> not matter if I can’t read binary files or can’t read Magma files with a
>> text editor :-)
>>
>
> The entire database can be browsed by simply opening the #root in an
> Explorer window and navigating down.  Even if the total model is terabytes
> in size, other than the few milliseconds pause when opening big branches,
> it's a completely transparent experience to exploring a local object.
>
>  - Chris
>
> PS -- Incidentally, the number of simultaneously-open FileStreams is not
> the only constrained resource.  Depending on the host platform, there may
> be limitations with maximum file sizes, too.  The same FileStream subclass
> for Magma solves this too, a default max size of 1.8GB per physical
> file, with .2., .3., etc. created and accessed transparently, as if it were
> one big file...
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220219/40f028e7/attachment.html>


More information about the Squeak-dev mailing list