[squeak-dev] FileStreams Limit

Sat Feb 19 13:27:48 UTC 2022

I cannot find the #beReadOnlyObject with MethodFinder in 5.3 or in class Object

> Am 19.02.2022 um 14:14 schrieb Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com>:
> 
> Hi Jörg,
> 
> 
> Le sam. 19 févr. 2022 à 11:20, Jörg Belger <unique75 at web.de <mailto:unique75 at web.de>> a écrit :
> Hi Chris,
> 
> My current very simple implementation is to have a
> 
> 	- database that is stored in a directory with same name
> 	- the database has multiple signals, where each is stored in a subdirectory with same name
> 	- a signal consists of multiple fragments that are stored in a file with an ISO date in name, the ISO date makes it sortable in explorer
> 	- a fragment contains multiple rows that are stored as time/value pair
> 	- currently I use binary format, where the time consumes 48bit and the value consumes 32 floating bits
> 
> 	- all the objects are also hold in memory for faster access
> 	- that is why I created the fragment objects, which can easily drop their rows-collection from memory
> 	- the idea is just to call later one method that drops fragments that are older than a date, so that I have only the last 2 years of data in memory
> 
> In an earlier version I used CSV fragment files, which is more human readable or more Jörg-readable. But a CSV row consumes at least 26 bytes with the nanosecond part, instead of the 10 binary bytes. With the old API it didn’t matter because I had not so much data and human-readable format was better for me. But with the new realtime api I have approx. 10MB per fragment, which is  approx 2.5 GB per year for only one signal. I have 30 assets where I store 4 different signals. That means I would have 300 GB of CSV data for one year. With the binary format this shrinks to 115 GB. There are 529 more assets where I store 5 signals, but this data comes only per-minute and does not matter really.
> 
> Of course there is more optimization potential when I store the 4 values as one object that re-use the time-part, but I decided for now to save the 4 values as separate signal which is more flexible, coz I can add later other signals from artificial intelligence output or other calculations and I can mix different signals together in a graph.
> 
> Currently I am thinking about to collect the data only in Smalltalk memory, but write it only every minute, then I can re-open the file, write a bunch of unwritten rows and close the file again. If the machine crashes meanwhile then I lost only 1 min of data. Either I need a new background process that looks every minute for unwritten data or I extend my current data provider architecture and maybe they do that job in their idle action, when they have nothing to do.
> 
> 
> If you collect large chunks of data in Smalltalk, I strongly recommend to use subclasses of RawBitsArray (like Float32Array for example).
> The alternative of using Array(s) of Float(s) consumes lot more Smalltalk objects, put lot of pressure on garbage collector, and our generation scavenger, which is not really optimal in such context where newly created objects are long-lived.
> 
> Is there a possibility in Magma, that I can change multiple objects over time, but defer the commit action? Everything I understand so far is, that I need to encapsulate my change operations into a commit block, where each object change is then tracked. This looks like Glorp to me. In VisualWorks I have implemented a single-user database systems based on the immutable-flags. It looks to me that Squeak has currently not this feature to have immutable objects, I could not find a method like #isImmutable like in VisualWorks. With that mechanism you can track object changes and later you need simply to send a #commit to your session.
> 
> For immutability, please see #beReadOnlyObject.
> 
> 
> But I think it should be possible in Magma to have something like this:
> 
> 	session trackChanges: [session root at: 1 put: #change].
> 	session trackChanges: [session root at: 2 put: #change].
> 	…
> 	session commit
> 
> The advantage of my files is of course I can simply remove older fragment files from the signal directory, zip it and put it somewhere else as backup and clean up the database a bit to make it smaller in the runtime. But I will have a look at that what you described as „browsing the magma database“ :-)
> 
> Ahh and the other advantage of my files is I can use it directly in my Python scripts to read it in. If I use a Magma database I need an exporter. 
> 
> Jörg
> 
> 
>> Am 19.02.2022 um 06:00 schrieb Chris Muller <asqueaker at gmail.com <mailto:asqueaker at gmail.com>>:
>> 
>> Hi Jörg,
>> 
>> My problem is simply that I need to leave the streams open coz reopening for every write is too slow.
>> 
>> I'm all too-familiar with this challenge!  For Magma's file storage, depending on the app's design and client behavior, there is no theoretical upper limit on the number of files utilized.  As you can imagine, it didn't take long for a medium sized domain to run into the upper limit of simultaneously-open files, and affect the server (this was back in 2005).  I realized, to have a resilient server, Magma's design would be *required* to be able to operate within all manner of resource limits.
>> 
>> How does it solve this particular one?  It defaults to a maximum of only 180 files open at a given time (low enough for almost any environment, but large enough to be able to perform), which can be adjusted up or down by the app administrator according to their knowledge of their VM and OS environment.  Internally, Magma adds opened FileStreams into a fixed-size LRU cache.  Upon any access, a FileStream is "renewed" back to the top, while as more streams are opened beyond the set capacity, the least-used are closed just before being pushed off the bottom.
>> 
>> It's a strategy that has worked remarkably well over the years.
>> 
>> I have realtime data coming through a socket in nanosecond precision and the file handling must be very fast. Currently I have 120 nanosecond realtime streams and 2645 minute-based streams. 
>> 
>> Magma is fast enough for well-designed applications that can tolerate sub second response times, but not sub nanosecond requirements.  To do the kind of real time application you mentioned in Squeak, I think you would have to just dump it to a file that is consumed separately, or make some sort of implementation dedicated to that use-case.  Darn, I hate to have to say that, sorry.
>>  
>> As I use now binary format instead of the previous CSV format, I cannot read the plain data files anyway, so maybe I will give Magma a try. It does not matter if I can’t read binary files or can’t read Magma files with a text editor :-)
>> 
>> The entire database can be browsed by simply opening the #root in an Explorer window and navigating down.  Even if the total model is terabytes in size, other than the few milliseconds pause when opening big branches, it's a completely transparent experience to exploring a local object.
>> 
>>  - Chris
>> 
>> PS -- Incidentally, the number of simultaneously-open FileStreams is not the only constrained resource.  Depending on the host platform, there may be limitations with maximum file sizes, too.  The same FileStream subclass for Magma solves this too, a default max size of 1.8GB per physical file, with .2., .3., etc. created and accessed transparently, as if it were one big file...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220219/75048835/attachment.html>