[vwnc] [squeak-dev] Re: [ANN] StOMP - Yet another multi-dialect object serializer

Paul Baumann paul.baumann at theice.com
Mon Jun 20 21:21:37 UTC 2011


Mariano,

My responses below are tagged by <plb> and </plb>...

Paul Baumann


From: Mariano Martinez Peck [mailto:marianopeck at gmail.com]

On Mon, Jun 20, 2011 at 5:48 PM, Paul Baumann <paul.baumann at theice.com<mailto:paul.baumann at theice.com>> wrote:
If you are going to compare object serializing tools then State Replication Protocol (SRP) should be added to that list.

Well, this thread was about StOMP, but I will answer anyway about Fuel. We did take a look to SRP. In fact, I've sent you an email asking a lot of questions and you kindly and detailed answered me all the questions.

SRP has not been promoted much but it is after many years still a good cross dialect and platform binary serialization tool. It was originally ported to about seven smalltalk dialects.  Every aspect of SRP is context-configurable.

That's one of the reasons which can me it a little bit slower than others.

<plb> Yeah, but that can be customized by configuration too. A little bit slower out of the box for the sake of portability is usually a good trade off so long as you can tune it to get the performance you need. If someone skips the tuning part then they'll leave with a bad impression. If SRP is compared with an immature solution one may find that a performance advantage disappears once their framework evolves years later to do what SRP is already doing .  </plb>

SRP encoding is unique, simple, fast, and unlimited. The user base for SRP is not well known, but I hear from several people that use it for production applications and I have personal experience with one deployment.

The default configuration for SRP is to use a portable mapping layer and to encode metastate into the data stream. Even with these costs, SRP is comparable in performance to serialization tools that do not do this. The (optional) portable mapping layer is used to represent common smalltalk objects in way that can be loaded into any smalltalk dialect. Metastates describe the structure of the object state so that data load is data driven rather than code dependent. SRP can actually load state for which a class is not defined or has significantly changed. Metastates can be stored in metastate tables that can be reused and referenced to reduce data size and improve performance. When you use metastate tables, SRP stores more compactly than any other binary serialization tool is capable of. Whoever compares performance of SRP with other binary serialization tools should keep in mind that they will have to disable SRP features like these to have a fair comparison.

How can I disable such portable mapping layer (exaxctly, in code)?  Can I disable that but at the same time support class shape changes?

<plb>
SrpConfiguration is the source of all the customizations and interactions.  Use SrpNonMappingConfiguration to avoid the portable mapping layer. The bigger cost in both performance and space efficiency is that SRP adds metastate information to the data if the metastates are not in a table. You need to use a metastate table. You need to define how loading code will resolve a metastate table that is being referenced. This is similar in purpose to an XML DTD file. You'd likely subclass SrpConfiguration and then refine methods like #saveMetastateTableNamed:containing:, #resolveMetastateTableNamed:, or #resolveMetastateTableReference:.
Metastates describe the data encoding. You must be able to resolve the metastate the object to be read. The metastate is data in a predictable (yet extendable) format that describes how data in a less predictable format is encoded. SRP uses metastates so that class shape changes or behavior will not affect the ability to read data. If the class doesn't exist at all for loaded data then SRP is able to load in instance of SrpState that represents the structure and accessor behavior of the original object.
A portable mapping is different entirely, it says for example that "a dictionary is a collection of association instances" rather than whatever the native dictionary implementation is. When saving a Dictionary instance, it instead saves a PmrDictionary (of association instances) that is then able to load-map back to the native Dictionary implementation. Other serialization frameworks tend to put these portability rules in code. You could put them in code with SRP too (and even directly write objects to the stream yourself within SRP data), but SRP defaults to using class-based portability mapping. Your configuration declares which mappings you want to use. Encode with no portability mappings at all and you'll be able to read the data in the format it originated from. Don't like the class-based mapping rules? Then tell your configuration to #beforeSavingAnyNamed:doWithContext: or #afterLoadingAnyNamed:doWithContext:. Still don't think that is fast enough for you then override #writerClass and #readerClass to use your own marshaler subclass to for example implement #saveDictionary:  to write your collection of associations.  SRP gives more options for portable mapping. You can still map by marshaler if you prefer that approach.
</plb>

SRP is maintained with a single code base that is designed to work for all smalltalk dialects. SRP does this by directing less-portable behavior through a "portal" that is configured to accommodate the dialect the code is being used with.

I find it funny when I see some binary encodings that are still code-bound. If the data does not somehow indicate the data encoding and layout in some standard way then you can render encode streams unreadable from something as simple as a class schema change. They do that to save the cost of a data type code. SRP would never make a mistake like that, and the cost that SRP experiences for this data type code is typically only one byte.

We do store the type as well in one byte. But in our case, objects are grouped together in clusters. So it is even one byte per cluster only.

<plb>
SRP can store in clusters too. It is a common layout for serialization tools (depth first storage in silo collections followed by a relationship graph of pointers). I'd experimented with an SRP configuration that used that layout. It didn't provide any advantage at all. The cost of the pointer values outweighs the savings on class identifier. If you prefer that layout then SRP can accommodate it though.
</plb>


SRP encoding is fundamentally a sequence of unsigned integers of infinite size. This is the most compact representation possible. An object type header is commonly only one byte and yet is still flexible enough to be unlimited and extended any way imaginable. SRP encoding supported four byte character strings before they were invented and stores them as compactly as possible. SRP allows direct and data width encodings for things like floats and embedded data. Even direct encoding of some doesn't break the readability of the object graph. SRP also allows has features for object annotation like if you want to remember the oop of an object or dependents. The encoding is what is most special and portable about SRP. Financial markets now exchange data using encoding standards (Fast FIX) for some data types that had been pioneered by SRP, but none that I'm aware of are as consistent and pure as SRP.

SRP is a solid base of code that is intended to be tailored and configured to your needs. It is fast, but the main goal of SRP was portability. SRP is provides a good configuration out of the box that you can easily tune and configure to meet your needs. The most recent tuning SRP has received was for the GS/S dialect to use GS/S specific optimizations. That GS/S specific code can be found here:

http://techsupport.gemstone.com/entries/181657-srp-3-1-010-0

SRP can serialize objects like a ComplexBlock, but does not attempt to do so in a dialect-portable way. It is simply that I had not defined a portable representation of a complex block in the portability layer. A common way to do that would be to determine the source of the block (for all dialects) and compile that code on load.

Yes, but that may not work. Because closures point to another context, which can be a CompiledMethod for example. And a closure can have references to variables defined outside the closure....

<plb>
SRP placeholders, actionItems, proxies, and substitutions can all be part of a solution that would make that work. I don't see a need to do it anyway. To me it is an example of something that many people think they need to serialize but that rarely does it need to be. An exception being the sort block of a sorted collection. What most people end up doing in that case is to have standardized substitutes. If serializing the block for [:a :b | a < b ] then instead serialize an object that will load as the native compiled form of that block.

You can support features like porting complex blocks with external references if you want to, but SRP doesn't do that in a cross-dialect form. At some point you need to limit the depth of your traversal by use of proxies or else you'll end up saving far more junk than you anticipated and is reasonable.
</plb>

It gets tricky if you attempt to support more than simple blocks or if you want to translate bytecodes (which I'd also prototyped). If you really think you need to serialize blocks then SRP is flexible enough to let you define how you want it done.


excellent.

Some Smalltalk dialects (like VA in particular) do not have an efficient two-way become. You'll find that most serialization tools expect there to be an efficient two-way become to substitute one object for another on load. SRP however has a unique way to fix-up references that is efficient for all dialects. SRP has a wide variety of object substitution hooks for both saving and loading that preserve graph relationship integrity without screwing up original objects. SRP also has support for proxy objects that can be managed by application code.

Where (classes/methods/tests) can I take a look how do you manage those proxies? it sounds interestng. The same for the object sustitutio hook.


<plb>
One of several ways is to tell the class to save *referenced* instances as a proxy so that it goes through #saveProxy: and #loadProxy methods with ways to customize. Direct saves of those kinds of objects are not proxies. You define the proxy representation. The context of both saving and loading is provided to you by SRP for the proxy.

SRP placeholders are temporary objects that are part of a graph being loaded. The placeholders are removed incrementally as the load of each object and any exchanges are completed. Placeholders are normally entirely gone by the time a load completes, but there are times when a few may be kept longer for post-load actions. You can for example declare a post-load action for an object which then gets wrapped with an action item that you control within the context of the full graph load.
</plb>


The main thing wrong with SRP is that it is not the framework that "you" created. SRP was the first binary serialization tool to focus on Smalltalk dialect portability. I'd argue that it is still the only one that truly accomplished that in a meaningful way. I created SRP by combining proven techniques from the best tools of the time and adding features for portability. SRP was superior to even the dialect-specific frameworks at the time. SRP is not something that I intend to maintain and promote. I released it open source some ten years ago in the hope that others would do that. A lot of effort and sacrifice was put into SRP "for the benefit of others". SRP taught me a painful lesson about human nature and the perception of value. Programmers (myself included) love to solve problems more than learn about existing solutions. Everyone wants to solve problems like this their own way and thinks they have a good reason that they must do it their way. "Yet another" was an excellent subject line.

I will speak just for Fuel. I don't think this is really a problem. This that you mention is so known that it has even a name: trade-off. If you find a way to be really fast in serializtion, materialization and be portable at the same time, then I am all ears. For me it is perfect to have different kind of serializers. Do you want something portable and be able to even edit it with a text editor?  then use SIXX. Do you want a portable solution with a more or less good performance? then use StOMP, SRP, etc. Do you want something really fast (mostly at materializtion time) which is not focused in portability? then use Fuel. Is that bad ??    Now in Pharo people are doing Opal compiler, which is 3 times slower than the old one. Why we are not agains that?  again, trade-off. Old Compiler is really difficult to understand and maintain. We want something more OO, easy to maintain, to understand and to experiment.

<plb>
SRP uses the same data/nesting sequence as XML. It wouldn't be difficult to create an SRP load marshaler that efficiently loads/translates XML from SRP encoded data. That way the data is both compact and human readable/editable.
</plb>


Now, I don't know the reasons but Colin ported SRP to Squeak and the he finally implemented his own S&M serializer. Masashi now implemented StOMP but he also took a look tp SRP. In fact, check the commits in http://www.squeaksource.com/SRP,  He fixed it, and I asked him a couple of questions to make it work. Since this week (a couple of days ago), SRP tests are green in Pharo. So...these guys took a look to SRP, as well as us.

In our case, we even created benchmarks (check package FuelBenchmarksSRP in Fuel repo) to compare Fuel against the rest. I can share the results with you if you want, but tell me first how to disable the mapping layer that makes it slower.

<plb>
Keep in mind that SRP was written a long time ago. It won't be everything for everybody, but is a good general base for customization to meet a set of needs. Nobody can look at code they wrote ten years ago and not see a way it could be improved. I'd have certainly done float marshaling differently (as you could customize yourself). I can no longer say it is the fastest option out there because I haven't compared SRP with the performance of anything that came after it. It would be interesting to see how it compares now, but I'd take any measurements with a grain of salt because frameworks do have different features and goals.  That said, SRP isn't bad considering the age and neglect. It is certainly a good starting point for anyone else that wants to do better.

The place that nearly all serialization tools have trouble with performance wise is due to the hash size limitations of most IdentityDictionary implementations. Serialization of a graph can easily touch thousands and thousands of objects. When VW IdentityDictionary performance degrades after 16K objects (32K for VA) then that causes a problem for any serialization tool that relies on the IdentityDictionary to save objects. This is the first thing I focus on when I tune SRP (#newHitList).  I've implemented faster identity dictionaries that grow better, but they are not built into SRP. There are also some dialect-specific tuning that can be done in this area.  The tuned version of SRP for GS/S for example makes use of a special hidden map in GS/S for all objects and that GS/S itself uses for their serialization.

SRP makes heavy use of #saveUnsigned: and #loadUnsigned. In some dialects using math functions on small integers is faster (and more portable) than bit manipulation. SRP would benefit from primitives to do the work of #saveUnsigned: and #loadUnsigned.
</plb>

________________________________
This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20110620/111ba79c/attachment.htm


More information about the Squeak-dev mailing list