[squeak-dev] Re: 4.1 - hashed collections still a problem

Andres Valloud avalloud at smalltalk.comcastbiz.net
Wed Mar 31 01:59:48 UTC 2010


Ah, yes, 4096 identityHash values are not too many for that kind of 
use.  GemStone runs into similar problems, and IIRC they use hash 
buckets keyed by identityHash value (VisualWorks, 32 bits, 16383 
possible hash values, 16383 hash buckets).  However, Martin noted that 
if the identityHash values are enlarged to 20 bits, then the hash 
buckets take too much space and the approach becomes counterproductive.  
I don't have all the details of what happened after that, and in any 
case it would be best for Martin to answer questions about alternative 
approaches in this particular case.

Regarding the identityHash for 32 bit Squeak, has the field been 
enlarged for 64 bits?  If so, one should consider whether the trauma 
(and introduced image / VM incompatibility) for 32 bit Squeak is worth 
the potential benefits when compared to other benefits that may be 
obtained with a similar amount of effort.  Also, I don't know what does 
the object header structure look like in Squeak, and it may or may not 
be a good idea to enlarge it merely to hold more identityHash values 
because larger headers will result in larger images and potentially 
slower execution.  These unintended consequences should be accounted for.

On 3/30/10 18:44 , Levente Uzonyi wrote:
> On Tue, 30 Mar 2010, Andres Valloud wrote:
>
>    
>> IME, relational database mappings typically restrict the classes of mapped
>> objects to some hierarchy of classes.  The top of the hierarchy adds an
>> instance variable such as id, which typically holds values from the primary
>> key of the table in question.  Please excuse my ignorance, but is this how
>> Magma works?  If not, how does it do things?
>>      
> No, Magma is an OODB, not an ORM and you can store any object in the
> database (including instances of Object).
>
>
> Levente
>
>    
>> On 3/30/10 17:55 , Levente Uzonyi wrote:
>>      
>>> On Tue, 30 Mar 2010, Andres Valloud wrote:
>>>
>>>
>>>        
>>>> What is the purpose of adding an id to so many objects?  Is this a real
>>>> application problem?  Can you be more specific as to the context?
>>>>
>>>>          
>>> AFAIK Magma uses it to assign unique ids to objects.
>>>
>>>
>>> Levente
>>>
>>>
>>>        
>>>> On 3/30/10 9:21 , Levente Uzonyi wrote:
>>>>
>>>>          
>>>>> On Tue, 30 Mar 2010, Andres Valloud wrote:
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> I mentioned implementing identityHash as hash as an example given.  I
>>>>>> also
>>>>>> think I mentioned the rule x == y =>    x identityHash = y identityHash,
>>>>>> so
>>>>>> I
>>>>>> hope it's clear that one shouldn't just blindly move ahead in these
>>>>>> matters.
>>>>>> The point is that nothing prevents anybody from adding an instance
>>>>>> variable
>>>>>> called identityHash to their objects, storing arbitrary (but well
>>>>>> chosen!)
>>>>>> small integers in said instance variable, and then having the
>>>>>> identityHash
>>>>>> message just answer said values.  If you do this for the cases in which
>>>>>> you
>>>>>> have significantly more than 4096 objects, then you only pay the price
>>>>>> to
>>>>>> hold better identityHash values for the objects that need them (as
>>>>>> opposed
>>>>>> to
>>>>>> every single object in the image when the header is expanded instead).
>>>>>> Or
>>>>>> perhaps you don't really need identity for the cases being discussed in
>>>>>> practice, and just using hash as identityHash is fine.  It's hard to
>>>>>> tell
>>>>>> without concrete examples.  One way or the other, I do not think the
>>>>>> size
>>>>>> of
>>>>>> the identityHash field *must* result in poor hashed collection
>>>>>> performance.
>>>>>> Such an implication does not follow.
>>>>>>
>>>>>>
>>>>>>              
>>>>> The original problem is: pick 1000000 or more objects and associate an id
>>>>> to them.
>>>>> - Using #hash and #== doesn't work help, because the objects state may
>>>>> change.
>>>>> - Adding a slot to every object can't be done, because the class of the
>>>>> objects can be anything. Adding a slot to Object would break lots of
>>>>> code.
>>>>> - Using #largeHash (12 bits from the object + 12 bits from the object's
>>>>> class) doesn't help, because there may be only a few different classes.
>>>>>
>>>>>
>>>>> Levente
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> On 3/30/10 1:48 , Andreas Raab wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> On 3/30/2010 1:09 AM, Andres Valloud wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> Right, so provide a better identityHash implementation in the image
>>>>>>>> (e.g.: implement hash and then have identityHash call hash instead),
>>>>>>>> and
>>>>>>>> problem solved...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>> Except that #hash is not constant over the lifetime of most objects but
>>>>>>> #identityHash is. So if you have a property associated with an object
>>>>>>> in
>>>>>>> a IDDict and the #hash depends on a value of a variable it may change
>>>>>>> over the lifetime of the object and your key gets invalid.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>        - Andreas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> On 3/26/10 3:37 , Levente Uzonyi wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> On Thu, 25 Mar 2010, Andres Valloud wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>>> If lookups find the sought object in mostly one attempt, the
>>>>>>>>>> primitive is
>>>>>>>>>> overkill... most of the time, the real issue is the quality of the
>>>>>>>>>> hash
>>>>>>>>>> function.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>> That's true, but this is not the case with the 4096 hash values.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Levente
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>>> On 3/25/10 1:27 , Levente Uzonyi wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> On Thu, 25 Mar 2010, Igor Stasenko wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>>>> i think that #pointsTo: is a cheat :), which you can use in Sets
>>>>>>>>>>>> but
>>>>>>>>>>>> not dictionaries, because
>>>>>>>>>>>> it contains associations. Also, it works only for identity-based
>>>>>>>>>>>> collections.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                          
>>>>>>>>>>> Dictionaries don't have to use associations (for example
>>>>>>>>>>> MethodDictionary
>>>>>>>>>>> doesn't use them), that's why #pointsTo: works (MethodDictionary
>>>>>>>>>>> also
>>>>>>>>>>> uses it).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>>>>> I wonder how LargeIdentityDictionary compares to your
>>>>>>>>>>>>> dictionaries'.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                            
>>>>>>>>>>>> me too.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                          
>>>>>>>>>>> If you give me a pointer to the source code, I can run the
>>>>>>>>>>> benchmarks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Levente
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>>                      
>>>>>>>>>                    
>>>>>>>>
>>>>>>>>                  
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>
>>>>>>              
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>> .
>>>
>>>
>>>        
>>
>>      
> .
>
>    



More information about the Squeak-dev mailing list