[squeak-dev] [Pharo-dev] Why do we have SmallDictionary?

Sat Jun 30 23:02:55 UTC 2018

On Sat, 30 Jun 2018, Chris Muller wrote:

>>> This discussion has limited its context to comparing *in-memory*
>>> Dictionary implementations, where open-addressing has an advantage.
>>> Using Magma reveals another context where chained Dictionary's can have
>>> some advantages:  When they're large, persistent, and shared in a multi-user
>>> system -- e.g., when grow rehashing is not an option.
>>
>>
>> B-trees were designed for that use case.
>> If growing is not an option, then your dictionary's performance will degrade
>> as its size increases, won't it?
>
> Yes, but only logarithmically.  But I was not making a
> performance-based comparison, only a possibility-based one.

Why is it logarithmic? If I have a hash table with an array of size 100,
and I insert 100000 elements into this table, then the best case scenario 
is that all the lists are of length 1000. So, the operation cost is not 
logarithmic but linear with a small constant: 0.01.
>
>>> Not only do chained Dictionary's make that scenario possible, they improve
>>> the concurrency -- different users can add/change/remove from the Dictionary
>>> as long as they occur at different keys.  With regular
>>> Dictionary's, concurrency guarantees a commit conflict.
>>
>> If you created a new, larger dictionary on grow and lazily copied the
>> entries from the smaller to the larger upon access, then I think you could
>> use open addressing hash tables.
>
> That's an interesting idea, but what if it grows beyond "the larger"
> again while still many entries in the smaller?  Add a third which is
> doubled in size again?  That sounds like it starts to intrude into the
> efficiency of open-addressing (both computation and space), but the
> idea still sparks some interesting design thoughts in my head..

Yes, add as many hash tables as necessary to the dictionary.
Let's say you have an empty dictionary and you want to insert n elements 
one by one. Let's say each hash table can hold twice as many elements as 
the previous one upon grow, and the initial hash table can hold 1 
element.
Once all elements are stored, the largest hash table will have at most n 
slots, the second largest will have n/2, the third largest will have n/4 
and so on down to 1. So, at most log2(n) hash tables will be necessary. 
Since each hash table operation costs O(1), the overall operation cost 
will be O(log(n)), and that's the worst possible case.
With some housekeeping you should be able to get rid of all the smaller 
hash tables. E.g. upon lookup, move the object from a smaller hash table
to the largest. Once all the elements are removed from a hash table, it 
can be removed from the dictionary.

Levente

>
>> Also, I don't see the concurrency issue with open addressing, because as
>> long as different users modify different chains, the case of concurrency is
>> the same as what you described, isn't it?
>
> Magma has a class called "MagmaPreallocatedDictionary" which does
> exactly this.  It's like a regular Dictionary, except for its internal
> 'array', it utilizes a MagmaArray, which is just like an Array except
> large and persistent and can grow in place (so able to maintain its
> identity).
>
> However, this alone was not enough to implement your proposal because
> of the 'tally'.  Clients would each be trying to propose their
> specific value for that variable = conflict.  Thankfully, Magma has
> yet another special class called MagmaCounter which fits that role
> beautifully.  Designed for multi-user systems, a MagmaCounter provides
> a concurrent counter.  Instances start at a #value of 0, after which
> any number of sessions may simultaneously increment or decrement (even
> by more than 1), but no one can actually hard set its value.
>
> Best,
>  Chris
>