[squeak-dev] SharedQueue does scale (was: Re: [Pharo-project] SharedQueue doesn't scale)

Mon Oct 18 09:57:54 UTC 2010

On 18 October 2010 07:00, Levente Uzonyi <leves at elte.hu> wrote:
> On Sun, 17 Oct 2010, Levente Uzonyi wrote:
>
> snip
>
>> SharedQueue's code for "growing" (#makeRoomAtEnd) is crap IMHO. Because of
>> that it takes O(n) time to add or remove and element to or from the queue.
>> SharedQueue2 is a lot better approach, because it doesn't try to
>> reimplement a dynamic array, but uses OrderedCollection instead.
>
> I uploaded a new version of that method to the Trunk. I don't think it's
> really useful, because I'm pretty sure we will get rid of both SharedQueue
> and SharedQueue2 in the near future.
> Anyway I did some benchmarks using Cog, and SharedQueue became surprisingly
> good, though it's still not even close to Igor's new FIFOQueue or
> AtomicSharedQueue.
>
> Benchmark #1: Make a large queue
>
> { SharedQueue. SharedQueue2. FIFOQueue. AtomicSharedQueue } collect: [
> :queueClass |
>        | queue |
>        queue := queueClass new.
>        queueClass -> (
>                (1 to: 5) collect: [ :run |
>                        Smalltalk garbageCollect.
>                        {
>                               [ 1 to: 1000000 do: [ :each | queue nextPut:
> each ] ] timeToRun.
>                               [ 1 to: 1000000 do: [ :each | queue next ] ]
> timeToRun } ]) ].
>
> SharedQueue->#(#(1028 1615) #(945 1557) #(976 2322) #(492 459) #(489 476)).
> SharedQueue2->#(#(1976 2284) #(1318 8298) #(982 692) #(1005 675) #(1002
> 665)).
> FIFOQueue->#(#(180 67) #(184 67) #(182 69) #(181 66) #(177 67)).
> AtomicSharedQueue->#(#(208 66) #(207 67) #(209 66) #(213 68) #(209 65)).
>
> This benchmark is similar to what Igor used, but it doesn't create a new
> queue between runs. It simply adds 1,000,000 elements then removes them
> which is a pretty unrealistic scenario for a shared queue. The effect of GC
> is pretty high on this benchmark, probably that's responsible for the high
> spikes.
>
> Benchmark #2: Sequential throughput test
>
> { SharedQueue. SharedQueue2. FIFOQueue. AtomicSharedQueue } collect: [
> :queueClass |
>        | queue |
>        queue := queueClass new.
>        queueClass -> (
>                (1 to: 5) collect: [ :run |
>                        | results |
>                        Smalltalk garbageCollect.
>                        results := #(0 0).
>                        1 to: 1000 do: [ :round |
>                                results := results + {
>                                        [ 1 to: 1000 do: [ :each | queue
> nextPut: each ] ] timeToRun.
>                                        [ 1 to: 1000 do: [ :each | queue next
> ] ] timeToRun } ].
>                        results ]) ].
>
> SharedQueue->#(#(464 452) #(472 436) #(466 437) #(463 449) #(462 452)).
> SharedQueue2->#(#(949 692) #(980 663) #(984 670) #(992 670) #(958 677)).
> FIFOQueue->#(#(125 67) #(263 62) #(250 76) #(262 63) #(247 81)).
> AtomicSharedQueue->#(#(154 70) #(264 77) #(273 62) #(275 63) #(265 71)).
>
> This is similar to benchmark #1, but instead of adding and removing
> 1,000,000 at once it's chunked up to 1,000 equal parts. It's more realistic
> than benchmark #1. It's interesting that both FIFOQueue and
> AtomicSharedQueue performed better in the previous benchmark, unlike the
> other two queues, which are better here.
>
> Benchmark #3: Concurrent throughput test
>
> { SharedQueue. SharedQueue2. FIFOQueue. AtomicSharedQueue } collect: [
> :queueClass |
>        | queue semaphore |
>        queue := queueClass new.
>        semaphore := Semaphore new.
>        queueClass -> (
>                (1 to: 5) collect: [ :run |
>                        | producers consumers |
>                        Smalltalk garbageCollect.
>                        producers := (1 to: 100) collect: [ :each |
>                                [ 1 to: 10000 do: [ :index | queue nextPut:
> each ] ] newProcess ].
>                        consumers := (1 to: 100) collect: [ :each |
>                                [
>                                        1 to: 10000 do: [ :index | queue next
> ].
>                                        semaphore signal ] newProcess ].
>                        [
>                                consumers do: [ :each | each priority: 39;
> resume ].
>                                producers do: [ :each | each priority: 39;
> resume ].
>                                100 timesRepeat: [ semaphore wait ] ]
> timeToRun ]) ].
>
> SharedQueue->#(3143 2977 3034 2949 3021).
> SharedQueue2->#(4280 4384 4179 4160 4181).
> FIFOQueue->#(245 311 252 254 255).
> AtomicSharedQueue->#(277 274 277 280 274)
>
> This benchmark is the real concurrent stress test. 100 processes are adding
> 10,000 elements to the queue while another 100 are reading from it. It
> clearly shows that Igor's queues are an order of magnitude faster. Also 200
> concurrent processes cause much less slowdown compared to the sequential
> tests for them.
>
> So, even though SharedQueue is now faster than SharedQueue2, both will have
> to go IMHO. :)
>

Thanks, Levente for giving a feedback.
Please, feel free to shape my classes in more complete from (such as
proper naming),
to make them ready for inclusion in both Squeak's and Pharo cores.
I propose the following names:
AtomicQueue (base class) -> AtomicCollection
FIFOQueue -> AtomicQueue
LIFOQueue -> AtomicStack
If you, or anyone else having better suggestions, speak now :)

In any case, i'm am open to discuss further details and possible
caveats of using new classes
to anyone interested in using them.

P.S. As a side note, i now can explain (to myself at first place), why
i intuitively choosed to used atomic swap
instead of CAS.
Because it fits better fits with smalltalk language semantics:

A swap operation in smalltalk implemented as two assignments:
x := y. y := z.
An assignments is basic operation, which have nothing to do with
late-bound nature of language.
Unless we going to introduce a meta-object protocol(s), which could
turn a simple assignment
into some message sends under the hood, it will remain a basic,
early-bound operation.
And even if we do, it is highly unlikely, that even then we will throw
away the old,
simple assignment, which identifies an assignment source & target at
compile time.

In contrast, a CAS operation , if written in smalltalk looks like:

(a == b ) ifTrue: [ a := c ]

so, it having two message sends (#== , #ifTrue:), and from strict,
pure language perspective,
this using a late-bound semantics (a message sends),
and as any message send, the message result and behavior cannot be
predicted at compile time
and therefore its wrong to assume that such statement could be an
atomic operation.

Unless, of course, we introduce a new language syntax which will
denote a CAS operation explicitly.

>
> Levente
>
>
> snip
>

-- 
Best regards,
Igor Stasenko AKA sig.