Ok, following with this. What I can add to the discussion:

In linux, latest VMs yield the following results (I added a space every three digits just to enhance readability)

"Pharo Cog"

1 tinyBenchmarks

'887 348 353 bytecodes/sec; 141 150 557 sends/sec'

"Pharo Stack"

1 tinyBenchmarks

'445 217 391 bytecodes/sec; 24 395 999 sends/sec'

While in Mac

"Pharo Cog"

1 tinyBenchmarks

'895 104 895 bytecodes/sec; 138 102 772 sends/sec'

"Pharo Stack"

1 tinyBenchmarks

'3 319 502 bytecodes/sec; 217 939 sends/sec'

So, I'd say it's a problem in cmake configuration or just compilation in mac :). Though I didn't test on windowze.

Another thing that I noticed is that when compiling my VM on Mac, since I updated Xcode, I was not longer using gnu gcc but llvm one. I tried to go back using the gnu gcc but couldn't make it work so far, he.

On Thu, Feb 21, 2013 at 5:09 AM, Igor Stasenko <siguctua@gmail.com> wrote:

On 20 February 2013 18:29, Eliot Miranda <eliot.miranda@gmail.com> wrote:
>
> On Tue, Feb 19, 2013 at 11:10 PM, Camillo Bruni <camillobruni@gmail.com> wrote:
>>
>>
>> On 2013-02-20, at 01:25, Eliot Miranda <eliot.miranda@gmail.com> wrote:
>>
>>>
>>> On Tue, Feb 19, 2013 at 2:16 PM, Camillo Bruni <camillobruni@gmail.com> wrote:
>>>>
>>>>>> The most annoying piece is Time machine and its disk access, I
>>>>>> sometimes forget to suspend it, but it was off during the
>>>>>> tinyBenchmark.
>>>>>
>>>>> One simple approach is to run the benchmark three times and to discard
>>>>> the best and the worst results.
>>>>
>>>> that is as good as taking the first one... if you want decent results
>>>> measure >30 times and do the only scientific correct thing: avg + std deviation?
>>>
>>> If the benchmark takes very little time to run and you're trying to
>>> avoid background effects then your approach won't necessarily work
>>> either.
>>
>> true, but the deviation will most probably give you exactly that feedback.
>> if you increase the runs but the quality of the result doesn't improve
>> you know that you're dealing with some systematic error source.
>>
>> This approach is simply more scientific and less home-brewed.
>
> Of course, no argument here. But what's being discussed is using
> tinyBenchmarks as a quick smoke test. A proper CI system can be set
> it up for reliable results, but for IMO for a quick smoke test doing
> three runs manually is fine. IME, what tends to happen is that the
> first run is slow (caches heating up etc) and the second two runs are
> extremely close.

but not in case when you have an order(s) of magnitude speed
degradation. This is too significant to be
considered as measurement error or deviation.
There should be something wrong with VM (cache always fails?).

> --
> best,
> Eliot

--
Best regards,
Igor Stasenko.