[Vm-dev] [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

Fri Nov 29 14:53:20 UTC 2019

Hi Clément,

(for anyone else reading this, the thread had become quite long, I've
chopped out quite a bit assuming the context is already familiar)

On Fri, 29 Nov 2019 at 11:53, Clément Béra <bera.clement at gmail.com> wrote:
>
>>
> Thanks Juraj. Are you both Feenk people?
> Are you mainly working on the VM Alistair? Or just having fun?

Yes, we're both feenk people. :-)

I'm not focusing on the VM as such, but when we have issues with the
VM I'm one of the people that tend to look at it.

> Not having FreeType and LibGit would be nice indeed. The difference between simulation performance 14Mb-54Mb is not really an issue for me, the bug happened on > 100Mb heap and simulation is still fairly fast.
> The problem is more to find a reliable way to crash soon after start-up, in some cases I start the simulator, go to sleep, but if the next morning it hasn't crashed, well, too bad :-(.
>
> In most cases we reproduce bugs using the Squeak REPL image. See:
> https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/image/buildspurtrunkreaderimage.sh
> I suggest you try using the simulator on the squeak repl, it's convenient you can run a few things and see what is going on. The REPL support chunk format (Put a ! after each do it).
> You can build something similar from the minimal Pharo if you want to, but I doubt you'll catch bugs that you can't catch from the Squeak one.

I've actually done both of these in the past, used the Squeak REPL and
build a Pharo version.

> Err. Maybe I forgot to write down a few steps here and commented a few other methods... I fixed it and then wrote the mail, I don't remember it all.
> I think indeed there was something accessing source or change files and I commented something in there.
> I'll try to check the change file later on.
> I don't have access to my laptop right now I'm at work so I cannot check.

I saw the list of changes you made, thanks.  I avoided the LibGit and
Zodiac issues by using the minimal image.

> Yes, running the script in the simulator generated me around 30 images (Save57.image to Save90.image). I frequently use saving from the simulator (usually Squeak image though). Should work.
> Then running the script again to 400 iterations from the VM I generated filled my local SSD :-).
> I don't remember which API I used though to save, maybe we used different ones? I try to use snapshot:andQuit: as much as possible to avoid unexpected errors, but this time I renamed, I don't remember how.

That's the difference alright: #saveImageInFileNamed: checks that the
parent directory exists first, which uses FileAttributesPlugin, while
#snapshot:andQuit: doesn't do those checks.

> Yeah that's the main problem when debugging GC in general. Pharo is less deterministic than Squeak for some reason (things are happening in the background doing FFI calls).

I think this will be more to do with the fact that Pharo has
#processPreemptionYields true, while Squeak has it false.  It means
that every IO and timer event can effectively change the active
process (if there are multiple at the same priority), so process
completion is much less deterministic.

> In both environment user events is a problem.
> That's why lemming debugging is very handy. And that's why OpenSmalltalk-VM development tools are far superior to other VMs I've dealt with. The back-in-time features that I used in C++ recently are very good though, in OpenSmalltalk-VM
> I guess the circular buffer of JIT simulation has a better time spent on tools/productivity ratio and is enough for now.

Yep, I'll be reading your papers.

> And this is a crash. Performance pitfalls issue are even harder to track down IMO.
>
>>
>> Also, while trying to reproduce your debug steps above, the image I
>> have already has memory leaks, so it isn't hitting the "self assert:
>> (self validRelocationPlanInPass: finalPass) = 0" assertion.
>
>
> You have to start simulation on a non already corrupted image. Did you make sure to comment the startUp: method in FreeTypeSettings? Disabling FreeType in the setting browser is not enough. Then you need to save and restart the image, and verifies it is not already corrupted.
> If you're talking about starting simulation from the saved images from the script, I did not take the latest which crashed because it was already corrupted, I used 57 while 58 was saved and 59 only changes were saved. You can see at start-up if swizzling and the initial GC find leaks.

This image didn't show any problems with validImage, but you're right,
I'll need to go one image back.

Thanks!
Alistair