[Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

Wed May 22 17:22:41 UTC 2013

Thanks for the suggestions Dave. I wanted to get some more information before replying; took me longer than I thought though.

I was able to pin down the problem around a couple of thousand objects. I'll sketch my findings roughly: I had two collections, both stored in an instance variable. I could serialize them both separately but not together. I could also serialize one of them and a subset of the other. The actual elements didn't seem to matter as much as the number of elements serialized. This worked for example:

arrayToSerialize := {
	coll1 first: 180. "coll1 has 200 elements"
	coll2 }

As did this:

arrayToSerialize := {
	coll1 last: 20.
	coll2 }

While this didn't:

arrayToSerialize := {
	coll1 first: 180. "coll1 has 200 elements"
	coll1 last: 20.
	coll2 }

But then again, this did:

arrayToSerialize := { coll2 }

As did this:

arrayToSerialize := { coll1 }

excluding elements that referenced each other in coll1 and coll2 didn't matter either.

I have no clue at all, what might be the problem here. My only (far fetched) guess is that maybe one of the objects was somehow corrupt (although that should have somehow triggered exceptions or at least make serialization impossible in all cases, especially when not forking).

Then, during testing, I messed up my snapshot and had to load a new one and tadaaa: the problem's gone… Just weird… I just hope, this never happens again. If it does, you'll hear from me :)

What I wrote below now seems obsolete. I'm leaving it in for reference.

Cheers and thanks a lot for your help!
Max

On 16.05.2013, at 14:33, David T. Lewis <lewis at mail.msen.com> wrote:

> First to check to see if I understand the problem:
> 
> - You are forking an image using OSProcess, and the failures are occuring
> the child VM/image that you created using one of the #forkSqueak methods.

Yes, precisely. I'm forking headless by the way.

> 
> - You are serializing some object in the child image so that they can be
> written to a serialized output file. You are doing this in the background
> child VM/image so that your main image can continue doing whatever it was
> doing without being impacted by the serializing process.

Yes.

> 
> - If you do this 10 times, it will fail about 4 times out of the 10 on average.

Yes, very roughly.

> 
> From the stack traces, it looks like your VM is hanging up in a normal
> garbage collection sweep that is happening as a side effect of allocating
> a new object. The first thing that comes to mind is that your VM is hanging
> up while trying to allocate more memory than the OS can provide to it, and
> it appears to be stuck while waiting memory to be provided by the OS. I'm
> suggesting this because I've seen similar symptoms when I intentially try
> to allocate very large object memories, so perhaps there may be something
> in your runtime environment (OS settings) that limit available memory.
> This is a guess only, but it might give you some ideas of where to look.

I used vmstat to look at possible swap (paging) problems. There wasn't a single swap operation and there's ample memory.

> 
> This might be a VM problem, or it might be a symptom of some other problem
> that just puts the VM and OS into a condition where they are unexpectedly
> trying to use a very large amount of memory.

That's what I hope, but I'm pretty clueless.

> 
> If possible it would be good to try to rule out the OSProcess forkImage as
> a possible contributing factor. If you were to do the serialization 10 times
> in the main image, rather than in the background child image, would it
> still fail about 4 times out of 10? This could be a factor in memory use.

Serializing without forking always succeeds.

> Although forkSqueak is very memory efficient, if the child and parent
> images do a lot of different things (such as try to serialize a bunch of
> stuff in the child image), then the OS will be forced to map in enough
> real memory to satisfy both of the images.

As pointed out above, there wasn't a single swap. htop also shows swap usage at roughly 20% and there's always memory available.

> 
> Another thing to check, if possible, is to see if the OS is doing a
> lot of disk swapping at the time of the failure. If you see all of the
> system memory in use, along with a lot of disk activity, then you are
> very likely look at a case of the VM trying to allocate more memory
> than the system can reasonable provide.

(Again, see above)
I/O is also pretty low. Nothing special there.

> 
> FWIW, here is what your VM is doing at the time of the "hang up". This
> involves scanning object memory through memEnd, which might be a very
> large value for one reason or another:
> 
> ObjectMemory>>updatePointersInRootObjectsFrom: memStart to: memEnd 
>    "update pointers in root objects"
>    | oop |
>    <inline: false>
>    1 to: rootTableCount do: [:i | 
>            oop := rootTable at: i.
>            ((self oop: oop isLessThan: memStart)
>                    or: [self oop: oop isGreaterThanOrEqualTo: memEnd])
>                ifTrue: ["Note: must not remap the fields of any object twice!"
>                    "remap this oop only if not in the memory range 
>                    covered below"
>                    self remapFieldsAndClassOf: oop]]

I was able to look at two forked processes where one didn't terminate (call it 1) and the other (call it 2) did. What I saw was that 1 was doing GC's like crazy right from the start (like the stacks I posted) while 2 showed a lot of other calls too (e.g. interpret()) and spent (in my oppinion) much less time in GC.

I tried two additional things:
1. after forking I kill the weak finalization process. This seems to have bettered the problem but isn't a fix
2. I modified the GC options the following way:

	"Limit incremental GC activity to run every 100k allocations" 
	SmalltalkImage current vmParameterAt: 5 put: 100000. "allocations between GCs (default: 4000)" 

	"Limit tenuring threshold to only tenure w/ > 25k survivors" 
	SmalltalkImage current vmParameterAt: 6 put: 25000. "tenuring threshold (default: 2000)" 

	"Tell the VM that we'd rather grow than spinning in tight GC loops" 
	SmalltalkImage current gcBiasToGrow: true. "default: false" 

	"Tell the VM to do a fullGC for good measure if the above growth exceeded x MB" 
	SmalltalkImage current gcBiasToGrowLimit: 8*1024*1024. "default: 0" 

This did not have any effect (as far as I can tell) on the problem.

> 			
> HTH,
> 
> Dave
> 
> On Thu, May 16, 2013 at 09:29:42AM +0200, Max Leske wrote:
>> 
>> Hi
>> 
>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with 
>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>> 
>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>> 
>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
>> 
>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>> 
>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>> 
>> Cheers,
>> Max
>> 
>> Begin forwarded message:
>> 
>>> From: Mariano Martinez Peck <marianopeck at gmail.com>
>>> Subject: Re: [Pharo-fuel] Possible collections problem
>>> Date: 15. Mai 2013 16:53:10 MESZ
>>> To: The Fuel Project <pharo-fuel at lists.gforge.inria.fr>
>>> Reply-To: The Fuel Project <pharo-fuel at lists.gforge.inria.fr>
>>> 
>>> I cannot see anything in particular. Too many GC stuff.
>>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
>>> 
>>> 
>>> 
>>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <maxleske at gmail.com> wrote:
>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
>>> 
>>> I haven't yet tried your suggestion Mariano.
>>> 
>>> Cheers,
>>> Max
>>> 
>>> 
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>> 
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>> 
>>> 
>>> 
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>> 
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>> 
>>> 
>>> 
>>> 
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>> 
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x0806fed2 in clone ()
>>> #6  0x08070095 in primitiveClone ()
>>> #7  0x0806de15 in interpret ()
>>> #8  0x08073dfe in main ()
>>> 
>>> 
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>> 
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069270 in primitiveNewWithArg ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>> 
>>> 
>>> [Thread debugging using libthread_db enabled]
>>> 0xb76f0f68 in select () from /lib/libc.so.6
>>> 
>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
>>> #1  0x08070880 in aioPoll ()
>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
>>> #5  0x0806de15 in interpret ()
>>> #6  0x08073dfe in main ()
>>> 
>>> 
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>> 
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>> 
>>> 
>>> 
>>> [Thread debugging using libthread_db enabled]
>>> 0x08064e7e in markAndTrace ()
>>> 
>>> #0  0x08064e7e in markAndTrace ()
>>> #1  0x0806593a in markPhase ()
>>> #2  0x08065f60 in incrementalGC ()
>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
>>> #4  0x0806fed2 in clone ()
>>> #5  0x08070095 in primitiveClone ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>> 
>>> 
>>> 
>>> 
>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <marianopeck at gmail.com> wrote:
>>> 
>>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
>>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive. 
>>>> Let us know!
>>>> 
>>>> 
>>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <maxleske at gmail.com> wrote:
>>>> 
>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <marianopeck at gmail.com> wrote:
>>>> 
>>>>> Hi Max. Question, are you able to reproduce the problem?
>>>> 
>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <maxleske at gmail.com> wrote:
>>>>> Hi guys
>>>>> 
>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
>>>>> 
>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
>>>>> 
>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
>>>>> 
>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
>>>>> 
>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
>>>>> 
>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
>>>>> 
>>>>> Cheers,
>>>>> Max
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> Pharo-fuel at lists.gforge.inria.fr
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Mariano
>>>>> http://marianopeck.wordpress.com
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> Pharo-fuel at lists.gforge.inria.fr
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> Pharo-fuel at lists.gforge.inria.fr
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Mariano
>>>> http://marianopeck.wordpress.com
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> Pharo-fuel at lists.gforge.inria.fr
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>> 
>>> 
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> Pharo-fuel at lists.gforge.inria.fr
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Mariano
>>> http://marianopeck.wordpress.com
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> Pharo-fuel at lists.gforge.inria.fr
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>> 
> 
> 
> _______________________________________________
> Pharo-fuel mailing list
> Pharo-fuel at lists.gforge.inria.fr
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel