Hi Eliot,
On 22/07/10 22:44, Eliot Miranda wrote:
just looks like the OS/run-time is not letting the program set a handler for SIGUSR2 and/or not allowing it to be caught. This is a deal breaker. Why it's happening I don't know, but currently Cog's heartbeat on linux depends on being able to catch SIGUSR2.
From: http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html
H.4: With LinuxThreads, I can no longer use the signals SIGUSR1 and SIGUSR2 in my programs! Why?
The short answer is: because the Linux kernel you're using does not support realtime signals.
I'd forgotten all that! I thought that stuff was ancient history. So we need two things, one is a pair of alternative signals, the other is a reliable #define that we can use to distinguish l'ancien regime from the modern day.
Something smells fishy about signals specifically reserved for user app's then being re-reserved for something else and since the page I linked to begins with the warning "This FAQ has not been updated for a while and may not be 100% up to date" I am trying to clarify the situation. Still at it but here is what I have dug up so far:
- "LinuxThreads" generally refers to threading on pre-2.6 kernels
- "NPTL", Native POSIX Threads Library for Linux, replaces "LinuxThreads" on 2.5+ kernels (publicly 2.6+)
- "NGPT", IBM's Next Generation POSIX Threads, for 2.4 kernels and earlier, works/worked in conjunction with LinuxThreads
- RedHat back-ported NTPL to pre-2.6 kernels and made the threading model selectable between NTPL/LinuxThreads on a per process basis
To determine the threading library that a system uses (example shown for my system: Ubuntu 9.10, 2.6.31-22-generic #60-Ubuntu SMP Thu May 27 00:22:23 UTC 2010 i686 GNU/Linux):
> getconf GNU_LIBPTHREAD_VERSION NPTL 2.10.1
So that fishy smell might come from a RedHat specific red-herring ;-) Most likely on modern distro's with a 2.6+ kernel:
A) NTPL *is* being used B) SIGUSR1/2 are *not* reserved C) SIGUSR1/2, if they are indeed the source of any problems, may be getting used elsewhere (in a plugin perhaps)
More on this below but first I would throw into the mix what in my limited experience has sometimes been the source of odd problems. This is the handling of EINTR/EAGAIN errors and, AFAICT, the increased likelihood of these errors occurring depending on how busy a process and/or the system in general is. I have seen code that has been written pre-2.6 which has worked well even post-2.6 until system load increases and/or used in multi-threading. Some IOCtrl() calls then fail but since the immediate code does not handle EINTR/EAGAIN the result is some obscure error at a higher level.
Like Paul and Rob I got the VM compiled last weekend but it would either crash immediately or after maybe 15/20s. Then I crashed when I got to the stage of trying to debug a multi-threaded application :-)
I'm admittedly largely ignorant of how Cog changes the VM and apologise if my post re SIGUSR1/2 does prove to be a red-herring but I'm still wondering about the need for multi-threading for non-Teleplace use? From your latest email it seems as if multi-threading has been introduced to support high-priority for Teleplace "media processing". If this is functionality that will remain private to Teleplace and there is no clear benefit to others for a core VM high-priority thread, and given the difficulties debugging, then could public Cog be single threaded?
-D