Hi Eliot,
On 19/04/2013 17:23, Eliot Miranda wrote:
Hi Alex,
On Fri, Apr 19, 2013 at 6:16
AM, Alex Bradbury
<asb@asbradbury.org>
wrote:
On 19 April 2013 08:38, Casey Ransberger <casey.obrien.r@gmail.com>
wrote:
>
> I had a brief chat with Con Kolivas, who did
BFS (which implements kernel stuff that will make
Cog happier under Linux on machines with
sub-supercomputing quantities of CPUs) tonight.
>
> It sounds like there are actually two reasons
it hasn't made it into the mainline kernel:
>
> a) he doesn't have time to support it, and
> b) the other kernel folks don't want it.
>
> Oh well. Since right now I'm focused on
Raspbian, I sent a message explaining what it was,
why I want it, etc on their web board. If I do get
it in, support would have to fall to me. Yikes,
right? ;)
Yes, for political reasons it seems unlikely
anything like BFS would
get in to the upstream kernel. If someone can do
work to actually show
noticeable performance gains then that would make us
(the Raspberry Pi
Foundation) interested in exploring further. Real
workloads that
perform much better with an alternative scheduler
would be much more
interesting than microbenchmarks.
This isn't about workload or performance. It is
about basic functionality. The CFS scheduler does
not support multiple thread priorities for user
processes (actually, for the non-real-time
scheduling policy, and the real-time scheduling
policy is available only to superuser processes).Â
This isn't entirely true. Out-of-the-box unprivileged
processes can't change the scheduling policy, but in
kernels after 2.6.12 it is possible to configure your
system to allow this without resorting to setuid root.
Quoting from the man page for sched_setscheduler -
http://linux.die.net/man/2/sched_setscheduler
- (the privilege restrictions are the same as for
pthread_attr_setschedpolicy), "If an unprivileged process
has a nonzero RLIMIT_RTPRIO soft limit, then it can change
its scheduling policy and priority, subject to the
restriction that the priority cannot be set to a value
higher than the maximum of its current priority and its
RLIMIT_RTPRIO soft limit."
Using the pam_limits.so module, one can set the
RLIMIT_RTPRIO soft limit higher than zero, which then
allows the use of the SCHED_FIFO and SCHED_RR policies
with priorities up to the soft limit.
One way to achieve this is to add the following lines to
the file /etc/security/limits.conf.
*   hard   rtprio   1
*   soft   rtprio   1
or you can add a squeakvm.conf file to
/etc/security/limits.d with those same lines, eg.
# /etc/security/limits.d/squeakvm.conf
*   hard   rtprio   1
*   soft   rtprio   1
This grants this capability to unprivileged users, but you
will need to logout and login again for it to take effect,
as pam limits are applied at user login.
The only problem with this approach is that there's a
possibility it might conflict with other global settings
for the rtprio. Another alternative is to grant the
privilege to a group (eg. squeakvm) and then add users to
that group to allow the ability to change the SCHED_FIFO
or SCHED_RR policies and to change the priorities of
threads:
# /etc/security/limits.d/squeakvm.conf
@squeakvm   hard   rtprio   1
@squeakvm   soft   rtprio   1
This will grant the ability only to users in the squeakvm
group. The 1 in the examples above is the maximum
priority. Higher levels could be used, but a level of 1 is
necessary to trigger the capability.
Of course the group needs to exist for this to take
effect.
sudo groupadd squeakvm
There's a handy test program on the pthread_setschedparam
man page -
http://linux.die.net/man/3/pthread_setschedparam
- that can be used for experimentation. I've attached the
source. I tried this out on an up-to-date Ubuntu Server
12.04 LTS VM running on a MacbookPro under VMWare Fusion.
YMMV.
pthreads_sched_test is a bit of a verbose name, so I named
the test program "schedtest" when I compiled it. Here are
the results of my tests.
First, compile the program
gcc pthreads_sched_test.c -o schedtest
-lpthread
The first set of tests were performed without making any
changes to the PAM limits.
Running schedtest without arguments gives the following
./schedtest
Scheduler settings of main thread
   policy=SCHED_OTHER, priority=0
Scheduler settings in 'attr'
   policy=SCHED_OTHER, priority=0
   inheritsched is INHERIT
Scheduler attributes of new thread
   policy=SCHED_OTHER, priority=0
Trying to change the policy and priority of the new thread
the program creates gives the following
./schedtest -ar1 -i e
Scheduler settings of main thread
   policy=SCHED_OTHER, priority=0
Scheduler settings in 'attr'
   policy=SCHED_RR, priority=1
   inheritsched is EXPLICIT
pthread_create: Operation not permitted
Trying to change the priority of the main thread gives
./schedtest -mr1
pthread_setschedparam: Operation not
permitted
As Eliot described, the default configuration prevents
unprivileged user processes from changing the priority or
scheduling policy.
After adding the /etc/security/limits.d/squeakvm.conf file
describe above, adding my user to the squeakvm group and
logging out and back in again, the tests are somewhat more
successful. Note that these are the only additional
privileges given to the squeakvm group.
schedtest -ar1 -i e
Scheduler settings of main thread
   policy=SCHED_OTHER, priority=0
Scheduler settings in 'attr'
   policy=SCHED_RR, priority=1
   inheritsched is EXPLICIT
Scheduler attributes of new thread
   policy=SCHED_RR, priority=1
schedtest -mr1 -ao0 -i e
Scheduler settings of main thread
   policy=SCHED_RR, priority=1
Scheduler settings in 'attr'
   policy=SCHED_OTHER, priority=0
   inheritsched is EXPLICIT
Scheduler attributes of new thread
   policy=SCHED_OTHER, priority=0
Does this give sufficient flexibility without having to
patch the kernel's scheduler (whatever its name)?
Cheers,
Steve
AFAIA it is the only main-stream pthreads
scheduler that doesn't. AFAIA BFS (what a name?!)
does support multiple thread priorities for user
processes.
Within the Squeak Cog VM (and in a number of
other VMs, SMalltalk and Java VMs amongst them)
there's a heartbeat which is used to cause the VM to
periodically break out of normal processing and poll
for events. A heartbeat is both much more
efficient, and more regular than e.g. decrementing a
counter as part of normal processing (e.g. frame
build on entering non-leaf methods). Ideally the
heartbeat is implemented as a thread spinning,
blocking in e.g. nanosleep and then forcing the
breakout before entering nanosleep again. But this
requires that the heartbeat thread runs at a higher
priority than the main VM thread(s). On linux
under the CFS this isn't possible. The fallback is
to use an interval timer (setitimer with
ITIMER_REAL) and a signal handler (for SIGALRM).Â
This is a poor substitute:
- system calls are interrupted, which can play
havoc with external code
- when debugging the heartbeat signal must be
disabled because otherwise one is constantly
stepping into the signal handler
- certain linux kernels have bugs with signal
delivery and threads which can cause the loss of a
thread's context, ending up with two threads having
the same context, hence the setitimer approach works
only with a strictly single-threaded VM (this is a
bug I found and worked around late last year in Red
Hat Enterprise Linux WS release 4 (Nahant Update 4)
vintage kernels, which alas I have customers using)
Either of these solutions would seem
straight-forward from the outside:
- make SCHED_RR and/or SCHED_FIFO for user
processes.
- implement multiple priorities for SCHED_OTHER
Expecting to be able to install a VM as a setuid
program is not realistic.
I think you'll find that this kind of
architectural issue is present in a number of
multi-media applications, not just dynamic language
virtual machines. The restriction to a single
thread priority is, frankly, pathetic. If you see
Rasbian and Pi as a platform for multi-media apps
then I would urge you to bring any influence you
have to bear on getting the linux kernel community
to provide multiple thread priorities. The lack
thereof is a significant limitation.
Â
best regards,
Eliot Miranda
Of
course the next step after that
wouldn't be dumping the upstream scheduler and
switching to BFS, but
it would certainly justify taking a closer look.
I'm not entirely sure why you want to fork BFS - as
far as I can see
Con Kolivas is keeping the BFS and his larger -ck
patchset up to date
with upstream releases.
In conclusion (from a Raspberry Pi perspective):
please do play with
BFS on the pi, do something useful with it (if it
solves the recently
discussed issues with heartbeat+cogvm then swell),
then let's think
about where to go from there.
Regards,
Alex
--
best,
Eliot
--
You can follow me on twitter at http://twitter.com/smalltalkhacker