[Vm-dev] BFS and CFS and Cogs, Oh My

Eliot Miranda eliot.miranda at gmail.com
Mon Apr 22 18:02:15 UTC 2013


Thanks, Steve, that's great news!   I'll try and look at this really soon.

On Sun, Apr 21, 2013 at 5:10 AM, Steve Rees <
squeak-vm-dev at vimes.worldonline.co.uk> wrote:

>
>  Hi Eliot,
>
> On 19/04/2013 17:23, Eliot Miranda wrote:
>
> Hi Alex,
>
> On Fri, Apr 19, 2013 at 6:16 AM, Alex Bradbury <asb at asbradbury.org> wrote:
>
>>
>> On 19 April 2013 08:38, Casey Ransberger <casey.obrien.r at gmail.com>
>> wrote:
>> >
>> > I had a brief chat with Con Kolivas, who did BFS (which implements
>> kernel stuff that will make Cog happier under Linux on machines with
>> sub-supercomputing quantities of CPUs) tonight.
>> >
>> > It sounds like there are actually two reasons it hasn't made it into
>> the mainline kernel:
>> >
>> > a) he doesn't have time to support it, and
>> > b) the other kernel folks don't want it.
>> >
>> > Oh well. Since right now I'm focused on Raspbian, I sent a message
>> explaining what it was, why I want it, etc on their web board. If I do get
>> it in, support would have to fall to me. Yikes, right? ;)
>>
>> Yes, for political reasons it seems unlikely anything like BFS would
>> get in to the upstream kernel. If someone can do work to actually show
>> noticeable performance gains then that would make us (the Raspberry Pi
>> Foundation) interested in exploring further. Real workloads that
>> perform much better with an alternative scheduler would be much more
>> interesting than microbenchmarks.
>
>
>  This isn't about workload or performance.  It is about basic
> functionality.  The CFS scheduler does not support multiple thread
> priorities for user processes (actually, for the non-real-time scheduling
> policy, and the real-time scheduling policy is available only to superuser
> processes).
>
> This isn't entirely true. Out-of-the-box unprivileged processes can't
> change the scheduling policy, but in kernels after 2.6.12 it is possible to
> configure your system to allow this without resorting to setuid root.
>
> Quoting from the man page for sched_setscheduler -
> http://linux.die.net/man/2/sched_setscheduler - (the privilege
> restrictions are the same as for pthread_attr_setschedpolicy), "If an
> unprivileged process has a nonzero RLIMIT_RTPRIO soft limit, then it can
> change its scheduling policy and priority, subject to the restriction that
> the priority cannot be set to a value higher than the maximum of its
> current priority and its RLIMIT_RTPRIO soft limit."
>
> Using the pam_limits.so module, one can set the RLIMIT_RTPRIO soft limit
> higher than zero, which then allows the use of the SCHED_FIFO and SCHED_RR
> policies with priorities up to the soft limit.
>
> One way to achieve this is to add the following lines to the file
> /etc/security/limits.conf.
>
> *    hard    rtprio    1
> *    soft    rtprio    1
>
> or you can add a squeakvm.conf file to /etc/security/limits.d with those
> same lines, eg.
>
> # /etc/security/limits.d/squeakvm.conf
> *    hard    rtprio    1
> *    soft    rtprio    1
>
> This grants this capability to unprivileged users, but you will need to
> logout and login again for it to take effect, as pam limits are applied at
> user login.
>
> The only problem with this approach is that there's a possibility it might
> conflict with other global settings for the rtprio. Another alternative is
> to grant the privilege to a group (eg. squeakvm) and then add users to that
> group to allow the ability to change the SCHED_FIFO or SCHED_RR policies
> and to change the priorities of threads:
>
> # /etc/security/limits.d/squeakvm.conf
> @squeakvm    hard    rtprio    1
>  @squeakvm    soft    rtprio    1
>
> This will grant the ability only to users in the squeakvm group. The 1 in
> the examples above is the maximum priority. Higher levels could be used,
> but a level of 1 is necessary to trigger the capability.
>
> Of course the group needs to exist for this to take effect.
>
> sudo groupadd squeakvm
>
> There's a handy test program on the pthread_setschedparam man page -
> http://linux.die.net/man/3/pthread_setschedparam - that can be used for
> experimentation. I've attached the source. I tried this out on an
> up-to-date Ubuntu Server 12.04 LTS VM running on a MacbookPro under VMWare
> Fusion. YMMV.
>
> pthreads_sched_test is a bit of a verbose name, so I named the test
> program "schedtest" when I compiled it. Here are the results of my tests.
>
> First, compile the program
>
> gcc pthreads_sched_test.c -o schedtest -lpthread
>
> The first set of tests were performed without making any changes to the
> PAM limits.
>
> Running schedtest without arguments gives the following
>
> ./schedtest
>
>  Scheduler settings of main thread
>     policy=SCHED_OTHER, priority=0
>
> Scheduler settings in 'attr'
>     policy=SCHED_OTHER, priority=0
>     inheritsched is INHERIT
>
> Scheduler attributes of new thread
>     policy=SCHED_OTHER, priority=0
>
>  Trying to change the policy and priority of the new thread the program
> creates gives the following
>
> ./schedtest -ar1 -i e
>
>  Scheduler settings of main thread
>     policy=SCHED_OTHER, priority=0
>
> Scheduler settings in 'attr'
>     policy=SCHED_RR, priority=1
>     inheritsched is EXPLICIT
>
> pthread_create: Operation not permitted
>
>  Trying to change the priority of the main thread gives
>
> ./schedtest -mr1
>
>  pthread_setschedparam: Operation not permitted
>
>  As Eliot described, the default configuration prevents unprivileged user
> processes from changing the priority or scheduling policy.
>
> After adding the /etc/security/limits.d/squeakvm.conf file describe above,
> adding my user to the squeakvm group and logging out and back in again, the
> tests are somewhat more successful. Note that these are the only additional
> privileges given to the squeakvm group.
>
> schedtest -ar1 -i e
>
>  Scheduler settings of main thread
>     policy=SCHED_OTHER, priority=0
>
> Scheduler settings in 'attr'
>     policy=SCHED_RR, priority=1
>     inheritsched is EXPLICIT
>
> Scheduler attributes of new thread
>     policy=SCHED_RR, priority=1
>
>
> schedtest -mr1 -ao0 -i e
>
>  Scheduler settings of main thread
>     policy=SCHED_RR, priority=1
>
> Scheduler settings in 'attr'
>     policy=SCHED_OTHER, priority=0
>     inheritsched is EXPLICIT
>
> Scheduler attributes of new thread
>     policy=SCHED_OTHER, priority=0
>
>  Does this give sufficient flexibility without having to patch the
> kernel's scheduler (whatever its name)?
>
> Cheers,
> Steve
>
>  AFAIA it is the only main-stream pthreads scheduler that doesn't.  AFAIA
> BFS (what a name?!) does support multiple thread priorities for user
> processes.
>
>  Within the Squeak Cog VM (and in a number of other VMs, SMalltalk and
> Java VMs amongst them) there's a heartbeat which is used to cause the VM to
> periodically break out of normal processing and poll for events.  A
> heartbeat is both much more efficient, and more regular than e.g.
> decrementing a counter as part of normal processing (e.g. frame build on
> entering non-leaf methods).  Ideally the heartbeat is implemented as a
> thread spinning, blocking in e.g. nanosleep and then forcing the breakout
> before entering nanosleep again.  But this requires that the heartbeat
> thread runs at a higher priority than the main VM thread(s).  On linux
> under the CFS this isn't possible.  The fallback is to use an interval
> timer (setitimer with ITIMER_REAL) and a signal handler (for SIGALRM).
> This is a poor substitute:
> - system calls are interrupted, which can play havoc with external code
> - when debugging the heartbeat signal must be disabled because otherwise
> one is constantly stepping into the signal handler
>  - certain linux kernels have bugs with signal delivery and threads which
> can cause the loss of a thread's context, ending up with two threads having
> the same context, hence the setitimer approach works only with a strictly
> single-threaded VM (this is a bug I found and worked around late last year
> in Red Hat Enterprise Linux WS release 4 (Nahant Update 4) vintage kernels,
> which alas I have customers using)
>
>  Either of these solutions would seem straight-forward from the outside:
> - make SCHED_RR and/or SCHED_FIFO for user processes.
> - implement multiple priorities for SCHED_OTHER
>  Expecting to be able to install a VM as a setuid program is not realistic.
>
>  I think you'll find that this kind of architectural issue is present in
> a number of multi-media applications, not just dynamic language virtual
> machines.  The restriction to a single thread priority is, frankly,
> pathetic.  If you see Rasbian and Pi as a platform for multi-media apps
> then I would urge you to bring any influence you have to bear on getting
> the  linux kernel community to provide multiple thread priorities.  The
> lack thereof is a significant limitation.
>
> best regards,
> Eliot Miranda
>
>  Of course the next step after that
>> wouldn't be dumping the upstream scheduler and switching to BFS, but
>> it would certainly justify taking a closer look.
>>
>> I'm not entirely sure why you want to fork BFS - as far as I can see
>> Con Kolivas is keeping the BFS and his larger -ck patchset up to date
>> with upstream releases.
>>
>> In conclusion (from a Raspberry Pi perspective): please do play with
>> BFS on the pi, do something useful with it (if it solves the recently
>> discussed issues with heartbeat+cogvm then swell), then let's think
>> about where to go from there.
>>
>> Regards,
>>
>> Alex
>>
>
>
>
>  --
> best,
> Eliot
>
>
>
> --
> You can follow me on twitter at http://twitter.com/smalltalkhacker
>
>
>


-- 
best,
Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20130422/50997051/attachment-0001.htm


More information about the Vm-dev mailing list