[Box-Admins] Proposed week-long shutdown of Jenkins

Thu Feb 20 22:31:19 UTC 2014

On 20 February 2014 20:51, David T. Lewis <lewis at mail.msen.com> wrote:
> Ken, thanks for the explanation. I recognize all of those issues.
>
> I do think it is more appropriate to use the Jenkins UI to turn off the
> problematic jobs until the issues can be addressed, as opposed to shutting
> down the whole Jenkins system.

Not really. Other than the InterpreterVM and CogVM jobs (I'm running
off memory here), all the other jobs use rake, and shell out to run
Squeak, and do fancy things to avoid hung builds. Now I'm absolutely
100% sure that the orphaned processes are my fault. Clearly I don't
understand the intricacies of shells and subshells and process
ownership. The problem is the disconnect between the sets of people
who know very well what the squeak-ci code does (me) and people who
understand the unix process model (and ownership in particular) (not
me).

> Yes, some of our Jenkins jobs are wasting a lot of space. Yes, that is a
> fixable problem. No I don't think that a bigger disk drive will fix it ;-)

I thought I'd take a look at ExternalPackages-Xtreams, one of the
bigger jobs at 511M.

By far the biggest part of the disk usage - 221M - is simply the
repository itself. This is because we store big fat blobs of binary
data (images) in the repository. Upgrading these is simply wasteful.
Maybe some serious git guru-ness might be applied to reduce this. I
think there are tricks to remove the presence and history of large
binaries, but I'd have to look it up.

The target/ directory takes up no less than 195M. It has three VMs
(like most of these jobs): each Cog VM directory takes 14M, while the
Interpreter VM takes up 38M. (This is the source: because every job
can run on any agent, and that agent could have any manner of glibc,
we _build_ an Interpreter VM and memoise the artifact).
target/package-cache/ takes up 37M, presumably because jobs update the
trunk image from the base CI, save that, then load the package under
test (Xtreams in this case).

I've started the process of making jobs depend on the binary artifacts
of other jobs, which will probably remove the package-cache disk
usage. So SqueakTrunk will produce a TrunkImage.image that
ReleaseSqueakTrunk will take and produce a Squeak4.5.image, while
ExternalPackages-Xtreams will turn the TrunkImage.image into a JUnit
test result.

Saving 38M per job means saving 38*33 ~= 1G on disk.

frank

> Dave
>
>> Let me just list the issues I'm aware of, not that these can all be
>> fixed in the same way or require any significant overall downtime.
>>
>> 1. Jenkins broke some of our build processes with a release months ago.
>>   Since that time we have been pinned to a specific release and have not
>> updated.  Initially the plan was to be agile and keep up to date with
>> Jenkins releases, but no one has found the time to figure out why the
>> builds broke or at least the proper way to address the problem.  I know
>> Frank tried but he has only so much time and other fish to fry.  I
>> approached Chris C as he was the original instigator for Jenkins to try
>> to see if he had the interest to help Frank out.
>>
>> 2. The issue I have harped on about in the past about filling up the
>> filesystem on box3.  I'm convinced that Jenkins jobs are wasting space
>> somewhere or that maybe there are some jobs that can be deleted?  I'm
>> just speculating, but there are a number of jobs that have not succeeded
>> in months.  By the way growth has been generally slow of late but we are
>> at 97%, no immediate fear but 'vigilance!'.  If ultimately
>> build.squeak.org is as big as it is because it has to be, then we
>> probably need to approach SFC and see if there is budget to upgrade the
>> disk space on box3.  That's not my first choice however.
>>
>> 3. The issue that Chris has referred to which is that we still get jobs
>> stuck fairly regularly that have to be killed manually.
>>
>> Ken
>>
>> On 02/20/2014 11:11 AM, David T. Lewis wrote:
>>> What problem are we trying to solve here?
>>>
>>> If there are Jenkins jobs that cause problems, and if those problems
>>> cannot be addressed right away, then the appropriate thing to do is
>>> disable them using the normal Jenkins console. If an explanation is
>>> needed, just update the job description to say what is going on.
>>>
>>> A little bit of updating of the Jenkins job descriptions would do no
>>> harm
>>> in any case. Sort of like a class comment: "I am a Jenkins job that
>>> tests
>>> the FreebleBaz package. If I stop working, please contact
>>> bilbo at baggins.org".
>>>
>>> :)
>>>
>>> Dave
>>>
>>>> Ken and I have been thinking of shutting down Jenkins (OK, it was my
>>>> idea)
>>>> for a week after 4.5 is released. The aim is to address hanging issues.
>>>>
>>>> A week is a long time from a technical point of view, but it allows
>>>> people
>>>> using it to take a break. Mainly we're thinking of Frank here.
>>>> We're thinking of upgrades, disk usage, necessary and un-necessary
>>>> builds
>>>> (if there are any). Basically stopping that world for a week.
>>>>
>>>> What do you think, Frank? If you are opposed, then we'll chuck this
>>>> idea.
>>>>
>>>> Chris
>>>
>>>
>>>
>>>
>>
>
>