Perseus Software Configuration Changes in 2017

Links below were the guides to what was done. It should be assumed that the indicated steps were executed by root.

Server stability troubles, in 2017... After a power outage on 2/4/17, Perseus started acting flaky again, like it had in the past.
It might become unresponsive for no obvious reason, and the only way to revive it would be to power-cycle it via the on/off switch.
After these hard shutdowns, it seemed to become even more vulnerable. Stability improved after a few clean shutdowns and restarts.

On 2/27/17, users' terminals began reporting SMART failures associated with /dev/sdc, suggesting possible file system problems.
However, tests like "smartctl -H" and "smartctl --test=short" (twice) followed by "smartctl --log=selftest" revealed no issues.
Repeated scans of "smartctl -a /dev/sdc" showed just one error counter creeping up: Multi_Zone_Error_Rate, at a rate of ~1/hour.
A check of /var/log/messages revealed 71 SMART timeouts had occurred on 2/26/17, as well as 10 timeouts involving sdc2 (/home).
Several volumes (/, /home, user0, /user3) showed errors with "fsck -n -f". Disk repairs were attempted through the following procedure:
How to check root partition with fsck?

This took care of /user0 and /user3, but errors still showed for / and /home. The latter was fixed by "umount /home; fsck /home; mount -a".
However, / can't be unmounted, and it seems the usual mechanisms for forcing a full fsck don't always work with systemd in CentOS 7.
Therefore, /etc/default/grub was edited to include GRUB_CMDLINE_LINUX="fsck.mode=force fsck.repair=yes ...", as implied here:
How to automatically force fsck disks after crash in `systemd`?
Linux Force fsck on the Next Reboot or Boot Sequence [bottom of post]

When /etc/default/grub is changed, it is necessary to update grub.cfg using "grub2-mkconfig -o /boot/grub2/grub.cfg" prior to rebooting:
Setting Up grub2 on CentOS 7
Searching for [and customizing the] grub configuration file in CentOS 7

Upon reboot, this supposedly triggers a fsck of all volumes, including root (/). However, this did not fix the file system errors either.
The next setting to try was GRUB_CMDLINE_LINUX="fsck.repair=yes ...", alone, but making this edit and rebooting again did not help.
The reason for trying it was that "systemd-fsck understands one kernel command line parameter" - systemd-fsck@.service

One further effort on 2/28/17: booting into Fedora 15 proved to be useless because its e2fsck was too old to repair the /centos volume.
Back in CentOS 7, / was remounted as read-only in order to run fsck on it, but that fix didn't stick; in the end there were still problems.
How to fix (fsck) a root file system that you have to boot into on Linux

On 3/1/17, "yum install ipmitool" brought in ipmitool and OpenIMPI, to see if they would give additional clues about the recent problems.
How to get the IPMI address of a server?
Logs showed that all 8 processors asserted an overheating condition at around the same time that the system got into an unresponsive state.
Furthermore, this occurred at just about the same 16-core job joined a 48-core job, so the system was starting to experience a heavy load.
Two subsequent "stuck" conditions did not show this coincidence, but anyway, a decision was made to send it to Red Barn for reconditioning.
In the meantime, users were advised to limit their runs to 32-48 cores. This did seem to reduce the frequency of incidents.

Repairs started on 3/20/17. Red Barn's scope of work included cleaning the fans and re-applying thermal paste to the processors.
However, these steps did not fix the problem: when the processors were heavily loaded, the machine would still become unreachable.
The logical next step was to update BIOS, but unfortunately, this operation utterly failed and machine would no longer pass POST(!).
Eventually—after trying several sets of both new and old ROM chips from Supermicro—the machine both booted and passed stress tests.
Perseus returned to its home on 4/11/17.

On 4/27/17, the OS was updated to CentOS 7.3. It was NOT done because Perseus continued its pattern of becoming unresponsive.
Rather, X2Go connections had become unreliable. The decision was made to do a full system refresh including a new version of X2Go.
This time, "yum update" cranked smoothly through 1000+ packages, and after 1.5 hours, the result was a clean reboot into the new kernel.

X2Go details: The initially installed x2goserver was 4.0.1.19. According to the Git repository, version 4.0.1.20 was released in Nov. 2016.
A RHEL 7 package, x2goserver-4.0.1.20-1.el7.x86_64.rpm, became available around that time, but "yum list" said Perseus lacked it.
Furthermore, new X2Go clients for various OSes (version 4.1.0) were released in early 2017. The time was clearly ripe for an update.

Things were good for a while, but between 5/30/17 and 6/2/17, there were 5 reboots due to unknown causes, as revealed by "last reboot".
Moreover, "last -x" indicated that many user sessions had ended with "crash". All of this seemed to point to an ongoing hardware problem.

Overheating could lead to such crashes. The IPMI Event Log showed no evidence for that, but its temperature monitoring is not very detailed.
It was easy to try another tool, via "yum install lm_sensors", then "sensors-detect" [accepting all defaults], then "sensors", as outlined here:
How to Get CPU Temperature Information on Linux (CentOS 6.4 /RHEL)
Unfortunately, "sensors" just provides single-point-in-time readings, so a cron job would be needed to take readings every minute or so.
Red Barn was consulted about the situation, and they did not feel that overheating was necessarily to blame for the crashes.

Perseus once again got stuck in an unresponsive state on 6/6/17, for the first time since its reconditioning back in March-April.
Power cycling was the only option, followed by "fsck -n -f" on all partitions. Both / and /home showed errors (while still mounted).
Repairs were made according to the approach used before, including this for the root partition: Force fsck on system boot.

There is one lingering issue from a known bug in CentOS 7.3, specifically, in libfabric-1.3.0. It causes a 15s delay in the start of any MPI run:
Bug 1408316 – openmpi hfi_wait_for_device causes 15s delay
This issue can reportedly be corrected by manually installing a CentOS package based on the current Fedora version of the library.
However, Perseus users said they were content to wait for the official update that is due to arrive with CentOS 7.4 (in late July 2017?).



Last updated on 6/6/17 by Steve Lantz (steve.lantz ~at~ cornell.edu)