Thermal Load Test
From athena
We only were able to do the RC unit test today, due to technical difficulties with the data mining software.
We turned off the water to the RC units twice to simulate a pump failure on the cooler, arguably the worst non-leak failure possible of a cooling system. Within 5 minutes on each test, the RC stopped functioning due to the thresholds set for the "exit air." On our first shutdown, we ran 10-12 minutes with the node fans running at double speed, an indication of 75 degrees or higher inlet air. We returned water to the cooling loops and recovered cooling before shutdown.
We discover that the ISX manager failed to send alerts on the critical temperatures of the inlet temperature systems. We also discovered the RC units would not turn back on automatically after water was restored. In fact, one of the RC units failed to start until we set some of it's thresholds higher.
After this exercise, we decided that the 35 degree setpoint for inlet air to the CPU's was too high and reset this to 30. We started the test again, and at 5 minutes, the RC units stopped. At roughly 15 minutes, Bill's scripts automatically shutdown most of the PDU's except rack 5, pdu 3, which stayed running. Bill scripts shutdown all of the polyserve nodes, but a user's nfs mount on the node (wdetmold) hung the shutdown of the head node.
Eventually, the UPS was shutdown and at that point, the EPO tripped and dropped the power to rack 5, pdu 3.
Doug will have more comments, and I'm sure I've missed something.