On Tuesday, 20.06.2023, there was a considerable downtime of various IT services and also of the TUW homepage as a result of a malfunction within the Datacore Storage Cluster.
The cause was identified as the import of Windows updates, which occurred automatically despite deactivation. The updates led to a shutdown at 05:02 and thus to a disk error at 05:03. This was displayed on TU.it's monitoring systems at 05:06 and dealt with by a TU.it staff member from 06:30. The final root-cause is currently being clarified with the manufacturers. From 07:30 a team of 15 employees was involved in the elimination of the fault. At 07:40 it was decided to activate an emergency page to the TUW website.
To ensure data integrity, the service that provides the virtual machines with their disks did not autostart on the storage clusters. This meant that the IT services running on these VMs would fail.
After a successful non-automated controlled restart of all services on the Datacore clusters, the disks were made available to the VMs again - without data loss. The TU.it virtualization team provided significant support to the service managers of the affected services during the recovery.
Since central TU information channels were down (e.g. TUchat, TUwiki) or impaired (e.g. mail) due to this disruption, customer communication was initially very difficult. However, improvised workarounds could be found.
At 9:00 p.m., all services were available again without any interruptions.
We would like to take this opportunity to apologize to all those who had to put up with disruptions to their work as a result of these service outages.