Lessons Learnt–ATO Outage

By now you have probably heard of the massive outage this week to the Australian Taxation Office (ATO). If you havent, where have you been. Its ok you can read about it here.

Long story short, massive data corruption on the HPE 3Par SAN infrastructure. The data corruption was then replicated to the DR 3Par SAN causing the whole SAN solution to go down and bring down with it, front facing websites and other core services.
Issues with the backup solution (NetBackup) at the time resulted in failed backups of the SAN. How long this was going on for, no-one knows.

Impact to business 3 days worth of outages. Ouch.

At time of writing, HPE are restoring the data. How recent and whether all the tapes are consistent and verified, is anyones guess.

FUN FACT – 1PB of data is about 200 tapes.

 

The lesson here is that backups arent just a set and forget component of any system, including SQL Server. Regularly test backing up using your preferred solution, whether that be CommVault, NetBackup or even SQL Server itself.

I have a reminder on my calendar to test the backup and restore process once a month. I will backup and restore a standalone SQL Server, an Always-On SQL Server database and a backup and restore to a DR SQL Server.
Basically you want to test every combination of SQL Server you can because you will never know which type of SQL Server will break on you.
NP – Make sure to include Transaction Logs in your restore process.

 

Your job depends largely on minimising potential data loss situations within SQL Server. Failure to do so could mean your on the lookout for a new job. Now you have been warned.