Users of Microsoft’s Azure system lost database records as part of a mass outage on Tuesday. A combination of DNS problems and automated scripts were to blame, said reports.
Microsoft deleted several Transparent Data Encryption (TDE) databases in Azure, holding live customer information. TDE databases dynamically encrypt the information they store, decrypting it when customers access it. Keeping the data encrypted at rest stops an intruder with access to the database from reading the information.
While there are different approaches to encrypting these tables, many Azure users store their own encryption keys in Microsoft’s Key Vault encryption key management system, in a process called Bring Your Own Key (BYOK).
The deletions were automated, triggered by a script that drops TDE database tables when their corresponding keys can no longer be accessed in the Key Vault, explained Microsoft in a letter reportedly sent to customers.
The company quickly restored the tables from a five-minute snapshot backup, but that meant any transactions that customers had processed within five minutes of the table drop would have to be dealt with manually. In this case, customers would have to raise a support ticket and ask for the database copy to be renamed to the original.
Why were the systems accessing the TDE tables unable to access the Key Vault? The answer stems from a far bigger issue for Microsoft and its Azure customers this week. An outage struck the cloud service worldwide on Tuesday, causing a range of problems. These included intermittent access to Office 365 in which users had only half a chance of logging on. Broader Azure cloud resources were also down.
This problem was, in turn, down to a DNS outage, according to Microsoft’s Azure status page:
Preliminary root cause: Engineers identified a DNS issue with an external DNS provider.
Mitigation: DNS services were failed over to an alternative DNS provider which mitigated the issue.
Reports suggested that this DNS outage came from CenturyLink, which provides DNS services to Microsoft. The company had suffered a software defect, it had said in a statement.
This shows what can go wrong when cloud-based systems are interconnected and automated enough to allow cascading failures. A software defect at a DNS provider indirectly led to the deletion of live customer information thanks to a lack of human intervention.
CenturyLink seems to be experiencing serial DNS problems lately. The company, which completed its $34bn acquisition of large network operator Level 3 in late 2017, also suffered a DNS outage in December that reportedly affected emergency services, sparking an FCC investigation.
Azure users can at least take comfort in the fact that Microsoft is offering multiple months of free Azure service for affected parties.