Renovating computer systems securely and progressively with APRON

This research paper was accepted by 2023 USENIX Annual Technical Conference (ATC), which is dedicated to advancing the field of systems research.

Whether they’re personal computers or cloud instances, it’s crucial to ensure that the computer systems people use every day are reliable and secure. The validity of these systems is critical because if storage devices containing important executables and data become invalid, the entire system is affected. Numerous events can jeopardize the validity of computer systems or the data stored in them, such as malicious attacks like ransomware; hardware or software errors can corrupt a system, and a lack of regular maintenance such as patch installations can cause a system to become outdated. While the ideal scenario would be to create a flawless computer system that prevents such invalid states from occurring, achieving this perfection may prove challenging in practice.

Cyber-resilient system and recovery

A cyber-resilient system is a practical approach for addressing invalid system states. This resilient system effectively identifies suspicious state corruption or preservation by analyzing various internal and external signals. If it confirms any corruption, it recovers the system. In our previous work, which we presented at the 40th IEEE Symposium on Security and Privacy, we demonstrated the feasibility of unconditional system recovery using a very small hardware component. This component forcefully resets the entire system, making it execute trusted tiny code for system boot and recovery when no authenticated deferral request is present.

However, existing recovery mechanisms, including our previous work, primarily focus on when to recover a system rather than how. Consequently, these mechanisms overlook the efficiency and security issues that can arise during system recovery. Typically, these mechanisms incorporate a dedicated recovery environment responsible for executing the recovery task. Upon system reset, if the system is found to be invalid, as illustrated in Figure 1, the recovery environment is invoked. In this scenario, the recovery environment fully restores the system using a reference image downloaded from a reliable source or a separate location where it was securely stored.

There are two diagrams in Figure 1. The first depicts a situation where boot code, which is executed on system power on or reset, recognizes some corrupt parts of an operating system. Then, the boot code executes a recovery environment to fully recover all corrupt parts and resets the system. The second diagram depicts a situation after full recovery and reset. The boot code now finds no problem in the operating system and executes it.
Figure 1: System boot with a normal recovery.

Unfortunately, performing a full system recovery leads to prolonged system downtime because the recovery environment is incapable of supporting any other regular task expected from a computer system. In other words, the system remains unavailable during the recovery process. Moreover, choosing to download the reference image only serves to extend overall downtime. Although using the stored image slightly relieves this issue, it introduces security concerns, as the stored image might be outdated. One can argue that a full recovery can be circumvented by inspecting each file or data block for validity and selectively recovering only the affected ones. However, this delta recovery approach is lengthier than a full recovery due to the additional calculations required for determining differences and the inefficient utilization of modern, throughput-oriented block storage devices.

Secure and progressive system renovation

In our paper “APRON: Authenticated and Progressive System Image Renovation,” which we are presenting at the 2023 USENIX Annual Technical Conference (USENIX ATC 2023), we introduce APRON, a novel mechanism for securely renovating a computer system with minimal downtime. APRON differs from conventional recovery mechanisms in a crucial way: it does not fully recover the system within the recovery environment. Instead, it selectively addresses a small set of system components, or data blocks containing them, that are necessary for booting and system recovery, including the operating system kernel and the APRON kernel module, as shown in 2 Once these components are recovered, the system boots into a partially renovated state and can perform regular tasks, progressively recovering other invalid system components as needed.

Figure 2: System boot with APRON.
Figure 2: System boot with APRON.

This design allows APRON to significantly decrease downtime during system recovery by up to 28 times, compared with a normal system recovery, when retrieving portions of the reference image from a remote storage server connected through a 1 Gbps link. In addition, APRON incorporates a background thread dedicated to renovating the remaining invalid system components that might be accessed in the future. This background thread operates with low priority to avoid disrupting important foreground tasks. Throughout both renovation activities, APRON incurs an average runtime overhead of only 9% across a range of real-world applications. Once the renovation process is complete, runtime overhead disappears. 

APRON’s differentiator lies in its unique approach: the APRON kernel module acts as an intermediary between application or kernel threads and the system storage device, allowing it to verify and recover each data block on demand, as shown in Figure 3. When a block is requested, APRON follows a straightforward process. If the requested block is valid, APRON promptly delivers it to the requester. If it is found to be invalid, APRON employs a reference image to fix the block before serving it to the requester.

Figure 3: System storage renovation with APRON.
Figure 3: System storage renovation with APRON.

To efficiently and securely verify arbitrary data blocks, APRON uses a Merkle hash tree, which cryptographically summarizes every data block of the reference image. APRON further cryptographically authenticates the Merkle tree’s root hash value so that a malicious actor cannot tamper with it. To further improve performance, APRON treats zero blocks (data blocks filled with zeros) as a special case and performs deduplication to avoid repeatedly retrieving equivalent blocks. We discuss the technical details of this process in our paper.

Looking forward—extending APRON to container engines and hypervisors

APRON’s simple and widely applicable core design can easily apply to other use cases requiring efficient and secure image recovery or provisioning. We are currently exploring the possibility of implementing APRON within a container engine or hypervisor to realize an agentless APRON for container layers or virtual disk images. By extending APRON’s capabilities to these environments, we aim to provide an efficient and reliable image recovery and provisioning process without needing to modify container instances or add a guest operating system.

The post Renovating computer systems securely and progressively with APRON appeared first on Microsoft Research.

Read More