Allen
Jones

The Patch Rollback Playbook: Recovering from Bad Updates

Allen Jones

Jan 12, 2026

11 min read

The Patch Rollback Playbook: Recovering from Bad Updates

On July 19, 2024, the enterprise world learned a $5.4 billion lesson. A single faulty sensor update from a leading EDR vendor crashed 8.5 million Windows devices globally. It grounded airlines, halted surgeries, and froze banking systems. The downtime costs for Fortune 500 companies averaged $14,000 per minute.

In the modern enterprise, the risk is no longer just unpatched vulnerabilities; it is software supply chain instability. When a bad update hits, whether it’s a faulty OS patch, a broken CrowdStrike agent, or a buggy Chrome release, your RTO (Recovery Time Objective) determines your survival. Most UEM tools are designed to push updates. Few are designed to pull them back.

This is the Patch Rollback Playbook. It is a technical guide for Enterprise Architects on how to engineer a update or patch rollback management strategy using Hexnode, ensuring that when the next bad update hits, you can reverse it in minutes, not days.

Future-proof your device management and security with Hexnode
 

Phase 1: The “Kill Switch”

The moment a bad update is reported, your priority is containment. You must stop the update from reaching the 90% of your fleet that hasn’t checked in yet. Most admins try to delete the update policy. This is too slow. It requires the devices checking in to remove the update, which might not happen for hours.

The Hexnode “Firebreak” Strategy: Defer, Don’t Delete

Instead of deleting, Hexnode allows you to defer the updates.

  • Windows: With Hexnode, you can immediately modify your update interval for up to 30 days. This forces the Windows Update Agent to drop any pending downloads.
  • macOS: You can delay the updates for up to 90 days. This hides the update from the system level, preventing even a manual “Check for Updates” from finding the broken build.
  • iOS/Android: For Supervised iOS devices and Android Enterprise, you can block OTA updates immediately. This acts as a hard lock on the OS version, killing any background download tasks.

Why this is the “Kill Switch”

When devices next check in (even seconds later), the OS receives a command that says “No updates are available.” The bad download is instantly aborted at the OS level, creating a firebreak that protects your surviving nodes.

Phase 2: The Resilience Framework – Fix and Prevention

Disaster Recovery Metrics - RPO and RTO
Orchestrating the update across the endpoints
 

The “Kill Switch” is your front-line defense, but in a global deployment, speed is relative. Even a five-minute delay in pausing a policy can leave thousands of devices already compromised. When the bad code has already reached your early adopters, you are no longer in a containment phase. You are in a race for operational resilience. This requires a transition from a broad global pause to a high-precision prevention and remediation lifecycle.

In a crisis, you cannot afford a “one-size-fits-all” recovery. Your fleet is fragmented: some devices are bricked, some are lagging, and some remain healthy. Hexnode manages this chaos through its dedicated Automation engine.

Rather than manually hunting for faulty KBs, you can configure Hexnode to act as an automated traffic controller. With the Automate capability, you can build a logic-based flow to initiate an automated patch rollback by:

  • Defining the Scope: Choose the OS and the specific patch category (OS vs. App).
  • Filtering by Update Type: Choose the type of update to pinpoint the exact patch.
  • Executing the Action: Select the specific update version and set the automation rule to download, install, or, uninstall.
  • Assigning Targets: Deploy the automation to specific Device Groups, User Groups, or even Organizational Units (OUs) synced from your directory.

Once these automation rules are live, Hexnode leverages Dynamic Device Groups to orchestrate this capability. So, by using dynamic device groups, you can enable:

  • Automated Remediation: Devices reporting the broken version (e.g., v7.11) are automatically funneled into a remediation group. This instantly triggers your patch rollback automation (or a custom script) to restore stability.
  • The Firebreak: Simultaneously, devices that haven’t been updated yet (Version < 7.10) are isolated via a deferral command. This protects them from the faulty code while you prepare a verified, fixed version (v7.12) for a forced push.

Eliminating “Over-Remediation” Risk

The greatest fear in mass rollbacks is accidentally downgrading healthy machines. Because Hexnode uses real-time version telemetry, your sub-fleets remain clearly distinguished. You gain 100% visibility into which devices are “Healed” and which are still “At Risk,” ensuring your remedy only touches the machines that actually need it.

Staged “Test Group” Deployment

It is always a good practice to test the updates or patches once before pushing them to large enterprise endpoints. Hexnode’s Custom Device Groups allow you to build a tiered “Test Ring” strategy that acts as a physical buffer:

  • The Test Ring: Deliver updates only to a diverse 5% subset of your fleet (IT staff, lab machines) 0 days after release.
  • The Production Ring: Configure the remaining 95% of your fleet with a mandatory 7-day deferral policy.

This 7-day gap acts as your safety valve. If the Test Ring reports crashes or performance dips, you simply pause the policy globally, before the update ever reaches the remaining 95% of your users.

Phase 3: The Rollback

If the update has already breached your production environment, the goal shifts to a surgical rollback. Because every OS handles its undo history differently, Hexnode allows you to deploy platform-specific remediation scripts to pull back the code without a full system wipe.

A: Windows Rollback with Scripts

Windows is unique because it allows for granular, command-line uninstallation of specific Knowledge Base (KB) patches. This makes Windows 10/11 fleet your most recoverable assets if you have the right script.

The Windows Update Standalone Installer (wusa.exe) is a built-in utility that manages update packages. By leveraging specific flags, we can transform it from a standard installer into a silent, fleet-wide undo button. To remediate the rollback:

  • Identify the Bad Patch: Find the KB number of the bad patch (e.g., KB5044033).
  • Create the Script: Write a PowerShell script that targets this specific KB.
  • Orchestrate via Hexnode: Push this script as a required action to the affected devices group.

Here is a sample script template to help with the rollback:

While PowerShell scripts are vital for granular control in offline or “Command Prompt only” scenarios, Hexnode’s native Automation acts as a first line of defense. It allows you to orchestrate these uninstalls directly from the portal with full status reporting, eliminating the need to manually manage script exit codes and log files.

B: The macOS Snapshot Strategy

Apple handles updates differently than Microsoft. There is no wusa.exe equivalent for macOS. Once an incremental update (e.g., macOS 14.1 to 14.2) is applied, you cannot simply “uninstall” it. You are typically looking at a full “wipe and reinstall” unless you have planned ahead. However, there is a strategy you could leverage:

The APFS Snapshot (Pre-Emptive)

For high-value machines or developer environments where uptime is critical. To solve this, admins can leverage the underlying physics of the Apple File System (APFS) to create a pre-update safety net.

Before pushing an OS update, deploy a shell script via Hexnode to trigger a Local APFS Snapshot.

The Script: tmutil localsnapshot /

This command doesn’t copy data; instead, it freezes the current file system’s metadata in seconds. If the subsequent update begins overwriting or corrupting system files, the snapshot ensures the original data blocks remain pinned and protected on the disk. A technician can boot into “Recovery Mode” and revert to this snapshot instantly. This is a manual process but saves a full wipe.

Phase 4: Recovering the “Bricked” Devices (When Agents Go Dark)

1. The Safe Mode

The nightmare scenario for any admin is an update so catastrophic, such as a persistent Blue Screen of Death (BSOD), that the OS fails before the Hexnode Agent can initialize. If the agent can’t check in, it can’t execute your rollback. In these moments, recovery requires a bridge between automated management and physical intervention.

In a Windows OS, if devices are caught in a boot loop but managed to reach the network for even a few seconds, you can attempt to force a stable state. By pushing a script to trigger Safe Mode with Networking, you strip away the noise of third-party drivers.

The Command: bcdedit /set {current} safeboot network

Safe Mode allows the OS to reach a desktop environment. This creates a stable environment where the system is no longer crashing. Even if the agent doesn’t check in immediately due to the limited service set of Safe Mode, the device is now “reachable” for a technician.

2. Using Hexnode LAPS

When remote access is impossible and if the endpoint can be accessed, a field technician or the end-user can intervene and fix it locally. However, giving a user the Domain Admin password is a catastrophic security risk. In that case, you need to grant local admin rights to uninstall a driver manually, but you cannot compromise your global credentials.

The Hexnode Fix: By deploying Hexnode’s Windows LAPS (Local Admin Password Solution) policy before a crisis hits, you ensure every device has a unique, encrypted local admin account.

The Recovery: An admin can simply look up the unique, one-time password for that specific machine in the Hexnode portal and provide it to the person on the ground. Once the fix is applied and the device checks back in, Hexnode automatically rotates the password, ensuring the key used for the fix is immediately neutralized and your security posture remains intact.

Simplifying Compliance: An Actionable Guide for IT
Feature Resource

Simplifying Compliance: An Actionable Guide for IT

Read about addressing various compliance challenges and the tips to build a strong security foundation

Get the white paper

Mastering the Patch Rollback

In an era of software supply chain instability, an MDM strategy is only as strong as its exit plan. True operational resilience isn’t measured by how fast you can push an update, but by how precisely you can pull it back. The difference between a minor internal ticket and a global headline lies in the infrastructure you build today by leveraging Hexnode UEMs automation, custom scripts, and other capabilities. By engineering a rapid patch rollback, you transform your fleet from a passive recipient of third-party code into a proactive, self-healing environment.

Frequently Asked Questions (FAQ)

1: Can you rollback a Windows Update via MDM?

Yes. You can rollback a specific Windows Update using an MDM scripting engine. By deploying a PowerShell script utilizing the wusa.exe /uninstall /kb:XXXXXX /quiet command, you can silently uninstall a specific patch across thousands of devices without user interaction.

2: How do I recover from a bad driver update (like CrowdStrike) if the device crashes?

If the device can still boot to a network-enabled state (Safe Mode with Networking), you can use MDM to push a script to delete the faulty driver file (e.g., sys file). If the device can boot, you must rely on LAPS (Local Admin Password Solution) to provide a unique local admin password to a technician for manual remediation.

3: What is the “Kill Switch” in patch management?

A “Kill Switch” is an emergency policy configuration that immediately pauses all software updates. In Hexnode, this is achieved by changing policies to defer updates. This creates a firebreak, preventing the faulty update from spreading to devices that have not yet checked in.

Share

Allen Jones

Resources Image