TB

MoppleIT Tech Blog

Welcome to my personal blog where I share thoughts, ideas, and experiences.

Resume Long Runs with PowerShell Checkpoints: Crash-Safe, Restartable Workflows

Long-running scripts fail at the worst possible time—network blips, API rate limits, restarts, or a stray Ctrl+C. You don't need to redo work after a crash or cancellation. With a tiny JSON checkpoint file, you can persist progress after each step, skip completed items on restart, and clear the checkpoint only when everything is done. The result: no rework, safer reruns, faster recovery, and predictable progress.

The Pattern: Minimal Checkpointing in PowerShell

The core idea is simple:

  • After each successful step, write a small state file (JSON) capturing your latest progress.
  • On startup, read the state and skip finished items.
  • When all work completes, delete the checkpoint to reset state.

Here is a minimal example that processes 20 items and resumes from where it left off:

$items = 1..20
$cp = Join-Path -Path (Get-Location) -ChildPath 'checkpoint.json'

$state = @{ Index = 0 }
if (Test-Path -Path $cp) {
  try {
    $state = Get-Content -Path $cp -Raw | ConvertFrom-Json
  } catch {
    Write-Warning ("Ignoring bad checkpoint: {0}" -f $_.Exception.Message)
  }
}

for ($i = [int]$state.Index + 1; $i -le $items.Count; $i++) {
  $n = $items[$i - 1]
  # Do work
  Start-Sleep -Milliseconds 80
  Write-Host ("Processed {0}" -f $n)

  # Persist progress
  $state.Index = $i
  [IO.File]::WriteAllText($cp, ($state | ConvertTo-Json), [Text.UTF8Encoding]::new($false))
}

if ($state.Index -ge $items.Count) {
  Remove-Item -Path $cp -Force -ErrorAction SilentlyContinue
  Write-Host 'All done.'
}

Why it works:

  • Resilience: Progress is saved after each item, so a crash only loses in-flight work for the current item.
  • Predictability: You always know how many items remain.
  • Repeatability: Reruns are safe—completed items are skipped deterministically.

Hardening the Pattern for Production

1) Use atomic writes

Write to a temporary file, then rename over the original. This prevents partial/corrupt files when the process stops mid-write.

function Write-JsonCheckpoint {
  param(
    [Parameter(Mandatory)] [string] $Path,
    [Parameter(Mandatory)] [object] $State
  )
  $json = $State | ConvertTo-Json -Depth 8 -Compress
  $tmp  = "${Path}.tmp"
  [IO.File]::WriteAllText($tmp, $json, [Text.UTF8Encoding]::new($false))
  Move-Item -Path $tmp -Destination $Path -Force
}

2) Guard against corrupt checkpoints

Checkpoints can be truncated by abrupt termination. If parsing fails, log and continue from a safe default. Optionally archive the bad file for diagnosis.

$state = @{ Index = 0 }
if (Test-Path $cp) {
  try {
    $state = Get-Content $cp -Raw | ConvertFrom-Json -AsHashtable
  } catch {
    Write-Warning ("Bad checkpoint; starting clean: {0}" -f $_.Exception.Message)
    Rename-Item -Path $cp -NewName ("checkpoint.bad.{0:yyyyMMddHHmmss}.json" -f (Get-Date)) -ErrorAction SilentlyContinue
  }
}

3) Make your work idempotent

On rerun, your work for a given item should be safe to repeat or skip. Techniques:

  • Use upserts or PUT/merge semantics in APIs.
  • Write outputs to deterministic paths or names.
  • Include content hashes or version markers to detect if work is already done.

4) Concurrency and locks

If multiple processes might touch the same checkpoint, add a simple lock. For single-process use, atomic writes are usually enough. For multi-process, consider a lock file or a named mutex.

$mutex = New-Object System.Threading.Mutex($false, 'Global:MyCheckpointLock')
try {
  $got = $mutex.WaitOne([TimeSpan]::FromSeconds(10))
  if (-not $got) { throw 'Could not acquire checkpoint lock' }
  Write-JsonCheckpoint -Path $cp -State $state
} finally {
  $mutex.ReleaseMutex() | Out-Null
}

5) Keep checkpoints small and useful

  • Store only what you need to resume: an index, a next token, or a set of completed IDs.
  • Optionally record a tiny amount of metadata (timestamps, durations, last error) for observability.

6) Save less often if needed

For very hot loops, persist every N items or every T seconds for better throughput.

$saveEvery = 10
$nextSave  = (Get-Date).AddSeconds(1)
for ($i = $start; $i -le $count; $i++) {
  # ... work ...
  $state.Index = $i
  if ($i % $saveEvery -eq 0 -or (Get-Date) -ge $nextSave) {
    Write-JsonCheckpoint $cp $state
    $nextSave = (Get-Date).AddSeconds(1)
  }
}

Adapting the Pattern to Real Workloads

Scenario A: Filesystem migration with a set of arbitrary items

When your item set is not a simple numeric range, track a dictionary of completed item keys. This enables safe skipping even if the order changes.

$items = Get-ChildItem -File -Recurse 'C:\Data'
$cp = Join-Path $pwd 'checkpoint.json'
$state = @{ Done = @{}; Errors = @{} }
if (Test-Path $cp) { $state = Get-Content $cp -Raw | ConvertFrom-Json -AsHashtable }

foreach ($f in $items) {
  $key = $f.FullName
  if ($state.Done.ContainsKey($key)) { continue }
  try {
    # Do idempotent work (e.g., copy if not exists or if hash differs)
    $dest = $key -replace '^C:\\Data', 'D:\\Archive'
    if (-not (Test-Path $dest)) { New-Item -ItemType Directory -Path (Split-Path $dest) -Force | Out-Null }
    Copy-Item -Path $key -Destination $dest -Force

    $state.Done[$key] = @{ at = (Get-Date) }
    Write-JsonCheckpoint $cp $state
  } catch {
    $state.Errors[$key] = $_.Exception.Message
    Write-Warning ("Failed: {0} => {1}" -f $key, $_.Exception.Message)
    Write-JsonCheckpoint $cp $state
  }
}
if ($state.Done.Count -ge $items.Count) { Remove-Item $cp -Force -ErrorAction SilentlyContinue }

Benefits:

  • Reordering or re-enumeration does not matter; completed files are skipped by key.
  • You can inspect failures later, fix the root cause, and rerun to finish only the missing ones.

Scenario B: API pagination with page index or resume token

For APIs that return a next page index or a resume token, persist that token so you never refetch pages you already processed.

$cp = Join-Path $pwd 'checkpoint.json'
$state = @{ Page = 1; Token = $null }
if (Test-Path $cp) { $state = Get-Content $cp -Raw | ConvertFrom-Json -AsHashtable }

while ($true) {
  $resp = Invoke-RestMethod -Method GET -Uri "https://api.example.com/items" -Body @{ page = $state.Page; token = $state.Token }

  foreach ($it in $resp.items) {
    # process item idempotently
  }

  $state.Page++
  $state.Token = $resp.nextToken
  Write-JsonCheckpoint $cp $state

  if (-not $resp.hasMore) { break }
}
Remove-Item $cp -Force -ErrorAction SilentlyContinue

Scenario C: Database batches

Batch by primary key ranges and persist the last committed key. Use transactions to ensure idempotent, all-or-nothing batches. On restart, begin at the next unprocessed key.

  • SELECT with key > lastKey ORDER BY key LIMIT N
  • Process within a transaction, commit, then update checkpoint
  • Repeat until no rows remain

Operational Tips and Safety Checks

  • Test resumption: Intentionally kill the script mid-run and verify it restarts without redoing work.
  • Handle Ctrl+C: Because you persist after each item, you usually don’t need special handling; the next run continues from the last successful save. For extra safety, flush in a finally block.
  • Avoid secrets in checkpoints: If you must store sensitive data, encrypt with DPAPI or CMS. Prefer storing opaque IDs or tokens that are safe to reissue.
  • Log outcomes: Add counts of processed/remaining items and last error message to your state for quick troubleshooting.
  • Clean up: Always remove the checkpoint when work completes. If the script ends successfully, the next run should start fresh.
  • Monitor size: If you store a growing set of item keys, consider periodic compaction or sharding the checkpoint by partition (e.g., one per day or per batch).

Putting It All Together

Checkpoints give you a small, reliable brain for your long-running tasks. You persist minimal state after each step, skip finished items on restart, and clear the file when you’re done. The approach is simple, but the wins are big: no rework, safer reruns, faster recovery, and predictable progress—even in flaky environments.

Try it on your next file migration, API sync, log processing job, or database backfill. Make your work idempotent, save progress atomically, and you’ll spend less time babysitting scripts and more time shipping value.

← All Posts Home →