TB

MoppleIT Tech Blog

Welcome to my personal blog where I share thoughts, ideas, and experiences.

Efficient PowerShell Pipeline Functions with Begin/Process/End

PowerShells advanced functions become dramatically faster and more predictable when you design them around the Begin/Process/End pattern. By precomputing once in begin, streaming and handling items in process, and aggregating or finalizing output in end, you reduce memory, improve throughput, and return clean, typed objects that are easy to consume in downstream tooling.

The Begin/Process/End pattern at a glance

Advanced functions let you implement the same streaming semantics as built-in cmdlets. The three blocks each have a clear responsibility:

  • begin: Initialize shared state, warm caches, compile regex, open connections, and set up any reusable resources. Runs exactly once per invocation.
  • process: Handle one pipeline input item at a time. Do the minimal necessary work, stream results, and avoid accumulating large collections in memory.
  • end: Finalize any computation that depends on all inputs (sorting, aggregation, deduplication), then emit the final objects.

Benefits you get by adhering to this structure:

  • Lower memory: You dont hold all inputs at once. Count, transform, or filter as items arrive.
  • Speed: Precompute once in begin, then reuse per-item for process. Sorting/aggregation happens only once in end.
  • Predictable output: Emit typed objects instead of formatted strings; your results round-trip well to JSON, CSV, databases, and dashboards.
  • Composability: Functions that stream scale naturally in long pipelines and across remote sessions.

Example: Get-TopExtensions

The function below counts file extensions across one or more directories, then returns the top N extensions as objects. It preps state in begin, processes each input path in process, and aggregates and sorts in end.

function Get-TopExtensions {
  [CmdletBinding()]
  param(
    [Parameter(ValueFromPipeline)]
    [string]$Path,
    [int]$Top = 5
  )
  begin {
    $counts = @{}
  }
  process {
    if (Test-Path -LiteralPath $Path) {
      Get-ChildItem -Path $Path -File -Recurse -ErrorAction SilentlyContinue |
        ForEach-Object {
          $ext = [IO.Path]::GetExtension($_.Name).ToLower()
          if (-not $ext) { $ext = '(none)' }
          if ($counts.ContainsKey($ext)) { $counts[$ext]++ } else { $counts[$ext] = 1 }
        }
    }
  }
  end {
    $counts.GetEnumerator() |
      Sort-Object -Property Value -Descending |
      Select-Object -First $Top |
      ForEach-Object { [pscustomobject]@{ Extension = $_.Key; Count = $_.Value } }
  }
}

# Example
'C:\Logs','C:\Windows\Temp' | Get-TopExtensions -Top 3

Why this is efficient:

  • One dictionary holds counts; it grows with the number of distinct extensions, not the number of files.
  • Streaming Get-ChildItem results means you dont allocate a huge array of files; you update counts in constant memory per file.
  • Typed output via [pscustomobject] ensures predictable properties for downstream tools.

Hardening and tuning the pattern

Begin: precompute once, set shared state

Minimize per-item work by preparing reusable resources in begin:

  • Use a case-insensitive dictionary to avoid .ToLower() overhead and string allocation.
  • Compile regular expressions or build a HashSet of excludes.
  • Initialize timers or counters for diagnostics.

Process: stream and keep memory low

Process should be side-effect free (unless youre explicitly modifying state), idempotent, and fast. Validate inputs, short-circuit early, and avoid per-item heavy allocations.

End: aggregate, sort, emit typed results

Do the expensive operations that need the full data set in end: sorting, ranking, and final projection to clean objects.

Improved example with typing and guard rails

function Get-TopExtensions {
  [CmdletBinding(SupportsShouldProcess = $false)]
  [OutputType([pscustomobject])]
  param(
    [Parameter(ValueFromPipeline, ValueFromPipelineByPropertyName)]
    [Alias('FullName')]
    [string]$Path,

    [ValidateRange(1, [int]::MaxValue)]
    [int]$Top = 5
  )
  begin {
    # Case-insensitive comparer avoids per-item ToLower()
    $comparer = [System.StringComparer]::OrdinalIgnoreCase
    $counts = [System.Collections.Generic.Dictionary[string,int]]::new($comparer)
  }
  process {
    if (-not $Path) { return }
    try {
      if (-not (Test-Path -LiteralPath $Path)) { return }
      Get-ChildItem -LiteralPath $Path -File -Recurse -ErrorAction SilentlyContinue |
        ForEach-Object {
          $ext = [IO.Path]::GetExtension($_.Name)
          if ([string]::IsNullOrEmpty($ext)) { $ext = '(none)' }
          if ($counts.ContainsKey($ext)) { $counts[$ext]++ } else { $counts[$ext] = 1 }
        }
    } catch {
      # Non-terminating error to keep pipeline streaming
      Write-Error -ErrorRecord $_
    }
  }
  end {
    $counts.GetEnumerator() |
      Sort-Object -Property @{Expression='Value';Descending=$true}, @{Expression='Key';Descending=$false} |
      Select-Object -First $Top |
      ForEach-Object { [pscustomobject]@{ Extension = $_.Key; Count = $_.Value } }
  }
}

Notes on the changes:

  • ValueFromPipelineByPropertyName allows objects with a FullName or Path property to flow in naturally.
  • OrdinalIgnoreCase dictionary removes the need to normalize case on each item.
  • Error handling emits non-terminating errors for problematic paths while preserving streaming for the rest.
  • Stable sort adds a tie-breaker on Key for deterministic results.

Practical tips for pipeline-first design

  • Emit objects, not formatted text: Dont call Format-* inside functions. Return objects; let callers format at the end of the pipeline.
  • Be explicit about output shape: Use [OutputType()] and return consistent properties. [pscustomobject] is ideal for lightweight records.
  • Avoid accumulating arrays: Replace $globalList += $item with streaming logic or use a .NET collection and .Add() for O(1) appends.
  • Scope resource lifetimes: Open connections, files, or caches in begin, reuse in process, and dispose or flush in end.
  • Prefer non-terminating errors per item: Use Write-Error (and honor $PSCmdlet.ThrowTerminatingError() when needed) to keep the pipeline alive.
  • Respect common parameters: Let -Verbose, -ErrorAction, and -WhatIf/-Confirm (when relevant) work as expected.
  • Parallelism caution: If you compose with ForEach-Object -Parallel or runspaces, avoid shared mutable state; use concurrent collections or aggregate per-runspace and merge in end.

Measuring the wins

Compare a naive, non-streaming approach that groups after materializing all files with the streaming approach:

$paths = 'C:\Logs','C:\Windows\Temp'

# Naive: materialize and group everything
$naive = Measure-Command {
  $paths |
    Get-ChildItem -File -Recurse -ErrorAction SilentlyContinue |
    Group-Object { [IO.Path]::GetExtension($_.Name) } |
    Sort-Object Count -Descending |
    Select-Object -First 3 |
    Out-Null
}

# Streaming: count on the fly
$streaming = Measure-Command {
  $paths | Get-TopExtensions -Top 3 | Out-Null
}

"Naive (ms):     $($naive.TotalMilliseconds)"
"Streaming (ms): $($streaming.TotalMilliseconds)"

On large trees, the streaming version typically uses far less memory and completes faster because it avoids building a giant intermediate array of file objects.

Real-world use cases

  • Log analysis: Stream log lines, count error codes in a dictionary, then emit the top offenders in end.
  • Inventory: Traverse servers or subscriptions, build a set of unique image IDs or SKUs, then output sorted summaries.
  • Compliance: Evaluate each resource against a rule set in process, and summarize failures in end.

Common pitfalls to avoid

  • Mixing formatting with data: Keep Write-Host and Format-* out of your functions. Use Write-Verbose for diagnostics.
  • Per-item heavy initialization: Dont compile regex or open connections in process. Move that to begin.
  • Unbounded accumulation: If you must aggregate, prefer dictionary or counters over appending objects per item.
  • Inconsistent output: Always return the same object shape regardless of input. If an error happens, produce errors, not partial different-shaped objects.

Takeaways

  • Use begin to set up shared state and warm caches.
  • Use process to handle each item efficiently and keep memory low.
  • Use end to aggregate, sort, and emit typed results.
  • Return objects, not formatted strings, for predictable, composable pipelines.
  • Measure and iterate; small structural changes unlock big performance wins.
← All Posts Home →