TB

MoppleIT Tech Blog

Welcome to my personal blog where I share thoughts, ideas, and experiences.

Efficient Log Parsing with Get-Content -ReadCount: Flat Memory, Steady Throughput

When logs get large, naive line-by-line parsing can thrash your CPU, inflate memory, and make throughput unpredictable. A better pattern in PowerShell is to stream logs in fixed-size chunks using Get-Content -ReadCount. By batching lines, you keep memory flat, processing costs consistent, and progress easy to report. In this post, youll learn a production-ready approach: batching, per-batch aggregation, explicit encoding and error handling, and pragmatic progress reporting.

Why batch logs with -ReadCount?

-ReadCount changes how Get-Content emits data. Instead of one string per line, it emits an array of lines per batch, so downstream ForEach-Object receives even-sized chunks you can aggregate efficiently.

  • Flat memory: You never load the entire file; you only hold a bounded chunk at a time.
  • Steady throughput: Work is evenly amortized across batches, minimizing GC churn and allocation spikes.
  • Explicit behavior: Setting -Encoding and -ErrorAction makes runs predictable across machines and PowerShell versions.
  • Progress you can trust: Emit periodic updates by batch count without flooding the console.

Heres a minimal pattern to count lines, errors, and warnings while reporting progress at 10K-line intervals:

$path = 'C:\logs\app.log'
$read = 1000
$stats = [ordered]@{ Lines = 0; Errors = 0; Warn = 0 }

Get-Content -Path $path -ReadCount $read -Encoding UTF8 -ErrorAction Stop |
  ForEach-Object {
    $batch = $_
    $stats.Lines  += $batch.Count
    $stats.Errors += ($batch | Where-Object { $_ -match '\bERROR\b' }).Count
    $stats.Warn   += ($batch | Where-Object { $_ -match '\bWARN(ING)?\b' }).Count
    if (($stats.Lines % 10000) -eq 0) { Write-Host ("Processed {0} lines..." -f $stats.Lines) }
  }

[pscustomobject]$stats

This gives you lower memory, faster runs, predictable throughput, and clearer logs.

Make it production-ready

Encoding and error handling

Different environments default to different encodings (Windows PowerShell vs. PowerShell Core). Make encoding explicit and stop on errors so your automation fails fast and loud.

param(
  [Parameter(Mandatory)]
  [string]$Path,
  [int]$ReadCount = 1000,
  [ValidateSet('UTF8','UTF7','Unicode','UTF32','ASCII','BigEndianUnicode','Latin1')]
  [string]$Encoding = 'UTF8'
)

$stats = [ordered]@{ Lines = 0; Errors = 0; Warn = 0 }

try {
  Get-Content -Path $Path -ReadCount $ReadCount -Encoding $Encoding -ErrorAction Stop |
    ForEach-Object {
      $batch = $_
      $stats.Lines += $batch.Count

      foreach ($line in $batch) {
        if ($line -match '\bERROR\b')     { $stats.Errors++ }
        if ($line -match '\bWARN(ING)?\b') { $stats.Warn++ }
      }

      if (($stats.Lines % 10000) -eq 0) {
        Write-Progress -Activity "Parsing $Path" -Status "Processed $($stats.Lines) lines..."
      }
    }
}
catch {
  Write-Error "Failed to read $Path: $($_.Exception.Message)"
  exit 1
}

[pscustomobject]$stats | ConvertTo-Json -Compress

Key points:

  • -Encoding: Keep it explicit for predictable behavior across hosts and OSes.
  • -ErrorAction Stop: Ensures file access issues terminate the run (use try/catch to capture a friendly message).
  • Write-Progress vs Write-Host: Prefer Write-Progress in automation and terminals that support it. In CI logs, consider Write-Host or Write-Information with $InformationPreference tuned.
  • Structured output: Convert final stats to JSON if downstream tooling consumes machine-readable output.

Boost throughput with compiled regex

If youre scanning millions of lines, cut overhead by precompiling your regexes and avoiding repeated Where-Object allocations.

$rxError = [regex]::new('\bERROR\b', [Text.RegularExpressions.RegexOptions]::Compiled)
$rxWarn  = [regex]::new('\bWARN(ING)?\b', [Text.RegularExpressions.RegexOptions]::Compiled)

Get-Content -Path $Path -ReadCount $ReadCount -Encoding UTF8 -ErrorAction Stop |
  ForEach-Object {
    $batch = $_
    foreach ($line in $batch) {
      if ($rxError.IsMatch($line)) { $stats.Errors++ }
      if ($rxWarn.IsMatch($line))  { $stats.Warn++ }
    }
    $stats.Lines += $batch.Count
  }

For simple substring detection (no word boundaries), -SimpleMatch with Select-String can be even faster, but regex gives you correctness for tokenized terms like ERROR vs. SOMEERRORCODE.

Choose the right batch size (and measure)

Theres no one-size-fits-all value. As a rule of thumb:

  • 256,000: safe on memory-constrained hosts and containers
  • 1,000,000: good default for general-purpose CI agents
  • 10,000+: fewer pipeline hops; may help on fast disks/SSDs but watch memory and latency

Benchmark on your hardware and logs:

$path  = 'C:\logs\app.log'
$sizes = 256, 1000, 5000, 10000
foreach ($n in $sizes) {
  $sw = [System.Diagnostics.Stopwatch]::StartNew()
  $lines = 0
  Get-Content -Path $path -ReadCount $n -Encoding UTF8 -ErrorAction Stop |
    ForEach-Object { $lines += $_.Count }
  $sw.Stop()
  '{0,6} lines/batch -> {1,8} ms, {2,10} lines/s' -f $n, $sw.ElapsedMilliseconds, [int]($lines / [math]::Max(0.001, $sw.Elapsed.TotalSeconds))
}

Pick the smallest batch that gives you stable, fast throughput. Aim for repeatability more than absolute peak numbers.

Patterns, edge cases, and deployment tips

Real-time logs (tailing)

For live monitoring, use -Tail and -Wait to follow appends. Typically you process line-by-line; batching matters less when lines arrive intermittently.

$path = 'C:\logs\app-live.log'
$stats = [ordered]@{ Lines = 0; Errors = 0; Warn = 0 }

Get-Content -Path $path -Tail 0 -Wait -Encoding UTF8 |
  ForEach-Object {
    $line = $_
    $stats.Lines++
    if ($line -match '\bERROR\b')     { $stats.Errors++ }
    if ($line -match '\bWARN(ING)?\b') { $stats.Warn++ }
    if (($stats.Lines % 1000) -eq 0) {
      Write-Progress -Activity "Following $path" -Status "Seen $($stats.Lines) lines..."
    }
  }

Tip: When you need strict batching for bursty writers, you can still buffer in your own loop (collect N lines then aggregate) while reading from -Wait.

Very large files and rotation

  • Huge files: -ReadCount ensures bounded memory regardless of file size.
  • Rotation: If the file rotates (e.g., app.log > app.log.1), a long-running read of the old handle may not see new data. For batch jobs, prefer discrete runs (e.g., per file) or detect rotation and restart the read.
  • Locking: Most appenders support read-sharing; if not, handle access exceptions and retry with backoff.

Make progress output CI-friendly

  • Dont pollute the pipeline: Keep progress on the host/information streams; emit final results on the output stream.
  • JSON output: Many CI/CD systems (GitHub Actions, Azure Pipelines) can parse JSON artifacts.
$stats = [ordered]@{ Lines = 0; Errors = 0; Warn = 0 }
$progressEvery = 10000

Get-Content -Path $Path -ReadCount $ReadCount -Encoding UTF8 -ErrorAction Stop |
  ForEach-Object {
    $batch = $_
    foreach ($line in $batch) {
      if ($line -match '\bERROR\b')     { $stats.Errors++ }
      if ($line -match '\bWARN(ING)?\b') { $stats.Warn++ }
    }
    $stats.Lines += $batch.Count
    if (($stats.Lines % $progressEvery) -eq 0) {
      Write-Information ("Processed $($stats.Lines) lines...") -InformationAction Continue
    }
  }

# Structured output only at the end
[pscustomobject]$stats | ConvertTo-Json -Compress

In non-interactive runs, set $InformationPreference = 'Continue' (or capture the stream) to ensure progress messages are visible without mixing them with the main output.

Containers and resource limits

In memory-constrained containers or sidecars that scrape logs, -ReadCount is a simple form of backpressure. Allow tuning via environment variables:

$read = [int]($env:READCOUNT ?? 1000)
$enc  = $env:LOG_ENCODING ?? 'UTF8'

Get-Content -Path $Path -ReadCount $read -Encoding $enc -ErrorAction Stop | ...

Keep images minimal, avoid unnecessary modules, and prefer streaming to avoid temporary files. Mount logs read-only to reduce risk.

Cross-platform and encoding nuances

  • Defaults differ: PowerShell 5.1 vs. 7+ have different default encodings; always set -Encoding.
  • Mixed newlines: Get-Content normalizes CRLF/LF to lines; your regexes dont need to handle newline characters explicitly.
  • Non-UTF8 logs: If your logs are Windows-1252/Latin1, specify that encoding to avoid mojibake and false negatives in pattern matching.

Extending the pattern

  • Multi-pattern scans: Track additional counters (HTTP status classes, latency buckets) within the per-batch loop.
  • Sampling: For extremely high-volume logs, sample 1 in N lines for metrics while still escalating on errors.
  • Persistence: Emit a metrics object per file with metadata (hostname, app version, time window) for dashboards.
$stats = [ordered]@{
  Lines = 0; Errors = 0; Warn = 0; Info = 0; Requests2xx = 0; Requests5xx = 0
}

Get-Content -Path $Path -ReadCount $ReadCount -Encoding UTF8 -ErrorAction Stop |
  ForEach-Object {
    $batch = $_
    foreach ($line in $batch) {
      if ($line -match '\bERROR\b')     { $stats.Errors++ }
      elseif ($line -match '\bWARN(ING)?\b') { $stats.Warn++ }
      elseif ($line -match '\bINFO\b')  { $stats.Info++ }

      if ($line -match '"\s(?<status>[0-9]{3})\s') {
        $code = [int]$Matches.status
        if     ($code -ge 200 -and $code -lt 300) { $stats.Requests2xx++ }
        elseif ($code -ge 500 -and $code -lt 600) { $stats.Requests5xx++ }
      }
    }
    $stats.Lines += $batch.Count
  }

[pscustomobject]$stats

Wrap-up

By batching with Get-Content -ReadCount, you get predictable performance and memory usage while preserving streaming semantics. Aggregate per batch, emit periodic progress, and always make encoding and error behavior explicit. This pattern scales from laptops to CI agents and memory-capped containers without heroic optimization.

Strengthen your streaming patterns in PowerShell with deeper techniques and recipes in the PowerShell Advanced Cookbook: Read the PowerShell Advanced CookBook .

← All Posts Home →