TB

MoppleIT Tech Blog

Welcome to my personal blog where I share thoughts, ideas, and experiences.

Normalize Text Files to UTF-8 in PowerShell: Detect BOMs, Convert Safely, Keep Diffs Clean

Encoding surprises can blow up builds, pollute diffs, and make cross-platform tooling flaky. A simple, enforceable standard solves this: store text files as UTF-8 without a BOM. In this guide, youll learn how to detect BOMs reliably, convert only when needed, and keep logs quiet so you can plug the normalizer into Git hooks and CI with confidence.

Why standardize on UTF-8 (no BOM)?

UTF-8 without a BOM is the de facto standard across modern toolchains. Heres why you want it:

  • Predictable diffs and merges: a BOM is an invisible byte that shows up as noise in diffs.
  • Cross-platform sanity: Linux and macOS tools expect plain UTF-8; Windows tooling has caught up.
  • Less surprise in PowerShell: PowerShell 7+ defaults to UTF-8 (no BOM) for most cmdlets; older Windows PowerShell often writes UTF-16LE or UTF-8 with BOM by default.
  • Fewer build failures: compilers, packagers, and YAML/JSON parsers behave consistently.

Detect before you convert

The safest approach is: detect the source encoding, convert only when necessary, and log only when a change occurs. Detection has a few key parts:

1) Trust BOMs (when present)

  • UTF-8 BOM: EF BB BF
  • UTF-16LE: FF FE
  • UTF-16BE: FE FF
  • UTF-32LE: FF FE 00 00
  • UTF-32BE: 00 00 FE FF

If a BOM is present, you can confidently decode using the corresponding encoding. If its UTF-8 with a BOM, you can safely strip it and rewrite as UTF-8 without a BOM.

2) Validate UTF-8 when theres no BOM

Without a BOM, you must decide whether the byte stream is valid UTF-8. Use a strict UTF-8 decoder that throws on invalid byte sequences. If it throws, fall back to the systems ANSI code page ([System.Text.Encoding]::Default) to preserve the content before normalizing to UTF-8 without BOM.

3) Skip binary files

A quick heuristic is to scan a small byte window (e.g., first 4 KB) and skip files that contain NUL (0x00) bytes. You can also skip by extension if you know your repo.

A minimal just fix it script

This short script uses a BOM-aware StreamReader and normalizes to UTF-8 without BOM only when needed. Its great for small repos and one-off fixes.

param([Parameter(Mandatory)][string]$Path)

if (-not (Test-Path -Path $Path -PathType Leaf)) { throw ("Missing: {0}" -f $Path) }

# Detect encoding via StreamReader (BOM-aware)
$sr = [IO.StreamReader]::new($Path, $true)
try {
  $text = $sr.ReadToEnd()
  $enc  = $sr.CurrentEncoding
} finally {
  $sr.Dispose()
}

$utf8NoBom = [Text.UTF8Encoding]::new($false)
$hasBom    = $enc.Preamble.Length -gt 0
$needsConv = ($enc.WebName -ne 'utf-8') -or $hasBom

if ($needsConv) {
  $bak = ("{0}.bak.{1}" -f $Path, (Get-Date -Format 'yyyyMMdd-HHmmss'))
  Copy-Item -Path $Path -Destination $bak -Force
  [IO.File]::WriteAllText($Path, $text, $utf8NoBom)
  Write-Host ("Converted -> {0} (from {1}, BOM={2})" -f $Path, $enc.WebName, $hasBom)
} else {
  Write-Host 'Already UTF-8 without BOM. Skipped.'
}

Note: This approach treats BOMs correctly, but like any BOM-only method, it cant disambiguate unknown legacy encodings that lack a BOM. The hardened script below adds strict UTF-8 validation and a safe fallback.

A production-hardened normalizer (strict UTF-8 validation, BOM-aware, quiet logs)

The script below:

  • Checks for BOMs first.
  • Validates non-BOM files with a strict UTF-8 decoder.
  • Falls back to the system code page if UTF-8 is invalid.
  • Skips binary-like files (NUL byte heuristic).
  • Backs up on change and logs only conversions.
  • Writes with a guaranteed UTF-8 without BOM encoder across PowerShell versions.
param(
  [Parameter(Mandatory)][string]$Path,
  [switch]$Backup,
  [switch]$Quiet
)

if (-not (Test-Path -LiteralPath $Path -PathType Leaf)) { throw "Missing file: $Path" }

# Read bytes once
$bytes = [IO.File]::ReadAllBytes($Path)
if ($bytes.Length -eq 0) { if (-not $Quiet) { Write-Host "Empty file. Skipped." }; return }

# Binary heuristic: NUL found in first 4KB
$sampleLen = [Math]::Min(4096, $bytes.Length)
$hasNull = $false
for ($i = 0; $i -lt $sampleLen; $i++) { if ($bytes[$i] -eq 0) { $hasNull = $true; break } }
if ($hasNull) { if (-not $Quiet) { Write-Host "Binary-like (NUL found). Skipped." }; return }

function Test-Prefix([byte[]]$a, [byte[]]$p) {
  if ($a.Length -lt $p.Length) { return $false }
  for ($j = 0; $j -lt $p.Length; $j++) { if ($a[$j] -ne $p[$j]) { return $false } }
  return $true
}

$utf8Bom = 0xEF,0xBB,0xBF
$utf16le = 0xFF,0xFE
$utf16be = 0xFE,0xFF
$utf32le = 0xFF,0xFE,0x00,0x00
$utf32be = 0x00,0x00,0xFE,0xFF

$enc = $null
$preambleLen = 0
if (Test-Prefix $bytes $utf32le) { $enc = [Text.UTF32Encoding]::new($false, $true); $preambleLen = 4 }
elseif (Test-Prefix $bytes $utf32be) { $enc = [Text.UTF32Encoding]::new($true, $true);  $preambleLen = 4 }
elseif (Test-Prefix $bytes $utf8Bom) { $enc = [Text.UTF8Encoding]::new($true, $false);  $preambleLen = 3 }
elseif (Test-Prefix $bytes $utf16le) { $enc = [Text.UnicodeEncoding]::new($false, $true); $preambleLen = 2 }
elseif (Test-Prefix $bytes $utf16be) { $enc = [Text.UnicodeEncoding]::new($true,  $true); $preambleLen = 2 }
else {
  # Validate strict UTF-8 (throw on invalid sequences); fallback to system ANSI if invalid
  $utf8Strict = [Text.UTF8Encoding]::new($false, $true)
  try {
    $null = $utf8Strict.GetString($bytes)  # will throw on invalid bytes
    $enc = $utf8Strict
  } catch [System.Text.DecoderFallbackException] {
    $enc = [Text.Encoding]::Default
  }
}

$hasBom = $preambleLen -gt 0
$utf8NoBom = [Text.UTF8Encoding]::new($false, $false)
$needsConv = ($enc.WebName -ne 'utf-8') -or $hasBom
if (-not $needsConv) { if (-not $Quiet) { Write-Host "Already UTF-8 without BOM. Skipped." }; return }

if ($Backup) {
  $bak = "{0}.bak.{1}" -f $Path, (Get-Date -Format 'yyyyMMdd-HHmmss')
  [IO.File]::Copy($Path, $bak, $true)
}

# Decode with detected encoding (skip BOM bytes), then write as UTF-8 (no BOM)
$text = $enc.GetString($bytes, $preambleLen, $bytes.Length - $preambleLen)
[IO.File]::WriteAllText($Path, $text, $utf8NoBom)
if (-not $Quiet) { Write-Host ("Converted -> {0} (from {1}, BOM={2})" -f $Path, $enc.WebName, $hasBom) }

Run across a repo

# Normalize common text file types in the current repo
$ext = '*.ps1','*.psm1','*.psd1','*.psmxml','*.cs','*.ts','*.js','*.json','*.yml','*.yaml','*.xml','*.html','*.md','*.txt'
Get-ChildItem -Recurse -File -Include $ext |
  ForEach-Object { pwsh -NoProfile -File .\tools\normalize-utf8.ps1 -Path $_.FullName -Backup -Quiet }

Tip: Put the script under tools/, then wire it into Git and CI as shown below.

Integrate with Git and CI/CD

Pre-commit hook

Normalize staged text files before they land in your history. This keeps diffs clean and prevents encoding regressions.

# .git/hooks/pre-commit (make executable on *nix)
$ErrorActionPreference = 'Stop'
$files = git diff --cached --name-only --diff-filter=ACM
foreach ($f in $files) {
  if (Test-Path $f) {
    # Only run on likely text files (adjust as needed)
    if ($f -match '\\.(ps1|psm1|psd1|psmxml|cs|ts|js|json|yml|yaml|xml|html|md|txt)$') {
      pwsh -NoProfile -File .\tools\normalize-utf8.ps1 -Path $f -Quiet
      git add -- $f
    }
  }
}

CI step

Fail the build if any file needs normalization. You can run in dry mode by checking whether the script would convert anything (log only when changed, then grep for Converted).

# Example PowerShell CI step
$changed = 0
$ext = '*.ps1','*.psm1','*.psd1','*.cs','*.ts','*.js','*.json','*.yml','*.yaml','*.xml','*.html','*.md','*.txt'
Get-ChildItem -Recurse -File -Include $ext | ForEach-Object {
  $out = pwsh -NoProfile -File .\tools\normalize-utf8.ps1 -Path $_.FullName -Quiet 2>&1
  if ($out -match 'Converted ->') { $changed++ }
}
if ($changed -gt 0) {
  Write-Error "Encoding normalization required on $changed file(s). Run the normalizer locally and commit the changes."
}

Practical tips and gotchas

  • Windows PowerShell vs. PowerShell 7+: In Windows PowerShell 5.1, Set-Content/Out-File often default to UTF-16LE or UTF-8 with BOM. In PowerShell 7+, defaults are UTF-8 without BOM. The explicit [Text.UTF8Encoding]::new($false) avoids ambiguity across versions.
  • Keep logs clean: Only print when a conversion occurs. Quiet logs are essential for Git hooks and CI to avoid noisy output.
  • Skip binaries: The NUL-byte heuristic is quick and effective; add an extension allowlist for extra safety.
  • Idempotency: Re-running the normalizer should produce zero changes and zero log noise.
  • EOL normalization: Encoding and line endings are separate concerns. Use .gitattributes for EOLs (e.g., * text=auto eol=lf) and the script above for encoding.
  • Backups: Enable -Backup initially. Once confident, turn it off in CI and hooks.
  • Large files: The example reads whole files into memory for simplicity. For very large files, use streamed decode/encode with FileStream and StreamReader/StreamWriter to keep memory usage bounded.

With a small PowerShell utility and a Git/CI integration, youll eliminate encoding drift, reduce merge conflicts, and make your builds and diffs predictable. Normalize once, enforce everywhere, and enjoy quieter pipelines.

← All Posts Home →