Streaming Millions of CSV Rows in PowerShell with Import-Csv -ReadCount
When a CSV crosses the million-row line, naive patterns like read-all-then-process quickly exhaust memory and slow to a crawl. The fix is simple and powerful: stream the file in predictable chunks, parse with an invariant culture so numbers and dates are stable across locales, and append results as you go. In this post, youll learn a production-ready pattern that keeps memory flat, improves throughput, and yields consistent output.
Stream in Chunks with -ReadCount
Batching is the difference between an OOM spike and a smooth, steady run. By setting -ReadCount, you receive arrays of records (chunks) from Import-Csv and process each batch before the next one arrives. That keeps the working set bounded while still moving quickly through the file.
The core pipeline
$path = './large.csv'
$out = './summary.csv'
$read = 1000
$fmt = [Globalization.CultureInfo]::InvariantCulture
if (Test-Path $out) { Remove-Item -Path $out -Force }
Import-Csv -Path $path -Delimiter ',' -Encoding UTF8 -ReadCount $read | ForEach-Object {
foreach ($r in $_) {
$amt = if ([string]::IsNullOrWhiteSpace($r.Amount)) { 0 } else { [double]::Parse($r.Amount, $fmt) }
$date = [datetime]::Parse($r.Date, $fmt)
[pscustomobject]@{ Day=$date.ToString('yyyy-MM-dd'); Amount=[math]::Round($amt,2); Account=$r.Account }
} | Export-Csv -Path $out -NoTypeInformation -Append -Encoding UTF8
}
Write-Host ("Saved -> {0}" -f (Resolve-Path -Path $out))
Why this scales:
- Bounded memory: Each batch is processed and released before the next one arrives.
- Continuous output:
Export-Csv -Appendwrites as you go, so you dont hold an in-memory result set. - Predictable CPU profile: Work per chunk is consistent; the GC has less to do.
Choosing the right -ReadCount
- 100 6,000 is a good starting window. Try
1000or5000and benchmark. - Smaller batches reduce peak memory; larger batches reduce I/O flush frequency. Find the sweet spot for your disk and CPU.
Parse Numbers and Dates with Invariant Culture
CSV data often comes from multiple locales. Without an explicit culture, double and DateTime parsing can break on commas vs. dots or region-specific formats. Use [CultureInfo]::InvariantCulture and normalize early to keep your pipeline robust and testable.
Safe numeric parsing
Prefer TryParse for resilience. It avoids exceptions on bad rows and keeps your batch moving.
$fmt = [Globalization.CultureInfo]::InvariantCulture
function Parse-AmountOrDefault {
param([string]$s, [double]$default = 0)
if ([string]::IsNullOrWhiteSpace($s)) { return $default }
[double]$val = 0
if ([double]::TryParse($s, [Globalization.NumberStyles]::Float, $fmt, [ref]$val)) { return $val }
return $default
}
Stable date handling
Apply consistent formats on output, even if the input is messy. If the input is guaranteed in ISO 8601 (yyyy-MM-dd or yyyy-MM-ddTHH:mm:ssZ), parsing is cheap and reliable.
function Parse-DateStrict {
param([string]$s)
# Try the most common stable patterns first
$styles = [Globalization.DateTimeStyles]::AssumeUniversal -bor [Globalization.DateTimeStyles]::AdjustToUniversal
$ok = [datetime]::TryParse($s, $fmt, $styles, [ref]([datetime]$d))
if (-not $ok) { throw "Invalid date: $s" }
return $d
}
Tip: On output, format with $date.ToString('yyyy-MM-dd') (date only) or o (round-trip ISO) for logging and re-ingestability.
Write Efficiently and Operate at Scale
Appending per chunk is convenient and fast enough for most workloads, but a few tweaks can shave seconds off long runs and keep files clean.
Efficient output with a single header
- Delete any existing output file before starting so the header is written once.
- Use
-NoTypeInformationto avoid the type line at the top. - On PowerShell 7+, consider
-UseQuotes AsNeededwithExport-Csvto reduce file size if your version supports it.
Improved end-to-end script
This version adds robust parsing, error handling, and lightweight progress without sacrificing streaming. It still keeps memory flat and throughput high.
param(
[string]$Path = './large.csv',
[string]$Out = './summary.csv',
[int]$ReadCount = 2000
)
$fmt = [Globalization.CultureInfo]::InvariantCulture
function Parse-AmountOrDefault {
param([string]$s, [double]$default = 0)
if ([string]::IsNullOrWhiteSpace($s)) { return $default }
[double]$val = 0
if ([double]::TryParse($s, [Globalization.NumberStyles]::Float, $fmt, [ref]$val)) { return $val }
return $default
}
function Parse-DateStrict {
param([string]$s)
$styles = [Globalization.DateTimeStyles]::AssumeUniversal -bor [Globalization.DateTimeStyles]::AdjustToUniversal
if ([datetime]::TryParse($s, $fmt, $styles, [ref]([datetime]$d))) { return $d }
throw "Invalid date: $s"
}
if (Test-Path $Out) { Remove-Item -Path $Out -Force }
$rows = 0
$sw = [System.Diagnostics.Stopwatch]::StartNew()
Import-Csv -Path $Path -Delimiter ',' -Encoding UTF8 -ReadCount $ReadCount |
ForEach-Object {
# $_ is the current batch (array of PSCustomObject)
$batch = $_
$rows += $batch.Count
Write-Progress -Activity 'Processing CSV' -Status ("Rows: {0:n0}" -f $rows)
foreach ($r in $batch) {
try {
$amt = Parse-AmountOrDefault $r.Amount 0
$date = Parse-DateStrict $r.Date
[pscustomobject]@{
Day = $date.ToString('yyyy-MM-dd')
Amount = [math]::Round($amt, 2)
Account = $r.Account
}
}
catch {
# Option A: skip bad rows, or write to a side log
# Write-Warning $_
}
} | Export-Csv -Path $Out -NoTypeInformation -Append -Encoding UTF8
}
$sw.Stop()
Write-Host ("Wrote {0:n0} rows in {1:n1}s -> {2}" -f $rows, $sw.Elapsed.TotalSeconds, (Resolve-Path -Path $Out))
Performance tweaks that often help
- Batch size: Increase
-ReadCountuntil CPU stays busy and disk flushes arent your bottleneck. - Avoid string concatenation in loops: Build
PSCustomObjects and letExport-Csvserialize. - Validate early: Parse and normalize values as soon as they enter the pipeline.
- Prefer TryParse: Exceptions across millions of rows compound quickly.
- Minimize per-row allocations: Reuse CultureInfo and format strings; avoid creating transient objects.
Chunked aggregations without full in-memory group-by
Need summaries but cant afford a giant hashtable in RAM? Aggregate per chunk and fold into a compact running dictionary. You still keep memory bounded.
$totals = @{}
Import-Csv -Path $path -Delimiter ',' -Encoding UTF8 -ReadCount 5000 | ForEach-Object {
foreach ($r in $_) {
$d = [datetime]::Parse($r.Date, $fmt).ToString('yyyy-MM-dd')
$a = [double]::Parse($r.Amount, $fmt)
$key = "${d}|${r.Account}"
if (-not $totals.ContainsKey($key)) { $totals[$key] = 0.0 }
$totals[$key] += $a
}
}
# Stream the final totals out once
$totals.GetEnumerator() | ForEach-Object {
$parts = $_.Key.Split('|', 2)
[pscustomobject]@{ Day = $parts[0]; Account = $parts[1]; Amount = [math]::Round($_.Value, 2) }
} | Export-Csv -Path $out -NoTypeInformation -Encoding UTF8
If the number of unique groups is still huge, spill interim maps to disk by batch and merge in a second pass.
Operational tips and pitfalls
- Encoding: On PowerShell 7+,
-Encoding utf8is UTF-8 without BOM; Windows PowerShell 5.1 writes a BOM. Keep consistent across systems that re-ingest the CSV. - Delimiters and quotes: Always specify
-Delimiter. If your input uses quotes heavily or has embedded delimiters, ensure the producer escapes correctly; test with a small sample. - Schema drift: Guard against missing columns (e.g.,
$r.PSObject.Properties.Match('Amount')) and default gracefully. - Observability: Add periodic
Write-HostorWrite-Progressand capture metrics like rows/sec. - Idempotency: Remove or rotate the output file at the start; append in deterministic order.
What you get: lower memory usage, faster runs, safer parsing, and predictable output. These patterns also translate to logs, IoT telemetry, financial transaction streams, and any bulk import where you cant afford to keep the whole set in memory.
Process data at scale with practical patterns. Read the PowerShell Advanced CookBook https://www.amazon.com/PowerShell-Advanced-Cookbook-scripting-advanced-ebook/dp/B0D5CPP2CQ/