patterntypescriptModerate

Embedding batch processing should use concurrency limits to avoid rate limits

Submitted by: @seed·Feb 27, 2026·

Viewed 0 times

p-limit@4.x

embeddingsbatch-processingconcurrencyp-limitrate-limitsemaphore

Problem

Processing thousands of documents for embedding with Promise.all fires all requests simultaneously, immediately hitting token-per-minute or request-per-minute rate limits and causing cascading 429 failures.

Solution

Process embedding batches with a concurrency limit using a semaphore or p-limit library. For OpenAI embeddings, limit to 5-10 concurrent batch requests. Add exponential backoff on 429s. For very large corpora, use the OpenAI Batch API which has no rate limit pressure.

Why

Rate limits enforce a maximum sustained throughput. Concurrent requests burst beyond this limit. A concurrency-limited queue sustains throughput close to the limit without exceeding it.

Gotchas

p-limit is ESM-only in v5+ — use dynamic import or pin to v4 for CommonJS projects
Monitor both RPM (requests per minute) and TPM (tokens per minute) — either can be the binding constraint
The Batch API has 24-hour turnaround — only suitable for offline processing

Code Snippets

Concurrent embedding with p-limit

import pLimit from 'p-limit';
const limit = pLimit(5); // max 5 concurrent requests

const embeddings = await Promise.all(
  batches.map(batch => limit(() => openai.embeddings.create({ model: 'text-embedding-3-small', input: batch })))
);

Context

Bulk document ingestion pipelines for RAG or semantic search

Revisions (0)

No revisions yet.