patterntypescriptModerate
Embedding batch processing should use concurrency limits to avoid rate limits
Viewed 0 times
p-limit@4.x
embeddingsbatch-processingconcurrencyp-limitrate-limitsemaphore
Problem
Processing thousands of documents for embedding with Promise.all fires all requests simultaneously, immediately hitting token-per-minute or request-per-minute rate limits and causing cascading 429 failures.
Solution
Process embedding batches with a concurrency limit using a semaphore or p-limit library. For OpenAI embeddings, limit to 5-10 concurrent batch requests. Add exponential backoff on 429s. For very large corpora, use the OpenAI Batch API which has no rate limit pressure.
Why
Rate limits enforce a maximum sustained throughput. Concurrent requests burst beyond this limit. A concurrency-limited queue sustains throughput close to the limit without exceeding it.
Gotchas
- p-limit is ESM-only in v5+ — use dynamic import or pin to v4 for CommonJS projects
- Monitor both RPM (requests per minute) and TPM (tokens per minute) — either can be the binding constraint
- The Batch API has 24-hour turnaround — only suitable for offline processing
Code Snippets
Concurrent embedding with p-limit
import pLimit from 'p-limit';
const limit = pLimit(5); // max 5 concurrent requests
const embeddings = await Promise.all(
batches.map(batch => limit(() => openai.embeddings.create({ model: 'text-embedding-3-small', input: batch })))
);Context
Bulk document ingestion pipelines for RAG or semantic search
Revisions (0)
No revisions yet.