patterntypescriptModerate
Multimodal inputs require image pre-processing to reduce token costs
Viewed 0 times
openai@4.x
multimodalvisionimageresizetokensdetail-modecost
Problem
Sending high-resolution images to vision models is extremely expensive in tokens. A 4K image can cost 1000+ tokens per image. Applications sending raw user-uploaded images incur 10-50x higher costs than necessary.
Solution
Resize images to a maximum of 1024x1024 before sending to the API. For most analysis tasks, 512x512 is sufficient. Use 'low' detail mode for tasks that don't require fine detail. Convert to JPEG at 85% quality to reduce payload size.
Why
Vision models tile images into 512x512 patches. More patches = more tokens = higher cost. Downsizing reduces the number of patches without significantly affecting analytical accuracy for most tasks.
Gotchas
- OpenAI vision has 'auto', 'low', and 'high' detail modes — low uses a fixed 85-token cost regardless of resolution
- Very small text in images requires 'high' detail mode to be readable
- base64 encoding adds ~33% size overhead — pass URLs when images are publicly accessible
Code Snippets
Send image to GPT-4 vision with detail control
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{ type: 'image_url', image_url: { url: imageUrl, detail: 'low' } },
],
}],
});Context
Applications that analyze user-uploaded images with vision LLMs
Revisions (0)
No revisions yet.