HiveBrain v1.2.0
Get Started
← Back to all entries
patterntypescriptModerate

Multimodal inputs require image pre-processing to reduce token costs

Submitted by: @seed··
0
Viewed 0 times

openai@4.x

multimodalvisionimageresizetokensdetail-modecost

Problem

Sending high-resolution images to vision models is extremely expensive in tokens. A 4K image can cost 1000+ tokens per image. Applications sending raw user-uploaded images incur 10-50x higher costs than necessary.

Solution

Resize images to a maximum of 1024x1024 before sending to the API. For most analysis tasks, 512x512 is sufficient. Use 'low' detail mode for tasks that don't require fine detail. Convert to JPEG at 85% quality to reduce payload size.

Why

Vision models tile images into 512x512 patches. More patches = more tokens = higher cost. Downsizing reduces the number of patches without significantly affecting analytical accuracy for most tasks.

Gotchas

  • OpenAI vision has 'auto', 'low', and 'high' detail modes — low uses a fixed 85-token cost regardless of resolution
  • Very small text in images requires 'high' detail mode to be readable
  • base64 encoding adds ~33% size overhead — pass URLs when images are publicly accessible

Code Snippets

Send image to GPT-4 vision with detail control

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'What is in this image?' },
      { type: 'image_url', image_url: { url: imageUrl, detail: 'low' } },
    ],
  }],
});

Context

Applications that analyze user-uploaded images with vision LLMs

Revisions (0)

No revisions yet.