patterntypescriptMajor
Content filtering guardrails should run before and after LLM calls
Viewed 0 times
openai@4.x
guardrailscontent-filteringmoderationsafetyinput-validationoutput-validation
Problem
Relying on the LLM itself to refuse harmful requests is insufficient. Adversarial prompts can bypass model-level safety. Additionally, model outputs can contain harmful content that was not in the input.
Solution
Apply a two-layer approach: use OpenAI's moderation endpoint or a specialized classifier on the user input before calling the main LLM. Also run the output through moderation before displaying it. For sensitive applications add a secondary LLM call that evaluates the output against your safety policy.
Why
Defense in depth: model-level safety, input moderation, and output moderation provide overlapping coverage. No single layer is sufficient.
Gotchas
- OpenAI moderation API is free but has latency — run it concurrently with other operations where possible
- Custom classifiers trained on your domain outperform generic moderation for domain-specific harms
- Moderation false positives can frustrate legitimate users — tune thresholds carefully
Code Snippets
Moderation check before LLM call
const mod = await openai.moderations.create({ input: userMessage });
if (mod.results[0].flagged) {
return { error: 'Your message was flagged by our content policy.' };
}
const response = await openai.chat.completions.create({ model: 'gpt-4o', messages });Context
Consumer-facing AI applications with safety requirements
Revisions (0)
No revisions yet.