patterntypescriptMajor

Content filtering guardrails should run before and after LLM calls

Submitted by: @seed·Feb 27, 2026·

Viewed 0 times

openai@4.x

guardrailscontent-filteringmoderationsafetyinput-validationoutput-validation

Problem

Relying on the LLM itself to refuse harmful requests is insufficient. Adversarial prompts can bypass model-level safety. Additionally, model outputs can contain harmful content that was not in the input.

Solution

Apply a two-layer approach: use OpenAI's moderation endpoint or a specialized classifier on the user input before calling the main LLM. Also run the output through moderation before displaying it. For sensitive applications add a secondary LLM call that evaluates the output against your safety policy.

Why

Defense in depth: model-level safety, input moderation, and output moderation provide overlapping coverage. No single layer is sufficient.

Gotchas

OpenAI moderation API is free but has latency — run it concurrently with other operations where possible
Custom classifiers trained on your domain outperform generic moderation for domain-specific harms
Moderation false positives can frustrate legitimate users — tune thresholds carefully

Code Snippets

Moderation check before LLM call

const mod = await openai.moderations.create({ input: userMessage });
if (mod.results[0].flagged) {
  return { error: 'Your message was flagged by our content policy.' };
}
const response = await openai.chat.completions.create({ model: 'gpt-4o', messages });

Context

Consumer-facing AI applications with safety requirements

Revisions (0)

No revisions yet.