Published at

How to Reduce Hallucinations in LLM Outputs

Table of Contents

Large language models are impressive, but they have a well-known problem: they make things up. They invent citations, fabricate API endpoints, state incorrect facts with total confidence, and sometimes produce outputs that sound plausible but are entirely wrong. This is called hallucination, and it is one of the most common issues developers run into when building AI-powered applications.

The good news is that hallucinations are not random or unpredictable. They follow patterns, and most of them can be significantly reduced — if not eliminated — with the right techniques applied at the right layer. This article walks through six practical approaches, with concrete examples of each.

What Causes Hallucinations?

Before fixing the problem, it helps to understand where it comes from.

LLMs are trained to predict the next most likely token given a sequence of text. They do not retrieve facts from a database — they generate text that resembles factual text based on patterns learned during training. When the model encounters a question it does not have strong training signal for, it does not say “I don’t know”. It generates text that looks like an answer, because that is what the training process rewarded.

This means hallucinations tend to cluster around:

  • Knowledge gaps — things outside the model’s training data or cutoff
  • Specificity — precise details like dates, names, version numbers, or URLs
  • Low-confidence domains — niche topics with sparse training data
  • Long outputs — the more text the model generates, the more opportunities for drift

With that in mind, here are six techniques that address different parts of the problem.

1. Write More Specific Prompts

The single fastest way to reduce hallucinations is to remove ambiguity from your prompt. Vague instructions give the model too much creative latitude, and it fills the gaps.

Before (vague):

Tell me about React hooks.

After (specific):

Explain what React hooks are and describe three commonly used hooks: useState, useEffect, and useContext.
For each hook, include: what it does, its syntax, and a one-sentence use case.
If you are unsure about any detail, say "I'm not certain about this" rather than guessing.

The second prompt does three things the first does not: it scopes the topic, specifies the output structure, and explicitly gives the model permission to express uncertainty. That last part is important. Models are trained on text that is mostly confident and declarative, so they default to sounding certain. Telling the model it is acceptable — even expected — to say “I don’t know” surfaces genuine uncertainty rather than burying it under a confident-sounding wrong answer.

You can make this instruction a standing part of your system prompt:

If you don't know something or aren't confident in a detail, say so explicitly.
Never invent facts, statistics, URLs, or citations to fill a gap.

2. Use Few-Shot Examples to Set the Bar

Few-shot prompting lets you show the model what a good (and bad) response looks like. This is especially useful when you need the model to express calibrated uncertainty rather than fake confidence.

Example — without few-shot:

User: What is the exact population of Tokyo as of today?

Without examples, the model will typically produce a confident-sounding answer:

The population of Tokyo is approximately 13.96 million in the city proper,
or around 37.4 million in the greater metropolitan area.

This sounds authoritative, but it is a static figure from training data presented as if it were current. The model has no real-time data and no instruction to say so. Now add examples:

Example — with few-shot:

Answer the following questions. If you are not certain, say "I'm not sure" and explain why.

Q: What is the capital of France?
A: Paris.

Q: What is the exact population of Tokyo as of today?
A: I'm not sure of the exact current figure — population counts change constantly and I don't have real-time data. As of my training data, the greater Tokyo area had roughly 37–38 million people, but you should verify this with a current source.

Q: What is the capital of Australia?
A:

By showing the model what a good uncertain response looks like, you dramatically increase the chance it will handle real uncertain questions the same way. The pattern is learned from your examples, not just stated in the instruction.

3. Lower the Temperature for Factual Tasks

Temperature controls how much randomness the model introduces when choosing the next token. A temperature of 0 makes the model as deterministic as possible — it always picks the most likely next token. In practice, hardware differences and floating-point arithmetic mean outputs can still vary slightly across runs even at temperature 0, but it is far more consistent than higher values. A higher temperature (like 0.9 or 1.0) makes it more creative and varied, but also more prone to drifting from facts.

For creative writing, higher temperatures are great. For factual tasks — generating structured data, answering specific questions, writing code — lower temperatures reduce hallucinations by keeping the model on its most-likely-to-be-correct path.

JavaScript example using the Anthropic SDK:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// For factual / structured output tasks
const factualResponse = await client.messages.create({
  model: 'claude-opus-4-6',
  max_tokens: 1024,
  temperature: 0.1, // Low temperature for factual accuracy
  messages: [
    {
      role: 'user',
      content:
        'List the HTTP status codes for: OK, Not Found, Internal Server Error, and Unauthorized.',
    },
  ],
});

// For creative writing tasks
const creativeResponse = await client.messages.create({
  model: 'claude-opus-4-6',
  max_tokens: 1024,
  temperature: 0.9, // Higher temperature for creative variety
  messages: [
    {
      role: 'user',
      content:
        'Write a short story opening about a developer who discovers a bug in production.',
    },
  ],
});

A practical rule of thumb: use temperature: 0 or temperature: 0.1 for anything that has a definitively correct answer. Use higher values only when variety and creativity are more important than factual accuracy.

4. Ground Responses with RAG

Retrieval-Augmented Generation (RAG) is the most powerful technique for eliminating hallucinations that stem from knowledge gaps. Instead of relying on what the model learned during training, you retrieve relevant documents and include them directly in the prompt context. The model then answers based on that retrieved content rather than generating from memory.

This is especially valuable for:

  • Internal knowledge bases (company docs, product manuals)
  • Up-to-date information the model could not have been trained on
  • Domain-specific content with high precision requirements

Conceptual flow:

User question

Embed the question as a vector

Search a vector database for similar chunks

Insert retrieved chunks into the prompt

Model answers from the retrieved context

One often-overlooked factor here is chunking strategy. If your documents are split into chunks that are too large, the retrieved context becomes noisy and the model may anchor on irrelevant parts. If chunks are too small, they lose surrounding context and produce incomplete answers. Both failure modes can introduce hallucinations. A reasonable starting point is 512–1024 tokens per chunk with some overlap (e.g. 10–20%) to preserve continuity across boundaries.

A minimal RAG prompt structure:

You are a helpful assistant. Answer the user's question using only the context provided below.
If the answer is not in the context, say "I don't have enough information to answer that."

Context:
---
[Retrieved document chunks inserted here]
---

User question: {question}

The key instruction is “using only the context provided”. This directly constrains the model to the retrieved text and prevents it from supplementing gaps with hallucinated details. Combined with an explicit fallback (“say I don’t have enough information”), this dramatically reduces invented content.

Node.js example with a simple in-memory context:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// In a real RAG system, this context would be retrieved from a vector database
const retrievedContext = `
  The Anthropic Messages API accepts the following parameters:
  - model (required): The model ID to use, e.g. "claude-opus-4-6"
  - max_tokens (required): Maximum number of tokens to generate
  - messages (required): Array of message objects with role and content
  - temperature (optional): Sampling temperature between 0 and 1
  - system (optional): System prompt string
`;

async function answerWithContext(question) {
  const response = await client.messages.create({
    model: 'claude-opus-4-6',
    max_tokens: 512,
    system: `You are a helpful assistant. Answer questions using only the context provided.
If the answer is not in the context, say "I don't have enough information to answer that."`,
    messages: [
      {
        role: 'user',
        content: `Context:\n---\n${retrievedContext}\n---\n\nQuestion: ${question}`,
      },
    ],
  });

  return response.content[0].text;
}

const answer = await answerWithContext(
  'What parameters are required when calling the Messages API?',
);
console.log(answer);

This approach keeps the model honest by giving it authoritative source material and an explicit instruction to stay within it.

5. Use Structured Outputs to Constrain Generation

Hallucinations thrive in free-form text generation. When a model is generating unstructured prose, it has unlimited latitude to insert invented details. Constraining the output to a schema — like JSON — reduces that latitude considerably.

When you ask a model to fill in a JSON object, it cannot easily sneak in hallucinated text. Each field has to fit the schema, and the structural constraint keeps the model focused.

Instead of this:

Describe the user's account status and any recent activity.

Try this:

Return a JSON object with the following fields:
- account_status: one of "active", "suspended", or "pending"
- last_login_date: ISO 8601 date string, or null if unknown
- has_outstanding_issues: boolean
- confidence: one of "high", "medium", or "low"

If you are not certain about a value, set confidence to "low" and use null for uncertain fields.
Only return the JSON object, no other text.

Response with structure:

{
  "account_status": "active",
  "last_login_date": null,
  "has_outstanding_issues": false,
  "confidence": "low"
}

The model is now forced to represent its uncertainty explicitly (via the confidence field) and use null for unknown values rather than inventing plausible-sounding ones. This makes hallucinations both less likely and more detectable when they do occur.

Several APIs also support native structured output modes that enforce a schema at the token level, making it technically impossible for the model to return malformed JSON or go off-schema. With the Anthropic SDK, you can use tool use to achieve this — define a tool with a JSON schema and the model is forced to populate it exactly:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const response = await client.messages.create({
  model: 'claude-opus-4-6',
  max_tokens: 512,
  tools: [
    {
      name: 'report_account_status',
      description:
        'Report the structured account status based on available information.',
      input_schema: {
        type: 'object',
        properties: {
          account_status: {
            type: 'string',
            enum: ['active', 'suspended', 'pending'],
            description: 'Current account status',
          },
          last_login_date: {
            type: ['string', 'null'],
            description:
              'ISO 8601 date string of last login, or null if unknown',
          },
          has_outstanding_issues: {
            type: 'boolean',
            description: 'Whether there are unresolved issues on the account',
          },
          confidence: {
            type: 'string',
            enum: ['high', 'medium', 'low'],
            description: 'Confidence level in the reported values',
          },
        },
        required: [
          'account_status',
          'last_login_date',
          'has_outstanding_issues',
          'confidence',
        ],
      },
    },
  ],
  tool_choice: { type: 'tool', name: 'report_account_status' },
  messages: [
    {
      role: 'user',
      content:
        'Check the account status for user ID 9821. Last activity data is unavailable.',
    },
  ],
});

// The model is forced to return a valid tool_use block — no free-form text
const toolUse = response.content.find((block) => block.type === 'tool_use');
const accountData = toolUse.input;
console.log(accountData);

This is strictly stronger than prompt-based structuring because the schema is enforced at the API level rather than relying on the model to follow instructions.

6. Validate Outputs Programmatically (or with a Second LLM Call)

No prompt technique eliminates hallucinations entirely. For high-stakes applications, you need a validation layer on top of your prompts.

Programmatic validation works well for structured outputs where you can check specific properties:

function validateLLMResponse(response) {
  const errors = [];

  // Check that required fields exist
  if (!response.account_status) {
    errors.push('Missing account_status');
  }

  // Check that enum values are valid
  const validStatuses = ['active', 'suspended', 'pending'];
  if (!validStatuses.includes(response.account_status)) {
    errors.push(`Invalid account_status: ${response.account_status}`);
  }

  // Check date format if present
  if (response.last_login_date !== null) {
    const isValidDate = !isNaN(Date.parse(response.last_login_date));
    if (!isValidDate) {
      errors.push(`Invalid date format: ${response.last_login_date}`);
    }
  }

  return { valid: errors.length === 0, errors };
}

LLM-as-judge validation is useful when you cannot programmatically verify correctness — for example, when validating that a summary accurately reflects a source document:

async function validateWithLLM(originalDocument, summary) {
  const client = new Anthropic();

  const response = await client.messages.create({
    model: 'claude-opus-4-6',
    max_tokens: 256,
    temperature: 0,
    messages: [
      {
        role: 'user',
        content: `
You are a fact-checker. Compare the summary below against the original document.
Return a JSON object with:
- accurate: boolean (true if the summary contains no factual errors)
- issues: array of strings describing any inaccuracies found (empty array if none)

Original document:
${originalDocument}

Summary to check:
${summary}

Return only the JSON object.`,
      },
    ],
  });

  const raw = response.content[0].text.replace(/```json|```/g, '').trim();

  try {
    return JSON.parse(raw);
  } catch {
    // The validator itself can occasionally return malformed output.
    // Treat parse failures as inconclusive rather than crashing.
    console.error('Validation response could not be parsed:', raw);
    return {
      accurate: null,
      issues: ['Validation failed: could not parse checker response'],
    };
  }
}

This pattern — using one LLM call to generate content and a second to verify it — is sometimes called a generator-critic approach, drawn from the broader literature on LLM self-refinement and evaluation. It adds latency and cost, but for high-stakes content like legal summaries, medical information, or financial reports, the tradeoff is usually worth it.

Putting It Together: A Decision Guide

Not every application needs all six techniques. Here is a practical starting point:

SituationStart with
Factual Q&A with no live dataSpecific prompts + low temperature + uncertainty instruction
Q&A over internal documentsRAG with a “context only” prompt
Structured data extractionStructured outputs + programmatic validation
Summarization of provided textFew-shot examples + LLM-as-judge validation
General-purpose chatbotSpecific system prompt + uncertainty instruction + low temperature
High-stakes content (legal, medical)RAG + structured outputs + LLM-as-judge validation

The most common mistake is treating hallucinations as a single problem with a single fix. In practice, a factual Q&A bot has a different root cause than a summarization tool, and they need different solutions. Diagnosing where the hallucination is coming from — knowledge gaps, underspecified prompts, free-form generation — points you directly to the right technique.

Start with the simplest fix (better prompts) before reaching for the more complex ones (RAG, validation pipelines). You will often be surprised how far a specific, well-structured prompt with an explicit uncertainty instruction takes you.