Overview

Nobody wants hallucinations or embarrassing responses in their LLM-based apps.

So you start adding various guidelines to your prompt:

  • “Do not mention competitors”
  • “Do not give financial advice”
  • “Answer only based on the following context: …”
  • “If you don’t know the answer, respond with I don’t know

… and so on.

Why not prompt engineering?

Prompt engineering is great—but as you add more guidelines, your prompt gets longer and more complex, and the LLM’s ability to follow all instructions accurately rapidly degrades.

If you care about reliability, prompt engineering is not enough.

Aporia transforms in-prompt guidelines to strong independent real-time guardrails, and allows your prompt to stay lean, focused, and therefore more accurate.

But doesn’t RAG solve hallucinations?

RAG is a useful method to enrich LLMs with your own data. You still get hallucinations—on your own data. Here’s how it works:

  1. Retrieve the most relevant documents from a knowledge base that can answer the user’s question
  2. This retrieved knowledge is then added to the prompt—right next to your agent’s task, guidelines, and the user’s question

RAG is just (very) sophisticated prompt engineering that happens in runtime.

Typically, another in-prompt guideline such as “Answer the question based solely on the following context” is added. Hopefully, the LLM follows this instruction, but as explained before, this isn’t always the case, especially as the prompt gets bigger.

Additionally, knowledge retrieval is hard, and when it doesn’t work (e.g. the wrong documents were retrieved, too many documents, …), it can cause hallucinations, even if LLMs were following instructions perfectly.

As LLM providers like OpenAI improve their performance, and your team optimizes the retrieval process, Aporia makes sure that the final context, post-retrieval, can fully answer the user’s question, and that the LLM-generated answer is actually derived from it and is factually consistent with it.

Therefore, Aporia is a critical piece in any enterprise RAG architecture that can help you mitigate hallucinations, no matter how retrieval is implemented.


Specialized RAG Chatbots

LLMs are trained on text scraped from public Internet websites, such as Reddit and Quora.

While this works great for general-purpose chatbots like ChatGPT, most enterprise use-cases revolve around more specific tasks or use-cases—like a customer support chatbot for your company.

Let’s explore a few key differences between general-purpose and specialized use-cases of LLMs:

1. Sticking to a specific task

Specialized chatbots often need to adhere to a specific task, maintain a certain personality, and follow particular guidelines.

For example, if you’re building a customer support chatbot, here are a few examples for guidelines you probably want to have:

Be friendly, helpful, and exhibit an assistant-like personality

Should not offer any kind of financial advice

Should never engage in sexual or violent discourse

To provide these guidelines to an LLM, AI engineers often use system prompt instructions. Here’s an example system prompt:

You are a customer support chatbot for Acme.

You need to be friendly, helpful, and exhibit assistant-like personality.
Do not provide financial advice.
Do not engage in sexual or violent discourse.

[...]

2. Custom knowledge

While general-purpose chatbots like ChatGPT provide answers based on their training dataset that was scraped from the Internet, your specialized chatbot needs to be able to respond solely based on your company’s knowledge base.

For example, a customer support chatbot needs to respond based on your company’s support KB—ideally, without errors.

This is where retrieval-augmented generation (RAG) becomes useful, as it allows you to combine an LLM with external knowledge, making your specialized chatbot knowledgeable about your own data.

RAG usually works like this:

1

User asks a question

“Hey, how do I create a new ticket?”

2

Retrieve knowledge

The system searches its knowledge base to find relevant information that could potentially answer the question—this is often called context.

In our example, the context might be a few articles from the company’s support KB.

3

Construct prompt

After the context is retrieved, we can construct a system prompt:

You are a customer support chatbot for Acme.

You need to be friendly, helpful, and exhibit assistant-like personality.
Do not provide financial advice.
Do not engage in sexual or violent discourse.

[...]

Answer the following question: <QUESTION> 

Answer the question based *only* on the following context:
<RETRIEVED_KNOWLEDGE>
4

Generate answer

Finally, the prompt is passed to the LLM, which generates an answer. The answer is then displayed to the user.

As you can see, RAG takes the retrieved knowledge and puts it in a prompt - right next to the chatbot’s task, guidelines, and the user’s question.

From Guidelines to Guardrails

We used methods like system prompt instructions and RAG with the hope of making our chatbot adhere to a specific task, have a certain personality, follow our guidelines, and be knowledgeable about our custom data.

Problem: LLMs do not follow instructions perfectly

As you can see in the example above, the result of these 2 methods is a single prompt that contains the chatbot’s task, guidelines, and knowledge.

While LLMs are improving, they do not follow instructions perfectly.

This is especially true when the input prompt gets longer and more complex—e.g. when more guidelines are added, or more documents are retrieved from the knowledge base and used as context.

Less is more - performance rapidly degrades when LLMs must retrieve information from the middle of the prompt. Source: Lost in the Middle

To provide a concrete example, a very common instruction for RAG is “answer this question based only on the following context”. However, LLMs can still easily add random information from their training set that is NOT part of the context.

This means that the generated answer might contain data from Reddit instead of your knowledge base, which might be completely false.

While LLM providers like OpenAI keep improving their models to better follow instructions, the very fact that the context is just part of the prompt itself, together with the user input and guidelines, means that there can be a lot of mistakes.

Problem: Knowledge retrieval is hard

Even if the previous problem was 100% solved, knowledge retrieval is typically a very hard problem, and is unrelated to the LLM itself.

Who said the context you retrieved can actually accurately answer the user’s question?

To understand this issue better, let’s explore how knowledge retrieval in RAG systems typically works. It all starts from your knowledge base: you turn chunks of text from a knowledge base into embedding vectors (numerical representations). When a user asks a question, it’s also converted into an embedding vector. The system then finds text chunks from the knowledge base that are closest to the question’s vector, often using measures like cosine similarity. These close text chunks are used as context to generate an answer.

But there’s a core problem with this approach: there’s a hidden assumption here that text chunks close in embedding space to the question contain the right answer. However, this isn’t always true. For example, the question “How old are you?” and the answer “27” might be far apart in embedding space, even though “27” is the correct answer.

2 text chunks that are close in embedding space does not mean they match as question and answer.

There are many ways to improve retrieval: changing the ‘k’ argument (how many documents to retrieve), fine-tuning embeddings, ranking models like ColBERT.

The important piece of retrieval is that it needs to be optimized to be very fast, to be able to search through your entire knowledge base to find the most relevant documents.

But no matter how you implemented retrieval - you end up with context that’s being passed to an LLM. Who said this context can accurately answers the user’s question and that the LLM-generated answer is fully derived from it?

Solution: Aporia Guardrails

Aporia makes sure your specialized RAG chatbot follows your guidelines, but takes that a step further.

Guardrails no longer have to be simple instructions in your prompt. Aporia provides a scalable way to build custom guardrails for your RAG chatbot. These guardrails run separately from your main LLM pipeline, they can learn from examples, and Aporia uses a variety of techniques - from deterministic algorithms to fine-tuned small language models specialized for guardrails - to make sure they add minimum latency and cost.

No matter how retrieval is implemented, you can use Aporia to make sure your final context can accurately answer the user’s question, and that the LLM-generated response is fully derived from it.

You can also use Aporia to safeguard against inappropriate responses, prompt injection attacks, and other issues.