Streaming support
Aporia Guardrails provides guardrails for both prompt-level and response-level streaming, which is critical for building reliable chatbot experiences.
Aporia Guardrails includes streaming support for completions requested from LLM providers.
This feature is particularly crucial for real-time applications, such as chatbots, where immediate responsiveness is essential.
Understanding Streaming
What is Streaming?
Typically, when a completion is requested from an LLM provider such as OpenAI, the entire content is generated and then returned to the user in a single response.
This can lead to significant delays, resulting in a poor user experience, especially with longer completions.
Streaming mitigates this issue by delivering the completion in parts, enabling the initial parts of the output to be displayed while the remaining content is still being generated.
Challenges in Streaming + Guardrails
While streaming improves response times, it introduces complexities in content moderation.
Streaming partial completions makes it challenging to fully assess the content for issues such as toxicity, prompt injections, and hallucinations.
Aporia Guardrails is designed to address these challenges effectively within a streaming context.
Aporia’s Streaming Support
Currently, Aporia supports streaming through the OpenAI proxy integration. Integration via the REST API is planned for a future release.
By default, Aporia processes chunks of partial completions received from OpenAI, and executes all policies simultaneously for every chunk of partial completions with historical context, and without significantly increasing latency or token usage.
You can also set the X-RESPONSE-CHUNKED: false
HTTP header to wait until the entire response is retrieved, run guardrails, and then simulate a streaming experience for UX.