Description
Check Existing Issues
Related: #13007
Problem Description
The /api/chat/completions
endpoint supports two primary modes of operation:
- Synchronous (
stream=False
): Typically invoked via direct HTTP requests, this mode processes the entire request and returns the complete response in a single HTTP transaction. - Asynchronous (
stream=True
): Primarily used by the frontend UI via WebSocket, this mode is expected to return immediately with atask_id
. Thistask_id
allows the frontend to receive status updates, stream the response incrementally via the WebSocket connection, and crucially, enables early stopping of the generation process initiated by the user.
While the asynchronous (stream=True
) mode functions as expected for standard chat interactions (returning the task_id
promptly), this expected behavior breaks when features requiring substantial pre-processing, such as Web Search or Tool Use, are enabled. Instead of returning immediately, it waits for the process_chat_payload
phase (which includes potentially long-running operations like web searches or tool executions) to complete before returning the task_id.
This synchronous behavior during the payload processing phase leads to two significant issues: (both reported in discussions)
- Delayed Early Stopping: The frontend does not receive the
task_id
until after web search/tool execution finishes. This prevents users from stopping the request during this initial, potentially lengthy (30-60s+), phase. - Network Timeouts: The extended wait time for the endpoint to respond increases the risk of network errors, such as gateway timeouts or client-side request timeouts, degrading the user experience.
Cause Analysis
The chat completion process can be broadly divided into two phases:
process_chat_payload
: Handles request preprocessing, including web searches, tool calls, and injecting results into the context for the language model.process_chat_response
: Handles the actual generation of the AI response by LLM and streams results back via WebSocket.
Currently, process_chat_response
is correctly handled asynchronously using create_task
, as seen:
open-webui/backend/open_webui/utils/middleware.py
Lines 1209 to 1210 in b8fb4e5
However the process_chat_payload
remains a synchronous function, user have to wait until process_chat_payload
finishes and then they can receive the background task_id
. Things get worse when web search feature is enabled as it might take up to 30s-60s, in this period user cannot early stop the request, and facing the risk of connection timeout.
Desired Solution you'd like
For asynchronous api calls, refactor the chat_completion
handler in main.py
to make the entire processing pipeline (both payload processing and response generation) asynchronous from the start. This can be achieved by wrapping all time-consuming logic within a single background task created immediately upon receiving the request. test_async_chat_completion
async def all_time_consuming_jobs(request, form_data, user, metadata, model):
form_data, metadata, events = await process_chat_payload(
request, form_data, user, metadata, model
)
response = await chat_completion_handler(request, form_data, user)
await process_chat_response( # don't create_task inside `process_chat_response`
request, response, form_data, user, metadata, model, events, tasks
)
task_id, _ = create_task(
all_time_consuming_jobs(request, form_data, user, metadata, model),
id=metadata["chat_id"],
)
Further Considerations
This simple patch is technically working, however there might still lots of work to be done:
- Identifying Synchronous/Asynchronous Requests in main.py
- Error handling: Correct and robust error handling during the two phases
- Early Stopping Behavior: The frontend logic of early stopping when web search has not finished
- etc.