feat: Asynchronous `process_chat_payload` in chat completion

### Check Existing Issues

Related: #13007

### Problem Description

The `/api/chat/completions` endpoint supports two primary modes of operation:
1. **Synchronous (`stream=False`)**: Typically invoked via direct HTTP requests, this mode processes the entire request and returns the complete response in a single HTTP transaction.
2. **Asynchronous (`stream=True`)**: Primarily used by the frontend UI via WebSocket, this mode is **expected to return immediately** with a `task_id`. This `task_id` allows the frontend to receive status updates, stream the response incrementally via the WebSocket connection, and crucially, enables early stopping of the generation process initiated by the user.

While the asynchronous (`stream=True`) mode functions as expected for standard chat interactions (returning the `task_id` promptly), this expected behavior breaks when features requiring substantial pre-processing, such as **Web Search** or **Tool Use**, are enabled. Instead of returning immediately, it waits for the `process_chat_payload` phase (which includes potentially long-running operations like web searches or tool executions) to complete before returning the task_id.

![Image](https://github.com/user-attachments/assets/409f8651-c070-42fd-ba4d-62e98fa1eeb0)

This synchronous behavior during the payload processing phase leads to two significant issues: (*both reported* in discussions)

1. **Delayed Early Stopping**: The frontend does not receive the `task_id` until after web search/tool execution finishes. This prevents users from stopping the request during this initial, potentially lengthy (30-60s+), phase.
2. **Network Timeouts**: The extended wait time for the endpoint to respond increases the risk of network errors, such as gateway timeouts or client-side request timeouts, degrading the user experience.

### Cause Analysis

The chat completion process can be broadly divided into two phases:

1. `process_chat_payload`: Handles request preprocessing, including web searches, tool calls, and injecting results into the context for the language model.
2. `process_chat_response`: Handles the actual generation of the AI response by LLM and streams results back via WebSocket.

Currently, `process_chat_response` is correctly handled asynchronously using `create_task`, as seen:

https://github.com/open-webui/open-webui/blob/b8fb4e528dc2629acf68b9a555a59fd0173aaa51/backend/open_webui/utils/middleware.py#L1209-L1210

However the `process_chat_payload` remains a **synchronous** function, user have to wait until `process_chat_payload` finishes and then they can receive the background `task_id`. Things get worse when web search feature is enabled as it might take up to 30s-60s, in this period user cannot early stop the request, and facing the risk of connection timeout.


### Desired Solution you'd like

For asynchronous api calls, refactor the `chat_completion` handler in `main.py` to make the entire processing pipeline (both payload processing and response generation) **asynchronous** from the start. This can be achieved by wrapping all time-consuming logic within a single background task created immediately upon receiving the request. [test_async_chat_completion](https://github.com/tth37/open-webui/commit/316adbb085219b2157f230791b6c5f5765b3c52a)

```python3
async def all_time_consuming_jobs(request, form_data, user, metadata, model):
    form_data, metadata, events = await process_chat_payload(
        request, form_data, user, metadata, model
    )
    response = await chat_completion_handler(request, form_data, user)
    await process_chat_response( # don't create_task inside `process_chat_response`
        request, response, form_data, user, metadata, model, events, tasks
    )

task_id, _ = create_task(
    all_time_consuming_jobs(request, form_data, user, metadata, model),
    id=metadata["chat_id"],
)
```


### Further Considerations

This simple patch is technically working, however there might still lots of work to be done:

- Identifying Synchronous/Asynchronous Requests in main.py
- Error handling: Correct and robust error handling during the two phases
- Early Stopping Behavior: The frontend logic of early stopping when web search has not finished
- etc.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Asynchronous `process_chat_payload` in chat completion #13027

Check Existing Issues

Problem Description

Cause Analysis

Desired Solution you'd like

Further Considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	# Handle as a background task
	async def post_response_handler(response, events):

Uh oh!

feat: Asynchronous process_chat_payload in chat completion #13027

Description

Check Existing Issues

Problem Description

Cause Analysis

Desired Solution you'd like

Further Considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

feat: Asynchronous `process_chat_payload` in chat completion #13027