Failing to resume from cache with locally hosted LLM?

I'm having an issue where our curator data generation runs are failing to properly resume from cache. We are using LiteLLM backend to contact a locally hosted LLM. This issue occurs on both the v0.1.20 version and the 0.1.19 version that was used yesterday. Resumes have worked for us in the past, so this seems to be new behavior. 


Our backend generation parameters are as follows:
```
backend_params = {
        "base_url": api_base,
        "generation_params": { "max_tokens": 32768, "min_p": 0.1 },
        "max_concurrent_requests": 80, 
        "max_requests_per_minute": 10000, 
        "max_tokens_per_minute": 10_000_000, 
        "request_timeout": 120
    }

```


On the first restart, the number of concurrent requests will be far below what it should (We have it set to 80, it will begin at around 18). It will then slowly count down to 1 concurrent request and sit there far beyond the time it should take to complete 1 request (10+ minutes). If restarted after that, it will stay on "Preparing to generate responses..." with no movement. 

This behavior appears to be new, as we've been able to resume from cache with no issues previously. 

The below example is from DeepSeek v3 but I have verified it occurs when contacting Tulu405B as well. 

```
&#10095; python bespoke_shisa_v1_reannotator.py -m "deepseek-ai/DeepSeek-V3" --api-base http://ip-10-1-1-135:8000/v1
[03/01/25 06:48:05] INFO     Manually set max_concurrent_requests to 80                                                                                                                                      base_online_request_processor.py:173                    
INFO     Manually set  max_concurrent_requests to 80                                                                                                                                      base_online_request_processor.py:173[03/01/25 06:48:06] 
INFO     Getting rate limits for model: hosted_vllm/deepseek-ai/DeepSeek-V3                                                                                                          litellm_online_request_processor.py:243                    
WARNING  LiteLLM does not support cost estimation for model: This model isn't mapped yet. model=deepseek-ai/DeepSeek-V3, custom_llm_provider=hosted_vllm. Add it here -               litellm_online_request_processor.py:226                             https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json.                                                                                                                                                      
INFO     Test call headers: {'llm_provider-date': 'Sat, 01 Mar 2025 06:48:05 GMT', 'llm_provider-server': 'uvicorn', 'llm_provider-content-length': '402',                            litellm_online_request_processor.py:229                             'llm_provider-content-type': 'application/json'}                                                                                                                                                                                        
INFO     Running LiteLLMOnlineRequestProcessor completions with model: hosted_vllm/deepseek-ai/DeepSeek-V3                                                                                      base_request_processor.py:132[03/01/25 06:48:10] 
INFO     Using cached requests. If you want to regenerate the dataset, disable or delete the cache.                                                                                             base_request_processor.py:213[03/01/25 06:48:11] 
INFO     Manually set max_requests_per_minute to 10000                                                                                                                                   base_online_request_processor.py:190                    
INFO     Manually set max_tokens_per_minute to 10000000                                                                                                                                  base_online_request_processor.py:209                    
INFO     Resuming progress by reading existing file: /fsx/ubuntu/.cache/curator/1f96e0acfca6031c/responses_0.jsonl                                                                              base_request_processor.py:511                    
INFO     Found 132 successful requests and 0 previously failed requests and 0 parsing errors in /fsx/ubuntu/.cache/curator/1f96e0acfca6031c/responses_0.jsonl                                   base_request_processor.py:537
Preparing to generate 187943 responses using hosted_vllm/deepseek-ai/DeepSeek-V3 with combined input/output token limiting strategy                                                                                                           &#9472;&#9583;
```


I have verified that there is nothing unusual in the debug log in the cache folder. I will happily provide any other logs you need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failing to resume from cache with locally hosted LLM? #564

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failing to resume from cache with locally hosted LLM? #564

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions