Skip to content

Add retry to VictoriaLogs data source #3654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

thebondo
Copy link

@thebondo thebondo commented May 30, 2025

The current VictoriaLogs data source will exit if the data source goes away once a tail based connection has been established. This causes the whole crowdsec service to exit as well.

This update fixes the Tail method on the VictoriaLogs client so that it tries to reconnect with a backoff if the connection is lost.

Resolves #3653

thebondo added 5 commits May 16, 2025 15:22
If the original HTTP get for the tail endpoint is successful, but
then the connection is lost, no retry is done. I updated the Tail
method to also retry in this case.
Upon review, the added code was pretty different in the approach
used to keep retrying compared to the approach for the QueryRange
method. I updated the method to create a new doTail that has the
same style as doQueryRange and updated Tail to use it. This has
the following effects:

- doTail will keep trying after losing a connection
- the retry interval will grow (with an upper limit) and shrink
  (with a lower limit) as connections are made and broken
- the time in the request is updated to avoid overlapping with
  previous data that was returned (missing in the first fix)
Bringing in changes from master.
Keeping up with changes from origin
The use of the ticker was unnecessary. I updated doTail to use a backoff
interval with time.After.
Copy link

@thebondo: There are no 'kind' label on this PR. You need a 'kind' label to generate the release automatically.

  • /kind feature
  • /kind enhancement
  • /kind refactoring
  • /kind fix
  • /kind chore
  • /kind dependencies
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

Copy link

@thebondo: There are no area labels on this PR. You can add as many areas as you see fit.

  • /area agent
  • /area local-api
  • /area cscli
  • /area appsec
  • /area security
  • /area configuration
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@thebondo
Copy link
Author

/kind fix
/area agent

@thebondo thebondo marked this pull request as ready for review May 30, 2025 13:54
@thebondo thebondo marked this pull request as draft May 30, 2025 13:54
@zekker6
Copy link
Contributor

zekker6 commented Jul 11, 2025

Hello @thebondo, do you think you will have some time to continue work on this PR?
If not, I can take over and create a PR with your commit and get it ready for review (attributing you as an original author).

@thebondo
Copy link
Author

Hello @zekker6. Yes, I can still contribute on this. I have a question about the approach for the final fix.

As I understand it, the tail endpoint is expected to stay open and report log entries when received (after some possible delay, which helps keep entries sorted by time). However, the connection is sometimes lost. Ideally, the tail consumer would like to see all log entries exactly once. However, there does not seem to be a good way to start a new tail query that will both

  1. return all log entries received after that last query disconnected
  2. not return any log entries received before the last query disconnected

The solution I have so far is to keep track of the latest timestamp seen while the tail connection is open. If the connection is lost, then compute start_offset based on the time of the new query, the latest timestamp, and possible a small constant to account for potential delays in the request process.

This has the drawback that it can potentially provide a log entry more than once. From looking at the VictoriaLogs source, the start time for the tail interval appears to be calculated as time.Now() - offset - start_offset. That means that the consumer cannot specify the start time exactly because it involves the time the request is actually handled by VictoriaLogs (which is the reason for adding a small constant to start_offset).

Does this all sound correct to you? If so, I will update the code to use start_offset. I will also remove the use of the limit argument, which is not documented for the tail endpoint and does not appear to do anything either.

thebondo added 2 commits July 17, 2025 06:31
Keeping the fork up to date with the origin repository.
The tail query endpoint does not support start, but start_offset. I
updated the doTail method to use this parameter, and calculate the
required value each time the query is attempted based on the desired
start time for the results returned from the query.
@zekker6
Copy link
Contributor

zekker6 commented Jul 17, 2025

@thebondo Thank you for an update!

Does this all sound correct to you? If so, I will update the code to use start_offset. I will also remove the use of the limit argument, which is not documented for the tail endpoint and does not appear to do anything either.

Yes, this sounds correct to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CrowdSec exits if the VictoriaLogs data source goes down temporarily
2 participants