Skip to content

DAOS-17534 dtx: race between DTX aggregation and container close #16504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Jun 12, 2025

dtx_aggregation_pool() logic may yield because of sched_req_put(). Then someone may close related container during the yield. If DTX aggregation logic does not check the race with close before adding the container back to the DTX aggregation list (per pool), then it may trigger assertion of "D_ASSERT(!dbca->dbca_deregister)" during subsequent DTX batched commit or DTX aggregation process.

On the other hand, DTX aggregation logic needs to hold reference on the dbca structure to avoid being freed during DTX aggregation.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Copy link

Ticket title is 'EMRG src/dtx/dtx_common.c:720 dtx_batched_commit() Assertion '!dbca->dbca_deregister' failed'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-17534

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17534_1 branch 3 times, most recently from cc90e4f to a912f53 Compare June 12, 2025 10:23
dtx_aggregation_pool() logic may yield because of sched_req_put().
Then someone may close related container during the yield. If DTX
aggregation logic does not check the race with close before adding
the container back to the DTX aggregation list (per pool), then it
may trigger assertion of "D_ASSERT(!dbca->dbca_deregister)" during
subsequent DTX batched commit or DTX aggregation process.

On the other hand, DTX aggregation logic needs to hold reference on
the dbca structure to avoid being freed during DTX aggregation.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17534_1 branch from a912f53 to 49281b0 Compare June 12, 2025 10:26
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16504/5/execution/node/1449/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16504/5/execution/node/1468/log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants