Skip to content

DAOS-17534 dtx: race between DTX aggregation and container close #16504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Jun 12, 2025

dtx_aggregation_pool() logic may yield because of sched_req_put(). Then someone may close related container during the yield. If DTX aggregation logic does not check the race with close before adding the container back to the DTX aggregation list (per pool), then it may trigger assertion of "D_ASSERT(!dbca->dbca_deregister)" during subsequent DTX batched commit or DTX aggregation process.

On the other hand, DTX aggregation logic needs to hold reference on the dbca structure to avoid being freed during DTX aggregation.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Copy link

github-actions bot commented Jun 12, 2025

Ticket title is 'EMRG src/dtx/dtx_common.c:720 dtx_batched_commit() Assertion '!dbca->dbca_deregister' failed'
Status is 'In Review'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-17534

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17534_1 branch 4 times, most recently from a912f53 to 49281b0 Compare June 12, 2025 10:26
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16504/5/execution/node/1449/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16504/5/execution/node/1468/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17534_1 branch from 49281b0 to 3fc3bbf Compare June 16, 2025 02:41
@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Jun 16, 2025
@Nasf-Fan Nasf-Fan marked this pull request as ready for review June 16, 2025 02:41
@Nasf-Fan Nasf-Fan requested review from a team as code owners June 16, 2025 02:41
@Nasf-Fan Nasf-Fan requested review from liuxuezhao and NiuYawei June 16, 2025 02:41
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16504/6/execution/node/1413/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16504/6/execution/node/1399/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16504/6/testReport/

dtx_aggregation_pool() logic may yield because of sched_req_put().
Then someone may close related container during the yield. If DTX
aggregation logic does not check the race with close before adding
the container back to the DTX aggregation list (per pool), then it
may trigger assertion of "D_ASSERT(!dbca->dbca_deregister)" during
subsequent DTX batched commit or DTX aggregation process.

On the other hand, DTX aggregation logic needs to hold reference on
the dbca structure to avoid being freed during DTX aggregation.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17534_1 branch from 3fc3bbf to 8227c86 Compare June 17, 2025 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority Ticket has high priority (automatically managed)
Development

Successfully merging this pull request may close these issues.

2 participants