Skip to content

DAOS-17534 dtx: race between DTX aggregation and container close - b26 #16505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: release/2.6
Choose a base branch
from

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Jun 12, 2025

dtx_aggregation_pool() logic may yield because of sched_req_put(). Then someone may close related container during the yield. If DTX aggregation logic does not check the race with close before adding the container back to the DTX aggregation list (per pool), then it may trigger assertion of "D_ASSERT(!dbca->dbca_deregister)" during subsequent DTX batched commit or DTX aggregation process.

On the other hand, DTX aggregation logic needs to hold reference on the dbca structure to avoid being freed during DTX aggregation.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Copy link

github-actions bot commented Jun 12, 2025

Ticket title is 'EMRG src/dtx/dtx_common.c:720 dtx_batched_commit() Assertion '!dbca->dbca_deregister' failed'
Status is 'In Review'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-17534

@Nasf-Fan Nasf-Fan marked this pull request as ready for review June 12, 2025 07:53
@Nasf-Fan Nasf-Fan requested review from a team as code owners June 12, 2025 07:53
@Nasf-Fan Nasf-Fan requested review from liuxuezhao and NiuYawei June 12, 2025 07:53
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17534_1_b26 branch from bc36f62 to 12a862d Compare June 12, 2025 09:27
/* Someone is closing current container. */
D_ASSERT(d_list_empty(&dbca->dbca_pool_link));
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks it could race with dtx_free_dbca()? (both could call sched_req_put(dbca->dbca_agg_req)).

 if (dbca->dbca_agg_req != NULL) {
            if (!dbca->dbca_agg_done)
                    sched_req_wait(dbca->dbca_agg_req, true);
            /* Just to be safe... */
            if (dbca->dbca_agg_req != NULL) {
                    D_ASSERT(dbca->dbca_agg_done);
                    sched_req_put(dbca->dbca_agg_req);
                    dbca->dbca_agg_req = NULL;
                    dbca->dbca_agg_done = 0;
            }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am further working on the patch with reference.

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17534_1_b26 branch 2 times, most recently from dc5536b to 7e16eb9 Compare June 12, 2025 10:29
@Nasf-Fan Nasf-Fan requested a review from NiuYawei June 12, 2025 10:29
@daosbuild3
Copy link
Collaborator

@@ -482,43 +482,49 @@ dtx_aggregation_pool(struct dss_module_info *dmi, struct dtx_batched_pool_args *
dbca_pool_link);
D_ASSERT(!dbca->dbca_deregister);

dtx_get_dbca(dbca);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how can this solve the issue I mentioned in prior comment? Looks dtx_free_dbca() only checks the dbca refcount at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how can this solve the issue I mentioned in prior comment? Looks dtx_free_dbca() only checks the dbca refcount at the end.

dtx_free_dbca() call sched_req_put() will when xxx_done is true, at that time, the sched_req_put() will not yield, so only one will call sched_req_put().


if (stat.dtx_cont_cmt_count >= dtx_agg_thd_cnt_up ||
((stat.dtx_cont_cmt_count > dtx_agg_thd_cnt_lo ||
stat.dtx_pool_cmt_count >= dtx_agg_thd_cnt_up) &&
(dtx_sec2age(stat.dtx_first_cmt_blob_time_lo) >= dtx_agg_thd_age_up))) {
D_ASSERT(!dbca->dbca_agg_done);
dtx_get_dbca(dbca);
dbca->dbca_agg_req = sched_create_ult(&attr, dtx_aggregate, dbca, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks there is an extra put in dtx_aggregate().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference is held at the beginning of the patch when into the while() section. If sched_create_ult succeed, its reference will be released via dtx_aggregate(), otherwise, the caller itself will explicitly release such reference via "goto next".

@Nasf-Fan Nasf-Fan requested a review from NiuYawei June 13, 2025 16:24
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16505/4/execution/node/1523/log

dtx_aggregation_pool() logic may yield because of sched_req_put().
Then someone may close related container during the yield. If DTX
aggregation logic does not check the race with close before adding
the container back to the DTX aggregation list (per pool), then it
may trigger assertion of "D_ASSERT(!dbca->dbca_deregister)" during
subsequent DTX batched commit or DTX aggregation process.

On the other hand, DTX aggregation logic needs to hold reference on
the dbca structure to avoid being freed during DTX aggregation.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17534_1_b26 branch from 7e16eb9 to c5793a3 Compare June 16, 2025 02:34
@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Jun 16, 2025
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16505/5/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16505/5/execution/node/1403/log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority Ticket has high priority (automatically managed)
Development

Successfully merging this pull request may close these issues.

3 participants