-
Notifications
You must be signed in to change notification settings - Fork 318
DAOS-17534 dtx: race between DTX aggregation and container close - b26 #16505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/2.6
Are you sure you want to change the base?
Conversation
Ticket title is 'EMRG src/dtx/dtx_common.c:720 dtx_batched_commit() Assertion '!dbca->dbca_deregister' failed' |
bc36f62
to
12a862d
Compare
/* Someone is closing current container. */ | ||
D_ASSERT(d_list_empty(&dbca->dbca_pool_link)); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks it could race with dtx_free_dbca()? (both could call sched_req_put(dbca->dbca_agg_req)).
if (dbca->dbca_agg_req != NULL) {
if (!dbca->dbca_agg_done)
sched_req_wait(dbca->dbca_agg_req, true);
/* Just to be safe... */
if (dbca->dbca_agg_req != NULL) {
D_ASSERT(dbca->dbca_agg_done);
sched_req_put(dbca->dbca_agg_req);
dbca->dbca_agg_req = NULL;
dbca->dbca_agg_done = 0;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am further working on the patch with reference.
dc5536b
to
7e16eb9
Compare
Test stage Functional Hardware Large completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16505/4/execution/node/1342/log |
@@ -482,43 +482,49 @@ dtx_aggregation_pool(struct dss_module_info *dmi, struct dtx_batched_pool_args * | |||
dbca_pool_link); | |||
D_ASSERT(!dbca->dbca_deregister); | |||
|
|||
dtx_get_dbca(dbca); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how can this solve the issue I mentioned in prior comment? Looks dtx_free_dbca() only checks the dbca refcount at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how can this solve the issue I mentioned in prior comment? Looks dtx_free_dbca() only checks the dbca refcount at the end.
dtx_free_dbca() call sched_req_put() will when xxx_done is true, at that time, the sched_req_put() will not yield, so only one will call sched_req_put().
|
||
if (stat.dtx_cont_cmt_count >= dtx_agg_thd_cnt_up || | ||
((stat.dtx_cont_cmt_count > dtx_agg_thd_cnt_lo || | ||
stat.dtx_pool_cmt_count >= dtx_agg_thd_cnt_up) && | ||
(dtx_sec2age(stat.dtx_first_cmt_blob_time_lo) >= dtx_agg_thd_age_up))) { | ||
D_ASSERT(!dbca->dbca_agg_done); | ||
dtx_get_dbca(dbca); | ||
dbca->dbca_agg_req = sched_create_ult(&attr, dtx_aggregate, dbca, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks there is an extra put in dtx_aggregate().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference is held at the beginning of the patch when into the while() section. If sched_create_ult succeed, its reference will be released via dtx_aggregate(), otherwise, the caller itself will explicitly release such reference via "goto next".
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16505/4/execution/node/1523/log |
dtx_aggregation_pool() logic may yield because of sched_req_put(). Then someone may close related container during the yield. If DTX aggregation logic does not check the race with close before adding the container back to the DTX aggregation list (per pool), then it may trigger assertion of "D_ASSERT(!dbca->dbca_deregister)" during subsequent DTX batched commit or DTX aggregation process. On the other hand, DTX aggregation logic needs to hold reference on the dbca structure to avoid being freed during DTX aggregation. Signed-off-by: Fan Yong <fan.yong@hpe.com>
7e16eb9
to
c5793a3
Compare
Test stage Functional Hardware Large completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16505/5/execution/node/1417/log |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16505/5/testReport/ |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16505/5/execution/node/1403/log |
dtx_aggregation_pool() logic may yield because of sched_req_put(). Then someone may close related container during the yield. If DTX aggregation logic does not check the race with close before adding the container back to the DTX aggregation list (per pool), then it may trigger assertion of "D_ASSERT(!dbca->dbca_deregister)" during subsequent DTX batched commit or DTX aggregation process.
On the other hand, DTX aggregation logic needs to hold reference on the dbca structure to avoid being freed during DTX aggregation.
Steps for the author:
After all prior steps are complete: