fix: persist member join status with conflict retry by weicao · Pull Request #10357 · apecloud/kubeblocks

weicao · 2026-06-11T07:26:34Z

Problem

Oracle 3-to-2 horizontal scaling hit a controller race in two reproduced runs. The component controller successfully executed memberJoin for the new pod, then the same reconcile lost the following InstanceSet annotation update to a resourceVersion conflict. A later scale-in read memberJoined=false from the stale replicas-status annotation and skipped memberLeave, leaving a stale Oracle DG broker member for the deleted pod.

Observed evidence from Oracle validation:

succeed to join member for pod ...-oracle-2
immediately followed by Operation cannot be fulfilled on instancesets... object has been modified
later scale-in logged joined replicas: [] while memberLeave was defined
no primary-side memberLeave kbagent action was executed

Fix

Add UpdateReplicaStatusWithRetry for InstanceSet replicas-status annotation writes.
After memberJoin succeeds, persist memberJoined=true through a fresh get/update retry loop instead of depending only on the later graph update.
Keep the in-memory proto InstanceSet status aligned for the current reconcile.
Add a regression test that injects one conflict and verifies memberJoined=true is still persisted.

Tests

make test-go-generate
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1

Validation boundary

This closes the controller persistence race for new memberJoin executions. It does not change memberLeave semantics for replicas that are explicitly still marked unjoined, because forcing leave for known-unjoined replicas would require a broader addon action idempotency contract. Oracle should validate this exact patch image in the same vcluster scenario before this PR is marked ready.

Fixes #10359

apecloud-bot · 2026-06-11T07:26:43Z

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

CLA Recheck Instructions

Usage:
  - /recheck-cla: Trigger a re-check of CLA status for this pull request.
Example:
  - /recheck-cla

codecov · 2026-06-11T07:35:31Z

Codecov Report

❌ Patch coverage is 28.57143% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.25%. Comparing base (1d86fc0) to head (84c8e12).

Files with missing lines	Patch %	Lines
...ps/component/transformer_component_workload_ops.go	0.00%	7 Missing ⚠️
pkg/controller/component/replicas.go	57.14%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10357      +/-   ##
==========================================
+ Coverage   53.15%   53.25%   +0.09%     
==========================================
  Files         533      533              
  Lines       63457    63470      +13     
==========================================
+ Hits        33733    33798      +65     
+ Misses      26277    26227      -50     
+ Partials     3447     3445       -2

Flag	Coverage Δ
unittests	`53.25% <28.57%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

weicao · 2026-06-11T22:13:27Z

Runtime validation update for the release-1.0 backport path:

Scope:

PR fix: persist member join status with conflict retry #10357 head: 84c8e12
release-1.0 backport commit validated by Oracle team: 2081645
controller image: docker.io/library/kubeblocks:r10-memberjoin-retry-20816453-amd64
live imageID: sha256:680243a878442af10f193aea4cc1e02b60c670c8e66c50e374def0cf33be9397
controller pod startTime: 2026-06-11T20:56:56Z
controller restart count: 0
rollback verified to stock image apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/kubeblocks:1.0.3-beta.9

Result:

N=3 3->2 scale-in rounds passed
joined replicas was non-empty in all 3 rounds: ["ora-w-2800-oracle-2"]
succeed to call leave member action appeared in all 3 rounds
memberLeave skip count: 0
controller-window sweep observed 28 resourceVersion conflicts during the window
joined replicas: [] count in the sweep: 0
DGMGRL post-scale-in checks showed 2 members, no ORCLCDB_2 zombie member, and SUCCESS in all 3 rounds
cleanup/rollback completed; test namespace oracle-w10357 was deleted by the Oracle team

Evidence verified:

evidence tar sha256: 922462216d52690e73eb7b795a6ecbfc0b79a5f90b84558c9ab7ef37c91e9f4f
identity supplement tar sha256: 5959b80641358aa47b421d244d8efa4e6606a08925472bf6d4c30abca548ea9a
final MANIFEST.sha256 self sha256: 20160103442399c9726acbef7b50f89fbe718f570da6f436f4ed08933cd16c31
shasum -c MANIFEST.sha256 passed for the extracted evidence set

Boundary:

This is a focused release-1.0 backport runtime validation for the memberJoin status-conflict retry fix.
It is not an Oracle full acceptance result and not a release-ready claim.

leon-ape · 2026-06-12T02:05:05Z

 				joinErrors = append(joinErrors, fmt.Errorf("pod %s: %w", pod.Name, err))
 			} else {
+				key := types.NamespacedName{Namespace: r.protoITS.Namespace, Name: r.protoITS.Name}
+				if err := component.UpdateReplicaStatusWithRetry(r.transCtx.Context, r.cli, key, pod.Name, func(status *component.ReplicaStatus) error {


[P1] This writes the live InstanceSet from inside the transformer after memberJoin succeeds, which is the wrong layer for this failure. A conflict while persisting memberJoined is a normal reconcile failure: the controller should retry the whole reconcile and rely on the memberJoin action idempotency, not introduce a second write path outside the DAG/plan execution model. Keep replicas-status persistence in the normal DAG/update flow instead of special-casing this annotation write.

leon-ape · 2026-06-12T02:05:15Z

[P1] The production failure is caused by accepting scale-in while the previous scale-out member lifecycle is still in progress, not by the resourceVersion conflict itself. If any replica still has pending DataLoaded or MemberJoined state, scale-in can delete it and skip or race the engine memberLeave path regardless of this retry. Add an operation/state gate so scale-in waits for scale-out member lifecycle completion, or define an explicit safe cancellation path, before deleting those replicas.

fix: persist member join status with conflict retry

84c8e12

github-actions Bot added the size/M Denotes a PR that changes 30-99 lines. label Jun 11, 2026

weicao marked this pull request as ready for review June 11, 2026 22:13

weicao requested review from a team and leon-ape as code owners June 11, 2026 22:13

leon-ape reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: persist member join status with conflict retry#10357

fix: persist member join status with conflict retry#10357
weicao wants to merge 1 commit into
mainfrom
bugfix/memberjoin-status-conflict

weicao commented Jun 11, 2026 •

edited

Loading

Uh oh!

apecloud-bot commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

weicao commented Jun 11, 2026

Uh oh!

leon-ape Jun 12, 2026

Uh oh!

leon-ape commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

weicao commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Validation boundary

Uh oh!

apecloud-bot commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

weicao commented Jun 11, 2026

Uh oh!

leon-ape Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

leon-ape commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weicao commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading