Skip to content

fix: persist member join status with conflict retry#10357

Open
weicao wants to merge 1 commit into
mainfrom
bugfix/memberjoin-status-conflict
Open

fix: persist member join status with conflict retry#10357
weicao wants to merge 1 commit into
mainfrom
bugfix/memberjoin-status-conflict

Conversation

@weicao

@weicao weicao commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

Oracle 3-to-2 horizontal scaling hit a controller race in two reproduced runs. The component controller successfully executed memberJoin for the new pod, then the same reconcile lost the following InstanceSet annotation update to a resourceVersion conflict. A later scale-in read memberJoined=false from the stale replicas-status annotation and skipped memberLeave, leaving a stale Oracle DG broker member for the deleted pod.

Observed evidence from Oracle validation:

  • succeed to join member for pod ...-oracle-2
  • immediately followed by Operation cannot be fulfilled on instancesets... object has been modified
  • later scale-in logged joined replicas: [] while memberLeave was defined
  • no primary-side memberLeave kbagent action was executed

Fix

  • Add UpdateReplicaStatusWithRetry for InstanceSet replicas-status annotation writes.
  • After memberJoin succeeds, persist memberJoined=true through a fresh get/update retry loop instead of depending only on the later graph update.
  • Keep the in-memory proto InstanceSet status aligned for the current reconcile.
  • Add a regression test that injects one conflict and verifies memberJoined=true is still persisted.

Tests

  • make test-go-generate
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1

Validation boundary

This closes the controller persistence race for new memberJoin executions. It does not change memberLeave semantics for replicas that are explicitly still marked unjoined, because forcing leave for known-unjoined replicas would require a broader addon action idempotency contract. Oracle should validate this exact patch image in the same vcluster scenario before this PR is marked ready.

Fixes #10359

@apecloud-bot

Copy link
Copy Markdown
Collaborator

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

CLA Recheck Instructions

Usage:
  - /recheck-cla: Trigger a re-check of CLA status for this pull request.
Example:
  - /recheck-cla

@github-actions github-actions Bot added the size/M Denotes a PR that changes 30-99 lines. label Jun 11, 2026
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 28.57143% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.25%. Comparing base (1d86fc0) to head (84c8e12).

Files with missing lines Patch % Lines
...ps/component/transformer_component_workload_ops.go 0.00% 7 Missing ⚠️
pkg/controller/component/replicas.go 57.14% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10357      +/-   ##
==========================================
+ Coverage   53.15%   53.25%   +0.09%     
==========================================
  Files         533      533              
  Lines       63457    63470      +13     
==========================================
+ Hits        33733    33798      +65     
+ Misses      26277    26227      -50     
+ Partials     3447     3445       -2     
Flag Coverage Δ
unittests 53.25% <28.57%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao

weicao commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Runtime validation update for the release-1.0 backport path:

Scope:

  • PR fix: persist member join status with conflict retry #10357 head: 84c8e12
  • release-1.0 backport commit validated by Oracle team: 2081645
  • controller image: docker.io/library/kubeblocks:r10-memberjoin-retry-20816453-amd64
  • live imageID: sha256:680243a878442af10f193aea4cc1e02b60c670c8e66c50e374def0cf33be9397
  • controller pod startTime: 2026-06-11T20:56:56Z
  • controller restart count: 0
  • rollback verified to stock image apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/kubeblocks:1.0.3-beta.9

Result:

  • N=3 3->2 scale-in rounds passed
  • joined replicas was non-empty in all 3 rounds: ["ora-w-2800-oracle-2"]
  • succeed to call leave member action appeared in all 3 rounds
  • memberLeave skip count: 0
  • controller-window sweep observed 28 resourceVersion conflicts during the window
  • joined replicas: [] count in the sweep: 0
  • DGMGRL post-scale-in checks showed 2 members, no ORCLCDB_2 zombie member, and SUCCESS in all 3 rounds
  • cleanup/rollback completed; test namespace oracle-w10357 was deleted by the Oracle team

Evidence verified:

  • evidence tar sha256: 922462216d52690e73eb7b795a6ecbfc0b79a5f90b84558c9ab7ef37c91e9f4f
  • identity supplement tar sha256: 5959b80641358aa47b421d244d8efa4e6606a08925472bf6d4c30abca548ea9a
  • final MANIFEST.sha256 self sha256: 20160103442399c9726acbef7b50f89fbe718f570da6f436f4ed08933cd16c31
  • shasum -c MANIFEST.sha256 passed for the extracted evidence set

Boundary:

  • This is a focused release-1.0 backport runtime validation for the memberJoin status-conflict retry fix.
  • It is not an Oracle full acceptance result and not a release-ready claim.

@weicao weicao marked this pull request as ready for review June 11, 2026 22:13
@weicao weicao requested review from a team and leon-ape as code owners June 11, 2026 22:13
joinErrors = append(joinErrors, fmt.Errorf("pod %s: %w", pod.Name, err))
} else {
key := types.NamespacedName{Namespace: r.protoITS.Namespace, Name: r.protoITS.Name}
if err := component.UpdateReplicaStatusWithRetry(r.transCtx.Context, r.cli, key, pod.Name, func(status *component.ReplicaStatus) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] This writes the live InstanceSet from inside the transformer after memberJoin succeeds, which is the wrong layer for this failure. A conflict while persisting memberJoined is a normal reconcile failure: the controller should retry the whole reconcile and rely on the memberJoin action idempotency, not introduce a second write path outside the DAG/plan execution model. Keep replicas-status persistence in the normal DAG/update flow instead of special-casing this annotation write.

@leon-ape

Copy link
Copy Markdown
Contributor

[P1] The production failure is caused by accepting scale-in while the previous scale-out member lifecycle is still in progress, not by the resourceVersion conflict itself. If any replica still has pending DataLoaded or MemberJoined state, scale-in can delete it and skip or race the engine memberLeave path regardless of this retry. Add an operation/state gate so scale-in waits for scale-out member lifecycle completion, or define an explicit safe cancellation path, before deleting those replicas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memberJoin success can lose joined status on conflict

3 participants