Skip to content

Fixed failing KleidiAI NHWC unit tests#29010

Open
martin-klacer-arm wants to merge 2 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_nhwc_failure
Open

Fixed failing KleidiAI NHWC unit tests#29010
martin-klacer-arm wants to merge 2 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_nhwc_failure

Conversation

@martin-klacer-arm

Copy link
Copy Markdown

Description

The current tests in the NhwcTransformerTests suite ConvFloat_UsesNhwcOnlyWithKleidi and FusedConvFloat_UsesNhwcOnlyWithKleidi assumed that whenever KleidiAI NHWC float-conv support is available, the test graph must be rewritten to com.microsoft.NhwcFusedConv.

That assumption is not valid when ONNX Runtime is built with --enable_arm_neon_nchwc. In that cofiguration, the Level 3 NCHWc transformer is registered before the NHWC transformer, so the NCHWc rewrite can be selected instead. The optimiser priority is intentional, so these tests shouldn't require NHWC to be chosen over NCHWc.

This change keeps the existing optimiser ordering and instead updates the assertions in the 2 tests. If the NHWC path is selected, the tests still validate the expected NhwcFusedConv graph shape and verify that the path is only used when KleidiAI NHWC support is available. If another valid layout optimisation is selected first, the tests no longer fail just because the NhwcFusedConv op isn't generated.

Motivation and Context

This change fixes the 2 mentioned unit tests which fail when ONNX Runtime is built and tested with --enable_arm_neon_nchwc.

 * Fixed tests ConvFloat_UsesNhwcOnlyWithKleidi,
   FusedConvFloat_UsesNhwcOnlyWithKleidi
 * Added check to only test NHWC shape parameters
   if NHWC path was taken in graph optimisation

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
@hariharans29

Copy link
Copy Markdown
Member

Verdict: Request changes — the fix is too permissive, and there is a one-line tighter alternative readily available

The diagnosis is correct: when ORT is built with --enable_arm_neon_nchwc, NchwcTransformer is registered at Level 3 and can win the layout-conversion race against the NHWC transformer, so the original EXPECT_EQ(NhwcFusedConv, 1) is genuinely wrong for that build config. But the proposed fix throws away too much coverage. The condition that drives the conflict is statically detectable in the test — let's gate on that rather than silently passing whenever no NHWC fused conv appears.


The coverage gap introduced by this patch

The original test made two assertions, depending on HasFloatNhwcNoTransposeSupport():

KleidiAI available KleidiAI not available
Original NhwcFusedConv == 1 (must use NHWC) NhwcFusedConv == 0 (must not use NHWC)
This PR NhwcFusedConv == 1 (if any was emitted) early return

In the no-NCHWc + KleidiAI-available build (the common ARM64 CI configuration), the original test catches a regression where the NHWC transformer fails to fire. The new test passes silently in that case — op_to_count["com.microsoft.NhwcFusedConv"] == 0 triggers an early return, and the bug goes undetected.

The test name (..._UsesNhwcOnlyWithKleidi) implies it verifies the NHWC path is actually selected when KleidiAI is available. After this patch, it verifies a strictly weaker contract: "if NHWC was selected, KleidiAI must have been available." That is the contrapositive of half of what the test used to check.


A better fix exists, one line away

The same condition that registers NchwcTransformer is checked at runtime in graph_transformer_utils.cc:466:

// Register the NCHWc layout transformer if supported by the platform.
if (MlasNchwcGetBlockSize() > 1) {
  transformers.emplace_back(std::make_unique<NchwcTransformer>());
}

mlas.h is already included in nhwc_transformer_test.cc:11, so MlasNchwcGetBlockSize() is directly callable from the test. Use it to branch the assertion on the same flag the production code uses:

auto check_nhwc_graph = [&](InferenceSessionWrapper& session) {
  auto op_to_count = CountOpsInGraph(session.GetGraph());
  const bool nchwc_can_intercept = MlasNchwcGetBlockSize() > 1;
  const bool expect_nhwc = HasFloatNhwcNoTransposeSupport({1, 8, 7, 7}, {16, 8, 3, 3}, {1, 1, 1, 1});

  if (nchwc_can_intercept) {
    // NCHWc transformer runs first and may rewrite the graph before the NHWC transformer.
    // Either NCHWc or NHWC is acceptable; we only verify "no NHWC without Kleidi support".
    if (op_to_count["com.microsoft.NhwcFusedConv"] > 0) {
      EXPECT_TRUE(expect_nhwc);
      EXPECT_EQ(op_to_count["com.microsoft.NhwcFusedConv"], 1);
      EXPECT_EQ(op_to_count["Transpose"], 2);
    }
    return;
  }

  // Original strict contract: NHWC is selected iff KleidiAI provides float NHWC support.
  EXPECT_EQ(op_to_count["com.microsoft.NhwcFusedConv"], expect_nhwc ? 1 : 0);
  EXPECT_EQ(op_to_count["Transpose"], expect_nhwc ? 2 : 0);
};

This preserves the full original contract on every build that doesn't have NCHWc enabled (the vast majority), and only relaxes it on the build that genuinely cannot guarantee the layout choice. Same one-file scope as the current patch.

Even better — if NchwcConv/NchwcMaxPool/etc. are the expected op types the NCHWc transformer emits, you can tighten the NCHWc branch further by asserting one of those appears, so the test still validates some layout transformation actually ran. (Worth checking what op names nchwc_transformer.cc produces and adding EXPECT_GT(nchwc_op_count, 0).)


Unrelated and unexplained: the tolerance bump

nhwc_transformer_test.cc:534 (the second test only):

-                    /*per_sample_tolerance*/ 1e-6,
+                    /*per_sample_tolerance*/ 2e-6,
                     /*relative_per_sample_tolerance*/ 1e-6);

The first test (ConvFloat_UsesNhwcOnlyWithKleidi) keeps 1e-6 — so this isn't a uniform Kleidi-related precision relaxation, it's specific to the FusedConv+Relu case. The PR description doesn't mention numerical tolerance at all.

Either:

  • The original 1e-6 was flaky for this particular case and the bump is fine (please say so in the description and ideally describe which path drifts), or
  • This was an unrelated drive-by relax that should be in a separate PR.

Without a justification it's a small but real loss of precision coverage that a reviewer can't tell whether to accept.


Smaller things

  • The early-return comment is good but worth naming the trigger: // NCHWc transformer (registered under --enable_arm_neon_nchwc) may rewrite this graph before NHWC runs, so an absent NhwcFusedConv is acceptable. Future readers shouldn't have to dig through git blame to learn why.

  • Both tests have identical check_nhwc_graph skeletons that differ only in the HasFloatNhwcNoTransposeSupport argument list. A small helper could share the logic, but that's a cleanup, not a request.

  • If you opt for the gated-strict-assertion approach above, the test name UsesNhwcOnlyWithKleidi still describes what it does in the dominant build. With the current patch, the name is misleading in the no-NCHWc + KleidiAI case (the test no longer ensures NHWC is actually used).


Bottom line

Right diagnosis, fix too broad. Please gate the existing strict assertion on MlasNchwcGetBlockSize() > 1 so the strict check survives in builds that don't have the NCHWc transformer registered, and either justify or drop the 1e-6 → 2e-6 tolerance change. Happy to re-approve once those two are addressed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates NhwcTransformerTests to stop assuming that KleidiAI NHWC float-conv availability always implies the graph will be rewritten to com.microsoft.NhwcFusedConv, which can be false when NCHWc layout optimizations run earlier (e.g., with --enable_arm_neon_nchwc).

Changes:

  • Relaxed assertions in two NHWC transformer tests to only validate NhwcFusedConv-specific graph shape when that path is actually selected.
  • Increased the per-sample tolerance for the fused-conv test.

Comment on lines +450 to +456
// Only validate NhwcFusedConv graph shape if that path was selected
if (op_to_count["com.microsoft.NhwcFusedConv"] == 0) {
return;
}
EXPECT_TRUE(HasFloatNhwcNoTransposeSupport({1, 8, 7, 7}, {16, 8, 3, 3}, {1, 1, 1, 1}));
EXPECT_EQ(op_to_count["com.microsoft.NhwcFusedConv"], 1);
EXPECT_EQ(op_to_count["Transpose"], 2);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in the latest commit, added an assert to the early return to ensure that if KleidiAI is enabled and an NhwcFusedConv isn't generated, a different optimiser consumed the Conv node rather than this being caused by an NhwcTransformer failure.

Tried to make this not directly tied to NCHWc as that would mean whenever a new Transformer is added that would take precedence over NHWC here in some cases, we'd need to update the test. My commit attempts to generalise this to any Transformer without needing to know the specifics, but I'm open to discussion or any suggestions about this point.

Comment on lines +520 to +526
// Only validate NhwcFusedConv graph shape if that path was selected
if (op_to_count["com.microsoft.NhwcFusedConv"] == 0) {
return;
}
EXPECT_TRUE(HasFloatNhwcNoTransposeSupport({1, 8, 7, 7}, {16, 8, 3, 3}, {1, 1, 1, 1}, {1, 1}));
EXPECT_EQ(op_to_count["com.microsoft.NhwcFusedConv"], 1);
EXPECT_EQ(op_to_count["Transpose"], 2);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reply as the previous Copilot comment

Comment thread onnxruntime/test/optimizer/nhwc_transformer_test.cc
…merTests

 * Affected tests: ConvFloat_UsesNhwcOnlyWithKleidi,
   FusedConvFloat_UsesNhwcOnlyWithKleidi

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
@martin-klacer-arm

Copy link
Copy Markdown
Author

Thanks for the review @hariharans29, that makes sense. I agreed the previous early return was too permissive, so I updated the two affected NhwcTransformerTests to tighten the checks slightly as well as to make the intention behind the tests clearer in code.

The tests now separate the two cases explicitly: if NhwcFusedConv is produced, they still validate that KleidiAI NHWC support is available and that the expected transpose-boundary graph shape is present. If NhwcFusedConv is not produced but KleidiAI NHWC support is available, they now assert that the original Conv / com.microsoft.FusedConv node was consumed by another layout optimizer. That should preserve coverage for the regression you called out, while still allowing NCHWc or another earlier Level 3 layout optimizer to take precedence without making the tests depend specifically on MlasNchwcGetBlockSize() or the NCHWc case.

Since the test isn't validating NCHWc functionality, I'm attempting to find a solution that doesn't tie it into the test directly, but I'm open to discussion on this.

I also added a short comment for the FusedConvFloat_UsesNhwcOnlyWithKleidi tolerance bump. The slightly larger absolute tolerance is only to account for different optimizer paths changing accumulation order; the relative tolerance is unchanged.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment on lines +458 to +460
// When the Arm® KleidiAI™ NHWC Conv path is available but no NHWC ops were produced,
// ensure that some other optimiser run and consumed the Conv node instead
EXPECT_TRUE(!kleidi_supported || op_to_count["Conv"] == 0);
Comment on lines +538 to +540
// When the Arm® KleidiAI™ NHWC Conv path is available but no NHWC ops were produced,
// ensure that some other optimiser run and consumed the FusedConv node instead
EXPECT_TRUE(!kleidi_supported || op_to_count["com.microsoft.FusedConv"] == 0);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants