Add experimental Harbor integration for GRPO environment training by adithya-s-k · Pull Request #6018 · huggingface/trl

adithya-s-k · 2026-06-11T22:35:43Z

Adds an experimental integration for training on Harbor agentic task suites with GRPOTrainer via environment_factory. It lives at trl.experimental.harbor, is gated behind a new trl[harbor] extra, and is lazy-imported so non-users pay nothing.

It mirrors the structure of the OpenReward integration (#5752, #5729, #5696): a single HarborSpec maps one task suite to the three trainer slots —

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.harbor import HarborSpec

spec = HarborSpec("AdithyaSK/data_agent_rl_environment_train", agent="bash", num_tasks=64)
trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(num_generations=8, max_steps=50, max_tool_calling_iterations=25),
    train_dataset=spec.train_dataset,
    environment_factory=spec.environment_factory,
    reward_funcs=spec.reward_funcs,
)
trainer.train()

HarborEnv wraps a Harbor sandbox + verifier. TRL drives the rollout loop: it calls the env's tool methods during generation and reads env.reward after. The base agent (harness) is pluggable via agent= — a built-in name, a HarborEnv subclass, or an import/file path.
Built-in bash harness, plus jupyter and terminal_notes example harnesses under examples/scripts/harbor/harnesses/ (folder-per-harness, each with a README listing its tools).

External agents only (for now)

Harbor supports external agents (run outside the sandbox, drive the loop via exec) and installed agents (installed into the image, run headless inside the container, trajectory parsed after). Only the external pattern is supported, because RL needs the trainer to drive generation turn-by-turn and capture the policy's tokens/log-probs + env mask — which an opaque in-container agent can't expose. A HarborEnv is therefore an external agent: tool methods exec into the sandbox, but the loop and the model under training stay in TRL. (Documented in docs/source/harbor.md.)

Notable changes

pyproject.toml: new harbor extra (harbor>=0.13.0; python_version >= '3.12'); relaxes the vLLM cap vllm>=0.12.0,<=0.19.0 → vllm>=0.22.0. The old cap pins transformers<5, which breaks environment_factory (needs transformers>=5.2). This change is separable from the rest of the PR if you'd prefer it split out.
E2B COPY workaround: E2B's from_dockerfile build honors RUN but silently drops files COPY'd from the build context, breaking task healthchecks that run those files (e.g. a data-pull hook). HarborEnv replicates the Dockerfile's COPY directives at runtime (upload as the sandbox user, mv into place as root).

Testing

tests/experimental/test_harbor.py — 12 tests (agent resolution, dataset/metadata columns, reward func), green; gated by require_harbor.
Verified end-to-end: gpt-4.1 over the bash harness loads task data and lands reward=1.0 on a data-agent task.

Note

Medium Risk
New experimental sandbox/RL path and a broader vLLM dependency range affect install compatibility; core training paths are unchanged unless Harbor or the new vLLM floor is used.

Overview
Adds experimental Harbor integration so GRPOTrainer can train on Harbor agentic task suites via environment_factory, following the same HarborSpec → train_dataset / environment_factory / reward_funcs pattern as OpenReward.

Core: trl.experimental.harbor introduces HarborSpec, HarborEnv (sandbox lifecycle, lazy verifier reward, dedicated async loop thread), and built-in HarborBashEnv. Harnesses are pluggable via agent= (name, import path, or HarborEnv subclass). Includes an E2B workaround that re-applies Dockerfile COPY files at runtime.

Packaging & deps: New optional extra trl[harbor] (harbor>=0.13.0, Python 3.12+). vLLM constraint relaxed from <=0.19.0 to >=0.22.0 so environment_factory can use transformers>=5.2.

Examples & docs: examples/scripts/harbor/data_agent.py GRPO script; example jupyter and terminal_notes harnesses; new docs/source/harbor.md and toctree/example table entries.

Tests: tests/experimental/test_harbor.py for agent resolution, dataset building, and outcome rewards; is_harbor_available / require_harbor helpers.

^{Reviewed by Cursor Bugbot for commit 1f6f01c. Bugbot is set up for automated code reviews on this repo. Configure here.}

Train on Harbor agentic task suites with GRPOTrainer via environment_factory. HarborSpec maps one task suite to the three trainer slots (train_dataset / environment_factory / reward_funcs), mirroring the OpenReward integration (huggingface#5752, huggingface#5729, huggingface#5696). HarborEnv wraps a Harbor sandbox + verifier; the base agent (harness) is pluggable — built-in `bash`, plus `jupyter` and `terminal_notes` example harnesses (folder-per-harness, each with a README). HarborEnv follows Harbor's *external agent* pattern (the policy drives the loop and tool methods exec into the sandbox); Harbor's *installed agents* are not supported, since RL needs the trainer to drive generation and capture the policy's tokens/log-probs, which an opaque in-container agent can't expose. - trl/experimental/harbor: HarborEnv (+ HarborBashEnv) and HarborSpec - examples/scripts/harbor: data_agent.py + harnesses/ - docs/source/harbor.md (+ toctree, example_overview entries) - tests/experimental/test_harbor.py (require_harbor + is_harbor_available) - pyproject: add `harbor` extra; relax vllm cap to >=0.22.0 (0.19 pins transformers<5, which breaks environment_factory; it needs transformers>=5.2) E2B's from_dockerfile build honors RUN but silently drops COPY'd build-context files, so HarborEnv replicates the Dockerfile's COPY directives at runtime (upload as the sandbox user, mv into place as root) — healthchecks that run those files (e.g. a data-pull hook) then work. Verified end-to-end: gpt-4.1 over the bash harness lands reward=1.0 on a data-agent task.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5c076888f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T22:38:39Z

+        verifier = VerifierFactory.create_verifier_from_config(
+            VerifierConfig(), task=self._task, trial_paths=self._paths, environment=self._env
+        )


Use the task's verifier config

For Harbor tasks that define a [verifier] block, this constructs a fresh default VerifierConfig() instead of using self._task.config.verifier. That silently drops task-provided verifier settings such as env credentials/model parameters, a custom import_path, user, or separate verifier environment, so those valid Harbor tasks will either run the wrong verifier or fail during reward computation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T22:38:39Z

+harbor = [
+    "harbor>=0.13.0; python_version >= '3.12'",  # harbor requires Python 3.12+ (pulls its sandbox backends)
+]


Include the cloud backend extras used by the example

The new example and docs recommend running with --env e2b, but installing trl[vllm,harbor] only pulls Harbor's base package; Harbor keeps providers such as e2b, daytona, and runloop behind optional extras. A user following the script metadata or quick-start will therefore hit an import error as soon as environment_type="e2b" is selected, even with E2B_API_KEY set.

Useful? React with 👍 / 👎.

…rainer Review fixes (PR huggingface#6018): - _verify: forward the task's [verifier].env to the verifier (override_env=), mirroring Harbor's trial runner, instead of dropping it with a bare VerifierConfig(). - HarborSpec dataset: task_index now reflects the selected suite position when `indices=` is used (was always 0..len-1). - _resolve_agent: split the agent selector on the last ':' (rpartition) so a Windows drive path (D:\...\harness.py:Class) isn't misparsed as a module. - docs/example: install the chosen sandbox backend's Harbor extra (e.g. `harbor[e2b]` for --env e2b); the trl[harbor] extra stays backend-free. AsyncGRPO compatibility: - HarborSpec.environment_factory returns a functools.partial (picklable) instead of a closure, so AsyncGRPOTrainer can ship it to its separate rollout-worker process. - HarborEnv runs its asyncio loop on a daemon thread and submits via run_coroutine_threadsafe, so sync tool methods work both from a plain thread (GRPOTrainer) and from inside a running event loop (AsyncGRPOTrainer's worker), where loop.run_until_complete would otherwise raise. Verified: 12 unit tests pass; gpt-4.1 over the bash harness still lands reward=1.0 end-to-end.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1f6f01c. Configure here.}

cursor · 2026-06-11T23:26:03Z

+        b64 = base64.b64encode(code.encode()).decode()
+        result = self._loop.run_until_complete(
+            self._env.exec(f"python3 /opt/run_cell.py --code-b64 {shlex.quote(b64)}", timeout_sec=180)
+        )


Jupyter harness wrong async loop

High Severity

JupyterEnv._run_cell drives sandbox exec via self._loop.run_until_complete, but HarborEnv keeps that loop running forever on a background thread and routes all Harbor I/O through _run / run_coroutine_threadsafe. The first add_and_execute_code_cell call typically raises a running-loop error, so the documented jupyter harness cannot run cells during GRPO rollouts.

^{Reviewed by Cursor Bugbot for commit 1f6f01c. Configure here.}

cursor · 2026-06-11T23:26:03Z

+
+        await self._stop()  # tear down the previous task's sandbox
+        self._task = Task(task_dir=Path(task_dir))
+        self._paths = TrialPaths(trial_dir=Path(tempfile.mkdtemp(prefix="harbor_trl_")))


Trial temp directories leak

Medium Severity

Each HarborEnv.reset assigns a new TrialPaths directory from tempfile.mkdtemp without deleting the prior trial folder. GRPO reuses one env instance per rollout slot across many steps, so host temp directories accumulate for the whole run and can exhaust disk space during long Harbor training.

^{Reviewed by Cursor Bugbot for commit 1f6f01c. Configure here.}

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread trl/experimental/harbor/_spec.py Outdated

Comment thread trl/experimental/harbor/_spec.py

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

adithya-s-k mentioned this pull request Jun 11, 2026

async grpo native weight sync with vllm>=0.22.0 #5892

Open

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental Harbor integration for GRPO environment training#6018

Add experimental Harbor integration for GRPO environment training#6018
adithya-s-k wants to merge 2 commits into
huggingface:mainfrom
adithya-s-k:experimental-harbor

adithya-s-k commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adithya-s-k commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

External agents only (for now)

Notable changes

Testing

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Jupyter harness wrong async loop

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Trial temp directories leak

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adithya-s-k commented Jun 11, 2026 •

edited

Loading