Openclaw Qa Testing
Run, watch, debug, extend, or explain OpenClaw qa-lab and qa-channel scenarios, artifacts, and live lanes.
npx clawhub@latest install openclaw-qa-testingKrav
Beskrivning
OpenClaw QA Testing
Use this skill for qa-lab / qa-channel work. Repo-local QA only.
Read first
docs/concepts/qa-e2e-automation.mddocs/help/testing.mddocs/channels/qa-channel.mdqa/README.mdqa/scenarios/index.mdextensions/qa-lab/src/suite.tsextensions/qa-lab/src/character-eval.ts
Model policy
- Live OpenAI lane:
openai/gpt-5.4 - Fast mode: on
- Do not use:
openai/gpt-5.4-proopenai/gpt-5.4-mini- Only change model policy if the user explicitly asks.
Default workflow
1. Read the scenario pack and current suite implementation.
2. Decide lane:
- mock/dev:
mock-openai - real validation:
live-frontier
3. For live OpenAI, use:
```bash
OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \
pnpm openclaw qa suite \
--provider-mode live-frontier \
--model openai/gpt-5.4 \
--alt-model openai/gpt-5.4 \
--output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>
```
4. Watch outputs:
- summary:
.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json - report:
.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md
5. If the user wants to watch the live UI, find the current openclaw-qa listen port and report http://127.0.0.1:<port>.
6. If a scenario fails, fix the product or harness root cause, then rerun the full lane.
OTEL smoke
For local QA-lab OpenTelemetry validation, use:
```bash
pnpm qa:otel:smoke
```
This starts a local OTLP/HTTP trace receiver, runs the otel-trace-smoke
scenario through qa-channel, decodes the emitted protobuf spans, and verifies
the exported trace names and privacy contract. It does not require Opik,
Langfuse, or external collector credentials.
Matrix live profiles
pnpm openclaw qa matrix defaults to the full all profile. Use explicit
profiles for faster CI/release proof:
```bash
OPENCLAW_QA_MATRIX_NO_REPLY_WINDOW_MS=3000 \
pnpm openclaw qa matrix --profile fast --fail-fast
```
fast: release-critical transport contract, excluding generated image and
deep E2EE recovery inventory.
transport,media,e2ee-smoke,e2ee-deep,e2ee-cli: sharded full
Matrix coverage.
QA-Lab - All Lanesuses explicitfastMatrix on scheduled runs. Manual
dispatch keeps matrix_profile=all as the default and always shards that full
Matrix selection.
QA credentials and 1Password
- Use
oponly insidetmuxfor QA secret lookup in this repo. - Quick auth check inside tmux:
```bash
op account list
```
- Direct Telegram npm live test secrets currently live in 1Password item:
- vault:
OpenClaw - item:
Telegram E2E - That item is the first place to look for:
OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKENOPENCLAW_QA_TELEGRAM_SUT_BOT_TOKENOPENCLAW_QA_PROVIDER_MODEOPENCLAW_NPM_TELEGRAM_PACKAGE_SPEC- Convex QA secrets currently live in 1Password items:
- vault:
OpenClaw - item:
OPENCLAW_QA_CONVEX_SITE_URL - item:
OPENCLAW_QA_CONVEX_SECRET_MAINTAINER - item:
OPENCLAW_QA_CONVEX_SECRET_CI - Additional related notes/login items seen during QA credential work:
- vault:
Private - items:
OPENCLAW QA,Convex,Telegram - If a required value is missing from those notes:
- do not guess
- ask the maintainer/operator for the current value or the current 1Password item name
- for Telegram direct runs,
OPENCLAW_QA_TELEGRAM_GROUP_IDmay be stored separately fromTelegram E2E - for Convex runs, the leased Telegram credential should provide the Telegram group id and bot tokens together; do not require a separate
OPENCLAW_QA_TELEGRAM_GROUP_ID - for Convex runs, prefer
OpenClaw/OPENCLAW_QA_CONVEX_SITE_URL; if that is stale or unclear, ask for the active pool URL before running - Prefer direct Telegram envs for the npm Telegram Docker lane when available:
```bash
OPENCLAW_QA_TELEGRAM_GROUP_ID="..." \
OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKEN="..." \
OPENCLAW_QA_TELEGRAM_SUT_BOT_TOKEN="..." \
OPENCLAW_QA_PROVIDER_MODE="mock-openai" \
OPENCLAW_NPM_TELEGRAM_PACKAGE_SPEC="openclaw@beta" \
pnpm test:docker:npm-telegram-live
```
- Prefer Convex mode when the goal is stable shared QA infra:
- round-robin credential leasing
- thinner wrapper for channel-specific setup
- CLI/admin flows around the pooled credentials
- Live npm Telegram Docker lane note:
scripts/e2e/npm-telegram-live-runner.tsreadsOPENCLAW_NPM_TELEGRAM_PROVIDER_MODE- do not assume
OPENCLAW_QA_PROVIDER_MODEis consumed by that wrapper - if a 1Password note only gives
OPENCLAW_QA_PROVIDER_MODE, map it explicitly toOPENCLAW_NPM_TELEGRAM_PROVIDER_MODEbefore running the Docker lane - Verified live shape:
- Convex mode can pass the real Docker lane without direct Telegram env vars
- leased Telegram payload includes the group id coupled to the driver/SUT tokens
- a real run of
pnpm test:docker:npm-telegram-livepassed with: OPENCLAW_QA_CREDENTIAL_SOURCE=convexOPENCLAW_QA_CREDENTIAL_ROLE=maintainerOPENCLAW_QA_CONVEX_SITE_URLOPENCLAW_QA_CONVEX_SECRET_MAINTAINEROPENCLAW_NPM_TELEGRAM_PROVIDER_MODE=mock-openai
Character evals
Use qa character-eval for style/persona/vibe checks across multiple live models.
```bash
pnpm openclaw qa character-eval \
--model openai/gpt-5.4,thinking=xhigh \
--model openai/gpt-5.2,thinking=xhigh \
--model openai/gpt-5,thinking=xhigh \
--model anthropic/claude-opus-4-6,thinking=high \
--model anthropic/claude-sonnet-4-6,thinking=high \
--model zai/glm-5.1,thinking=high \
--model moonshot/kimi-k2.5,thinking=high \
--model google/gemini-3.1-pro-preview,thinking=high \
--judge-model openai/gpt-5.4,thinking=xhigh,fast \
--judge-model anthropic/claude-opus-4-6,thinking=high \
--concurrency 16 \
--judge-concurrency 16 \
--output-dir .artifacts/qa-e2e/character-eval-<tag>
```
- Runs local QA gateway child processes, not Docker.
- Preferred model spec syntax is
provider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>]for both--modeland--judge-model. - Do not add new examples with separate
--model-thinking; keep that flag as legacy compatibility only. - Defaults to candidate models
openai/gpt-5.4,openai/gpt-5.2,openai/gpt-5,anthropic/claude-opus-4-6,anthropic/claude-sonnet-4-6,zai/glm-5.1,moonshot/kimi-k2.5, andgoogle/gemini-3.1-pro-previewwhen no--modelis passed. - Candidate thinking defaults to
high, withxhighfor OpenAI models that support it. Prefer inline--model provider/model,thinking=<level>;--thinking <level>and--model-thinking <provider/model=level>remain compatibility shims. - OpenAI candidate refs default to fast mode so priority processing is used where supported. Use inline
,fast,,no-fast, or,fast=falsefor one model; use--fastonly to force fast mode for every candidate. - Judges default to
openai/gpt-5.4,thinking=xhigh,fastandanthropic/claude-opus-4-6,thinking=high. - Report includes judge ranking, run stats, durations, and full transcripts; do not include raw judge replies. Duration is benchmark context, not a grading signal.
- Candidate and judge concurrency default to 16. Use
--concurrency <n>and--judge-concurrency <n>to override when local gateways or provider limits need a gentler lane. - Scenario source should stay markdown-driven under
qa/scenarios/. - For isolated character/persona evals, write the persona into
SOUL.mdand blankIDENTITY.mdin the scenario flow. UseSOUL.md + IDENTITY.mdonly when intentionally testing how the normal OpenClaw identity combines with the character. - Keep prompts natural and task-shaped. The candidate model should receive character setup through
SOUL.md, then normal user turns such as chat, workspace help, and small file tasks; do not ask "how would you react?" or tell the model it is in an eval. - Prefer at least one real task, such as creating or editing a tiny workspace artifact, so the transcript captures character under normal tool use instead of pure roleplay.
Codex CLI model lane
Use model refs shaped like codex-cli/<codex-model> whenever QA should exercise Codex as a model backend.
Examples:
```bash
pnpm openclaw qa suite \
--provider-mode live-frontier \
--model codex-cli/<codex-model> \
--alt-model codex-cli/<codex-model> \
--scenario <scenario-id> \
--output-dir .artifacts/qa-e2e/codex-<tag>
```
```bash
pnpm openclaw qa manual \
--model codex-cli/<codex-model> \
--message "Reply exactly: CODEX_OK"
```
- Treat the concrete Codex model name as user/config input; do not hardcode it in source, docs examples, or scenarios.
- Live QA preserves
CODEX_HOMEso Codex CLI auth/config works while keepingHOMEandOPENCLAW_HOMEsandboxed. - Mock QA should scrub
CODEX_HOME. - If Codex returns fallback/auth text every turn, first check
CODEX_HOME,~/.profile, and gateway child logs before changing scenario assertions. - For model comparison, include
codex-cli/<codex-model>as another candidate inqa character-eval; the report should label it as an opaque model name.
Repo facts
- Seed scenarios live in
qa/. - Main live runner:
extensions/qa-lab/src/suite.ts - QA lab server:
extensions/qa-lab/src/lab-server.ts - Child gateway harness:
extensions/qa-lab/src/gateway-child.ts - Synthetic channel:
extensions/qa-channel/
What “done” looks like
- Full suite green for the requested lane.
- User gets:
- watch URL if applicable
- pass/fail counts
- artifact paths
- concise note on what was fixed
Common failure patterns
- Live timeout too short:
- widen live waits in
extensions/qa-lab/src/suite.ts - Discovery cannot find repo files:
- point prompts at
repo/...inside seeded workspace - Subagent proof too brittle:
- prefer stable final reply evidence over transient child-session listing
- Harness “rebuild” delay:
- dirty tree can trigger a pre-run build; expect that before ports appear
When adding scenarios
- Add or update scenario markdown under
qa/scenarios/ - Keep kickoff expectations in
qa/scenarios/index.mdaligned - Add executable coverage in
extensions/qa-lab/src/suite.ts - Prefer end-to-end assertions over mock-only checks
- Save outputs under
.artifacts/qa-e2e/
npx clawhub@latest install openclaw-qa-testingKrav
Vanliga frågor
Recensioner
0 recensionerLogga in för att skriva en recension
Inga recensioner ännu. Var den första att dela din upplevelse!