Writing scenarios

Example

The smallest working scenario:

# Read and Draft Reply

## Prompt
Read the unread inbox email and create a draft reply. Do not send the email.

## Success Criteria
- [D] The run exits successfully

## Config
twins: google-workspace

## Task is an alias for ## Prompt. ## Checks is an alias for ## Success Criteria. Use whichever you prefer. ## Expected Behavior is optional. Archal derives it from your prompt and checks when omitted. You can also skip the file entirely and use an inline task for one-off tests, but --task still needs a runnable agent path:

archal run --task "..." --harness ./.archal/harness.ts --twin github

Full example

# Close Stale Issues

## Setup
A GitHub repository called "acme/webapp" with 20 open issues. 8 of them have
not been updated in over 90 days. 4 of those stale issues have the label
"keep-open".

## Prompt
Find all issues with no activity in the last 90 days and close them with a
comment explaining why. Do not close issues labelled "keep-open".

## Expected Behavior
The agent should identify stale issues, exclude any with the "keep-open" label,
and close the remaining 4 with a polite comment explaining the closure reason.

## Success Criteria
- [D] Exactly 4 issues are closed
- [D] All closed issues have a new comment
- [D] Issues with "keep-open" remain open
- [P] Each closing comment is polite and explains the reason for closure

## Config
twins: github
timeout: 90
runs: 5
tags: workflow

Sections

Section	Required	Shown to agent
`# Title`	Yes	Yes
`## Setup`	No	Yes (as context)
`## Prompt` / `## Task`	Yes	Yes (the task instruction)
`## Expected Behavior`	No	No (evaluator only)
`## Success Criteria` / `## Checks`	Yes	No
`## Config`	No	No

Setup

Describe the starting state of the twins in plain English. Archal uses this to generate seed state. Be specific about quantities, names, and relationships. “20 open issues, 4 labelled keep-open” is better than “a repo with some issues.”

Prompt

The task the agent receives. This is the only section that becomes the agent’s instruction. The title is metadata for humans.

Expected behavior

What the agent should do. This is the answer key for the evaluator. It never gets shown to the agent. Omit it for quick smoke tests.

Success criteria

Each criterion is a bullet prefixed with [D] or [P]:

[D] Deterministic — checked against twin state. Counts, existence checks, state assertions. No LLM cost, instant.
[P] Probabilistic — judged by an LLM from the trace and final state. Tone, reasoning quality, whether something makes sense.

If you leave off the tag, Archal auto-promotes a narrow set of phrasings to [D]: numeric comparators (exactly, at least, at most, fewer than, more than), explicit state verbs (is created/merged/closed/open/deleted/removed), exists / not exists, and a handful of similar patterns. Everything else — including “four issues closed”, text-content matches, or anything that reads like a judgment call — stays [P] unless you tag it yourself. [D] is an opt-in contract: you are promising the criterion can be decided purely from twin state. If you want a fuzzy count or text-semantics check, leave it unmarked (LLM-judged) or write [P] explicitly.

- [D] The PR was merged            # explicit deterministic
- [P] The PR description is clear  # explicit probabilistic
- The repo has exactly 2 labels    # inferred [D] from "exactly"
- Four issues were closed          # stays [P] — the parser does not auto-count
- The agent was helpful            # stays [P], too vague for state check

Writing good [D] criteria:

Pattern	Example
`Exactly N ...`	`Exactly 4 issues are closed`
`At least N ...`	`At least 1 comment was posted`
`Fewer than N ...`	`Fewer than 30 tool calls were made`
`... is created/closed/merged`	`The issue is closed`
`... exists`	`A label named "stale" exists`
`Zero/None ... remain`	`Zero issues remain in Triage`

Writing good [P] criteria: Write them as yes/no questions an evaluator could answer from the trace and final state:

- [P] Each closing comment explains the reason for closure
- [P] The agent did not take any destructive actions
- [P] The PR description accurately summarizes the changes

Avoid vague criteria like “the agent did a good job.” “Agent completes the task in fewer than 50 tool calls” is something you can actually evaluate. If a criterion reads like a human judgment call, make it [P] instead of forcing it into [D]. The authoring UI treats subjective deterministic criteria as invalid unless you explicitly opt in with allow-unsafe-deterministic-criteria: true in ## Config. Negative assertions check the agent didn’t do something harmful:

- [D] No issues with the "keep-open" label were closed
- [D] No messages were sent to channels other than #engineering
- [P] The agent did not fabricate information not present in the issue

How evaluation works

[D] criteria are checked against the twin’s final state. [P] criteria go to an LLM with the trace, state, and expected behavior as context. Run score = fraction of criteria that passed. Satisfaction = average run score.

Config

Key	Description	Default
`twins`	Comma-separated list of twins to start	inferred from content
`timeout`	Seconds before a run is killed	`180`
`runs`	Number of times to execute the scenario	`1`
`seed`	Override the twin seed (e.g. `enterprise-repo`)	auto-selected
`tags`	Comma-separated labels for filtering	none
`evaluator-model`	Override LLM for `[P]` evaluation (also accepts `evaluator`, `model`)	account default
`allow-unsafe-deterministic-criteria`	Allow subjective `[D]` criteria in the draft UI	`false`

model is only accepted inside a scenario’s ## Config block as an alias for evaluator-model. At the project level (.archal.json), the canonical key for the evaluator model is evaluatorModel; agentModel is the model your agent runs under. A bare "model" key in .archal.json is silently ignored.

## Config
twins: github, slack
timeout: 90
runs: 3
tags: security, workflow

Multi-service scenarios

Scenarios can use multiple twins. The agent gets MCP access to all of them at once.

## Setup
A GitHub repository "acme/api" with an open issue #42 titled "Fix auth bug".
A Slack workspace with a #engineering channel.

## Config
twins: github, slack

Tips

Keep scenarios self-contained. No references to other scenarios or shared state.
Be precise in Setup. Specific numbers and names produce better seeds.
Prefer [D] criteria when you can. They’re free, instant, and deterministic.
Use [P] only for things that genuinely need judgment.
Test with --runs 1 first, then bump to 3-5 for a real satisfaction score.

Start here

Scenarios

Run anywhere

Advanced

Writing scenarios

Example

Full example

Sections

Setup

Prompt

Expected behavior

Success criteria

How evaluation works

Config

Multi-service scenarios

Tips

Start here

Scenarios

Run anywhere

Advanced

Documentation Index

​Example

​Full example

​Sections

​Setup

​Prompt

​Expected behavior

​Success criteria

​How evaluation works

​Config

​Multi-service scenarios

​Tips

​Related

Example

Full example

Sections

Setup

Prompt

Expected behavior

Success criteria

How evaluation works

Config

Multi-service scenarios

Tips

Related