Tests and Verification
A test is a small named world the runtime sets up, makes claims about, and tears down. Argon has a dedicated test item kind, plus @[theorem]-tagged rules that the compiler preserves as theorem markers for a future mechanical-verification pipeline.
This chapter covers both.
A first test
test "a freshly-signed residential lease is active" {
let alice: Tenant = {
id: "alice-001",
name: "Alice",
contact_email: "alice@example.com",
}
let bob: Landlord = {
id: "bob-001",
name: "Bob",
payout_account: "acct-bob",
}
let house: Property = {
id: "p-1",
address: "1200 University Ave",
}
let lease: ResidentialLease = {
id: "lease-1",
tenant: alice,
landlord: bob,
property: house,
start_date: #2026-01-01#,
end_date: #2027-01-01#,
monthly_rent: 9500,
bedrooms: 3,
}
assert ResidentialLease(lease)
assert ActiveLease(lease)
}
Run it:
$ ox test
Compiling lease-tutorial v0.1.0
Running 1 test
test scene "a freshly-signed residential lease is active" ... ok
test result: ok. 1 passed; 0 failed; 0 unproven; 0 assumed; 0 ignored
Three things to notice.
test "string" not test name
Tests are named with string literals — they are descriptive narratives, not identifiers. This is intentional: a test description is what shows up in ox test output, in CI logs, and in failure reports. Make it readable. “a freshly-signed residential lease is active” beats signed_lease_active.
let for setup
The body’s let name : Type = { fields } form constructs an instance directly. Field-block syntax lets you supply each field by name; the compiler validates against the declared type. The instance is bound to the local name; you can refer to it in assert statements.
This is a different let from the one inside compute Form-2 bodies: that let binds an intermediate value during evaluation; this let constructs a concept instance for the test’s local world.
assert <atom> for claims
Inside a test, assert introduces a single rule-atom claim — a type test, a derive-head call, a comparison, an aggregate inequality, or any other shape the rule-body grammar admits. The runtime evaluates the atom against the test’s world; if it does not hold, the test fails.
assert here is post-reasoning: the runtime saturates the model first (running all asserts and derives), then checks each test claim against the saturated knowledge base. So assert ActiveLease(lease) works because the ActiveLease derive rule has fired during reasoning.
You can also write assert not <atom>:
test "a lease with start >= end is rejected by the validity refinement" {
let bad: Lease = {
id: "lease-bad",
...
start_date: #2027-01-01#,
end_date: #2026-01-01#,
...
}
assert not bad: ValidLease
}
not negates the atom — the test passes when the atom fails. Useful for exclusion claims like “this bad input does not pass our refinement.”
Status note: instantiates-form assertions in tests. The runtime’s assertion evaluator currently supports concept-as-predicate atoms (
assert ResidentialLease(lease),assert ActiveLease(lease)) but does not yet evaluate theinstantiates-formassert <ind>: <Concept>(and its negation) — running such a test reportsatom shape not yet supported: Instantiates. The shape is parsed and elaborated and passesox check/ox build; it just isn’t dispatched byox testyet. Until that lands, preferassert <Concept>(<ind>)for membership claims; refinement-membership tests will roll in alongside the runtime extension.
Sequenced statements: mutate and cleanup
Tests aren’t only declarative. A test can also exercise a pub mutation against the test ABox, observe its effects, and run ordered teardown. The test body admits four statement kinds in source order: let, assert, mutate, and cleanup.
test "rent payment passes timeliness check" {
let p: RentPayment = {
paid_on: 2025-03-03,
amount: 9500,
period_label: "2025-03",
is_timely: true,
}
mutate record_rent_payment(p)
assert RentPayment(p)
assert tenant_balance(p.tenant) == 0
cleanup {
mutate retract_test_payment(p)
assert not RentPayment(p)
}
}
mutate <name>(<args>) invokes a pub mutation. The runner:
- Resolves the mutation by name through the same import surface as any other cross-package symbol.
- Evaluates each
require { }precondition against the current post-saturation ABox. On the first false, the runner records a structuredMutationPreconditionFailureand skips the mutation’sdo/retract/emitclauses; subsequent statements continue against the unchanged ABox so you see all assertion drift in one run. - Otherwise applies
retract,do(field updates and locally-boundletindividuals), andemit(events become individuals visible to subsequent saturation). - Re-saturates. The next
assertevaluates against the post-mutation state.
cleanup { stmts } is a single, trailing teardown block. Cleanup admits the same four statement kinds and runs regardless of whether main-body assertions failed. Failures inside cleanup are tagged with a cleanup: true flag on the runner output so you can distinguish “the operation failed” from “the teardown failed.”
Well-formedness rules:
- A test admits at most one
cleanupblock (OE0240 MultipleCleanupBlocksif more). cleanupmust be the last statement (OE0241 CleanupNotAtEndif anything follows it).- Cleanup blocks don’t nest (
OE0242 NestedCleanup).
What v1 catches
Before this surface existed, the only way to “test” a mutation was the scene-shape convention: let-bind the mutation’s emitted events as if it had run, then assert the post-state shape. That convention proved input shapes were constructible but never evaluated require, never exercised do, never verified that emit clauses fired.
mutate <name>(<args>) closes those gaps for the clauses v1 fully evaluates:
- Existing-precondition violations. A test that hands
record_rent_paymentanis_timely: falsepayment fails withMutationPreconditionFailurerather than silently passing. emitclauses (constructor case).emit Event { ... }evaluates the expression and mints a fresh event individual with the literal field values. Anassert Event(...)after themutatestatement is meaningful coverage of the emit clause.- Multi-mutation flows.
mutate A(...); mutate B(...); assert <post-state>proves the chained effect with re-saturation between steps so B’srequiresees A’s derived facts. - Cleanup ordering. A
cleanup { }block runs regardless of mid-test assertion failures, so a teardown mutation’s post-state assertion is exercised even when an earlier assertion drifted.
What v1 does NOT catch
The runner is honest about which clauses it does not yet evaluate against the test ABox — those clauses surface as MutationUnsupported test failures so the modeler sees the gap rather than a silent pass:
do { x.field = expr }field-updates. v1 emits aMutationUnsupportedfailure per field-update. A test likemutate update_balance(acct, 100); assert acct.balance == 100will see the runner flag the gap explicitly. v2 scope.retract { }clauses. v1 emits aMutationUnsupportedfailure per retract.mutate teardown(x); assert not Concept(x)reports the unsupported flag rather than silently passing or failing the assertion. v2 scope.do { let }value-expression bindings. The let’s type assertion lands and the binding is visible to subsequent statements as a typed individual, but the field assignments inside the let’s value expression do not yet bind. v2 scope.emit ppath-reference (no-op). When the elaborator lowersemit p(wherepis ado { let }-bound name) the resultingCoreEmitcarries empty args. v1 treats this as a no-op sincepis already in the ABox via the let. Constructor-form emits are the meaningful coverage path; path-reference assertions test the let-binding shape, not the emit clause itself.- Dropped preconditions. A
record_rent_paymentwhoserequire { p.is_timely }line was deleted from source still appears to pass — there’s no atom left to evaluate, so noMutationPreconditionFailureever fires. Detecting “this mutation should have a precondition but doesn’t” requires a separatemutate_fails <name>(<args>)form that proves a precondition fails. v1 catches violations of preconditions the modeler kept; v2 will catch dropped-precondition regressions.
Multi-mutate failure propagation
When mutation A’s require fails, A’s do / retract / emit are skipped and the ABox stays unchanged. The runner does not halt the statement loop — it continues to the next statement. A subsequent mutate B(...) runs with B’s require evaluating against the pre-A state; if B’s preconditions hold against that state, B applies normally.
This means a chained-mutation test where B depends on A’s effects will see B fail or behave unexpectedly when A is skipped — but the source ordering of failures (A first, then B’s drift) makes the dependency obvious. The alternative — halting after A’s failure — would hide downstream drift.
Ordering
Tests with only let and assert statements observe today’s behavior unchanged: lets materialize, the ABox saturates, assertions evaluate. Tests that introduce mutate get per-statement saturation — re-saturation runs between mutate statements so each mutation’s require sees the previous mutation’s derived facts. A test author who introduces a mutate and observes that an assertion that previously passed now fails should investigate whether the mutation legitimately changes the asserted state — the change is signal, not noise.
Failed assertions don’t halt the loop. The runner records every assertion’s outcome and continues, so you see all failures in one run.
Compute equality
A specifically-supported test-only form: assert that a compute call equals an expected value.
test "deposit return arithmetic" {
let r: DepositReturn = {
deposit_held_amount: 9500,
allowed_deduction_total: 1400,
}
assert deposit_after_deductions(r) == 8100
}
The compute is called against the test’s world; the result is compared with the expected value. This is how you validate that arithmetic-heavy computes do what you mean.
Multiple tests in a module
A test block is an item, like any other. Several can live in the same file:
use lease::*
use party::*
use rules::*
use types::*
test "active lease is queryable" { ... }
test "expired lease is queryable" { ... }
test "validity refinement catches bad dates" { ... }
test "validity refinement catches non-positive rent" { ... }
Convention: put tests in their own module — typically src/tests.ar for a small package. The book’s running tutorial follows this convention.
Frames and fixtures
When several tests share setup — the same Tenant, the same Property, the same baseline rules — pull that setup into a frame. A frame is a named, composable bundle of TBox + rules + ABox that any test can opt into.
frame canonical_party {
let alice: Tenant = {
id: "alice-001",
name: "Alice",
contact_email: "alice@example.com",
}
let bob: Landlord = {
id: "bob-001",
name: "Bob",
payout_account: "acct-bob",
}
let house: Property = {
id: "p-1",
address: "1200 University Ave",
}
}
A frame’s body admits any top-level item form — concept declarations, rules, asserts, lets. pub frame exports across packages; bare frame is package-local.
Tests compose frames with using:
test "a freshly-signed lease over the canonical party is active" using canonical_party {
let lease: ResidentialLease = {
id: "lease-1",
tenant: alice,
landlord: bob,
property: house,
start_date: #2026-01-01#,
end_date: #2027-01-01#,
monthly_rent: 9500,
bedrooms: 3,
}
assert ResidentialLease(lease)
assert ActiveLease(lease)
}
using pulls every binding, rule, and assertion the frame defines into the test’s world. Multiple frames compose: using canonical_party, baseline_rules brings both. The same name declared in two composed frames is a conflict (OE0215); a using reference that does not resolve to a known frame is OE0214; an inline fixture that redeclares a frame member is OE0216; circular frame includes are OE0218.
A worked multi-frame example:
frame canonical_party {
let alice: Tenant = { id: "alice-001", name: "Alice", contact_email: "alice@example.com" }
let house: Property = { id: "p-1", address: "1200 University Ave" }
}
frame baseline_rates {
let standard_rent: Money = 9500
let standard_deposit: Money = 19000
}
test "standard-rate lease matches baseline" using canonical_party, baseline_rates {
let lease: ResidentialLease = {
id: "lease-1",
tenant: alice,
property: house,
monthly_rent: standard_rent,
start_date: #2026-01-01#,
end_date: #2027-01-01#,
bedrooms: 3,
}
assert lease.monthly_rent == standard_rent
}
Variants: _catalog/test-frame-declaration/{minimal,multi-frame,frame-with-fixtures}/ and _catalog/test-using-clause/{multi-using,composition-with-expect}/.
Status note:
OE0216not yet emitted.OE0216is registered in the diagnostic catalog and committed by the rule above, but the emit-site at the inline-fixture-vs-frame-composition junction is not yet wired in the elaborator. Tests that exercise the collision (an inlinefixtureredeclaring a name ausing-pulled frame already contributes) currently compile clean. Sibling diagnosticsOE0214/OE0215/OE0218are wired. Tracked at sharpe-dev/orca-mvp#411.
Inside a test, fixture { … } declares a test-local mini-module — items declared in the fixture are visible in the surrounding test but never escape to the parent module:
test "deposit return arithmetic with a custom property" {
fixture {
let custom: Property = {
id: "custom-1",
address: "42 Galaxy Way",
}
}
let r: DepositReturn = {
deposit_held_amount: 9500,
allowed_deduction_total: 1400,
}
assert deposit_after_deductions(r) == 8100
}
Diagnostics that arise inside a fixture block are captured — they are visible to the test’s expect block (below) but never propagate to the parent module’s diagnostic stream. This lets a test deliberately exercise a malformed fixture and assert the right error fires, without the malformation breaking unrelated tests in the same package.
Diagnostic expectations
Tests can verify the compiler’s diagnostics, not just post-saturation fact-claims. The mechanism is the expect { … } block paired with //~ <label> source markers:
test "a sortal without a Kind ancestor is flagged" {
fixture {
pub kind Person { name: String }
//~ orphan_role
pub role Orphan { }
}
expect {
diagnostic OW0207 at orphan_role
}
}
//~ <label> anchors a labelled marker to the next source line — the convention follows Rust’s compiletest. Inside expect { … }, the bare form diagnostic <code> requires that exactly one diagnostic with that code fires somewhere in the test’s scope; at <marker> constrains the match to the labelled span; severity:<level> (one of error, warning, info) further constrains. Severity defaults to error when omitted.
expect {
diagnostic OE0211 at line_a
diagnostic OE0212 severity:warning
diagnostic OE0605
}
The test passes when every claim in the expect block matches. An unmatched claim, an unexpected diagnostic above the configured severity floor, or a diagnostic anchored at the wrong span is a test failure; the runner surfaces the diff between expected and actual.
expect { … } and assert <atom> compose. A single test can verify both that a fixture produces the right diagnostic and that the rest of the model still satisfies its post-saturation claims.
test "bad lease shape flags + good lease still classifies" {
fixture {
//~ bad_lease
let bad: Lease = { id: "x", start_date: #2027-01-01#, end_date: #2026-01-01# }
}
let good: Lease = {
id: "y",
start_date: #2026-01-01#,
end_date: #2027-01-01#,
monthly_rent: 9500,
}
expect {
diagnostic OE0606 at bad_lease // refinement violation on bad
}
assert ValidLease(good) // good still satisfies the refinement
}
Variants: _catalog/test-expect-block/{minimal-expect,multi-arm-expect,with-fixture,expect-with-query}/.
meta() reflection and :: classification
Two related forms surface the metatype of a concept inside a rule body, query, or assertion. meta(X) returns X’s metatype as a value:
pub derive is_kind(x) :- meta(x) == kind
Person :: Kind is sugar for the same query — read “Person is a Kind”. The two lower to the same internal atom; pick whichever reads more naturally in context.
pub derive equiv(x) :- x :: Kind
meta() and :: classify types, not instances. A bound variable or named concept fits. A free unbound variable does not — meta(?z) with no binding fires OE0212 MetaArgumentNotGround at elaboration. Negation composes: not meta(x) == role is well-formed.
Tests use these forms to assert metatype-shape directly:
test "Tenant is a role" {
assert Tenant :: Role
}
@[theorem] — mechanical verification
Some claims are not test-shaped. They are statements about the rule system itself: “this property holds for every possible lease, not just the three I happen to construct in a test.” For those, mark a derive rule with @[theorem]:
@[theorem]
pub derive transitive_subtype(x: ⊤, y: ⊤, z: ⊤) :-
x specializes y,
y specializes z
=> x specializes z
@[theorem] marks the rule as a theorem claim. At the current cut, the marker is preserved through compilation and surfaces in the kernel’s runtime so a future verification pipeline can target it; the compile-time mechanical-verification pass (the saturator + meta-property calculus attempting to prove that the rule’s body always implies its head) is not yet active. Today the value is signalling intent — the body of @[theorem]-tagged rules is the inventory the verification pass will close over when it lands.
@[theorem] is restricted to derive items. Placing it on a compute, mutation, query, or anything else yields OW0821 (“@[theorem] applied to a non-derive item”); declaring it twice on the same derive yields OW0822.
Two related attributes come from the same family:
#[unproven]— applied to atestblock. Marks a theorem-shaped claim that the test asserts as true without mechanical verification. The test runs as usual; the runner reports it under its own status category so the build’s body of unproven claims is visible at a glance.#[assumed]— applied to atestblock. The body’s claim is injected as a postulate within the test’s scope. Used when a real-world axiom (e.g., “every SSN is unique”) is needed for downstream reasoning but cannot be derived from the model. Reported under its own status category for the same reason.
#[unproven]
test "subtyping is transitive across the entire lattice" {
/* claim — not yet mechanically verified */
}
#[assumed]
test "every SSN is unique" {
assert forall ?p1, ?p2: Person where ?p1.ssn == ?p2.ssn => ?p1 == ?p2
}
@[…] AttrList and @<ident> / #<ident> Decorator forms are accepted as alternates to #[…] for both attributes. Pick the form that matches the package’s prevailing style.
What ox test runs
ox test runs every test block in the project. @[theorem] markers are surfaced at ox check / ox build time (today as preserved markers; the verification pass is forthcoming), not under ox test — so a (future) verified theorem will be a property of the build, not of an individual test invocation.
The runner reports five status categories: Pass (every claim held), Fail (a claim did not hold), Unproven (a claim marked #[unproven] ran without proof), Assumed (a claim marked #[assumed] was injected as a postulate), Ignored (the test was excluded from the run). The categories are surfaced as separate counts in the test result: line so a build’s body of unproven and assumed claims stays visible.
$ ox test
Compiling lease-tutorial v0.1.0
Running 5 tests
test result: ok. 3 passed; 0 failed; 1 unproven; 1 assumed; 0 ignored
Pass -p <package> to scope to a single package within a workspace. Pass --filter <pattern> to run only tests whose names match. Pass -v / --verbose to show every assertion, not just failures.
Edge cases worth knowing
let : Type = { fields }is the canonical creation form inside tests. It runs the same field-validation the type-checker runs on parameter passing; bad field types fail atox check, not atox test.- Tests run with full reasoning. Unlike the default
ox check, tests trigger the reasoner so that derives and asserts have fired before tests’assertclaims are evaluated. - Assertions in tests are not the same as
assertitems at module scope. A module-scopeassertdeclares an invariant over the package; a test’sassertis a single rule-atom check inside the test’s world.
Putting it in the running example
Add src/tests.ar. Bundle the shared party-and-property setup into a frame so each test’s body stays focused on what it actually exercises:
use lease::*
use party::*
use rules::*
use types::*
frame canonical_party {
let alice: Tenant = {
id: "alice-001",
name: "Alice",
contact_email: "alice@example.com",
}
let bob: Landlord = {
id: "bob-001",
name: "Bob",
payout_account: "acct-bob",
}
let house: Property = {
id: "p-1",
address: "1200 University Ave",
}
}
test "a freshly-signed residential lease is active" using canonical_party {
let lease: ResidentialLease = {
id: "lease-1",
tenant: alice,
landlord: bob,
property: house,
start_date: #2026-01-01#,
end_date: #2027-01-01#,
monthly_rent: 9500,
bedrooms: 3,
}
assert ResidentialLease(lease)
assert ActiveLease(lease)
assert lease: ValidLease
}
test "a lease with start >= end is rejected by the validity refinement" using canonical_party {
let bad: Lease = {
id: "lease-bad",
tenant: alice,
landlord: bob,
property: house,
start_date: #2027-01-01#,
end_date: #2026-01-01#,
monthly_rent: 9500,
}
assert not bad: ValidLease
}
Then:
$ ox test
Compiling lease-tutorial v0.1.0
Running 2 tests
test "a freshly-signed residential lease is active" ... ok
test "a lease with start >= end is rejected by the validity refinement" ... ok
test result: ok. 2 passed; 0 failed; 0 unproven; 0 assumed; 0 ignored
The model is now self-validating. Bad inputs do not survive the test pass.
Summary
Tests are first-class — test "string-name" { let setup; assert claims }. Frames bundle setup so tests share it via using. Fixtures scope test-local items and capture their diagnostics. expect { diagnostic … } blocks paired with //~ <label> markers verify the compiler’s output, not just post-saturation facts. meta() and :: surface metatype-shape directly. @[theorem], #[unproven], and #[assumed] provide a graded set of verification claims, from informally-tested through mechanically-verified through axiomatic, and the runner reports them as five distinct status categories. The lease tutorial now has tests covering the happy path and the failure modes that the refinement type catches. Next we render the model.