Be Digital · Case study

A Copilot-only coverage workflow

A team on GitHub Copilot Business — running Claude models, split across VS Code and IntelliJ — needed to lift a Java service past a 91% SonarQube coverage gate without touching production code, and without a runaway token bill. This is how I built a slim, Copilot-only workflow that drives the work autonomously, enforces "tests only" at the pipeline, and keeps the context window small on purpose.

Goal: Line coverage ≥ 91% (SonarQube gate)
Hard constraint: Add tests only — never edit src/main or pom.xml
Tooling: GitHub Copilot (Agent mode) · Claude model · Maven · JUnit 5
IDEs: VS Code & IntelliJ IDEA
Profile: Copilot-only — no Claude Code, no CLAUDE.md
Enforcement: CI diff gate + SonarQube quality gate

The problem

Coverage gates are easy to state and easy to game. The honest version of the task has three tensions pulling against each other. It has to be autonomous enough to be worth it — nobody wants to hand-write the fortieth boilerplate test. It has to be safe — a coverage push must never quietly "fix" a failing assertion by editing the code under test. And it has to be affordable — an agent that re-reads a giant coverage report on every step turns a cheap task into an expensive one.

The team had also standardized on one AI tool: GitHub Copilot Business, with Claude selected in the model picker. So the full framework I'd normally reach for — which pairs Claude Code (with its runtime write-blocking hooks) and Copilot — was overkill. Half of it would never run. The job was to strip it down to exactly what Copilot uses, while keeping the one feature that makes the loop worth running: autonomy.

The approach: a slim, Copilot-only profile

Instead of the full dual-tool config (a CLAUDE.md, a .claude/ directory with deny-rules and a pre-tool hook, plus the Copilot files), the slim profile ships only what Copilot actually reads — and leans on CI for the enforcement the hooks used to provide.

Seven files, no Claude Code. The agent reads the policy + prompts and writes tests; the CI diff gate is what actually keeps production code untouched.

Kept (Copilot reads it)

copilot-instructions.md — auto-loaded policy
Two prompt files — /cover and /increase-coverage
rank_coverage.py — token-saving ranker
.vscode/ settings + tasks; .sdkmanrc
CI templates: guardrail · sonar · base

Removed — Claude Code only

CLAUDE.md — never read by Copilot; the policy lives in copilot-instructions.md instead
.claude/settings.json deny-rules — a Claude Code mechanism; doesn't load under Copilot
.claude/hooks/guard-paths.sh — the write block; hooks only fire inside Claude Code
.claude/commands/ — Claude Code slash commands; Copilot uses prompt files
IntelliJ Claude-panel run config

One exception, and it's repo-specific. When this profile was built, .claude/skills/ belonged here too. As of April 2026, Copilot reads project-level skills committed to the repo — per GitHub's docs, .github/skills/, .claude/skills/, or .agents/skills/ — so a repo-checked-in .claude/skills/ is no longer invisible to Copilot. The distinction is repo-level vs. personal: project skills committed to the repository are cross-tool, while home-directory skills (~/.copilot/skills, ~/.claude/skills) stay tool-specific and aren't shared across agents. This workflow uses no skills either way — the autonomous loop runs from a prompt file, not a model-invoked skill — so nothing in the profile changed; only the rationale did.

Giving Copilot the autonomous loop

The feature worth preserving is the autonomous /increase-coverage loop. People assume that's a Claude Code exclusive — it isn't. VS Code Copilot Agent mode can take multiple steps on its own: edit files, run terminal commands, read the results, and try again. A prompt file with agent: agent frontmatter is the native equivalent of a slash command, so the same loop runs entirely inside Copilot.

--- .github/prompts/increase-coverage.prompt.md ---
agent: agent
description: Autonomously raise coverage toward 91% (tests only)
# body: seed report → rank worst 5 → for each class write tests,
# run mvn -B test, self-correct (≤3) → stop at 91% → summarize

The same four steps a person would take — made repeatable. The retry cap and the "tests only" rule keep the agent from wandering into production code or empty assertions.

Safety without the hook

This is the honest trade-off of going Copilot-only. In the full framework, a .claude/ PreToolUse hook intercepts and denies writes to src/main and pom.xml — and, when it matches Bash as well as the edit tools, it also catches the back door of mutating files via sed -i or cat >, which a path-scoped deny-rule on Edit/Write alone would miss. Worth being precise: it's an agent-level guard — a script Claude Code runs in its own tool loop — not a kernel file permission (that's the separate /sandbox feature). And it only fires inside Claude Code; Copilot never sees it. So on this profile, "tests only" is held by two softer layers and one hard one.

A violation can slip past the natural-language layers, but it can't reach main — the pipeline diff gate is non-negotiable and tool-agnostic.

The lesson here: match enforcement to the tool that's actually running. A guardrail that only fires in a tool you've removed is theater. Moving the real check to CI made the policy independent of which AI, which IDE, and even whether an AI wrote the code at all.

Keeping it cheap: token engineering

An autonomous loop is a token amplifier. Each step re-sends the whole conversation as input, and every tool result — every Maven log, every file read — gets appended and re-sent on the next step. Left alone, the cost curve bends the wrong way. Two moves kept it flat.

First, compress the inputs. The agent needs to know which classes are least covered; that lived in a jacoco.xml report ~6,000 tokens long (roughly 1,500+ lines). Pasting it every turn was the single biggest cost. A 100-line script parses it once and emits a compact ranked table — same decision quality, a fraction of the tokens.

The cheapest token is the one you compute outside the model. The same trick applies to logs, datasets, and search results.

Second, keep tool output off the meter. Maven runs go through quietly (-q), tests are scoped to the class in play rather than the whole suite, and each class gets a fresh context (new conversation) so five rounds of build logs don't pile into one ballooning history.

Illustrative, not a billed figure: if a full report is ~6,000 tokens and the ranked table is ~150, and the agent consults it across ~15 steps, that's roughly 90,000 vs 2,250 input tokens for the same decisions — before counting the history those big payloads would also inflate on every subsequent turn. Your numbers will differ; the direction won't.

Two IDEs, one OS caveat

The team was split across editors, so the profile had to work in both. In VS Code, the prompt files surface as real slash commands (/cover, /increase-coverage) and .vscode/settings.json turns on agent mode and auto-approves the repeating safe commands so the loop doesn't stop to confirm every mvn run.

In IntelliJ, the JetBrains plugin does now support .prompt.md files (added in Copilot 1.5.54), but with two caveats that made manual paste the pragmatic choice for this team: (a) prompt-file support is gated behind editor_preview_features, which Copilot Business orgs frequently leave disabled; (b) the plugin only reliably discovers prompt files created through its own settings UI — repo-committed .github/prompts/*.prompt.md files often aren't recognized. So on a standardized Business plan, pasting the prompt body into Copilot's agent chat remains the dependable path. copilot-instructions.md still auto-loads in both.

One toolchain lever covers both: a .sdkmanrc pins Java and Maven, so sdk env switches versions in one place, and the CI image is built from JAVA_VERSION/MAVEN_VERSION variables to match. The shell pieces (setup script, the guardrail check, SDKMAN) assume bash — on Windows that means Git Bash or WSL, which this team already used.

What I'd carry to the next repo

Ship only what the running tool reads. A config file an AI never loads isn't neutral — it's maintenance debt and a false sense of safety. The slim profile is seven files, and every one earns its place.
Put enforcement where it can't be bypassed. Natural-language rules are guidance; the CI diff gate is the contract. Make the hard check tool- and IDE-agnostic.
Autonomy and frugality aren't opposites. The same loop that accelerates the work will inflate the bill unless you compress inputs and keep tool output out of history.
"Good" coverage asserts behavior. The prompt forbids tests that execute lines without asserting anything — the number is only worth raising if the tests would catch a regression.

Be Digital — notes from the bench. Related: Claude Skills & context engineering · Claude Code implementation field notes · Cost optimization: working with tokens · Context & token cheat-sheet · The Credit Fire — Copilot spend parable

Want this kind of work for your team?

Cost-aware, auditable AI workflows — Copilot or Claude Code — that respect your guardrails and your budget.

See GenAI & AppSec advisory