How TestPilot AI Turns Software Requirements Into Test Cases

TestPilot-AI GitHub Repo

Writing Test Cases - The Work Behind the Work

“Why is this taking so long?” That was a question I heard a lot during my internship usually when I was deep in writing test cases, and honestly, sometimes I didn't have a good answer. It just takes time. You read the requirement, re-read it, think through the scenarios, write them out, and half the day is gone. Sometimes, with work piling up, we'd skip writing test cases altogether. Not because we didn't care, but because there wasn't enough time.

Under deadline pressure, most teams default to the happy path plus a handful of obvious negatives. Edge cases get missed. Coverage depends on who wrote the tests and how much time they had. When bugs slip through to production, the question is always: "Didn't we test this?"

That experience stuck with me. We had great tools for running tests, but nothing to help with the hardest part: figuring out what to test in the first place.

Why Test Design Techniques Matter

Test design techniques are systematic methods for deciding what to test. Instead of writing test cases based on gut feeling or whatever comes to mind first, these techniques give you a structured way to derive scenarios from the requirement itself. Each technique looks at the requirement from a different angle. One focuses on input boundaries, another on business rule combinations, another on how the system moves between states. The goal is to maximize the chance of finding bugs with a manageable number of test cases.

QA has well-established techniques for this. Boundary Value Analysis catches off-by-one errors by testing at the edges of input ranges. Decision Tables map out complex business rules with multiple conditions and their combinations. State Transition Testing models how the system moves between states and what happens on valid and invalid transitions. Equivalence Partitioning divides inputs into meaningful classes so you test one representative from each instead of every possible value.

These techniques aren't niche knowledge. They're part of standard QA training and most experienced engineers use them regularly. The real issue is time. When you're juggling multiple sprints, sitting down to formally apply five different techniques to a single requirement feels like a luxury you can't afford

What If AI Could Apply These Techniques For You?

The idea behind this project is what if AI could apply multiple QA techniques to every requirement, so you don't have to do it manually?

Not a single generic prompt like "write test scenarios for the given requirement." Anyone who's tried that in ChatGPT knows the result: a flat list of obvious scenarios with no structured depth. The problem is the AI tries to do everything at once and does nothing particularly well.

Our approach was to separate concerns and give the AI one technique at a time. "Apply Boundary Value Analysis" makes it focus on range edges and limits. "Apply State Transition Testing" makes it map lifecycle flows. Each technique becomes a separate lens. The results are noticeably more thorough than a single catch-all prompt.

TestPilot AI ships with 12 QA technique playbooks and applies them individually, then merges results into one clean suite. The skill library is just a folder of markdown files, so you can add your own custom techniques by dropping a new file into skills/. The system picks it up automatically. No code changes needed.

Under the Hood: A Step-by-Step Breakdown

Here's every internal stage, from the moment you provide a requirement to the final test suite on screen.

Step 1: Drop In Your Requirement

This step handles getting your requirement into the system and preparing it for AI processing.

1.1 Input Parsing

Three ways to provide your requirement: type it, upload a file, or pull from Jira. Whatever the format, the server converts it into clean plain text.

Input Method	What Happens Server-Side
Type / Paste	Text used directly, whitespace normalized
PDF upload	Binary parsed, layout artifacts removed
DOCX upload	Converted to raw text, formatting stripped
HTML upload	Scripts, styles, and markup stripped. Only body text is kept
Markdown upload	Used as-is
Jira import	REST API fetches story title, description, and acceptance criteria, then formats them into a structured block

The Jira integration browses your projects, epics, and sprints server-side, so credentials never touch the browser. By the time input reaches the AI, it's clean text with no formatting artifacts or binary noise.

1.2 Building the Analysis Prompt

The system doesn't just forward your text to the AI. It builds a structured request with two things: your clean requirement text, and a catalog of all available QA techniques (each listed by ID, title, and tags). The AI is prompted to act as a "senior QA engineer and test architect."

Step 2: Let the AI Read Between the Lines (Understanding What to Test)

This step is where the AI reads your requirement, figures out what's testable, and recommends which techniques to apply.

2.1 Clarify Requirements (Optional)

Requirements are rarely perfect. Before running the full analysis, you can optionally click "Find missing info" to have the AI identify ambiguities and gaps. It returns assumptions it would have to make and questions about unclear details, the same kinds of things a senior QA engineer would ask before writing test cases.

For example, "users can upload a profile photo" might trigger: What file formats are accepted? Is there a max file size? What happens if the upload fails?

You answer as many or as few as you want. Those answers get appended to the requirement as clarifications during generation. If you skip this step, the AI proceeds with its own assumptions (listed in the final output for your review).

2.2 Extracting Testable Elements

The AI reads through the requirement and pulls out everything that can be tested, categorized by type

Element Type	What It Means	Example
Input	Data the user provides	Email field, password field, age field
Output	What the system produces	Confirmation message, error alert
State	A condition the system can be in	Logged in, pending verification, locked
Rule	Business logic that governs behavior	"Users under 13 cannot register"
Boundary	Limits or thresholds	Password must be 8-64 characters
Constraint	Restrictions on the system	Max 3 login attempts before lockout
Action	Something the user does	Click submit, upload avatar
Integration	External system involved	Email service, payment gateway

These extracted elements become the foundation for technique selection. The AI uses them to decide which skills apply.

2.3 Scoring Techniques by Confidence

For each QA technique in the catalog, the AI checks whether the extracted elements match what that technique targets. It assigns a confidence level, a rationale, and an estimated scenario count.

Confidence	What It Means
High	Requirement clearly contains elements this technique targets
Medium	Elements are probably there but not spelled out explicitly
Low	Technique would be marginally useful for this requirement

Example for a "user registration form" requirement

Technique	Confidence	Rationale	Est. Scenarios
Equivalence Partitioning	High	Multiple fields with clear valid/invalid domains	8
Boundary Value Analysis	High	Explicit constraints: password 8-64 chars, age 13-120	10
Decision Tables	Medium	Conditional logic around verification + terms	6
State Transition	Medium	Flow has states (form → verify → active)	5
Error Guessing	High	Common failures: duplicate emails, injection, unicode	7
Non-Functional Baseline	Low	No explicit performance/security requirements stated	3
General Fallback	High	Always included as baseline	6

2.4 Validating the AI's Response

LLM output can't be blindly trusted. The system runs several checks:

Schema validation: Every response is checked against a strict JSON schema using Ajv, which verifies field types, required fields, and allowed values (e.g., confidence must be exactly "high," "medium," or "low"). If the response isn't valid JSON, the system tries to extract a JSON object from the text.

Self-repair loop: If validation fails, the system sends the errors back to the AI and asks it to fix and resubmit at a lower temperature. Two attempts before giving up.

Native JSON mode: For OpenAI and Gemini, the schema is passed directly to the API, constraining output format at the generation level.

Hallucination filtering: Any technique referencing a skill ID that doesn't exist in the library gets silently dropped.

General Fallback injection: If the AI omits the baseline technique, the system adds it automatically.

This doesn't make things perfect. But it helps ensure what reaches you is structurally valid and comes from real techniques.

2.5 You Review and Decide

The validated recommendations appear on screen, grouped by confidence (high, medium, low). You toggle techniques on or off. The AI suggests, and you have the final say on what gets generated.

Step 3: From Techniques to Test Cases

This step handles the actual test case generation, deduplication, and final output.

3.1 Creating Per-Skill Generation Tasks

For each technique you selected, the system creates a separate generation task. Each task bundles:

A focused system prompt: "You are a senior QA engineer specializing in [this technique]"
The full requirement text
The technique's markdown playbook (the skill file from skills/)
The extracted analysis context from Step 2.2

This is different from a single mega-prompt. Focused instructions produce more targeted results. "Apply Boundary Value Analysis" generates more precise boundary tests than asking the AI to handle six techniques at once.

3.2 Parallel Execution

Tasks go into a worker pool. Workers pull from a queue, and as one finishes, the next starts. Each call returns a mini test suite focused on that single technique.

Each mini-suite goes through the same validate-and-repair cycle from Step 2.4. If both attempts fail for a skill, it returns an empty suite instead of crashing the whole run. One bad response doesn't lose the others.

After validation, the skill ID is appended to every test case's coverage tags. A case from Boundary Value Analysis gets "boundary-value-analysis" in its tags, making it easy to trace which technique produced which case.

3.4 Test Case Structure

Every generated test case follows a consistent format:

Field	Description	Example
`id`	Sequential identifier	TC-001
`title`	Starts with "Verify that..."	Verify that login fails with empty password
`type`	Category from fixed set	negative, boundary, functional, security, ...
`priority`	Severity level	P0 (Critical), P1 (High), P2 (Medium), P3 (Low)
`preconditions`	Setup conditions	User is on login page, account exists
`steps`	Actions to perform	1. Leave password blank, 2. Click Login
`expected`	Expected outcome	Error: "Password is required"
`coverageTags`	Technique + domain tags	["boundary-value-analysis", "authentication"]
`requirementRefs`	Requirement traceability	["REQ-001"]

Here's what an actual generated test case looks like:

{
  "id": "TC-003",
  "title": "Verify that registration fails when password is shorter than 8 characters",
  "type": "boundary",
  "priority": "P0",
  "preconditions": [
    "User is on the registration page",
    "All other fields are filled with valid data"
  ],
  "steps": [
    "Enter a 7-character password in the password field",
    "Click the Register button"
  ],
  "expected": [
    "Registration is rejected",
    "Error message displayed: 'Password must be at least 8 characters'"
  ],
  "coverageTags": ["boundary-value-analysis", "registration", "password"],
  "requirementRefs": ["REQ-002"]
}

3.5 Combining All Suites

All mini-suites are pooled into one collection. Assumptions, risks, and missing-info questions from every skill are merged. Duplicate assumptions and risks are removed by exact lowercase match.

3.6 Weighted Jaccard Deduplication

When multiple techniques analyze the same requirement, overlap is inevitable. A BVA test and an Error Guessing test might both check what happens when a required field is empty. The system catches these by comparing every pair:

Dimension	Weight	Calculation
Title	40%	Tokenized, lowercased, Jaccard index (shared / total unique tokens)
Steps	40%	Steps concatenated and tokenized, same Jaccard
Expected results	20%	Expected concatenated and tokenized, same Jaccard

Score = 0.4 × title + 0.4 × steps + 0.2 × expected. Pairs exceeding 60% are flagged as duplicates and the newer one is dropped. Title and steps get more weight because they define what the test does. Two tests can have different expected outcomes but test the same scenario.

3.7 Renumber and Finalize

Surviving cases get fresh sequential IDs (TC-001, TC-002, ...) and are capped at a configured limit. The final suite is ready to view in the UI, export as PDF/CSV, or push to AIO Tests.

3.8 What the Final Output Includes (and Why)

The output isn't just a list of test cases. The final suite contains four distinct sections, each serving a different purpose:

Test Cases are the core output. Each one is a structured, atomic scenario with steps, expected results, priority, type, and traceability tags. These are the actual test scenarios ready to execute or push to your test management tool.

Assumptions are things the AI had to assume because the requirement didn't explicitly state them. For example, if the requirement says "users can reset their password" but doesn't mention how, the AI might assume "password reset is done via email link." These are listed so you can review them and catch any wrong assumptions before they turn into misleading test cases. If the AI assumed something incorrectly, you know which test cases to adjust or remove.

Risks are potential problem areas the AI identified while analyzing the requirement. These aren't test cases themselves, but flags for things that could go wrong. For example: "No rate limiting mentioned for login attempts, which creates a brute force risk" or "Requirement doesn't specify behavior when payment gateway is unavailable." Risks help you and your team prioritize what to test more carefully and what to raise with the product team. They're especially useful for junior QA engineers who might not spot these concerns on their own.

Missing Info Questions are things the AI couldn't determine from the requirement that would affect test design. These are similar to what the optional clarify step (Step 2.1) surfaces, but these come from the generation phase itself. For example: "What is the maximum number of items allowed in the cart?" or "Should the form auto-save drafts?" If you answered clarification questions in Step 2.1, you'll see fewer of these. The ones that remain highlight gaps in the requirement that you might want to take back to stakeholders.

Output Section	Why It Exists	What to Do With It
Test Cases	The actual test scenarios to execute	Review, refine, export, or push to AIO Tests
Assumptions	What the AI had to guess	Verify each one. Wrong assumptions mean wrong test cases
Risks	Potential problem areas spotted during analysis	Prioritize testing, raise with product team
Missing Info	Gaps that affect test coverage	Take back to stakeholders, or accept and move on

Together, these four sections give you more than just test cases. They give you a picture of how confident you can be in the output and where the requirement itself might need work.

3.9 Optional: Visual Technique Diagrams

You can toggle diagram generation for specific techniques. The AI generates a Mermaid.js diagram alongside the test cases, rendered directly in the UI.

Technique	Diagram Type	What It Shows
State Transition	State diagram	States, valid/invalid transitions
Decision Tables	Flowchart	Conditions branching to outcomes
Equivalence Partitioning	Flowchart	Input partitions: valid (green) vs invalid (red)
Boundary Value Analysis	Flowchart	Boundary points with pass/fail zones
Pairwise / Combinatorial	Flowchart	Parameter combinations as tree/matrix
Feature Decomposition	Mind map	Decomposed dimensions: actors, data, rules, states

The Skill Library: 12 Skills and Counting

The system ships with 12 playbooks, but this isn't a hard limit. Each skill is a markdown file in skills/ with frontmatter (ID, title, tags) and a body describing how to apply the technique. The server loads every .md file in that folder on startup. Want to add "API Contract Testing" or "WCAG 2.1 Checklist"? Drop in a markdown file and the system picks it up automatically.

#	Skill	What It Targets	Example Use Case
1	Equivalence Partitioning	Valid/invalid input classes	Email: valid format vs missing @, vs empty
2	Boundary Value Analysis	Edges of input ranges	Password: 7 (fail), 8 (pass), 64 (pass), 65 (fail)
3	Decision Tables	Multi-condition business rules	Member + coupon + $100 order = 20% off
4	State Transition	Lifecycle flows	Order: draft → submitted → approved → shipped
5	Pairwise / Combinatorial	Multi-parameter interactions	Browser × OS × language × payment method
6	Error Guessing & Heuristics	Common failure modes	SQL injection, special characters, empty arrays
7	Risk-Based Prioritization	High-impact scenarios first	Payment processing before cosmetic UI tests
8	Requirements Traceability	Full spec-to-test mapping	Every acceptance criterion traced to a test
9	Feature Decomposition	Atomic testable units	"Search" → query, filters, sort, pagination
10	Functional Core	Happy-path business logic	Login, add to cart, complete checkout
11	Non-Functional Baseline	Performance, security, usability	Page load < 3s, WCAG compliance, XSS checks
12	General Fallback	Catch-all baseline (always on)	Scenarios outside specific techniques

The more skills you add, the more the analysis engine can recommend.

The Human in the Loop

The generated test suite is a starting point, not a finished product. This is where your judgment comes in.

TestPilot AI doesn't know your system's history, your team's risk tolerance, or which integration partner has a flaky API. Those calls are yours.

What TestPilot AI Tries to Do	What You Still Need to Do
Apply available QA techniques	Decide which edge cases matter for your system
Generate structured test cases	Review and refine with domain knowledge
Remove overlapping scenarios	Add context-specific tests the AI can't know
Get you a first draft quickly	Make the final priority and risk calls

Some cases will be spot-on. Others might be too generic or miss a nuance only you'd catch. The value is in not starting from a blank page. You review, edit, and then export as PDF, CSV, or push directly to AIO Tests.

The goal was never to replace QA engineers. It was to spend less time typing and more time thinking. That's what we built TestPilot AI to help with.