Writing Test Cases - The Work Behind the Work
“Why is this taking so long?” That was a question I heard a lot during my internship usually when I was deep in writing test cases, and honestly, sometimes I didn't have a good answer. It just takes time. You read the requirement, re-read it, think through the scenarios, write them out, and half the day is gone. Sometimes, with work piling up, we'd skip writing test cases altogether. Not because we didn't care, but because there wasn't enough time.
Under deadline pressure, most teams default to the happy path plus a handful of obvious negatives. Edge cases get missed. Coverage depends on who wrote the tests and how much time they had. When bugs slip through to production, the question is always: "Didn't we test this?"
That experience stuck with me. We had great tools for running tests, but nothing to help with the hardest part: figuring out what to test in the first place.
Why Test Design Techniques Matter
Test design techniques are systematic methods for deciding what to test. Instead of writing test cases based on gut feeling or whatever comes to mind first, these techniques give you a structured way to derive scenarios from the requirement itself. Each technique looks at the requirement from a different angle. One focuses on input boundaries, another on business rule combinations, another on how the system moves between states. The goal is to maximize the chance of finding bugs with a manageable number of test cases.
QA has well-established techniques for this. Boundary Value Analysis catches off-by-one errors by testing at the edges of input ranges. Decision Tables map out complex business rules with multiple conditions and their combinations. State Transition Testing models how the system moves between states and what happens on valid and invalid transitions. Equivalence Partitioning divides inputs into meaningful classes so you test one representative from each instead of every possible value.
These techniques aren't niche knowledge. They're part of standard QA training and most experienced engineers use them regularly. The real issue is time. When you're juggling multiple sprints, sitting down to formally apply five different techniques to a single requirement feels like a luxury you can't afford
What If AI Could Apply These Techniques For You?
The idea behind this project is what if AI could apply multiple QA techniques to every requirement, so you don't have to do it manually?
Not a single generic prompt like "write test scenarios for the given requirement." Anyone who's tried that in ChatGPT knows the result: a flat list of obvious scenarios with no structured depth. The problem is the AI tries to do everything at once and does nothing particularly well.
Our approach was to separate concerns and give the AI one technique at a time. "Apply Boundary Value Analysis" makes it focus on range edges and limits. "Apply State Transition Testing" makes it map lifecycle flows. Each technique becomes a separate lens. The results are noticeably more thorough than a single catch-all prompt.
TestPilot AI ships with 12 QA technique playbooks and applies them individually, then merges results into one clean suite. The skill library is just a folder of markdown files, so you can add your own custom techniques by dropping a new file into skills/. The system picks it up automatically. No code changes needed.
Under the Hood: A Step-by-Step Breakdown
Here's every internal stage, from the moment you provide a requirement to the final test suite on screen.
Step 1: Drop In Your Requirement
This step handles getting your requirement into the system and preparing it for AI processing.
1.1 Input Parsing
Three ways to provide your requirement: type it, upload a file, or pull from Jira. Whatever the format, the server converts it into clean plain text.
| Input Method | What Happens Server-Side |
|---|---|
| Type / Paste | Text used directly, whitespace normalized |
| PDF upload | Binary parsed, layout artifacts removed |
| DOCX upload | Converted to raw text, formatting stripped |
| HTML upload | Scripts, styles, and markup stripped. Only body text is kept |
| Markdown upload | Used as-is |
| Jira import | REST API fetches story title, description, and acceptance criteria, then formats them into a structured block |
The Jira integration browses your projects, epics, and sprints server-side, so credentials never touch the browser. By the time input reaches the AI, it's clean text with no formatting artifacts or binary noise.
1.2 Building the Analysis Prompt
The system doesn't just forward your text to the AI. It builds a structured request with two things: your clean requirement text, and a catalog of all available QA techniques (each listed by ID, title, and tags). The AI is prompted to act as a "senior QA engineer and test architect."
Step 2: Let the AI Read Between the Lines (Understanding What to Test)
This step is where the AI reads your requirement, figures out what's testable, and recommends which techniques to apply.
2.1 Clarify Requirements (Optional)
Requirements are rarely perfect. Before running the full analysis, you can optionally click "Find missing info" to have the AI identify ambiguities and gaps. It returns assumptions it would have to make and questions about unclear details, the same kinds of things a senior QA engineer would ask before writing test cases.
For example, "users can upload a profile photo" might trigger: What file formats are accepted? Is there a max file size? What happens if the upload fails?
You answer as many or as few as you want. Those answers get appended to the requirement as clarifications during generation. If you skip this step, the AI proceeds with its own assumptions (listed in the final output for your review).
2.2 Extracting Testable Elements
The AI reads through the requirement and pulls out everything that can be tested, categorized by type
| Element Type | What It Means | Example |
|---|---|---|
| Input | Data the user provides | Email field, password field, age field |
| Output | What the system produces | Confirmation message, error alert |
| State | A condition the system can be in | Logged in, pending verification, locked |
| Rule | Business logic that governs behavior | "Users under 13 cannot register" |
| Boundary | Limits or thresholds | Password must be 8-64 characters |
| Constraint | Restrictions on the system | Max 3 login attempts before lockout |
| Action | Something the user does | Click submit, upload avatar |
| Integration | External system involved | Email service, payment gateway |
These extracted elements become the foundation for technique selection. The AI uses them to decide which skills apply.
2.3 Scoring Techniques by Confidence
For each QA technique in the catalog, the AI checks whether the extracted elements match what that technique targets. It assigns a confidence level, a rationale, and an estimated scenario count.
| Confidence | What It Means |
|---|---|
| High | Requirement clearly contains elements this technique targets |
| Medium | Elements are probably there but not spelled out explicitly |
| Low | Technique would be marginally useful for this requirement |
Example for a "user registration form" requirement
| Technique | Confidence | Rationale | Est. Scenarios |
|---|---|---|---|
| Equivalence Partitioning | High | Multiple fields with clear valid/invalid domains | 8 |
| Boundary Value Analysis | High | Explicit constraints: password 8-64 chars, age 13-120 | 10 |
| Decision Tables | Medium | Conditional logic around verification + terms | 6 |
| State Transition | Medium | Flow has states (form → verify → active) | 5 |
| Error Guessing | High | Common failures: duplicate emails, injection, unicode | 7 |
| Non-Functional Baseline | Low | No explicit performance/security requirements stated | 3 |
| General Fallback | High | Always included as baseline | 6 |
2.4 Validating the AI's Response
LLM output can't be blindly trusted. The system runs several checks:
Schema validation: Every response is checked against a strict JSON schema using Ajv, which verifies field types, required fields, and allowed values (e.g., confidence must be exactly "high," "medium," or "low"). If the response isn't valid JSON, the system tries to extract a JSON object from the text.
Self-repair loop: If validation fails, the system sends the errors back to the AI and asks it to fix and resubmit at a lower temperature. Two attempts before giving up.
Native JSON mode: For OpenAI and Gemini, the schema is passed directly to the API, constraining output format at the generation level.
Hallucination filtering: Any technique referencing a skill ID that doesn't exist in the library gets silently dropped.
General Fallback injection: If the AI omits the baseline technique, the system adds it automatically.
This doesn't make things perfect. But it helps ensure what reaches you is structurally valid and comes from real techniques.
2.5 You Review and Decide
The validated recommendations appear on screen, grouped by confidence (high, medium, low). You toggle techniques on or off. The AI suggests, and you have the final say on what gets generated.
Step 3: From Techniques to Test Cases
This step handles the actual test case generation, deduplication, and final output.
3.1 Creating Per-Skill Generation Tasks
For each technique you selected, the system creates a separate generation task. Each task bundles:
- A focused system prompt: "You are a senior QA engineer specializing in [this technique]"
- The full requirement text
- The technique's markdown playbook (the skill file from
skills/) - The extracted analysis context from Step 2.2
This is different from a single mega-prompt. Focused instructions produce more targeted results. "Apply Boundary Value Analysis" generates more precise boundary tests than asking the AI to handle six techniques at once.
3.2 Parallel Execution
Tasks go into a worker pool. Workers pull from a queue, and as one finishes, the next starts. Each call returns a mini test suite focused on that single technique.
Each mini-suite goes through the same validate-and-repair cycle from Step 2.4. If both attempts fail for a skill, it returns an empty suite instead of crashing the whole run. One bad response doesn't lose the others.
After validation, the skill ID is appended to every test case's coverage tags. A case from Boundary Value Analysis gets "boundary-value-analysis" in its tags, making it easy to trace which technique produced which case.
3.4 Test Case Structure
Every generated test case follows a consistent format:
| Field | Description | Example |
|---|---|---|
id | Sequential identifier | TC-001 |
title | Starts with "Verify that..." | Verify that login fails with empty password |
type | Category from fixed set | negative, boundary, functional, security, ... |
priority | Severity level | P0 (Critical), P1 (High), P2 (Medium), P3 (Low) |
preconditions | Setup conditions | User is on login page, account exists |
steps | Actions to perform | 1. Leave password blank, 2. Click Login |
expected | Expected outcome | Error: "Password is required" |
coverageTags | Technique + domain tags | ["boundary-value-analysis", "authentication"] |
requirementRefs | Requirement traceability | ["REQ-001"] |
Here's what an actual generated test case looks like:
{
"id": "TC-003",
"title": "Verify that registration fails when password is shorter than 8 characters",
"type": "boundary",
"priority": "P0",
"preconditions": [
"User is on the registration page",
"All other fields are filled with valid data"
],
"steps": [
"Enter a 7-character password in the password field",
"Click the Register button"
],
"expected": [
"Registration is rejected",
"Error message displayed: 'Password must be at least 8 characters'"
],
"coverageTags": ["boundary-value-analysis", "registration", "password"],
"requirementRefs": ["REQ-002"]
}
3.5 Combining All Suites
All mini-suites are pooled into one collection. Assumptions, risks, and missing-info questions from every skill are merged. Duplicate assumptions and risks are removed by exact lowercase match.
3.6 Weighted Jaccard Deduplication
When multiple techniques analyze the same requirement, overlap is inevitable. A BVA test and an Error Guessing test might both check what happens when a required field is empty. The system catches these by comparing every pair:
| Dimension | Weight | Calculation |
|---|---|---|
| Title | 40% | Tokenized, lowercased, Jaccard index (shared / total unique tokens) |
| Steps | 40% | Steps concatenated and tokenized, same Jaccard |
| Expected results | 20% | Expected concatenated and tokenized, same Jaccard |
Score = 0.4 × title + 0.4 × steps + 0.2 × expected. Pairs exceeding 60% are flagged as duplicates and the newer one is dropped. Title and steps get more weight because they define what the test does. Two tests can have different expected outcomes but test the same scenario.
3.7 Renumber and Finalize
Surviving cases get fresh sequential IDs (TC-001, TC-002, ...) and are capped at a configured limit. The final suite is ready to view in the UI, export as PDF/CSV, or push to AIO Tests.
3.8 What the Final Output Includes (and Why)
The output isn't just a list of test cases. The final suite contains four distinct sections, each serving a different purpose:
Test Cases are the core output. Each one is a structured, atomic scenario with steps, expected results, priority, type, and traceability tags. These are the actual test scenarios ready to execute or push to your test management tool.
Assumptions are things the AI had to assume because the requirement didn't explicitly state them. For example, if the requirement says "users can reset their password" but doesn't mention how, the AI might assume "password reset is done via email link." These are listed so you can review them and catch any wrong assumptions before they turn into misleading test cases. If the AI assumed something incorrectly, you know which test cases to adjust or remove.
Risks are potential problem areas the AI identified while analyzing the requirement. These aren't test cases themselves, but flags for things that could go wrong. For example: "No rate limiting mentioned for login attempts, which creates a brute force risk" or "Requirement doesn't specify behavior when payment gateway is unavailable." Risks help you and your team prioritize what to test more carefully and what to raise with the product team. They're especially useful for junior QA engineers who might not spot these concerns on their own.
Missing Info Questions are things the AI couldn't determine from the requirement that would affect test design. These are similar to what the optional clarify step (Step 2.1) surfaces, but these come from the generation phase itself. For example: "What is the maximum number of items allowed in the cart?" or "Should the form auto-save drafts?" If you answered clarification questions in Step 2.1, you'll see fewer of these. The ones that remain highlight gaps in the requirement that you might want to take back to stakeholders.
| Output Section | Why It Exists | What to Do With It |
|---|---|---|
| Test Cases | The actual test scenarios to execute | Review, refine, export, or push to AIO Tests |
| Assumptions | What the AI had to guess | Verify each one. Wrong assumptions mean wrong test cases |
| Risks | Potential problem areas spotted during analysis | Prioritize testing, raise with product team |
| Missing Info | Gaps that affect test coverage | Take back to stakeholders, or accept and move on |
Together, these four sections give you more than just test cases. They give you a picture of how confident you can be in the output and where the requirement itself might need work.
3.9 Optional: Visual Technique Diagrams
You can toggle diagram generation for specific techniques. The AI generates a Mermaid.js diagram alongside the test cases, rendered directly in the UI.
| Technique | Diagram Type | What It Shows |
|---|---|---|
| State Transition | State diagram | States, valid/invalid transitions |
| Decision Tables | Flowchart | Conditions branching to outcomes |
| Equivalence Partitioning | Flowchart | Input partitions: valid (green) vs invalid (red) |
| Boundary Value Analysis | Flowchart | Boundary points with pass/fail zones |
| Pairwise / Combinatorial | Flowchart | Parameter combinations as tree/matrix |
| Feature Decomposition | Mind map | Decomposed dimensions: actors, data, rules, states |
The Skill Library: 12 Skills and Counting
The system ships with 12 playbooks, but this isn't a hard limit. Each skill is a markdown file in skills/ with frontmatter (ID, title, tags) and a body describing how to apply the technique. The server loads every .md file in that folder on startup. Want to add "API Contract Testing" or "WCAG 2.1 Checklist"? Drop in a markdown file and the system picks it up automatically.
| # | Skill | What It Targets | Example Use Case |
|---|---|---|---|
| 1 | Equivalence Partitioning | Valid/invalid input classes | Email: valid format vs missing @, vs empty |
| 2 | Boundary Value Analysis | Edges of input ranges | Password: 7 (fail), 8 (pass), 64 (pass), 65 (fail) |
| 3 | Decision Tables | Multi-condition business rules | Member + coupon + $100 order = 20% off |
| 4 | State Transition | Lifecycle flows | Order: draft → submitted → approved → shipped |
| 5 | Pairwise / Combinatorial | Multi-parameter interactions | Browser × OS × language × payment method |
| 6 | Error Guessing & Heuristics | Common failure modes | SQL injection, special characters, empty arrays |
| 7 | Risk-Based Prioritization | High-impact scenarios first | Payment processing before cosmetic UI tests |
| 8 | Requirements Traceability | Full spec-to-test mapping | Every acceptance criterion traced to a test |
| 9 | Feature Decomposition | Atomic testable units | "Search" → query, filters, sort, pagination |
| 10 | Functional Core | Happy-path business logic | Login, add to cart, complete checkout |
| 11 | Non-Functional Baseline | Performance, security, usability | Page load < 3s, WCAG compliance, XSS checks |
| 12 | General Fallback | Catch-all baseline (always on) | Scenarios outside specific techniques |
The more skills you add, the more the analysis engine can recommend.
The Human in the Loop
The generated test suite is a starting point, not a finished product. This is where your judgment comes in.
TestPilot AI doesn't know your system's history, your team's risk tolerance, or which integration partner has a flaky API. Those calls are yours.
| What TestPilot AI Tries to Do | What You Still Need to Do |
|---|---|
| Apply available QA techniques | Decide which edge cases matter for your system |
| Generate structured test cases | Review and refine with domain knowledge |
| Remove overlapping scenarios | Add context-specific tests the AI can't know |
| Get you a first draft quickly | Make the final priority and risk calls |
Some cases will be spot-on. Others might be too generic or miss a nuance only you'd catch. The value is in not starting from a blank page. You review, edit, and then export as PDF, CSV, or push directly to AIO Tests.
The goal was never to replace QA engineers. It was to spend less time typing and more time thinking. That's what we built TestPilot AI to help with.