Writing Test Cases - The Work Behind the Work

“Why is this taking so long?” That was a question I heard a lot during my internship usually when I was deep in writing test cases, and honestly, sometimes I didn't have a good answer. It just takes time. You read the requirement, re-read it, think through the scenarios, write them out, and half the day is gone. Sometimes, with work piling up, we'd skip writing test cases altogether. Not because we didn't care, but because there wasn't enough time.

Under deadline pressure, most teams default to the happy path plus a handful of obvious negatives. Edge cases get missed. Coverage depends on who wrote the tests and how much time they had. When bugs slip through to production, the question is always: "Didn't we test this?"

That experience stuck with me. We had great tools for running tests, but nothing to help with the hardest part: figuring out what to test in the first place.

Why Test Design Techniques Matter

Test design techniques are systematic methods for deciding what to test. Instead of writing test cases based on gut feeling or whatever comes to mind first, these techniques give you a structured way to derive scenarios from the requirement itself. Each technique looks at the requirement from a different angle. One focuses on input boundaries, another on business rule combinations, another on how the system moves between states. The goal is to maximize the chance of finding bugs with a manageable number of test cases.

QA has well-established techniques for this. Boundary Value Analysis catches off-by-one errors by testing at the edges of input ranges. Decision Tables map out complex business rules with multiple conditions and their combinations. State Transition Testing models how the system moves between states and what happens on valid and invalid transitions. Equivalence Partitioning divides inputs into meaningful classes so you test one representative from each instead of every possible value.

These techniques aren't niche knowledge. They're part of standard QA training and most experienced engineers use them regularly. The real issue is time. When you're juggling multiple sprints, sitting down to formally apply five different techniques to a single requirement feels like a luxury you can't afford

What If AI Could Apply These Techniques For You?

The idea behind this project is what if AI could apply multiple QA techniques to every requirement, so you don't have to do it manually?

Not a single generic prompt like "write test scenarios for the given requirement." Anyone who's tried that in ChatGPT knows the result: a flat list of obvious scenarios with no structured depth. The problem is the AI tries to do everything at once and does nothing particularly well.

Our approach was to separate concerns and give the AI one technique at a time. "Apply Boundary Value Analysis" makes it focus on range edges and limits. "Apply State Transition Testing" makes it map lifecycle flows. Each technique becomes a separate lens. The results are noticeably more thorough than a single catch-all prompt.

TestPilot AI ships with 12 QA technique playbooks and applies them individually, then merges results into one clean suite. The skill library is just a folder of markdown files, so you can add your own custom techniques by dropping a new file into skills/. The system picks it up automatically. No code changes needed.

Under the Hood: A Step-by-Step Breakdown

Here's every internal stage, from the moment you provide a requirement to the final test suite on screen.


Step 1: Drop In Your Requirement

This step handles getting your requirement into the system and preparing it for AI processing.

1.1 Input Parsing

Three ways to provide your requirement: type it, upload a file, or pull from Jira. Whatever the format, the server converts it into clean plain text.

Input MethodWhat Happens Server-Side
Type / PasteText used directly, whitespace normalized
PDF uploadBinary parsed, layout artifacts removed
DOCX uploadConverted to raw text, formatting stripped
HTML uploadScripts, styles, and markup stripped. Only body text is kept
Markdown uploadUsed as-is
Jira importREST API fetches story title, description, and acceptance criteria, then formats them into a structured block

The Jira integration browses your projects, epics, and sprints server-side, so credentials never touch the browser. By the time input reaches the AI, it's clean text with no formatting artifacts or binary noise.

1.2 Building the Analysis Prompt

The system doesn't just forward your text to the AI. It builds a structured request with two things: your clean requirement text, and a catalog of all available QA techniques (each listed by ID, title, and tags). The AI is prompted to act as a "senior QA engineer and test architect."


Step 2: Let the AI Read Between the Lines (Understanding What to Test)

This step is where the AI reads your requirement, figures out what's testable, and recommends which techniques to apply.

2.1 Clarify Requirements (Optional)

Requirements are rarely perfect. Before running the full analysis, you can optionally click "Find missing info" to have the AI identify ambiguities and gaps. It returns assumptions it would have to make and questions about unclear details, the same kinds of things a senior QA engineer would ask before writing test cases.

For example, "users can upload a profile photo" might trigger: What file formats are accepted? Is there a max file size? What happens if the upload fails?

You answer as many or as few as you want. Those answers get appended to the requirement as clarifications during generation. If you skip this step, the AI proceeds with its own assumptions (listed in the final output for your review).

2.2 Extracting Testable Elements

The AI reads through the requirement and pulls out everything that can be tested, categorized by type

Element TypeWhat It MeansExample
InputData the user providesEmail field, password field, age field
OutputWhat the system producesConfirmation message, error alert
StateA condition the system can be inLogged in, pending verification, locked
RuleBusiness logic that governs behavior"Users under 13 cannot register"
BoundaryLimits or thresholdsPassword must be 8-64 characters
ConstraintRestrictions on the systemMax 3 login attempts before lockout
ActionSomething the user doesClick submit, upload avatar
IntegrationExternal system involvedEmail service, payment gateway

These extracted elements become the foundation for technique selection. The AI uses them to decide which skills apply.

2.3 Scoring Techniques by Confidence

For each QA technique in the catalog, the AI checks whether the extracted elements match what that technique targets. It assigns a confidence level, a rationale, and an estimated scenario count.

ConfidenceWhat It Means
HighRequirement clearly contains elements this technique targets
MediumElements are probably there but not spelled out explicitly
LowTechnique would be marginally useful for this requirement

Example for a "user registration form" requirement

TechniqueConfidenceRationaleEst. Scenarios
Equivalence PartitioningHighMultiple fields with clear valid/invalid domains8
Boundary Value AnalysisHighExplicit constraints: password 8-64 chars, age 13-12010
Decision TablesMediumConditional logic around verification + terms6
State TransitionMediumFlow has states (form → verify → active)5
Error GuessingHighCommon failures: duplicate emails, injection, unicode7
Non-Functional BaselineLowNo explicit performance/security requirements stated3
General FallbackHighAlways included as baseline6

2.4 Validating the AI's Response

LLM output can't be blindly trusted. The system runs several checks:

Schema validation: Every response is checked against a strict JSON schema using Ajv, which verifies field types, required fields, and allowed values (e.g., confidence must be exactly "high," "medium," or "low"). If the response isn't valid JSON, the system tries to extract a JSON object from the text.

Self-repair loop: If validation fails, the system sends the errors back to the AI and asks it to fix and resubmit at a lower temperature. Two attempts before giving up.

Native JSON mode: For OpenAI and Gemini, the schema is passed directly to the API, constraining output format at the generation level.

Hallucination filtering: Any technique referencing a skill ID that doesn't exist in the library gets silently dropped.

General Fallback injection: If the AI omits the baseline technique, the system adds it automatically.

This doesn't make things perfect. But it helps ensure what reaches you is structurally valid and comes from real techniques.

2.5 You Review and Decide

The validated recommendations appear on screen, grouped by confidence (high, medium, low). You toggle techniques on or off. The AI suggests, and you have the final say on what gets generated.


Step 3: From Techniques to Test Cases

This step handles the actual test case generation, deduplication, and final output.

3.1 Creating Per-Skill Generation Tasks

For each technique you selected, the system creates a separate generation task. Each task bundles:

  • A focused system prompt: "You are a senior QA engineer specializing in [this technique]"
  • The full requirement text
  • The technique's markdown playbook (the skill file from skills/)
  • The extracted analysis context from Step 2.2

This is different from a single mega-prompt. Focused instructions produce more targeted results. "Apply Boundary Value Analysis" generates more precise boundary tests than asking the AI to handle six techniques at once.

3.2 Parallel Execution

Tasks go into a worker pool. Workers pull from a queue, and as one finishes, the next starts. Each call returns a mini test suite focused on that single technique.

Each mini-suite goes through the same validate-and-repair cycle from Step 2.4. If both attempts fail for a skill, it returns an empty suite instead of crashing the whole run. One bad response doesn't lose the others.

After validation, the skill ID is appended to every test case's coverage tags. A case from Boundary Value Analysis gets "boundary-value-analysis" in its tags, making it easy to trace which technique produced which case.

3.4 Test Case Structure

Every generated test case follows a consistent format:

FieldDescriptionExample
idSequential identifierTC-001
titleStarts with "Verify that..."Verify that login fails with empty password
typeCategory from fixed setnegative, boundary, functional, security, ...
prioritySeverity levelP0 (Critical), P1 (High), P2 (Medium), P3 (Low)
preconditionsSetup conditionsUser is on login page, account exists
stepsActions to perform1. Leave password blank, 2. Click Login
expectedExpected outcomeError: "Password is required"
coverageTagsTechnique + domain tags["boundary-value-analysis", "authentication"]
requirementRefsRequirement traceability["REQ-001"]

Here's what an actual generated test case looks like:

{
  "id": "TC-003",
  "title": "Verify that registration fails when password is shorter than 8 characters",
  "type": "boundary",
  "priority": "P0",
  "preconditions": [
    "User is on the registration page",
    "All other fields are filled with valid data"
  ],
  "steps": [
    "Enter a 7-character password in the password field",
    "Click the Register button"
  ],
  "expected": [
    "Registration is rejected",
    "Error message displayed: 'Password must be at least 8 characters'"
  ],
  "coverageTags": ["boundary-value-analysis", "registration", "password"],
  "requirementRefs": ["REQ-002"]
}

3.5 Combining All Suites

All mini-suites are pooled into one collection. Assumptions, risks, and missing-info questions from every skill are merged. Duplicate assumptions and risks are removed by exact lowercase match.

3.6 Weighted Jaccard Deduplication

When multiple techniques analyze the same requirement, overlap is inevitable. A BVA test and an Error Guessing test might both check what happens when a required field is empty. The system catches these by comparing every pair:

DimensionWeightCalculation
Title40%Tokenized, lowercased, Jaccard index (shared / total unique tokens)
Steps40%Steps concatenated and tokenized, same Jaccard
Expected results20%Expected concatenated and tokenized, same Jaccard

Score = 0.4 × title + 0.4 × steps + 0.2 × expected. Pairs exceeding 60% are flagged as duplicates and the newer one is dropped. Title and steps get more weight because they define what the test does. Two tests can have different expected outcomes but test the same scenario.

3.7 Renumber and Finalize

Surviving cases get fresh sequential IDs (TC-001, TC-002, ...) and are capped at a configured limit. The final suite is ready to view in the UI, export as PDF/CSV, or push to AIO Tests.

3.8 What the Final Output Includes (and Why)

The output isn't just a list of test cases. The final suite contains four distinct sections, each serving a different purpose:

Test Cases are the core output. Each one is a structured, atomic scenario with steps, expected results, priority, type, and traceability tags. These are the actual test scenarios ready to execute or push to your test management tool.

Assumptions are things the AI had to assume because the requirement didn't explicitly state them. For example, if the requirement says "users can reset their password" but doesn't mention how, the AI might assume "password reset is done via email link." These are listed so you can review them and catch any wrong assumptions before they turn into misleading test cases. If the AI assumed something incorrectly, you know which test cases to adjust or remove.

Risks are potential problem areas the AI identified while analyzing the requirement. These aren't test cases themselves, but flags for things that could go wrong. For example: "No rate limiting mentioned for login attempts, which creates a brute force risk" or "Requirement doesn't specify behavior when payment gateway is unavailable." Risks help you and your team prioritize what to test more carefully and what to raise with the product team. They're especially useful for junior QA engineers who might not spot these concerns on their own.

Missing Info Questions are things the AI couldn't determine from the requirement that would affect test design. These are similar to what the optional clarify step (Step 2.1) surfaces, but these come from the generation phase itself. For example: "What is the maximum number of items allowed in the cart?" or "Should the form auto-save drafts?" If you answered clarification questions in Step 2.1, you'll see fewer of these. The ones that remain highlight gaps in the requirement that you might want to take back to stakeholders.

Output SectionWhy It ExistsWhat to Do With It
Test CasesThe actual test scenarios to executeReview, refine, export, or push to AIO Tests
AssumptionsWhat the AI had to guessVerify each one. Wrong assumptions mean wrong test cases
RisksPotential problem areas spotted during analysisPrioritize testing, raise with product team
Missing InfoGaps that affect test coverageTake back to stakeholders, or accept and move on

Together, these four sections give you more than just test cases. They give you a picture of how confident you can be in the output and where the requirement itself might need work.

3.9 Optional: Visual Technique Diagrams

You can toggle diagram generation for specific techniques. The AI generates a Mermaid.js diagram alongside the test cases, rendered directly in the UI.

TechniqueDiagram TypeWhat It Shows
State TransitionState diagramStates, valid/invalid transitions
Decision TablesFlowchartConditions branching to outcomes
Equivalence PartitioningFlowchartInput partitions: valid (green) vs invalid (red)
Boundary Value AnalysisFlowchartBoundary points with pass/fail zones
Pairwise / CombinatorialFlowchartParameter combinations as tree/matrix
Feature DecompositionMind mapDecomposed dimensions: actors, data, rules, states

The Skill Library: 12 Skills and Counting

The system ships with 12 playbooks, but this isn't a hard limit. Each skill is a markdown file in skills/ with frontmatter (ID, title, tags) and a body describing how to apply the technique. The server loads every .md file in that folder on startup. Want to add "API Contract Testing" or "WCAG 2.1 Checklist"? Drop in a markdown file and the system picks it up automatically.

#SkillWhat It TargetsExample Use Case
1Equivalence PartitioningValid/invalid input classesEmail: valid format vs missing @, vs empty
2Boundary Value AnalysisEdges of input rangesPassword: 7 (fail), 8 (pass), 64 (pass), 65 (fail)
3Decision TablesMulti-condition business rulesMember + coupon + $100 order = 20% off
4State TransitionLifecycle flowsOrder: draft → submitted → approved → shipped
5Pairwise / CombinatorialMulti-parameter interactionsBrowser × OS × language × payment method
6Error Guessing & HeuristicsCommon failure modesSQL injection, special characters, empty arrays
7Risk-Based PrioritizationHigh-impact scenarios firstPayment processing before cosmetic UI tests
8Requirements TraceabilityFull spec-to-test mappingEvery acceptance criterion traced to a test
9Feature DecompositionAtomic testable units"Search" → query, filters, sort, pagination
10Functional CoreHappy-path business logicLogin, add to cart, complete checkout
11Non-Functional BaselinePerformance, security, usabilityPage load < 3s, WCAG compliance, XSS checks
12General FallbackCatch-all baseline (always on)Scenarios outside specific techniques

The more skills you add, the more the analysis engine can recommend.

The Human in the Loop

The generated test suite is a starting point, not a finished product. This is where your judgment comes in.

TestPilot AI doesn't know your system's history, your team's risk tolerance, or which integration partner has a flaky API. Those calls are yours.

What TestPilot AI Tries to DoWhat You Still Need to Do
Apply available QA techniquesDecide which edge cases matter for your system
Generate structured test casesReview and refine with domain knowledge
Remove overlapping scenariosAdd context-specific tests the AI can't know
Get you a first draft quicklyMake the final priority and risk calls

Some cases will be spot-on. Others might be too generic or miss a nuance only you'd catch. The value is in not starting from a blank page. You review, edit, and then export as PDF, CSV, or push directly to AIO Tests.

The goal was never to replace QA engineers. It was to spend less time typing and more time thinking. That's what we built TestPilot AI to help with.