AI Prompt Evaluation Checklist: Diagnose Why Your Prompts Fail & Fix Them Fast

AI Prompt Evaluation Checklist: Diagnose Why Your Prompts Fail & Fix Them Fast

impossible to

possible

Make

Make

Make

dreams

dreams

dreams

happen

happen

happen

with

with

with

AI

AI

AI

LucyBrain Switzerland ○ AI Daily

AI Prompt Evaluation Checklist: Diagnose Why Your Prompts Fail & Fix Them Fast

December 27, 2025

TL;DR: What You'll Learn

  • Random prompt iteration wastes time—systematic diagnosis identifies exact failure points in minutes

  • Five-component evaluation framework works across text, image, and video AI tools

  • Most prompt failures stem from 1-2 missing components, not overall prompt quality

  • Diagnostic checklist prevents rewriting entire prompts when only one element needs adjustment

  • Scoring system reveals which prompts need minor tweaks vs complete reconstruction

Most people iterate prompts randomly. They change multiple elements at once, hoping something improves. They rewrite everything when results disappoint.

This wastes time because most prompt failures aren't systemic—they're component-specific. A prompt might have perfect style direction but missing context. Excellent subject definition but vague constraints. Strong role framing but ambiguous task specification.

Random changes fix problems by accident after many attempts. Systematic diagnosis fixes problems deliberately on the first or second iteration.

This article provides a diagnostic framework for evaluating prompts, identifying specific failure points, and making targeted improvements that actually work.

Why Random Iteration Fails

Understanding why unfocused changes waste time clarifies the value of systematic diagnosis.

The random iteration pattern:

Attempt 1: "Write a marketing email" Result: Too generic

Attempt 2: "Write a really good professional marketing email that gets results" Result: Still generic, now verbose

Attempt 3: "You are a marketing expert. Write an amazing email about our product that makes people want to buy it with great subject line and call to action" Result: Better but still misses the mark

Attempt 4: Random complete rewrite trying different approach Result: Different problems, no closer to solution

Why this fails: Each iteration changes multiple components simultaneously. When something improves, you don't know which change caused the improvement. When things worsen, you don't know which change caused the regression.

The systematic approach:

Attempt 1: "Write a marketing email" Result: Too generic

Diagnosis: Missing context (who's the audience? what's the goal?), missing constraints (length? tone? specific requirements?)

Attempt 2: "You are a B2B SaaS marketing manager. Write a follow-up email to demo attendees encouraging trial signup. Target: technical decision-makers who saw our demo but haven't signed up yet. 150 words max, professional but approachable tone, single clear CTA. Avoid: hype language, multiple CTAs, assuming they remember demo details." Result: Achieves target on second attempt

Why this works: Diagnosis identified specific missing components (context, constraints). Changes targeted only those components. No wasted iteration on elements that already worked.

The Five-Component Diagnostic Framework

Every prompt evaluation examines these five elements.

Component 1: Role Definition

What to evaluate: Does the prompt specify what expertise, perspective, or style the AI should apply?

Diagnostic questions:

  • Is there explicit role framing? ("You are a [specific role]")

  • For text AI: Does the role imply relevant knowledge domain?

  • For image/video AI: Does the role reference specific aesthetic or style?

  • Is the role specific enough to activate appropriate patterns?

Failure indicators:

  • Output feels generic or lacks specialized knowledge

  • Tone doesn't match expected expertise level

  • Style misses the mark despite other elements being correct

Scoring:

  • Strong: Specific role with clear expertise or style reference

  • ⚠️ Weak: Vague role or generic description ("professional," "expert")

  • Missing: No role context at all

Examples:

Missing role (Text): "Write a business plan" Strong role: "You are a venture capital analyst who reviews 200+ SaaS business plans annually"

Weak role (Image): "Professional product photo" Strong role: "Product photography in Apple commercial style"

Missing role (Video): "Video of someone working" Strong role: "Documentary style observational footage, Frederick Wiseman aesthetic"

Component 2: Task Specification

What to evaluate: Is the output format, structure, and scope explicitly defined?

Diagnostic questions:

  • Does the prompt specify exact output format?

  • Are structural requirements clear?

  • Is scope bounded (length, duration, number of elements)?

  • Would someone else reading this prompt generate the same structure?

Failure indicators:

  • Output has wrong format (got paragraph, needed bullet list)

  • Structure doesn't match needs (wanted 5 sections, got 3)

  • Length wildly off target

  • Multiple interpretations possible for what to create

Scoring:

  • Strong: Unambiguous format, structure, and scope

  • ⚠️ Weak: Format implied but not explicit

  • Missing: No indication of desired output structure

Examples:

Missing task (Text): "Help with my presentation" Strong task: "Create a 10-slide presentation outline with: title slide, problem statement, 3 solution approaches with pros/cons each, implementation timeline, conclusion with next steps"

Weak task (Image): "Make an image for my blog" Strong task: "16:9 blog header image, 1920x1080px, subject positioned left third leaving right two-thirds negative space for text overlay"

Missing task (Video): "Show product being used" Strong task: "15-second product demo, three-shot sequence: (1) unboxing close-up 4sec, (2) hands demonstrating feature 6sec, (3) user reaction close-up 5sec"

Component 3: Context Loading

What to evaluate: Has necessary background information, constraints, and success criteria been provided?

Diagnostic questions:

  • Does the AI have information needed to make appropriate decisions?

  • Are success criteria defined?

  • Are relevant constraints specified?

  • Is audience or purpose clear?

  • Are examples or reference points provided when helpful?

Failure indicators:

  • Output technically correct but doesn't serve actual need

  • AI makes wrong assumptions about audience or purpose

  • Results miss implicit requirements

  • Output would work for different context but not yours

Scoring:

  • Strong: Comprehensive background, clear success criteria, relevant constraints

  • ⚠️ Weak: Some context but missing key information

  • Missing: No context about situation, audience, or goals

Examples:

Missing context (Text): "Improve this email" [attaches draft] Strong context: "This email requests meeting with potential enterprise client. Previous version was too casual and buried meeting request in paragraph 3. Rewrite to: establish credibility in opening, state meeting purpose clearly, include specific time options, professional but approachable tone for CTO audience."

Weak context (Image): "Professional headshot" Strong context: "LinkedIn profile photo for executive coach targeting corporate clients—needs to convey: approachable warmth, professional credibility, confident competence. Will be viewed primarily on mobile LinkedIn app."

Missing context (Video): "Product demo video" Strong context: "Demo for landing page targeting technical decision-makers who understand the problem but skeptical of solutions. Goal: show credibility through working product, not marketing promises. Viewer has 15 seconds attention max before scrolling."

Component 4: Style Direction

What to evaluate: Is the communication approach, aesthetic, or tone clearly specified?

Diagnostic questions:

  • For text: Is voice and tone explicit?

  • For image/video: Is aesthetic direction concrete?

  • Are style references specific and recognizable?

  • Are anti-patterns specified (what to avoid)?

  • Could two people interpret style direction the same way?

Failure indicators:

  • Output has wrong tone despite correct content

  • Aesthetic doesn't match intended feel

  • "Professional" or "good quality" used without definition

  • Style interpretation wildly different from intention

Scoring:

  • Strong: Concrete references or detailed characteristics, includes anti-patterns

  • ⚠️ Weak: Vague descriptors without examples

  • Missing: No style direction at all

Examples:

Missing style (Text): "Write a report" Strong style: "Write in McKinsey Quarterly style: data-driven, third-person, executive audience, specific examples and frameworks, avoid: buzzwords, hype language, first-person, rhetorical questions"

Weak style (Image): "Modern minimalist style" Strong style: "Kinfolk magazine aesthetic: natural light, muted earth tones, 70% negative space, single subject, lifestyle context subtle not staged, avoid: hard shadows, bright colors, busy compositions"

Missing style (Video): "Professional video" Strong style: "Apple product launch aesthetic: slow deliberate camera movements, clean minimal compositions, warm color grade, smooth not jerky motion, intimate product focus, avoid: fast cuts, dramatic music, voice-over narration"

Component 5: Constraints Definition

What to evaluate: Are technical limits, format requirements, and prohibitions explicit?

Diagnostic questions:

  • Are technical specifications clear?

  • Are format requirements defined?

  • Are quality thresholds specified?

  • Are prohibitions or exclusions stated?

  • Could output be technically unusable despite good content?

Failure indicators:

  • Correct content but wrong format (too long, wrong aspect ratio, invalid characters)

  • Violates unstated requirements

  • Includes prohibited elements

  • Technically incompatible with intended use

Scoring:

  • Strong: Complete technical specs, clear requirements, exclusions stated

  • ⚠️ Weak: Some constraints but missing critical limits

  • Missing: No constraints specified

Examples:

Missing constraints (Text): "Write a social media caption" Strong constraints: "Instagram caption: 125 characters max (platform truncates at 125), include 1 call-to-action, 3-5 hashtags, conversational Gen Z tone, maximum 2 emoji, avoid: multiple CTAs, sales language, corporate speak"

Weak constraints (Image): "Product photo" Strong constraints: "Product photography: 1:1 aspect ratio 2000x2000px minimum for Instagram, pure white background RGB 255,255,255, product centered occupying 60-70% of frame, sharp focus throughout (no blur), shadow beneath product for depth, avoid: colored backgrounds, multiple products, text overlays"

Missing constraints (Video): "Product demo" Strong constraints: "Product demo: 15 seconds duration maximum, 9:16 vertical aspect ratio for social media stories, subject must remain center-frame throughout (mobile cropping safe), 24fps, natural lighting only (no obvious studio setup), avoid: rapid cuts, zooms, music that requires licensing"

The Diagnostic Process

Systematic evaluation follows this sequence.

Step 1: Generate with current prompt

Get output before diagnosis. Can't identify problems without seeing results.

Step 2: Evaluate output quality

Ask: Does this achieve my goal? If yes, prompt is complete. If no, proceed to diagnosis.

Step 3: Score each component

Work through all five components systematically:

  • ✅ Strong (3 points)

  • ⚠️ Weak (1 point)

  • ❌ Missing (0 points)

Total score out of 15 points.

Step 4: Interpret score

  • 12-15 points: Prompt is strong, failures likely due to AI capability limits or need for tool change, not prompt issues

  • 8-11 points: Moderate prompt, identify weak components and strengthen them

  • 4-7 points: Weak prompt, multiple components need work, prioritize missing elements first

  • 0-3 points: Failed prompt, essentially just an input with no guidance, needs complete reconstruction

Step 5: Prioritize fixes

Fix missing components before weak components. Missing components cause larger quality gaps.

Step 6: Modify only failing components

Don't rewrite the entire prompt. Change only diagnosed problems. This isolates whether your fix worked.

Step 7: Test single change

Generate again and evaluate. Did the specific change improve the specific issue? If yes, continue to next weak component. If no, try different approach to same component.

Diagnostic Examples

Example 1: Generic Business Email

Original prompt: "Write a marketing email"

Output received: Generic promotional email with no specific value proposition or audience targeting

Diagnostic evaluation:

Role: ❌ Missing (0 points) - No role specified Task: ⚠️ Weak (1 point) - Format implied (email) but no structure detailContext: ❌ Missing (0 points) - No audience, goal, or situation Style: ❌ Missing (0 points) - No tone or voice direction Constraints: ❌ Missing (0 points) - No length, format requirements, or prohibitions

Total score: 1/15 (Failed prompt)

Diagnosis: Prompt is essentially just an input. Needs complete reconstruction with all five components.

Reconstructed prompt:

"You are a B2B SaaS marketing manager [ROLE] writing to technical decision-makers who attended your product demo last week but haven't signed up for trial yet [CONTEXT].

Create a follow-up email with this structure [TASK]:

  • Subject line (under 50 characters)

  • Opening: Reference their specific demo attendance

  • Body: One key benefit they'd gain from trial (focus on time savings)

  • CTA: Clear trial signup link

  • Close: Offer to answer questions

Requirements [CONSTRAINTS]:

  • 150 words maximum

  • Single clear CTA only

  • Professional but approachable tone [STYLE]

Avoid: Hype language, multiple CTAs, assuming they remember demo details, selling features they didn't express interest in"

Result: Specific, targeted email that addresses the actual need

Example 2: Vague Product Photo Request

Original prompt: "Professional product photo of coffee mug"

Output received: Random interpretation—overhead angle, busy background, wrong lighting

Diagnostic evaluation:

Role: ⚠️ Weak (1 point) - "Professional" too vague Task: ⚠️ Weak (1 point) - Subject clear but composition not specified Context: ❌ Missing (0 points) - No use case or platform Style: ❌ Missing (0 points) - "Professional" is not a style reference Constraints: ❌ Missing (0 points) - No aspect ratio, resolution, composition requirements

Total score: 2/15 (Failed prompt)

Diagnosis: Multiple missing components. Start with role and constraints.

Improved prompt:

"Product photography in Williams Sonoma catalog style [ROLE]: sage green ceramic coffee mug on white marble surface [TASK].

Composition [TASK continued]: Rule of thirds, mug positioned left of frame showing handle and interior, 60% negative space right side for text overlay [CONTEXT: e-commerce use].

Lighting: Soft directional from upper left creating gentle shadow right side, white seamless background RGB 255,255,255 [STYLE].

Technical specs [CONSTRAINTS]: 4:5 aspect ratio for Instagram product post, 2000x2000px minimum, sharp focus throughout, avoid: busy backgrounds, multiple objects, harsh shadows, colored backgrounds"

Result: Specific product photo suitable for e-commerce use

Example 3: Unclear Video Request

Original prompt: "Video showing app features"

Output received: Random screen recording with no narrative flow or clear demonstration

Diagnostic evaluation:

Role: ❌ Missing (0 points) - No cinematic or stylistic direction Task: ⚠️ Weak (1 point) - "Showing features" is vague sequence Context: ❌ Missing (0 points) - No audience or purpose Style: ❌ Missing (0 points) - No aesthetic referenceConstraints: ❌ Missing (0 points) - No duration, format, technical specs

Total score: 1/15 (Failed prompt)

Diagnosis: Needs complete reconstruction focusing on shot sequence and purpose.

Improved prompt:

"15-second app demo for landing page [CONTEXT] targeting skeptical enterprise buyers who need to see working product [CONTEXT continued], Apple product video aesthetic [ROLE].

Shot sequence [TASK]:

  • (0-5sec) Close-up: Hand opens app, clean interface visible

  • (5-10sec) Medium shot: Finger taps through key workflow demonstrating core feature

  • (10-15sec) Screen shows result, hand exits frame, lingers on successful outcome

Camera: Overhead angle, locked position throughout, soft natural lighting, warm color grade [STYLE].

Technical [CONSTRAINTS]: 16:9 for web embed, maintain app screen center-frame for mobile safety, smooth controlled motion (no jerky movements), 24fps, avoid: voice-over, music, text overlays during demo"

Result: Clear demonstration sequence suitable for landing page

Quick Diagnostic Shortcuts

For rapid evaluation without full component scoring.

The "Would Someone Else" Test

Read your prompt and ask: Would another person generate the same output type with this prompt?

  • If yes: Task specification is probably strong

  • If no: Task specification needs work

The "What's Missing" Test

Generate output and identify disconnects:

  • Wrong tone? → Check style direction

  • Wrong format? → Check task specification

  • Missed the point? → Check context loading

  • Generic? → Check role definition

  • Technically unusable? → Check constraints

The "One Change" Test

Change only one component and regenerate:

  • Improvement? → That component was the problem

  • No change? → Try different component

  • Worse? → Revert change, try different approach to same component

The "Score Drop" Test

If a previously working prompt suddenly fails:

  • Platform update changed defaults? → Check constraints

  • Different use case? → Check context

  • New audience? → Check style direction

Most prompt degradation comes from context or constraint changes, rarely from role or task issues.

Common Diagnostic Patterns

Pattern 1: High Score But Poor Results (12-15 points)

What it means: Prompt structure is strong but either:

  • AI capability limits reached

  • Wrong tool for task

  • Unrealistic expectations

What to check:

  • Is this task within AI capability?

  • Would different tool (ChatGPT vs Claude, Midjourney vs DALL-E) work better?

  • Are expectations achievable?

Fix approach: Consider tool change or expectation adjustment rather than prompt modification.

Pattern 2: Medium Score with Weak Style (8-11 points)

What it means: Basic structure present but aesthetic or tone missing.

Common in: Image and video prompts where "professional" or "high quality" used without concrete reference.

Fix approach: Add specific style references (photographer names, film references, publication styles).

Pattern 3: Low Score Missing Context (4-7 points)

What it means: Task specified but situation unclear.

Common in: Business writing, technical documentation where audience and purpose missing.

Fix approach: Add context about audience, goals, constraints, and success criteria.

Pattern 4: Very Low Score, Essentially Input (0-3 points)

What it means: Prompt is just a request with no guidance.

Common in: First attempts before understanding prompt structure.

Fix approach: Complete reconstruction adding all five components.

For systematic improvement strategies beyond diagnosis, see AI Prompt Iteration & Optimization: How to Get First-Attempt Quality Every Time.

Building Your Diagnostic Habit

Make evaluation systematic through practice.

Create evaluation template:

Save this checklist for repeated use:

PROMPT EVALUATION
Prompt: [paste prompt here]
Output quality: [rate 1-10]

Review past successes:

When prompt works excellently, score it to understand which components made it work. Build intuition about what strong components look like.

Track failure patterns:

Notice which components you consistently score weak or missing. These are your personal blind spots to improve.

Compare before/after:

When systematic diagnosis fixes a problem, save both versions with notes about what changed and why it worked.

For cross-platform diagnostic techniques, see Cross-Platform AI Prompting 2026: Text, Image & Video Unified Framework.

Frequently Asked Questions

How long does systematic prompt diagnosis take?

2-3 minutes to score five components once familiar with framework. Much faster than random iteration which can waste 20-30 minutes changing elements without clear direction. Initial diagnosis investment pays off through targeted fixes that work on first attempt.

Can I skip components if I know they're not the problem?

Yes for rapid diagnosis on simple prompts, but full evaluation catches non-obvious issues. Sometimes weak style direction creates output that seems like task failure. Systematic check prevents misdiagnosis and wasted iteration on wrong component.

What if multiple components score weak or missing?

Prioritize: Fix missing components first (biggest impact), then weak components. Within missing components, prioritize: Context → Task → Role → Style → Constraints. Context failures cascade most broadly.

Does the scoring system work across text, image, and video AI?

Yes, five-component framework applies universally—only implementation changes. Text Role = expertise, Image Role = style reference, Video Role = cinematic approach. Core diagnostic logic remains constant across modalities.

How do I know if problem is prompt quality vs AI capability limits?

High-scoring prompts (12-15 points) that still fail likely hit capability limits. Try same prompt on different tool (ChatGPT vs Claude, Midjourney vs DALL-E). If all tools fail, task may exceed current AI capability or need breaking into smaller pieces.

Should I diagnose before first attempt or only after failure?

Quick evaluation before submitting prevents obvious gaps. Most people write prompts then realize context missing. 30-second pre-check asking "Do I have all five components?" saves iteration cycles. Full diagnostic scoring after failure identifies specific fixes needed.

What's the most commonly missing component?

Context, especially in text prompts. People specify what to create (task) but omit audience, purpose, constraints, success criteria. "Write a report" missing who reads it and why. "Create presentation" missing what decision it supports. Context grounds output in reality.

Can good prompts still produce bad outputs?

Yes. Even perfect prompts sometimes generate poor results due to AI randomness, training data gaps, or capability limits. One bad output doesn't mean prompt failed—regenerate 2-3 times. Consistent poor results indicate prompt issue diagnosable with checklist.

Related Reading

Foundation:

Tool-Specific Guides:

Optimization:

Advanced Techniques:

Pitfall Prevention:

Templates:

www.topfreeprompts.com

Access 80,000+ professionally engineered prompts, each evaluated with the five-component diagnostic framework. Every prompt includes component scoring showing exactly why it works, helping you learn diagnostic evaluation through examples.

Newest Articles