



impossible to
possible

LucyBrain Switzerland ○ AI Daily
AI Prompt Evaluation Checklist: Diagnose Why Your Prompts Fail & Fix Them Fast
December 27, 2025
TL;DR: What You'll Learn
Random prompt iteration wastes time—systematic diagnosis identifies exact failure points in minutes
Five-component evaluation framework works across text, image, and video AI tools
Most prompt failures stem from 1-2 missing components, not overall prompt quality
Diagnostic checklist prevents rewriting entire prompts when only one element needs adjustment
Scoring system reveals which prompts need minor tweaks vs complete reconstruction
Most people iterate prompts randomly. They change multiple elements at once, hoping something improves. They rewrite everything when results disappoint.
This wastes time because most prompt failures aren't systemic—they're component-specific. A prompt might have perfect style direction but missing context. Excellent subject definition but vague constraints. Strong role framing but ambiguous task specification.
Random changes fix problems by accident after many attempts. Systematic diagnosis fixes problems deliberately on the first or second iteration.
This article provides a diagnostic framework for evaluating prompts, identifying specific failure points, and making targeted improvements that actually work.
Why Random Iteration Fails
Understanding why unfocused changes waste time clarifies the value of systematic diagnosis.
The random iteration pattern:
Attempt 1: "Write a marketing email" Result: Too generic
Attempt 2: "Write a really good professional marketing email that gets results" Result: Still generic, now verbose
Attempt 3: "You are a marketing expert. Write an amazing email about our product that makes people want to buy it with great subject line and call to action" Result: Better but still misses the mark
Attempt 4: Random complete rewrite trying different approach Result: Different problems, no closer to solution
Why this fails: Each iteration changes multiple components simultaneously. When something improves, you don't know which change caused the improvement. When things worsen, you don't know which change caused the regression.
The systematic approach:
Attempt 1: "Write a marketing email" Result: Too generic
Diagnosis: Missing context (who's the audience? what's the goal?), missing constraints (length? tone? specific requirements?)
Attempt 2: "You are a B2B SaaS marketing manager. Write a follow-up email to demo attendees encouraging trial signup. Target: technical decision-makers who saw our demo but haven't signed up yet. 150 words max, professional but approachable tone, single clear CTA. Avoid: hype language, multiple CTAs, assuming they remember demo details." Result: Achieves target on second attempt
Why this works: Diagnosis identified specific missing components (context, constraints). Changes targeted only those components. No wasted iteration on elements that already worked.
The Five-Component Diagnostic Framework
Every prompt evaluation examines these five elements.
Component 1: Role Definition
What to evaluate: Does the prompt specify what expertise, perspective, or style the AI should apply?
Diagnostic questions:
Is there explicit role framing? ("You are a [specific role]")
For text AI: Does the role imply relevant knowledge domain?
For image/video AI: Does the role reference specific aesthetic or style?
Is the role specific enough to activate appropriate patterns?
Failure indicators:
Output feels generic or lacks specialized knowledge
Tone doesn't match expected expertise level
Style misses the mark despite other elements being correct
Scoring:
✅ Strong: Specific role with clear expertise or style reference
⚠️ Weak: Vague role or generic description ("professional," "expert")
❌ Missing: No role context at all
Examples:
Missing role (Text): "Write a business plan" Strong role: "You are a venture capital analyst who reviews 200+ SaaS business plans annually"
Weak role (Image): "Professional product photo" Strong role: "Product photography in Apple commercial style"
Missing role (Video): "Video of someone working" Strong role: "Documentary style observational footage, Frederick Wiseman aesthetic"
Component 2: Task Specification
What to evaluate: Is the output format, structure, and scope explicitly defined?
Diagnostic questions:
Does the prompt specify exact output format?
Are structural requirements clear?
Is scope bounded (length, duration, number of elements)?
Would someone else reading this prompt generate the same structure?
Failure indicators:
Output has wrong format (got paragraph, needed bullet list)
Structure doesn't match needs (wanted 5 sections, got 3)
Length wildly off target
Multiple interpretations possible for what to create
Scoring:
✅ Strong: Unambiguous format, structure, and scope
⚠️ Weak: Format implied but not explicit
❌ Missing: No indication of desired output structure
Examples:
Missing task (Text): "Help with my presentation" Strong task: "Create a 10-slide presentation outline with: title slide, problem statement, 3 solution approaches with pros/cons each, implementation timeline, conclusion with next steps"
Weak task (Image): "Make an image for my blog" Strong task: "16:9 blog header image, 1920x1080px, subject positioned left third leaving right two-thirds negative space for text overlay"
Missing task (Video): "Show product being used" Strong task: "15-second product demo, three-shot sequence: (1) unboxing close-up 4sec, (2) hands demonstrating feature 6sec, (3) user reaction close-up 5sec"
Component 3: Context Loading
What to evaluate: Has necessary background information, constraints, and success criteria been provided?
Diagnostic questions:
Does the AI have information needed to make appropriate decisions?
Are success criteria defined?
Are relevant constraints specified?
Is audience or purpose clear?
Are examples or reference points provided when helpful?
Failure indicators:
Output technically correct but doesn't serve actual need
AI makes wrong assumptions about audience or purpose
Results miss implicit requirements
Output would work for different context but not yours
Scoring:
✅ Strong: Comprehensive background, clear success criteria, relevant constraints
⚠️ Weak: Some context but missing key information
❌ Missing: No context about situation, audience, or goals
Examples:
Missing context (Text): "Improve this email" [attaches draft] Strong context: "This email requests meeting with potential enterprise client. Previous version was too casual and buried meeting request in paragraph 3. Rewrite to: establish credibility in opening, state meeting purpose clearly, include specific time options, professional but approachable tone for CTO audience."
Weak context (Image): "Professional headshot" Strong context: "LinkedIn profile photo for executive coach targeting corporate clients—needs to convey: approachable warmth, professional credibility, confident competence. Will be viewed primarily on mobile LinkedIn app."
Missing context (Video): "Product demo video" Strong context: "Demo for landing page targeting technical decision-makers who understand the problem but skeptical of solutions. Goal: show credibility through working product, not marketing promises. Viewer has 15 seconds attention max before scrolling."
Component 4: Style Direction
What to evaluate: Is the communication approach, aesthetic, or tone clearly specified?
Diagnostic questions:
For text: Is voice and tone explicit?
For image/video: Is aesthetic direction concrete?
Are style references specific and recognizable?
Are anti-patterns specified (what to avoid)?
Could two people interpret style direction the same way?
Failure indicators:
Output has wrong tone despite correct content
Aesthetic doesn't match intended feel
"Professional" or "good quality" used without definition
Style interpretation wildly different from intention
Scoring:
✅ Strong: Concrete references or detailed characteristics, includes anti-patterns
⚠️ Weak: Vague descriptors without examples
❌ Missing: No style direction at all
Examples:
Missing style (Text): "Write a report" Strong style: "Write in McKinsey Quarterly style: data-driven, third-person, executive audience, specific examples and frameworks, avoid: buzzwords, hype language, first-person, rhetorical questions"
Weak style (Image): "Modern minimalist style" Strong style: "Kinfolk magazine aesthetic: natural light, muted earth tones, 70% negative space, single subject, lifestyle context subtle not staged, avoid: hard shadows, bright colors, busy compositions"
Missing style (Video): "Professional video" Strong style: "Apple product launch aesthetic: slow deliberate camera movements, clean minimal compositions, warm color grade, smooth not jerky motion, intimate product focus, avoid: fast cuts, dramatic music, voice-over narration"
Component 5: Constraints Definition
What to evaluate: Are technical limits, format requirements, and prohibitions explicit?
Diagnostic questions:
Are technical specifications clear?
Are format requirements defined?
Are quality thresholds specified?
Are prohibitions or exclusions stated?
Could output be technically unusable despite good content?
Failure indicators:
Correct content but wrong format (too long, wrong aspect ratio, invalid characters)
Violates unstated requirements
Includes prohibited elements
Technically incompatible with intended use
Scoring:
✅ Strong: Complete technical specs, clear requirements, exclusions stated
⚠️ Weak: Some constraints but missing critical limits
❌ Missing: No constraints specified
Examples:
Missing constraints (Text): "Write a social media caption" Strong constraints: "Instagram caption: 125 characters max (platform truncates at 125), include 1 call-to-action, 3-5 hashtags, conversational Gen Z tone, maximum 2 emoji, avoid: multiple CTAs, sales language, corporate speak"
Weak constraints (Image): "Product photo" Strong constraints: "Product photography: 1:1 aspect ratio 2000x2000px minimum for Instagram, pure white background RGB 255,255,255, product centered occupying 60-70% of frame, sharp focus throughout (no blur), shadow beneath product for depth, avoid: colored backgrounds, multiple products, text overlays"
Missing constraints (Video): "Product demo" Strong constraints: "Product demo: 15 seconds duration maximum, 9:16 vertical aspect ratio for social media stories, subject must remain center-frame throughout (mobile cropping safe), 24fps, natural lighting only (no obvious studio setup), avoid: rapid cuts, zooms, music that requires licensing"
The Diagnostic Process
Systematic evaluation follows this sequence.
Step 1: Generate with current prompt
Get output before diagnosis. Can't identify problems without seeing results.
Step 2: Evaluate output quality
Ask: Does this achieve my goal? If yes, prompt is complete. If no, proceed to diagnosis.
Step 3: Score each component
Work through all five components systematically:
✅ Strong (3 points)
⚠️ Weak (1 point)
❌ Missing (0 points)
Total score out of 15 points.
Step 4: Interpret score
12-15 points: Prompt is strong, failures likely due to AI capability limits or need for tool change, not prompt issues
8-11 points: Moderate prompt, identify weak components and strengthen them
4-7 points: Weak prompt, multiple components need work, prioritize missing elements first
0-3 points: Failed prompt, essentially just an input with no guidance, needs complete reconstruction
Step 5: Prioritize fixes
Fix missing components before weak components. Missing components cause larger quality gaps.
Step 6: Modify only failing components
Don't rewrite the entire prompt. Change only diagnosed problems. This isolates whether your fix worked.
Step 7: Test single change
Generate again and evaluate. Did the specific change improve the specific issue? If yes, continue to next weak component. If no, try different approach to same component.
Diagnostic Examples
Example 1: Generic Business Email
Original prompt: "Write a marketing email"
Output received: Generic promotional email with no specific value proposition or audience targeting
Diagnostic evaluation:
Role: ❌ Missing (0 points) - No role specified Task: ⚠️ Weak (1 point) - Format implied (email) but no structure detailContext: ❌ Missing (0 points) - No audience, goal, or situation Style: ❌ Missing (0 points) - No tone or voice direction Constraints: ❌ Missing (0 points) - No length, format requirements, or prohibitions
Total score: 1/15 (Failed prompt)
Diagnosis: Prompt is essentially just an input. Needs complete reconstruction with all five components.
Reconstructed prompt:
"You are a B2B SaaS marketing manager [ROLE] writing to technical decision-makers who attended your product demo last week but haven't signed up for trial yet [CONTEXT].
Create a follow-up email with this structure [TASK]:
Subject line (under 50 characters)
Opening: Reference their specific demo attendance
Body: One key benefit they'd gain from trial (focus on time savings)
CTA: Clear trial signup link
Close: Offer to answer questions
Requirements [CONSTRAINTS]:
150 words maximum
Single clear CTA only
Professional but approachable tone [STYLE]
Avoid: Hype language, multiple CTAs, assuming they remember demo details, selling features they didn't express interest in"
Result: Specific, targeted email that addresses the actual need
Example 2: Vague Product Photo Request
Original prompt: "Professional product photo of coffee mug"
Output received: Random interpretation—overhead angle, busy background, wrong lighting
Diagnostic evaluation:
Role: ⚠️ Weak (1 point) - "Professional" too vague Task: ⚠️ Weak (1 point) - Subject clear but composition not specified Context: ❌ Missing (0 points) - No use case or platform Style: ❌ Missing (0 points) - "Professional" is not a style reference Constraints: ❌ Missing (0 points) - No aspect ratio, resolution, composition requirements
Total score: 2/15 (Failed prompt)
Diagnosis: Multiple missing components. Start with role and constraints.
Improved prompt:
"Product photography in Williams Sonoma catalog style [ROLE]: sage green ceramic coffee mug on white marble surface [TASK].
Composition [TASK continued]: Rule of thirds, mug positioned left of frame showing handle and interior, 60% negative space right side for text overlay [CONTEXT: e-commerce use].
Lighting: Soft directional from upper left creating gentle shadow right side, white seamless background RGB 255,255,255 [STYLE].
Technical specs [CONSTRAINTS]: 4:5 aspect ratio for Instagram product post, 2000x2000px minimum, sharp focus throughout, avoid: busy backgrounds, multiple objects, harsh shadows, colored backgrounds"
Result: Specific product photo suitable for e-commerce use
Example 3: Unclear Video Request
Original prompt: "Video showing app features"
Output received: Random screen recording with no narrative flow or clear demonstration
Diagnostic evaluation:
Role: ❌ Missing (0 points) - No cinematic or stylistic direction Task: ⚠️ Weak (1 point) - "Showing features" is vague sequence Context: ❌ Missing (0 points) - No audience or purpose Style: ❌ Missing (0 points) - No aesthetic referenceConstraints: ❌ Missing (0 points) - No duration, format, technical specs
Total score: 1/15 (Failed prompt)
Diagnosis: Needs complete reconstruction focusing on shot sequence and purpose.
Improved prompt:
"15-second app demo for landing page [CONTEXT] targeting skeptical enterprise buyers who need to see working product [CONTEXT continued], Apple product video aesthetic [ROLE].
Shot sequence [TASK]:
(0-5sec) Close-up: Hand opens app, clean interface visible
(5-10sec) Medium shot: Finger taps through key workflow demonstrating core feature
(10-15sec) Screen shows result, hand exits frame, lingers on successful outcome
Camera: Overhead angle, locked position throughout, soft natural lighting, warm color grade [STYLE].
Technical [CONSTRAINTS]: 16:9 for web embed, maintain app screen center-frame for mobile safety, smooth controlled motion (no jerky movements), 24fps, avoid: voice-over, music, text overlays during demo"
Result: Clear demonstration sequence suitable for landing page
Quick Diagnostic Shortcuts
For rapid evaluation without full component scoring.
The "Would Someone Else" Test
Read your prompt and ask: Would another person generate the same output type with this prompt?
If yes: Task specification is probably strong
If no: Task specification needs work
The "What's Missing" Test
Generate output and identify disconnects:
Wrong tone? → Check style direction
Wrong format? → Check task specification
Missed the point? → Check context loading
Generic? → Check role definition
Technically unusable? → Check constraints
The "One Change" Test
Change only one component and regenerate:
Improvement? → That component was the problem
No change? → Try different component
Worse? → Revert change, try different approach to same component
The "Score Drop" Test
If a previously working prompt suddenly fails:
Platform update changed defaults? → Check constraints
Different use case? → Check context
New audience? → Check style direction
Most prompt degradation comes from context or constraint changes, rarely from role or task issues.
Common Diagnostic Patterns
Pattern 1: High Score But Poor Results (12-15 points)
What it means: Prompt structure is strong but either:
AI capability limits reached
Wrong tool for task
Unrealistic expectations
What to check:
Is this task within AI capability?
Would different tool (ChatGPT vs Claude, Midjourney vs DALL-E) work better?
Are expectations achievable?
Fix approach: Consider tool change or expectation adjustment rather than prompt modification.
Pattern 2: Medium Score with Weak Style (8-11 points)
What it means: Basic structure present but aesthetic or tone missing.
Common in: Image and video prompts where "professional" or "high quality" used without concrete reference.
Fix approach: Add specific style references (photographer names, film references, publication styles).
Pattern 3: Low Score Missing Context (4-7 points)
What it means: Task specified but situation unclear.
Common in: Business writing, technical documentation where audience and purpose missing.
Fix approach: Add context about audience, goals, constraints, and success criteria.
Pattern 4: Very Low Score, Essentially Input (0-3 points)
What it means: Prompt is just a request with no guidance.
Common in: First attempts before understanding prompt structure.
Fix approach: Complete reconstruction adding all five components.
For systematic improvement strategies beyond diagnosis, see AI Prompt Iteration & Optimization: How to Get First-Attempt Quality Every Time.
Building Your Diagnostic Habit
Make evaluation systematic through practice.
Create evaluation template:
Save this checklist for repeated use:
Review past successes:
When prompt works excellently, score it to understand which components made it work. Build intuition about what strong components look like.
Track failure patterns:
Notice which components you consistently score weak or missing. These are your personal blind spots to improve.
Compare before/after:
When systematic diagnosis fixes a problem, save both versions with notes about what changed and why it worked.
For cross-platform diagnostic techniques, see Cross-Platform AI Prompting 2026: Text, Image & Video Unified Framework.
Frequently Asked Questions
How long does systematic prompt diagnosis take?
2-3 minutes to score five components once familiar with framework. Much faster than random iteration which can waste 20-30 minutes changing elements without clear direction. Initial diagnosis investment pays off through targeted fixes that work on first attempt.
Can I skip components if I know they're not the problem?
Yes for rapid diagnosis on simple prompts, but full evaluation catches non-obvious issues. Sometimes weak style direction creates output that seems like task failure. Systematic check prevents misdiagnosis and wasted iteration on wrong component.
What if multiple components score weak or missing?
Prioritize: Fix missing components first (biggest impact), then weak components. Within missing components, prioritize: Context → Task → Role → Style → Constraints. Context failures cascade most broadly.
Does the scoring system work across text, image, and video AI?
Yes, five-component framework applies universally—only implementation changes. Text Role = expertise, Image Role = style reference, Video Role = cinematic approach. Core diagnostic logic remains constant across modalities.
How do I know if problem is prompt quality vs AI capability limits?
High-scoring prompts (12-15 points) that still fail likely hit capability limits. Try same prompt on different tool (ChatGPT vs Claude, Midjourney vs DALL-E). If all tools fail, task may exceed current AI capability or need breaking into smaller pieces.
Should I diagnose before first attempt or only after failure?
Quick evaluation before submitting prevents obvious gaps. Most people write prompts then realize context missing. 30-second pre-check asking "Do I have all five components?" saves iteration cycles. Full diagnostic scoring after failure identifies specific fixes needed.
What's the most commonly missing component?
Context, especially in text prompts. People specify what to create (task) but omit audience, purpose, constraints, success criteria. "Write a report" missing who reads it and why. "Create presentation" missing what decision it supports. Context grounds output in reality.
Can good prompts still produce bad outputs?
Yes. Even perfect prompts sometimes generate poor results due to AI randomness, training data gaps, or capability limits. One bad output doesn't mean prompt failed—regenerate 2-3 times. Consistent poor results indicate prompt issue diagnosable with checklist.
Related Reading
Foundation:
The Prompt Anatomy Framework: Why 90% of AI Prompts Fail Across ChatGPT, Midjourney & Sora - Five-component framework underlying diagnosis
Tool-Specific Guides:
Best AI Prompts for ChatGPT, Claude & Gemini in 2026: Templates, Examples & Scorecard - Text AI prompting
Midjourney & DALL-E Image Prompts 2026: From Concept to Perfect Visual Output - Image generation
Sora & VEO Video AI Prompts 2026: Cinematic Storytelling Made Simple - Video generation
Optimization:
AI Prompt Iteration & Optimization: How to Get First-Attempt Quality Every Time - Beyond diagnosis to systematic improvement
Advanced Techniques:
Cross-Platform AI Prompting 2026: Text, Image & Video Unified Framework - Diagnostic across modalities
Role & Context in AI Prompts: Unlocking Expert-Level Outputs in Text, Image & Video AI - Deep dive on most impactful components
Style & Tone for AI Prompts: How to Communicate Like a Human Across ChatGPT, Midjourney & Sora - Style component mastery
Pitfall Prevention:
Avoiding Common AI Prompt Mistakes: Over-Constraining, Ambiguity & Context Assumptions - Common diagnostic patterns
Templates:
AI Prompt Templates Library 2026: Ready-to-Use Prompts for ChatGPT, Claude, Midjourney & Sora - Pre-scored prompts
www.topfreeprompts.com
Access 80,000+ professionally engineered prompts, each evaluated with the five-component diagnostic framework. Every prompt includes component scoring showing exactly why it works, helping you learn diagnostic evaluation through examples.



