OpenAI + Claude: score prompts with LMUnit clarity
You get three different answers from three different LLMs, and now you’re stuck. Which one is actually clearer. Which one rambles. Which one just “sounds smart” but doesn’t help.
Marketing leads feel this when approvals drag and copy keeps getting rewritten. Product managers hit it during spec reviews. And agency folks get it from clients who say “make it tighter” with zero guidance. This LMUnit scoring automation gives you a repeatable way to judge responses for clarity and conciseness without debating opinions for an hour.
You’ll set up an n8n workflow that sends the same prompt to OpenAI, Claude, and Gemini, scores each reply with Contextual AI’s LMUnit (1–5), and returns a clean winner-style summary you can actually act on.
How This Automation Works
Here’s the complete workflow you’ll be setting up:
n8n Workflow Template: OpenAI + Claude: score prompts with LMUnit clarity
flowchart LR
subgraph sg0["When chat message received Flow"]
direction LR
n0@{ icon: "mdi:play-circle", form: "rounded", label: "When chat message received", pos: "b", h: 48 }
n1@{ icon: "mdi:cog", form: "rounded", label: "Run LMUnit", pos: "b", h: 48 }
n2@{ icon: "mdi:swap-vertical", form: "rounded", label: "Preprocess OpenAI Response", pos: "b", h: 48 }
n3@{ icon: "mdi:swap-vertical", form: "rounded", label: "Preprocess Gemini Response", pos: "b", h: 48 }
n4["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/merge.svg' width='40' height='40' /></div><br/>Combine responses"]
n5["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Add unit tests to responses"]
n6@{ icon: "mdi:swap-vertical", form: "rounded", label: "Iterate over each unit tests", pos: "b", h: 48 }
n7@{ icon: "mdi:cog", form: "rounded", label: "Wait for 3 sec", pos: "b", h: 48 }
n8@{ icon: "mdi:swap-vertical", form: "rounded", label: "Associate scores with Respon..", pos: "b", h: 48 }
n9["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Group Results Together"]
n10["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Format Final Result"]
n11@{ icon: "mdi:robot", form: "rounded", label: "Final Response", pos: "b", h: 48 }
n12@{ icon: "mdi:swap-vertical", form: "rounded", label: "Preprocess Anthropic Response", pos: "b", h: 48 }
n13@{ icon: "mdi:robot", form: "rounded", label: "OpenAI GPT 4.1", pos: "b", h: 48 }
n14@{ icon: "mdi:robot", form: "rounded", label: "Gemini 2.5 Flash", pos: "b", h: 48 }
n15@{ icon: "mdi:robot", form: "rounded", label: "Claude 4.5 Sonnet", pos: "b", h: 48 }
n1 --> n8
n13 --> n2
n7 --> n1
n14 --> n3
n15 --> n12
n4 --> n5
n10 --> n11
n9 --> n6
n3 --> n4
n2 --> n4
n0 --> n13
n0 --> n14
n0 --> n15
n5 --> n6
n6 --> n10
n6 --> n7
n12 --> n4
n8 --> n9
end
%% Styling
classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef disabled stroke-dasharray: 5 5,opacity: 0.5
class n0 trigger
class n11,n13,n14,n15 ai
class n5,n9,n10 code
classDef customIcon fill:none,stroke:none
class n4,n5,n9,n10 customIcon
Why This Matters: Consistent LLM Quality Without Guesswork
Manually comparing model outputs is deceptively expensive. You paste the same prompt into OpenAI, Claude, and Gemini, then you try to “judge” the results like a panel of one. It’s slow, it’s inconsistent, and honestly it’s easy to pick the answer that matches your mood instead of the one that’s clearest. The worst part is the rework: unclear responses turn into more follow-up prompts, more editing, and more time lost in review loops when teams can’t agree on what “good” looks like.
It adds up fast. Here’s where it usually breaks down in real teams.
- You run the same prompt across models, but the comparison lives in scattered tabs and gets lost by tomorrow.
- Human scoring is inconsistent, so you can’t track improvements as prompts evolve.
- Traditional metrics don’t capture clarity or redundancy, which means “better” becomes a subjective argument.
- Without a standard, prompt libraries rot because nobody trusts the results enough to reuse them.
What You’ll Build: A Multi-LLM Prompt Scoring Workflow
This workflow turns model comparison into something you can repeat on demand. A chat message kicks everything off: you submit one prompt, once. n8n sends that exact prompt to OpenAI, Claude (Anthropic), and Gemini so the test is fair. Each response is normalized into a consistent structure (so you’re not comparing apples to weirdly formatted oranges). Then Contextual AI’s LMUnit evaluates every reply against clear criteria, like “Is this easy to understand?” and “Is it concise without repeating itself?” Finally, n8n aggregates the scores and sends back a structured summary showing each model’s performance and the overall averages.
The workflow starts from an inbound chat trigger, so it feels instant when you use it day-to-day. From there, it generates three model replies, batches them through the LMUnit tests with a short pause to avoid flaky results, and returns a winner-style message you can share or log.
What You’re Building
| What Gets Automated | What You’ll Achieve |
|---|---|
|
|
Expected Results
Say you review 10 prompts a week for landing pages, support macros, or internal docs. Manually, you’ll usually spend about 10 minutes per prompt to run three models, copy replies into a doc, and argue about what’s “clear,” which is roughly 1.5 hours. With this workflow, you send the prompt once, wait maybe half a minute for responses and scoring, and you get a ranked summary back. That’s about an hour back each week, and the comparison is actually consistent.
Before You Start
- n8n instance (try n8n Cloud free)
- Self-hosting option if you prefer (Hostinger works well)
- Contextual AI (LMUnit) for scoring clarity and conciseness.
- OpenAI to generate a model response.
- Anthropic (Claude) to generate a second model response.
- Google Gemini to generate a third model response.
- CONTEXTUALAI_API_KEY (get it from your Contextual AI account dashboard)
- OpenAI API key (get it from platform.openai.com/account/api-keys)
- Anthropic API key (get it from console.anthropic.com/settings/keys)
- Gemini API key (get it from ai.google.dev)
Skill level: Intermediate. You’ll connect a few credentials, paste keys, and adjust simple evaluation criteria.
Want someone to build this for you? Talk to an automation expert (free 15-minute consultation).
Step by Step
A chat message triggers the run. You submit a prompt through the workflow’s chat interface (and you can also connect Telegram if that’s how your team works day to day).
Three LLMs answer the same prompt. n8n calls OpenAI, Anthropic Claude, and Google Gemini in parallel, so you’re comparing responses generated from identical input.
Responses get cleaned up and prepared for scoring. The workflow normalizes each output into a consistent structure, merges the replies, then attaches the evaluation checks used by LMUnit.
LMUnit evaluates and a summary comes back. Items are processed in batches, there’s a short wait to keep API calls stable, scores are mapped back to each model, and you receive a formatted “who won and why” message in chat.
You can easily modify the evaluation criteria to include tone, completeness, or factual accuracy based on your needs. See the full implementation guide below for customization options.
Step-by-Step Implementation Guide
Step 1: Configure the Chat Trigger
Set up the inbound chat entry point that starts the workflow and routes the user message into parallel model evaluations.
- Add the Inbound Chat Trigger node as your trigger.
- Open Inbound Chat Trigger and confirm the Options include
responseMode: responseNodesso the response is sent by the reply node. - Leave credentials empty (none required for Inbound Chat Trigger).
Step 2: Connect LLM Providers in Parallel
Configure the three model builders that run simultaneously from the same chat input.
- Connect Inbound Chat Trigger to OpenAI Response Builder, Gemini Response Builder, and Anthropic Response Builder. Inbound Chat Trigger outputs to both OpenAI Response Builder and Gemini Response Builder and Anthropic Response Builder in parallel.
- In OpenAI Response Builder, set Model to
gpt-4.1and set the first Messages → Content to{{ $json.chatInput }}. - Credential Required: Connect your openAiApi credentials in OpenAI Response Builder.
- In Gemini Response Builder, set Model to
models/gemini-2.5-flashand set Messages → Content to{{ $json.chatInput }}. - Credential Required: Connect your googlePalmApi credentials in Gemini Response Builder.
- In Anthropic Response Builder, set Model to
claude-sonnet-4-5-20250929and set Messages → Content to{{ $json.chatInput }}. - Credential Required: Connect your anthropicApi credentials in Anthropic Response Builder.
⚠️ Common Pitfall: Ensure all three model nodes are connected in parallel from Inbound Chat Trigger; otherwise Merge Model Replies will wait indefinitely for missing branches.
Step 3: Normalize and Merge Model Responses
Standardize each model’s response payload and merge all three into a single stream for evaluation.
- In Normalize OpenAI Output, set provider to
OpenAIand response to{{ $json.message.content }}. - In Normalize Gemini Output, set provider to
Geminiand response to{{ $json.content.parts[0].text }}. - In Normalize Anthropic Output, set provider to
Anthropicand response to{{ $json.content[0].text }}. - Connect all three normalize nodes into Merge Model Replies and set Number of Inputs to
3.
Step 4: Attach Tests, Score, and Batch
Generate evaluation checks for each response, batch them, and score each test with Contextual AI.
- In Attach Evaluation Checks, keep the provided JavaScript that expands each response into tests like
Is the response clear and easy to understand?. - Send output to Batch Through Tests to process tests sequentially; keep default batch options.
- From Batch Through Tests, connect one output to Compose Summary Message and the other to Pause Three Seconds to throttle scoring.
- In Pause Three Seconds, set Amount to
3seconds, then connect to Execute LMUnit Scoring. - In Execute LMUnit Scoring, set Resource to
LMUnit, Query to{{ $('Inbound Chat Trigger').first().json.chatInput }}, Response to{{ $json.response }}, and Unit Test to{{ $json.unit_test }}. - Credential Required: Connect your contextualAiApi credentials in Execute LMUnit Scoring.
- In Map Scores to Replies, map provider to
{{ $('Attach Evaluation Checks').item.json.provider }}, response to{{ $('Attach Evaluation Checks').item.json.response }}, unit_test to{{ $('Attach Evaluation Checks').item.json.unit_test }}, and score to{{ $json.score }}.
Tip: The Pause Three Seconds node helps avoid rate limits during scoring bursts.
Step 5: Aggregate Scores and Deliver the Reply
Aggregate test results per provider, format a summary message, and reply to the chat.
- In Aggregate Evaluation Sets, keep the provided JavaScript that groups results by provider and unit_test.
- Connect Map Scores to Replies to Aggregate Evaluation Sets and then to Batch Through Tests to continue the loop until all tests are processed.
- In Compose Summary Message, keep the JavaScript that builds the final message string in
message. - In Deliver Chat Reply, set Message to
{{ $json.message }}so the user receives the evaluation report.
Step 6: Test and Activate Your Workflow
Validate end-to-end behavior in the editor, then activate the workflow for production use.
- Click Execute Workflow and send a test message to Inbound Chat Trigger.
- Confirm that OpenAI Response Builder, Gemini Response Builder, and Anthropic Response Builder all return outputs and that Merge Model Replies completes.
- Verify scoring runs through Execute LMUnit Scoring and that Deliver Chat Reply returns a message starting with
Here are the evaluation results:. - Toggle the workflow to Active to enable production execution.
Troubleshooting Tips
- Contextual AI credentials can expire or be mis-scoped. If scoring suddenly fails, check the Contextual AI credential in n8n first and confirm the CONTEXTUALAI_API_KEY is still valid.
- If you’re using Wait nodes or external model endpoints, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
- Default prompts in AI nodes are generic. Add your brand voice early or you’ll be editing outputs forever.
Quick Answers
About 30 minutes if you already have your API keys.
No. You’ll connect accounts, paste API keys, and tweak the LMUnit criteria text.
Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in OpenAI/Anthropic/Gemini usage fees and Contextual AI plan limits.
Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.
Yes, and you should. You can add criteria inside the “Execute LMUnit Scoring” step (for example: tone, completeness, or factual accuracy). You can also duplicate the “OpenAI Response Builder”, “Anthropic Response Builder”, or “Gemini Response Builder” steps to include more providers. If you want different reporting, adjust the “Compose Summary Message” logic to output JSON, a table, or a dashboard-friendly block.
Usually it’s a bad or expired CONTEXTUALAI_API_KEY. Regenerate the key in your Contextual AI account, then update the credential in n8n. If it still fails, check that your workspace has access to LMUnit and that you haven’t hit a plan limit. One more thing: if you run a lot of tests back-to-back, a short wait between batches helps avoid temporary errors.
On n8n Cloud Starter, you can run a healthy number of workflows each month, and self-hosting removes execution limits (your server becomes the constraint). Practically, this one evaluates prompts in batches, so most teams can process dozens of prompts per hour as long as the model APIs and LMUnit aren’t rate-limiting. If you need high volume, increase the wait time slightly and consider running it on a larger VPS.
Often, yes. You’re coordinating three model calls, normalizing outputs, batching tests, and aggregating results, which is the kind of branching and data-shaping that gets clunky (and pricey) fast in Zapier. n8n also gives you a self-hosting path when volume grows, plus it’s easier to insert waits, retries, and “if this fails, do that” logic. Zapier or Make can still work if you only want a simple “send prompt, get one answer” flow, but that’s not what this workflow is doing. Talk to an automation expert if you want help picking the right stack.
Once this is running, prompt reviews stop being a vibes-based argument and start looking like a simple scoreboard. The workflow handles the repetitive judging so you can focus on improving what you ask, and what you ship.
Need Help Setting This Up?
Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.