OpenAI + Claude: score prompts with LMUnit clarity

Q: Why is my Contextual AI connection failing in this workflow?

Usually it’s a bad or expired CONTEXTUALAI_API_KEY. Regenerate the key in your Contextual AI account, then update the credential in n8n. If it still fails, check that your workspace has access to LMUnit and that you haven’t hit a plan limit. One more thing: if you run a lot of tests back-to-back, a short wait between batches helps avoid temporary errors.

You get three different answers from three different LLMs, and now you’re stuck. Which one is actually clearer. Which one rambles. Which one just “sounds smart” but doesn’t help.

Marketing leads feel this when approvals drag and copy keeps getting rewritten. Product managers hit it during spec reviews. And agency folks get it from clients who say “make it tighter” with zero guidance. This LMUnit scoring automation gives you a repeatable way to judge responses for clarity and conciseness without debating opinions for an hour.

You’ll set up an n8n workflow that sends the same prompt to OpenAI, Claude, and Gemini, scores each reply with Contextual AI’s LMUnit (1–5), and returns a clean winner-style summary you can actually act on.

How This Automation Works

Here’s the complete workflow you’ll be setting up:

n8n Workflow Template: OpenAI + Claude: score prompts with LMUnit clarity

Click to explore

flowchart LR

    subgraph sg0["When chat message received Flow"]
        direction LR
        n0@{ icon: "mdi:play-circle", form: "rounded", label: "When chat message received", pos: "b", h: 48 }
        n1@{ icon: "mdi:cog", form: "rounded", label: "Run LMUnit", pos: "b", h: 48 }
        n2@{ icon: "mdi:swap-vertical", form: "rounded", label: "Preprocess OpenAI Response", pos: "b", h: 48 }
        n3@{ icon: "mdi:swap-vertical", form: "rounded", label: "Preprocess Gemini Response", pos: "b", h: 48 }
        n4["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/merge.svg' width='40' height='40' /></div><br/>Combine responses"]
        n5["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Add unit tests to responses"]
        n6@{ icon: "mdi:swap-vertical", form: "rounded", label: "Iterate over each unit tests", pos: "b", h: 48 }
        n7@{ icon: "mdi:cog", form: "rounded", label: "Wait for 3 sec", pos: "b", h: 48 }
        n8@{ icon: "mdi:swap-vertical", form: "rounded", label: "Associate scores with Respon..", pos: "b", h: 48 }
        n9["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Group Results Together"]
        n10["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Format Final Result"]
        n11@{ icon: "mdi:robot", form: "rounded", label: "Final Response", pos: "b", h: 48 }
        n12@{ icon: "mdi:swap-vertical", form: "rounded", label: "Preprocess Anthropic Response", pos: "b", h: 48 }
        n13@{ icon: "mdi:robot", form: "rounded", label: "OpenAI GPT 4.1", pos: "b", h: 48 }
        n14@{ icon: "mdi:robot", form: "rounded", label: "Gemini 2.5 Flash", pos: "b", h: 48 }
        n15@{ icon: "mdi:robot", form: "rounded", label: "Claude 4.5 Sonnet", pos: "b", h: 48 }
        n1 --> n8
        n13 --> n2
        n7 --> n1
        n14 --> n3
        n15 --> n12
        n4 --> n5
        n10 --> n11
        n9 --> n6
        n3 --> n4
        n2 --> n4
        n0 --> n13
        n0 --> n14
        n0 --> n15
        n5 --> n6
        n6 --> n10
        n6 --> n7
        n12 --> n4
        n8 --> n9
    end

    %% Styling
    classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
    classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
    classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef disabled stroke-dasharray: 5 5,opacity: 0.5
    class n0 trigger
    class n11,n13,n14,n15 ai
    class n5,n9,n10 code
    classDef customIcon fill:none,stroke:none
    class n4,n5,n9,n10 customIcon

Why This Matters: Consistent LLM Quality Without Guesswork

Manually comparing model outputs is deceptively expensive. You paste the same prompt into OpenAI, Claude, and Gemini, then you try to “judge” the results like a panel of one. It’s slow, it’s inconsistent, and honestly it’s easy to pick the answer that matches your mood instead of the one that’s clearest. The worst part is the rework: unclear responses turn into more follow-up prompts, more editing, and more time lost in review loops when teams can’t agree on what “good” looks like.

It adds up fast. Here’s where it usually breaks down in real teams.

You run the same prompt across models, but the comparison lives in scattered tabs and gets lost by tomorrow.
Human scoring is inconsistent, so you can’t track improvements as prompts evolve.
Traditional metrics don’t capture clarity or redundancy, which means “better” becomes a subjective argument.
Without a standard, prompt libraries rot because nobody trusts the results enough to reuse them.

What You’ll Build: A Multi-LLM Prompt Scoring Workflow

This workflow turns model comparison into something you can repeat on demand. A chat message kicks everything off: you submit one prompt, once. n8n sends that exact prompt to OpenAI, Claude (Anthropic), and Gemini so the test is fair. Each response is normalized into a consistent structure (so you’re not comparing apples to weirdly formatted oranges). Then Contextual AI’s LMUnit evaluates every reply against clear criteria, like “Is this easy to understand?” and “Is it concise without repeating itself?” Finally, n8n aggregates the scores and sends back a structured summary showing each model’s performance and the overall averages.

The workflow starts from an inbound chat trigger, so it feels instant when you use it day-to-day. From there, it generates three model replies, batches them through the LMUnit tests with a short pause to avoid flaky results, and returns a winner-style message you can share or log.

What You’re Building

What Gets Automated

What You’ll Achieve

Send the same prompt to OpenAI, Claude, and Gemini in parallel.
Normalize each model’s output into a consistent format for evaluation.
Score clarity and conciseness automatically with LMUnit’s 1–5 scale.
Aggregate test results into a readable summary message.

Cut a 20-minute “which answer is best?” debate down to a 1-minute check.
Get consistent scoring you can compare week to week.
Spot rambling answers quickly (and tighten your prompt library faster).
Make model choice less emotional and more defensible for stakeholders.
Build a repeatable prompt QA habit without adding meetings.

Expected Results

Say you review 10 prompts a week for landing pages, support macros, or internal docs. Manually, you’ll usually spend about 10 minutes per prompt to run three models, copy replies into a doc, and argue about what’s “clear,” which is roughly 1.5 hours. With this workflow, you send the prompt once, wait maybe half a minute for responses and scoring, and you get a ranked summary back. That’s about an hour back each week, and the comparison is actually consistent.

Before You Start

n8n instance (try n8n Cloud free)
Self-hosting option if you prefer (Hostinger works well)
Contextual AI (LMUnit) for scoring clarity and conciseness.
OpenAI to generate a model response.
Anthropic (Claude) to generate a second model response.
Google Gemini to generate a third model response.
CONTEXTUALAI_API_KEY (get it from your Contextual AI account dashboard)
OpenAI API key (get it from platform.openai.com/account/api-keys)
Anthropic API key (get it from console.anthropic.com/settings/keys)
Gemini API key (get it from ai.google.dev)

Skill level: Intermediate. You’ll connect a few credentials, paste keys, and adjust simple evaluation criteria.

Want someone to build this for you? Talk to an automation expert (free 15-minute consultation).

Step by Step

A chat message triggers the run. You submit a prompt through the workflow’s chat interface (and you can also connect Telegram if that’s how your team works day to day).

Three LLMs answer the same prompt. n8n calls OpenAI, Anthropic Claude, and Google Gemini in parallel, so you’re comparing responses generated from identical input.

Responses get cleaned up and prepared for scoring. The workflow normalizes each output into a consistent structure, merges the replies, then attaches the evaluation checks used by LMUnit.

LMUnit evaluates and a summary comes back. Items are processed in batches, there’s a short wait to keep API calls stable, scores are mapped back to each model, and you receive a formatted “who won and why” message in chat.

You can easily modify the evaluation criteria to include tone, completeness, or factual accuracy based on your needs. See the full implementation guide below for customization options.

Step-by-Step Implementation Guide

Step 1: Configure the Chat Trigger

Set up the inbound chat entry point that starts the workflow and routes the user message into parallel model evaluations.

Add the Inbound Chat Trigger node as your trigger.
Open Inbound Chat Trigger and confirm the Options include responseMode: responseNodes so the response is sent by the reply node.
Leave credentials empty (none required for Inbound Chat Trigger).

Step 2: Connect LLM Providers in Parallel

Configure the three model builders that run simultaneously from the same chat input.

Connect Inbound Chat Trigger to OpenAI Response Builder, Gemini Response Builder, and Anthropic Response Builder. Inbound Chat Trigger outputs to both OpenAI Response Builder and Gemini Response Builder and Anthropic Response Builder in parallel.
In OpenAI Response Builder, set Model to gpt-4.1 and set the first Messages → Content to {{ $json.chatInput }}.
Credential Required: Connect your openAiApi credentials in OpenAI Response Builder.
In Gemini Response Builder, set Model to models/gemini-2.5-flash and set Messages → Content to {{ $json.chatInput }}.
Credential Required: Connect your googlePalmApi credentials in Gemini Response Builder.
In Anthropic Response Builder, set Model to claude-sonnet-4-5-20250929 and set Messages → Content to {{ $json.chatInput }}.
Credential Required: Connect your anthropicApi credentials in Anthropic Response Builder.

⚠️ Common Pitfall: Ensure all three model nodes are connected in parallel from Inbound Chat Trigger; otherwise Merge Model Replies will wait indefinitely for missing branches.

Step 3: Normalize and Merge Model Responses

Standardize each model’s response payload and merge all three into a single stream for evaluation.

In Normalize OpenAI Output, set provider to OpenAI and response to {{ $json.message.content }}.
In Normalize Gemini Output, set provider to Gemini and response to {{ $json.content.parts[0].text }}.
In Normalize Anthropic Output, set provider to Anthropic and response to {{ $json.content[0].text }}.
Connect all three normalize nodes into Merge Model Replies and set Number of Inputs to 3.

Step 4: Attach Tests, Score, and Batch

Generate evaluation checks for each response, batch them, and score each test with Contextual AI.

In Attach Evaluation Checks, keep the provided JavaScript that expands each response into tests like Is the response clear and easy to understand?.
Send output to Batch Through Tests to process tests sequentially; keep default batch options.
From Batch Through Tests, connect one output to Compose Summary Message and the other to Pause Three Seconds to throttle scoring.
In Pause Three Seconds, set Amount to 3 seconds, then connect to Execute LMUnit Scoring.
In Execute LMUnit Scoring, set Resource to LMUnit, Query to {{ $('Inbound Chat Trigger').first().json.chatInput }}, Response to {{ $json.response }}, and Unit Test to {{ $json.unit_test }}.
Credential Required: Connect your contextualAiApi credentials in Execute LMUnit Scoring.
In Map Scores to Replies, map provider to {{ $('Attach Evaluation Checks').item.json.provider }}, response to {{ $('Attach Evaluation Checks').item.json.response }}, unit_test to {{ $('Attach Evaluation Checks').item.json.unit_test }}, and score to {{ $json.score }}.

Tip: The Pause Three Seconds node helps avoid rate limits during scoring bursts.

Step 5: Aggregate Scores and Deliver the Reply

Aggregate test results per provider, format a summary message, and reply to the chat.

In Aggregate Evaluation Sets, keep the provided JavaScript that groups results by provider and unit_test.
Connect Map Scores to Replies to Aggregate Evaluation Sets and then to Batch Through Tests to continue the loop until all tests are processed.
In Compose Summary Message, keep the JavaScript that builds the final message string in message.
In Deliver Chat Reply, set Message to {{ $json.message }} so the user receives the evaluation report.

Step 6: Test and Activate Your Workflow

Validate end-to-end behavior in the editor, then activate the workflow for production use.

Click Execute Workflow and send a test message to Inbound Chat Trigger.
Confirm that OpenAI Response Builder, Gemini Response Builder, and Anthropic Response Builder all return outputs and that Merge Model Replies completes.
Verify scoring runs through Execute LMUnit Scoring and that Deliver Chat Reply returns a message starting with Here are the evaluation results:.
Toggle the workflow to Active to enable production execution.

🔒

Unlock Full Step-by-Step Guide

Get the complete implementation guide + downloadable template

Troubleshooting Tips

Contextual AI credentials can expire or be mis-scoped. If scoring suddenly fails, check the Contextual AI credential in n8n first and confirm the CONTEXTUALAI_API_KEY is still valid.
If you’re using Wait nodes or external model endpoints, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
Default prompts in AI nodes are generic. Add your brand voice early or you’ll be editing outputs forever.

Quick Answers

What’s the setup time for this LMUnit scoring automation?

About 30 minutes if you already have your API keys.

Is coding required for this LMUnit scoring?

No. You’ll connect accounts, paste API keys, and tweak the LMUnit criteria text.

Is n8n free to use for this LMUnit scoring workflow?

Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in OpenAI/Anthropic/Gemini usage fees and Contextual AI plan limits.

Where can I host n8n to run this automation?

Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.

Can I modify this LMUnit scoring workflow for different use cases?

Yes, and you should. You can add criteria inside the “Execute LMUnit Scoring” step (for example: tone, completeness, or factual accuracy). You can also duplicate the “OpenAI Response Builder”, “Anthropic Response Builder”, or “Gemini Response Builder” steps to include more providers. If you want different reporting, adjust the “Compose Summary Message” logic to output JSON, a table, or a dashboard-friendly block.

Why is my Contextual AI connection failing in this workflow?

Usually it’s a bad or expired CONTEXTUALAI_API_KEY. Regenerate the key in your Contextual AI account, then update the credential in n8n. If it still fails, check that your workspace has access to LMUnit and that you haven’t hit a plan limit. One more thing: if you run a lot of tests back-to-back, a short wait between batches helps avoid temporary errors.

What volume can this LMUnit scoring workflow process?

On n8n Cloud Starter, you can run a healthy number of workflows each month, and self-hosting removes execution limits (your server becomes the constraint). Practically, this one evaluates prompts in batches, so most teams can process dozens of prompts per hour as long as the model APIs and LMUnit aren’t rate-limiting. If you need high volume, increase the wait time slightly and consider running it on a larger VPS.

Is this LMUnit scoring automation better than using Zapier or Make?

Often, yes. You’re coordinating three model calls, normalizing outputs, batching tests, and aggregating results, which is the kind of branching and data-shaping that gets clunky (and pricey) fast in Zapier. n8n also gives you a self-hosting path when volume grows, plus it’s easier to insert waits, retries, and “if this fails, do that” logic. Zapier or Make can still work if you only want a simple “send prompt, get one answer” flow, but that’s not what this workflow is doing. Talk to an automation expert if you want help picking the right stack.

Once this is running, prompt reviews stop being a vibes-based argument and start looking like a simple scoreboard. The workflow handles the repetitive judging so you can focus on improving what you ask, and what you ship.

OpenAI + Claude: score prompts with LMUnit clarity

How This Automation Works

n8n Workflow Template: OpenAI + Claude: score prompts with LMUnit clarity

Why This Matters: Consistent LLM Quality Without Guesswork