7-dimension scorecard for any LLM output

Paste a prompt and an LLM output. Seven specialist evaluators run in parallel and return a calibrated scorecard with reasoning, red flags, and a suggested prompt fix. No accounts. Nothing stored.

System prompt (optional)

The system message used to generate the output, if any.

User promptrequired

The user message that produced the output.

LLM outputrequired

The output you want evaluated.

Context (optional)

Source documents, transcript, or RAG snippets. Enables groundedness scoring.

Typical run: 4-8s, ~$0.03. Limit 5/IP/day.

Do not paste sensitive data. Inputs are sent to Anthropic for evaluation and not stored on this server.