7-dimension scorecard for any LLM output

Paste a prompt and an LLM output. Seven specialist evaluators run in parallel and return a calibrated scorecard with reasoning, red flags, and a suggested prompt fix. No accounts. Nothing stored.

The system message used to generate the output, if any.

The user message that produced the output.

The output you want evaluated.

Source documents, transcript, or RAG snippets. Enables groundedness scoring.

Typical run: 4-8s, ~$0.03. Limit 5/IP/day.

Do not paste sensitive data. Inputs are sent to Anthropic for evaluation and not stored on this server.