Agent Quality Evaluation¶

This chapter covers the screen that measures response quality and safety in batches for built agents. The Agent Creation → Agent Quality Evaluation menu in the left sidebar is in scope.

For the single-input debugging flow inside the canvas, see Agent Operations · Execution and Debugging. This chapter focuses on batch evaluation — running many inputs at once and scoring them.

Accessing the Screen¶

Select Agent Creation → Agent Quality Evaluation in the left sidebar. The screen has two tabs at the top.

Tab	Covers
품질 평가 (Quality Evaluation)	Test execution history + start a new test (includes Harmbench safety evaluation)
품질 척도 정의 (Quality Criteria)	Manage criteria presets used to score responses

Common top-right actions: + New Test, Harmbench, Refresh.

Quality Evaluation Tab¶

Narrow the history with the top status filters (All / Running / Completed / Error) and the Search by agentflow box. When no tests exist, the body shows an empty state — "테스트 기록이 없습니다 — Select an agentflow and start a test" — with a Start New Test button.

Running a New Test¶

Clicking + New Test at the top right opens the Batch Test Creation modal. Upload a dataset file (questions + expected answers) to score many cases at once.

Item	Description
Dataset format	`.xlsx`, `.xls`, `.csv`, `.json` (Harmbench / OpenAI and other common formats)
Column structure	At minimum `question` / `expected_answer`. Extra columns (category, tags) are used in result analysis
Upload method	Drag-and-drop or click-to-upload
Quick start	Use the Example or Excel buttons at the top of the modal to download a sample dataset

After upload, choose the target agent and a criteria preset to run. Results return to the same history list and transition from Running → Completed with scores filled in.

Harmbench Safety Evaluation¶

The Harmbench button at the top right triggers an immediate safety evaluation using the public Harmbench dataset (Standard, 250 items) — no need to prepare your own dataset. It checks how the model reacts to harmful prompts against a standardized baseline.

Item	Description
Source	HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Dataset size	250 items (Standard split)
Columns	`behavior` (harmful scenario prompt) / `category` (taxonomy) / `behavior_id` (identifier)
Result	Refusal rate, bypass rate, sensitive-response rate, etc. show up as safety scores on the history card

Dataset contents

The Harmbench dataset contains sensitive prompts including chemical / biological / cyber risk scenarios. They are shown verbatim on screen — do not copy or redistribute beyond evaluation use.

Quality Criteria Tab¶

Manage criteria presets used to score responses. The solution ships official presets (e.g., Safety, Harmfulness, Ethics/Fairness, Expression Accuracy) and your own custom presets side by side as cards.

Creating a New Preset¶

Clicking + New Preset at the top right opens the preset authoring modal.

Item	Description
Preset name	A one-liner identifiable to others (e.g., "Internal FAQ response quality")
Description	A paragraph about when this criteria applies
Category	Area classification — Safety / Accuracy / Tone / etc.
Criteria	Detailed metrics and weights used in scoring. Multiple can be registered per area

Saved presets show up in the Criteria Preset dropdown of the Quality Evaluation → New Test modal.

Recommended Flow¶

Define criteria first — Create the criteria preset that fits your organization before running tests, so all subsequent tests are comparable on the same scale.
Start with a small dataset — Begin with ≤50 sample items to learn the criteria and result interpretation, then scale up.
Run Harmbench regularly — Model or prompt changes can cause safety regressions; run it before every deploy.
Share results with governance — For agents with "Org-wide" impact scope, use the evaluation score in AI Governance approval review.

Creating an Agent — Build the agent to be evaluated
Agent Operations · Execution and Debugging — Quick single-input run
AI Governance — Risk Assessment & Review — Use evaluation scores in approval flow

Contact¶

For Agent Quality Evaluation questions, please contact the Xgen Solution Administrator.