ML evaluation infrastructure

StructuredhumanreviewforLLMoutputdatasets.

Upload prompt/response pairs, run human review workflows with keyboard shortcuts, and generate evaluation reports. Built for ML teams doing model evaluation and red-teaming.

Get started

lumen — review session

1/5

row_041

row_042

row_043

row_044

row_045

Prompt

Explain gradient descent to a non-technical audience in 2–3 sentences.

Response

Gradient descent is like finding the lowest valley in hilly terrain while blindfolded. You take small steps downhill based on the slope beneath your feet. In ML, each step adjusts model weights to minimize prediction error.

rating

verdict

pass

fail

skip

jnext

4rate

ppass

⌘↵save

saved ✓

keyboard-first review◆role-based access◆audit trail◆500ms polling◆draft auto-save◆PDF reports◆verdict distribution◆red-teaming◆model evaluation◆JSONL · CSV import◆keyboard-first review◆role-based access◆audit trail◆500ms polling◆draft auto-save◆PDF reports◆verdict distribution◆red-teaming◆model evaluation◆JSONL · CSV import◆

Three steps from upload to report.

Import datasets

Upload JSONL or CSV with prompt/response pairs. Validated row-by-row, processed asynchronously—progress tracked at 500ms.

Run review workflows

Rate 1–5, set PASS/FAIL/SKIP verdicts, add Markdown evidence. Keyboard shortcuts let reviewers process 50+ rows per session without touching the mouse.

Generate reports

Export PDF reports with verdict breakdowns, rating histograms, flagged rows, and reviewer notes. Background-generated via async job queue.

Built for reviewers who move fast.

Keyboard-first

j/k navigate, 1–5 rate, p/f/n verdict, Cmd+Enter save. Review 50+ rows per session without touching the mouse.

Role-based access

Owner, Admin, Reviewer, Viewer roles per workspace. Reviewers submit reviews; Admins upload datasets and generate reports.

Background processing

Datasets process asynchronously after upload. Progress polled at 500ms — no page refresh needed.

Draft auto-save

Review state saved to localStorage on every keystroke. Return mid-session and your in-progress draft is waiting.

Evaluation stats

Dashboard shows verdict distribution, rating histogram, and 14-day review activity per workspace.

Audit trail

Every mutation logged: actor, action, and diff. Append-only audit log per workspace, queryable from the database.

Start reviewing with your team.

Try the demo workspace and review your first dataset in under a minute.

Get started