ML evaluation infrastructure

StructuredhumanreviewforLLMoutputdatasets.

Upload prompt/response pairs, run human review workflows with keyboard shortcuts, and generate evaluation reports. Built for ML teams doing model evaluation and red-teaming.

lumen — review session
1/5
row_041
row_042
row_043
row_044
row_045
Prompt

Explain gradient descent to a non-technical audience in 2–3 sentences.

Response

Gradient descent is like finding the lowest valley in hilly terrain while blindfolded. You take small steps downhill based on the slope beneath your feet. In ML, each step adjusts model weights to minimize prediction error.

rating
1
2
3
4
5
verdict
pass
fail
skip
jnext
4rate
ppass
⌘↵save
saved ✓
keyboard-first reviewrole-based accessaudit trail500ms pollingdraft auto-savePDF reportsverdict distributionred-teamingmodel evaluationJSONL · CSV importkeyboard-first reviewrole-based accessaudit trail500ms pollingdraft auto-savePDF reportsverdict distributionred-teamingmodel evaluationJSONL · CSV import

Three steps from upload to report.

01

Import datasets

Upload JSONL or CSV with prompt/response pairs. Validated row-by-row, processed asynchronously—progress tracked at 500ms.

02

Run review workflows

Rate 1–5, set PASS/FAIL/SKIP verdicts, add Markdown evidence. Keyboard shortcuts let reviewers process 50+ rows per session without touching the mouse.

03

Generate reports

Export PDF reports with verdict breakdowns, rating histograms, flagged rows, and reviewer notes. Background-generated via async job queue.

Built for reviewers who move fast.

Keyboard-first

j/k navigate, 1–5 rate, p/f/n verdict, Cmd+Enter save. Review 50+ rows per session without touching the mouse.

Role-based access

Owner, Admin, Reviewer, Viewer roles per workspace. Reviewers submit reviews; Admins upload datasets and generate reports.

Background processing

Datasets process asynchronously after upload. Progress polled at 500ms — no page refresh needed.

Draft auto-save

Review state saved to localStorage on every keystroke. Return mid-session and your in-progress draft is waiting.

Evaluation stats

Dashboard shows verdict distribution, rating histogram, and 14-day review activity per workspace.

Audit trail

Every mutation logged: actor, action, and diff. Append-only audit log per workspace, queryable from the database.

Start reviewing with your team.

Try the demo workspace and review your first dataset in under a minute.

Get started