For AI training

Your outsourced
human-data company.
With domain experts.

Two ingredients, one engagement: the operating companies whose data the work is built on, and the domain experts who design the tasks, write the rubrics and gold standards, and run the scoring.

Start a confidential conversation Browse the catalog

Initial conversations are exploratory and confidential. NDA available on request.

What This Is

A curated, hands-on engagement model. Iceberg acts as your outsourced human-data team. We bring in operating companies whose proprietary corpora are released under NDA, and we staff domain practitioners who design the tasks, write the rubrics and gold standards, and run the scoring and expert review on top.

Not a self-serve portal. Not a vendor marketplace. Engagements are scoped to a specific domain and a specific goal, then staffed and run end-to-end.

Two ingredients, one engagement

The operating companies

Proprietary workflow corpora released under NDA

Real documents from real businesses — leases, rent rolls, IC memos, investor reports, claim files, dispatch logs, underwriting packets, and the like. Native formats (PDF, Excel, Word). Not scraped, not synthetic. Anonymized in the public catalog; firm names disclosed under NDA.

See the catalog →

The domain experts

Task design, rubrics, gold standards, scoring, expert review

Practitioners who have actually run the workflow professionally. Specialty: long-chain reasoning and other indeterministic work where rubric quality and reviewer judgment matter more than throughput. Deterministic tasks supported too where the methodology calls for it.

Scope an engagement →

Most engagements bring both. We pair the operating companies (the data) with the domain experts (the work) under one engagement, end-to-end.

Specialty: long-chain reasoning

Short tasks with clean answers are well-served by existing human-data vendors. The hard work is the other kind: multi-step domain workflows where the “right answer” depends on methodology, precedent, and the order of operations — and where a plausible-looking output can be quietly wrong.

Not easily verifiable

Outputs require domain judgment to assess. A general reviewer can't tell whether the model used the correct methodology or just landed near the right number.

Multi-step dependencies

An error in step two cascades into steps three, four, and five. Scoring has to attribute the failure to its root cause, not just the final answer.

Rubric quality dominates

The rubric is the work. A weak rubric produces weak evaluation regardless of reviewer skill or throughput. We design and version rubrics like product.

Gold standards take expertise

Gold answers need a practitioner who has done the workflow professionally — not a generalist applying a checklist.

The Expert Network

Reviewers and rubric authors are domain practitioners — people who have actually run the workflow as a professional. Real estate underwriters, lease attorneys, claim adjusters, dispatch operators, depending on the domain.

Where the task calls for it, we prioritize reviewers who combine domain experience with prior lab or training-data evaluation work. That combination — domain depth plus eval discipline — is what makes long-chain rubric work tractable.

We staff narrow. Engagements get a small, hand-picked panel rather than a generic worker pool. Reviewer identity is disclosed to the engaging team on request.

What We Deliver

Task design

Workflow decomposed into the actual steps a practitioner would take. Inputs, intermediate artifacts, and final outputs specified. Versioned and revisable.

Rubrics

Scoring criteria authored by domain practitioners. Designed to attribute failure to its root cause across multi-step chains. Reviewed before any scoring run.

Gold standards

Expert-authored reference answers, with the reasoning trace where applicable. Built for the rubric, not retrofitted to it.

Deterministic scoring

Field-by-field comparison where the task supports it (extractions, structured calculations). Auditable, reproducible, and explainable to procurement.

Indeterministic scoring

Rubric-driven expert review for tasks where there is no single right answer. Multiple reviewers, calibrated, with inter-rater agreement tracked.

Catalog access

NDA-gated access to proprietary corpora from operating companies. Document samples, full inventories, format breakdowns, and licensing terms negotiated directly with the data owner.

Worked Examples

We’ve published a set of multi-step commercial-real-estate workflows under our methodology — task decomposition, rubric authorship, gold-standard construction, and model scoring across 7+ frontier models. These are concrete artifacts of the work, not case-study slides.

Early Lease Termination & Lender Consent

Multi-step office workflow · 18 scored fields · 7 models · Loan-covenant logic

View →

Lease Abstract → NER Calculation

Industrial workflow · Single-step and multi-step variants · 7 models

View →

Commercial real estate is the domain we’ve published in depth. The same methodology — task design, rubric, gold, scoring — is portable across logistics, insurance, and other document-heavy domains.

Suitable Engagements

Vertical capability programs

Training teams pushing model capability into specific business domains — real estate, logistics, insurance, healthcare operations — where general benchmarks have run out.

Long-chain reasoning evals

Multi-step workflows where the rubric is the hard problem. RL-relevant tasks where reward signal has to be defensible.

Pre-training data partnerships

Procurement of proprietary, NDA-gated corpora from operating companies. We characterize, anonymize, and steward the release.

Domain expert review

Existing evals or training data that need a domain-practitioner read for correctness, methodology, or rubric calibration.

We are not a high-volume labeling shop. Engagements are scoped, staffed narrow, and run by people who could do the underlying work themselves.

Tell us what you’re trying to evaluate.

A confidential conversation. NDA on request. No slide deck required.