For operators & owners of proprietary corpora

The files you already have
are training-grade data.

A curated, bespoke process to surface meaningful value from your proprietary corpus. You decide what is contributed. Everything is anonymized and de-sensitized before it leaves your hands.

Initial conversations are exploratory and confidential. No commitment.

Two principles, before anything else

You control what is contributed

Nothing leaves your environment without your explicit sign-off. Every file, every folder, every release decision is yours. You can pause or end the engagement at any point. We work at your pace, not ours.

Everything is anonymized and de-sensitized

Identity, counterparties, individuals, sensitive financial detail — all redacted, masked, or withheld before any external party sees a single document. The anonymization protocol is scoped to your corpus and your comfort level, not a one-size template.

What This Is

A curated, bespoke engagement — not a marketplace, not a SaaS upload portal, not a one-size process. We work directly with a small number of operating companies to characterize their corpora, prepare them for AI training, and steward each release.

The work is hands-on by design. Volume is not the goal. The goal is to surface meaningful value from a corpus that already exists — and, where it works, to open an ancillary revenue stream that did not exist before.

Why Your Data Has Value

Domain corpora are the bottleneck

General-purpose models have plateaued on web text. Vertical capability now depends on real operational documents from real businesses.

Native formats are scarce

AI training teams specifically want PDFs, Excels, and Word docs in their original layouts — not text dumps. Real formatting carries real signal.

Provenance matters more than volume

"Not scraped, not synthetic" has become a procurement requirement. Documents from operating companies clear that bar.

Your archive is differentiated

No web crawler can reach inside your drives. The corpora that move the needle for vertical AI live behind firewalls — and yours is one of them.

What You Get

Inventory & characterization

We catalog what you have — file counts, formats, document types, NAICS classification, sensitivity profile. Most participants learn the shape of their own corpus for the first time.

Anonymized profile

Your corpus is described under a reference code (like operator-k4n9), with no identifying detail. Identity is disclosed only after a mutual NDA you have approved.

Selective introductions

Qualified AI training teams are introduced through Iceberg. We screen and route. You approve every party that proceeds before sample documents or pricing are shared.

Anonymization protocol

Identity, counterparties, individuals, sensitive financial detail — redacted, masked, or withheld. Scoped to your corpus and your comfort level, documented and reviewed before any release.

Pricing & structure guidance

One-time corpus, ongoing feed, or hybrid. We help you frame the offering, set the price, and structure the agreement directly with the lab.

You keep ownership

Iceberg never takes title to your data. Licenses are between you and the lab. You retain control of every release decision.

How It Works

  1. 01

    Exploratory call

    We walk through what you have, what kind of access AI training teams might want, and whether an engagement makes sense. No commitment, no slide deck required.

  2. 02

    Inventory & characterization

    We profile the corpus: counts, formats, document types, time period, NAICS, sensitivity surface. Output is a draft profile for your review.

  3. 03

    Anonymization protocol agreed

    We scope what gets redacted, masked, or withheld — together with you. Nothing is shared externally until you sign off on the protocol and the profile.

  4. 04

    Selective introductions

    Qualified labs are introduced under your terms. You approve who proceeds. Mutual NDAs are signed before identity, samples, or pricing are shared.

  5. 05

    Sample release & deal structure

    Scoped sample documents, pricing, and license terms are negotiated under NDA. Iceberg facilitates; you sign the agreement directly with the lab.

Example Profile

Here’s how a curated profile presents to a qualified lab once your corpus has been characterized and your anonymization protocol is set. The participating company is anonymized; the corpus profile carries enough detail to make a serious inquiry under NDA. This particular example is a real estate operator; engagements in logistics (3PL), insurance, and other sectors are at various stages of the same process.

Real Estate Operator + Property Manager

Mid-Atlantic U.S. · Multifamily & mixed-use

Active going concern · ~70k files · ~70 GB · Leases, rent rolls, IC docs, K-1s, investor reports, deal pipeline.

View profile →

Your Controls, in Detail

  • ·Your identity is anonymized in every external touchpoint and disclosed only after a mutual NDA you have approved.
  • ·You approve every party who proceeds past the initial introduction.
  • ·Anonymization and de-sensitization protocols are scoped with you, documented, and reviewed before any release. PII, counterparty identifiers, individual names, and sensitive financial detail are redacted, masked, or withheld.
  • ·You set the price. You sign the agreement. Iceberg facilitates; the deal is between you and the lab.
  • ·You decide whether to offer a one-time corpus, an ongoing feed, or both — and you can pause or end the engagement at any point.

Suitable For

Operating companies with deep document archives

Logistics and supply-chain operators, insurance carriers, healthcare systems, law firms, accounting firms, manufacturers, asset managers, real estate operators — any business whose document trail captures domain expertise.

Wound-down or archival entities

Companies whose operations have ended but whose document corpora retain training value. Archival data is often easier to release because operational sensitivities have passed.

Service providers with multi-client repositories

Third-party logistics providers (3PLs), third-party administrators, claim processors, outsourced operations firms, property managers — work product that spans many counterparties and accumulates real-world breadth.

Specialized firms with narrow vertical depth

Niche practices and boutique operators whose corpora carry expertise that’s rare in public data.

No Promises

We don’t guarantee outcomes. The market for proprietary corpora is forming in real time, and the value of any specific corpus depends on what you have, what AI training teams want, and what you’re comfortable releasing.

What we aim for is straightforward: meaningful value creation from a corpus that already exists, and — where it works — an ancillary revenue stream that did not exist before. Anything more would be marketing.

Explore what your corpus could be worth.

A confidential conversation. No commitment. No slide deck required.