Lease Abstract → NER Calculation
We tested 7 frontier models on a ~1M SF, single-tenant, NNN industrial lease across two benchmarks:
- → Single-step: abstract the lease (extract commercial terms and the full rent schedule)
- → Multi-step: abstract the lease and calculate the net effective rent (NER) in a single end-to-end workflow
Each benchmark was run with two prompt styles — detailed (full methodology spelled out) and less detailed (the model decides what matters) — across the same model cohort, 10 runs per model per variant.
This benchmark suite measures what changes when prompt guidance gets lighter and workflows get longer.
Directional results from live Iceberg runs. Not a formal audited study.
Can AI do the job?
Often yes with a detailed prompt. Much less reliably when the prompt gets lighter and the workflow gets longer.
| Task | Detailed Prompt | Less Detailed Prompt | |
|---|---|---|---|
Abstract a 1M SF NNN industrial lease 7 models ·single-step | 99.4% | 93.4% | See results → |
Abstract the lease, then run the NER 7 models ·multi-step · 50/50 step weighting | 94.0% | 70.1% strict · 82.3% lenient | See results → |
Prompt quality is part of task performance.
The same model cohort looks very different when guidance gets lighter. Workflow horizon matters too — single-step tasks hold up better than chained workflows.
- →Less guidance creates much more model separation. Detailed prompts compress the field; reducing guidance is what reveals real differences.
- →Extraction and workflow tasks fail in different ways. Extraction breaks on OCR-degraded rent table column disambiguation; workflows break on calculation methodology.
- →Some apparent workflow failures are valid NER convention divergence, not economic misunderstanding. Several models recover materially under lenient scoring that accepts multiple valid CRE conventions.
- →The remaining failures are real extraction or computation errors. Models that don't recover under lenient scoring are computing wrong values, not just using a different convention.
Scope & rationale
Lease abstraction, NER calculation, and chained workflow execution on a real industrial lease — not a synthetic prompt.
Detailed prompts compress the field. Reducing guidance is what reveals real differences across the cohort.
Measures accuracy, completion, cost, and latency — and distinguishes extraction failure, methodology divergence, and computation error.
Single-Step: Lease Extraction
Detailed prompts compress the field. Less detailed prompts create the meaningful spread.
- →The dominant extraction failure is rent table column confusion in OCR-degraded source data.
- →Lower-salience term misses (e.g., antenna count) are a secondary failure mode.
| Model | Avg | Median | Range | Cost | Time | Completion | Runs |
|---|---|---|---|---|---|---|---|
| GPT-5.4 | 100.0% | 100.0% | 100–100% | $0.140 | 13.2s | 70% | bestworst |
| Claude Opus 4.6 | 100.0% | 100.0% | 100–100% | $0.340 | 24.1s | 100% | bestworst |
| Gemini 3.1 Pro | 100.0% | 100.0% | 100–100% | $0.189 | 91.2s | 100% | bestworst |
| GPT-5 | 100.0% | 100.0% | 100–100% | $0.111 | 64.5s | 90% | bestworst |
| Claude Sonnet 4.6 | 100.0% | 100.0% | 100–100% | $0.200 | 24.3s | 100% | bestworst |
| GPT-5 Mini | 100.0% | 100.0% | 100–100% | $0.020 | 55.3s | 100% | bestworst |
| Claude Haiku 4.5 | 99.1% | 99.1% | 98–100% | $0.070 | 12.2s | 100% | bestworst |
| Model | Avg | Median | Range | Cost | Time | Runs |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 100.0% | 100.0% | 100–100% | $0.200 | 33.4s | bestworst |
| Claude Haiku 4.5 | 98.4% | 98.2% | 98–100% | $0.070 | 11.2s | bestworst |
| Gemini 3.1 Pro | 97.9% | 100.0% | 79–100% | $0.170 | 52.8s | bestworst |
| GPT-5 | 96.3% | 98.2% | 77–100% | $0.116 | 81.1s | bestworst |
| GPT-5.4 | 95.6% | 100.0% | 79–100% | $0.140 | 12.4s | bestworst |
| GPT-5 Mini | 88.2% | 88.6% | 77–100% | $0.023 | 63.0s | bestworst |
| Claude Opus 4.6 | 85.1% | 78.9% | 77–100% | $0.340 | 21.0s | bestworst |
All completed benchmark runs are live on Iceberg. Click “best” or “worst” to inspect the full model output and field-by-field scoring.
What changed when guidance got lighter
Detailed prompts pushed many models to the ceiling. Six of eight scored 100% on completed runs, though two of those models (GPT-5.4 and GPT-5) had intermittent provider failures that reduced their completion rates to 70% and 90%.
Less detailed prompts created real separation. Cohort average dropped from 99.4% to 93.4%, with a 15-point spread.
The biggest failures came from interpreting the rent schedule, not locating it. This was an interpretation problem, not a retrieval problem.
Price did not predict performance. The most expensive model finished last on the less detailed variant.
| Model | Detailed | Less Detailed | Delta |
|---|---|---|---|
| Claude Sonnet 4.6 | 100.0% | 100.0% | — |
| Claude Haiku 4.5 | 99.1% | 98.4% | -0.7 |
| Gemini 3.1 Pro | 100.0% | 97.9% | -2.1 |
| GPT-5 | 100.0% | 96.3% | -3.7 |
| GPT-5.4 | 100.0% | 95.6% | -4.4 |
| GPT-5 Mini | 100.0% | 88.2% | -11.8 |
| Claude Opus 4.6 | 100.0% | 85.1% | -14.9 |
Where Models Broke
Models confused adjacent columns in the OCR-degraded Addendum 1 rent table — especially the fixed amortization component ($1.60/SF) versus the escalating base rent ($4.75–$5.60/SF).
Some models missed explicit but buried terms like antenna rights when the prompt stopped enumerating them. 44% of less detailed runs returned null for this field.
Claude Sonnet 4.6 was the only model to correctly identify the rent schedule columns on all 10 less detailed runs. Claude Opus 4.6 — the most expensive model — failed this disambiguation more often than its smaller sibling.
Multi-Step: Lease Abstract → NER
The chained workflow benchmark is where prompt explicitness and methodology assumptions both start to matter. The model must abstract the lease and calculate the NER in a single end-to-end response. Step-weighted scoring: extraction 50%, NER calculation 50%.
A simpler single-step NER baseline (calculator-enabled, given the abstracted lease economics as input) largely converged once tool use was working — so the more meaningful benchmark became the chained workflow below.
- →Detailed workflow runs are strong but not fully saturated.
- →Less-detailed workflow runs create major separation across the cohort.
- →Strict scoring understates some performance because valid CRE NER conventions diverge under lighter prompting.
Detailed prompt
Full extraction checklist + locked NER methodology. Seven of eight models score above 94%; only GPT-5.4 drops meaningfully due to provider reliability issues.
Less detailed prompt
Same task, same schema, but the methodology is removed. Models must choose how to compute NER on their own.
This is where the workflow benchmark gets interesting — and where we surface a benchmark-design issue worth being explicit about.
Strict:the model's NPV per SF must match the benchmark's reference value (annual end-of-period discounting at 8%) within a tight tolerance. This measures adherence to the methodology a benchmark author specified.
Lenient: the model's NPV per SF is accepted if it matches any of four valid CRE conventions (annual end-of-period, monthly nominal end-of-period, monthly nominal begin-of-period / annuity due, monthly effective end-of-period). This measures whether the model produced an economically valid answer under any standard convention.
The NER PSF value itself stays strict in both views (all four conventions converge on the same NER to within $0.003).
Δ shows how much each model recovers when the benchmark accepts alternate valid CRE conventions. Large positive deltas indicate convention divergence under reduced prompting; near-zero deltas indicate genuine computational or extraction failures.
NER Convention Usage
Across each model's completed less-detailed runs, models did not converge on a single discounting convention. The four accepted conventions — annual end-of-period, monthly nominal end-of-period, monthly nominal begin-of-period (annuity due), and monthly effective end-of-period — are all standard CRE methods.
| Model | Annual | Mo. Nom. End | Mo. Nom. Begin | Mo. Eff. End | No Match |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | 1 | 2 | 7 | — | — |
| GPT-5 | 8 | 1 | 1 | — | — |
| Claude Opus 4.6 | — | 2 | — | 8 | — |
| Claude Sonnet 4.6 | 5 | 4 | — | — | 1 |
| GPT-5.4 | — | 5 | — | 4 | 1 |
| GPT-5 Mini | 8 | — | — | — | 2 |
| Claude Haiku 4.5 | 2 | — | — | 1 | 7 |
The “No Match” column counts runs where the model's NPV did not match any of the four accepted conventions — these are genuine computational failures, not convention divergence.
Where Models Actually Failed
Not all wrong answers are wrong for the same reason. The benchmark distinguishes three failure modes — and the distinction matters for how you interpret each model's score.
Wrong rent column, missed lease term, or hallucinated values from the document. Most common in OCR-degraded source material.
Economically valid NER computed under a different CRE convention than the benchmark's annual default. Recoverable under lenient scoring.
Wrong cash flow basis, wrong discount math, or no defensible convention match. Not recoverable under any view.
Methodology
~1M SF single-tenant NNN industrial lease (Delaware), with full Addendum 1 rent schedule.
7 frontier models across major providers (Anthropic, OpenAI, Google).
Detailed (full methodology spelled out) and less detailed (model decides what matters).
Single-step lease extraction and multi-step extraction → NER. (A supplementary single-step NER baseline was used internally to validate calculator behavior.)
Step-weighted for workflow tasks: extraction 50%, NER calculation 50%. 10 runs per model per variant.
Accuracy, completion rate, cost per run, latency, and failure mode taxonomy.
Domain-reviewed answer keys, deterministic field-by-field scoring, and post-hoc methodology analysis distinguishing convention divergence from real failure.
Run it yourself
This benchmark is live on Iceberg. You can run the same task, inspect the model output, and compare your result against the cohort. Every benchmark run has a public URL — the tables above link directly to the underlying runs.
Run This BenchmarkBenchmark Roadmap
- →Short-horizon extraction — single document, single work product (covered above).
- →Multi-step extraction → computation — chained workflow execution (covered above).
- →Next: longer-horizon underwriting and IC memo workflows, multi-document chaining, and cross-asset comparisons.
This is an early benchmark family, not a finished map of CRE work. The point is to make the progression from extraction to full workflow execution measurable, repeatable, and inspectable.