April 2026

Lease Abstract → NER Calculation

We tested 7 frontier models on a ~1M SF, single-tenant, NNN industrial lease across two benchmarks:

  • Single-step: abstract the lease (extract commercial terms and the full rent schedule)
  • Multi-step: abstract the lease and calculate the net effective rent (NER) in a single end-to-end workflow

Each benchmark was run with two prompt styles — detailed (full methodology spelled out) and less detailed (the model decides what matters) — across the same model cohort, 10 runs per model per variant.

This benchmark suite measures what changes when prompt guidance gets lighter and workflows get longer.

Directional results from live Iceberg runs. Not a formal audited study.

Can AI do the job?

Often yes with a detailed prompt. Much less reliably when the prompt gets lighter and the workflow gets longer.

TaskDetailed PromptLess Detailed Prompt
Abstract a 1M SF NNN industrial lease
7 models ·single-step
99.4%
93.4%
See results →
Abstract the lease, then run the NER
7 models ·multi-step · 50/50 step weighting
94.0%
70.1%
strict · 82.3% lenient
See results →

Prompt quality is part of task performance.

The same model cohort looks very different when guidance gets lighter. Workflow horizon matters too — single-step tasks hold up better than chained workflows.

What this benchmark shows
  • Less guidance creates much more model separation. Detailed prompts compress the field; reducing guidance is what reveals real differences.
  • Extraction and workflow tasks fail in different ways. Extraction breaks on OCR-degraded rent table column disambiguation; workflows break on calculation methodology.
  • Some apparent workflow failures are valid NER convention divergence, not economic misunderstanding. Several models recover materially under lenient scoring that accepts multiple valid CRE conventions.
  • The remaining failures are real extraction or computation errors. Models that don't recover under lenient scoring are computing wrong values, not just using a different convention.

Scope & rationale

Real Task

Lease abstraction, NER calculation, and chained workflow execution on a real industrial lease — not a synthetic prompt.

Real Headroom

Detailed prompts compress the field. Reducing guidance is what reveals real differences across the cohort.

Real Diagnosis

Measures accuracy, completion, cost, and latency — and distinguishes extraction failure, methodology divergence, and computation error.

Single-Step: Lease Extraction

Detailed prompts compress the field. Less detailed prompts create the meaningful spread.

  • The dominant extraction failure is rent table column confusion in OCR-degraded source data.
  • Lower-salience term misses (e.g., antenna count) are a secondary failure mode.
Detailed Prompt — 10 runs per model, 57 scored fields
ModelAvgMedianRangeCostTimeCompletionRuns
GPT-5.4100.0%100.0%100–100%$0.14013.2s70%bestworst
Claude Opus 4.6100.0%100.0%100–100%$0.34024.1s100%bestworst
Gemini 3.1 Pro100.0%100.0%100–100%$0.18991.2s100%bestworst
GPT-5100.0%100.0%100–100%$0.11164.5s90%bestworst
Claude Sonnet 4.6100.0%100.0%100–100%$0.20024.3s100%bestworst
GPT-5 Mini100.0%100.0%100–100%$0.02055.3s100%bestworst
Claude Haiku 4.599.1%99.1%98–100%$0.07012.2s100%bestworst
Less Detailed Prompt — 10 runs per model, 57 scored fields
ModelAvgMedianRangeCostTimeRuns
Claude Sonnet 4.6100.0%100.0%100–100%$0.20033.4sbestworst
Claude Haiku 4.598.4%98.2%98–100%$0.07011.2sbestworst
Gemini 3.1 Pro97.9%100.0%79–100%$0.17052.8sbestworst
GPT-596.3%98.2%77–100%$0.11681.1sbestworst
GPT-5.495.6%100.0%79–100%$0.14012.4sbestworst
GPT-5 Mini88.2%88.6%77–100%$0.02363.0sbestworst
Claude Opus 4.685.1%78.9%77–100%$0.34021.0sbestworst

All completed benchmark runs are live on Iceberg. Click “best” or “worst” to inspect the full model output and field-by-field scoring.

What changed when guidance got lighter

Detailed prompts pushed many models to the ceiling. Six of eight scored 100% on completed runs, though two of those models (GPT-5.4 and GPT-5) had intermittent provider failures that reduced their completion rates to 70% and 90%.

Less detailed prompts created real separation. Cohort average dropped from 99.4% to 93.4%, with a 15-point spread.

The biggest failures came from interpreting the rent schedule, not locating it. This was an interpretation problem, not a retrieval problem.

Price did not predict performance. The most expensive model finished last on the less detailed variant.

Accuracy Delta: Detailed → Less Detailed
ModelDetailedLess DetailedDelta
Claude Sonnet 4.6100.0%100.0%
Claude Haiku 4.599.1%98.4%-0.7
Gemini 3.1 Pro100.0%97.9%-2.1
GPT-5100.0%96.3%-3.7
GPT-5.4100.0%95.6%-4.4
GPT-5 Mini100.0%88.2%-11.8
Claude Opus 4.6100.0%85.1%-14.9

Where Models Broke

Primary Failure Mode
Rent table column confusion

Models confused adjacent columns in the OCR-degraded Addendum 1 rent table — especially the fixed amortization component ($1.60/SF) versus the escalating base rent ($4.75–$5.60/SF).

Secondary Failure Mode
Lower-salience lease terms

Some models missed explicit but buried terms like antenna rights when the prompt stopped enumerating them. 44% of less detailed runs returned null for this field.

Addendum 1 — OCR-degraded rent table excerpt:
Year AMORTIZED (fixed) TOTAL BASE BASE component 1 $4.7503 $1.6009 $6.3512 2 $4.8215 $1.6009 $6.4224 3 $4.8939 $1.6009 $6.4948
Models that scored 100% on the detailed prompt regularly misidentified the $4.75 column when the semantic guidance was removed.

Claude Sonnet 4.6 was the only model to correctly identify the rent schedule columns on all 10 less detailed runs. Claude Opus 4.6 — the most expensive model — failed this disambiguation more often than its smaller sibling.

Multi-Step Workflow

Multi-Step: Lease Abstract → NER

The chained workflow benchmark is where prompt explicitness and methodology assumptions both start to matter. The model must abstract the lease and calculate the NER in a single end-to-end response. Step-weighted scoring: extraction 50%, NER calculation 50%.

A simpler single-step NER baseline (calculator-enabled, given the abstracted lease economics as input) largely converged once tool use was working — so the more meaningful benchmark became the chained workflow below.

  • Detailed workflow runs are strong but not fully saturated.
  • Less-detailed workflow runs create major separation across the cohort.
  • Strict scoring understates some performance because valid CRE NER conventions diverge under lighter prompting.

Detailed prompt

Full extraction checklist + locked NER methodology. Seven of eight models score above 94%; only GPT-5.4 drops meaningfully due to provider reliability issues.

Detailed Workflow — Step-Weighted Scoring
ModelStrictErrorsBest Run
Gemini 3.1 Pro100.0%best
Claude Opus 4.698.7%best
GPT-5 Mini98.2%best
Claude Sonnet 4.697.9%best
Claude Haiku 4.595.0%best
GPT-594.2%best
GPT-5.474.3%4best

Less detailed prompt

Same task, same schema, but the methodology is removed. Models must choose how to compute NER on their own.

This is where the workflow benchmark gets interesting — and where we surface a benchmark-design issue worth being explicit about.

Two scoring views

Strict:the model's NPV per SF must match the benchmark's reference value (annual end-of-period discounting at 8%) within a tight tolerance. This measures adherence to the methodology a benchmark author specified.

Lenient: the model's NPV per SF is accepted if it matches any of four valid CRE conventions (annual end-of-period, monthly nominal end-of-period, monthly nominal begin-of-period / annuity due, monthly effective end-of-period). This measures whether the model produced an economically valid answer under any standard convention.

The NER PSF value itself stays strict in both views (all four conventions converge on the same NER to within $0.003).

Less Detailed Workflow — Strict vs Lenient
ModelStrictLenientΔErrorsBest Run
Gemini 3.1 Pro77.5%100.0%+22.5best
GPT-593.5%98.5%+5.0best
Claude Opus 4.668.4%92.1%+23.7best
Claude Sonnet 4.678.3%88.3%+10.0best
GPT-5.460.7%83.2%+22.5best
GPT-5 Mini73.9%73.9%best
Claude Haiku 4.563.3%64.6%+1.2best

Δ shows how much each model recovers when the benchmark accepts alternate valid CRE conventions. Large positive deltas indicate convention divergence under reduced prompting; near-zero deltas indicate genuine computational or extraction failures.

NER Convention Usage

Across each model's completed less-detailed runs, models did not converge on a single discounting convention. The four accepted conventions — annual end-of-period, monthly nominal end-of-period, monthly nominal begin-of-period (annuity due), and monthly effective end-of-period — are all standard CRE methods.

NPV Convention Usage — Less Detailed Workflow (up to 10 runs per model)
ModelAnnualMo. Nom. EndMo. Nom. BeginMo. Eff. EndNo Match
Gemini 3.1 Pro127
GPT-5811
Claude Opus 4.628
Claude Sonnet 4.6541
GPT-5.4541
GPT-5 Mini82
Claude Haiku 4.5217

The “No Match” column counts runs where the model's NPV did not match any of the four accepted conventions — these are genuine computational failures, not convention divergence.

Where Models Actually Failed

Not all wrong answers are wrong for the same reason. The benchmark distinguishes three failure modes — and the distinction matters for how you interpret each model's score.

Extraction Failure
Wrong field interpretation

Wrong rent column, missed lease term, or hallucinated values from the document. Most common in OCR-degraded source material.

Convention Divergence
Valid alternate methodology

Economically valid NER computed under a different CRE convention than the benchmark's annual default. Recoverable under lenient scoring.

Computation Failure
Wrong NPV/PMT application

Wrong cash flow basis, wrong discount math, or no defensible convention match. Not recoverable under any view.

Methodology

Source document

~1M SF single-tenant NNN industrial lease (Delaware), with full Addendum 1 rent schedule.

Model cohort

7 frontier models across major providers (Anthropic, OpenAI, Google).

Prompt variants

Detailed (full methodology spelled out) and less detailed (model decides what matters).

Workflow design

Single-step lease extraction and multi-step extraction → NER. (A supplementary single-step NER baseline was used internally to validate calculator behavior.)

Scoring

Step-weighted for workflow tasks: extraction 50%, NER calculation 50%. 10 runs per model per variant.

Metrics

Accuracy, completion rate, cost per run, latency, and failure mode taxonomy.

Validation

Domain-reviewed answer keys, deterministic field-by-field scoring, and post-hoc methodology analysis distinguishing convention divergence from real failure.

Run it yourself

This benchmark is live on Iceberg. You can run the same task, inspect the model output, and compare your result against the cohort. Every benchmark run has a public URL — the tables above link directly to the underlying runs.

Run This Benchmark

Benchmark Roadmap

  • Short-horizon extraction — single document, single work product (covered above).
  • Multi-step extraction → computation — chained workflow execution (covered above).
  • Next: longer-horizon underwriting and IC memo workflows, multi-document chaining, and cross-asset comparisons.

This is an early benchmark family, not a finished map of CRE work. The point is to make the progression from extraction to full workflow execution measurable, repeatable, and inspectable.