Benchmark

Early Lease Termination & Lender Consent Analysis

Benchmarking frontier AI models on a multi-document, multi-step CRE asset management workflow.

We tested 7 frontier AI models on a real asset management problem: analyze a proposed early lease termination for a Class A office building, calculate remaining obligations and replacement costs, run net effective rent on both leases, and determine whether lender consent conditions are satisfied.

The benchmark used 5 reference documents, 5 analytical steps, and 18 scored outputs.

April 2026 · 7 models · 30+ completed runs · Internal calibration

The Workflow

This benchmark tests a five-step analytical workflow that an asset manager would perform before seeking lender consent on an early lease termination. Each step depends on the one before it.

Calculate remaining lease obligations

Remaining rent through expiration, plus unamortized TI, leasing commissions, and free rent.

Calculate replacement tenant costs

Downtime rent, free rent, TI, landlord work, and leasing commissions for the replacement tenant.

Calculate net effective rent

NER for both leases using the methodology prescribed in the loan agreement.

Apply lender consent conditions

Test whether at least 2 of 3 loan covenant conditions are satisfied.

Make a recommendation

Should the landlord proceed with the termination?

An error in the leasing commission calculation cascades into remaining obligations, replacement costs, and NER.

What the Models Read

Curated Lease Packet

Rent schedule, concessions, TI allowance, broker clause, commencement and expiration mechanics

~5 pages

Manhattan Commission Schedule

Tiered commission rates with abatement treatment and broker split convention

1 pages

Termination Email Exchange

4-email negotiation arriving at a $12M termination payment effective September 30, 2026

~2 pages

Replacement Tenant LOI

Rent, free rent, TI, landlord work, term, and broker representation

~2 pages

Loan Agreement Excerpt

Lease termination restriction with 2-of-3 lender consent test and NER methodology

~3 pages

Total context: approximately 13 pages across 5 documents.

Accuracy on Completed Runs

Accuracy is measured only on runs where the model produced valid structured output. Models that failed to complete are reported separately below.

Model	Avg	Range	Perfect / Completed	Cost	Time
Gemini 3.1 Pro	100.0%	100–100%	5 of 5	$0.30	2.1 min
Claude Opus 4.6	100.0%	100–100%	2 of 2	$0.71	3.4 min
GPT-5 Mini	90.0%	81–100%	2 of 5	$0.05	3.2 min
GPT-5.4	88.9%	75–100%	2 of 5	$0.09	43 sec
GPT-5	86.1%	81–94%	1 of 3	$0.18	4.5 min
Claude Haiku 4.5	37.0%	19–42%	0 of 3	$0.10	1.1 min

Gemini 3.1 Pro scored 100% on every run and completed every run. Claude Opus 4.6 also scored 100% on the runs it completed, but completed only 2 of 5 attempts. GPT-5 Mini and GPT-5.4 both achieved perfect scores on some runs, with variance driven primarily by leasing commission methodology.

Completion Rate

A model that scores 100% when it completes but only completes 40% of the time presents a different risk profile than a model that scores 90% but completes every time.

Model	Attempted	Completed	Rate
Gemini 3.1 Pro	5	5	100%
GPT-5.4	5	5	100%
GPT-5 Mini	5	5	100%
GPT-5	5	3	60%
Claude Haiku 4.5	5	3	60%
Claude Opus 4.6	5	2	40%
Claude Sonnet 4.6	5	0	0%

Many completion failures appear to be driven by model/runner tool-calling behavior rather than the analytical difficulty of the task itself. Claude Sonnet 4.6 did not complete a single run. Claude Opus produced flawless analytical results when it completed, but completed only 2 of 5 attempts.

The Economics

A mid-level asset management professional at a CRE investment firm might earn $200–300K in total compensation. At a blended rate of roughly $120–150/hour, a first-pass analysis like this would take an estimated 1–2 hours.

	Human Analyst	AI (cheapest)	AI (most accurate)	AI (fastest)
Time	1–2 hours	3.2 min	2.1 min	43 sec
Cost	$150–300	$0.05	$0.30	$0.09
Model	—	GPT-5 Mini	Gemini 3.1 Pro	GPT-5.4
Accuracy	Assumed correct	90% avg	100% avg	89% avg

Running the same analysis across the four most reliable models (Gemini, GPT-5.4, GPT-5 Mini, GPT-5) costs under $0.65 and takes under 5 minutes. That kind of redundancy — cross-checking AI outputs against each other — was not economically practical before.

These numbers do not suggest AI replaces the analyst. The analyst still reviews, validates, and makes the final decision. But the economics of first-pass analysis and review are shifting.

What Models Got Right

Across completed runs from models that produced strong results (Gemini, GPT-5.4, GPT-5, GPT-5 Mini, and Opus when it completed):

✓

Remaining rent. Correctly calculated remaining rental payments ($15.8M), including the rent step from $86/SF to $93/SF at the 5th anniversary of Rent Commencement.

✓

TI amortization. Correctly amortized the $4.0M TI allowance over 128 months on a monthly straight-line basis.

✓

Free rent valuation. Correctly identified both the 8-month full-premises abatement and the 6-month partial abatement on 3,382 RSF, and amortized the total correctly.

✓

Replacement tenant economics. TI, landlord work, free rent, and downtime rent were calculated correctly.

✓

NER methodology. Models followed the NER calculation methodology prescribed in the loan agreement — including upfront leasing costs — and produced NER values within scoring tolerance.

✓

Lender consent logic. Models that completed the underlying arithmetic correctly also applied the 2-of-3 consent framework correctly.

Where Models Broke

Leasing Commission Methodology

38% pass rate

The hardest single calculation in the benchmark. The Manhattan lease commission schedule requires amortizing rental abatements over the term of the lease, deducting them from annual fixed rent, applying tiered rates to each lease year, handling a partial year 11, and applying a 150% broker split.

Common errors:

Used the rent-paying term instead of the full lease term for abatement amortization
Omitted the partial year 11 from the commission schedule

When models calculated LC correctly, they often scored at or near 100%. When they did not, scores typically fell materially.

Output Reliability

Five of eight models had completion rates below 100%, with three models completing fewer than half their attempts. Many of these failures appear to be related to tool-calling orchestration rather than analytical capability. Claude Opus's intermediate tool-call outputs showed correct calculations — the model could do the work, but could not consistently deliver the final structured output.

Cascading Errors

When a model got leasing commissions wrong, the error cascaded into unamortized LC, total remaining obligations, replacement LC, and total replacement costs. However, the NER calculation absorbed the LC error without failing, because the per-square-foot impact was small relative to NER scoring tolerances after annualization.

What This Tells You

Frontier AI models can complete real, multi-step CRE asset management work.

This is not extraction or summarization. This is a five-step analytical workflow requiring document interpretation, financial calculation, legal covenant application, and a business recommendation. Multiple models completed it correctly end-to-end.

Price does not cleanly predict performance.

GPT-5 Mini ($0.05/run) outperformed GPT-5 ($0.18/run) on both average accuracy and completion reliability. Claude Opus ($0.71/run) — the most expensive model — scored perfectly when it completed, but completed only 40% of its runs.

Reliability matters as much as accuracy.

A model that scores 100% but completes only 40% of the time presents a meaningfully different risk than one that scores 90% and completes every time. Completion rate should be part of any vendor evaluation.

The differences between models emerge on hard analytical steps.

Simple extraction and arithmetic were solved by most models. The differentiator was leasing commission methodology — a multi-step reasoning chain requiring interpretation, calculation, and application of multiple rules. Generic demonstrations will not surface these differences.

Firms should benchmark AI vendors on their actual workflows.

If you are evaluating AI tools for CRE work, test them on the specific analytical tasks your team does — with real documents, real calculations, and deterministic scoring. The results may differ significantly from what vendor demos suggest.

Important Caveats

This is not a production benchmark. This was an internal calibration exercise to test challenge design, scoring methodology, and model reliability. The results are directional.

The prompt was fixed. All models received the same user prompt. In a real deployment, prompt quality is a significant variable.

Tolerances affect scores. Scoring uses binary pass/fail within defined tolerance bands. Some models produced commercially reasonable answers that fell just outside tolerance — particularly on leasing commissions, where multiple defensible methodologies exist.

Completion failures are partly infrastructure-related. The Anthropic model failures appear to be driven in part by tool-calling orchestration specific to our evaluation runner. Different infrastructure could produce different completion rates.

Sample size is limited. Five runs per model provides directional signal but is not sufficient for statistically robust conclusions.

One workflow, one prompt. This benchmark tests one specific workflow with one prompt. Model performance may not generalize to other CRE tasks.

Saturation risk on the current task. One model scored 100% on every run with the current scoring tolerances. Further testing is needed to determine whether the task provides sufficient differentiation across top-performing models.

Methodology

Scoring: 18 fields scored against a human-verified oracle. Binary pass/fail within field-specific tolerances.

Oracle verification: All oracle values independently calculated by a CRE professional using a spreadsheet model and cross-checked.

NER methodology: Prescribed in the loan agreement — NPV of monthly rent cash flows from commencement through expiration, minus undiscounted upfront leasing costs, converted to level annual $/SF at 8% per annum discounted monthly.

LC methodology: Commission schedule applied to net adjusted annual rent (after straight-line amortization of abatements over the full lease term), with 150% broker split.

Tool access: All models had access to a Python calculator tool.

Fixed inputs: Same system prompt, user prompt, and reference documents for all models and all runs.

Run it yourself

Try the same workflow on Iceberg — pick a model, run it against the five-document packet, and see where it holds up and where it breaks.

Run This Benchmark

Directional results from internal Iceberg calibration runs. Not a formal audited study. All scores reflect binary pass/fail within defined tolerances against a human-verified oracle.