Early Lease Termination & Lender Consent Analysis
Benchmarking frontier AI models on a multi-document, multi-step CRE asset management workflow.
We tested 7 frontier AI models on a real asset management problem: analyze a proposed early lease termination for a Class A office building, calculate remaining obligations and replacement costs, run net effective rent on both leases, and determine whether lender consent conditions are satisfied.
The benchmark used 5 reference documents, 5 analytical steps, and 18 scored outputs.
April 2026 · 7 models · 30+ completed runs · Internal calibration
The Workflow
This benchmark tests a five-step analytical workflow that an asset manager would perform before seeking lender consent on an early lease termination. Each step depends on the one before it.
Calculate remaining lease obligations
Remaining rent through expiration, plus unamortized TI, leasing commissions, and free rent.
Calculate replacement tenant costs
Downtime rent, free rent, TI, landlord work, and leasing commissions for the replacement tenant.
Calculate net effective rent
NER for both leases using the methodology prescribed in the loan agreement.
Apply lender consent conditions
Test whether at least 2 of 3 loan covenant conditions are satisfied.
Make a recommendation
Should the landlord proceed with the termination?
An error in the leasing commission calculation cascades into remaining obligations, replacement costs, and NER.
What the Models Read
Curated Lease Packet
Rent schedule, concessions, TI allowance, broker clause, commencement and expiration mechanics
Manhattan Commission Schedule
Tiered commission rates with abatement treatment and broker split convention
Termination Email Exchange
4-email negotiation arriving at a $12M termination payment effective September 30, 2026
Replacement Tenant LOI
Rent, free rent, TI, landlord work, term, and broker representation
Loan Agreement Excerpt
Lease termination restriction with 2-of-3 lender consent test and NER methodology
Total context: approximately 13 pages across 5 documents.
Accuracy on Completed Runs
Accuracy is measured only on runs where the model produced valid structured output. Models that failed to complete are reported separately below.
| Model | Avg | Range | Perfect / Completed | Cost | Time |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | 100.0% | 100–100% | 5 of 5 | $0.30 | 2.1 min |
| Claude Opus 4.6 | 100.0% | 100–100% | 2 of 2 | $0.71 | 3.4 min |
| GPT-5 Mini | 90.0% | 81–100% | 2 of 5 | $0.05 | 3.2 min |
| GPT-5.4 | 88.9% | 75–100% | 2 of 5 | $0.09 | 43 sec |
| GPT-5 | 86.1% | 81–94% | 1 of 3 | $0.18 | 4.5 min |
| Claude Haiku 4.5 | 37.0% | 19–42% | 0 of 3 | $0.10 | 1.1 min |
Gemini 3.1 Pro scored 100% on every run and completed every run. Claude Opus 4.6 also scored 100% on the runs it completed, but completed only 2 of 5 attempts. GPT-5 Mini and GPT-5.4 both achieved perfect scores on some runs, with variance driven primarily by leasing commission methodology.
Completion Rate
A model that scores 100% when it completes but only completes 40% of the time presents a different risk profile than a model that scores 90% but completes every time.
| Model | Attempted | Completed | Rate |
|---|---|---|---|
| Gemini 3.1 Pro | 5 | 5 | 100% |
| GPT-5.4 | 5 | 5 | 100% |
| GPT-5 Mini | 5 | 5 | 100% |
| GPT-5 | 5 | 3 | 60% |
| Claude Haiku 4.5 | 5 | 3 | 60% |
| Claude Opus 4.6 | 5 | 2 | 40% |
| Claude Sonnet 4.6 | 5 | 0 | 0% |
Many completion failures appear to be driven by model/runner tool-calling behavior rather than the analytical difficulty of the task itself. Claude Sonnet 4.6 did not complete a single run. Claude Opus produced flawless analytical results when it completed, but completed only 2 of 5 attempts.
The Economics
A mid-level asset management professional at a CRE investment firm might earn $200–300K in total compensation. At a blended rate of roughly $120–150/hour, a first-pass analysis like this would take an estimated 1–2 hours.
| Human Analyst | AI (cheapest) | AI (most accurate) | AI (fastest) | |
|---|---|---|---|---|
| Time | 1–2 hours | 3.2 min | 2.1 min | 43 sec |
| Cost | $150–300 | $0.05 | $0.30 | $0.09 |
| Model | — | GPT-5 Mini | Gemini 3.1 Pro | GPT-5.4 |
| Accuracy | Assumed correct | 90% avg | 100% avg | 89% avg |
Running the same analysis across the four most reliable models (Gemini, GPT-5.4, GPT-5 Mini, GPT-5) costs under $0.65 and takes under 5 minutes. That kind of redundancy — cross-checking AI outputs against each other — was not economically practical before.
These numbers do not suggest AI replaces the analyst. The analyst still reviews, validates, and makes the final decision. But the economics of first-pass analysis and review are shifting.
What Models Got Right
Across completed runs from models that produced strong results (Gemini, GPT-5.4, GPT-5, GPT-5 Mini, and Opus when it completed):
Where Models Broke
Leasing Commission Methodology
38% pass rateThe hardest single calculation in the benchmark. The Manhattan lease commission schedule requires amortizing rental abatements over the term of the lease, deducting them from annual fixed rent, applying tiered rates to each lease year, handling a partial year 11, and applying a 150% broker split.
Common errors:
- Used the rent-paying term instead of the full lease term for abatement amortization
- Omitted the partial year 11 from the commission schedule
When models calculated LC correctly, they often scored at or near 100%. When they did not, scores typically fell materially.
Output Reliability
Five of eight models had completion rates below 100%, with three models completing fewer than half their attempts. Many of these failures appear to be related to tool-calling orchestration rather than analytical capability. Claude Opus's intermediate tool-call outputs showed correct calculations — the model could do the work, but could not consistently deliver the final structured output.
Cascading Errors
When a model got leasing commissions wrong, the error cascaded into unamortized LC, total remaining obligations, replacement LC, and total replacement costs. However, the NER calculation absorbed the LC error without failing, because the per-square-foot impact was small relative to NER scoring tolerances after annualization.
What This Tells You
Frontier AI models can complete real, multi-step CRE asset management work.
This is not extraction or summarization. This is a five-step analytical workflow requiring document interpretation, financial calculation, legal covenant application, and a business recommendation. Multiple models completed it correctly end-to-end.
Price does not cleanly predict performance.
GPT-5 Mini ($0.05/run) outperformed GPT-5 ($0.18/run) on both average accuracy and completion reliability. Claude Opus ($0.71/run) — the most expensive model — scored perfectly when it completed, but completed only 40% of its runs.
Reliability matters as much as accuracy.
A model that scores 100% but completes only 40% of the time presents a meaningfully different risk than one that scores 90% and completes every time. Completion rate should be part of any vendor evaluation.
The differences between models emerge on hard analytical steps.
Simple extraction and arithmetic were solved by most models. The differentiator was leasing commission methodology — a multi-step reasoning chain requiring interpretation, calculation, and application of multiple rules. Generic demonstrations will not surface these differences.
Firms should benchmark AI vendors on their actual workflows.
If you are evaluating AI tools for CRE work, test them on the specific analytical tasks your team does — with real documents, real calculations, and deterministic scoring. The results may differ significantly from what vendor demos suggest.
Important Caveats
This is not a production benchmark. This was an internal calibration exercise to test challenge design, scoring methodology, and model reliability. The results are directional.
The prompt was fixed. All models received the same user prompt. In a real deployment, prompt quality is a significant variable.
Tolerances affect scores. Scoring uses binary pass/fail within defined tolerance bands. Some models produced commercially reasonable answers that fell just outside tolerance — particularly on leasing commissions, where multiple defensible methodologies exist.
Completion failures are partly infrastructure-related. The Anthropic model failures appear to be driven in part by tool-calling orchestration specific to our evaluation runner. Different infrastructure could produce different completion rates.
Sample size is limited. Five runs per model provides directional signal but is not sufficient for statistically robust conclusions.
One workflow, one prompt. This benchmark tests one specific workflow with one prompt. Model performance may not generalize to other CRE tasks.
Saturation risk on the current task. One model scored 100% on every run with the current scoring tolerances. Further testing is needed to determine whether the task provides sufficient differentiation across top-performing models.
Methodology
Scoring: 18 fields scored against a human-verified oracle. Binary pass/fail within field-specific tolerances.
Oracle verification: All oracle values independently calculated by a CRE professional using a spreadsheet model and cross-checked.
NER methodology: Prescribed in the loan agreement — NPV of monthly rent cash flows from commencement through expiration, minus undiscounted upfront leasing costs, converted to level annual $/SF at 8% per annum discounted monthly.
LC methodology: Commission schedule applied to net adjusted annual rent (after straight-line amortization of abatements over the full lease term), with 150% broker split.
Tool access: All models had access to a Python calculator tool.
Fixed inputs: Same system prompt, user prompt, and reference documents for all models and all runs.
Run it yourself
Try the same workflow on Iceberg — pick a model, run it against the five-document packet, and see where it holds up and where it breaks.
Run This BenchmarkDirectional results from internal Iceberg calibration runs. Not a formal audited study. All scores reflect binary pass/fail within defined tolerances against a human-verified oracle.