Over the past six months, we've run thousands of test conversations through every major language model on the market. Claude, Gemini, Llama, Mistral, GPT-3.5, GPT-4, and a handful of open-source alternatives. We weren't looking for the cheapest option or the fastest one. We were looking for the one that could actually do the job.
The job, in our case, is nuanced: handle real estate leasing conversations with the accuracy and empathy of a great leasing agent. That's a higher bar than most people realize.
After extensive testing, we're going all-in on GPT-4 as the foundation for our Leasing Logic Engine. Here's why.
The Testing Framework
Before we dive into results, let me explain how we tested. We created a benchmark of 500 real tenant conversations pulled from our beta customers (anonymized, of course). These conversations covered:
- Routine inquiries — pricing, availability, amenities, pet policies
- Complex scenarios — co-signers, Section 8, lease breaks, roommate changes
- Compliance landmines — questions that could lead to Fair Housing violations if answered incorrectly
- Emotional situations — frustrated tenants, urgent move-in needs, complaint-laden messages
Each response was graded on four dimensions: accuracy, compliance, tone, and helpfulness. We used human evaluators (experienced property managers) plus automated checks for known compliance patterns.
The Results
Here's the summary, though I'll spare you the full spreadsheet:
GPT-4 scored 94.2% overall, with particularly strong performance on compliance (97%) and complex scenarios (91%). No other model broke 88% overall.
Claude came close at 89.1%, with excellent tone scores but weaker performance on real-estate-specific nuance. It would sometimes give technically correct but contextually awkward answers—like explaining lease terms in a way that sounded more like a legal document than a conversation.
GPT-3.5 scored 76.4%, which might sound acceptable until you realize that a 24% error rate on tenant conversations means multiple screwups per day for a busy property. The cost savings weren't worth the quality drop.
Open-source models clustered around 65-72%, with the best (Llama 2 70B fine-tuned) reaching 74%. Impressive for free, but not production-ready for our use case.
Why GPT-4 Wins for Real Estate
The numbers tell part of the story, but the qualitative differences were more revealing. Here's what set GPT-4 apart:
1. It Understands Context Switching
Real tenant conversations are messy. Someone asks about parking, then pet policies, then circles back to parking with a follow-up question. GPT-4 tracks these threads naturally. Other models would sometimes "forget" the earlier context or conflate different topics.
2. It Handles Ambiguity Gracefully
When a tenant asks "What's your pet policy?", they might mean: breed restrictions, weight limits, deposit amounts, monthly pet rent, or all of the above. GPT-4 recognizes this ambiguity and either asks clarifying questions or provides comprehensive information. Lesser models would often answer just one aspect and leave the tenant frustrated.
3. It Knows When to Stop
This is subtle but critical. GPT-4 recognizes when a question requires human judgment—like "Can you make an exception to the credit score requirement?"—and gracefully hands off rather than inventing an answer. Other models would sometimes fabricate policies or make unauthorized promises.
4. The Tone is Right
Leasing conversations require a specific register: professional but warm, informative but not overwhelming, helpful without being pushy. GPT-4 nails this. Claude was close but occasionally drifted toward overly formal. GPT-3.5 could sound robotic under pressure.
The Cost Question
Let's address the elephant: GPT-4 is expensive. At current API pricing, it costs roughly 10-20x more per conversation than GPT-3.5. That's a real consideration.
Our answer comes down to unit economics. If GPT-4 converts just one additional lead per month that GPT-3.5 would have fumbled, the cost difference is irrelevant. A single signed lease is worth $12-24K in rent over a year. The API cost for thousands of conversations is a rounding error.
We're also implementing smart routing: straightforward questions (like "What are your office hours?") get handled by lighter models, while complex inquiries route to GPT-4. This hybrid approach cuts costs by 40% without sacrificing quality where it matters.
What This Means for Customers
For WARE customers, this decision means:
- Higher accuracy on the conversations that actually matter—the complex ones where deals get made or lost
- Fewer embarrassing errors that require human cleanup and damage tenant relationships
- Better compliance out of the box, reducing legal exposure
- Faster improvements as OpenAI releases new versions—we'll be first in line
Are we betting the company on a single vendor? In some sense, yes. But we're also betting on the trajectory: GPT-4 is getting cheaper and better every quarter. The model that powers our engine in 2025 will likely be significantly more capable than what we have today, at a fraction of the cost.
Looking Ahead
This isn't religion. If Claude or Gemini or some new entrant definitively outperforms GPT-4 for real estate conversations, we'll switch. We're model-agnostic in principle, even if we're committed in practice.
But for now, GPT-4 is the best tool for the job. And that's what matters most to our customers: AI that actually works, every conversation, every time.
See GPT-4-powered leasing in action. Request a demo.