The Benchmarks Are Lying to You. Here's How to Actually Evaluate LLMs.
I spend a lot of time in the AI space -- reading papers, building things, talking to engineers who are actually shipping. And there is a gap between what the demos show and what production systems actually look like that nobody is being fully honest about. So here is my honest take on where things actually are. Everyone is calling everything an "agent" right now. A function that calls a tool? Agent. A chatbot with memory? Agent. A script with a loop? Agent. This dilution is not just semantic. It

