📋 Today's Highlights
Agent tool-use reveals critical gaps: frontier LLMs achieve <60% success on real-world multi-tool tasks, while GUI automation reaches new SOTA with 73.3% on AndroidWorld.
...continues with detailed analysis, metrics, and citations