Kevin M. Boyce, John A. Nagl, and Kristan J. Wheaton
The US Army War College oral comprehensive examination serves as the institution’s capstone, measuring its senior officers’ strategic thinking. In early 2026, three faculty panels applied that standard to four leading commercial artificial intelligence (AI) systems: ChatGPT, Gemini, Claude, and Grok. Prompted without core curriculum materials, all four models passed. Unlike static benchmarks, the examination’s impromptu dialogue format revealed meaningful performance differences that were invisible in general-purpose evaluations, with one model performing at a statistically significant advantage. These findings challenge how the Department of War assesses commercial AI for strategic applications and point toward domain-specific, dialogue-based benchmarking as a more rigorous standard.
Keywords: military artificial intelligence, professional military education, AI benchmarking, oral comprehensive examination, strategic thinking