On Tuesday, French AI startup Mistral AI launched Devstral 2, a 123 billion parameter open-weights coding mannequin designed to work as a part of an autonomous software program engineering agent. The mannequin achieves a 72.2 p.c rating on SWE-bench Verified, a benchmark that makes an attempt to check whether or not AI techniques can clear up actual GitHub points, placing it among the many top-performing open-weights fashions.
Maybe extra notably, Mistral didn’t simply launch an AI mannequin, it launched a brand new growth app known as Mistral Vibe. It’s a command line interface (CLI) much like Claude Code, OpenAI Codex, and Gemini CLI that lets builders work together with the Devstral fashions straight of their terminal. The software can scan file buildings and Git standing to take care of context throughout a whole mission, make adjustments throughout a number of recordsdata, and execute shell instructions autonomously. Mistral launched the CLI below the Apache 2.0 license.
It’s at all times smart to take AI benchmarks with a big grain of salt, however we’ve heard from staff of the massive AI corporations that they pay very shut consideration to how properly fashions do on SWE-bench Verified, which presents AI fashions with 500 actual software program engineering issues pulled from GitHub points in common Python repositories. The AI should learn the problem description, navigate the codebase, and generate a working patch that passes unit checks. Whereas some AI researchers have famous that round 90 p.c of the duties within the benchmark check comparatively easy bug fixes that skilled engineers may full in below an hour, it’s one of many few standardized methods to match coding fashions.
Similtaneously the bigger AI coding mannequin, Mistral additionally launched Devstral Small 2, a 24 billion parameter model that scores 68 p.c on the identical benchmark and may run regionally on shopper {hardware} like a laptop computer with no Web connection required. Each fashions help a 256,000 token context window, permitting them to course of reasonably giant codebases (though whether or not you think about it giant or small may be very relative relying on total mission complexity). The corporate launched Devstral 2 below a modified MIT license and Devstral Small 2 below the extra permissive Apache 2.0 license.