There's no scarcity of generative AI benchmarks designed to measure the efficiency and accuracy of a given mannequin on finishing numerous useful enterprise duties — from coding to instruction following to agentic net shopping and instrument use. However many of those benchmarks have one main shortcoming: they measure the AI's capacity to finish particular issues and requests, not how factual the mannequin is in its outputs — how properly it generates objectively appropriate data tied to real-world knowledge — particularly when coping with data contained in imagery or graphics.
For industries the place accuracy is paramount — authorized, finance, and medical — the shortage of a standardized technique to measure factuality has been a important blind spot.
That adjustments at the moment: Google’s FACTS crew and its knowledge science unit Kaggle launched the FACTS Benchmark Suite, a complete analysis framework designed to shut this hole.
The related analysis paper reveals a extra nuanced definition of the issue, splitting "factuality" into two distinct operational eventualities: "contextual factuality" (grounding responses in offered knowledge) and "world information factuality" (retrieving data from reminiscence or the net).
Whereas the headline information is Gemini 3 Professional’s top-tier placement, the deeper story for builders is the industry-wide "factuality wall."
In accordance with the preliminary outcomes, no mannequin—together with Gemini 3 Professional, GPT-5, or Claude 4.5 Opus—managed to crack a 70% accuracy rating throughout the suite of issues. For technical leaders, it is a sign: the period of "belief however confirm" is way from over.
Deconstructing the Benchmark
The FACTS suite strikes past easy Q&A. It’s composed of 4 distinct checks, every simulating a special real-world failure mode that builders encounter in manufacturing:
-
Parametric Benchmark (Inside Information): Can the mannequin precisely reply trivia-style questions utilizing solely its coaching knowledge?
-
Search Benchmark (Software Use): Can the mannequin successfully use an internet search instrument to retrieve and synthesize reside data?
-
Multimodal Benchmark (Imaginative and prescient): Can the mannequin precisely interpret charts, diagrams, and pictures with out hallucinating?
-
Grounding Benchmark v2 (Context): Can the mannequin stick strictly to the offered supply textual content?
Google has launched 3,513 examples to the general public, whereas Kaggle holds a non-public set to forestall builders from coaching on the take a look at knowledge—a typical concern referred to as "contamination."
The Leaderboard: A Recreation of Inches
The preliminary run of the benchmark locations Gemini 3 Professional within the lead with a complete FACTS Rating of 68.8%, adopted by Gemini 2.5 Professional (62.1%) and OpenAI’s GPT-5 (61.8%).Nevertheless, a better take a look at the info reveals the place the actual battlegrounds are for engineering groups.
|
Mannequin |
FACTS Rating (Avg) |
Search (RAG Functionality) |
Multimodal (Imaginative and prescient) |
|
Gemini 3 Professional |
68.8 |
83.8 |
46.1 |
|
Gemini 2.5 Professional |
62.1 |
63.9 |
46.9 |
|
GPT-5 |
61.8 |
77.7 |
44.1 |
|
Grok 4 |
53.6 |
75.3 |
25.7 |
|
Claude 4.5 Opus |
51.3 |
73.2 |
39.2 |
Knowledge sourced from the FACTS Workforce launch notes.
For Builders: The "Search" vs. "Parametric" Hole
For builders constructing RAG (Retrieval-Augmented Era) methods, the Search Benchmark is essentially the most important metric.
The info exhibits an enormous discrepancy between a mannequin's capacity to "know" issues (Parametric) and its capacity to "discover" issues (Search). For example, Gemini 3 Professional scores a excessive 83.8% on Search duties however solely 76.4% on Parametric duties.
This validates the present enterprise structure commonplace: don’t depend on a mannequin's inside reminiscence for important information.
In case you are constructing an inside information bot, the FACTS outcomes recommend that hooking your mannequin as much as a search instrument or vector database just isn’t non-compulsory—it’s the solely technique to push accuracy towards acceptable manufacturing ranges.
The Multimodal Warning
Probably the most alarming knowledge level for product managers is the efficiency on Multimodal duties. The scores listed here are universally low. Even the class chief, Gemini 2.5 Professional, solely hit 46.9% accuracy.
The benchmark duties included studying charts, deciphering diagrams, and figuring out objects in nature. With lower than 50% accuracy throughout the board, this implies that Multimodal AI just isn’t but prepared for unsupervised knowledge extraction.
Backside line: In case your product roadmap includes having an AI routinely scrape knowledge from invoices or interpret monetary charts with out human-in-the-loop evaluation, you’re doubtless introducing vital error charges into your pipeline.
Why This Issues for Your Stack
The FACTS Benchmark is prone to turn into an ordinary reference level for procurement. When evaluating fashions for enterprise use, technical leaders ought to look past the composite rating and drill into the particular sub-benchmark that matches their use case:
-
Constructing a Buyer Help Bot? Have a look at the Grounding rating to make sure the bot sticks to your coverage paperwork. (Gemini 2.5 Professional really outscored Gemini 3 Professional right here, 74.2 vs 69.0).
-
Constructing a Analysis Assistant? Prioritize Search scores.
-
Constructing an Picture Evaluation Software? Proceed with excessive warning.
Because the FACTS crew famous of their launch, "All evaluated fashions achieved an general accuracy under 70%, leaving appreciable headroom for future progress."For now, the message to the {industry} is obvious: The fashions are getting smarter, however they aren't but infallible. Design your methods with the belief that, roughly one-third of the time, the uncooked mannequin would possibly simply be mistaken.