The 2025 Model Tool-Calling Landscape: Which LLMs Can Actually Use Your APIs?
We recently evaluated 35 models for a client integration and discovered a fundamental architectural split that's reshaping how we think about multi-model applications.
If you're building AI applications that need to call external tools—APIs, databases, memory systems, or any structured interface—you've probably discovered an uncomfortable truth: not all LLMs are created equal when it comes to tool use.
We recently evaluated 35 models for a client integration and discovered a fundamental architectural split that's reshaping how we think about multi-model applications. This post shares what we learned.
The "Agentic Separation" of 2025
The LLM landscape has quietly diverged into two distinct categories:
| Model Type | Optimization Target | Tool Behavior |
|---|---|---|
| Acting Models | Tool execution, function calling | Reliably outputs structured JSON to call your APIs |
| Thinking Models | Chain-of-thought reasoning | Reasons about tools instead of calling them |
This isn't a minor distinction. It's architectural, baked into how these models were trained, and it determines whether your tool-calling integration will work reliably—or fail silently.
The "Thinking Trap"
Models like DeepSeek R1 and Qwen's "Thinking" variants use Reinforcement Learning optimized for generating reasoning tokens. When you give them a tool like save_to_database, something interesting happens:
<think>
The user mentioned they're a lawyer. This seems important.
I should consider saving this to the database.
Let me analyze whether this qualifies as worth storing...
What are the implications of saving versus not saving?
[exhausts token limit without ever calling the API]
</think>The model treats the tool decision as a reasoning puzzle rather than an action to execute. Their RL weights actively resist the deterministic JSON output your API requires.
This isn't a prompting problem you can engineer around. The DeepSeek R1 Nature paper explicitly states the model "cannot make use of tools" and has "suboptimal structured output capabilities." This is by design.
What We Found: Model-by-Model Assessment
We evaluated models across the major families for tool-calling reliability. Here's how they stack up:
Tier 1: Reliable Tool Calling
These models have native function calling baked into their post-training. They treat tool definitions as a distinct syntactic mode, not text to complete.
| Model | Company | Evidence |
|---|---|---|
| Claude Sonnet 4.5 | Anthropic | Best-in-class; 77% SWE-bench; first to pass 60% Terminal-Bench 2.0 |
| GPT-5.1 | OpenAI | Native function calling; production-proven at scale |
| Qwen 3 / Qwen 3 Next | Alibaba | ~69.6% SWE-bench; designed explicitly for agentic use |
| MiniMax M2 | MiniMax | τ²-Bench 77.2%; explicitly agent-optimized |
| Kimi K2 | Moonshot | Can chain 200-300 sequential tool calls autonomously |
| Mistral Large 3 | Mistral | Native tool calling; 256K context window |
| GLM 4.6 | Zhipu AI | Agent and tool integration as core architectural focus |
| GPT-OSS 120B / 20B | OpenAI | Open weights with strong tool use; Apache 2.0 |
| Llama 4 Maverick | Meta | Strong general benchmarks; better than Scout for tools |
| Llama 3.3 | Meta | Proven tool capabilities; 128K context |
Tier 2: Limited / Proceed with Caution
These models can call tools but have reliability issues you need to handle:
| Model | Issue | Workaround |
|---|---|---|
| Llama 4 Scout | Safety constraints often block "recording" actions | Limit to read-only operations |
| Ministral 3B / 8B | Forgets required JSON fields (parameter injection failures) | Set defaults at API layer |
| DeepSeek V3.1 / V3.2 | Tool calling at API level only, not native to weights | Implement parser retry logic |
| Mistral Small (older) | "Chatty caller" — adds conversational text before JSON | Strip preamble, extract JSON |
The "Chatty Caller" problem: Some models will output something like:
"Sure! I'll save that for you now. Here is the function call:
{"action": "save_memory", "content": "..."}"If your parser expects pure JSON, this crashes. Add a negative constraint to your system prompt: "Do not output conversational text before or after tool use. Output only the raw function call."
Tier 3: Tool Calling Not Supported
These models should not be used for tool-based workflows:
| Model | Reason |
|---|---|
| DeepSeek Reasoner (R1) | Nature paper explicitly states "cannot make use of tools" |
| DeepSeek V3.2 Thinking | Reasoning-only mode; outputs CoT, not JSON |
| DeepSeek V3.2 Speciale | Documentation confirms "does not support tool calling" |
| Qwen 3 Next Thinking | Thinking mode prioritizes reasoning tokens over tool execution |
| Qwen 3 Coder | Specialized for code generation; weak general instruction following |
Practical Implications
If you're building a single-model application:
Choose a Tier 1 model and use native tool calling. Claude Sonnet 4.5, GPT-5.1, or Qwen 3 will handle your API integrations reliably.
If you're building a multi-model application:
You have two options:
Option A: Tiered Access
Restrict tool-dependent features to capable models. When users switch to a Thinking model, gracefully degrade: "Advanced features paused while using DeepSeek R1."
Option B: Middleware Pattern
Move tool operations outside the model inference loop entirely. Retrieve context before the model sees the message; extract actions after the response. The model just sees enriched text—it never knows tools exist.
We've found Option B to be more robust for applications that need universal model coverage. The model-agnostic approach means new models work automatically without capability assessment.
If you're building memory or state management:
This is where the distinction matters most. Memory systems need reliable writes. A model that thinks about saving instead of actually saving will silently corrupt your state.
Either:
- Restrict memory operations to Tier 1 models, or
- Use an extraction layer (a dedicated model like Claude Haiku) that runs after every response to identify and persist memorable information
The Benchmark Gap
One thing we noticed: there's no standardized "tool-calling benchmark" that's widely reported. Models publish scores on SWE-bench, HumanEval, MMLU—but tool reliability is often inferred from proxy metrics.
The closest indicators:
- τ²-Bench — Specifically tests agentic tool use and long chains
- Terminal-Bench — CLI/shell tool execution
- SWE-bench Verified — Real-world coding with implicit tool use (git, tests, file systems)
If you're evaluating models for tool-heavy applications, look for these benchmarks rather than general reasoning scores.
What's Next
The Agentic Separation will likely deepen. We're already seeing model families release explicit "Thinking" and "Acting" variants (DeepSeek R1 vs V3, Qwen Thinking vs Instruct). This is a feature, not a bug—different architectures for different use cases.
For builders, the implication is clear: know your model's mode before you build your integration. A Thinking model is exceptional for complex reasoning tasks. An Acting model is what you need for reliable API orchestration. Using the wrong one for your use case will cause subtle, hard-to-debug failures.
At amotivv, we build memory infrastructure for AI applications. This research came from a real integration project where we needed to support 35+ models with consistent memory quality. If you're facing similar challenges, we'd love to talk.