The 2025 Model Tool-Calling Landscape: Which LLMs Can Actually Use Your APIs?

We recently evaluated 35 models for a client integration and discovered a fundamental architectural split that's reshaping how we think about multi-model applications.

If you're building AI applications that need to call external tools—APIs, databases, memory systems, or any structured interface—you've probably discovered an uncomfortable truth: not all LLMs are created equal when it comes to tool use.

We recently evaluated 35 models for a client integration and discovered a fundamental architectural split that's reshaping how we think about multi-model applications. This post shares what we learned.

The "Agentic Separation" of 2025

The LLM landscape has quietly diverged into two distinct categories:

Model Type Optimization Target Tool Behavior
Acting Models Tool execution, function calling Reliably outputs structured JSON to call your APIs
Thinking Models Chain-of-thought reasoning Reasons about tools instead of calling them

This isn't a minor distinction. It's architectural, baked into how these models were trained, and it determines whether your tool-calling integration will work reliably—or fail silently.

The "Thinking Trap"

Models like DeepSeek R1 and Qwen's "Thinking" variants use Reinforcement Learning optimized for generating reasoning tokens. When you give them a tool like save_to_database, something interesting happens:

<think>
The user mentioned they're a lawyer. This seems important.
I should consider saving this to the database.
Let me analyze whether this qualifies as worth storing...
What are the implications of saving versus not saving?
[exhausts token limit without ever calling the API]
</think>

The model treats the tool decision as a reasoning puzzle rather than an action to execute. Their RL weights actively resist the deterministic JSON output your API requires.

This isn't a prompting problem you can engineer around. The DeepSeek R1 Nature paper explicitly states the model "cannot make use of tools" and has "suboptimal structured output capabilities." This is by design.

What We Found: Model-by-Model Assessment

We evaluated models across the major families for tool-calling reliability. Here's how they stack up:

Tier 1: Reliable Tool Calling

These models have native function calling baked into their post-training. They treat tool definitions as a distinct syntactic mode, not text to complete.

Model Company Evidence
Claude Sonnet 4.5 Anthropic Best-in-class; 77% SWE-bench; first to pass 60% Terminal-Bench 2.0
GPT-5.1 OpenAI Native function calling; production-proven at scale
Qwen 3 / Qwen 3 Next Alibaba ~69.6% SWE-bench; designed explicitly for agentic use
MiniMax M2 MiniMax τ²-Bench 77.2%; explicitly agent-optimized
Kimi K2 Moonshot Can chain 200-300 sequential tool calls autonomously
Mistral Large 3 Mistral Native tool calling; 256K context window
GLM 4.6 Zhipu AI Agent and tool integration as core architectural focus
GPT-OSS 120B / 20B OpenAI Open weights with strong tool use; Apache 2.0
Llama 4 Maverick Meta Strong general benchmarks; better than Scout for tools
Llama 3.3 Meta Proven tool capabilities; 128K context

Tier 2: Limited / Proceed with Caution

These models can call tools but have reliability issues you need to handle:

Model Issue Workaround
Llama 4 Scout Safety constraints often block "recording" actions Limit to read-only operations
Ministral 3B / 8B Forgets required JSON fields (parameter injection failures) Set defaults at API layer
DeepSeek V3.1 / V3.2 Tool calling at API level only, not native to weights Implement parser retry logic
Mistral Small (older) "Chatty caller" — adds conversational text before JSON Strip preamble, extract JSON

The "Chatty Caller" problem: Some models will output something like:

"Sure! I'll save that for you now. Here is the function call:
{"action": "save_memory", "content": "..."}"

If your parser expects pure JSON, this crashes. Add a negative constraint to your system prompt: "Do not output conversational text before or after tool use. Output only the raw function call."

Tier 3: Tool Calling Not Supported

These models should not be used for tool-based workflows:

Model Reason
DeepSeek Reasoner (R1) Nature paper explicitly states "cannot make use of tools"
DeepSeek V3.2 Thinking Reasoning-only mode; outputs CoT, not JSON
DeepSeek V3.2 Speciale Documentation confirms "does not support tool calling"
Qwen 3 Next Thinking Thinking mode prioritizes reasoning tokens over tool execution
Qwen 3 Coder Specialized for code generation; weak general instruction following

Practical Implications

If you're building a single-model application:

Choose a Tier 1 model and use native tool calling. Claude Sonnet 4.5, GPT-5.1, or Qwen 3 will handle your API integrations reliably.

If you're building a multi-model application:

You have two options:

Option A: Tiered Access
Restrict tool-dependent features to capable models. When users switch to a Thinking model, gracefully degrade: "Advanced features paused while using DeepSeek R1."

Option B: Middleware Pattern
Move tool operations outside the model inference loop entirely. Retrieve context before the model sees the message; extract actions after the response. The model just sees enriched text—it never knows tools exist.

We've found Option B to be more robust for applications that need universal model coverage. The model-agnostic approach means new models work automatically without capability assessment.

If you're building memory or state management:

This is where the distinction matters most. Memory systems need reliable writes. A model that thinks about saving instead of actually saving will silently corrupt your state.

Either:

  • Restrict memory operations to Tier 1 models, or
  • Use an extraction layer (a dedicated model like Claude Haiku) that runs after every response to identify and persist memorable information

The Benchmark Gap

One thing we noticed: there's no standardized "tool-calling benchmark" that's widely reported. Models publish scores on SWE-bench, HumanEval, MMLU—but tool reliability is often inferred from proxy metrics.

The closest indicators:

  • τ²-Bench — Specifically tests agentic tool use and long chains
  • Terminal-Bench — CLI/shell tool execution
  • SWE-bench Verified — Real-world coding with implicit tool use (git, tests, file systems)

If you're evaluating models for tool-heavy applications, look for these benchmarks rather than general reasoning scores.

What's Next

The Agentic Separation will likely deepen. We're already seeing model families release explicit "Thinking" and "Acting" variants (DeepSeek R1 vs V3, Qwen Thinking vs Instruct). This is a feature, not a bug—different architectures for different use cases.

For builders, the implication is clear: know your model's mode before you build your integration. A Thinking model is exceptional for complex reasoning tasks. An Acting model is what you need for reliable API orchestration. Using the wrong one for your use case will cause subtle, hard-to-debug failures.


At amotivv, we build memory infrastructure for AI applications. This research came from a real integration project where we needed to support 35+ models with consistent memory quality. If you're facing similar challenges, we'd love to talk.

Subscribe to amotivv.ai

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe