Product

LLM vs SLM for AI Agents: Notch’s Hybrid Playbook

Elool Jacoby

Co-founder and CPO at Notch

Elool Jacoby is the co-founder and CPO at Notch.cx, an autonomous AI customer support platform that uses agentic architecture to resolve customer inquiries end-to-end.

Stay ahead in support AI

Get our newest articles and field notes on autonomous support.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

LLM vs SLM for AI Agents: Notch’s Hybrid Playbook

NVIDIA’s position paper argues that small language models (SLMs) are not just “good enough” for AI agents - they are often better aligned to the job and far more economical.
(Sources: NVIDIA’s “Small Language Models are the Future of Agentic AI”)

The shift toward hybrid architectures, SLMs handling the assembly line, LLMs reserved for moments that actually require deeper reasoning, represents a maturation in how production AI systems get built.

SLMs are the future of AI agents - and here’s how to put that into practice.

Key Differences Between SLMs and LLMs

Large language models (LLMs) like GPT-4 and Claude range from 70B to 400B+ parameters.

Excel at sophisticated reasoning, nuanced language, and complex multi-step problems
Slower inference and higher latency
10-50x more expensive per token than SLMs
Require specialized infrastructure or API access to frontier providers

Small language models (SLMs) in the 1B-8B parameter range include Phi-3, Gemma, Mistral 7B, and Llama 3 8B.

Handle well-defined tasks effectively: classification, extraction, routing, structured outputs
Faster responses, lower latency
Run on modest hardware, easier to sandbox
Practical to fine-tune on domain-specific data
Struggle with open-ended reasoning, ambiguous situations, and tasks requiring broad world knowledge

Why SLMs fit agentic architecture

Right tool for the job - Most agent tasks are small and repeatable, such as classification, extraction, routing. Smaller models handle them just fine.

Easy to run - Faster responses, simple to sandbox, runs on everyday hardware. Latency compounds across 20-40 calls per interaction.

Friendly on budget - Much cheaper per token than flagship LLMs, so you cut serving costs by a lot.

Easier to control - Smaller models with narrower capabilities are easier to constrain. Guardrails and output validation work more reliably.

Why Using Both LLMs and SLMs Makes Sense

Default to SLMs for the pipeline - route only the hard parts or brand-critical copy to a larger LLM. NVIDIA explicitly advocates heterogeneous agentic systems that invoke different models selectively, as you can also see implemented in ChatGPT 5 “Auto”.

A simple playbook

1. Inventory Tasks

List your agent's tasks and prompts. Tag each by complexity (simple/moderate/complex) and risk (low/medium/high). This audit reveals which steps actually need LLM capability versus which default to LLMs simply because that's how the system was initially built.

2. Right-Size Models

Trial SLMs in the 1B-8B range on each action. Fine-tune or prompt-tune where needed for domain-specific performance. Evaluate against your actual task distribution, not generic benchmarks.

3. Add a Router

Implement a lightweight policy layer that escalates to LLMs on clear triggers: low confidence scores, novel task patterns, safety or tone-sensitive replies, high-value customer interactions. Start conservative with escalation thresholds, then tighten as you build confidence in SLM performance.

4. Guardrails and Evals

Build unit tests for prompts, regression suites for ongoing validation, and safety filters around the SLM path. Smaller models need tighter guardrails because they have less capacity to handle edge cases gracefully.

5. Measure and Iterate

Track task success rate by model, latency per step and end-to-end, fallback rate from SLM to LLM, and cost per resolved task. Use these metrics to continuously adjust routing thresholds and identify tasks where model assignment should shift.

How do we use SLM and LLM at Notch?

At Notch, we treat LLMs like a scarce, premium resource - and SLMs like the default workhorse. In a naive agent setup, we sometimes use 30-40 LLM calls to resolve a single customer message: classify intent, retrieve policy documents, extract entities, decide which tools to use, run checks, draft, rewrite for tone, and verify. That adds up fast in both latency and cost, and it gets worse in complex domains like banking and insurance, where every interaction can involve multiple products, compliance rules, and edge-case exceptions. Instead, we break the job into small, repeatable tasks and right-size the model to each step.

SLMs handle the “assembly line” work - routing, extraction, simple decisions, and structured tool calls - while an LLM is reserved for the few moments that actually need deeper reasoning or brand-sensitive language, like ambiguous cases, multi-step disputes, or the final customer-facing answer when tone and nuance matter. The result is the same (or better), with a dramatically lower cost per resolved ticket and faster time-to-resolution.

What should be escalated to a larger LLM?

Complex multi-step reasoning, ambiguous tool choices, or creative/brand voice moments like the final client-facing reply.

Bottom line

Using heavy LLMs for every agent action is like buying a supercomputer for high-school math. Pick the right model size for the job, and you can slash costs while maintaining outcomes - then reserve premium LLMs for where they actually move the needle.