Smaller and Leaner: How Compact AI Models and Tools Are Transforming On-Device and Edge Workflows

By Benjamin Chan | 18 Aug 2025 | IN-7911

Lean Artificial Intelligence (AI) models and frameworks—led by innovations like gpt-oss-20b, quantization, sparsification, and open standards for integration—enable on-device and edge deployments that dramatically reduce memory, cost, and latency barriers. This could be the catalyst for accelerating AI adoption among enterprises and Small and Medium Enterprises (SMEs).

Checking your access...

By Benjamin Chan | 18 Aug 2025 | IN-7911

Lean AI's Increasing Presence Emerges Across the Ecosystem

NEWS

OpenAI’s launch of the open-weight gpt-oss-20b in early August 2025 marks a significant step toward edge Artificial Intelligence (AI). While not the first of its kind, it indicates that more AI developers are shifting toward on-device solutions for businesses and everyday users. The model’s ability to run on edge devices with only 16 Gigabytes (GB) of memory showcases an ideal use case for on-device AI, local inference, or quick iteration without expensive infrastructure.

gpt-oss-20b is just one of the newer releases in the growing wave of compact “Small” Language Models (SLMs), along with dozens of models with fewer than 20b parameters, such as Gemma 3 1b, Deepseek-R1, Phi-4, and Qwen 2.5, all of which are explicitly optimized for devices with between 8 GB and 16 GB of memory. Additionally, over the past year, more than 30 notable products, frameworks, chips, and apps have been launched to support on-device or resource-efficient AI workflows. While compact SLMs have gained popularity, other “Lean AI” initiatives focus on making Large Language Models (LLMs) more efficient. Notable developments include WebAI’s distributed infrastructure model deployment and collaboration between Cerebras Systems and Neural Magic on sparse training techniques for LLMs, demonstrating the potential for edge AI deployment.

Why Lean AI's Stacks Matter for Enterprises and Daily Workflows

IMPACT

Lean stacks in AI deployment have strong potential to shape the future of Small and Medium Enterprises (SMEs) and daily consumer workflows. ABI Research forecasts that AI software revenue on devices will reach US$75 billion in 2030, growing at a Compound Annual Growth Rate (CAGR) of 49% between 2024 and 2030. The continued focus of major players like OpenAI, Google, and Alibaba on developing and releasing leaner, more efficient models shows a clear interest in the field. Their focus mainly revolves around maximizing cost efficiency by reducing parameters per token or minimizing connectivity and hardware requirements needed to access AI models.

The move toward making “Lean AI” deployable now depends on recent breakthroughs in developing full-featured models that can run on laptops, phones, and even microcontrollers with significantly less memory, latency, and cost than before. For SMEs, these advancements suggest the potential to deploy high-value applications like search, vision recognition, summarization, chat, and predictive maintenance that can run entirely locally and in real time. In the past year, some leading products in these innovations are:

The deployment of Mixture-of-Experts (MoE), as shown by gpt-oss-20b, and research like the collaborative deployment framework for edge LLMs (CoEL) and On-the-Fly MoE Inference (FloE), demonstrates expert swapping across edge clusters or on single memory-starved Graphics Processing Units (GPUs) for consumer computers. MoE allows only 5% to 10% of parameters to activate per token, which reduces memory and computing needs by 6X to 8X.
Quantization methods, such as webAI’s Entropy-Weighted Quantization (EWQ) and Nota AI’s SplitQuant, introduce LLM reduction or layer splitting techniques that enable models with 20 billion parameters to fit inside a phone’s Neural Processing Unit (NPU) or 8 GB GPUs without speed penalties.
Sparsification techniques like Neural Magic Sparse Llama 3.1 8B and DeepSparse Yolo-11 show efficient LLM inference performance and object detection at GPU-level speeds on Central Processing Units (CPUs). This unlocks sparse weightings to utilize CPU racks instead of GPUs to suit day-to-day searches, Retrieval Augmented Generation (RAG), and vision pipelines.
The mainstream adoption of the Model Context Protocol (MCP) follows an open standard for how LLMs interact with external tools and services. This reduces integration overhead and allows enterprises to mix local, edge, and cloud models without custom glue code.

Strategic Adoption Paths and Use Cases to Watch

RECOMMENDATIONS

Such innovations unlock several benefits for enterprises looking to integrate AI into their daily operations. Some of the high-impact lean AI use cases that enterprises should continue to track in the coming years include:

Agentic private assistants through localization of customer data and the elimination of per-seat Application Programming Interface (API) fees
Factory vision quality assessment with sub-100 Millisecond (ms) latency AI defect detection software on existing CPU and Personal Computer (PC) hardware
Retail shelf analytics through on-premises image processing and continuous stock alerts without cloud connectivity
Long-context contract review through sparse LLM adoption in-house
Voice & multimodal chat on phones that enable translations and offline replies with low latency

Given the strong market forecasts and the fast pace of open-source releases, the next few years will be a key opportunity for SMEs to capture market value by deploying lean AI stacks through piloting, refining, and integrating more efficient workflows before widespread adoption. Enterprises aiming to deploy AI with lean stacks should strongly consider investing in the potential for on-device or edge AI deployments. These strategies should include:

Aligning Strategic Roadmaps with the Organization’s AI Implementation Goals: Organizations should be clear about their intention to deploy AI along with realistic estimations of its Return On Investment (ROI), Capital Expenditure (CAPEX), and Operational Expenditure (OPEX).
Hardware Availability Planning: Organizing for hardware availability that can complement the intended lean-AI stack by standardizing sufficient Random Access Memory (RAM) needs based on requirements, as CPU inference can now be a viable alternative for various LLM and multimodal tasks. Some ideal standards to meet are:

More than 12 cores and 128 GB RAM in CPU for sub-10 billion models
24 GB GPU for 20 to 40 billion models
Laptop NPUs with more than 45 Terra Operations per Second (TOPS)

Knowledge of Target Software Stack: Familiarity with the intended software technology stack used in LLM models and the toolchain for connecting organizational data to AI models via API. Model strategies should balance the needs and priorities of their use cases, choosing general chat or domain-specific SLMs that utilize lean AI technology such as MoE and quantization. With lean AI stacks designed for local deployment, models should employ AI model runner software like webAI and Ollama, which facilitate allocating intermittent heavy inference bursts across local hardware. This integration can reduce the need for extensive code changes while supporting smooth scale-out across edge nodes.

Written by Benjamin Chan

Industry Analyst

Industry Analyst Benjamin Chan is part of the Asia-Pacific Advisory team. He handles qualitative analysis and market forecasts related to global consumer mobile devices and mobile broadband, as well as opportunities for implementing Artificial Intelligence (AI) and Machine Learning (ML) in the APAC region.

Smaller and Leaner: How Compact AI Models and Tools Are Transforming On-Device and Edge Workflows

By Benjamin Chan | 18 Aug 2025 | IN-7911

By Benjamin Chan | 18 Aug 2025 | IN-7911

Lean AI's Increasing Presence Emerges Across the Ecosystem

NEWS

Why Lean AI's Stacks Matter for Enterprises and Daily Workflows

IMPACT

Strategic Adoption Paths and Use Cases to Watch

RECOMMENDATIONS

Written by Benjamin Chan

Job Role

Industry

By Topic

Packages

Services

Spotlights

5G, Cloud & Networks

AI & Robotics

Automotive

Bluetooth, Wi-Fi & Short Range Wireless

Cyber & Digital Security

IoT

Vertical Markets

All Other Services

News & Resources

Vendors & Rankings

About Us

RESEARCH SERVICES

5G, Cloud & Networks

AI & Robotics

Automotive

Bluetooth, Wi-Fi & Short Range Wireless

Cyber & Digital Security

IoT

Vertical Markets

All Other Services

FREE RESOURCES

PRESS RESOURCES

COMPANY

Smaller and Leaner: How Compact AI Models and Tools Are Transforming On-Device and Edge Workflows

By Benjamin Chan | 18 Aug 2025 | IN-7911

By Benjamin Chan | 18 Aug 2025 | IN-7911

Lean AI's Increasing Presence Emerges Across the Ecosystem

NEWS

Why Lean AI's Stacks Matter for Enterprises and Daily Workflows

IMPACT

Strategic Adoption Paths and Use Cases to Watch

RECOMMENDATIONS

Written by Benjamin Chan

Job Role

Industry

By Topic

Packages

Services

Spotlights

5G, Cloud & Networks

AI & Robotics

Automotive

Bluetooth, Wi-Fi & Short Range Wireless

Cyber & Digital Security

IoT

Vertical Markets

All Other Services

News & Resources

Vendors & Rankings

About Us

RESEARCH SERVICES

5G, Cloud & Networks

AI & Robotics

Automotive

Bluetooth, Wi-Fi & Short Range Wireless

Cyber & Digital Security

IoT

Vertical Markets

All Other Services

FREE RESOURCES

PRESS RESOURCES

COMPANY

GET THE ANALYST INSIDER NEWSLETTER