Ditching the cloud for local AI — how I use two mini PCs to process millions of tokens a day and save money on costly API fees

For this kind of reading, thinking, analyzing, and re-presenting, local models work brilliantly. They have high throughput but are working in the background, meaning that the slower time to first token that many local LLM users complain about in comparison to big lab-hosted alternatives isn’t an issue for me. The model runs 24 hours a day, and if it takes two seconds or two minutes to process the prompts (between 7,000 and 18,000 tokens, depending on whether it’s a reporter or editor and how far through the discussion process it is), it doesn’t bother me. Tokens per second won’t impress those talking a big game about local LLMs on social media: the models handle the prompts at around 300 tok/s, while the output is a much slower 5-10 tok/s. Yet it works for me.

But for now, I’m still keeping my big lab subscriptions — though I’m using them differently. My GLM Coding plan, bought around Christmastime and which lasts for a year, is used alongside Codex through my OpenAI subscription to troubleshoot and tinker with the projects when issues arise. My coding knowledge stopped at some QuickBASIC and Delphi in my teenage years, so having the ability to call on them (and an OpenCode Go subscription I occasionally dip into) to fix problems is invaluable.

However, the proportion of my AI use has shifted significantly. Two-thirds or more of my total token use is now locally-hosted LLMs I run myself. And as local models continue to develop their abilities and the gap between them and the state of the art from big labs closes, I can envisage that it will increase. For instance, I recently vibe-coded a web interface for LM Studio that allows me to use it as a regular chatbot just this last week. And in just two months, the amount I’ve saved if I had run that project every day through API calls on GPT-5.4-mini, arguably a comparable model, is three-quarters of the cost of that first mini PC — around $1,500.

In hindsight, I wish I’d bought the 128GB version of my mini PC, which is why I decided around two weeks ago, before another memory-based price hike, to buy the bigger version. The reason was a simple one: the volume of queries I was putting through my 96GB box was starting to hit the limits, and I wanted to expand the project. I also wanted to test out locally hosted coding harnesses like Claude Code or Hermes using a local model.

The experience, trials, and tribulations from my first mini PC setup helped enormously with setting up the second PC. Token count has increased from 20-50 million tokens a day to more like 50-80 million tokens a day. I offloaded part of that massive ingest and analysis project onto the new hardware and put it onto more powerful 27B and 36B parameter models (through the Final-Bench-Darwin-36B-Opus model), freeing up space on my first mini PC and allowing me to test the idea of a locally-hosted Claude Code-style project with the spare space on my second mini PC.

That has been less successful — at least so far. Underpinning the coding harness with GLM-4.7-Flash works, but feels like too big a step back in model generations to be a useful tradeoff. Larger Qwen models have so far got stuck in their own thinking (or burned through a lot of the context window they’re assigned), but I’m considering swapping Claude Code out for a lighter-weight, less context-heavy harness and giving it a proper run.

The bet I’ve made is a simple one: subscription and API prices from frontier labs — with the odd outlier like DeepSeek excepted — are only going to go in one direction as the companies behind them realize they need to make a financial return for investors. Even if prices don’t go into the stratosphere, labs might make tradeoffs to cut down on usage — as we’ve already seen GitHub doing. And while the race to build capacity to meet demand for those major AI labs will continue to push up prices for hardware in the short term, I still think it’s a better bet to have control over your own models and how much you pay for them than to leave it in the hands of big companies.

So I’ll keep tinkering with my local stack, which has already gone from one mini PC to two interlinked ones — and already have my eyes on a PC with an Nvidia GPU to give me the token speed that’s currently missing. But for now, I think it’s worth keeping what I have for a while and seeing how I can eke out additional benefits before making the leap financially in expanding my whole system.

Chris Stokel-Walker is a Tom's Hardware contributor who focuses on the tech sector and its impact on our daily lives\u2014 online and offline.\u00a0He is the author of How AI Ate the World, published in 2024, as well as TikTok Boom, YouTubers, and The History of the Internet in Byte-Sized Chunks. ","collapsible":{"enabled":true,"maxHeight":250,"readMoreText":"Read more","readLessText":"Read less"}}), "https://slice.vanilla.futurecdn.net/13-4-24/js/authorBio.js"); } else { console.error('%c FTE ','background: #9306F9; color: #ffffff','no lazy slice hydration function available'); } Chris Stokel-Walker Freelance Contributor Chris Stokel-Walker is a Tom's Hardware contributor who focuses on the tech sector and its impact on our daily lives— online and offline. He is the author of How AI Ate the World, published in 2024, as well as TikTok Boom, YouTubers, and The History of the Internet in Byte-Sized Chunks.

Ditching the cloud for local AI — how I use two mini PCs to process millions of tokens a day and save money on costly API fees

Key considerations

Reference reading

More on this site

Leave a Comment Cancel reply

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply