Why you shouldn't buy an NVIDIA GPU or the DGX Spark for local LLM inference in 2025

By onTree Team
#LLM #Hardware #NVIDIA #GPU #Local Inference #AMD
Why you shouldn't buy an NVIDIA GPU or the DGX Spark for local LLM inference in 2025

When you’re shopping for hardware to run LLMs locally, the conversation typically starts with NVIDIA GPUs. They have the best driver support, the most mature ecosystem, and work with everything. But here’s the problem: consumer NVIDIA cards are hitting a hard ceiling that makes them unsuitable for modern local LLM inference.

This is the first in a three-part series on hardware options in 2025. We’ll cover why NVIDIA consumer cards fall short, why Apple’s ecosystem pricing is prohibitive, and ultimately, what you should actually buy.

The 32GB Wall and custom builds

Let’s start with the most established path: NVIDIA GPUs.

NVIDIA has the best driver support, the most mature ecosystem, and the widest software compatibility. If you’re running vLLM, Transformers, or any PyTorch-based inference stack, NVIDIA just works.

The problem? Most consumer and prosumer NVIDIA cards top out at 32GB of VRAM. On top, there’s a reason why NVIDIA’s stock price is soaring. They demand a huge premium for their products:

  • RTX 5090: 32GB, 2500+ Euro
  • RTX 3090: 24GB, 1300+ Euro
  • (Professional cards like the RTX 6000 Pro Blackwell: 96GB, 8,000+ Euro)

With one of these in the shoppingcart you just have the graphics card, leaving out the remaining components like CPU, RAM, SSDs, power supply and a case to house it all. There is no standard around building this, everything is possible. While this may sound cool, this is a maintenance nightmare. Do I see this special behavior on this machine because I have an Intel CPU and an ASUS motherboard? You can never rule this out, and usually you don’t have a second system to do a comparison with easily. Stability requires uniformity, there’s no way around it.

Quantization Explained

Before we talk about memory constraints, let’s understand how model size changes with quantization.

Modern LLMs store parameters as numbers. Quantization reduces the precision of these numbers to save memory. Here’s what that looks like for two common model sizes:

32B Model (like Qwen3 32B):

  • FP16 (16-bit): ~64GB
  • 8-bit quantization: ~32GB
  • 4-bit quantization: ~16-18GB
  • 3-bit quantization: ~12-14GB

70B Model (like Llama 3.1 70B):

  • FP16 (16-bit): ~140GB
  • 8-bit quantization: ~70GB
  • 4-bit quantization: ~35-38GB
  • 3-bit quantization: ~26-28GB

The rule of thumb: divide the parameter count by the bits per parameter to get approximate memory usage. A 32B model at 4-bit quantization needs roughly 32B ÷ 2 = 16GB, plus some overhead for model structure.

Lower quantization (3-bit, 4-bit) saves memory but reduces model accuracy slightly. For most local inference use cases, 4-bit quantization offers the best balance: you keep ~95% of model quality while cutting memory usage to ~25% of the original.

But here’s the catch: the model weights are only part of the story.

Why 32GB Isn’t Enough Anymore

Here’s the thing people miss: GPU memory doesn’t just hold the model weights. It also holds the KV cache: This stores the attention keys and values for your entire context window. Avid agentic coders know this as their context, and it’s always running out too fast!

Let’s do the math on a practical example:

Qwen3 32B (4-bit quantization):

  • Model weights: ~18GB
  • You have 14GB left for context on a 32GB card
  • The KV cache size per token depends on the model’s architecture. A typical 32B model consumes around 0.5 to 1.0 MB of VRAM per token in the context window
  • With 14GB remaining: 14,000 MB / 0.5 MB per token = ~28,000 tokens of context (or as low as ~14,000 tokens with FP16 KV cache)

That sounds like a lot—until you understand what actually consumes those tokens in real-world usage.

What Eats Your Context Window

Here’s what fills up those 28,000 tokens in practice:

System prompts and instructions: 500-2,000 tokens

  • Base system prompt defining the agent’s behavior
  • Task-specific instructions
  • Safety guidelines and constraints

MCP (Model Context Protocol) plugins and tools: 2,000-5,000+ tokens

  • Tool definitions for each MCP server
  • Function schemas and examples
  • Return value specifications
  • Multiple plugins stack up quickly

Conversation history: Variable, but grows fast

  • Your messages to the agent
  • Agent’s responses
  • Multi-turn back-and-forth
  • A 20-message conversation can easily hit 10,000+ tokens

Retrieved context: 10,000-50,000+ tokens

  • Document chunks pulled from vector databases
  • Code files for context
  • API documentation
  • Knowledge base articles

Working memory for long-running tasks: The killer use case

  • Agent exploring a codebase
  • Multi-step research tasks
  • Complex debugging sessions
  • Building features across multiple files

This last point is crucial: long context = agents can work independently for longer before needing a reset. If your context fills up after 20,000 tokens, your agent might need to restart after 10-15 tool calls. With significantly more tokens available (100K+ on systems with more memory), the agent can run through dozens of tool calls, maintain full context of what it’s tried, and make better decisions.

And if you want to run a larger model—say, a 70B quantized to 4-bit (~38GB)—you can’t even fit it on a 32GB card, let alone leave room for meaningful context.

The verdict: NVIDIA GPUs are excellent for established models and production workloads where you control the infrastructure. But consumer cards hit a hard ceiling when you want both large models AND extensive context.

What About NVIDIA’s DXG Spark?

NVIDIA’s Grace Blackwell DGX Spark has an interesting spec:

  • 128GB unified LPDDR5x memory
  • 20 core ARM processor
  • 1 petaflop AI performance
  • 10 GbE, ConnectX-7 Smart NIC
  • pricepoint between 3000 and 4000 Euro.
  • new NVFP4 format

On paper, this is a game-changer for local LLM inference. The most interesting thing is the new NVFP4 format, allowing for more aggressive quantization. Research has shown that models can be trained to fit into this quantization with almost the same performance. As this is a hardware supported feature, it can’t be copied quickly from Apple or AMD.

But the machine is far from available: It was announced for a long time, now we have seen the first videos, but also only sold out shops so far.

Even if it becomes available, this is a subsidized offer from NVIDIA. They are expecting very high margins for their products. By the nature of the price, this cannot maintain typical nvidia margins. They want you to buy into this ecosystem and then once it works locally go the most convenient way of running it on NVIDIA hardware in the data center as well. This machine is a mechanism to lock you into their ecosystem.

Most of the things Nvidia does are closed source: CUDA is closed source, the operating system running the DGX Spark DGX OS 7 is closed source, the list goes on. We’ve seen the play before, and we’re not eager to run into the trap again.

The Memory Wall Is Real

Here’s what’s become clear: local LLM inference is increasingly memory-bound, not compute-bound.

Modern MoE architectures like Qwen3-Next demonstrate this perfectly:

  • 80B total parameters, only 3B active per token
  • Inference speed is excellent—if you have the memory to hold it
  • Context window length matters as much as model size

The bottleneck isn’t “can my GPU compute fast enough?” It’s “can I fit the model AND enough context to do meaningful work?”

This is why 32GB cards are increasingly limiting. They work for smaller models or shorter contexts, but can’t handle the frontier of what’s possible with local inference.

This post is part of a three part series. Next week: Why you shouldn’t buy into the Apple ecosystem.



Custom Slash Commands, Part 3: The Installer

By onTree Team
#Claude Code #Automation #Custom Commands #Developer Tools #Installers
Custom Slash Commands, Part 3: The Installer

In part one, custom slash commands automated repetitive tasks but gave inconsistent results. In part two, we found the fix: separate the conversational prompt from a deterministic script.

Now, the finale. After weeks of hardening this approach, I’ve distilled it into three patterns that transform scripts into powerful self-fixing installers. Here’s what I learned building the TreeOS production setup.

The Three Patterns

1. The Self-Fixing Loop

The old way: run the script, watch it fail, open the file, find line 52, guess a fix, save, run again. High-friction context switching.

The new way: I let Claude execute the script, it fails on an edge case, and it comes back to me with what happened and multiple options how to fix it. Claude has the full context: the command, the code, and the failed output. It updates the script immediately. The script hardens with each real-world failure. This tight feedback loop is the fastest way to build robust automation. TreeOS will be open source. Users can run the install script and contribute a pull request if they encounter an edge case. Can’t wait to see this in real life.

2. Soft Front Door + Hard Engine

Every installer consists of two side-by-side files:

  • Soft Front Door (.md): treeos-setup-production.md

  • Hard Engine (.sh): treeos-setup-production-noconfirm.sh

The treeos prefix is to separate my custom commands from others. The markdown contains the Claude Code prompt, the conversational layer. It explains what’s about to happen, checks prerequisites, and asks for confirmation. It’s flexible and human-friendly.

The shell script is the deterministic engine. It takes inputs and executes precise commands. No ambiguity, no improvisation, 100% repeatable.

This separation is crucial. Claude can safely modify the conversation in the front door without breaking the logic in the engine. The naming convention makes the relationship obvious.

3. The Graceful Handoff

Depending on the machine and trust level of the user, sometimes Claude Code has access to sudo, sometimes not. The pattern: check if sudo is available without a password prompt.


sudo -n true 2>/dev/null && echo "SUDO_AVAILABLE" || echo "SUDO_REQUIRED"

If sudo requires a password, the front door hands off cleanly:


⚠️ This script requires sudo privileges.

Claude Code cannot provide passwords for security reasons.

I've prepared everything. Run this one command:

cd ~/repositories/ontree/treeos

sudo ./.claude/commands/treeos-setup-production-noconfirm.sh

Paste the output back here, and I'll verify success.

Claude does 95% of the work, then asks me to handle the one step it can’t. Perfect collaboration.

The Real-World Result

These three patterns and a lot of iterations produced my TreeOS production installer. It’s now 600+ lines and handles:

  • OS detection (Linux/macOS) and architecture

  • Downloading the correct binary from GitHub releases

  • Creating system users with proper permissions

  • Optional AMD ROCm installation if a fitting GPU is detected

  • Service setup (systemd/launchd) and verification

When something breaks on a new platform, the self-fixing loop makes improvements trivial. I’ve hardened this across dozens of edge cases without dreading the work.

Why This Changes Everything

Traditional README files demand a lot from the user. They push the cognitive load onto users: identify your platform, map generic instructions to your setup, debug when it breaks.

This flips the script. Instead of static documentation describing a process, we have executable automation that performs it.

But this isn’t just about installers. Apply these patterns to any complex developer task:

  • /setup-dev-environment clones repos, installs tools, and seeds databases

  • /run-migration backs up production, runs the migration, and rolls back on failure

  • /deploy-staging builds containers, pushes to registries, and updates Kubernetes

We’re moving from documentation that describes to automation that executes, with AI as the safety net and co-pilot. This is the future of developer experience: reducing friction by automating complex workflows around code.

With the explosion of AI tools, setup complexity is a real barrier. These patterns are one step towards changing that.


Running Qwen3-Next-80B Locally: October 2025 Case Study

By onTree Team
#Local AI #Machine Learning #Hardware #Apple MLX #NVIDIA #AMD #MoE Architecture
Running Qwen3-Next-80B Locally: October 2025 Case Study

Just last month, Alibaba’s Qwen team released something remarkable that has gone mostly unnoticed: Qwen3-Next-80B, an 80 billion parameter model that runs as fast as a 3 billion parameter one.

This isn’t just another upgrade. It’s a paradigm shift and glimpse into the future of local AI. It’s also a perfect showcase of the current local inferencing ecosystem.

The Model That Changes the Game

The magic is in its ultra-sparse Mixture of Experts (MoE) architecture. Think of it like a massive library with 80 billion books (total parameters), but your librarian only pulls the 3 billion most relevant ones for your query (active parameters). You get the knowledge of the entire library at the speed of a small, curated collection. Interestingly, this is also similar to how actual brains work, where the prefrontal cortex directs attention to the correct brain region.

This architecture results in knowledgeable models with fast inference. According to real-world benchmarks, on a high-end Mac Studio M3 Ultra with 8-bit quantization, this translates to 50 tokens per second—fast enough for a real-time conversation with a world-class AI.

Sounds too good to be true? Unfortunately, there are a few catches, at least for now. Qwen3-Next uses a novel hybrid attention architecture combining Gated DeltaNet and Gated Attention—tricky new architectural features that most existing tools weren’t built for. The struggle to run it on affordable hardware reveals everything about the three main paths for local AI in 2025.


Path 1: The Apple Fast Lane

Qwen3-Next released on September 11. Full support in the popular LM Studio app? September 16. Just five days later. On a Mac, it just works. You download the model from LM Studio’s catalog, and it runs. Why is that?: Apple controls the entire stack. Their MLX framework is a specialized engine built to get the most out of their own silicon. When a new model like Qwen3-Next appears, developers can write support for it directly in MLX, bypassing the community bottlenecks that affect other platforms.

While Apple is sleepy on their product side at integrating AI, the MLX is on top of the game.

The performance is great, on this blog post there are claims for 14 tokens/sec on a Mac Mini M4 Pro with 64GB and a whopping 50 tokens/sec on a Mac Studio M3 Ultra. But this seamless experience comes at a premium. A Mac Studio equipped to run this model comfortably (128GB unified memory) starts around $7,200. You’re paying for a vertically integrated ecosystem where the hardware and software are perfectly in sync.

Path 2: The Professional Choice with NVIDIA

Support for Qwen3-Next? Day one. If you have the right hardware, running the model is as simple as a single command line or a few lines of Python. The professional AI world is built on NVIDIA and its mature software ecosystem. Frameworks like vLLM, Transformers, and SGLang are designed for production servers and work directly with a model’s native PyTorch code. There’s no need for conversion or waiting for updates. If the model’s creators release it, these tools can run it instantly.

The full, unquantized 80B model is massive and impractical for most users. Instead, the standard approach is quantization — compressing the model to use less memory with minimal quality loss.

According to deployment guides, common quantization formats include:

  • FP8: ~40GB of VRAM needed

  • INT8: ~42GB of VRAM needed

  • INT4/AWQ: ~22GB of VRAM needed

Even with aggressive quantization, you’re looking at 22-40GB+ of VRAM. A single NVIDIA A100 80GB costs $10,000-15,000, so out of range for most. Consumer cards like the RTX 4090 (24GB) can’t fit even the most aggressive quantizations of an 80B model.

The Trade-Off: NVIDIA offers the most mature, powerful, and instantly-compatible software. But for the newest, largest models, it’s realistically a cloud or enterprise solution, not a local one for most consumers. Unless you have access to datacenter GPUs, you’re limited to smaller models or cloud inference. The just released Spark DGX could change that, but general availability is unclear.


Path 3: The Open Path with AMD

This is my path, and it’s the one most people are on.

The hardware is ready. My AMD Ryzen AI Max+ 395 is a beast, offering 128GB of unified LPDDR5X-7500 memory. For inference, fast memory is the limiting factor. Strix Halo matches Apple’s M4 Pro line and the DGX Spark, if it will ever be available. It’s a much more affordable system, e.g. with the Framework Desktop.

But it can’t run next generation models like Qwen3-Next-80B, at least for now. Llama.cpp is a brilliant open source community project. It’s the workhorse of local AI on AMD, making models accessible to everyone.

However, its universality is also its bottleneck. To support a radically new model like Qwen3-Next, a volunteer developer has to painstakingly implement the new architecture inside llama.cpp. As of October 2025, the pull request for Qwen3-Next is still a work in progress, with developers debugging complex issues like partial RoPE implementation. Whatever that is. Llama.cpp also powers the popular Ollama, so it has the same problems right now. The hardware is more than capable, but we’re all waiting for the community software to catch up.

Are there alternatives? Yes, there are actually two right now.

AMD users can use vLLM like NVIDIA users do. But despite ROCm 6.4.1 and 7.0 supporting the Ryzen AI Max+ 395 chipset (gfx1151), vLLM compatibility remains problematic. Users encounter “invalid device function” errors during model initialization, and official support for gfx1151 is still an open feature request.

With SGLang it’s the same story. It’s a framework that comes from the bigger cloud center hardware and is slow at adopting consumer hardware like the AMD AI Max+. There’s a PR open, but little activity.

What This All Means

This one model neatly illustrates the trade-offs in local AI today:

  1. Apple & NVIDIA: The “it just works” experience. It’s fast and polished, but you pay a premium for admission to the walled garden. You might go faster now, but beware of locking yourself in.

  2. AMD & Llama.cpp: The universal path. It brings AI to the most hardware at the best value, but for brand-new, architecturally complex models, there can be a delay as the open-source ecosystem catches up.

The good news is that the community path is rapidly improving:

In LLama.cpp there’s a lot of activity. I find it hard to follow all the details, even as a hardware enthusiast. Ollama’s new inference engine with direct GGML access shows they’re building toward faster support for novel architectures, though the transition is ongoing.

Ultra-efficient MoE models like Qwen3-Next prove that frontier-level intelligence can run on our local machines. The infrastructure is racing to keep up, and this competition means a better future for everyone, no matter which path you’re on.


The Viscosity of Your Software Stack And Why it Matters for Working With Agents

By onTree Team
#Software Engineering #AI Agents #Code Quality #Vibe Engineering #Tribal Knowledge
The Viscosity of Your Software Stack And Why it Matters for Working With Agents

I love how new terms are being coined at the moment. Simon Willison’s post about getting from Vibe coding to Vibe engineering is a perfect example. Unfortunately, it’s missing one key property of codebases: The differing viscosities of your files and often lines.

The blog post is a superb summary of things to look out for efficiently coding with agents. I found myself nodding at every bullet point. As an engineering leader who pitched a lot of these for implementation, it often feels like we can ultimately prove now that good software engineering practices have a business impact, instead of being misunderstood as a goodie for the engineers…

One aspect was lacking, and in the spirit of coining terms, I’d like to name it viscosity of software. Every codebase has fast, easy-to-change parts and almost impossible-to-change parts. If you draft to change them, you’ll spend the whole week convincing other engineers. These crucial files or lines have many implications to the whole project. It’s often unclear which lines are written in stone and which are written in sand.

Another way to frame this is tribal knowledge: Engineers accustomed to this codebase know these corners from their own experience or because of the stories around it. So far, I didn’t find a way to onboard my agents with this knowledge. Every agent comes in like a new developer, not knowing anything about the code. It’s amazing how fast they navigate the codebase. The AGENTS.md and code comments help a bit, but it’s a main frustration point when relying more on agents. They’re unaware of this tribal knowledge and I don’t know how to teach them.

How do we teach a machine to be afraid of the right lines of code? I’d love to hear your thoughts.


Custom Slash Commands, Part 2: From Convenience to 100% Repeatability

By onTree Team
#Claude Code #Automation #Custom Commands #Developer Tools #Repeatability

Custom Slash Commands, Part 2: From Convenience to 100% Repeatability

A few weeks ago, I showed you custom slash commands for storing prompts you need repeatedly. But I ran into a problem: sometimes Claude followed my instructions perfectly, sometimes not. I found the fix.

My first attempts were useful, but inconsistent. Sometimes the agent followed the orders exactly, sometimes it improvised. Still, it improved the structure and findability of these saved prompts, as before my codebases were cluttered with small READMEs.

The Breakthrough: Scripts + Prompts = 100% Repeatability

Custom slash commands aren’t just for storing prompts in a markdown file. You can also put scripts in the same folder and instruct the coding agent to use that script. This was my breakthrough on repeatability.

Consider this cleanup script. Here’s what happens when I run it:

  • The prompt explains what the script will do and asks for my confirmation.
  • I type yes
  • Claude executes the script and monitors its output, warning me about unexpected behavior.

This saves a lot of time, every day. It’s the best UI to run scripts I need regularly. I can verify I selected the right script before execution because I often selected the wrong one in a hurry. And I get automatic monitoring that catches problems.

The Bigger Vision: Slash Commands as Installers

Here’s how I’ll handle the installer routine for TreeOS. Instead of asking people to read a README and follow five to seven steps, they’ll run a custom slash command. I’d love to see this pattern in many tools.

Example: A few days ago I found Mergiraf, a clever tool that makes git conflicts less scary. Hosted on Codeberg 🇪🇺! The installation guide is concise, but you need to map it to your platform. And then you still need to configure it as a git merge driver.

How cool would it be if they shipped a custom slash command that detects your system, recommends the best installation method, and walks you through configuration? And they could also include a script to remove the tool, if it doesn’t work for me. This would dramatically reduce the cognitive overhead of trying a new tool like Mergiraf.

With the explosion of tools we’re seeing right now, lengthy setup routines are a real barrier. Slash commands with embedded scripts could change that.


AllowedTools vs YOLO mode: Secure But Powerful Agentic Engineering

By onTree Team
#Claude Code #Security #YOLO Mode #Allowed Tools #Agentic Engineering
AllowedTools vs YOLO mode: Secure But Powerful Agentic Engineering

Recently, I’ve defaulted to using my coding agents in YOLO mode. I found a better way to balance security and ease of use.

Once you get the hang of agentic coding, it can feel like babysitting. Can I read this file? Can I search these directories? Everything has to be allowed individually by default. The easiest fix is to switch to YOLO mode. Instead of starting claude in the terminal, start claude —dangerously-skip-permissions. This allows your agent to do everything: read all the files, delete all the files, commit to every repository on your hard disk. Even connecting to production servers and databases using your SSH keys. YOLO mode is the right name, real accidents happened.

But YOLO mode has limitations too. I started to install Claude on my managed servers. It’s helpful for boring server administration tasks. Unfortunately, Claude doesn’t work in YOLO mode when you’re the root user, which is typical for cloud machines. I’m not sure if I agree with Anthropic’s limitation, since this can be less dangerous than running Claude on my private machine with all my private data in YOLO mode.

Fortunately, better options are emerging. One I like is allowed tools. This gives the agent fine-grained controls on what he can do on his own and what not. Together with the slash commands, I wrote about last week, this is a powerful combination. Similar to the dotfiles that many developers use for a familiar environment on new machines, I can imagine checking out a claude-tools repository with custom slash commands for repeating tasks. And also including allowedTools for uninterrupted execution.

Disclaimer: I haven’t built this yet. Hopefully, I’ll have a demo for you in the next weeks!


Custom Slash Commands: A Field Trip With Claude Code

By Stefan Munz
#Claude Code #Automation #Custom Commands #Developer Tools
Custom Slash Commands: A Field Trip With Claude Code

Today I’m taking you on a field trip on how I built my first two custom slash commands for Claude Code.

I apparently slept under a rock regarding custom slash commands since they’ve been available for a long time. Boris Cherny mentioned them in a video I linked earlier, which made me aware.

The first custom command I wrote last week is named /make-ci-happy. It’s a simple prompt that informs the agent how to run all the tests on Ci locally. It also gives guardrails on what to fix and escalate back to me. Because this isn’t Claude’s standard behavior, it became a repetitive task I had to carefully repeat every time before committing. It’s of course highly tailored to this repository, so it’s a custom / command only available here. It’s an elegant system to have slash commands available on your machine, or only per repository.

So this is nice and helps me every day a little bit. But I wanted to see how far I can take this. I’m getting nearer to the public release of my TreeOS open source software. It’s written in Go and compiles to a binary for macOS and Linux. It has a self-update mechanism built in. Most mechanisms use a JSON on a server, which the binary queries. It’s better to control this JSON and not rely on the GitHub releases page. My repository’s source code isn’t ready, and I haven’t settled on a license. Still, I want to test the update mechanism first. This is possible via a second public repository and some GitHub Actions magic. It builds the release in one repository but pushes the artifacts to another. At the same time, the JSON needs to be updated, which lies in a different repository. For new stable releases I want to adapt the JSON and add a blog post. If this sounds complicated, it is. The perfect problem to automate via custom slash commands.

The best way to build custom slash commands is to ask Claude Code to build them. Ask it to read the documentation first, because the knowledge about slash commands isn’t in the model. I named this command /treeos-release to create a namespace for my custom commands and made it available on my whole system. The paths to the needed repositories are hardcoded inside the script.

You might think this isn’t proper engineering with hardwiring it to one system. Probably you’re right. Since I don’t see the need to run the script elsewhere than on my computer, it’s fine for now. One of the main advantages of working with Coding Agents is everything can be more fluid and brittle in the beginning. I can make it more stable later, if needed.

Thee Result? Within a few minutes, I had a script. It didn’t work correctly. I reported this to Claude, who fixed it. On the second try, I shipped the first beta to the public repository. Yay!

TreeOS v0.1.0 Release

Upon closer inspection, it fell apart again. For further testing, I continued creating a stable release on top of the beta release. This failed, and it turned out the first invocation of the slash command hadn’t used the created script at all. Claude Code had done it all by itself! We modified it together and added the following warning:

Claude Code Guardrails Warning

In short, it needed a few more repetitions until I was happy with the script. I ended up splitting it into multiple scripts, because making Claude Code patiently wait for CI is hard. Overall, it’s an interesting work style, because Claude can finish the work of a half-baked script if needed. This allows iterative improvement of the script while continuing with the main work.

I highly recommend custom slash commands. It’s a well-thought-out system that integrates nicely into Claude Code. Creating and debugging the slash commands are easy. Start with your most repetitive tasks, ensuring every command runs a main script to increase result consistency.

You could argue that these scripts lock you into Claude Code versus other coding agents. While that is true, I don’t think it will be challenging for any other agent to copy my existing code/commands to their system, as long as the systems are similar.

Ultimately, Claude Code is a motivated but average developer. Like most average developers I’ve worked with, including myself, they usually need a few attempts to get it right.

Oh, and regarding the first binary of TreeOS visible above: It would probably work on your machine, but I haven’t created concise documentation for Mac or Linux, so I can’t recommend it. If you’re interested, reply to this email and I’ll add you to the early alpha testers. 👊


DHH is into Home Servers, too

By Stefan Munz
#Home Servers #Cloud Computing #DHH #Technology Trends
DHH is into Home Servers, too

Home servers are back and many cloud computing offerings are a complete rip-off: DHH discovered the same seismic changes this year, and he’s a genius marketer.

David Heinemeier Hansson, or DHH in short, must live in the same social media bubble as I do, our most important topics overlap this year: home servers are on the cusp of becoming a serious alternative to cloud offerings and the cloud is turning into an expensive joke. Also, like me, he holds a serious grudge against Apple. #fuckapple

It used to be that home servers were kind of a joke. That’s because all home computers were a joke. Intel dominated the scene with no real competition. SSDs were tiny or nonexistent. This made computers and notebooks were power-hungry, hot, and slooow. The M-series CPUs from Apple are not even 5 years old. Also only in the last 5 years AMD got their shit together and started shipping serious consumer CPUs.

So you could have a home server, but they were slow machines. Plus, your internet connection was also slow. Most people had asynchronous DSL connections with maybe okayish download speeds and totally underpowered upload speeds. Accessing your server from the outside was a pain in the ass. I remember running Plex on my home server 10 years ago and watching a video on my mobile phone, in bad resolution. I don’t remember the bigger bottleneck: the slow CPU transcoding or my slow upload speed.

Back to 2025, this has changed dramatically. Many homes upgraded to fiber connections, providing fast up- and download speeds. Well-manufactured mini PCs are available cheaply. While Mac minis can be a valid option for fast compute, AMD has gotten serious about this niche with their AI 395+ flagship CPU with integrated graphics and 128GB of shared RAM. These machines are not a joke anymore. If your use cases require a lot of RAM, like local LLM inference, going back to the edge aka your home becomes an interesting alternative.

And then we haven’t started talking about sovereignty and independence from unpredictable vendors or countries…

I wholeheartedly recommend DHH’s full talk, it’s a very energetic fuck you to the “modern stack” in general and Apple in particular.


Welcome to the OnTree Blog

By Stefan Munz
#announcement #ontree

Welcome to the OnTree blog! We’re excited to share our journey, insights, and learnings with you.

What to Expect

In this blog, we’ll be covering:

  • Organizational Insights: Deep dives into how to build modern organizations in the age of AI
  • Product Updates: New features and improvements to the OnTree platform
  • Industry Trends: Our perspective on the evolving landscape of LLM inference, AI agents and agentic coding.

Stay Connected

We’ll be publishing new content regularly. Stay tuned for more insights and updates from us!

Feel free to reach out if you have topics you’d like us to cover!