I regularly use Copilot models, and they can manage this without too many issues (Claude 3.7 and Gemini output usable code with tests), but local models seem to not have the ability to do it quite yet.
I'm using Qwen3-30B-A3B locally and it's very impressive. Feels like the GPT-4 killer we were waiting for for two years. I'm getting 70 tok/s on an M3 Max, which is pushing it into the "very usable" quadrant.
What was even more impressive is the 0.6B model which made the sub 1B actually useful for non-trivial tasks.
Overall very impressed. I am evaluating how it can integrate with my current setup and will probably report somewhere about that.
Importantly they note that using a draft model screws it up and this was my experience. I was initially impressed, then started seeing problems, but after disabling my draft model it started working much better. Very cool stuff- it's fast too as you note.
The /think and /no_think commands are very convenient.
A draft model is something that you would explicitly enable. It uses a smaller model to speculatively generate next tokens, in theory speeding up generation.
I went form bottom up, started with 4B, then 8B, then 30B, and 30B was the only one that started to "use tools". Other models were saying they will use but never did, or didnt notice all the tools. I think anything above 30b would atually be able to go full GPT on a task. 30b does it, but a bit.. meh
I have a MBP M1 Max 64GB and I get 40t/s with llama.cpp and unsloth q4_k_m on the 30B A3B model. I always use /nothink and Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 - these are the settings recommended for Qwen3 and they make a big difference. With the default settings from llama-server it will always run into an endless loop.
The quality of the output is decent, just keep in mind it is only a 30B model. It also translates really well from french to german and vice versa, much better than Google translate.
Edit: for comparision, Qwen2.5-coder 32B q4 is around 12-14t/s on this M1 which is too slow for me. I usually used the Qwen2.5-coder 17B at around 30t/s for simple tasks. Qwen3 30B is imho better and faster.
I am not a mac person, but I am debating buying one for the unified ram now that the prices seem to be inching down. Is it painful to set up? The general responses I seem to get range from "It is takes zero effort" to "It was a major hassle to set everything up."
Honestly, it is quite a hastle, took me 2 hours BUT. if you just take the whole article text and paste that to gemini-2.5-pro and give your circumstance, i think it will give you specific steps for your case and it should be trivial from that moment on
Cool, but running qwen3 and doing a ls tool call is not “vibe coding”, this reads more like a lazy ad for localforge
I doubt it can perform well with actual autonomous tasks like reading multiple files, navigating dirs and figuring out where to make edits. That’s at least what I would understand under “vibe coding”
Definitely try it, it can navigate files search for stuff, run bash commands, and while 30b is a bit cranky it gets the job done (much worse then i would get when i plug in gpt-4.1, but its still not bad, Kudos o qwen.
As for localforge, it really is a vibe coding tool, just like claude or codex, but with the possibility to plug more than just one provider. What's wrong with that?
Very cool to see this and glad to discover localforge. Question about localforge, can I combine two agents to do something like: pass an image to a multimodal agent to provide html/css for it, and another to code the rest?
In the post I saw there’s gemma3 (multimodal) and qwen3 (not multimodal). Could they be used as above?
How does localforge know when to route a prompt to which agent?
Just wanted to give a shout out to MLX and MLX-LM - I’ve been using it to fine-tune Gemma 3 models locally and it’s a surprisingly well put together library and set of tools from the Apple devs.
Anyone know of a setup, perhaps with MCP, where I can get my local LLM to work in tandem on tasks, compress context, or otherwise act in concert with the cloud agent I'm using with Augment/Cursor/whatever? It seems silly that my shiny new M3 box just renders the UI while the cloud LLM alone refactors my codebase, I feel they could negotiate the tasks between themselves somehow.
There's a few Ollama-MCP bridge servers already (from a quick search, also interested myself):
ollama-mcp-bridge: A TypeScript implementation that "connects local LLMs (via Ollama) to Model Context Protocol (MCP) servers. This bridge allows open-source models to use the same tools and capabilities as Claude, enabling powerful local AI assistants"
simple-mcp-ollama-bridge: A more lightweight bridge connecting "Model Context Protocol (MCP) servers to OpenAI-compatible LLMs like Ollama"
rawveg/ollama-mcp: "An MCP server for Ollama that enables seamless integration between Ollama's local LLM models and MCP-compatible applications like Claude Desktop"
How you route would be an interesting challenge, presumably could just tell it to use the mcp for certain tasks, thereby offloading locally.
I've been toying with Visual Studio Code's MCP and agent support and gotten it to offload things like reference searches and targeted web crawling (look up module X on git repo Y via this URL pattern that the MCP server goes, fetches and parses).
I started by giving it a reference Python MCP server and asking it to modify the code to do that. Now I have 3-4 tools that give me reproducible results.
I'd thought to share this quick tutorial to get an actual autonomous agent running on your local and doing some simple tasks. Still in progress trying to figure ou right MLX settings or proper model version to do it, but the framework around this approach is solid, so i hought i'd share!
Ye you can just run it in a folder, and ask it to look around, it can execute bash commands, do anything that Claude Code can do. it will read all the codebase if it has to
Running models locally is starting to get interesting now. Especially the 30B-A3B version seems like a promising direction, though it is still out of reach on 16 GB VRAM (quite accessible).
Hoping for new Nvidia RTX cards with 24/32 GB VRAM. Seems that we might get to GPT4-ish levels within a few years? Which is useful for a bunch of tasks.
I think we are just tiny bit away of being able to really "code" with ai, locally. Because even if it would be on gemini2.5 level, since its free, you can make it self prompt a bit more and eventually solve any problem. if i could ran 200b or if 30b wouldve been as good - it wouldve been enough
It's not uncommon on HN! We frequently have people chiming in as CEOs, insiders, and experts in various fields without much proof. Generally, it hasn't been a problem. Or at least I've not seen any examples of having wool pulled over our eyes in this fashion.
There needs to be more mention about the requirement of setting the model-name correctly. For this tutorial to be executed top-to-bottom, the model name must be "mlx-community/Qwen3-30B-A3B-8bit". Other model names will result in a 404 -- rightly so, as this is used to determine which model is executed in mlx_lm.serve!
Great thank you. Side topic : anyone knows a way to have a centralized proxy to all LLMs services, online or local, that lets our services connect to it and we manage access to LLMs only once there ? And also records calls to LLM. Would make the whole UX of switching LLMs weekly easier, we would only reconfigure the proxy. I know only LiteLLM that can do that but its record of all LLMs calls is a bit clunky to use properly
Litellm is definitely your best bet. For recording - you can probably vibe code a proxy in front of it that mitms it and dumps the request into whatever format you need
I'm a maintainer of Opik, an open source LLM eval/observability framework. If you use something like LiteLLM or OpenRouter to handle the proxying of requests, Opik basically provides an out-of-the-box recording layer via its integrations with both:
I’ve been looking for this for my team but haven’t found it. Providers like OpenAI and Anthropic offer admin token to manage team accounts and you look hook into Ollama or another self managed service for local AI.
Seems like a great way to roll out AI to a medium sized team where a very small team can coordinate access to the best available tools so the entire team doesn’t need to keep pace at the current break-neck speed.
Could you maybe make use of Simon Willsons [LLM lib/app](https://github.com/simonw/llm)? It has great LLM support (just pass in the model to use) and records everything by default.
The one feature missing from LLM core for this right now is serving models over an HTTP OpenAI-compatible local server. There's a plugin you can try for that here though: https://github.com/irthomasthomas/llm-model-gateway
I've found the local models useful for non-coding tasks, however the 8B parameter models so far have proven lacking enough for coding tasks that I'm waiting another few months for whatever the Moore's law equivalent of LLM power is to catch up. Until then, I'm sticking with Sonnet 3.7.
No, that sounds right. 24GB isn’t enough to feasibly run 27B parameters. The rule of thumb is approximately 1GB of ram per billion parameters.
Someone in another comment on this post mentioned using one of the micro models (Qwen 0.6B I think?) and having decent results. Maybe you can try that and then progressively move upwards?
That rule of thumb is only related to 8 bit quants at low context. The default for ollama is 4 bit, which puts it roughly about 14GB.
The vast majority of people run between 4-6 bit depending on system capability. The extra accuracy above 6 tends to not be worth it relative to the performance hit.
I understand why people use the Mac for their local LLM work. I can't bring myself to spend any money on Apple products. I need to find an alternative platform that runs under Linux, and preferably, since I would run this remotely from my work laptop. I would also want to find some way to modulate the power consumption to turn it off automatically when I'm idle.
If you don't mind going through the eldritchian horror that is building ROCm from source[0], Qwen_Qwen3-30B-A3B-Q6_K (6bit quantization of the LLM mentioned in the article which in practice shouldn't be much different) works decently fast on a RX 7900 XTX using koboldcpp and llama.cpp. And by "decently fast" i mean "it writes faster i can read".
If you're on Debian AFAIK AMD is paying someone to experience the pain in your place, so that is an option if you're building something from scratch, but my openSUSE Tumbleweed installation predates the existence of llama.cpp by a few years and i'm not subjecting myself to the horror that is Python projects (mis)managed by AI developers[1] :-P.
EDIT: my mistake, ROCm isn't needed (or actually, supported) by koboldcpp, it uses Vulkan. ROCm is available via a fork. Still, with Vulkan it is fast too.
[0] ...and more than once as after some OS upgrade it might break, like mine
[1] ok, i did it once, because recently i wanted to try out some tool someone wrote that relied on some AI stuff and i was too stubborn to give up - i had to install Python from source on a Debian docker container because some dependency 2-3 layers deep didn't compile with a newer minor version release of Python. It convinced me to thank yet again to thank Georgi Gerganov for making AI-related tooling that enables people to stick with C++
Yeah, as i wrote "if you're on Debian AFAIK AMD is paying someone to experience the pain in your place" :-).
I used to use Debian at the past but when i was about to install my current OS i already had the openSUSE Tumbleweed installer in a USB so i went with that. Ultimately i just needed "a Linux" and didn't care which. I do end up building more stuff from source than when i used Debian but TBH the only time that annoyed me was with ROCm because it is broken into 2983847283 pieces, many of them have their own flags for the same stuff, some claim they allow to install them anywhere but in practice can only work via the default in "/opt", and a bunch of them have their own special snowflake build process (including one that downloads some random stuff via a script through the build process - IIRC a Gentoo packager made a bug report about it to remove the need to download stuff, but i'm not sure if it has been addressed or not).
If i was doing a fresh OS install i'd probably go with Gentoo - it packages ROCm like Debian, but AFAICT (i haven't tried it) it also provides some tools for you to make bespoke patches to packages you install that survive updates and i'd like to do some customizations on stuff i install.
The new AMD chips in the Framework laptops would be a good candidate and I think you can get 96GB RAM in them. Also if the LLM software is idle (like llama.cpp or ollama) there is negligible extra power consumption.
It's very cool that useful models can be run on single personal computers at all. For coding, your time is very valuable, and I'd never want to use anything less than the best. I'm happy to pay pennies to use a frontier model with a huge context model and great speed.
Agreed. It’s definitely been fun playing locally, learning, fine tuning, etc, but these models just don’t quite cut it for serious development tasks (yet, and assuming none of the above considerations apply). I haven’t found better than Gemini 2.5 for my work so far.
Qwen3 is great, but not for writing code. Even after the recent fixes, and with the recommended parameters, it gets often trapped in a loop.
Qwen2.5-32B, Cogito-32B and GLM-32B remain the best options for local coding agents, even though the recently released MiMo model is also quite good for its size.
Yes, Ollama has Qwen 3 and it works great on a Mac. It may be slightly slower than MLX since Ollama hasn't integrated that (Apple Silicon optimized) library yet, but Ollama models still use the Mac's GPU.
You can just use Ollama and have a bunch of models, some are good for planning, some are for executing tasks... this sounds more complex then it should be or maybe I am lazy and want everything neatly sorted.
I have models on external drive because Apple and through Ollama server they interact really well with Cline or Roo code or even Bolt, but I found Bolt really not working well.
I was just trying to make sure is maximally performant, and did it with MLX because i am running on mac hardware and wanted to be able to run 30b in reasonable time so it can actually autonomously code something. Otherwise there are many ways to do it!
mlx is an alternative model format to GGUF. It executes natively on apple silicon using Apple's AI accelerator, rather than through GGUF as a compute shader(!). It's faster and uses fewer resources on Apple devices.
Yeah, you don't necessarily need it but it's optimized for Apple Silicon and in my experience feels like it gives slightly better performance than GGUFs. I really need to formally measure that so I'm not just running on vibes!
Forgive me I am just digesting the term "vibe coding," which doesn't seem like coding at all? It's just typing into your AI's text prompt and describing it to do xyz and then keep making edits til the AI has a working prototype for what you seek. Is that a correct assumption?
So it's not coding ... it's talking to a LLM via voice or chat and have it code for you. Then ask it to change/edit things and then review the code some or just run an error check so the LLM fixes the error and your done.
And so people who are vibe coding are getting paid multiple six figure salaries .... that's not sustainable anyone at any age and in any country can vibe code.
Looks like we are embracing the demise of our skill-sets, careers and livelihoods quickly!
I'm really impressed and also very interested to see models I can run on my MacBook Pro start to generate results close to large hosted "frontier" models, and do so with what I assume are far fewer parameters.
It's been a solid trend for the last two years: I've not upgraded my laptop in the time and the quality of results I'm getting from local models on that same machine has continued to rise.
My hunch is that there's still some remaining optimization fruit to be harvested but I expect we may be nearing a plateau. I may have to upgrade from 64GB of RAM this year.
"create a python decorator that uses a trie to do mqtt topic routing”
phi4-reasoning works, but I think the code is buggy
phi4-mini-reasoning freaks out
qwen3:30b starts looping and forgets about the decorator
mistral-small gets straight to the point and the code seems sane
https://mastodon.social/@rcarmo/114433075043021470
I regularly use Copilot models, and they can manage this without too many issues (Claude 3.7 and Gemini output usable code with tests), but local models seem to not have the ability to do it quite yet.
What was even more impressive is the 0.6B model which made the sub 1B actually useful for non-trivial tasks.
Overall very impressed. I am evaluating how it can integrate with my current setup and will probably report somewhere about that.
Conversely, the 4B model actually seemed to work really well and gave results comparable to Gemini 2.0 Flash (at least in my simple tests).
The /think and /no_think commands are very convenient.
Here’s the LM Studio docs on it: https://lmstudio.ai/docs/app/advanced/speculative-decoding
It's using 20GB of memory according to ollama.
The quality of the output is decent, just keep in mind it is only a 30B model. It also translates really well from french to german and vice versa, much better than Google translate.
Edit: for comparision, Qwen2.5-coder 32B q4 is around 12-14t/s on this M1 which is too slow for me. I usually used the Qwen2.5-coder 17B at around 30t/s for simple tasks. Qwen3 30B is imho better and faster.
[1] parameters for Qwen3: https://huggingface.co/Qwen/Qwen3-30B-A3B
[2] unsloth quant: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF
[3] llama.cpp: https://github.com/ggml-org/llama.cpp
I'm running Q4 and it's taking 17.94 GB VRAM with 4k context window, 20GB with 32k tokens.
As a Python person I've found uv + MLX to be pretty painless on a Mac too.
The latter is super easy. Just download the model (thru the GUI) and go.
This part, yes. I assume the setting a complete environment is a little more involved than the 4 commands sibling is also refers to.
I doubt it can perform well with actual autonomous tasks like reading multiple files, navigating dirs and figuring out where to make edits. That’s at least what I would understand under “vibe coding”
In the post I saw there’s gemma3 (multimodal) and qwen3 (not multimodal). Could they be used as above?
How does localforge know when to route a prompt to which agent?
Thank you
ollama-mcp-bridge: A TypeScript implementation that "connects local LLMs (via Ollama) to Model Context Protocol (MCP) servers. This bridge allows open-source models to use the same tools and capabilities as Claude, enabling powerful local AI assistants"
simple-mcp-ollama-bridge: A more lightweight bridge connecting "Model Context Protocol (MCP) servers to OpenAI-compatible LLMs like Ollama"
rawveg/ollama-mcp: "An MCP server for Ollama that enables seamless integration between Ollama's local LLM models and MCP-compatible applications like Claude Desktop"
How you route would be an interesting challenge, presumably could just tell it to use the mcp for certain tasks, thereby offloading locally.
I started by giving it a reference Python MCP server and asking it to modify the code to do that. Now I have 3-4 tools that give me reproducible results.
I do think you should disclose that Localforge is your own project though.
https://github.com/comet-ml/opik
Seems like a great way to roll out AI to a medium sized team where a very small team can coordinate access to the best available tools so the entire team doesn’t need to keep pace at the current break-neck speed.
Someone in another comment on this post mentioned using one of the micro models (Qwen 0.6B I think?) and having decent results. Maybe you can try that and then progressively move upwards?
EDIT: “Queen” -> “Qwen”
The vast majority of people run between 4-6 bit depending on system capability. The extra accuracy above 6 tends to not be worth it relative to the performance hit.
I have 64GB and I can only just fit a bunch of Firefox and VS Code windows at the same time as running a 27B model.
If you're on Debian AFAIK AMD is paying someone to experience the pain in your place, so that is an option if you're building something from scratch, but my openSUSE Tumbleweed installation predates the existence of llama.cpp by a few years and i'm not subjecting myself to the horror that is Python projects (mis)managed by AI developers[1] :-P.
EDIT: my mistake, ROCm isn't needed (or actually, supported) by koboldcpp, it uses Vulkan. ROCm is available via a fork. Still, with Vulkan it is fast too.
[0] ...and more than once as after some OS upgrade it might break, like mine
[1] ok, i did it once, because recently i wanted to try out some tool someone wrote that relied on some AI stuff and i was too stubborn to give up - i had to install Python from source on a Debian docker container because some dependency 2-3 layers deep didn't compile with a newer minor version release of Python. It convinced me to thank yet again to thank Georgi Gerganov for making AI-related tooling that enables people to stick with C++
llama.cpp can be built using Debian-supplied libraries with ROCm backend enabled.
I used to use Debian at the past but when i was about to install my current OS i already had the openSUSE Tumbleweed installer in a USB so i went with that. Ultimately i just needed "a Linux" and didn't care which. I do end up building more stuff from source than when i used Debian but TBH the only time that annoyed me was with ROCm because it is broken into 2983847283 pieces, many of them have their own flags for the same stuff, some claim they allow to install them anywhere but in practice can only work via the default in "/opt", and a bunch of them have their own special snowflake build process (including one that downloads some random stuff via a script through the build process - IIRC a Gentoo packager made a bug report about it to remove the need to download stuff, but i'm not sure if it has been addressed or not).
If i was doing a fresh OS install i'd probably go with Gentoo - it packages ROCm like Debian, but AFAICT (i haven't tried it) it also provides some tools for you to make bespoke patches to packages you install that survive updates and i'd like to do some customizations on stuff i install.
1. Sovereignty over data, your outputs can't be stolen or trained on
2. Just for fun / learning / experiment on
3. Avoid detection that you're using AI
4. No Internet connection, in the woods at your cabin or something
Qwen2.5-32B, Cogito-32B and GLM-32B remain the best options for local coding agents, even though the recently released MiMo model is also quite good for its size.
https://ollama.com/library/qwen3
I have models on external drive because Apple and through Ollama server they interact really well with Cline or Roo code or even Bolt, but I found Bolt really not working well.
Why isn't using localforge enough as it ties into models?
https://x.com/karpathy/status/1886192184808149383
And so people who are vibe coding are getting paid multiple six figure salaries .... that's not sustainable anyone at any age and in any country can vibe code.
Looks like we are embracing the demise of our skill-sets, careers and livelihoods quickly!
I wonder how far this can go?
My hunch is that there's still some remaining optimization fruit to be harvested but I expect we may be nearing a plateau. I may have to upgrade from 64GB of RAM this year.