Running Qwen3 on your macbook, using MLX, to vibe code for free

(localforge.dev)

286 points | by avetiszakharyan 71 days ago

31 comments

omneity
71 days ago
I'm using Qwen3-30B-A3B locally and it's very impressive. Feels like the GPT-4 killer we were waiting for for two years. I'm getting 70 tok/s on an M3 Max, which is pushing it into the "very usable" quadrant.
What was even more impressive is the 0.6B model which made the sub 1B actually useful for non-trivial tasks.
Overall very impressed. I am evaluating how it can integrate with my current setup and will probably report somewhere about that.
[-]
- TuxSH
  71 days ago
  Personally, I'm getting 15 tok/s on both RTX 3060 and my MacBook Air M4 (w/ 32GB, but 24 should suffice), with the default config from LMStudio.
  Which I find even more impressive, considering the 3060 is the most used GPU (on Steam) and that M4 Air and future SoCs are/will be commonplace too.
  (Q4_K_M with filesize=18GB)
- anon373839
  71 days ago
  One of the most interesting things about that model is its excellent score on the RAG confabulations (hallucination) leaderboard. It’s the 3rd best model overall, beating all OpenAI models, for example. I wonder what Alibaba did to achieve that.
  https://github.com/lechmazur/confabulations
- c0brac0bra
  71 days ago
  What tasks have you found the 0.6B model useful for? The hallucination that's apparent during its thinking process put up a big red flag for me.
  Conversely, the 4B model actually seemed to work really well and gave results comparable to Gemini 2.0 Flash (at least in my simple tests).
  [-]
  - SparkyMcUnicorn
    71 days ago
    You can use 0.6B for speculative decoding on the larger models. It'll speed up 32B, but slows down 30B-A3B dramatically.
  - omneity
    71 days ago
    It's okay for extracting simple things like addresses, or for formatting text with some input data, like a more advanced form of mail merge.
    I haven't evaled these tasks so YMMV. I'm exploring other possibilities as well. I suspect it might be decent at autocomplete, and it's small enough one could consider finetuning it on a codebase.
- jasonjmcghee
  71 days ago
  Importantly they note that using a draft model screws it up and this was my experience. I was initially impressed, then started seeing problems, but after disabling my draft model it started working much better. Very cool stuff- it's fast too as you note.
  The /think and /no_think commands are very convenient.
  [-]
  - woadwarrior01
    71 days ago
    That should not be the case. Speculative decoding is trading off compute for memory bandwidth. The model's output is guaranteed to be the same, with or without it. Perhaps there's a bug in the implementation that you're using.
  - marcalc
    71 days ago
    What do you mean by draft model? And how would one disable it? Cheers
    [-]
    - _neil
      71 days ago
      A draft model is something that you would explicitly enable. It uses a smaller model to speculatively generate next tokens, in theory speeding up generation.
      Here’s the LM Studio docs on it: https://lmstudio.ai/docs/app/advanced/speculative-decoding
- mtw
  71 days ago
  how much RAM do you have? I want to compare with my local setup (M4 Pro)
  [-]
  - dust42
    71 days ago
    I have a MBP M1 Max 64GB and I get 40t/s with llama.cpp and unsloth q4_k_m on the 30B A3B model. I always use /nothink and Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 - these are the settings recommended for Qwen3 and they make a big difference. With the default settings from llama-server it will always run into an endless loop.
    The quality of the output is decent, just keep in mind it is only a 30B model. It also translates really well from french to german and vice versa, much better than Google translate.
    Edit: for comparision, Qwen2.5-coder 32B q4 is around 12-14t/s on this M1 which is too slow for me. I usually used the Qwen2.5-coder 17B at around 30t/s for simple tasks. Qwen3 30B is imho better and faster.
    [1] parameters for Qwen3: https://huggingface.co/Qwen/Qwen3-30B-A3B
    [2] unsloth quant: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF
    [3] llama.cpp: https://github.com/ggml-org/llama.cpp
  - omneity
    71 days ago
    128GB but it's not using much.
    I'm running Q4 and it's taking 17.94 GB VRAM with 4k context window, 20GB with 32k tokens.
    [-]
    - A4ET8a8uTh0_v2
      71 days ago
      I am not a mac person, but I am debating buying one for the unified ram now that the prices seem to be inching down. Is it painful to set up? The general responses I seem to get range from "It is takes zero effort" to "It was a major hassle to set everything up."
      [-]
      - simonw
        71 days ago
        LM Studio and Ollama are both very low complexity ways to get local LLMs running on a Mac.
        As a Python person I've found uv + MLX to be pretty painless on a Mac too.
      - dghlsakjg
        71 days ago
        Read the article you are commenting on. It is a how to that answers your exact question. It takes 4 commands in the terminal.
      - PhilippGille
        71 days ago
        > I am not a mac person, but I am debating buying one for the unified ram
        Soon some AMD Ryzen AI Max PCs will be available, with unified memory as well. For example the Framework Desktop with up to 128 GB, shared with the iGPU:
        - Product: https://frame.work/us/en/desktop?tab=overview
        - Video, discussing 70B LLMs at around 3m:50s : https://youtu.be/zI6ZQls54Ms
        [-]
        A4ET8a8uTh0_v2
        71 days ago
        oh boy... i was genuinely hoping someone less pricy would enter this market
        edit: ok.. i am excited.
      - MR4D
        71 days ago
        You can use the method in this tutorial or you can download LM Studio and run it.
        The latter is super easy. Just download the model (thru the GUI) and go.
      - bloqs
        71 days ago
        The article should answer your question. Or do you mean setting up a Mac for use as a Linux or windows user
        [-]
        A4ET8a8uTh0_v2
        71 days ago
        << Or do you mean setting up a Mac for use as a Linux or windows user
        This part, yes. I assume the setting a complete environment is a little more involved than the 4 commands sibling is also refers to.
        [-]
        Shraal
        71 days ago
        It's relatively easy. macOS is easier to set up than Linux, but it will always depend on your specific needs and environment.
        E.g.: I go a little bit overboard for the average macOS user:
        - custom system- and app-specific keyboard mappings (ultra-modifier on caps-lock; custom tabbing-key-modifier) via Karabiner Elements
        - custom trackpad mappings via BetterTouchTool
        - custom Time Machine schedule and backup logic; you can vibe-code your install script once and re-use it in the future; just make it idempotent
        - custom quake-like Terminal via iTerm
        - shell customizations
        - custom Alfred workflows
        - etc.
        If all you need is just a sensible package manager and the terminal to get started, just set up Time Machine with default settings, Homebrew, your shell, and optionally iTerm2, and you're good to go. Other noteworthy power-user tools:
        - Hammerspoon
        - Syncthing / Resilio Sync
        - Arq. Naturally, the usual backup tools also run on macOS: Borg, Kopia, etc.
        - Affinity suite for image processing
        - Keyshape for animations on web, mobile, etc.
        tough
        71 days ago
        I mean that's more like old habits no? MacOS is pretty easy to setup. if all your target apps are available on it, you shouldn't have much of a problem.
        As a Windows/MacOS/Linux dweller kinto is a godsend so I can have macos keyboard (but you could have linux or windows by default) on all OSes https://kinto.sh/
      - avetiszakharyan
        71 days ago
        Honestly, it is quite a hastle, took me 2 hours BUT. if you just take the whole article text and paste that to gemini-2.5-pro and give your circumstance, i think it will give you specific steps for your case and it should be trivial from that moment on
      - emmelaich
        71 days ago
        Using llama.cpp or pytorch could hardly be easier.
- tmaly
  70 days ago
  I am curious if you have tried function calling or MCPs with it?
- UK-Al05
  71 days ago
  It's fits entirely in my 7900xtx memory. But tbh i've been disappointed with programming ability so far.
  It's using 20GB of memory according to ollama.
- tomr75
  71 days ago
  I’m getting 56 with mlx and lmstudio. How 76?
- avetiszakharyan
  71 days ago
  I went form bottom up, started with 4B, then 8B, then 30B, and 30B was the only one that started to "use tools". Other models were saying they will use but never did, or didnt notice all the tools. I think anything above 30b would atually be able to go full GPT on a task. 30b does it, but a bit.. meh
artdigital
71 days ago
Cool, but running qwen3 and doing a ls tool call is not “vibe coding”, this reads more like a lazy ad for localforge
I doubt it can perform well with actual autonomous tasks like reading multiple files, navigating dirs and figuring out where to make edits. That’s at least what I would understand under “vibe coding”
[-]
- datpuz
  71 days ago
  What *is* vibe coding? And how can I stop hearing about it for the rest of my life?
  [-]
  - krashidov
    71 days ago
    The real definition of vibe coding is coding with just an LLM. Never looking at what code it outputs, never doing manual edits. Just iterating with the LLM and always pressing approve.
    It is a viable way of making software. People have made working software with it. It will likely only ever be more prevalent but might be renamed to just plain old making apps.
  - baq
    71 days ago
    ‘If it works it ships’
- avetiszakharyan
  71 days ago
  Definitely try it, it can navigate files search for stuff, run bash commands, and while 30b is a bit cranky it gets the job done (much worse then i would get when i plug in gpt-4.1, but its still not bad, Kudos o qwen. As for localforge, it really is a vibe coding tool, just like claude or codex, but with the possibility to plug more than just one provider. What's wrong with that?
  [-]
  - tough
    71 days ago
    They're just pointing out how most -educational- content is actually marketing in disguise, which is fine, but also fine to acknowledge i guess, even if a bit snarkily
    [-]
    - avetiszakharyan
      71 days ago
      Well its an oss project, free, I kind of didnt see it that way i guess, that something thats given for free is bad-tone to market in any possible way. Iguess from my standpoint its more of a, I just want to show this thing to people, because I am proud of it as a personal project, and it brings me joy to just, put it out there And since if you "just put it out there" it will sink to the bottom of the HN pit, why not get a bit more creative.
      [-]
      - tough
        71 days ago
        I respect your approach and appreciate the content.
- 85392_school
  71 days ago
  You should try it. It's trained for tool calling and thinks before taking action.
kamranjon
71 days ago
Just wanted to give a shout out to MLX and MLX-LM - I’ve been using it to fine-tune Gemma 3 models locally and it’s a surprisingly well put together library and set of tools from the Apple devs.
avetiszakharyan
71 days ago
I'd thought to share this quick tutorial to get an actual autonomous agent running on your local and doing some simple tasks. Still in progress trying to figure ou right MLX settings or proper model version to do it, but the framework around this approach is solid, so i hought i'd share!
[-]
- nottorp
  71 days ago
  Now how do you feed it an existing codebase as part of your prompt? Does it even support that (prompt size etc).
  [-]
  - avetiszakharyan
    71 days ago
    Ye you can just run it in a folder, and ask it to look around, it can execute bash commands, do anything that Claude Code can do. it will read all the codebase if it has to
    [-]
    - pylotlight
      71 days ago
      Typically I'd use tools for that as context will be finite, but I hear it does a decent job at tool calling too so should see solid perf there.
      [-]
      - avetiszakharyan
        71 days ago
        8b was very bad at tool calling. 30b does an "ok" job but requires the wrapper to do a lot of job, like for example the model will be very random, either have multiple tool_call ags, or have multiple tool calls in one tag inconsistant. but "handle-able"
chuckadams
71 days ago
Anyone know of a setup, perhaps with MCP, where I can get my local LLM to work in tandem on tasks, compress context, or otherwise act in concert with the cloud agent I'm using with Augment/Cursor/whatever? It seems silly that my shiny new M3 box just renders the UI while the cloud LLM alone refactors my codebase, I feel they could negotiate the tasks between themselves somehow.
[-]
- _joel
  71 days ago
  There's a few Ollama-MCP bridge servers already (from a quick search, also interested myself):
  ollama-mcp-bridge: A TypeScript implementation that "connects local LLMs (via Ollama) to Model Context Protocol (MCP) servers. This bridge allows open-source models to use the same tools and capabilities as Claude, enabling powerful local AI assistants"
  simple-mcp-ollama-bridge: A more lightweight bridge connecting "Model Context Protocol (MCP) servers to OpenAI-compatible LLMs like Ollama"
  rawveg/ollama-mcp: "An MCP server for Ollama that enables seamless integration between Ollama's local LLM models and MCP-compatible applications like Claude Desktop"
  How you route would be an interesting challenge, presumably could just tell it to use the mcp for certain tasks, thereby offloading locally.
- rcarmo
  71 days ago
  I've been toying with Visual Studio Code's MCP and agent support and gotten it to offload things like reference searches and targeted web crawling (look up module X on git repo Y via this URL pattern that the MCP server goes, fetches and parses).
  I started by giving it a reference Python MCP server and asking it to modify the code to do that. Now I have 3-4 tools that give me reproducible results.
- 101011
  69 days ago
  This is the closest I've found that's akin to Claude Code: https://aider.chat/
rcarmo
71 days ago
Coincidentally, I just managed to get Qwen3 to go into a loop by using a fairly simple prompt:
"create a python decorator that uses a trie to do mqtt topic routing”
phi4-reasoning works, but I think the code is buggy
phi4-mini-reasoning freaks out
qwen3:30b starts looping and forgets about the decorator
mistral-small gets straight to the point and the code seems sane
https://mastodon.social/@rcarmo/114433075043021470
I regularly use Copilot models, and they can manage this without too many issues (Claude 3.7 and Gemini output usable code with tests), but local models seem to not have the ability to do it quite yet.
[-]
- datpuz
  71 days ago
  Here's qwen-30b-a3b's response to your prompt when I worded it better:
  The prompt was:
  "Create a Python decorator that registers functions as handlers for MQTT topic patterns (including + and # wildcards). Internally, use a trie to store the topic patterns and match incoming topic strings to the correct handlers. Provide an example showing how to register multiple handlers and dispatch a message to the correct one based on an incoming topic."
  https://pastebin.com/wefw7X2h
  [-]
  - rcarmo
    71 days ago
    I went back and used your prompt, and it is still looping:
    https://pastebin.com/VfmhCTFm
    [-]
    - anon373839
      71 days ago
      Are you using Ollama? If so, the issue may be Ollama's default context length: just 2,048 tokens. Ollama truncates the rest of the context silently, so "thinking" models cannot work with the default settings.
      If you are using Ollama, try explicitly setting the `num_ctx` parameter in your request to something higher like 16k or 32k, and then see if you still encounter the looping. I haven't run into that behavior once with this model.
      [-]
      - rcarmo
        70 days ago
        I was using the CLI (which is where I live), but I will redownload and give it a try.
- datpuz
  71 days ago
  I think your prompt is bad. Still impressive that Claude 3.7 handled your bad prompt, but qwen3 had no problem with this prompt:
  Create a Python decorator that registers functions as handlers for MQTT topic patterns (including + and # wildcards). Internally, use a trie to store the topic patterns and match incoming topic strings to the correct handlers. Provide an example showing how to register multiple handlers and dispatch a message to the correct one based on an incoming topic.
  [-]
  - rcarmo
    71 days ago
    I purposefully used exactly the same thing I did with Claude and Gemini to see how the models dealt with ambiguity. It shouldn't have degraded the chain of thought to the point where it starts looping.
    [-]
    - 101011
      69 days ago
      The trick shouldn't be to try and generate a litmus test for agentic development, it's to change your workflow to game-plan solutions and decompose problems (like you would a jira epic to stories), and THEN have it build something for you.
- avetiszakharyan
  71 days ago
  Is there an additional system prompt before that? Or i can repro with just this?
  [-]
  - rcarmo
    71 days ago
    Just that. I purposefully used exactly the same thing I did with Claude and Gemini to see how the models dealt with ambiguity.
- GaggiX
  71 days ago
  You should probably try a different quantization, have you try UD-Q4_K_XL?
- throwaway314155
  71 days ago
  [flagged]
nico
71 days ago
Very cool to see this and glad to discover localforge. Question about localforge, can I combine two agents to do something like: pass an image to a multimodal agent to provide html/css for it, and another to code the rest?
In the post I saw there’s gemma3 (multimodal) and qwen3 (not multimodal). Could they be used as above?
How does localforge know when to route a prompt to which agent?
Thank you
[-]
- avetiszakharyan
  71 days ago
  you can combine agents in 2 ways, you can constantly swap agents during one conversation, or you can have the in separate conversations and collaborate. I was even thinking 2 agents can work on 2 separate git clones, and then do PR's to each other. I also like using code to do the image parsing and css, adn then gemini to do the coding. I tried using gemma and qwen for "real stuff" but its more of a, simple stuff only, if i really need output, id' rather spend money for now. Hopefully to change soon. as for rotuing, localforge does NOT know. you choose the agent, and it will loop inside that agent forever. Like, the way it works is that unless agent decides to talk to a user, it will forever be doing function calls and "talking to functions", as one agent. The only routing happens this way. there is main model and there is Expert model. main model knows to ask expert model (see system promp), when its stuck. so for any rouing to happen 1) system prompt needs to mention it 2) a routing to another model should be a function call. that way model knows how to ask another model for something
  [-]
  - nico
    71 days ago
    Great insights, thank you for the extended and detailed answer, I'll have to try it out
walthamstow
71 days ago
Looks good. I've been looking for a local-first AI-assisted IDE to work with Google's Gemma 3 27B
I do think you should disclose that Localforge is your own project though.
[-]
- danw1979
  71 days ago
  Personally, I assumed that a blog post on the domain localforge.dev was written by the developers of localforge, but I might be wrong.
  [-]
  - SquareWheel
    71 days ago
    They likely mean that the submitter, avetiszakharyan, should disclose their relationship to Localforge.
    [-]
    - zarathustreal
      71 days ago
      Fascinating.. I wonder how much of the economy runs on social proof
      [-]
      - SquareWheel
        71 days ago
        It's not uncommon on HN! We frequently have people chiming in as CEOs, insiders, and experts in various fields without much proof. Generally, it hasn't been a problem. Or at least I've not seen any examples of having wool pulled over our eyes in this fashion.
  - walthamstow
    71 days ago
    Sure, if you already know what Localforge is before clicking.
    [-]
    - tasuki
      71 days ago
      I didn't know, and still assumed the blog post on localforge.dev was written by the localforge.dev people. Who else?
      [-]
    - avetiszakharyan
      71 days ago
      Where do i put that, in the blogpost or?
      [-]
      - walthamstow
        71 days ago
        If you can still edit it, adding it to your first comment is fine I would say. "Disclosure: I am the author of Localforge" or similar.
        [-]
        avetiszakharyan
        71 days ago
        No thats the only thing I can't edit tbh :(
999900000999
71 days ago
Very impressive, it doesn't need to be as good as the pay for token models. For example I've probably spent at least $300 last month on vibe coding, a big part of this is I want to know what tools I'm going to end up competing with, and another is I got a working implementation of one of my side projects, and then I decided I wanted it to be rewritten in another programming language.
Even if I chill out a bit here, a refurbished Nvidia laptop would pay for itself within a year. I am a bit disappointed Ollama can't handle the full flow yet, IE it could be a single command.
ollama code qwen3
[-]
- _bin_
  71 days ago
  I just tried it. It got stuck looping on a `cargo check` call and literally wouldn't do anything else. No additional context, just repeatedly spitting out the same tool call.
  The problem is the best models barely clear the bar for some stuff in terms of coherence and reliability; anything else just isn't particularly usable.
  [-]
  - 999900000999
    71 days ago
    This happens when I'm using Claude Code too. Even the best models need humans to get unstuck.
    Fron what I've seen most of them are good at writing new code from scratch.
    Refactoring is very difficult.
    [-]
    - _bin_
      71 days ago
      I tried it 3-4 times before giving up and it did this every single time. I checked the tool call output and it was running cargo check appropriately. I think maybe the 30b-scale models just aren't sufficient for typical development.
      You're generally correct though, that from-scratch gets better results. This is a huge constraint of them: I don't want a model that will write something its way. I've already gone through my design and settled on the style/principles/libraries I did for a reason; the bot working terribly with that is a major flaw and I don't see saying "let the bot do things its preferred way" as a good answer. Some systems, things like latency matters, and the bot's way just isn't good enough.
      The vast majority of man-hours are maintaining and extending code, not green-fielding new stuff. Vendors should be hyper-focused on this, on compliance with user directions, not with building something that makes a react todo-list app marginally faster or better than competitors.
      [-]
      - 999900000999
        71 days ago
        If anything, it's a good sign that these tools are no where close to replacing us.
        I was trying to get postgres working with a project the other day, and Claude decided that it was going to just replace it with SQL lite when it couldn't get the build to work.
        All I want is "I don't know how to do this." But now these tools would rather just do it wrong.
        They also have a very very strong tendency to try and force unoptimized solutions. You'll have 3 classes that do the exact same thing with only minor variable differences. Something a human would do in one class.
        For my latest project I'm strongly tempted to just suck it up and code the whole thing by hand.
ttoinou
71 days ago
Great thank you. Side topic : anyone knows a way to have a centralized proxy to all LLMs services, online or local, that lets our services connect to it and we manage access to LLMs only once there ? And also records calls to LLM. Would make the whole UX of switching LLMs weekly easier, we would only reconfigure the proxy. I know only LiteLLM that can do that but its record of all LLMs calls is a bit clunky to use properly
[-]
- ramesh31
  71 days ago
  https://openrouter.ai/
- Havoc
  71 days ago
  Litellm is definitely your best bet. For recording - you can probably vibe code a proxy in front of it that mitms it and dumps the request into whatever format you need
  [-]
  - rcarmo
    71 days ago
    Litellm can log stuff pretty well on its own.
- mnholt
  71 days ago
  I’ve been looking for this for my team but haven’t found it. Providers like OpenAI and Anthropic offer admin token to manage team accounts and you look hook into Ollama or another self managed service for local AI.
  Seems like a great way to roll out AI to a medium sized team where a very small team can coordinate access to the best available tools so the entire team doesn’t need to keep pace at the current break-neck speed.
- calebkaiser
  71 days ago
  I'm a maintainer of Opik, an open source LLM eval/observability framework. If you use something like LiteLLM or OpenRouter to handle the proxying of requests, Opik basically provides an out-of-the-box recording layer via its integrations with both:
  https://github.com/comet-ml/opik
- tidbeck
  71 days ago
  Could you maybe make use of Simon Willsons [LLM lib/app](https://github.com/simonw/llm)? It has great LLM support (just pass in the model to use) and records everything by default.
  [-]
  - simonw
    71 days ago
    The one feature missing from LLM core for this right now is serving models over an HTTP OpenAI-compatible local server. There's a plugin you can try for that here though: https://github.com/irthomasthomas/llm-model-gateway
desireco42
71 days ago
You can just use Ollama and have a bunch of models, some are good for planning, some are for executing tasks... this sounds more complex then it should be or maybe I am lazy and want everything neatly sorted.
I have models on external drive because Apple and through Ollama server they interact really well with Cline or Roo code or even Bolt, but I found Bolt really not working well.
[-]
- desireco42
  71 days ago
  To add, you can use so called, abliterated models that are stripped of censorship for example. Much better experience sometimes.
jononor
71 days ago
Running models locally is starting to get interesting now. Especially the 30B-A3B version seems like a promising direction, though it is still out of reach on 16 GB VRAM (quite accessible). Hoping for new Nvidia RTX cards with 24/32 GB VRAM. Seems that we might get to GPT4-ish levels within a few years? Which is useful for a bunch of tasks.
[-]
- avetiszakharyan
  71 days ago
  I think we are just tiny bit away of being able to really "code" with ai, locally. Because even if it would be on gemini2.5 level, since its free, you can make it self prompt a bit more and eventually solve any problem. if i could ran 200b or if 30b wouldve been as good - it wouldve been enough
freeone3000
71 days ago
There needs to be more mention about the requirement of setting the model-name correctly. For this tutorial to be executed top-to-bottom, the model name must be "mlx-community/Qwen3-30B-A3B-8bit". Other model names will result in a 404 -- rightly so, as this is used to determine which model is executed in mlx_lm.serve!
endlessvoid94
71 days ago
I've found the local models useful for non-coding tasks, however the 8B parameter models so far have proven lacking enough for coding tasks that I'm waiting another few months for whatever the Moore's law equivalent of LLM power is to catch up. Until then, I'm sticking with Sonnet 3.7.
[-]
- walthamstow
  71 days ago
  If you have a 32GB Mac then you should be able to run up to 27B params, I have done so with Google's `gemma3:27b-it-qat`
  [-]
  - endlessvoid94
    71 days ago
    Hm, I've got an M2 air w/ 24GB. Running the 27B model was crawling. Maybe I had something misconfigured.
    [-]
    - 100721
      71 days ago
      No, that sounds right. 24GB isn’t enough to feasibly run 27B parameters. The rule of thumb is approximately 1GB of ram per billion parameters.
      Someone in another comment on this post mentioned using one of the micro models (Qwen 0.6B I think?) and having decent results. Maybe you can try that and then progressively move upwards?
      EDIT: “Queen” -> “Qwen”
      [-]
      - brandall10
        71 days ago
        That rule of thumb is only related to 8 bit quants at low context. The default for ollama is 4 bit, which puts it roughly about 14GB.
        The vast majority of people run between 4-6 bit depending on system capability. The extra accuracy above 6 tends to not be worth it relative to the performance hit.
      - simonw
        71 days ago
        You also need to leave space for other apps. If you run a 27B model on a 32GB machine you may find that you can't productively run other apps.
        I have 64GB and I can only just fit a bunch of Firefox and VS Code windows at the same time as running a 27B model.
    - redman25
      71 days ago
      I think only 2/3 of ram is allocated to be available to the gpu, so like 14gb which is probably not enough to run even Q4 quant.
      [-]
      - tstrimple
        71 days ago
        This is configurable by the way.
        sudo sysctl iogpu.wired_limit_mb=12345
    - hadlock
      70 days ago
      deepseek-r1:8b screams on my 12gb gpu. gemma3:12b-it-qat runs just fine, a little faster than I can read. Once you exceed GPU ram it offloads a lot of the model to the CPU and splitting between gpu and cpu is dramatically (80? 95%?) slower
  - alkh
    71 days ago
    How much RAM was it taking during inference?
    [-]
    - walthamstow
      71 days ago
      15.4GB during inference according to Activity Monitor
      [-]
      - alkh
        71 days ago
        Oh, nice, that's actually not bad at all. Thanks, will give it a try on my 36Gb Mac
paul7986
71 days ago
Forgive me I am just digesting the term "vibe coding," which doesn't seem like coding at all? It's just typing into your AI's text prompt and describing it to do xyz and then keep making edits til the AI has a working prototype for what you seek. Is that a correct assumption?
[-]
- prophesi
  71 days ago
  Karpathy's original tweet defining "vibe coding":
  https://x.com/karpathy/status/1886192184808149383
  [-]
  - paul7986
    71 days ago
    So it's not coding ... it's talking to a LLM via voice or chat and have it code for you. Then ask it to change/edit things and then review the code some or just run an error check so the LLM fixes the error and your done.
    And so people who are vibe coding are getting paid multiple six figure salaries .... that's not sustainable anyone at any age and in any country can vibe code.
    Looks like we are embracing the demise of our skill-sets, careers and livelihoods quickly!
    [-]
    - harvey9
      71 days ago
      Are people who have no other programming skills really landing well paid jobs with this? I would like to imagine that as a step towards an Iain M Banks future but realistically I'm more likely to see you all at the Skynet work camp.
    - jimbokun
      71 days ago
      Well yes now you are up to speed on current developments.
      [-]
      - paul7986
        71 days ago
        lol thank you my LLM is now more advanced :)
    - colesantiago
      71 days ago
      Do not fear, there will be new jobs available from AI.
      [-]
      - jimbokun
        71 days ago
        And AI will do those too.
        [-]
        paul7986
        71 days ago
        Indeed it will DOGE all those jobs too
        abc_lisper
        71 days ago
        lol
jedisct1
71 days ago
Qwen3 is great, but not for writing code. Even after the recent fixes, and with the recommended parameters, it gets often trapped in a loop.
Qwen2.5-32B, Cogito-32B and GLM-32B remain the best options for local coding agents, even though the recently released MiMo model is also quite good for its size.
rickydroll
71 days ago
I understand why people use the Mac for their local LLM work. I can't bring myself to spend any money on Apple products. I need to find an alternative platform that runs under Linux, and preferably, since I would run this remotely from my work laptop. I would also want to find some way to modulate the power consumption to turn it off automatically when I'm idle.
[-]
- badsectoracula
  71 days ago
  If you don't mind going through the eldritchian horror that is building ROCm from source[0], Qwen_Qwen3-30B-A3B-Q6_K (6bit quantization of the LLM mentioned in the article which in practice shouldn't be much different) works decently fast on a RX 7900 XTX using koboldcpp and llama.cpp. And by "decently fast" i mean "it writes faster i can read".
  If you're on Debian AFAIK AMD is paying someone to experience the pain in your place, so that is an option if you're building something from scratch, but my openSUSE Tumbleweed installation predates the existence of llama.cpp by a few years and i'm not subjecting myself to the horror that is Python projects (mis)managed by AI developers[1] :-P.
  EDIT: my mistake, ROCm isn't needed (or actually, supported) by koboldcpp, it uses Vulkan. ROCm is available via a fork. Still, with Vulkan it is fast too.
  [0] ...and more than once as after some OS upgrade it might break, like mine
  [1] ok, i did it once, because recently i wanted to try out some tool someone wrote that relied on some AI stuff and i was too stubborn to give up - i had to install Python from source on a Debian docker container because some dependency 2-3 layers deep didn't compile with a newer minor version release of Python. It convinced me to thank yet again to thank Georgi Gerganov for making AI-related tooling that enables people to stick with C++
  [-]
  - rationably
    71 days ago
    If you are on Debian, ROCm is already packaged in Debian 13 (Trixie).
    llama.cpp can be built using Debian-supplied libraries with ROCm backend enabled.
    [-]
    - badsectoracula
      71 days ago
      Yeah, as i wrote "if you're on Debian AFAIK AMD is paying someone to experience the pain in your place" :-).
      I used to use Debian at the past but when i was about to install my current OS i already had the openSUSE Tumbleweed installer in a USB so i went with that. Ultimately i just needed "a Linux" and didn't care which. I do end up building more stuff from source than when i used Debian but TBH the only time that annoyed me was with ROCm because it is broken into 2983847283 pieces, many of them have their own flags for the same stuff, some claim they allow to install them anywhere but in practice can only work via the default in "/opt", and a bunch of them have their own special snowflake build process (including one that downloads some random stuff via a script through the build process - IIRC a Gentoo packager made a bug report about it to remove the need to download stuff, but i'm not sure if it has been addressed or not).
      If i was doing a fresh OS install i'd probably go with Gentoo - it packages ROCm like Debian, but AFAICT (i haven't tried it) it also provides some tools for you to make bespoke patches to packages you install that survive updates and i'd like to do some customizations on stuff i install.
  - rickydroll
    71 days ago
    Yesterday I was successively using olama installed qwen3:32b and drove it using Simon Willison's llm tool (https://llm.datasette.io/en/stable/). Using CPU only, it ran (if you can call moving at the speed of a walker running) and sucked up almost all of my 32 GB ram.
    My laptop has dual (and dueling) graphics chips, Intel and Quadro K1200M with 4 GB of RAM. I will need to learn more about LLM setup, so maybe I can torture myself getting the Nvidia driver working on Linux and experiment with that.
- telotortium
  71 days ago
  Entirely due to the unified RAM between CPU and GPU in Apple Silicon. Laptops otherwise almost never have a GPU with sufficient RAM for LLMs.
  [-]
  - rickydroll
    71 days ago
    Should have been clearer. I was thinking of a dedicated in-house LLM server I could use from different laptops.
- lreeves
  71 days ago
  The new AMD chips in the Framework laptops would be a good candidate and I think you can get 96GB RAM in them. Also if the LLM software is idle (like llama.cpp or ollama) there is negligible extra power consumption.
  [-]
  - organsnyder
    71 days ago
    I preordered a Framework Desktop with 128GB RAM for exactly this reason. Apparently under Linux it's possible to assign >100GB to the GPU.
xnx
71 days ago
It's very cool that useful models can be run on single personal computers at all. For coding, your time is very valuable, and I'd never want to use anything less than the best. I'm happy to pay pennies to use a frontier model with a huge context model and great speed.
[-]
- chipsrafferty
  71 days ago
  This is mostly for one of 4 reasons:
  1. Sovereignty over data, your outputs can't be stolen or trained on
  2. Just for fun / learning / experiment on
  3. Avoid detection that you're using AI
  4. No Internet connection, in the woods at your cabin or something
  [-]
  - biker142541
    71 days ago
    Agreed. It’s definitely been fun playing locally, learning, fine tuning, etc, but these models just don’t quite cut it for serious development tasks (yet, and assuming none of the above considerations apply). I haven’t found better than Gemini 2.5 for my work so far.
  - marcalc
    71 days ago
    This is my key points too. I love the power of having search engine on my laptop.
api
71 days ago
I'm really impressed and also very interested to see models I can run on my MacBook Pro start to generate results close to large hosted "frontier" models, and do so with what I assume are far fewer parameters.
I wonder how far this can go?
[-]
- simonw
  71 days ago
  It's been a solid trend for the last two years: I've not upgraded my laptop in the time and the quality of results I'm getting from local models on that same machine has continued to rise.
  My hunch is that there's still some remaining optimization fruit to be harvested but I expect we may be nearing a plateau. I may have to upgrade from 64GB of RAM this year.
  [-]
  - api
    71 days ago
    Seeing diffusion language models mature and get better will be interesting. They can be much, much faster on less hardware.
idcrook
71 days ago
I wish when someone publishes something like this included the amount of RAM required! It spits the value out when it runs. And include which specific Apple Silicon CPU.
This Qwen3-30B-A3B-8bit - how much RAM under MLX?
My 16GB Apple Silicon Macs want to play.
[-]
- hadlock
  70 days ago
  With 16gb you can comfortably run a 12b model that's been quantized. Quantizing is (bad example) effectively lossy compression.
avetiszakharyan
71 days ago
Quick Video i made on this topic: https://www.youtube.com/watch?v=-h_IZhOdAeU
crazymoka
71 days ago
Why do you need mlx? Like your blog post by you never explain why things need to be used.
Why isn't using localforge enough as it ties into models?
[-]
- avetiszakharyan
  71 days ago
  I was just trying to make sure is maximally performant, and did it with MLX because i am running on mac hardware and wanted to be able to run 30b in reasonable time so it can actually autonomously code something. Otherwise there are many ways to do it!
  [-]
  - p0w3n3d
    71 days ago
    It would be nice if you mentioned it's about apple silicon, and not apple intel computers. They're still ubiquitous nowadays
    [-]
    - Tokumei-no-hito
      71 days ago
      we're on the 4th generation of silicon now
- turnsout
  71 days ago
  I believe mlx will allow you to run the models marginally faster (per a recent blog post by @simonw)
  [-]
  - simonw
    71 days ago
    Yeah, you don't necessarily need it but it's optimized for Apple Silicon and in my experience feels like it gives slightly better performance than GGUFs. I really need to formally measure that so I'm not just running on vibes!
    [-]
    - indigodaddy
      71 days ago
      I for one, am willing to just trust you bro ;)
      [-]
      - turnsout
        71 days ago
        Yeah I’ll go with Simon’s vibes over most people’s measurements!
- freeone3000
  71 days ago
  mlx is an alternative model format to GGUF. It executes natively on apple silicon using Apple's AI accelerator, rather than through GGUF as a compute shader(!). It's faster and uses fewer resources on Apple devices.
at0mic22
71 days ago
Is there a way to achieve the same with ollama?
[-]
- simonw
  71 days ago
  Yes, Ollama has Qwen 3 and it works great on a Mac. It may be slightly slower than MLX since Ollama hasn't integrated that (Apple Silicon optimized) library yet, but Ollama models still use the Mac's GPU.
  https://ollama.com/library/qwen3
  [-]
  - avetiszakharyan
    71 days ago
    Yes, i did that but its not apple silicon optimized so it was taking forever for 30b models. So its ok, but its not fantastic
    [-]
    - spmurrayzzz
      71 days ago
      You can just use llama.cpp instead (which is what ollama is using under the hood via bindings). Just need to make sure youre using commit `d3bd719` or newer. I normally use this with nvidia/cuda, but tested on my mbp and havent had any speed issues thus far.
      Alternatively, LMStudio has MLX support you can use as well.
seanhunter
71 days ago
You can already do this, with qwen or (which I use) deepscaler using aider and ollama. This is just an advert for localforge.
Tacite
71 days ago
Trying on Macbook Pro (M4) with 24 GB, the whole system freeze after the first question.
[-]
- thih9
  71 days ago
  I’m also team 24GB.
  Is anyone using a similar setup with 24GB RAM? Which model would you recommend?
  I only saw a sub-1B model mentioned in other comments, but that one seems too small.
  [-]
  - TMWNN
    71 days ago
    `ollama run qwen3:14b`
maille
71 days ago
I have a Windows PC with a GTX 5070 (12GB) any chance to run it?
[-]
- simonw
  71 days ago
  I expect that will run Qwen 3 8B quite happily, and I've found that to be a surprisingly capable model for its size.
- UK-Al05
  71 days ago
  The 30B one requires 20 GB of memory for me. But some of the lower parameters one should be ok
  [-]
  - avetiszakharyan
    71 days ago
    For me it was peaking at 35GB even when using
croemer
71 days ago
Site seems to have been struck with the HN hug of death
joejoo
71 days ago
What’s the difference between using MLX and MPS?
[-]
- Tokumei-no-hito
  71 days ago
  i think MPS is the term for the APIs apple exposes to control their GPUs and MLX is a machine learning framework optimized for using MPS.
avetiszakharyan
71 days ago
I just wana say i got it to make a snake game! :D for free
pritambarhate
71 days ago
Just tried Qwen 3 32B 4bit and 8bit GGUF via Ollama today on RunPod A40 GPUs. For general knowledge questions the results were very underwhelming. Here are answers to 3 questions I asked to the 8 bit GGUF model (skipping the thinking tokens to keep it short):
Q: who won the cricket worldcup 2023?
A: The *2023 ICC Cricket World Cup* was won by *Australia*. They defeated *New Zealand* in the final on *November 19, 2023*, at the *M. A. Chidambaram Stadium in Chennai, India*.
### Key Details: - *Final Score*: Australia scored *334/5* in their 50 overs, while New Zealand reached *199/8* in their chase. - *Man of the Match*: Pat Cummins (Australia) with 5 wickets for 34 runs. - *Tournament Host*: India. - *Notable Performance*: Australia's aggressive batting and disciplined bowling sealed the victory, marking their *sixth Cricket World Cup title* (tying them with West Indies for the most titles).
This was the 13th edition of the ICC Cricket World Cup (50-over format). The T20 World Cup is a separate tournament, with India winning the 2024 edition. Let me know if you need further details!
total duration: 34.911890735s load duration: 65.034584ms prompt eval count: 20 token(s) prompt eval duration: 305.460537ms prompt eval rate: 65.47 tokens/s eval count: 511 token(s) eval duration: 34.539660654s eval rate: 14.79 tokens/s
Comment: Wrong - https://www.espncricinfo.com/series/icc-cricket-world-cup-20...
Q: give last 10 movies to win the oscar best picture award
Here is a clear and accurate list of the *last 10 Best Picture winners* at the Academy Awards, from *2014 to 2023*, including the ceremony year, the winning film, and the film's release year:
---
1. *2023 (95th Oscars)* - *Winner*: Everything Everywhere All at Once - *Release Year*: 2022
2. *2022 (94th Oscars)* - *Winner*: The Power of the Dog - *Release Year*: 2021
3. *2021 (93rd Oscars)* - *Winner*: Nomadland - *Release Year*: 2020
4. *2020 (92nd Oscars)* - *Winner*: Parasite - *Release Year*: 2019
5. *2019 (91st Oscars)* - *Winner*: Green Book - *Release Year*: 2018
6. *2018 (90th Oscars)* - *Winner*: The Shape of Water - *Release Year*: 2017
7. *2017 (89th Oscars)* - *Winner*: Moonlight - *Release Year*: 2016
8. *2016 (88th Oscars)* - *Winner*: Spotlight - *Release Year*: 2015
9. *2015 (87th Oscars)* - *Winner*: 12 Years a Slave - *Release Year*: 2013
10. *2014 (86th Oscars)* - *Winner*: Argo - *Release Year*: 2012
Comment: Answer number 2 ans 9 are wrong. (https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture)
I would have expected it to get things which are such big events right at least.
bitbasher
71 days ago
I've been vibe coding for over a decade. All you need is a decent pair of headphones and a pot of coffee. It's not free, but it's pretty cheap.
[-]
- thih9
  71 days ago
  In case someone didn’t see that yet, vibe coding has a recent and more specific meaning.
  https://en.m.wikipedia.org/wiki/Vibe_coding
  > a programming paradigm dependent on artificial intelligence (AI), where a person describes a problem in a few sentences as a prompt to a large language model (LLM) tuned for coding.
  > A key part of the definition of vibe coding is that the user accepts code without full understanding.
- madduci
  71 days ago
  And the "for free" in the title excludes the electricity costs
- datpuz
  71 days ago
  That's just called coding
  [-]
  - redcobra762
    71 days ago
    *with good vibes