For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow.
By the way, I'm wondering why unsloth (a goddamn python library) tries to run apt-get with sudo (and fails on my nixos). Like how tf we are supposed to use that?
Oh hey I'm assuming this is for conversion to GGUF after a finetune? If you need to quantize to GGUF Q4_K_M, we have to compile llama.cpp, hence apt-get and compiling llama.cpp within a Python shell.
There is a way to convert to Q8_0, BF16, F16 without compiling llama.cpp, and it's enabled if you use `FastModel` and not on `FastLanguageModel`
Essentially I try to do `sudo apt-get` if it fails then `apt-get` and if all fails, it just fails. We need `build-essential cmake curl libcurl4-openssl-dev`
It seems Unsloth is useful and popular, and you seem responsive and helpful. I'd be down to try to improve this and maybe package Unsloth for Nix as well, if you're up for reviewing and answering questions; seems fun.
Imo it's best to just depend on the required fork of llama.cpp at build time (or not) according to some configuration. Installing things at runtime is nuts (especially if it means modifying the existing install path). But if you don't want to do that, I think this would also be an improvement:
- see if llama.cpp is on the PATH and already has the requisite features
- if not, check /etc/os-release to determine distro
- if unavailable, guess distro class based on the presence of high-level package managers (apt, dnf, yum, zypper, pacman) on the PATH
- bail, explain the problem to the user, give copy/paste-friendly instructions at the end of we managed to figure out where we're running
Is either sort of change potentially agreeable enough that you'd be happy to review it?
Maybe it's a personal preference, but I don't want external programs to ever touch my package manager, even with permission. Besides, this will fail loudly for systems that don't use `apt-get`.
I would just ask the user to install the package, and _maybe_ show the command line to install it (but never run it).
I don't think this should be a personal preference, I think it should be a standard*.
That said, it does at least seem like these recent changes are a large step in the right direction.
---
* in terms of what the standard approach should be, we live in an imperfect world and package management has been done "wrong" in many ecosystems, but in an ideal world I think the "correct" solution here should be:
(1) If it's an end user tool it should be a self contained binary or it should be a system package installed via the package manager (which will manage any ancillary dependencies for you)
(2) If it's a dev tool (which, if you're cloning a cpp repo & building binaries, it is), it should not touch anything systemwide. Whatsoever.
This often results in a README with manual instructions to install deps, but there are many good automated ways to approach this. E.g. for CPP this is a solved problem with Conan Profiles. However that might incur significant maintenace overhead for the Unsloth guys if it's not something the ggml guys support. A dockerised build is another potential option here, though that would still require the user to have some kind of container engine installed, so still not 100% ideal.
I would like to be in (1) but I'm not a packaging person so I'll need to investigate more :(
(2) I might make the message on installing llama.cpp maybe more informative - ie instead of re-directing people to the docs on manual compilation ie https://docs.unsloth.ai/basics/troubleshooting-and-faqs#how-..., I might actually print out a longer message in the Python cell entirely
Hopefully the solution for now is a compromise if that works? It will show the command as well, so if not accepted, typing no will error out and tell the user on how to install the package
3. Agreed on bailing - I was also thinking if doing a Python input() with a 30 second waiting period for apt-get if that's ok? We tell the user we will apt-get some packages (only if apt exists) (no sudo), and after 30 seconds, it'll just error out
4. I will remove sudo immediately (ie now), and temporarily just do (3)
But more than happy to fix this asap - again sorry on me being dumb
It shouldn't install any packages itself. Just print out a message about the missing packages and your guess of the command to install them, then exit. That way users can run the command themselves if it's appropriate or add the packages to their container build or whatever. People set up machines in a lot of different ways, and automatically installing things is going to mess that up.
Dude, this is NEVER ok. What in the world??? A third party LIBRARY running sudo commands? That’s just insane.
You just fail and print a nice error message telling the user exactly what they need to do, including the exact apt command or whatever that they need to run.
IMO the correct thing to do to make these people happy, while being sane, is - do not build llama.cpp on their system. Instead, bundle a portable llama.cpp binary along with unsloth, so that when they install unsloth with `pip` (or `uv`) they get it.
Some people may prefer using whatever llama.cpp in $PATH, it's okay to support that, though I'd say doing so may lead to more confused noob users spam - they may just have an outdated version lurking in $PATH.
Doing so makes unsloth wheel platform-dependent, if this is too much of a burden, then maybe you can just package llama.cpp binary and have it on PyPI, like how scipy guys maintain a https://pypi.org/project/cmake/ on PyPI (yes, you can `pip install cmake`), and then depends on it (maybe in an optional group, I see you already have a lot due to cuda shit).
Oh yes I was working on providing binaries together with pip - currently we're relying on pyproject.toml, but once we utilize setup.py (I think), using binaries gets much simpler
I'm still working on it, but sadly I'm not a packaging person so progress has been nearly zero :(
I think you misunderstood rfoos suggestion slightly.
From how I interpreted it, he meant you could create a new python package, this would effectively be the binary you need.
In your current package, you could depend on the new one, and through that - pull in the binary.
This would let you easily decouple your package from the binary,too - so it'd be easy to update the binary to latest even without pushing a new version of your original package
I've maintained release pipelines before and handled packaging in a previous job, but I'm not particularly into the python ecosystem, so take this with a grain of salt: an approach would be
Pip Packages :
* Unsloth: current package, prefers using unsloth-llama, and uses path llama-cpp as fallback (with error msg as final fallback if neither exist, promoting install for unsloth-llama)
* Unsloth-llama: new package which only bundles the llama cpp binary
I was trying to see if I could pre-compile some llama.cpp binaries then save them as a zip file (I'm a noob sorry) - but I definitely need to investigate further on how to do python pip binaries
Yep agreed - I primarily thought it was a reasonable "hack", but it's pretty bad security wise, so apologies again.
The current solution hopefully is in between - ie sudo is gone, apt-get will run only after the user agrees by pressing enter, and if it fails, it'll tell the user to read docs on installing llama.cpp
Don't apologize, you are doing amazing work. I appreciate the effort you put.
Usually you don't make assumptions on the host OS, just try to find the things you need and if not, fail, ideally with good feedback. If you want to provide the "hack", you can still do it, but ideally behind a flag, `allow_installation` or something like that. This is, if you want your code to reach broader audiences.
I added it since many people who used Unsloth don't know how to compile llama.cpp, so the only way from Python's side is to either (1) Install it via apt-get within the Python shell (2) Error out then tell the user to install it first, then continue again
I chose (1) since it was mainly for ease of use for the user - but I agree it's not a good idea sorry!
Hey man, I was seeing your comments and you do seem to respond to each and everyone nicely regarding this sudo shenanigan.
I think that you have removed sudo so this is nice, my suggestion is pretty similar to that of pxc (basically determine different distros and use them as that)
I wonder if we will ever get a working universal package manager in linux, to me flatpak genuinely makes the most sense even sometimes for cli but flatpak isn't built for cli unlike snap which both support cli and gui but snap is proprietory.
Hey :) I love suggestions and keep them coming! :)
I agree on handling different distros - sadly I'm not familiar with others, so any help would be appreciated! For now I'm most familiar with apt-get, but would 100% want to expand out!
Just to let you know though that its really rare that flatpak is used for cli's. I think I mentioned it in my comment too or if not, my apologies but flatpak is used mostly in gui's.
I doubt its efficacy here, they might be more useful if you provide a whole jupyter / browser gui though but a lot o f us run it just in cli so I doubt flatpak.
I didn't mean to say that flatpak was the right tool for this job, I seriously don't know too much to comment and so I'd prefer if you could ask someone definitely experienced regarding it.
My reasoning for flatpak was chunking support (that I think is rare in appimage) and easier gpu integration (I think) compared to docker, though my reasoning might be flawed since flatpak isn't mostly used with cli.
Oh llamafile is very cool! I might add it as an option actually :) For generic exports (ie to vLLM, llamafile etc), normally finetunes end with model.save_pretrained_merged and that auto merges to 16bit safetensors which allows for further processing downstream - but I'll investigate llamafile more! (good timing since llamafile is cross platform!)
hey fellow crazy person! slight tangent: one thing that helps keep me grounded with "LLMs are doing much more than regurgitation" is watching them try to get things to work on nixos - and hitting every rake on the way to hell!
nixos is such a great way to expose code doing things it shouldn't be doing.
not sure if its just chat.deepseek.com but one strange thing I've noticed is that now it replies to like 90% of your prompts with "Of course.", even when it doesnt fit the prompt at all or you ask it something obviously impossible. maybe it's the backend injecting it to be more obedient? but you can tell it `don't begin the reply to this with "of" ending "course"` and it will listen. it's very strange
Some people on reddit (very reliable source I know) are saying it was trained on a lot of Gemini and I can see that. for example it does that annoying thing gemini does now where when you use slang or really any informal terms it puts them in quotes in its reply
With all these things, it depends on your own eval suite. gpt-oss-120b works as well as o4-mini over my evals, which means I can run it via OpenRouter on Cerebras where it's SO DAMN FAST and like 1/5th the price of o4-mini.
My experience is that gpt-oss doesn't know much about obscure topics, so if you're using it for anything except puzzles or coding in popular languages, it won't do well as the bigger models.
It's knowledge seems to be lacking even compared to gpt3.
Something I was doing informally that seems very effective is asking for details about smaller cities and towns and lesser points of interest around the world. Bigger models tend to have a much better understanding and knowledge base for the more obscure places.
Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.
garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.
there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book.
i don't use it for coding, but for things that are more unique i feel is more precise.
I wonder if Conway's law is at all responsible for that, in the similarity it is based on; regional trained data which has concept biases which it sends back in response.
I'm doing coreference resolution and this model (w/o thinking) performs at the Gemini 2.5-Pro level (w/ thinking_budget set to -1) at a fraction of the cost.
How can a benchmark be secret if you post it to an API to test a model on it?
"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"
:P
If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.
It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.
What formats? I thought the very schema of json is what allows these LLMs to enforce structured outputs at the decoder level? I guess you can do it with any format, but why stray from json?
Sometimes it will randomly generate something like this in the body of the text:
```
<tool_call>executeshell
<arg_key>command</arg_key>
<arg_value>echo "" >> novels/AI_Voodoo_Romance/chapter-1-a-new-dawn.txt</arg_value>
</tool_call>
```
or this:
```
<|toolcallsbegin|><|toolcallbegin|>executeshell<|toolsep|>{"command": "pwd && ls -la"}<|toolcallend|><|toolcallsend|>
```
Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.
In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema. Generally, the training is doing 99% of the work, of course, it's just "strict" means "we'll check it's work to the point a GBNF grammar created from the schema will validate."
One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)
> In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema.
I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.
I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.
I maintain a cross-platform llama.cpp client - you're right to point out that generally we expect nuking logits can take care of it.
There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.
Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling
* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params
* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...
Those Qwen3 2507 models are the local creme-de-la-creme right now. If you've got any sort of GPU and ~32gb of RAM to play with, the A3B one is great for pair-programming tasks.
Do you happen to know if it can be run via an eGPU enclosure with f.ex. RTX 5090 inside, under Linux?
I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.
I’m running Ollama on 2 eGPUs over Thunderbolt. Works well for me. You’re still dealing with an NVDIA device, of course. The connection type is not going to change that hassle.
Thank you for the validation. As much as I don't like NVIDIA's shenanigans on Linux, having a local LLM is very tempting and I might put my ideological problems to rest over it.
Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?
You would still need drivers and all the stuff difficult with nvidia in linux with a egpu. (Its not nessecarily terrible just suboptimal) Rather just add the second GPU in the Workstation, or just run the llm in your AMD GPU.
I've been running LLM models on my Radeon 7600 XT 16GB for past 2-3 months without issues (Windows 11). I've been using llama.cpp only. The only thing from AMD I installed (apart from latest Radeon drivers) is the "AMD HIP SDK" (very straight forward installer). After unzipping (the zip from GitHub releases page must contain hip-radeon in the name) all I do is this:
llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf
And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.
Sure, though you'll be bottlenecked by the interconnect speed if you're tiling between system memory and the dGPU memory. That shouldn't be an issue for the 30B model, but would definitely be an issue for the 480B-sized models.
DeepSeek is bad for hallucinations in my experience. I wouldn't trust its output for anything serious without heavy grounding. It's great for fantastical fiction though. It also excels at giving characters "agency".
Sad to see the off peak discount go. I was able to crank tokens like crazy and not have it cost anything. That said the pricing is still very very good so I can't complain too much.
That's interesting. I am curious about the extent of the training data in these models.
I asked Kimi K2 for an account of growing up in my home town in Scotland, and it was ridiculously accurate. I then asked it to do the same for a similarly sized town in Kerala. ChatGPT suggested that while it was a good approximation, K2 got some of the specifics wrong.
FWIW I have the €20 Pro plan and exchange maybe 20 messages with Opus (with thinking) every day, including one weeks-long conversation. Plus a few dozen Sonnet tasks and occasionally light weight CC.
I'm not a programmer, though - engineering manager.
Sure I do, but not as part of any tools, just for one-off conversations where I know it's going to be the best out there. For tasks where reasoning helps little to none, it's often still number one.
just saw this on Chinese internet - deepseek officially mentioned that v3.1 is trained using UE8M0 FP8 as that is the FP8 to be supported by the next gen Chinese AI chip. so basically -
some Chinese next gen AI chips is coming, deepseek is working with them to get its flagship model trained using such domestic chips.
interesting time ahead! just imagine what it could do to NVIDIA share price when deepseek releases a SOTA new model trained without using NVIDIA chips.
Incredible how "keeping their people down" means leaps in personal wealth and happiness for huge swathes of the population and internal criticism is that it is a "poverty reduction machine" that is too focused.
If they had a government like Taiwan they would be significantly wealthier. Their government is a drag and should not steal credit from the actual people who made that wealth with their hard work and entrepreneurship.
They're not mutually exclusive. Lots of terrible and mismanaged governments rely on genius short-term economic exploitation, like Syria, Iran, India, Korea, etc.
What would be incredible is China sticking the landing to a third-sector economy. Plenty of countries have industrialized over the past century, only a handful became true service economies.
./llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CPU"
More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1
There is a way to convert to Q8_0, BF16, F16 without compiling llama.cpp, and it's enabled if you use `FastModel` and not on `FastLanguageModel`
Essentially I try to do `sudo apt-get` if it fails then `apt-get` and if all fails, it just fails. We need `build-essential cmake curl libcurl4-openssl-dev`
See https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...
Imo it's best to just depend on the required fork of llama.cpp at build time (or not) according to some configuration. Installing things at runtime is nuts (especially if it means modifying the existing install path). But if you don't want to do that, I think this would also be an improvement:
Is either sort of change potentially agreeable enough that you'd be happy to review it?(1) Removed and disabled sudo
(2) Installing via apt-get will ask user's input() for permission
(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp
I would just ask the user to install the package, and _maybe_ show the command line to install it (but never run it).
That said, it does at least seem like these recent changes are a large step in the right direction.
---
* in terms of what the standard approach should be, we live in an imperfect world and package management has been done "wrong" in many ecosystems, but in an ideal world I think the "correct" solution here should be:
(1) If it's an end user tool it should be a self contained binary or it should be a system package installed via the package manager (which will manage any ancillary dependencies for you)
(2) If it's a dev tool (which, if you're cloning a cpp repo & building binaries, it is), it should not touch anything systemwide. Whatsoever.
This often results in a README with manual instructions to install deps, but there are many good automated ways to approach this. E.g. for CPP this is a solved problem with Conan Profiles. However that might incur significant maintenace overhead for the Unsloth guys if it's not something the ggml guys support. A dockerised build is another potential option here, though that would still require the user to have some kind of container engine installed, so still not 100% ideal.
(2) I might make the message on installing llama.cpp maybe more informative - ie instead of re-directing people to the docs on manual compilation ie https://docs.unsloth.ai/basics/troubleshooting-and-faqs#how-..., I might actually print out a longer message in the Python cell entirely
Yes we're working on Docker! https://hub.docker.com/r/unsloth/unsloth
1. So I added a `check_llama_cpp` which checks if llama.cpp does exist and it'll use the prebuilt one https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...
2. Yes I like the idea of determining distro
3. Agreed on bailing - I was also thinking if doing a Python input() with a 30 second waiting period for apt-get if that's ok? We tell the user we will apt-get some packages (only if apt exists) (no sudo), and after 30 seconds, it'll just error out
4. I will remove sudo immediately (ie now), and temporarily just do (3)
But more than happy to fix this asap - again sorry on me being dumb
But I'm working on more cross platform docs as well!
You just fail and print a nice error message telling the user exactly what they need to do, including the exact apt command or whatever that they need to run.
(1) Removed and disabled sudo
(2) Installing via apt-get will ask user's input() for permission
(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp
Again apologies on my dumbness and thanks for pointing it out!
I was thinking if I can do it during the pip install or via setup.py which will do the apt-get instead.
As a fallback, I'll probably for now remove shell executions and just warn the user
Some people may prefer using whatever llama.cpp in $PATH, it's okay to support that, though I'd say doing so may lead to more confused noob users spam - they may just have an outdated version lurking in $PATH.
Doing so makes unsloth wheel platform-dependent, if this is too much of a burden, then maybe you can just package llama.cpp binary and have it on PyPI, like how scipy guys maintain a https://pypi.org/project/cmake/ on PyPI (yes, you can `pip install cmake`), and then depends on it (maybe in an optional group, I see you already have a lot due to cuda shit).
I'm still working on it, but sadly I'm not a packaging person so progress has been nearly zero :(
From how I interpreted it, he meant you could create a new python package, this would effectively be the binary you need.
In your current package, you could depend on the new one, and through that - pull in the binary.
This would let you easily decouple your package from the binary,too - so it'd be easy to update the binary to latest even without pushing a new version of your original package
I've maintained release pipelines before and handled packaging in a previous job, but I'm not particularly into the python ecosystem, so take this with a grain of salt: an approach would be
Pip Packages :
I was trying to see if I could pre-compile some llama.cpp binaries then save them as a zip file (I'm a noob sorry) - but I definitely need to investigate further on how to do python pip binaries
The current solution hopefully is in between - ie sudo is gone, apt-get will run only after the user agrees by pressing enter, and if it fails, it'll tell the user to read docs on installing llama.cpp
Usually you don't make assumptions on the host OS, just try to find the things you need and if not, fail, ideally with good feedback. If you want to provide the "hack", you can still do it, but ideally behind a flag, `allow_installation` or something like that. This is, if you want your code to reach broader audiences.
I chose (1) since it was mainly for ease of use for the user - but I agree it's not a good idea sorry!
:( I also added a section to manually compile llama.cpp here: https://docs.unsloth.ai/basics/troubleshooting-and-faqs#how-...
But I agree I should remove apt-gets - will do this asap! Thanks for the suggestions :)
I think that you have removed sudo so this is nice, my suggestion is pretty similar to that of pxc (basically determine different distros and use them as that)
I wonder if we will ever get a working universal package manager in linux, to me flatpak genuinely makes the most sense even sometimes for cli but flatpak isn't built for cli unlike snap which both support cli and gui but snap is proprietory.
I agree on handling different distros - sadly I'm not familiar with others, so any help would be appreciated! For now I'm most familiar with apt-get, but would 100% want to expand out!
Interesting will check flatpak out!
I doubt its efficacy here, they might be more useful if you provide a whole jupyter / browser gui though but a lot o f us run it just in cli so I doubt flatpak.
I didn't mean to say that flatpak was the right tool for this job, I seriously don't know too much to comment and so I'd prefer if you could ask someone definitely experienced regarding it.
My reasoning for flatpak was chunking support (that I think is rare in appimage) and easier gpu integration (I think) compared to docker, though my reasoning might be flawed since flatpak isn't mostly used with cli.
probably not, because LLMs are a little more competent than this
nixos is such a great way to expose code doing things it shouldn't be doing.
Some people on reddit (very reliable source I know) are saying it was trained on a lot of Gemini and I can see that. for example it does that annoying thing gemini does now where when you use slang or really any informal terms it puts them in quotes in its reply
https://artificialanalysis.ai/models/deepseek-v3-1-reasoning
Let's hope not, because gpt-oss-120B can be dramatically moronical. I am guessing the MoE contains some very dumb subnets.
Benchmarks can be a starting point, but you really have to see how the results work for you.
It's knowledge seems to be lacking even compared to gpt3.
No idea how you'd benchmark this though.
https://openrouter.ai/openai/gpt-oss-120b and https://openrouter.ai/deepseek/deepseek-chat-v3.1 for the same providers is probably better, although gpt-oss-120b has been around long enough to have more providers, and presumably for hosters to get comfortable with it / optimize hosting of it.
https://www.tbench.ai/leaderboard
Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.
there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
Example recent one on GPT-5:
https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...
All results:
https://eval.16x.engineer/evals/coding
We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?
"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"
:P
If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.
So no.
They're not secret.
or this: ``` <|toolcallsbegin|><|toolcallbegin|>executeshell<|toolsep|>{"command": "pwd && ls -la"}<|toolcallend|><|toolcallsend|> ```
Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.
One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)
I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.
I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.
There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.
Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling
* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params
* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...
Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1
I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.
Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?
llama.cpp probably is too, but I haven't tried it with a bigger model yet.
llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf
And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.
I asked Kimi K2 for an account of growing up in my home town in Scotland, and it was ridiculously accurate. I then asked it to do the same for a similarly sized town in Kerala. ChatGPT suggested that while it was a good approximation, K2 got some of the specifics wrong.
I'm not a programmer, though - engineering manager.
https://brokk.ai/power-ranking?version=openround-2025-08-20&...
$0.56 per million tokens in — and $1.68 per million tokens out.
some Chinese next gen AI chips is coming, deepseek is working with them to get its flagship model trained using such domestic chips.
interesting time ahead! just imagine what it could do to NVIDIA share price when deepseek releases a SOTA new model trained without using NVIDIA chips.
I'll have to see how things go with this model after a week, once the hype has died down.
wait until you find out that China also acting the same way toward the rest of the world (surprise pikachu face)
Dark propaganda opposed to what, light propaganda? The Chinese model being released is about keeping China down?
You seem very animated about this, but you would probably have more success if you tried to clarify this a bit more.
What would be incredible is China sticking the landing to a third-sector economy. Plenty of countries have industrialized over the past century, only a handful became true service economies.