This really captures something I've been experiencing with Gemini lately. The models are genuinely capable when they work properly, but there's this persistent truncation issue that makes them unreliable in practice.
I've been running into it consistently, responses that just stop mid-sentence, not because of token limits or content filters, but what appears to be a bug in how the model signals completion. It's been documented on their GitHub and dev forums for months as a P2 issue.
The frustrating part is that when you compare a complete Gemini response to Claude or GPT-4, the quality is often quite good. But reliability matters more than peak performance. I'd rather work with a model that consistently delivers complete (if slightly less brilliant) responses than one that gives me half-thoughts I have to constantly prompt to continue.
It's a shame because Google clearly has the underlying tech. But until they fix these basic conversation flow issues, Gemini will keep feeling broken compared to the competition, regardless of how it performs on benchmarks.
Another issue: Gemini can’t do tool calling and (forced) json output at the same time
If you want to use application/json as the specified output in the request, you can’t use tools
So if you need both, you either hope it gives you correct json when using tools (which many times it doesn’t). Or you have to do two requests, one for the tool calling, another for formatting
At least, even if annoying, this issue is pretty straightforward to get around
I think what I am seeing from ChatGPT is highly varying performance. I think this must be something they are doing to manage limitations of compute or costs. With Gemini, I think what I see is slightly different - more like a lower “peak capability” than ChatGPT’s “peak capability”.
I'm fairly sure there's some sort of dynamic load balancing at work. I read an anecdote from someone had a test where they asked it to draw a little image (something like an ascii cat, but probably not exactly that since it seems a bit basic), and if the result came back poor they didn't bother using it until a different time of day.
Of course it could all be placebo, but when you intuitively think about it, somewhere on the road the the hundreds of billions in datacenter capex, one would think that there will be periods where compute and demand are out of sync. It's also perfectly understandable why now would be a time to be seeing that.
I've heard a lot that voice mode uses a faster (and worse) model than regular ChatGPT. So I think this makes sense. But I haven't seen this in any official documentation.
Small things like this or the fact that AI studio still has issues with simple scrolling confuse me. How does such a brilliant tool still lack such basic things?
If anyone from OpenAI is reading this, I have two complaints:
1. Using the "Projects" thing (Folder organization) makes my browser tab (on Firefox) become unusably slow after a while. I'm basically forced to use the default chats organization, even though I would like to organize my chats in folders.
2. After editing a message that you already sent,you get to select between the different branches of the chat (1/2, and so on), which is cool, but when ChatGPT fails to generate a response in this "branched conversation" context, it will continue failing forever. When your conversation is a single thread and a ChatGPT message fails with an error, re trying usually works and the chat continues normally.
On mobile (android) opening the keyboard scrolls the chat to the bottom! I sometimes want to type referring something from the middle of the LLMs last answer.
Projects should have their own memory system. Perhaps something more interactive than the existing Memories but projects need their own data (definitions, facts, draft documents) that is iterated on and referred to per project. Attached documents aren't it, the AI needs to be able to update the data over multiple chats.
I wonder if this is because a memory cap was reached at that output token. Perhaps they route conversations to different hardware depending on how long they expect it to be.
Yes agree, it was totally broken when I tested the API two months ago. Lots of failed to connect and very slow response time. Hoping the update fixes these issues.
I wonder if [good examples of] SVGs of pelicans on bikes are "being introduced" into training sets. Some of the engineers who work on this stuff are the kind to hang out here.
It's possible, but honestly I've never seen a decent vector illustration of a pelican on a bicycle myself so they'd have to work pretty hard to find one!
Serious question: If it's an improved 2.5 model, why don't they call it version 2.6? Seems annoying to have to remember if you're using the old 2.5 or the new 2.5. Kind of like when Apple released the third-gen iPad many years ago and simply called it the "new iPad" without a number.
If they're going to include the month and year as part of the version number, they should at least use big endian dates like gemini-2.5-flash-preview-2025-09 instead of 09-2025.
Or, you know, just Gemini 2.6 Flash. I don't recall the 2.5 version having a date associated with it when it came out, though maybe they are using dates now. In marketing, at least, it's always known as Gemini 2.5 Flash/Pro.
It always had dates... They release multiple versions and update regularly. Not sure if this is the first 2.5 Flash update, but pretty sure Pro had a few updates as well...
This is also the case with OpenAI and their models. Pretty standard I guess.
They don't change the versioning, because I guess they don't consider it to be "a new model trained from scratch".
2.5 is not the version number, it's the generation of the underlying model architecture. Think of it like the trim level on a Mazda 3 hatchback. Mazda already has the Mazda 3 Sport in their lineup, then later they release the Mazda 3 Turbo which is much faster. When they release this new version of the vehicle its not called the Mazda 4... that would be an entirely different vehicle based on a new platform and powertrain etc (if it existed). The new vehicle is just a new trim level / visual refresh of the existing Mazda 3.
That's why Google names it like this, but I agree its dumb. Semver would be easier.
Google seems to be the main foundation model provider that's really focusing on the latency/TPS/cost dimensions. Anthropic/OpenAI are really making strides in model intelligence, but underneath some critical threshold of performance, the really long thinking times make workflows feel a lot worse in collaboration-style tools, vs a much snappier but slightly less intelligent model.
It's a delicate balance, because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.
I would be surprised if this dichotomy you're painting holds up to scrutiny.
My understanding is Gemini is not far behind on "intelligence", certainly not in a way that leaves obvious doubt over where they will be over the next iteration/model cycles, where I would expect them to at least continue closing the gap. I'd be curious if you have some benchmarks to share that suggest otherwise.
Meanwhile, afaik something Google has done, and perhaps relates back to your point re "latency/TPS/cost dimensions" that other providers aren't doing as much is integrating their model into interesting products beyond chat, at a pace that seems surprising given how much criticism they had been taking for being "slow" to react to the LLM trend.
Besides the Google Workspace surface and Google search, which now seem obvious - there are other interesting places where Gemini will surface - https://jules.google/ for one, to say nothing of their experiments/betas in the creative space - https://labs.google/flow/about
I would have thought putting Gemini on a finance dashboard like this would be inviting all sorts of regulatory (and other) scrutiny... and wouldn't be in keeping with a "slow" incumbent. But given the current climate, it seems Google is plowing ahead just as much as anyone else - with a lot more resources and surface to bring to bear. Imagine Gemini integration on Youtube. At this point it just seems like counting down the days...
I do scientific and hard code a lot. Gemini is a good bit below GPT5 in those areas, though still quite good. It's also just a bad agent, it lacks autonomy and isn't RL'd to explore well. Gemini's superpower is being really smart while also having by far the best long context reasoning, use it like an oracle with bundles of your entire codebase (or a subtree if it's too big) to guide agents in implementation.
Gemini 2.5-Pro was great when it released, but o3 and GPT-5 both eclipsed it for me—the tool use/search improvements open up so many use cases that Gemini fails at.
IMO the race for Latency/TPS/cost is entirely between grok and gemini flash. No model can touch them (especially for image to text related tasks), openai/anthropic seem entirely uninterested in competing for this.
grok-4-fast is a phenomenal agentic model, and gemini flash is great for deep research leaf nodes since it's so cheap, you can segment your context a lot more than you would for pro to ensure it surfaces anything that might be valuable.
> because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.
I'm using Gemini (2.5-pro) less and less these days. I used to be really impressived with its deep research capabilities and ability to cite sources reliably.
The last few weeks, it's increasingly argumentative and incapable of recognizing hallucinations around sourcing. I'm tired of arguing with it on basics like RFCs and sources it fabricates, won't validate, and refuses to budge on.
Example prompt I was arguing with it on last night:
> within a github actions workflow, is it possible to get access to the entire secrets map, or enumerate keys in this object?
As recent supply-chain attacks have shown, exfiltrating all the secrets from a Github workflow is as simple as `${{ toJSON(secrets) }}` or `echo ${{ toJSON(secrets) }} | base64` at worse. [1]
Give this prompt a shot! Gemini won't do anything except be obstinately ignorant. With me, it provided a test case workflow, and refused to believe the results. When challenged, expect it to cite unrelated community posts. Chatgpt had no problem with it.
While arguing may not be productive, I have had good results challenging Gemini on hallucinated sources in the past. eg, "You cited RFC 1918, which is a mistake. Can you try carefully to cite a better source here?" which would get it to re-evaluate, maybe by using another tool, admit the mistake, and allow the research to continue.
With this example, several attempts resulted in the same thing: Gemini expressing a strong belief that Github has a security capability which is really doesn't have.
If someone is able to get Gemini to give an accurate answer to this with a similar question, I'd be very curious to hear what it is.
Can't agree with that. Gemini doesn't lead just on price/performance - ironically it's the best "normie" model most of the time, despite it's lack of popularity with them until very recent.
It's bad at agentic stuff, especially coding. Incomparably so compared to Claude and now GPT-5. But if it's just about asking it random stuff, and especially going on for very long in the same conversation - which non-tech users have a tendency to do - Gemini wins. It's still the best at long context, noticing things said long ago.
Earlier this week I was doing some debugging. For debugging especially I like to run sonnet/gpt5/2.5-pro in parallel with the same prompt/convo. Gemini was the only one that, 4 or so messages in, pointed out something very relevant in the middle of the logs in the very first message. GPT and Sonnet both failed to notice, leading them to give wrong sample code. I would've wasted more time if I hadn't used Gemini.
It's also still the best at a good number of low-resource languages. It doesn't glaze too much (Sonnet, ChatGPT) without being overly stubborn (raw GPT-5 API). It's by far the best at OCR and image recognition, which a lot of average users use quite a bit.
Google's ridiculously bad at marketing and AI UX, but they'll get there. They're already much more than just a "bang for the buck" player.
FWIW I use all 3 above mentioned on a daily basis for a wide variety of tasks, often side-by-side in parallel to compare performance.
My pet theory without any strong foundation is because OpenAI and Anthropic have trained their models really hard to fit the sycophantic mold of:
===============================
Got it — *compliment on the info you've shared*, *informal summary of task*. *Another compliment*, but *downside of question*.
----------
(relevant emoji) Bla bla bla
1. Aspect 1
2. Aspect 2
----------
*Actual answer*
-----------
(checkmark emoji) *Reassuring you about its answer because:*
* Summary point 1
* Summary point 2
* Summary point 3
Would you like me to *verb* a ready-made *noun* that will *something that's helpful to you 40% of the time*?
===============================
I suspect this has emerged organically from the user given RLHF via thumb voting in the apps. People LIKE being treated this way so the model converges in that direction.
Same as social media converging to rage bait. The user base LIKES it subconsciously. Nobody at the companies explicitly added that to content recommendation model training. I know, for the latter, as I was there.
Not the case with GPT-5 I’d say. Sonnet 4 feels a lot like this, but the coding and agency of it is still quite solid and overall IMO the best coder. Gemini2.5 to me is most helpful as a research assistant. It’s quite good together with google search based grounding.
Gemini does the sycophantic thing too, so I'm not sure that holds water. I keep having to remind it to stop with the praise whenever my previous instruction slips out of context window.
Oh god I _hate_ this. Does anyone have any custom instructions to shut this thing off. The only thing that worked for me is to ask the model to be terse. But that causes the main answer part to be terse too, which sucks sometimes.
Not really. Any prefix before the content you want is basically "thinking time". The text itself doesn't even have to reflect it, it happens internally. Even if you don't go for the thinking model explicitly, that task summary and other details can actually improve the quality, not reduce it.
Google also has a lot of very useful structured data from search that they’re surely going to figure out how to use at some point. Gemini is useless at finding hotels, but it says it’s using Google’s Hotel data, and I’m sure at some point it’ll get good at using it. Same with flights too. If a lot of LLM usage is going to be better search, then all the structured data Google have for search should surely be a useful advantage.
I recently started using Open WebUI, which lets you run your query on multiple models simultaneously. My anecdote: For non-coding tasks, Gemini 2.5 Pro beats Sonnet 4 handily. It's a lot more common to get wrong/hallucinated content from Sonnet 4 than Gemini.
Okay this is a nitpick but why wouldn't you increment a part of the version number to signify that there is an improvement? These releases are confusing.
Sure and that is why you can call it 2.5.<whatever>
They just don't want to be pinned down because the shifting sands are useful for the time when the LLM starts to get injected with ads or paid influence.
I wish they would actually explain it like that somewhere. Or publish the internal version numbers they must certainly be using to ensure a proper development process.
Anthropic kind of did the same thing [1] except it back-fired recently with the cries of "nerfing".
We buy these tokens, which are very hard to do in limited tiers, they expire after only a year, and we don't even know how often the responses are changing in the background. Even a 1% improvement or reduction I would want disclosed.
Really scary foundation AI companies are building on IMO. Transparency and access is important.
I would assume that it will supersede the model that they currently have. So eventually 2.5 flash will be the new and improved 2.5 Flash rather than 2.6.
Same way that openai updated their 4-o models and the like, which didn't turn out so well when it started glazing everyone and they had to revert it (maybe that was just chat and not api)
Even if it was just chat and or API I have used the API and I know that they have at minimum added the retraining date and time that they could just affix to the Gemini 2.5 Flash and Flash-Lite because when I use the API I have to verify that the upgrade of the backend system didn't break anything and pinning versions I assume is pretty common.
Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).
Gemini 2.5 Flash-Lite improvements include better instruction following, reduced verbosity, stronger multimodal & translation capabilities. Gemini 2.5 Flash improvements include better agentic tool use and more token-efficient reasoning.
Model strings: gemini-2.5-flash-lite-preview-09-2025 and gemini-2.5-flash-preview-09-2025
Any idea what "output token efficiency" refers to?
Gemini Flash is billed by number of input/output tokens, which I assume is fixed for the same output, so I'm struggling to understand how it could result in lower cost. Unless of course they have changed tokenization in the new version?
2.5 Flash is the first time I've felt AI has become truly useful to me. I was #1 AI hater but now find myself going to the Gemini app instead of Google search. It's just better in every way and no ads. The info it provides is usually always right and it feels like I have the whole generalized and accurate knowledge of the internet at my fingertips in the app. It's more intimate, less distractions. Just me and the Gemini app alone talking about kale's ideal germination temperature, instead of a bunch of mommy bloggers, bots, and SEO spam.
Now how long can Google keep this going and cannibalizing how they make money is another question...
It's also excellent for subjective NLP-type analysis. For example, I use it for "scouting" chapters in my translation pipeline to compile coherent glossaries that I can feed into prompts for per-chapter translation.
This involves having it identify all potential keywords and distinct entities, determine their approximate gender (important for languages with ambiguous gender pronouns), and then perform a line-by-line analysis of each chapter. For each line, it identifies the speaking entity, determines whose POV the line represents, and identifies the subject entity. While I didn't need or expect perfection, Gemini Flash 2.5 was the only model I tested that could not only follow all these instructions, but follow them well. The cheap price was a bonus.
I was thoroughly impressed, it's now my go-to for any JSON-formatted analysis reports.
Am I using a different Gemini from everyone else? We have Google Workspace at my job, so Gemini is baked in.
It is HORRENDOUS when compared to other models.
I hear a bunch of other people talking about how great Gemini is, but I've never seen it.
The responses are usually either incorrect, way too long, (essays when I wanted summaries) or just...not...good. I will ask the exact same question to both Gemini and ChatGPT (free) and GPT will give a great answer while the Gemini answer is trash.
I've been finding it leaps and bounds above other models but I'm only using it via aistudio. I haven't tried any IDE integration or similar, so can't talk to that. I do still have to tell it to stop it with the effusive praise (I guess that also helps reduce context windows)
I use Gemini almost exclusively for coding and 2.5 Pro is extremely good at it. It has revised hundreds of lines of academic code for me at a time and the results run correctly with only minor revision.
I will also say whatever they use for the AI search summary is good enough for me like 50% of the time I google something, but those are generally the simpler 50% of queries.
Gemini 2.5 Flash has been the LLM I've used the most recently for a variety of domains, especially image inputs and structured outputs which beat both OpenAI and Anthropic in my opinion.
My one big problem with OpenRouter is that, as far as I can tell, they don't provide any indication of how many companies are using each model.
For all I know there are a couple of enormous whales on there who, should they decide to switch from one model to another, will instantly impact those overall ratings.
I'd love to have a bit more transparency about volume so I can tell if that's what is happening or not.
Right, that chart shows App usage based on the user-agent header but doesn't tell you if there is a single individual user of an app that skews the results.
API usage of Flash 2.0 is free, at least till you hit a very generous bound. It's not simply a trial period. You don't even need to register any payment details to get an API key. This might be a reason for its popularity. AFAIK only some Mistral offerings have a similar free tier?
Yeah, that's my use case. When you want to test some program / script that utilizes an llm in the middle and you just want to make sure everything non-llm related is working. It's free! just try again and again till it "compiles" and then switch to 2.5
Yep Kilo (and Cline/Roo more recently) push these free trial of the week models really hard, partially as incentive to register an account with their cloud offering. I began using Cline and Roo before "cloud" features were even a thing and still haven't bothered to register, but I do play with the free Kilo models when I see them since I'm already signed in (they got me with some kind of register and spend $5 to get $X model credits deal) and hey, it's free (I really don't care about my random personal projects being used for training).
If xAI in particular is in the mood to light cash on fire promoting their new model, you'll see it everywhere during the promo period, so not surprised that heavily boosts xAI stats. The mystery codename models of the week are a bit easier to miss.
It's pretty good and fast af. At backend stuff is ~ gpt5-mini in capabilities, writes ok code, and works good with agentic extensions like roo/kilo. My colleagues said it handles frontend creation so-so, but it's so fast that you can "roll" a couple of tries and choose the one you want.
Yeah, the speed and price are why I use it. I find that any LLM is garbage at writing code unless it gets constant high-entropy feedback (e.g. an MCP tool reporting lint errors, a test, etc.) and the quality of the final code depends a lot more on how well the LLM was guided than the quality of the model.
A bad model with good automated tooling and prompts will beat a good model without them, and if your goal is to build good tooling and prompts you need a tighter iteration loop.
This is so far off my experience. Grok 4 fast is straight trash, it literally isn’t even close to decent code for what I tried. Meanwhile Sonnet is miles better - but even still, Opus while I guess technically being only slightly better, in practice is so much better that I find it hard to use Sonnet at all.
I mean, I can kinda roll through a lot of iterations with this model without worrying about any AI limits.
Y'know with all these latest models, the lines are kinda blurry actually. The definition of "good" is being foggy.
So it might as well be free as the definition of money is clear as crystal.
I also used it for some time to test on something really really niche like building telegram bot in cloudflare workers and grok-4-fast was kinda decent on that for the most part actually. So that's nice.
The switch by Artificial Analysis from per-token-cost to per-benchmark-cost shows some effect!
Its nice that labs are now trying to optimize what I actually have to pay to get an answer - It always annoys me to have to pay for all the senseless rambling of the less-capable reasoning models.
I would really like to see the 270M but which also knows phonetic alphabetic pronounciation in sentences. Perhaps IPA?
I would like to try a small computer->human "upload" experiment, basic multilingual understanding without pronounciation knowledge would be very sad.
I intend to make a sort of computer reflexive game, I want to compare different upload strategies (with/without analog or classic error correcting codes, empirical spaced repetition constants, a ML predictor of which parameters I'm forgetting / losing resolution on.
I'm not even sure how to evaluate what a "better" LLM is, when I've tried running the exact same model (Qwen3) and prompt and gotten vastly different responses on Qwen Chat vs OpenRouter vs running the model locally.
There several reasons responses from the same model might vary:
- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition
- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)
- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat
- not-quite-deterministic GPU acceleration
Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.
Of course a benchmark still can't tell you everything - real-world performance can be very different.
I can't speak to qwen, but something interesting with Deepseek is that the official API supports almost no parameters, while the vllm hosts on openrouter do. The experience you get with the rehosters is wildly different since you can use samplers.
It's weird that the just keep the version number. Why not release it as 2.6 or something else. Now it is confusing, do my existing workflows automatically use the updated version and if yes do I need to monitor them for unwanted changed behavior etc.
I’ve been tinkering with the last version for code gen. This update might finally put it on par with Claude for latency. Anyone tried benchmarking the new preview yet?
Yeah, why is it that working with AI makes people completely forget what version numbers mean?
gemini-2.5-flash-preview-09-2025 - what are they thinking?
I thought about joking that they had AI name it for them, but when I asked Gemini, it said that this name was confusing, redundant, and leads to unnecessarily high cognitive load.
Maybe Googlers should learn from their own models.
Threw few short python scripts at 2.5. Got stupid messages like "OMG Significant Flaw!!1 all of your functions have non-obvious dependency on this global variable declared in main, nothing will work if you dont execute main first!!1" I mean sure, technically correct, the best kind of LLM correct.
It kept finding those fatal flaws and starting to explain them to then slowly finish with "oh yes this works as intended".
The most annoying thing about Gemini is that it can't stop suggesting youtube videos. Even when you ask it to stop doing that, multiple times in the same conversation, it will just keep doing it.
I think that the Gemini 3 pro might be next month I am not sure.
can I get the sources of your rumour please? (Yes I know that I can search it but I would honestly prefer it if you could share it, thanks in advance!)
> Today, we are releasing updated versions of Gemini 2.5 Flash and 2.5 Flash-Lite, available on Google AI Studio and Vertex AI, aimed at continuing to deliver better quality while also improving the efficiency.
Typo in the first sentence? "... improving the efficiency." Gemini 2.5 Pro says this is perfectly good phrasing, whereas ChatGPT and Claude recognize that it's awkward or just incorrect. Hmm...
ChatGPT and Claude are mistaken if they think it is incorrect. The parallelism in verb tenses is between "continuing to deliver" and "improving the efficiency". It's a bit wordy, but definitely not wrong.
Usually you would say "improving the efficiency of x and y". In this case at the end of the sentence it should be "improving the models' efficiency" or just "improving efficiency". I don't think it's "wrong" and it's obviously clear what they mean, but I agree that the phrasing is a little awkward.
This is pedantic. It's perfectly fine usage in non-formal English speaking. What's more - who gives a shit? By your own standards, you're inserting a quote in the middle of your comment in an arguably similarly "awkward" way.
This is not my experience. In my experience Gemini 2.5 Pro is the best model in every use-case I tried. There are a few very hard (graduate level) logic or math problems that Claude 4.1 Opus edged-out over Gemini 2.5 Pro, but in general if you have no idea which model will perform best on a difficult question, imho Gemini 2.5 Pro is a safer bet especially since it's significantly cheaper. Gemini 2.5 Flash is really good but imho not nearly as good as Pro in (1) research math (2) creative/artistic writing (3) open ended programming debugging.
On the other hand, I do prefer using Claude 4 Sonnet on very open-ended agentic programming tasks because it seems to have a better integration with VSCode Copilot. Gemini 2.5 Pro bugs out much more often where Claude works fine almost every time.
Yeah that's how I feel too. Flash is less verbose and every LLM nowadays seems to be designed by some low-taste people who reward the model for falsely hedging (i.e. "The 2024 Corolla Cross usually has an X gallon gas tank") on stuff that isn't at all variable or questionable. This false hedging is way more of an issue than hallucinations in my experience and the "smarter" 2.5 Pro is not any better at avoiding this issue than Flash
Also 2.5 Pro is often incapable of searching and will hallucinate instead. I don't know why. It will claim it searched and then return some made up results instead. 2.5 Flash is much more consistently capable of searching
2.5 isn't the version number, its the model generation. it would only be updated when the underlying model architecture, training, etc are updated. this release is, as the name implies, the same model but likely with hardware optimizations, system prompt, and fine-tuning tweaks applied.
I am seeing a lot of demand for something like a semver for AI models.
Could thereotically there could be something like a semver that can be autogenerated from that defined and regular version scheme that you shared?
Like, Honestly my idea of it is that I could use something like openrouter and then just change the semver without having to worry about these soooo many things as the schema that you shared y'know?
A website / tool which can create a semver from this defined scheme and vice versa can be really cool actually :>
Why do all of these model providers have such issues naming/versioning them? Why even use a version number (2.5) if you aren't going to change it when you update the model?
This industry desperately needs a Steve Jobs to bring some sanity to the marketing.
I've been running into it consistently, responses that just stop mid-sentence, not because of token limits or content filters, but what appears to be a bug in how the model signals completion. It's been documented on their GitHub and dev forums for months as a P2 issue.
The frustrating part is that when you compare a complete Gemini response to Claude or GPT-4, the quality is often quite good. But reliability matters more than peak performance. I'd rather work with a model that consistently delivers complete (if slightly less brilliant) responses than one that gives me half-thoughts I have to constantly prompt to continue.
It's a shame because Google clearly has the underlying tech. But until they fix these basic conversation flow issues, Gemini will keep feeling broken compared to the competition, regardless of how it performs on benchmarks.
https://github.com/googleapis/js-genai/issues/707
https://discuss.ai.google.dev/t/gemini-2-5-pro-incomplete-re...
If you want to use application/json as the specified output in the request, you can’t use tools
So if you need both, you either hope it gives you correct json when using tools (which many times it doesn’t). Or you have to do two requests, one for the tool calling, another for formatting
At least, even if annoying, this issue is pretty straightforward to get around
Of course it could all be placebo, but when you intuitively think about it, somewhere on the road the the hundreds of billions in datacenter capex, one would think that there will be periods where compute and demand are out of sync. It's also perfectly understandable why now would be a time to be seeing that.
I’ve seen that behavior when LLMs of any make or model aren’t given enough time or allowed enough tokens.
It’s so annoying that you have this super capable model but you interact with it using an app that is complete ass
1. Using the "Projects" thing (Folder organization) makes my browser tab (on Firefox) become unusably slow after a while. I'm basically forced to use the default chats organization, even though I would like to organize my chats in folders.
2. After editing a message that you already sent,you get to select between the different branches of the chat (1/2, and so on), which is cool, but when ChatGPT fails to generate a response in this "branched conversation" context, it will continue failing forever. When your conversation is a single thread and a ChatGPT message fails with an error, re trying usually works and the chat continues normally.
On mobile (android) opening the keyboard scrolls the chat to the bottom! I sometimes want to type referring something from the middle of the LLMs last answer.
(Disclosure: I'm the founder of Synthetic.new, a company that runs open-source LLMs for monthly subscriptions.)
Pelicans: https://github.com/simonw/llm-gemini/issues/104#issuecomment...
For example, the latest Gemini 2.5 Flash is known as "google/gemini-2.5-flash-preview-09-2025" [1].
[1]: https://openrouter.ai/google/gemini-2.5-flash-preview-09-202...
This is also the case with OpenAI and their models. Pretty standard I guess.
They don't change the versioning, because I guess they don't consider it to be "a new model trained from scratch".
That's why Google names it like this, but I agree its dumb. Semver would be easier.
semantic versioning works for most scenarios.
It's a delicate balance, because these Gemini models sometimes feel downright lobotomized compared to claude or gpt-5.
My understanding is Gemini is not far behind on "intelligence", certainly not in a way that leaves obvious doubt over where they will be over the next iteration/model cycles, where I would expect them to at least continue closing the gap. I'd be curious if you have some benchmarks to share that suggest otherwise.
Meanwhile, afaik something Google has done, and perhaps relates back to your point re "latency/TPS/cost dimensions" that other providers aren't doing as much is integrating their model into interesting products beyond chat, at a pace that seems surprising given how much criticism they had been taking for being "slow" to react to the LLM trend.
Besides the Google Workspace surface and Google search, which now seem obvious - there are other interesting places where Gemini will surface - https://jules.google/ for one, to say nothing of their experiments/betas in the creative space - https://labs.google/flow/about
Another I noticed today: https://www.google.com/finance/beta
I would have thought putting Gemini on a finance dashboard like this would be inviting all sorts of regulatory (and other) scrutiny... and wouldn't be in keeping with a "slow" incumbent. But given the current climate, it seems Google is plowing ahead just as much as anyone else - with a lot more resources and surface to bring to bear. Imagine Gemini integration on Youtube. At this point it just seems like counting down the days...
I'm using Gemini (2.5-pro) less and less these days. I used to be really impressived with its deep research capabilities and ability to cite sources reliably.
The last few weeks, it's increasingly argumentative and incapable of recognizing hallucinations around sourcing. I'm tired of arguing with it on basics like RFCs and sources it fabricates, won't validate, and refuses to budge on.
Example prompt I was arguing with it on last night:
> within a github actions workflow, is it possible to get access to the entire secrets map, or enumerate keys in this object?
As recent supply-chain attacks have shown, exfiltrating all the secrets from a Github workflow is as simple as `${{ toJSON(secrets) }}` or `echo ${{ toJSON(secrets) }} | base64` at worse. [1]
Give this prompt a shot! Gemini won't do anything except be obstinately ignorant. With me, it provided a test case workflow, and refused to believe the results. When challenged, expect it to cite unrelated community posts. Chatgpt had no problem with it.
[1] https://github.com/orgs/community/discussions/174045 https://github.com/orgs/community/discussions/47165
With this example, several attempts resulted in the same thing: Gemini expressing a strong belief that Github has a security capability which is really doesn't have.
If someone is able to get Gemini to give an accurate answer to this with a similar question, I'd be very curious to hear what it is.
It's bad at agentic stuff, especially coding. Incomparably so compared to Claude and now GPT-5. But if it's just about asking it random stuff, and especially going on for very long in the same conversation - which non-tech users have a tendency to do - Gemini wins. It's still the best at long context, noticing things said long ago.
Earlier this week I was doing some debugging. For debugging especially I like to run sonnet/gpt5/2.5-pro in parallel with the same prompt/convo. Gemini was the only one that, 4 or so messages in, pointed out something very relevant in the middle of the logs in the very first message. GPT and Sonnet both failed to notice, leading them to give wrong sample code. I would've wasted more time if I hadn't used Gemini.
It's also still the best at a good number of low-resource languages. It doesn't glaze too much (Sonnet, ChatGPT) without being overly stubborn (raw GPT-5 API). It's by far the best at OCR and image recognition, which a lot of average users use quite a bit.
Google's ridiculously bad at marketing and AI UX, but they'll get there. They're already much more than just a "bang for the buck" player.
FWIW I use all 3 above mentioned on a daily basis for a wide variety of tasks, often side-by-side in parallel to compare performance.
Same as social media converging to rage bait. The user base LIKES it subconsciously. Nobody at the companies explicitly added that to content recommendation model training. I know, for the latter, as I was there.
People have said it destroys the intelligence mid convo
They just don't want to be pinned down because the shifting sands are useful for the time when the LLM starts to get injected with ads or paid influence.
Anthropic kind of did the same thing [1] except it back-fired recently with the cries of "nerfing".
We buy these tokens, which are very hard to do in limited tiers, they expire after only a year, and we don't even know how often the responses are changing in the background. Even a 1% improvement or reduction I would want disclosed.
Really scary foundation AI companies are building on IMO. Transparency and access is important.
[1] https://status.claude.com/incidents/h26lykctfnsz
Same way that openai updated their 4-o models and the like, which didn't turn out so well when it started glazing everyone and they had to revert it (maybe that was just chat and not api)
Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).
Gemini 2.5 Flash-Lite improvements include better instruction following, reduced verbosity, stronger multimodal & translation capabilities. Gemini 2.5 Flash improvements include better agentic tool use and more token-efficient reasoning.
Model strings: gemini-2.5-flash-lite-preview-09-2025 and gemini-2.5-flash-preview-09-2025
Which is a good thing in my book as the models now are way too verbose (and I suspect one of the reasons is the billing by tokens).
The first chart implies the gains are minimal for nonthinking models.
Now how long can Google keep this going and cannibalizing how they make money is another question...
This involves having it identify all potential keywords and distinct entities, determine their approximate gender (important for languages with ambiguous gender pronouns), and then perform a line-by-line analysis of each chapter. For each line, it identifies the speaking entity, determines whose POV the line represents, and identifies the subject entity. While I didn't need or expect perfection, Gemini Flash 2.5 was the only model I tested that could not only follow all these instructions, but follow them well. The cheap price was a bonus.
I was thoroughly impressed, it's now my go-to for any JSON-formatted analysis reports.
Disclaimer: I recently joined this team. But I like the product!
It is HORRENDOUS when compared to other models.
I hear a bunch of other people talking about how great Gemini is, but I've never seen it.
The responses are usually either incorrect, way too long, (essays when I wanted summaries) or just...not...good. I will ask the exact same question to both Gemini and ChatGPT (free) and GPT will give a great answer while the Gemini answer is trash.
Am I missing something?
I will also say whatever they use for the AI search summary is good enough for me like 50% of the time I google something, but those are generally the simpler 50% of queries.
ChatGPT is better at:
A) Interpreting what I'm asking it for me needing to provide additional explicit context.
B) Formatting answers in a way that are easily digestible.
Something that distinguishes between a completely new pre-training process/architecture, and standard RLHF cycles/optimizations.
From OpenRouter last week:
* xAI: Grok Code Fast 1: 1.15T
* Anthropic: Claude Sonnet 4: 586B
* Google: Gemini 2.5 Flash: 325B
* Sonoma Sky Alpha: 227B
* Google: Gemini 2.0 Flash: 187B
* DeepSeek: DeepSeek V3.1 (free): 180B
* xAI: Grok 4 Fast (free): 158B
* OpenAI: GPT-4.1 Mini: 157B
* DeepSeek: DeepSeek V3 0324: 142B
For all I know there are a couple of enormous whales on there who, should they decide to switch from one model to another, will instantly impact those overall ratings.
I'd love to have a bit more transparency about volume so I can tell if that's what is happening or not.
A "weekly active API Keys" faceted by models/app would be a useful data point to measure real-world popularity though.
It might not be OK for that kind of usecase, or might breach ToS.
But it's still great. Even my premium Perplexity account doesn't give me free API access.
People are lazy at pointing to the latest name.
Both apps have offered usage for free for a limited time:
https://blog.kilocode.ai/p/grok-code-fast-get-this-frontier-...
https://cline.bot/blog/grok-code-fast
If xAI in particular is in the mood to light cash on fire promoting their new model, you'll see it everywhere during the promo period, so not surprised that heavily boosts xAI stats. The mystery codename models of the week are a bit easier to miss.
Also cheap enough to not really matter.
A bad model with good automated tooling and prompts will beat a good model without them, and if your goal is to build good tooling and prompts you need a tighter iteration loop.
I would rather use a model that is good than a model that is free, but different people have different priorities.
Y'know with all these latest models, the lines are kinda blurry actually. The definition of "good" is being foggy.
So it might as well be free as the definition of money is clear as crystal.
I also used it for some time to test on something really really niche like building telegram bot in cloudflare workers and grok-4-fast was kinda decent on that for the most part actually. So that's nice.
I would like to try a small computer->human "upload" experiment, basic multilingual understanding without pronounciation knowledge would be very sad.
I intend to make a sort of computer reflexive game, I want to compare different upload strategies (with/without analog or classic error correcting codes, empirical spaced repetition constants, a ML predictor of which parameters I'm forgetting / losing resolution on.
- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition
- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)
- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat
- not-quite-deterministic GPU acceleration
Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.
Of course a benchmark still can't tell you everything - real-world performance can be very different.
Though I imagine this should be a smaller effect than different quantization levels say.
[1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
Gemini 2.5 Flash Preview $0.30 $2.50
Grok 4 Fast $0.20 $0.50
Here's a summary of this discussion with the new version: https://extraakt.com/extraakts/the-great-llm-versioning-deba...
gemini-2.5-flash-preview-09-2025 - what are they thinking?
I thought about joking that they had AI name it for them, but when I asked Gemini, it said that this name was confusing, redundant, and leads to unnecessarily high cognitive load.
Maybe Googlers should learn from their own models.
It kept finding those fatal flaws and starting to explain them to then slowly finish with "oh yes this works as intended".
can I get the sources of your rumour please? (Yes I know that I can search it but I would honestly prefer it if you could share it, thanks in advance!)
To be honest, I hadn't heard that elsewhere, but I haven't been following it massively this week.
I AM LAUGHING SO HARD RIGHT NOWWWWW
LMAOOOO
I wish to upvote this twice lol
Typo in the first sentence? "... improving the efficiency." Gemini 2.5 Pro says this is perfectly good phrasing, whereas ChatGPT and Claude recognize that it's awkward or just incorrect. Hmm...
“deliver better quality while also improving the efficiency.”
Reads fine to me. An editor would likely drop “the”.
Flash is super fast, gets straight to the point.
Pro takes ages to even respond, then starts yapping endlessly, usually confuses itself in the process and ends up with a wrong answer.
On the other hand, I do prefer using Claude 4 Sonnet on very open-ended agentic programming tasks because it seems to have a better integration with VSCode Copilot. Gemini 2.5 Pro bugs out much more often where Claude works fine almost every time.
Also 2.5 Pro is often incapable of searching and will hallucinate instead. I don't know why. It will claim it searched and then return some made up results instead. 2.5 Flash is much more consistently capable of searching
Anthropic learned this lesson. Google, Deepseek, Kimi, OpenAI and others keep repeating it. This feels like Gemini_2.5_final_FINAL_FINAL_v2.
Could thereotically there could be something like a semver that can be autogenerated from that defined and regular version scheme that you shared?
Like, Honestly my idea of it is that I could use something like openrouter and then just change the semver without having to worry about these soooo many things as the schema that you shared y'know?
A website / tool which can create a semver from this defined scheme and vice versa can be really cool actually :>
This industry desperately needs a Steve Jobs to bring some sanity to the marketing.
I actually even agree that the progress is plateauing, but your comment is a non-sequitur.