Autoresearch on an old research idea

(ykumar.me)

117 points | by ykumards 1 hour ago

11 comments

the_arun
1 hour ago
Try this if the main link is not responsive - https://archive.is/6xLiU
_pdp_
40 minutes ago
Take some working code. Ask an LLM to fix bugs. Measure performance and test coverage. Feed the results back into the LLM. Repeat.
This has been the standard approach for more complex LLM deployments for a while now in our shop.
Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.
[-]
- cyanydeez
  33 minutes ago
  Can we modify this approach to get LLMs that are good at specific programming languages or frameworks? That seems to be where local LLMs could really shine.
carlsborg
1 hour ago
> “ The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in.”
Good lens.
The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.
datsci_est_2015
1 hour ago
I often use LLMs to explore prior art and maybe find some alternative ways of thinking of problems. About 90% of what it tells me is useless or inapplicable to my domain due to a technicality it could not have known, but the other 10% is nice and has helped me learn some great new things.
I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.
On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.
[-]
- andy12_
  48 minutes ago
  I think the main value lies in allowing the agent to try many things while you aren't working (when you are sleeping or doing other activities), so even if many tests are not useful, with many trials it can find something nice without any effort on your part.
  This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
- Eufrat
  1 hour ago
  I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember or things where even being flat out wrong is okay and you just do it yourself.
  For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.
  People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?
  [-]
  - foobarian
    51 minutes ago
    > I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember
    I found LLMs make a fabulous frontend for git :-D
- MattGaiser
  1 hour ago
  > agent try everything that the LLM chatbot had recommended ($$$)
  A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.
jpcompartir
58 minutes ago
There are better techniques for hyper-parameter optimisation, right? I fear I have missed something important, why has Autoresearch blown up so much?
The bottleneck in AI/ML/DL is always data (volume & quality) or compute.
Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?
[-]
- nextos
  50 minutes ago
  AFAIK, it's a bit more than hyper-parameter tuning as it can also make non-parametric (structural) changes.
  Non-parametric optimization is not a new idea. I guess the hype is partly because people hope it will be less brute force now.
  [-]
  - coppsilgold
    39 minutes ago
    Perhaps LLM-guided Superoptimization: <https://en.wikipedia.org/wiki/Superoptimization>
    I recall reading about a stochastic one years ago: <https://github.com/StanfordPL/stoke>
    I wonder if the next step in "autoX" is to have an LLM generate dozens of candidates on a cluster and then get an LLM to figure out how to "mate" the two best ones or something. Trying to do this with regular evolutionary/genetic algorithms has always been challenging because how do you represent the gene to phenotype mapping? Let an LLM sort it out working just with the phenotypes - Lamarckian inheritance.
  - gwerbin
    36 minutes ago
    It's an LLM-powered evolutionary algorithm.
    [-]
    - ainch
      26 minutes ago
      I'd like see a system like this take more inspiration from the ES literature, similar to AlphaEvolve. Let's see an archive of solutions, novelty scoring and some crossover rather than purely mutating the same file in a linear fashion.
- hun3
  53 minutes ago
  > There are better techniques for hyper-parameter optimisation, right?
  There always are. You need to think about what those would be, though. Autoresearch outsources the thinking to LLMs.
lucasay
28 minutes ago
This feels less like automated research and more like structured trial and error with a decent feedback loop. Still useful, but I think the real bottleneck is how good your eval metric is. If that’s weak, the whole loop just optimizes for the wrong thing faster.
love2read
1 hour ago
So... It did work. It found bugs (that he didn't know about) and it did optimization (that he hadn't done).
dvt
1 hour ago
Ok, so looking at the commit log[1], I was mostly interested in seeing what the "moonshot ideas" implementations looked like, but basically everything is just hyperparameter tuning. Which is nice, but likely not worth the $$$ spent on the tokens. Am I missing something here?
[1] https://github.com/ykumards/eCLIP/commits/main/autoresearch
[-]
- DoctorOetker
  48 minutes ago
  It would seem wise to modify the autoresearch instructions to first estimate the computational costs rigorously and then sort and compare the proposals for human review, and for each actually executed attempt to feed back the computational costs with LoRa adapter?
  i.e. perhaps minimal changes to autoresearch can take control for cost-effective research to occur.
lamroger
1 hour ago
Awesome breakdown! It really feels like a hyper-hyper parameter search + bug fixer.
I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.
Wild ensembles, squeezing a bit of loss out. More engineering than research IMO
[-]
- sdenton4
  1 hour ago
  For raw hyperparameter search, though, I would expect a proper Bayesian framework to be much better. Eg, vizier.
  [-]
  - ainch
    24 minutes ago
    I think it depends whether you can leverage some knowledge. It's possible for a person/LLM to look at a loss curve and say "oh that's undertraining, let's bump the lr" - whereas a Bayesian method doesn't necessarily have deeper understanding, so it'll waste a lot of time exploring the search space on poor options.
    If you're resource unconstrained then BO should ofc do very well though.
BrokenCogs
1 hour ago
Does autoresearch work for projects that are not llm based? Eg in karpathy's example he is optimizing the nanogpt. What if I wanted to improve a Unet for image segmentation?
[-]
- simonw
  1 hour ago
  Tobi from Shopify used a variant of autoresearch to optimize the Liquid template engine, and found a 53% speedup after ~120 experiments: https://github.com/Shopify/liquid/pull/2056
  I wrote up some more notes on that here: https://simonwillison.net/2026/Mar/13/liquid/
  [-]
  - Denzel
    43 minutes ago
    How much did this cost? Has there ever been an engineering focus on performance for liquid?
    It’s certainly cool, but the optimizations are so basic that I’d expect a performance engineer to find these within a day or two with some flame graphs and profiling.
    [-]
    - simonw
      34 minutes ago
      He used Pi as the harness but didn't say which underlying model. My stab-in-the-air guess would be no more than a few hundred dollars in token spend (for 120 experiments run over a few days assuming Claude Opus 4.6 used without the benefits of the Claude Max plan.)
      So cheaper than a performance engineer for a day or two... but the Shopify CEO's own time is likely a whole lot more expensive than a regular engineer!
- sdenton4
  1 hour ago
  The gist of these things is you point them at an eval metric and say 'make it go better.' so, you can point it at anything you can measure. The example in the blog post here is bonding boxes on wood cut images.
- bethekind
  53 minutes ago
  I used it to speed up an codecompass-like repo from 86 files per second to 2000. Still haven't used the repo in production, so maybe it secretly broke things, but the ability to say: "optimize this benchmark and commit only if you pass these tests" is nice
nadavdebi
1 hour ago
[flagged]