A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.
Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.
As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.
It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.
Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?
> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green.
I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem. When I was at Google, it was somewhat common for me to debug non-deterministic bugs such as a missing synchronization or fence causing flakiness; and it was common to just launch 10000 copies of the same test on 10000 machines to find perhaps a single digit number of failures. My current employer has a clunkier implementation of the same thing (no UI), but there's also a single command to launch 1000 test workers to run all tests from your own checkout. The goal is to finish testing a 1M loc codebase in no more than five minutes so that you get quick feedback on your changes.
> make builds fully hermetic (so no inter-run caching)
These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.
I was also at Google for years. Places like that are not even close to representative. They can afford to just-throw-more-resources, they get bulk discounts on hardware and they pay top dollar for engineers.
In more common scenarios that represent 95% of the software industry CI budgets are fixed, clusters are sized to be busy most of the time, and you cannot simply launch 10,000 copies of the same test on 10,000 machines. And even despite that these CI clusters can easily burn through the equivalent of several SWE salaries.
> These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.
Again, that's how companies like Google do it. In normal companies, build caching isn't always perfectly reliable, and if CI runs suffer flakes due to caching then eventually some engineer is gonna get mad and convince someone else to turn the caching off. Blaze goes to extreme lengths to ensure this doesn't happen, and Google spends extreme sums of money on helping it do that (e.g. porting third party libraries to use Blaze instead of their own build system).
In companies without money printing machines, they sacrifice caching to get determinism and everything ends up slow.
I’m at Google today and even with all the resources, I am absolutely most bottlenecked by the Presubmit TAP and human review latency. Making CLs in the editor takes me a few hours. Getting them in the system takes days and sometimes weeks.
Indeed. You'd think Google would test for how well people will cope with boredom, rather than their bait-and-switch interviews that make it seem like you'll be solving l33tcode every evening.
Yes and no, I'd estimate 1/3 to 1/2 of that is down to test suites are flaky and time-consuming to run. IIRC shortest build I had was 52m for Android Wear iOS app, easily 3 hours for Android.
Most of my experience writing concurrent/parallel code in (mainly) Java has been rewriting half-baked stuff that would need a lot of testing with straightforward reliable and reasonably performant code that uses sound and easy-to-use primitives such as Executors (watch out for teardown though), database transactions, atomic database operations, etc. Drink the Kool Aid and mess around with synchronized or actors or Streams or something and you're looking at a world of hurt.
I've written a limited number of systems that needed tests that probe for race conditions by doing something like having 3000 threads run a random workload for 40 seconds. I'm proud of that "SuperHammer" test on a certain level but boy did I hate having to run it with every build.
Developer time is more expensive than machine time, but at most companies it isn't 10000x more expensive. Google is likely an exception because it pays extremely well and has access to very cheap machines.
Even then, there are other factors:
* You might need commercial licenses. It may be very cheap to run open source code 10000x, but guess how much 10000 Questa licenses cost.
* Moores law is dead Amdahl's law very much isn't. Not everything is embarrassingly parallel.
* Some people care about the environment. I worked at a company that spent 200 CPU hours on every single PR (even to fix typos; I failed to convince them they were insane for not using Bazel or similar). That's a not insignificant amount of CO2.
Yes, but the OP specifically is talking about CI for large numbers of pull requests, which should be very parallelizable (I can imagine exceptions, but only with anti-patterns, e.g. if your test pipeline makes some kind of requests to something that itself isn't scalable).
Actually, OP was talking about the throughput of running on a large number of pull requests and the latency of running on a single pull request. The latter is not necessarily parallelizable.
That's solvable with modern cloud offerings - Provision spot instances for a few minutes and shut them down afterwards. Let the cloud provider deal with demand balancing.
I think the real issue is that developers waiting for PRs to go green are taking a coffee break between tasks, not sitting idly getting annoyed. If that's the case you're cutting into rest time and won't get much value out of optimizing this.
There are non-IP reasons to go outside the big clouds for CI. Most places I worked over the years had dedicated hardware for at least some CI jobs because otherwise it's too hard to get repeatable performance numbers. At some point you have an outage in production caused by a new build passing tests but having much lower performance, or performance is a feature of the software being sold, and so people decide they need to track perf with repeatable load tests.
Sorta. For CI/CD you can use spot instances and spin them down outside of business hours, so they can end up being cheaper than buying many really beefy machines and amortizing them over the standard depreciation schedule.
Azure for example has “confidential compute” that encrypts even the memory contents of the VM such that even their own engineers can’t access the contents.
As long as you don’t back up the disks and use HTTPS for pulls, I don’t see a realistic business risk.
If a cloud like Azure or AWS got caught stealing competitor code they’d be sued and immediately lose a huge chunk of their customers.
It makes zero business sense to do so.
PS: Microsoft employees have made public comments saying that they refuse to even look at some open source repository to avoid any risk of accidentally “contaminating” their own code with something that has an incompatible license.
The way Azure implements CC unfortunately lowers a lot of the confidentiality. It's not their fault exactly, more like a common side effect of trying to make CC easy to use. You can certainly use their CC to do secure builds but it would require an absolute expert in CC / RA to get it right. I've done design reviews of such proposals before and there's a lot of subtle details.
I don't know about Azure's implementation of confidential compute but GCP's version basically essentially relies on AMD SEV-SVP. Historically there have been vulnerabilities that undermine the confidentiality guarantee.
Nobody's code is that secret, especially not from a vendor like Microsoft.
Unless all development is done with air-gapped machines, realistic development environments are simultaneously exposed to all of the following "leakage risks" because they're using third-party software, almost certainly including a wide range of software from Microsoft:
- Package managers, including compromised or malicious packages.
Microsoft owns both NuGet and NPM!
- IDEs and their plugins, the latter especially can be a security risk.
What developer doesn't use Microsoft VS Code these days?
- CLI and local build tools.
- SCM tools such as GitHub Enterprise (Microsoft again!)
- The CI/CD tooling including third-party tools.
- The operating system itself. Microsoft Windows is still a very popular platform, especially in enterprise environments.
- The OS management tools, anti-virus, monitoring, etc...
And on and on.
Unless you live in a total bubble world with USB sticks used to ferry your dependencies into your windowless facility underground, your code is "exposed" to third parties all of the time.
Worrying about possible vulnerabilities in encrypted VMs in a secure cloud facility is missing the real problem that your developers are probably using their home gaming PC for work because it's 10x faster than the garbage you gave them.
Yes, this happens. All the time. You just don't know because you made the perfect the enemy of the good.
> missing the real problem that your developers are probably using their home gaming PC for work because it's 10x faster than the garbage you gave them.
> Yes, this happens. All the time. You just don't know because you made the perfect the enemy of the good.
That only happens in cowboy coding startups.
In places where security matters (e.g. fintech jobs), they just lock down your PC (no admin rights), encrypt the storage and part of your VPN credentials will be on a part of your storage that you can't access.
> ...your developers are probably using their home gaming PC for work because it's 10x faster than the garbage you gave them...
I went from a waiter to startup owner and then acquirer, then working for Google. No formal education, no "real job" till Google, really. I'm not sure even when I was a waiter I had this...laissez-faire? naive?...sense of how corporate computing worked.
That aside, the whole argument stands on "well, other bad things can happen more easily!", which we agree is true, but also, it isn't an argument against it.
From a Chesterson's Fence view, one man's numbskull insistence on not using AWS that must only be due to pointy-haired boss syndrome, is another's valiant self-hosting-that-saved-7 figures. Hard to say from the bleachers, especially with OP making neither claim.
>Do companies not just double their CI workers after hearing people complain?
They do not.
I don't know if it's a matter of justifying management levels, but these discussions are often drawn out and belabored in my experience. By the time you get approval, or even worse, rejected, for asking for more compute (or whatever the ask is), you've spent way more money on the human resource time than you would ever spend on the requested resources.
This is exactly my experience with asking for more compute at work. We have to prepare loads of written justification, come up with alternatives or optimizations (which we already know won't work), etc. and in the end we choose the slow compute and reduced productivity over the bureaucracy.
And when we manage to make a proper request it ends up being rejected anyways as many other teams are asking for the same thing and "the company has limited resources". Duh.
I have never once been refused by a manager or director when I am explicitly asking for cost approval. The only kind of long and drawn out discussions are unproductive technical decision making. Example: the ask of "let's spend an extra $50,000 worth of compute on CI" is quickly approved but "let's locate the newly approved CI resource to a different data center so that we have CI in multiple DCs" solicits debates that can last weeks.
> I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem.
I'd personally agree. But this sounds like the kind of thing that, at many companies, could be a real challenge.
Ultimately, you can measure dollars spent on CI workers. It's much harder and less direct to quantify the cost of not having them (until, for instance, people start taking shortcuts with testing and a regression escapes to production).
That kind of asymmetry tends, unless somebody has a strong overriding vision of where the value really comes from, to result in penny pinching on the wrong things.
It's more than that. You can measure salaries too, measurement isn't the issue.
The problem is that if you let people spend the companies money without any checks or balances they'll just blow through unlimited amounts of it. That's why companies always have lots of procedures and policies around expense reporting. There's no upper limit to how much money developers will spend on cloud hardware given the chance, as the example above of casually running a test 10,000 times in parallel demonstrates nicely.
CI doesn't require you to fill out an expense report every time you run a PR thank goodness, but there still has to be a way to limit financial liability. Usually companies do start out by doubling cluster sizes a few times, but each time it buys a few months and then the complaints return. After a few rounds of this managers realize that demand is unlimited and start pushing back on always increasing the budget. Devs get annoyed and spend an afternoon on optimizations, suddenly times are good again.
The meme on HN is that developer time is always more expensive than machine time, but I've been on both sides of this and seen how the budgets work out. It's often not true, especially if you use clouds like Azure which are overloaded and expensive, or have plenty of junior devs, and/or teams outside the US where salaries are lower. There's often a lot of low hanging fruit in test times so it can make sense to optimize, even so, huge waste is still the order of the day.
Even Google can not buy more old Intel Macs or Pixel 6s or Samsung S20s to increase their testing on those devices (as an example)
Maybe that affects less devs who don't need to test on actual hardware but plenty of apps do. Pretty much anything that touches a GPU driver for example like a game.
No it is not. Senior management often has a barely disguised contempt for engineering and spending money to do a better job. They listen much more to sales complain.
You're confusing throughput and latency. Lengthy CI runs increase the latency of developer output, but they don't significantly reduce overall throughput, given a developer will typically be working on multiple things at once, and can just switch tasks while CI is running. The productivity cost of CI is not zero, but it's way, way less than the raw wallclock time spent per run.
Then also factor in that most developer tasks are not even bottlenecked by CI. They are bottlenecked primarily by code review, and secondarily by deployment.
Length CI runs do reduce throughput, as working around high CI latencies pushes people towards juggling more PRs at once meaning more merge conflicts to deal with, and increases the cost of a build failing transiently.
And context switching isn't free by any means.
Still, if LLM agents keep improving then the bottleneck of waiting on code review won't exist for the agents themselves, there'll just be a stream of always-green branches waiting for someone to review and merge them. CI costs will still matter though.
I’m currently at google (opinions not representative of my employer’s etc) and this is true for things that run in a data center but it’s a lot harder for things that need to be tested on physical hardware like parts of Android or CrOS.
Writing testing infrastructure so that you can just double workers and get a corresponding doubling in productivity is non-trivial. Certainly I've never seen anything like Google's testing infrastructure anywhere else I've worked.
Yeah Google's infrastructure is unique because Blaze is tightly integrated with the remote execution workers and can shard testing work across many machines automatically. Most places can't do that so once you have enough hardware that queue depth isn't too big you can't make anything go faster by adding hardware, you can only try to scale vertically or optimize. But if you're using hosted CI SaaS it's often not always easy to get bigger machines, or the bigger machines are superlinear in cost.
Many companies are strangely reluctant to spend money on hardware for developers. They might refuse to spend $1,000 on a better laptop to be used for the next three years by an employee, whose time costs them that much money in a single afternoon.
That's been a pet peeve of mine for so long. (Glad my current employer gets me the best 1.5ℓ machine from Dell every few years!)
On the other hand I've seen many overcapitalized pre-launch startups go for months with a $20,000+ AWS bill without thinking about it then suddenly panic about what they're spending; they'd find tens of XXXXL instances spun up doing nothing, S3 buckets full of hundreds of terabytes of temp files that never got cleared out, etc. With basic due diligence they could have gotten that down to $2k a month, somebody obsessive about cost control could have done even better.
IME it's less of a "throw more resources" problem and more of a "stop using resources in literally the worst way possible"
CI caching is, apparently, extremely difficult. Why spend a couple of hours learning about your CI caches when you can just download and build the same pinned static library a billion times? The server you're downloading from is (of course) someone else's problem and you don't care about wasting their resources either. The power you're burning by running CI for there hours instead of one is also someone else's problem. Compute time? Someone else's problem. Cloud costs? You bet it's someone else's problem.
Sure, some things you don't want to cache. I always do a 100% clean build when cutting a release or merging to master. But for intermediate commits on a feature branch? Literally no reason not to cache builds the exact same way you do on your local machine.
Not really, in most small companies/departments, £100k a month is considered a painful cloud bill and adding more EC2 instances to provide cloud runners can add 10% to that easily.
My personal experience: We run over 1.1m test cases to verify every PR that I submit, and there are more test cases that don't get run on every commit and instead get run daily or on-demand.
At that scale getting quick turnaround is a difficult infrastructure problem, especially if you have individual tests that take multiple seconds or suites that take multiple minutes (we do, and it's hard to actually pull the execution time down on all of them).
I've never personally heard "we don't have the budget" or "we don't have enough machines" as answers for why our CI turnaround isn't 5 minutes, and it doesn't seem to me like the answer is just doubling the core count in every situation.
The scenario I work on daily (a custom multi-platform runtime with its own standard library) does by necessity mean that builds and testing are fairly complex though. I wouldn't be surprised if your assertion (just throw more resources at it) holds for more straightforward apps.
- Just spin up more test instances. If the AI is as good as people claim then it's still way cheaper than extra programmers.
- Write fast code. At $WORK we can test roughly a trillion things per CPU physical core year for our primary workload, and that's in a domain where 20 microsecond processing time is unheard of. Orders of magnitude speed improvements pay dividends quickly.
- LLMs don't care hugely about the language. Avoid things like rust where compile times are always a drag.
- That's something of a strange human problem you're describing. Once the PR is reviewed, can't you just hit "auto-merge" and go to the next task, only circling back if the code was broken? Why is that a significant amount of developer time?
- The thing you're observing is something every growing team witnesses. You can get 90% of the way to what you want by giving the build system a greenfield re-write. If you really have to run 100x more tests, it's worth a day or ten sanity checking docker caching or whatever it is your CI/CD is using. Even hermetic builds have inter-run caching in some form; it's just more work to specify how the caches should work. Put your best engineer on the problem. It's important.
- Be as specific as possible in describing test dependencies. The fastest tests are the ones which don't run.
- Separate out unit tests from other forms of tests. It's hard to write software operating with many orders of magnitude of discrepancies, and tests are no exception. Your life is easier if conceptually they have a separate budget (e.g., continuous fuzz testing or load testing or whatever). Unit tests can then easily be fast enough for a developer to run all the changed ones on precommit. Slower tests are run locally when you think they might apply. The net effect is that you don't have the sort of back-and-forth with your CI that actually causes lost developer productivity because the PR shouldn't have a bunch of bullshit that's green locally and failing remotely.
These are all good suggestions, albeit many are hard to implement in practice.
> That's something of a strange human problem you're describing.
Are we talking about agent-written changes now, or human? Normally reviewers expect tests to pass before they review something, otherwise the work might change significantly after they did the review in order to fix broken tests. Auto merges can fail due to changes that happened in the meantime, they're aren't auto in many cases.
Once latency goes beyond a minute or two people get distracted and start switching tasks to something else, which slows everything down. And yes code review latency is a problem as well, but there are easier fixes for that.
1. As implementation phase gets faster, the bottleneck could actually switch to PM. In which case, changes will be more serial, so a lot fewer conflicts to worry about.
2. I think we could see a resurrection of specs like TLA+. Most engineers don't bother with them, but I imagine code agents could quickly create them, verify the code is consistent with them, and then require fewer full integration tests.
3. When background agents are cleaning up redundant code, they can also clean up redundant tests.
4. Unlike human engineering teams, I expect AIs to work more efficiently on monoliths than with distributed microservices. This could lead to better coverage on locally runnable tests, reducing flakes and CI load.
5. It's interesting that even as AI increases efficiency, that increased velocity and sheer amount of code it'll write and execute for new use cases will create its own problems that we'll have to solve. I think we'll continue to have new problems for human engineers to solve for quite some time.
CI should just run on each developer's machine. As in, each developer should have a local instance of the CI setup in a VM or a docker container. If tests pass, the result is reported to a central server.
For Python apps, I've gotten good CI speedups by moving over to the astral.sh toolchain, using uv for the package installation with caching. Once I move to their type-checker instead of mypy, that'll speed the CI up even more. The playwright test running will then probably be the slowest part, and that's only in apps with frontends.
(Also, Hi Mike, pretty sure I worked with you at Google Maps back in early 2000s, you were my favorite SRE so I trust your opinion on this!)
LLM making a quick edit, <100 lines... Sure. Asking an LLM to rubber-duck your code, sure. But integrating an LLM into your CI is going to end up costing you 100s of hours productivity on any large project. That or spend half the time you should be spending learning to write your own code, dialing down context sizing and prompt accuracy.
I really really don't understand the hubris around llm tooling, and don't see it catching on outside of personal projects and small web apps. These things don't handle complex systems well at all, you would have to put a gun in my mouth to let one of these things work on an important repo of mine without any supervision... And if I'm supervising the LLM I might as well do it myself, because I'm going to end up redoing 50% of its work anyways..
I keep seeing this argument over and over again, and I have to wonder, at what point do you accept that maybe LLM's are useful? Like how many people need to say that they find it makes them more productive before you'll shift your perspective?
...and my comment clearly isnt talking about that, but at the suggestion that its useless to write code with an LLM because you'll end up rewriting 50% of it.
If everyone has an opinion different to mine, I dont instantly change my opinion, but I do try and investigate the source of the difference, to find out what I'm missing or what they are missing.
The polarisation between people that find LLMs useful or not is very similar to the polarisation between people that find automated testing useful or not, and I have a suspicion they have the same underlying cause.
You seem to think everyone shares your view, around me I see a lot of people acknowledging they are useful to a degree, but also clearly finding limits in a wide array of cases, including that they really struggle with logical code, architectural decisions, re-using the right code patterns, larger scale changes that aren’t copy paste, etc.
So far what I see is that if I provide lots of context and clear instructions to a mostly non-logical area of code, I can speed myself up about 20-40%, but only works in about 30-50% of the problems I solve day to day at a day job.
So basically - it’s about a rough 20% improvement in my productivity - because I spend most of my time of the difficult things it can’t do anyway.
Meanwhile these companies are raising billion dollar seed rounds and telling us that all programming will be done by AI by next year.
That's a tool, and it depends what you need to do. If it fits someone need and make them more productive, or even simply enjoy more the activity, good.
Just because two people are fixing something on the whole doesn't mean the same tool will hold fine. Gum, pushpin, nail, screw,bolts?
The parent thread did mention they use LLM successfully in small side project.
They say it’s only effective for personal projects but there’s literally evidence of LLMs being used for what he says can’t be used. Actual physical evidence.
It’s self delusion. And also the pace of AI is so fast he may not be aware of how fast LLMs are integrating into our coding environments. Like 1 year ago what he said could be somewhat true but right now what he said is clearly not true at all.
I've used Claude with a large, mature codebase and it did fine. Not for every possible task, but for many.
Probably, Mercury isn't as good at coding as Claude is. But even if it's not, there's lots of small tasks that LLMs can do without needing senior engineer level skills. Adding test coverage, fixing low priority bugs, adding nice animations to the UI etc. Stuff that maybe isn't critical so if a PR turns up and it's DOA you just close it, but which otherwise works.
Note that many projects already use this approach with bots like Renovate. Such bots also consume a ton of CI time, but it's generally worth it.
IMHO LLMs are notoriously bad at test coverage. They usually hard code a value to have the test pass, since they lack the reasoning required to understand why the test exists or the concept of assertion, really
I don’t know, Claude is very good at writing that utterly useless kind of unit test where every dependency is mocked out and the test is just the inverted dual of the original code. 100% coverage, nothing tested.
Yeah and that's even worse because there's not an easy metric you can have the agent work towards and get feedback on.
I'm not that into "prompt engineering" but tests seem like a big opportunity for improvement. Maybe something like (but much more thorough):
1. "Create a document describing all real-world actions which could lead to the code being used. List all methods/code which gets called before it (in order) along with their exact parameters and return value. Enumerate all potential edge cases and errors that could occur and if it ends up influencing this task. After that, write a high-level overview of what need to occur in this implementation. Don't make it top down where you think about what functions/classes/abstractions which are created, just the raw steps that will need to occur"
2. Have it write the tests
3. Have it write the code
Maybe TDD ends up worse but I suspect the initial plan which is somewhat close to code makes that not the case
Writing the initial doc yourself would definitely be better, but I suspect just writing one really good one, then giving it as an example in each subsequent prompt captures a lot of the improvement
Don't want to put words in the parent commenter's mouth, but I think the key word is "unsupervised". Claude doesn't know what it doesn't know, and will keep going round the loop until the tests go green, or until the heat death of the universe.
Before cars people spent little on petroleum products or motor oil or gasoline or mechanics. Now they do. That's how systems work. You wanna go faster well you need better roads, traffic lights, on ramps, etc. you're still going faster.
Use AI to solve the IP bottlenecks or build more features that ear more revenue that buy more ci boxes. Same as if you added 10 devs which you are with AI so why wouldn't some of the dev support costs go up.
Are you not in a place where you can make an efficiency argument to get more ci or optimize? What's a ci box cost?
Any modern MacBook can run those tests 100x faster than the crappy cloud runners most companies use. You can also configure runners that run locally and get the benefit of those speed gains. So all of this is really a business and technical problem that is solved for those who want to solve it. It can be solved very cheap, or it can be solved very expensive. Regardless, it's precisely those types of efficiency gains that motivate companies to finally do something about it.
And if not, then enjoy being paid waiting for CI to go green. Maybe it's a reminder to go take a break.
It will be worse when the process is super optimized and the expectation changes. So now instead of those 2 PRs that went to prod today because everyone knows CI takes forever, you'll be expected to push 8 because in our super optimized pipeline it only takes seconds. No excuses. Now the bottleneck is you.
If I am coding, I want to stay in the flow and get my PR green asap, so I can continue on the project.
If I am orchestrating agents, I might have 10 or 100 PRs in the oven. In that case I just look at the ones that finish CI.
It’s gonna be less, or at least different, kind of flow IMO. (Until you can just crank out design docs and whiteboard sessions and have the agents fully autonomously get their work green.)
> If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.
I am guesstimating (based on previous experience self-hosting the runner for MacOS builds) that the project I am working on could get like 2-5x pipeline performance at 1/2 cost just by using self-hosted runners on bare metal rented machines like Hetzner. Maybe I am naive, and I am not the person that would be responsible for it - but having a few bare metal machines you can use in the off hours to run regression tests, for less than you are paying the existing CI runner just for build, that speed up everything massively seems like a pure win for relatively low effort. Like sure everyone already has stuff on their plate and would rather pay external service to do it - but TBH once you have this kind of compute handy you will find uses anyway and just doing things efficiently. And knowing how to deal with bare metal/utilize this kind of compute sounds generally useful skill - but I rarely encounter people enthusiastic about making this kind of move. Its usually - hey lets move to this other service that has slightly cheaper instances and a proprietary caching layer so that we can get locked into their CI crap.
Its not like these services have 0 downtime/bug free/do not require integration effort - I just don't see why going bare metal is always such a taboo topic even for simple stuff like builds.
Yep. For my own company I used a bare metal machine in Hetzner running Linux and a Windows VM along with a bunch of old MacBook Pros wired up in the home office for CI.
It works, and it's cheap. A full CI run still takes half an hour on the Linux machine (the product [1] is a kind of build system for shipping desktop apps cross platform, so there's lots of file IO and cryptography involved). The Macs are by far the fastest. The M1 Mac is embarrassingly fast. It can complete the same run in five minutes despite the Hetzner box having way more hardware. In fairness, it's running both a Linux and Windows build simultaneously.
I'm convinced the quickest way to improve CI times in most shops is to just build an in-office cluster of M4 Macs in an air conditioned room. They don't have to be HA. The hardware is more expensive but you don't rent per month, and CI is often bottlenecked on serial execution speed so the higher single threaded performance of Apple Silicon is worth it. Also, pay for a decent CI system like TeamCity. It helps reduce egregious waste from problems like not caching things or not re-using checkout directories. In several years of doing this I haven't had build caching related failures.
> 2-5x pipeline performance at 1/2 cost just by using self-hosted runners on bare metal rented machines like Hetzner
This is absolutely the case. Its a combination of having dedicated CPU cores, dedicated memory bandwidth, and (perhaps most of all) dedicated local NVMe drives. We see a 2x speed up running _within VMs_ on bare metal.
> And knowing how to deal with bare metal/utilize this kind of compute sounds generally useful skill - but I rarely encounter people enthusiastic about making this kind of move
We started our current company for this reason [0]. A lot of people know this makes sense on some level, but not many people want to do it. So we say we'll do it for you, give you the engineering time needed to support it, and you'll still save money.
> I just don't see why going bare metal is always such a taboo topic even for simple stuff like builds.
It is decreasingly so from what I see. Enough people have been variously burned by public cloud providers to know they are not a panacea. But they just need a little assistance in making the jump.
At the last place I worked at, which was just a small startup with 5 developers, I calculated that a server workstation in the office would be both cheaper and more performant than renting a similar machine in the cloud.
Bare metal makes such a big difference for test and CI scenarios. It even has an integrated a GPU to speed up webdev tests. Good luck finding an affordable machine in the cloud that has a proper GPU for this kind of a use-case
Is it a startup or small business ? In my book a startup expects to scale and hosting bare metal HW in an office with 5 people means you have to figure everything out again when you get 20/50/100 people - IMO not worth the effort and hosting hardware has zero transferable skills to your product.
Running on managed bare metal servers is theoretically the same as running any other infra provider except you are on the hook for a bit more maintenance, you scale to 20 people you just rent a few more machines. I really do not see many downsides for the build server/test runner scenario.
The nice part about most CI workloads is that they can almost always be split up and executed in parallel. Make sure you're utilizing every core on every CI worker and your worker pools are appropriately sized for the workload. Use spot instances and add auto scaling where it makes sense. No one should be waiting more than a few minutes for a PR build. Exception being compile time which can vary significantly between languages. I have a couple projects that are stuck on ancient compilers because of CPU architecture and C variant, so those will always be a dog without effort to move to something better. Ymmv
As an example we recently had a Ruby application that had a test suite that was taking literally an hour per build, but turned out it was running entirely sequential by default, using only 1 core. I spent an afternoon migrating our CI runners to split the workload across all available cores and now it's 5 minutes per build. And that was just the low hanging fruit, it can be significantly improved further but there's obviously diminishing returns
> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers
No, this is common. The devs just haven't grokked dependency inversion. And I think the rate of new devs entering the workforce will keep it that way forever.
Here's how to make it slow:
* Always refer to "the database". You're not just storing and retrieving objects from anywhere - you're always using the database.
* Work with statements, not expressions. Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance. This will force you to sequentialise the tests (simultaneous tests would otherwise race and cause flakiness) plus you get to write a bunch of setup and teardown and wipe state between tests.
* If you've done the above, you'll probably need to wait for state changes before running an assertion. Use a thread sleep, and if the test is ever flaky, bump up the sleep time and commit it if the test goes green again.
Nah. Tests could be run in N processes each with own database configured to skip full fsync. It resolves most of the issues and makes testing much much simpler.
> Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance
Er, doesn’t this boil down to saying “not testing database end state (trusting in transactionality) is faster than testing it”?
I mean sure, trivially true, but not a good idea. I’ve seen lots of bugs caused by code that unexpectedly forced a commit, or even opened/used/committed a whole new DB connection, somewhere buried down inside a theoretically externally-transactional request handler. Bad code, to be sure, but common in many contexts in my experience.
> I’ve seen lots of bugs caused by code that unexpectedly forced a commit, or even opened/used/committed a whole new DB connection, somewhere buried down inside a theoretically externally-transactional request handler.
Yes! That's my current codebase you're describing! If you interweave the database all throughout your accounting logic, you absolutely can bury those kinds of problems for people to find later. But remember, one test at a time so that you don't accidentally discover that your the database transactions aren't protecting you nearly as well as you thought.
In fact, screw database transactions. Pay the cost of object-relation impedance mismatch and unscalable joins, but make sure you avoid the benefits, by turning off ACID for performance reasons (probably done for you already) and make heavy use of LINQ so that values are loaded in and out of RAM willy-nilly and thereby escape their transaction scopes.
The C# designers really leaned into the 'statements' not 'expression' idea! There's no transaction context object returned from beginTrans which could be passed into subsequent operations (forming a nice expression) and thereby clear up any "am I in a transaction?" questions.
But yeah, right now it's socially acceptable to plumb the database crap right through the business logic. If we could somehow put CSS or i18n in the business logic, we'd need to put a browser into our test suite too!
Wow, your story gives me flashbacks to the 1990s when I worked in a mainframe environment. Compile jobs submitted by developers were among the lowest priorities. I could make a change to a program, submit a compile job, and wait literally half a day for it to complete. Then I could run my testing, which again might have to wait for hours. I generally had other stuff I could work on during those delays but not always.
Yet, now I have added a LLM workflow to my coding the value of my old and mostly useless workflows is now 10x'd.
Git checkpoints, code linting and my naive suite of unit and integration tests are now crucial to my LLM not wasting too much time generating total garbage.
It’s because people don’t know how to write tests. All of the “don’t do N select queries in a for loop” comments made in PRs are completely ignored in tests.
Each test can output many db queries. And then you create multiple cases.
People don’t even know how to write code that just deals with N things at a time.
I am confident that tests run slowly because the code that is tested completely sucks and is not written for batch mode.
Ignoring batch mode, tests are most of the time written in a a way where test cases are run sequentially. Yet attempts to run them concurrently result in flaky tests, because the way you write them and the way you design interfaces does not allow concurrent execution at all.
Another comment, code done by the best AI model still sucks. Anything simple, like a music player with a library of 10000 songs is something it can’t do. First attempt will be horrible. No understanding of concurrent metadata parsing, lists showing 10000 songs at once in UI being slow etc.
So AI is just another excuse for people writing horrible code and horrible tests. If it’s so smart , try to speed up your CI with it.
I agree. I think there are potentially multiple solutions to this since there are multiple bottlenecks. The most obvious is probably network overhead when talking to a database. Another might be storage overhead if storage is being used.
Frankly another one is language. I suspect type-safe, compiled, functional languages are going to see some big advantages here over dynamic interpreted languages. I think this is the sweet spot that grants you a ton of performance over dynamic languages, gives you more confidence in the models changes, and requires less testing.
Faster turn-around, even when you're leaning heavily on AI, is a competitive advantage IMO.
It could go either way. Depends very much on what kind of errors LLMs make.
Type safe languages in theory should do well, because you get feedback on hallucinated APIs very fast. But if the LLM generally writes code that compiles, unless the compiler is very fast you might get out-run by an LLM just spitting out JavaScript at high speed, because it's faster to run the tests than wait for the compile.
The sweet spot is probably JIT compiled type safe languages. Java, Kotlin, TypeScript. The type systems can find enough bugs to be worth it, but you don't have to wait too long to get test results either.
In most companies the CI/Dev Tools team is a career dead end. There is no possibility to show a business impact, it's just a money pit that leadership can't/won't understand (and if they do start to understand it, then it becomes _their_ money pit, which is a career dead end for them) So no one who has their head on straight wants to spend time improving it.
And you can't even really say it's a short sighted attitude. It definitely is from a developer's perspective, and maybe it is for the company if dev time is what decides the success of the business overall.
I haven't worked in places using off-the-shelf/SaaS CI in more than a decade so I feel my experience has been quite the opposite from yours.
We always worked hard to make the CI/CD pipeline as fast as possible. I personally worked on those kind of projects at 2 different employers as a SRE: a smaller 300-people shop which I was responsible for all their infra needs (CI/CD, live deployments, migrated later to k8s when it became somewhat stable, at least enough for the workloads we ran, but still in its beta-days), then at a different employer some 5k+ strong working on improving the CI/CD setup which used Jenkins as a backend but we developed a completely different shim on top for developer experience while also working on a bespoke worker scheduler/runner.
I haven't experienced a CI/CD setup that takes longer than 10 minutes to run in many, many years, got quite surprised reading your comment and feeling spoiled I haven't felt this pain for more than a decade, didn't really expect it was still an issue.
I think the prevalence of teams having a "CI guy" who often is developing custom glue, is a sign that CI is still not really working as well as it should given the age of the tech.
I've done a lot of work on systems software over the years so there's often tests that are very I/O or computation heavy, lots of cryptography, or compilation, things like that. But probably there are places doing just ordinary CRUD web app development where there's Playwright tests or similar that are quite slow.
A lot of the problems are cultural. CI times are a commons, so it can end in tragedy. If everyone is responsible for CI times then nobody is. Eventually management gets sick of pouring money into it and devs learn to juggle stacks of PRs on top of each other. Sometimes you get a lot of pushback on attempts to optimize CI because some devs will really scream about any optimization that might potentially go wrong (e.g. depending on your build system cache), even if caching nothing causes an explosion in CI costs. Not their money, after all.
>There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.
Testing every change incrementally is a vestige of the code being done by humans (and thus of the current approach where AI helps and/or replaces one given human), in small increments at that, and of the failures being analyzed by individual humans who can keep in their head only limited number of things/dependencies at once.
Good God I hate CI. Just let me run the build automation myself dammit! If you're worried about reproducibility make it reproducible and hash the artifacts, make people include the hash in the PR comment if you want to enforce it.
The amount of time people waste futzing around in eg Groovy is INSANE and I'm honestly inclined to reject job offers from companies that have any serious CI code at this point.
It takes more work (serious CI code) to make CI run anywhere, such as your own computer. So you prefer companies that just use GHA? You can't get simpler than that.
I tried the playground and got a strange response. I asked for a regex pattern, and the model gave itself a little game-plan, then it wrote the pattern and started to write tests for it. But it never stopped writing tests. It continued to write tests of increasing size until I guess it reached a context limit and the answer was canceled. Also, for each test it wrote, it added a comment about if the test should pass or fail, but after about the 30th test, it started giving the wrong answer for those too, saying that a test should fail when actually it should pass if the pattern is correct. And after about the 120th test, the tests started to not even make sense anymore. They were just nonsense characters until the answer got cut off.
The pattern it made was also wrong, but I think the first issue is more interesting.
I had this happen to me on Claude Sonnet once. It started spitting out huge blocks of source code completely unrelated to my prompt, seemingly from its training data, and switching codebases once in a while... like, a few thousand lines of some C program, then switching to another JavaScript one, etc. it was insane!
FWIW, I remember regular models doing this not that long ago, sometimes getting stuck in something like an infinite loop where they keep producing output that is only a slight variation on previous output.
if you shrink the context window on most models you'll get this type of behaviour. If you go too small you end up with basically gibberish even on modern models like Gemini 2.5.
Mercury has a 32k context window according to the paper, which could be why it does that.
I think that's a prime example showing that token prediction simply isn't good enough for correctness. It never will be. LLMs are not designed to reason about code.
ICYMI, DeepMind also has a Gemini model that is diffusion-based[1]. I've tested it a bit and while (like with this model) the speed is indeed impressive, the quality of responses was much worse than other Gemini models in my testing.
Yes, arxiv requires that submissions must be scientific research. And not just anyone can publish on arxiv, you need endorsement by existing users.
That that scientific research is in pursuit of a commercial product, or that the paper submitted is of low quality, is not something they would filter however.
Using the free playground link, and it is in fact extremely fast. The "diffusion mode" toggle is also pretty neat as a visualization, although I'm not sure how accurate it is - it renders as line noise and then refines, while in reality presumably those are tokens from an imprecise vector in some state space that then become more precise until it's only a definite word, right?
Some text diffusion models use continuous latent space but they historically haven't done that well. Most the ones we're seeing now typically are trained to predict actual token output that's fed forward into the next time series. The diffusion property comes from their ability to modify previous timesteps to converge on the final output.
I am personally very excited for this development. Recently I AI-coded a simple game for a game jam and half the time was spent waiting for the AI agent to finish its work so I can test it. If instead of waiting 1-2 minutes for every prompt to be executed and implemented I could wait 10 seconds instead that would be literally game changing. I could test 5-10 different versions of the same idea in the time it took me to test one with the current tech.
Of course this model is not as advanced yet for this to be feasible, but so was Claude 3.0 just over a year ago. This will only get better over time I’m sure. Exciting times ahead of us.
The pricing is a little on the higher side. Working on a performance-sensitive application, I tried Mercury and Groq (Llama 3.1 8b, Llama 4 Scout) and the performance was neck-and-neck but the pricing was way better for Groq.
But I'll be following diffusion models closely, and I hope we get some good open source ones soon. Excited about their potential.
If your application is pricing sensitive, check out DeepInfra.com - they have a variety of models in the pennies-per-mil range. Not quite as fast as Mercury, Groq or Samba Nova though.
(I have no affiliation with this company aside from being a happy customer the last few years)
Diffusion is just the logically most optimally behavior for searching massively parallel spaces without informed priors. We need to think beyond language modeling however and start to view this in terms of drug discovery etc. A good diffusion model + the laws of chemistry could be god-tier. I think language modeling has the AI community's in its grips right now and they aren't seeing the applications of the same techniques to real world problems elsewhere.
Actually in most deep learning schemes for science adding in the "laws of nature" as constraints makes things much worse. For example, all the best weather prediction models utilize basically zero fluid dynamics. Even though a) global weather can be in principle predicted by using the Navier-Stokes equations and b) deep learning models can be used to approximately evaluate the Navier-Stokes equations, we now know that incorporating physics into these models is mostly a mistake.
The intuitive reason might be that unconstrained optimization is easier than constrained optimization, particularly in high dimensions, but no one really knows the real reason. It may be that we are not yet at the end of the "bigger is better" regime, and at the true frontier we must add the laws of natures to eke out the last remaining bits of performance possible.
I think the LLM dev community is underestimating these models. E.g. there is no LLM inference framework that supports them today.
Yes the diffusion foundation models have higher cross entropy. But diffusion LLMs can also be post trained and aligned, which cuts the gap.
IMO, investing in post training and data is easier than forcing GPU vendors to invest in DRAM to handle large batch sizes and forcing users to figure out how to batch their requests by 100-1000x. It is also purely in the hands of LLM providers.
Google has Gemini Diffusion in the works. I joined the beta. Roughly speeking it "feels" a lot like 2.5 Flash in the style of its interaction and accuracy. But the walls of text appear almost instantaneously; you don't notice any scrolling.
Damn, that is fast. But it is faster than I can read, so hopefully they can use that speed and turn it into better quality of the output. Because otherwise, I honestly don't see the advantage, in practical terms, over existing LLMs. It's like having a TV with a 200Hz refresh rate, where 100Hz is just fine.
You're missing another big advantage is cost. If you can do 1000tok/s on a $2/hr H100 vs 60tok/s on the same hardware, you can price it at 1/40th of the price for the same margin.
You can also slow down the hardware (say, dropping the clock and then voltages) to save huge amounts of power, which should be interesting for embedded applications.
I've been looking at the code on their chat playground, https://chat.inceptionlabs.ai/, and they have a helper function `const convertOpenAIMessages = (convo) => { ... }`, which also contains `models: ['gpt-3.5-turbo']`. I also see in API response: `"openai": true`. Is it actually using OpenAI, or is it actually calling its dLLM? Does anyone know?
Also: you can turn on "Diffusion Effect" in the top-right corner, but this just seems to be an "animation gimmick" right?
I've been asking bespoke questions and the timing is >2 seconds, and slower than what I get for the same questions to ChatGPT (using gpt-4.1-mini). I am looking at their call stack and what I see: "verifyOpenAIConnection()", "generateOpenAIChatCompletion()", "getOpenAIModels()", etc. Maybe it's just so it's compatible with OpenAI API?
The output is very fast but many steps backwards in all of my personal benchmarks. Great tech but not usable in production when it is over 60% hallucinations.
That might just depend on how big it is/how much money was spent on training. The neural architecture can clearly work. Beyond that catching up may be just a matter of effort.
For something a little different than a coding task, I tried using it in my game: https://www.playintra.win/ (in settings you can select Mercury, the game uses OpenRouter)
At first it seemed pretty competent and of course very fast, but it seemed to really fall apart as the context got longer. The context in this case is a sequence of events and locations, and it needs to understand how those events are ordered and therefore what the current situation and environment are (though there's also lots of hints in the prompts to keep it focused on the present moment). It's challenging, but lots of smaller models can pull it off.
But also a first release and a new architecture. Maybe it just needs more time to bake (GPT 3.5 couldn't do these things either). Though I also imagine it might just perform _differently_ from other LLMs, not really on the same spectrum of performance, and requiring different prompting.
Love the ui in the playground, it reminds me of Qwen chat.
We have reached a point where the bottlenecks in genAI is not the knowledge or accuracy, it is the context window and speed.
Luckily, Google (and Meta?) has pushed the limits of the context window to about 1 million tokens which is incredible. But I feel like todays options are still stuck about ~128k token window per chat, and after that it starts to forget.
Another issue is the time time it takes for inference AND reasoning. dLLMs is an interesting approach at this. I know we have Groqs hardware aswell.
I do wonder, can this be combined with Groqs hardware? Would the response be instant then?
How many tokens can each chat handle in the playground? I couldn't find so much info about it.
Which model is it using for inference?
Also, is the training the same on dLLMs as on the standardised autoregressive LLMs? Or is the weights and models completely different?
I agree entirely with you. While Claude Code is amazing, it is also slow as hell and the context issue keeps coming up (usually at what feels like the worst possible time for me).
It honestly feels like dialup most LLMs (apart from this!).
AFIAK with traditional models context size is very memory intensive (though I know there are a lot of things that are trying to 'optimize' this). I believe memory usage grows at the square of context length, so even 10xing context length requires 100x the memory.
(Image) diffusion does not grow like that, it is much more linear. But I have no idea (yet!) about text diffusion models if someone wants to chip in :).
I was expecting really crappy performance but just chatting to it, giving it some puzzles, it feels very smart and gets a lot of things right that a lot of other models don't.
> By submitting User Submissions through the Services, you hereby do and shall grant Inception a worldwide, non-exclusive, perpetual, royalty-free, fully paid, sublicensable and transferable license to use, edit, modify, truncate, aggregate, reproduce, distribute, prepare derivative works of, display, perform, and otherwise fully exploit the User Submissions in connection with this site, the Services and our (and our successors’ and assigns’) businesses, including without limitation for promoting and redistributing part or all of this site or the Services (and derivative works thereof) in any media formats and through any media channels (including, without limitation, third party websites and feeds), and including after your termination of your account or the Services. For clarity, Inception may use User Submissions to train artificial intelligence models. (However, we will not train models using submissions from users accessing our Services via OpenRouter.)
Reinforcement learning really helped Transformer based LLMs evolve in terms of quality and reasoning which we saw as DeepSeek was launched. I am curious if what this is is equivalent to an early GPT 4o that has not yet reaped the benefits of add-on technologies that helped improve the quality?
I've used mercury quite a bit in my commit message generator. I noticed it would always produce the exact same response if you ran it multiple times, and increasing temperature didn't affect it. To get some variability I added a $(uuidgen) to the prompt. Then I could run it again for a new response if I didn't like the first.
I was curious to know the statistics on the mentions of various programming languages on HN over the years, so I got me a copy of all HN comments from a BigTable public source. But now I need to interpret each comment and so what I need is a semantic grep. The easiest would be to prompt an LLM.
Comments are pretty short, but there are many millions of them. So getting high throughput at minimum cost is key.
I'm hoping that Inception might be able to churn through this quickly.
If you folks have other ideas or suggestions, what might also work, I'd love to hear them!
The idea is having a semgrep command line tool. If latencies are dropping dramatically, it might be feasible.
Code output is verifiable in multiple ways. Combine that with this kind of speed (and far faster in future) and you can brute force your way to a killer app in a few minutes.
Yes, exactly. The demo of Gemini's Diffusion model [0] was really eye-opening to me in this regard. Since then, I've been convinced the future of lots of software engineering is basically UX and SQA: describe the desired states, have an LLM fill in the gaps based on its understanding of human intent, and unit test it to verify. Like most engineering fields, we'll have an empirical understanding of systems as opposed to the analytical understanding of code we have today. I'd argue most complex software is already only approximately understood even before LLMs. I doubt the quality of software will go up (in fact the opposite), but I think this work will scale much better and be much, much more boring.
I wonder if diffusion llms solve the hallucination problem more effectively. In the same way that image models learned to create less absurd images, dllms can perhaps learn to create sensical responses more predictably
The speed here is super impressive! I am curious - are there any qualitative ways in which modeling text using diffusion differs from that using autoregressive models? The kind of problems it works better on, creativity, and similar.
One works in the coarse-to-fine direction, another works start-to-end. Which means different directionality biases, at least. Difference in speed, generalization, etc. is less clear and needs to be proven in practice, as fundamentally they are closer than it seems. Diffusion models have some well-studied shortcuts to trade speed for quality, but nothing stops you from implementing the same for the other type.
I guess this makes specific language patterns cheaper and more artistic language patterns more expensive. This could be a good way to limit pirated and masqueraded materials submitted by students.
I'm kind of impressed by the speed of it. I told it to write a MQTT topic pattern matcher based on a Trie and it spat out something reasonable on first try. It hat a few compilation issues though, but fair enough.
> I strongly believe that this will be a really important technique in the near future.
I share the same belief, but regardless of cost. What excites me is the ability to "go both ways", edit previous tokens after others have been generated, using other signals as "guided generation", and so on. Next token prediction works for "stories", but diffusion matches better with "coding flows" (i.e. going back and forth, add something, come back, import something, edit something, and so on).
It would also be very interesting to see how applying this at different "abstraction layers" would work. Say you have one layer working on ctags, one working on files, and one working on "functions". And they all "talk" to each other, passing context and "re-diffusing" their respective layers after each change. No idea where the data for this would come, maybe from IDEs?
I wonder if there's a way to do diffusion within some sort of schema-defined or type constrained space.
A lot of people these days are asking for structured output from LLMs so that a schema is followed. Even if you train on schema-following with a transformer, you're still just 'hoping' in the end that the generated json matches the schema.
I'm not a diffusion excerpt, but maybe there's a way to diffuse one value in the 'space' of numbers, and another value in the 'space' of all strings, as required by a schema:
I'm not sure how far this could lead. Could you diffuse more complex schemas that generalize to a arbitrary syntax tree? E.g. diffuse some code in a programming language that is guaranteed to be type-safe?
I, for one, am willing to trade accuracy for speed. I'd rather have 10 iterations of poor replies which forces me to ask the right question than 1 reply which takes 10 times as long and _maybe_ is good, since it tries to reason about my poor question.
Personally I like asking coding agents a question and getting an answer back immediately. Systems like Junie that go off and research a bunch of irrelevant things than ask permission than do a lot more irrelevant research, ask more permission and such and then 15 minutes later give you a mountain of broken code are a waste of time if you ask me. (Even if you give permission in advance)
Having token embeddings with diffusion models, for 16x16 transformer encoding. Image is tokenized before transformers compile it. If decomposed virtualization modulates according to a diffusion model.
A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.
Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.
As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.
It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.
Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?
> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green.
I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem. When I was at Google, it was somewhat common for me to debug non-deterministic bugs such as a missing synchronization or fence causing flakiness; and it was common to just launch 10000 copies of the same test on 10000 machines to find perhaps a single digit number of failures. My current employer has a clunkier implementation of the same thing (no UI), but there's also a single command to launch 1000 test workers to run all tests from your own checkout. The goal is to finish testing a 1M loc codebase in no more than five minutes so that you get quick feedback on your changes.
> make builds fully hermetic (so no inter-run caching)
These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.
I was also at Google for years. Places like that are not even close to representative. They can afford to just-throw-more-resources, they get bulk discounts on hardware and they pay top dollar for engineers.
In more common scenarios that represent 95% of the software industry CI budgets are fixed, clusters are sized to be busy most of the time, and you cannot simply launch 10,000 copies of the same test on 10,000 machines. And even despite that these CI clusters can easily burn through the equivalent of several SWE salaries.
> These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.
Again, that's how companies like Google do it. In normal companies, build caching isn't always perfectly reliable, and if CI runs suffer flakes due to caching then eventually some engineer is gonna get mad and convince someone else to turn the caching off. Blaze goes to extreme lengths to ensure this doesn't happen, and Google spends extreme sums of money on helping it do that (e.g. porting third party libraries to use Blaze instead of their own build system).
In companies without money printing machines, they sacrifice caching to get determinism and everything ends up slow.
I’m at Google today and even with all the resources, I am absolutely most bottlenecked by the Presubmit TAP and human review latency. Making CLs in the editor takes me a few hours. Getting them in the system takes days and sometimes weeks.
Indeed. You'd think Google would test for how well people will cope with boredom, rather than their bait-and-switch interviews that make it seem like you'll be solving l33tcode every evening.
Presumably the "days and sometimes weeks" thing is entirely down to human review latency?
Yes and no, I'd estimate 1/3 to 1/2 of that is down to test suites are flaky and time-consuming to run. IIRC shortest build I had was 52m for Android Wear iOS app, easily 3 hours for Android.
Most of my experience writing concurrent/parallel code in (mainly) Java has been rewriting half-baked stuff that would need a lot of testing with straightforward reliable and reasonably performant code that uses sound and easy-to-use primitives such as Executors (watch out for teardown though), database transactions, atomic database operations, etc. Drink the Kool Aid and mess around with synchronized or actors or Streams or something and you're looking at a world of hurt.
I've written a limited number of systems that needed tests that probe for race conditions by doing something like having 3000 threads run a random workload for 40 seconds. I'm proud of that "SuperHammer" test on a certain level but boy did I hate having to run it with every build.
Developer time is more expensive than machine time, but at most companies it isn't 10000x more expensive. Google is likely an exception because it pays extremely well and has access to very cheap machines.
Even then, there are other factors:
* You might need commercial licenses. It may be very cheap to run open source code 10000x, but guess how much 10000 Questa licenses cost.
* Moores law is dead Amdahl's law very much isn't. Not everything is embarrassingly parallel.
* Some people care about the environment. I worked at a company that spent 200 CPU hours on every single PR (even to fix typos; I failed to convince them they were insane for not using Bazel or similar). That's a not insignificant amount of CO2.
> Moores law is dead Amdahl's law
Yes, but the OP specifically is talking about CI for large numbers of pull requests, which should be very parallelizable (I can imagine exceptions, but only with anti-patterns, e.g. if your test pipeline makes some kind of requests to something that itself isn't scalable).
Actually, OP was talking about the throughput of running on a large number of pull requests and the latency of running on a single pull request. The latter is not necessarily parallelizable.
That's solvable with modern cloud offerings - Provision spot instances for a few minutes and shut them down afterwards. Let the cloud provider deal with demand balancing.
I think the real issue is that developers waiting for PRs to go green are taking a coffee break between tasks, not sitting idly getting annoyed. If that's the case you're cutting into rest time and won't get much value out of optimizing this.
Both companies I've worked in recently have been too paranoid about IP to use the cloud for CI.
Anyway I don't see how that solves any of the issues except maybe cost to some degree (but maybe not; cloud is expensive).
Were they running CI on their own physical servers under a desk or in a basement somewhere, or renting their own racks in a data center just for CI?
There are non-IP reasons to go outside the big clouds for CI. Most places I worked over the years had dedicated hardware for at least some CI jobs because otherwise it's too hard to get repeatable performance numbers. At some point you have an outage in production caused by a new build passing tests but having much lower performance, or performance is a feature of the software being sold, and so people decide they need to track perf with repeatable load tests.
Sorta. For CI/CD you can use spot instances and spin them down outside of business hours, so they can end up being cheaper than buying many really beefy machines and amortizing them over the standard depreciation schedule.
That’s paranoid to the point of lunacy.
Azure for example has “confidential compute” that encrypts even the memory contents of the VM such that even their own engineers can’t access the contents.
As long as you don’t back up the disks and use HTTPS for pulls, I don’t see a realistic business risk.
If a cloud like Azure or AWS got caught stealing competitor code they’d be sued and immediately lose a huge chunk of their customers.
It makes zero business sense to do so.
PS: Microsoft employees have made public comments saying that they refuse to even look at some open source repository to avoid any risk of accidentally “contaminating” their own code with something that has an incompatible license.
The way Azure implements CC unfortunately lowers a lot of the confidentiality. It's not their fault exactly, more like a common side effect of trying to make CC easy to use. You can certainly use their CC to do secure builds but it would require an absolute expert in CC / RA to get it right. I've done design reviews of such proposals before and there's a lot of subtle details.
I don't know about Azure's implementation of confidential compute but GCP's version basically essentially relies on AMD SEV-SVP. Historically there have been vulnerabilities that undermine the confidentiality guarantee.
Mandatory XKCD: https://xkcd.com/538/
Nobody's code is that secret, especially not from a vendor like Microsoft.
Unless all development is done with air-gapped machines, realistic development environments are simultaneously exposed to all of the following "leakage risks" because they're using third-party software, almost certainly including a wide range of software from Microsoft:
- Package managers, including compromised or malicious packages.
- IDEs and their plugins, the latter especially can be a security risk. - CLI and local build tools.- SCM tools such as GitHub Enterprise (Microsoft again!)
- The CI/CD tooling including third-party tools.
- The operating system itself. Microsoft Windows is still a very popular platform, especially in enterprise environments.
- The OS management tools, anti-virus, monitoring, etc...
And on and on.
Unless you live in a total bubble world with USB sticks used to ferry your dependencies into your windowless facility underground, your code is "exposed" to third parties all of the time.
Worrying about possible vulnerabilities in encrypted VMs in a secure cloud facility is missing the real problem that your developers are probably using their home gaming PC for work because it's 10x faster than the garbage you gave them.
Yes, this happens. All the time. You just don't know because you made the perfect the enemy of the good.
> missing the real problem that your developers are probably using their home gaming PC for work because it's 10x faster than the garbage you gave them.
> Yes, this happens. All the time. You just don't know because you made the perfect the enemy of the good.
That only happens in cowboy coding startups.
In places where security matters (e.g. fintech jobs), they just lock down your PC (no admin rights), encrypt the storage and part of your VPN credentials will be on a part of your storage that you can't access.
Between Github and Copilot, MS has a copy of all of your code.
> ...your developers are probably using their home gaming PC for work because it's 10x faster than the garbage you gave them...
I went from a waiter to startup owner and then acquirer, then working for Google. No formal education, no "real job" till Google, really. I'm not sure even when I was a waiter I had this...laissez-faire? naive?...sense of how corporate computing worked.
That aside, the whole argument stands on "well, other bad things can happen more easily!", which we agree is true, but also, it isn't an argument against it.
From a Chesterson's Fence view, one man's numbskull insistence on not using AWS that must only be due to pointy-haired boss syndrome, is another's valiant self-hosting-that-saved-7 figures. Hard to say from the bleachers, especially with OP making neither claim.
>Do companies not just double their CI workers after hearing people complain?
They do not.
I don't know if it's a matter of justifying management levels, but these discussions are often drawn out and belabored in my experience. By the time you get approval, or even worse, rejected, for asking for more compute (or whatever the ask is), you've spent way more money on the human resource time than you would ever spend on the requested resources.
This is exactly my experience with asking for more compute at work. We have to prepare loads of written justification, come up with alternatives or optimizations (which we already know won't work), etc. and in the end we choose the slow compute and reduced productivity over the bureaucracy.
And when we manage to make a proper request it ends up being rejected anyways as many other teams are asking for the same thing and "the company has limited resources". Duh.
I have never once been refused by a manager or director when I am explicitly asking for cost approval. The only kind of long and drawn out discussions are unproductive technical decision making. Example: the ask of "let's spend an extra $50,000 worth of compute on CI" is quickly approved but "let's locate the newly approved CI resource to a different data center so that we have CI in multiple DCs" solicits debates that can last weeks.
> I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem.
I'd personally agree. But this sounds like the kind of thing that, at many companies, could be a real challenge.
Ultimately, you can measure dollars spent on CI workers. It's much harder and less direct to quantify the cost of not having them (until, for instance, people start taking shortcuts with testing and a regression escapes to production).
That kind of asymmetry tends, unless somebody has a strong overriding vision of where the value really comes from, to result in penny pinching on the wrong things.
It's more than that. You can measure salaries too, measurement isn't the issue.
The problem is that if you let people spend the companies money without any checks or balances they'll just blow through unlimited amounts of it. That's why companies always have lots of procedures and policies around expense reporting. There's no upper limit to how much money developers will spend on cloud hardware given the chance, as the example above of casually running a test 10,000 times in parallel demonstrates nicely.
CI doesn't require you to fill out an expense report every time you run a PR thank goodness, but there still has to be a way to limit financial liability. Usually companies do start out by doubling cluster sizes a few times, but each time it buys a few months and then the complaints return. After a few rounds of this managers realize that demand is unlimited and start pushing back on always increasing the budget. Devs get annoyed and spend an afternoon on optimizations, suddenly times are good again.
The meme on HN is that developer time is always more expensive than machine time, but I've been on both sides of this and seen how the budgets work out. It's often not true, especially if you use clouds like Azure which are overloaded and expensive, or have plenty of junior devs, and/or teams outside the US where salaries are lower. There's often a lot of low hanging fruit in test times so it can make sense to optimize, even so, huge waste is still the order of the day.
Even Google can not buy more old Intel Macs or Pixel 6s or Samsung S20s to increase their testing on those devices (as an example)
Maybe that affects less devs who don't need to test on actual hardware but plenty of apps do. Pretty much anything that touches a GPU driver for example like a game.
No it is not. Senior management often has a barely disguised contempt for engineering and spending money to do a better job. They listen much more to sales complain.
That depends on the company.
My last company was unsure about paying $20/mo to get a Copilot license for all the engineers.
I've seen people not pay for Slack and just deal with disappearing messages and use Skype (back in the day) for group calls.
You're confusing throughput and latency. Lengthy CI runs increase the latency of developer output, but they don't significantly reduce overall throughput, given a developer will typically be working on multiple things at once, and can just switch tasks while CI is running. The productivity cost of CI is not zero, but it's way, way less than the raw wallclock time spent per run.
Then also factor in that most developer tasks are not even bottlenecked by CI. They are bottlenecked primarily by code review, and secondarily by deployment.
Length CI runs do reduce throughput, as working around high CI latencies pushes people towards juggling more PRs at once meaning more merge conflicts to deal with, and increases the cost of a build failing transiently.
And context switching isn't free by any means.
Still, if LLM agents keep improving then the bottleneck of waiting on code review won't exist for the agents themselves, there'll just be a stream of always-green branches waiting for someone to review and merge them. CI costs will still matter though.
I’m currently at google (opinions not representative of my employer’s etc) and this is true for things that run in a data center but it’s a lot harder for things that need to be tested on physical hardware like parts of Android or CrOS.
Writing testing infrastructure so that you can just double workers and get a corresponding doubling in productivity is non-trivial. Certainly I've never seen anything like Google's testing infrastructure anywhere else I've worked.
Yeah Google's infrastructure is unique because Blaze is tightly integrated with the remote execution workers and can shard testing work across many machines automatically. Most places can't do that so once you have enough hardware that queue depth isn't too big you can't make anything go faster by adding hardware, you can only try to scale vertically or optimize. But if you're using hosted CI SaaS it's often not always easy to get bigger machines, or the bigger machines are superlinear in cost.
Many companies are strangely reluctant to spend money on hardware for developers. They might refuse to spend $1,000 on a better laptop to be used for the next three years by an employee, whose time costs them that much money in a single afternoon.
That's been a pet peeve of mine for so long. (Glad my current employer gets me the best 1.5ℓ machine from Dell every few years!)
On the other hand I've seen many overcapitalized pre-launch startups go for months with a $20,000+ AWS bill without thinking about it then suddenly panic about what they're spending; they'd find tens of XXXXL instances spun up doing nothing, S3 buckets full of hundreds of terabytes of temp files that never got cleared out, etc. With basic due diligence they could have gotten that down to $2k a month, somebody obsessive about cost control could have done even better.
I have faced this at each of the $50B in profit companies I have worked at.
IME it's less of a "throw more resources" problem and more of a "stop using resources in literally the worst way possible"
CI caching is, apparently, extremely difficult. Why spend a couple of hours learning about your CI caches when you can just download and build the same pinned static library a billion times? The server you're downloading from is (of course) someone else's problem and you don't care about wasting their resources either. The power you're burning by running CI for there hours instead of one is also someone else's problem. Compute time? Someone else's problem. Cloud costs? You bet it's someone else's problem.
Sure, some things you don't want to cache. I always do a 100% clean build when cutting a release or merging to master. But for intermediate commits on a feature branch? Literally no reason not to cache builds the exact same way you do on your local machine.
Not really, in most small companies/departments, £100k a month is considered a painful cloud bill and adding more EC2 instances to provide cloud runners can add 10% to that easily.
My personal experience: We run over 1.1m test cases to verify every PR that I submit, and there are more test cases that don't get run on every commit and instead get run daily or on-demand.
At that scale getting quick turnaround is a difficult infrastructure problem, especially if you have individual tests that take multiple seconds or suites that take multiple minutes (we do, and it's hard to actually pull the execution time down on all of them).
I've never personally heard "we don't have the budget" or "we don't have enough machines" as answers for why our CI turnaround isn't 5 minutes, and it doesn't seem to me like the answer is just doubling the core count in every situation.
The scenario I work on daily (a custom multi-platform runtime with its own standard library) does by necessity mean that builds and testing are fairly complex though. I wouldn't be surprised if your assertion (just throw more resources at it) holds for more straightforward apps.
- Just spin up more test instances. If the AI is as good as people claim then it's still way cheaper than extra programmers.
- Write fast code. At $WORK we can test roughly a trillion things per CPU physical core year for our primary workload, and that's in a domain where 20 microsecond processing time is unheard of. Orders of magnitude speed improvements pay dividends quickly.
- LLMs don't care hugely about the language. Avoid things like rust where compile times are always a drag.
- That's something of a strange human problem you're describing. Once the PR is reviewed, can't you just hit "auto-merge" and go to the next task, only circling back if the code was broken? Why is that a significant amount of developer time?
- The thing you're observing is something every growing team witnesses. You can get 90% of the way to what you want by giving the build system a greenfield re-write. If you really have to run 100x more tests, it's worth a day or ten sanity checking docker caching or whatever it is your CI/CD is using. Even hermetic builds have inter-run caching in some form; it's just more work to specify how the caches should work. Put your best engineer on the problem. It's important.
- Be as specific as possible in describing test dependencies. The fastest tests are the ones which don't run.
- Separate out unit tests from other forms of tests. It's hard to write software operating with many orders of magnitude of discrepancies, and tests are no exception. Your life is easier if conceptually they have a separate budget (e.g., continuous fuzz testing or load testing or whatever). Unit tests can then easily be fast enough for a developer to run all the changed ones on precommit. Slower tests are run locally when you think they might apply. The net effect is that you don't have the sort of back-and-forth with your CI that actually causes lost developer productivity because the PR shouldn't have a bunch of bullshit that's green locally and failing remotely.
These are all good suggestions, albeit many are hard to implement in practice.
> That's something of a strange human problem you're describing.
Are we talking about agent-written changes now, or human? Normally reviewers expect tests to pass before they review something, otherwise the work might change significantly after they did the review in order to fix broken tests. Auto merges can fail due to changes that happened in the meantime, they're aren't auto in many cases.
Once latency goes beyond a minute or two people get distracted and start switching tasks to something else, which slows everything down. And yes code review latency is a problem as well, but there are easier fixes for that.
There are a couple mitigating considerations
1. As implementation phase gets faster, the bottleneck could actually switch to PM. In which case, changes will be more serial, so a lot fewer conflicts to worry about.
2. I think we could see a resurrection of specs like TLA+. Most engineers don't bother with them, but I imagine code agents could quickly create them, verify the code is consistent with them, and then require fewer full integration tests.
3. When background agents are cleaning up redundant code, they can also clean up redundant tests.
4. Unlike human engineering teams, I expect AIs to work more efficiently on monoliths than with distributed microservices. This could lead to better coverage on locally runnable tests, reducing flakes and CI load.
5. It's interesting that even as AI increases efficiency, that increased velocity and sheer amount of code it'll write and execute for new use cases will create its own problems that we'll have to solve. I think we'll continue to have new problems for human engineers to solve for quite some time.
CI should just run on each developer's machine. As in, each developer should have a local instance of the CI setup in a VM or a docker container. If tests pass, the result is reported to a central server.
For Python apps, I've gotten good CI speedups by moving over to the astral.sh toolchain, using uv for the package installation with caching. Once I move to their type-checker instead of mypy, that'll speed the CI up even more. The playwright test running will then probably be the slowest part, and that's only in apps with frontends.
(Also, Hi Mike, pretty sure I worked with you at Google Maps back in early 2000s, you were my favorite SRE so I trust your opinion on this!)
LLM making a quick edit, <100 lines... Sure. Asking an LLM to rubber-duck your code, sure. But integrating an LLM into your CI is going to end up costing you 100s of hours productivity on any large project. That or spend half the time you should be spending learning to write your own code, dialing down context sizing and prompt accuracy.
I really really don't understand the hubris around llm tooling, and don't see it catching on outside of personal projects and small web apps. These things don't handle complex systems well at all, you would have to put a gun in my mouth to let one of these things work on an important repo of mine without any supervision... And if I'm supervising the LLM I might as well do it myself, because I'm going to end up redoing 50% of its work anyways..
I keep seeing this argument over and over again, and I have to wonder, at what point do you accept that maybe LLM's are useful? Like how many people need to say that they find it makes them more productive before you'll shift your perspective?
> I keep seeing this argument over and over again, and I have to wonder, at what point do you accept that maybe LLM's are useful?
The post you are responding to literally acknowledges that LLMs are useful in certain roles in coding in the first sentence.
> Like how many people need to say that they find it makes them more productive before you'll shift your perspective?
Argumentum ad populum is not a good way of establishing fact claims beyond the fact of a belief being popular.
...and my comment clearly isnt talking about that, but at the suggestion that its useless to write code with an LLM because you'll end up rewriting 50% of it.
If everyone has an opinion different to mine, I dont instantly change my opinion, but I do try and investigate the source of the difference, to find out what I'm missing or what they are missing.
The polarisation between people that find LLMs useful or not is very similar to the polarisation between people that find automated testing useful or not, and I have a suspicion they have the same underlying cause.
You seem to think everyone shares your view, around me I see a lot of people acknowledging they are useful to a degree, but also clearly finding limits in a wide array of cases, including that they really struggle with logical code, architectural decisions, re-using the right code patterns, larger scale changes that aren’t copy paste, etc.
So far what I see is that if I provide lots of context and clear instructions to a mostly non-logical area of code, I can speed myself up about 20-40%, but only works in about 30-50% of the problems I solve day to day at a day job.
So basically - it’s about a rough 20% improvement in my productivity - because I spend most of my time of the difficult things it can’t do anyway.
Meanwhile these companies are raising billion dollar seed rounds and telling us that all programming will be done by AI by next year.
> Meanwhile these companies are raising billion dollar seed rounds and telling us that all programming will be done by AI by next year.
Which is the same thing they said last year, and hasn't panned out. But surely this time it'll be right...
> at what point do you accept that maybe LLM's are useful?
LLMs are useful, just not for every task and price point.
That's a tool, and it depends what you need to do. If it fits someone need and make them more productive, or even simply enjoy more the activity, good.
Just because two people are fixing something on the whole doesn't mean the same tool will hold fine. Gum, pushpin, nail, screw,bolts?
The parent thread did mention they use LLM successfully in small side project.
People say they are more productive using visual basic, but that will never shift my perspective on it.
Code is a liability. Code you didn't write is a ticking time bomb.
They say it’s only effective for personal projects but there’s literally evidence of LLMs being used for what he says can’t be used. Actual physical evidence.
It’s self delusion. And also the pace of AI is so fast he may not be aware of how fast LLMs are integrating into our coding environments. Like 1 year ago what he said could be somewhat true but right now what he said is clearly not true at all.
I've used Claude with a large, mature codebase and it did fine. Not for every possible task, but for many.
Probably, Mercury isn't as good at coding as Claude is. But even if it's not, there's lots of small tasks that LLMs can do without needing senior engineer level skills. Adding test coverage, fixing low priority bugs, adding nice animations to the UI etc. Stuff that maybe isn't critical so if a PR turns up and it's DOA you just close it, but which otherwise works.
Note that many projects already use this approach with bots like Renovate. Such bots also consume a ton of CI time, but it's generally worth it.
IMHO LLMs are notoriously bad at test coverage. They usually hard code a value to have the test pass, since they lack the reasoning required to understand why the test exists or the concept of assertion, really
I don’t know, Claude is very good at writing that utterly useless kind of unit test where every dependency is mocked out and the test is just the inverted dual of the original code. 100% coverage, nothing tested.
Yeah and that's even worse because there's not an easy metric you can have the agent work towards and get feedback on.
I'm not that into "prompt engineering" but tests seem like a big opportunity for improvement. Maybe something like (but much more thorough):
1. "Create a document describing all real-world actions which could lead to the code being used. List all methods/code which gets called before it (in order) along with their exact parameters and return value. Enumerate all potential edge cases and errors that could occur and if it ends up influencing this task. After that, write a high-level overview of what need to occur in this implementation. Don't make it top down where you think about what functions/classes/abstractions which are created, just the raw steps that will need to occur" 2. Have it write the tests 3. Have it write the code
Maybe TDD ends up worse but I suspect the initial plan which is somewhat close to code makes that not the case
Writing the initial doc yourself would definitely be better, but I suspect just writing one really good one, then giving it as an example in each subsequent prompt captures a lot of the improvement
I've not gone into it yet, but I think BDD would fit reasonably well with agents and generating tests that aren't entirely useless.
This is why unit tests are the least useful kind of test and regression tests are the most useful.
I think unit tests are best written /before/ the real code and thrown out after. Of course, that's extremely situational.
Don't want to put words in the parent commenter's mouth, but I think the key word is "unsupervised". Claude doesn't know what it doesn't know, and will keep going round the loop until the tests go green, or until the heat death of the universe.
Yes, but you can just impose timeouts to solve that. If it's unsupervised the only cost is computation.
Do the opposite - integrate your CI into your LLM.
Make it run tests after it changes your code and either confirm it didnt break anything or go back and try again.
He is simply observing that if PR numbers and launch rates increase dramatically CI cost will become untenable.
Before cars people spent little on petroleum products or motor oil or gasoline or mechanics. Now they do. That's how systems work. You wanna go faster well you need better roads, traffic lights, on ramps, etc. you're still going faster.
Use AI to solve the IP bottlenecks or build more features that ear more revenue that buy more ci boxes. Same as if you added 10 devs which you are with AI so why wouldn't some of the dev support costs go up.
Are you not in a place where you can make an efficiency argument to get more ci or optimize? What's a ci box cost?
Any modern MacBook can run those tests 100x faster than the crappy cloud runners most companies use. You can also configure runners that run locally and get the benefit of those speed gains. So all of this is really a business and technical problem that is solved for those who want to solve it. It can be solved very cheap, or it can be solved very expensive. Regardless, it's precisely those types of efficiency gains that motivate companies to finally do something about it.
And if not, then enjoy being paid waiting for CI to go green. Maybe it's a reminder to go take a break.
It will be worse when the process is super optimized and the expectation changes. So now instead of those 2 PRs that went to prod today because everyone knows CI takes forever, you'll be expected to push 8 because in our super optimized pipeline it only takes seconds. No excuses. Now the bottleneck is you.
This might end up being less of an issue.
If I am coding, I want to stay in the flow and get my PR green asap, so I can continue on the project.
If I am orchestrating agents, I might have 10 or 100 PRs in the oven. In that case I just look at the ones that finish CI.
It’s gonna be less, or at least different, kind of flow IMO. (Until you can just crank out design docs and whiteboard sessions and have the agents fully autonomously get their work green.)
> If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.
I am guesstimating (based on previous experience self-hosting the runner for MacOS builds) that the project I am working on could get like 2-5x pipeline performance at 1/2 cost just by using self-hosted runners on bare metal rented machines like Hetzner. Maybe I am naive, and I am not the person that would be responsible for it - but having a few bare metal machines you can use in the off hours to run regression tests, for less than you are paying the existing CI runner just for build, that speed up everything massively seems like a pure win for relatively low effort. Like sure everyone already has stuff on their plate and would rather pay external service to do it - but TBH once you have this kind of compute handy you will find uses anyway and just doing things efficiently. And knowing how to deal with bare metal/utilize this kind of compute sounds generally useful skill - but I rarely encounter people enthusiastic about making this kind of move. Its usually - hey lets move to this other service that has slightly cheaper instances and a proprietary caching layer so that we can get locked into their CI crap.
Its not like these services have 0 downtime/bug free/do not require integration effort - I just don't see why going bare metal is always such a taboo topic even for simple stuff like builds.
Yep. For my own company I used a bare metal machine in Hetzner running Linux and a Windows VM along with a bunch of old MacBook Pros wired up in the home office for CI.
It works, and it's cheap. A full CI run still takes half an hour on the Linux machine (the product [1] is a kind of build system for shipping desktop apps cross platform, so there's lots of file IO and cryptography involved). The Macs are by far the fastest. The M1 Mac is embarrassingly fast. It can complete the same run in five minutes despite the Hetzner box having way more hardware. In fairness, it's running both a Linux and Windows build simultaneously.
I'm convinced the quickest way to improve CI times in most shops is to just build an in-office cluster of M4 Macs in an air conditioned room. They don't have to be HA. The hardware is more expensive but you don't rent per month, and CI is often bottlenecked on serial execution speed so the higher single threaded performance of Apple Silicon is worth it. Also, pay for a decent CI system like TeamCity. It helps reduce egregious waste from problems like not caching things or not re-using checkout directories. In several years of doing this I haven't had build caching related failures.
[1] https://hydraulic.dev/
> 2-5x pipeline performance at 1/2 cost just by using self-hosted runners on bare metal rented machines like Hetzner
This is absolutely the case. Its a combination of having dedicated CPU cores, dedicated memory bandwidth, and (perhaps most of all) dedicated local NVMe drives. We see a 2x speed up running _within VMs_ on bare metal.
> And knowing how to deal with bare metal/utilize this kind of compute sounds generally useful skill - but I rarely encounter people enthusiastic about making this kind of move
We started our current company for this reason [0]. A lot of people know this makes sense on some level, but not many people want to do it. So we say we'll do it for you, give you the engineering time needed to support it, and you'll still save money.
> I just don't see why going bare metal is always such a taboo topic even for simple stuff like builds.
It is decreasingly so from what I see. Enough people have been variously burned by public cloud providers to know they are not a panacea. But they just need a little assistance in making the jump.
[0] - https://lithus.eu
At the last place I worked at, which was just a small startup with 5 developers, I calculated that a server workstation in the office would be both cheaper and more performant than renting a similar machine in the cloud.
Bare metal makes such a big difference for test and CI scenarios. It even has an integrated a GPU to speed up webdev tests. Good luck finding an affordable machine in the cloud that has a proper GPU for this kind of a use-case
Is it a startup or small business ? In my book a startup expects to scale and hosting bare metal HW in an office with 5 people means you have to figure everything out again when you get 20/50/100 people - IMO not worth the effort and hosting hardware has zero transferable skills to your product.
Running on managed bare metal servers is theoretically the same as running any other infra provider except you are on the hook for a bit more maintenance, you scale to 20 people you just rent a few more machines. I really do not see many downsides for the build server/test runner scenario.
The nice part about most CI workloads is that they can almost always be split up and executed in parallel. Make sure you're utilizing every core on every CI worker and your worker pools are appropriately sized for the workload. Use spot instances and add auto scaling where it makes sense. No one should be waiting more than a few minutes for a PR build. Exception being compile time which can vary significantly between languages. I have a couple projects that are stuck on ancient compilers because of CPU architecture and C variant, so those will always be a dog without effort to move to something better. Ymmv
As an example we recently had a Ruby application that had a test suite that was taking literally an hour per build, but turned out it was running entirely sequential by default, using only 1 core. I spent an afternoon migrating our CI runners to split the workload across all available cores and now it's 5 minutes per build. And that was just the low hanging fruit, it can be significantly improved further but there's obviously diminishing returns
> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers
No, this is common. The devs just haven't grokked dependency inversion. And I think the rate of new devs entering the workforce will keep it that way forever.
Here's how to make it slow:
* Always refer to "the database". You're not just storing and retrieving objects from anywhere - you're always using the database.
* Work with statements, not expressions. Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance. This will force you to sequentialise the tests (simultaneous tests would otherwise race and cause flakiness) plus you get to write a bunch of setup and teardown and wipe state between tests.
* If you've done the above, you'll probably need to wait for state changes before running an assertion. Use a thread sleep, and if the test is ever flaky, bump up the sleep time and commit it if the test goes green again.
Nah. Tests could be run in N processes each with own database configured to skip full fsync. It resolves most of the issues and makes testing much much simpler.
> Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance
Er, doesn’t this boil down to saying “not testing database end state (trusting in transactionality) is faster than testing it”?
I mean sure, trivially true, but not a good idea. I’ve seen lots of bugs caused by code that unexpectedly forced a commit, or even opened/used/committed a whole new DB connection, somewhere buried down inside a theoretically externally-transactional request handler. Bad code, to be sure, but common in many contexts in my experience.
> I’ve seen lots of bugs caused by code that unexpectedly forced a commit, or even opened/used/committed a whole new DB connection, somewhere buried down inside a theoretically externally-transactional request handler.
Yes! That's my current codebase you're describing! If you interweave the database all throughout your accounting logic, you absolutely can bury those kinds of problems for people to find later. But remember, one test at a time so that you don't accidentally discover that your the database transactions aren't protecting you nearly as well as you thought.
In fact, screw database transactions. Pay the cost of object-relation impedance mismatch and unscalable joins, but make sure you avoid the benefits, by turning off ACID for performance reasons (probably done for you already) and make heavy use of LINQ so that values are loaded in and out of RAM willy-nilly and thereby escape their transaction scopes.
The C# designers really leaned into the 'statements' not 'expression' idea! There's no transaction context object returned from beginTrans which could be passed into subsequent operations (forming a nice expression) and thereby clear up any "am I in a transaction?" questions.
But yeah, right now it's socially acceptable to plumb the database crap right through the business logic. If we could somehow put CSS or i18n in the business logic, we'd need to put a browser into our test suite too!
Wow, your story gives me flashbacks to the 1990s when I worked in a mainframe environment. Compile jobs submitted by developers were among the lowest priorities. I could make a change to a program, submit a compile job, and wait literally half a day for it to complete. Then I could run my testing, which again might have to wait for hours. I generally had other stuff I could work on during those delays but not always.
This is because coders didn't spend enough time making their tests efficient. Maybe LLM coding agents can help with that.
Call me a skeptic but I do not believe LLMs are significantly altering the time between commits so much that CI is the problem.
However, improving CI performance is valuable regardless.
Yet, now I have added a LLM workflow to my coding the value of my old and mostly useless workflows is now 10x'd.
Git checkpoints, code linting and my naive suite of unit and integration tests are now crucial to my LLM not wasting too much time generating total garbage.
It’s because people don’t know how to write tests. All of the “don’t do N select queries in a for loop” comments made in PRs are completely ignored in tests.
Each test can output many db queries. And then you create multiple cases.
People don’t even know how to write code that just deals with N things at a time.
I am confident that tests run slowly because the code that is tested completely sucks and is not written for batch mode.
Ignoring batch mode, tests are most of the time written in a a way where test cases are run sequentially. Yet attempts to run them concurrently result in flaky tests, because the way you write them and the way you design interfaces does not allow concurrent execution at all.
Another comment, code done by the best AI model still sucks. Anything simple, like a music player with a library of 10000 songs is something it can’t do. First attempt will be horrible. No understanding of concurrent metadata parsing, lists showing 10000 songs at once in UI being slow etc.
So AI is just another excuse for people writing horrible code and horrible tests. If it’s so smart , try to speed up your CI with it.
> This will make the CI bottleneck even worse.
I agree. I think there are potentially multiple solutions to this since there are multiple bottlenecks. The most obvious is probably network overhead when talking to a database. Another might be storage overhead if storage is being used.
Frankly another one is language. I suspect type-safe, compiled, functional languages are going to see some big advantages here over dynamic interpreted languages. I think this is the sweet spot that grants you a ton of performance over dynamic languages, gives you more confidence in the models changes, and requires less testing.
Faster turn-around, even when you're leaning heavily on AI, is a competitive advantage IMO.
It could go either way. Depends very much on what kind of errors LLMs make.
Type safe languages in theory should do well, because you get feedback on hallucinated APIs very fast. But if the LLM generally writes code that compiles, unless the compiler is very fast you might get out-run by an LLM just spitting out JavaScript at high speed, because it's faster to run the tests than wait for the compile.
The sweet spot is probably JIT compiled type safe languages. Java, Kotlin, TypeScript. The type systems can find enough bugs to be worth it, but you don't have to wait too long to get test results either.
In most companies the CI/Dev Tools team is a career dead end. There is no possibility to show a business impact, it's just a money pit that leadership can't/won't understand (and if they do start to understand it, then it becomes _their_ money pit, which is a career dead end for them) So no one who has their head on straight wants to spend time improving it.
And you can't even really say it's a short sighted attitude. It definitely is from a developer's perspective, and maybe it is for the company if dev time is what decides the success of the business overall.
> it's just a money pit that leadership can't/won't understand
In my experience it's the opposite: they want more automated testing, but don't want to pay for the friction this causes on productivity.
I haven't worked in places using off-the-shelf/SaaS CI in more than a decade so I feel my experience has been quite the opposite from yours.
We always worked hard to make the CI/CD pipeline as fast as possible. I personally worked on those kind of projects at 2 different employers as a SRE: a smaller 300-people shop which I was responsible for all their infra needs (CI/CD, live deployments, migrated later to k8s when it became somewhat stable, at least enough for the workloads we ran, but still in its beta-days), then at a different employer some 5k+ strong working on improving the CI/CD setup which used Jenkins as a backend but we developed a completely different shim on top for developer experience while also working on a bespoke worker scheduler/runner.
I haven't experienced a CI/CD setup that takes longer than 10 minutes to run in many, many years, got quite surprised reading your comment and feeling spoiled I haven't felt this pain for more than a decade, didn't really expect it was still an issue.
I think the prevalence of teams having a "CI guy" who often is developing custom glue, is a sign that CI is still not really working as well as it should given the age of the tech.
I've done a lot of work on systems software over the years so there's often tests that are very I/O or computation heavy, lots of cryptography, or compilation, things like that. But probably there are places doing just ordinary CRUD web app development where there's Playwright tests or similar that are quite slow.
A lot of the problems are cultural. CI times are a commons, so it can end in tragedy. If everyone is responsible for CI times then nobody is. Eventually management gets sick of pouring money into it and devs learn to juggle stacks of PRs on top of each other. Sometimes you get a lot of pushback on attempts to optimize CI because some devs will really scream about any optimization that might potentially go wrong (e.g. depending on your build system cache), even if caching nothing causes an explosion in CI costs. Not their money, after all.
This sounds like a strawman.
GPUs can do 1 million trillion instructions per second.
Are you saying it’s impossible to write a test that finishes in less than one second on that machine?
Is that a fundamental limitation or an incredibly inefficient test?
A million trillion operations per second is literally an exaflop. That's one hell of a GPU you have.
Thanks, I missed a factor of 1000x, it should be a million billion
>There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.
Testing every change incrementally is a vestige of the code being done by humans (and thus of the current approach where AI helps and/or replaces one given human), in small increments at that, and of the failures being analyzed by individual humans who can keep in their head only limited number of things/dependencies at once.
then kill the CI/CD
these redundant processes are for human interoperability
[dead]
Good God I hate CI. Just let me run the build automation myself dammit! If you're worried about reproducibility make it reproducible and hash the artifacts, make people include the hash in the PR comment if you want to enforce it.
The amount of time people waste futzing around in eg Groovy is INSANE and I'm honestly inclined to reject job offers from companies that have any serious CI code at this point.
It takes more work (serious CI code) to make CI run anywhere, such as your own computer. So you prefer companies that just use GHA? You can't get simpler than that.
I tried the playground and got a strange response. I asked for a regex pattern, and the model gave itself a little game-plan, then it wrote the pattern and started to write tests for it. But it never stopped writing tests. It continued to write tests of increasing size until I guess it reached a context limit and the answer was canceled. Also, for each test it wrote, it added a comment about if the test should pass or fail, but after about the 30th test, it started giving the wrong answer for those too, saying that a test should fail when actually it should pass if the pattern is correct. And after about the 120th test, the tests started to not even make sense anymore. They were just nonsense characters until the answer got cut off.
The pattern it made was also wrong, but I think the first issue is more interesting.
I had this happen to me on Claude Sonnet once. It started spitting out huge blocks of source code completely unrelated to my prompt, seemingly from its training data, and switching codebases once in a while... like, a few thousand lines of some C program, then switching to another JavaScript one, etc. it was insane!
Sounds like solidgoldmagikarp[0]. There must've been something in your prompt that is over-represented throughout the training data.
[0] https://www.lesswrong.com/posts/jbi9kxhb4iCQyWG9Y/explaining...
FWIW, I remember regular models doing this not that long ago, sometimes getting stuck in something like an infinite loop where they keep producing output that is only a slight variation on previous output.
if you shrink the context window on most models you'll get this type of behaviour. If you go too small you end up with basically gibberish even on modern models like Gemini 2.5.
Mercury has a 32k context window according to the paper, which could be why it does that.
I think that's a prime example showing that token prediction simply isn't good enough for correctness. It never will be. LLMs are not designed to reason about code.
This is too funny to be true.
In their tech report, they say this is based on:
> "Our methods extend [28] through careful modifications to the data and computation to scale up learning."
[28] is Lou et al. (2023), the "Score Entropy Discrete Diffusion" (SEDD) model (https://arxiv.org/abs/2310.16834).
I wrote the first (as far as I can tell) independent from-scratch reimplementation of SEDD:
https://github.com/mstarodub/dllm
My goal was making it as clean and readable as possible. I also implemented the more complex denoising strategy they described (but didn't implement).
It runs on a single GPU in a few hours on a toy dataset.
ICYMI, DeepMind also has a Gemini model that is diffusion-based[1]. I've tested it a bit and while (like with this model) the speed is indeed impressive, the quality of responses was much worse than other Gemini models in my testing.
[1] https://deepmind.google/models/gemini-diffusion/
Is the Gemini Diffusion demo free? I've been on the waitlist for it for a few weeks now.
Yes it is.
From my minor testing I agree that it's crazy fast and not that good at being correct
Are there any rules for what can be uploaded to arxiv?
This is a marketing page turned into a PDF, I guess who cares but could someone upload like a facebook marketplace listing screenshotted into a PDF?
Yes, arxiv requires that submissions must be scientific research. And not just anyone can publish on arxiv, you need endorsement by existing users.
That that scientific research is in pursuit of a commercial product, or that the paper submitted is of low quality, is not something they would filter however.
Ton of performance upside in most GPU adjacent code right now.
However, is this what arXiv is for? It seems more like marketing their links than research. Please correct me if I'm wrong/naive on this topic.
not wrong, per se, but it's far from the first time
Using the free playground link, and it is in fact extremely fast. The "diffusion mode" toggle is also pretty neat as a visualization, although I'm not sure how accurate it is - it renders as line noise and then refines, while in reality presumably those are tokens from an imprecise vector in some state space that then become more precise until it's only a definite word, right?
Some text diffusion models use continuous latent space but they historically haven't done that well. Most the ones we're seeing now typically are trained to predict actual token output that's fed forward into the next time series. The diffusion property comes from their ability to modify previous timesteps to converge on the final output.
I have an explanation about one of these recent architectures that seems similar to what Mercury is doing under the hood here: https://pierce.dev/notes/how-text-diffusion-works/
Oh neat, thanks! The OP is surprisingly light on details on how it actually works and is mostly benchmarks, so this is very appreciated :)
Link : https://chat.inceptionlabs.ai/
Still cannot pass the stRawbeRRy or the Sally's 1 sister tests unfortunately...
It's insane how fast that thing is!
I am personally very excited for this development. Recently I AI-coded a simple game for a game jam and half the time was spent waiting for the AI agent to finish its work so I can test it. If instead of waiting 1-2 minutes for every prompt to be executed and implemented I could wait 10 seconds instead that would be literally game changing. I could test 5-10 different versions of the same idea in the time it took me to test one with the current tech.
Of course this model is not as advanced yet for this to be feasible, but so was Claude 3.0 just over a year ago. This will only get better over time I’m sure. Exciting times ahead of us.
Pricing:
US$0.000001 per output token ($1/M tokens)
US$0.00000025 per input token ($0.25/M tokens)
https://platform.inceptionlabs.ai/docs#models
The pricing is a little on the higher side. Working on a performance-sensitive application, I tried Mercury and Groq (Llama 3.1 8b, Llama 4 Scout) and the performance was neck-and-neck but the pricing was way better for Groq.
But I'll be following diffusion models closely, and I hope we get some good open source ones soon. Excited about their potential.
Good to know. I didn't realize how good the pricing is on Groq!
If your application is pricing sensitive, check out DeepInfra.com - they have a variety of models in the pennies-per-mil range. Not quite as fast as Mercury, Groq or Samba Nova though.
(I have no affiliation with this company aside from being a happy customer the last few years)
You're getting the savings by shifting the pollution of the datacenter onto a largely black community and choking them out.
Diffusion is just the logically most optimally behavior for searching massively parallel spaces without informed priors. We need to think beyond language modeling however and start to view this in terms of drug discovery etc. A good diffusion model + the laws of chemistry could be god-tier. I think language modeling has the AI community's in its grips right now and they aren't seeing the applications of the same techniques to real world problems elsewhere.
Actually in most deep learning schemes for science adding in the "laws of nature" as constraints makes things much worse. For example, all the best weather prediction models utilize basically zero fluid dynamics. Even though a) global weather can be in principle predicted by using the Navier-Stokes equations and b) deep learning models can be used to approximately evaluate the Navier-Stokes equations, we now know that incorporating physics into these models is mostly a mistake.
The intuitive reason might be that unconstrained optimization is easier than constrained optimization, particularly in high dimensions, but no one really knows the real reason. It may be that we are not yet at the end of the "bigger is better" regime, and at the true frontier we must add the laws of natures to eke out the last remaining bits of performance possible.
Well diffusion models have long already made the jump to biology at least. Esm3 and alphafold 3 both are diffusion based.
I think the LLM dev community is underestimating these models. E.g. there is no LLM inference framework that supports them today.
Yes the diffusion foundation models have higher cross entropy. But diffusion LLMs can also be post trained and aligned, which cuts the gap.
IMO, investing in post training and data is easier than forcing GPU vendors to invest in DRAM to handle large batch sizes and forcing users to figure out how to batch their requests by 100-1000x. It is also purely in the hands of LLM providers.
You can absolutely tune causal LLMs. In fact the original idea with GPTs was that you had to tune them before they'd be useful for anything.
Yes I agree you can tune autoregressive LLMs
You can also tune diffusion LLMs
After doing so, the diffusion LLM will be able to generate more tokens/sec during inference
Google has Gemini Diffusion in the works. I joined the beta. Roughly speeking it "feels" a lot like 2.5 Flash in the style of its interaction and accuracy. But the walls of text appear almost instantaneously; you don't notice any scrolling.
If anyone else is curious about the claim "Copilot Arena, where the model currently ranks second on quality"
This seems to be the link, mind blowing results if indeed is the case: https://lmarena.ai/leaderboard/copilot
Damn, that is fast. But it is faster than I can read, so hopefully they can use that speed and turn it into better quality of the output. Because otherwise, I honestly don't see the advantage, in practical terms, over existing LLMs. It's like having a TV with a 200Hz refresh rate, where 100Hz is just fine.
There are plenty of LLM use cases where the output isn’t meant to be read by a human at all. e.g:
parsing unstructured text into structured formats like JSON
translating between natural or programming languages
serving as a reasoning step in agentic systems
So even if it’s “too fast to read,” that speed can still be useful
You're missing another big advantage is cost. If you can do 1000tok/s on a $2/hr H100 vs 60tok/s on the same hardware, you can price it at 1/40th of the price for the same margin.
You can also slow down the hardware (say, dropping the clock and then voltages) to save huge amounts of power, which should be interesting for embedded applications.
Sure, but I was talking about the chat interface, sorry if that was not clear.
This lets you do more (potentially a lot more) reasoning steps and tool calls before answering.
I've been looking at the code on their chat playground, https://chat.inceptionlabs.ai/, and they have a helper function `const convertOpenAIMessages = (convo) => { ... }`, which also contains `models: ['gpt-3.5-turbo']`. I also see in API response: `"openai": true`. Is it actually using OpenAI, or is it actually calling its dLLM? Does anyone know?
Also: you can turn on "Diffusion Effect" in the top-right corner, but this just seems to be an "animation gimmick" right?
The speed of the response is waaay to quick for using OpenAi as backend, it's almost instant!
I've been asking bespoke questions and the timing is >2 seconds, and slower than what I get for the same questions to ChatGPT (using gpt-4.1-mini). I am looking at their call stack and what I see: "verifyOpenAIConnection()", "generateOpenAIChatCompletion()", "getOpenAIModels()", etc. Maybe it's just so it's compatible with OpenAI API?
Check the bottom, I think it's just some off the shelf chat UI that uses OpenAI compatible API behind the scenes.
Ah got it, it looks like it's a whole bunch of things so it can also interface with ollama, and other APIs.
The output is very fast but many steps backwards in all of my personal benchmarks. Great tech but not usable in production when it is over 60% hallucinations.
That might just depend on how big it is/how much money was spent on training. The neural architecture can clearly work. Beyond that catching up may be just a matter of effort.
For something a little different than a coding task, I tried using it in my game: https://www.playintra.win/ (in settings you can select Mercury, the game uses OpenRouter)
At first it seemed pretty competent and of course very fast, but it seemed to really fall apart as the context got longer. The context in this case is a sequence of events and locations, and it needs to understand how those events are ordered and therefore what the current situation and environment are (though there's also lots of hints in the prompts to keep it focused on the present moment). It's challenging, but lots of smaller models can pull it off.
But also a first release and a new architecture. Maybe it just needs more time to bake (GPT 3.5 couldn't do these things either). Though I also imagine it might just perform _differently_ from other LLMs, not really on the same spectrum of performance, and requiring different prompting.
Love the ui in the playground, it reminds me of Qwen chat.
We have reached a point where the bottlenecks in genAI is not the knowledge or accuracy, it is the context window and speed.
Luckily, Google (and Meta?) has pushed the limits of the context window to about 1 million tokens which is incredible. But I feel like todays options are still stuck about ~128k token window per chat, and after that it starts to forget.
Another issue is the time time it takes for inference AND reasoning. dLLMs is an interesting approach at this. I know we have Groqs hardware aswell.
I do wonder, can this be combined with Groqs hardware? Would the response be instant then?
How many tokens can each chat handle in the playground? I couldn't find so much info about it.
Which model is it using for inference?
Also, is the training the same on dLLMs as on the standardised autoregressive LLMs? Or is the weights and models completely different?
We have reached a point where the bottlenecks in genAI is not the knowledge or accuracy, it is the context window and speed.
You’re joking, right? I’m using o3 and it couldn’t do half of the coding tasks I tried.
I agree entirely with you. While Claude Code is amazing, it is also slow as hell and the context issue keeps coming up (usually at what feels like the worst possible time for me).
It honestly feels like dialup most LLMs (apart from this!).
AFIAK with traditional models context size is very memory intensive (though I know there are a lot of things that are trying to 'optimize' this). I believe memory usage grows at the square of context length, so even 10xing context length requires 100x the memory.
(Image) diffusion does not grow like that, it is much more linear. But I have no idea (yet!) about text diffusion models if someone wants to chip in :).
I mean we don't really talk about the accuracy of generative models. It is more of a discriminative model thing.
But besides this, the current gen of models still, like, hallucinates more than many would like
Its a fork/implementation of openwebui isn't it?
is there a kind of nanogpt for diffusion language models? i would love to understand them better
This video has a live coding part which implements a masked diffusion generation process: https://www.youtube.com/watch?v=oot4O9wMohw
Is parameter count published? I'm by no means expert, but failure modes remind me of Chinese 1B class models.
Wow, this thing is really quite smart.
I was expecting really crappy performance but just chatting to it, giving it some puzzles, it feels very smart and gets a lot of things right that a lot of other models don't.
Sounds all cool and interesting, however:
> By submitting User Submissions through the Services, you hereby do and shall grant Inception a worldwide, non-exclusive, perpetual, royalty-free, fully paid, sublicensable and transferable license to use, edit, modify, truncate, aggregate, reproduce, distribute, prepare derivative works of, display, perform, and otherwise fully exploit the User Submissions in connection with this site, the Services and our (and our successors’ and assigns’) businesses, including without limitation for promoting and redistributing part or all of this site or the Services (and derivative works thereof) in any media formats and through any media channels (including, without limitation, third party websites and feeds), and including after your termination of your account or the Services. For clarity, Inception may use User Submissions to train artificial intelligence models. (However, we will not train models using submissions from users accessing our Services via OpenRouter.)
Company blog post: https://www.inceptionlabs.ai/introducing-mercury-our-general...
News coverage from February: https://techcrunch.com/2025/02/26/inception-emerges-from-ste...
Reinforcement learning really helped Transformer based LLMs evolve in terms of quality and reasoning which we saw as DeepSeek was launched. I am curious if what this is is equivalent to an early GPT 4o that has not yet reaped the benefits of add-on technologies that helped improve the quality?
No open model/weights?
Not only they do not release models/weights. They don't even tell the size of the models!
The linked whitepaper is pretty useless, and I am saying as a big fan of diffusion-transformers-for-not-just-images-or-videos approach.
Also, Gemini Diffusion ([1]) is way better at coding than Mercury offering.
1. https://deepmind.google/models/gemini-diffusion/
Oddly fast, almost instantaneous.
Tried it on some coding questions and it hallucinated a lot, but the appearance (i.e. if you’re not a domain expert) of the output is impressive.
I've used mercury quite a bit in my commit message generator. I noticed it would always produce the exact same response if you ran it multiple times, and increasing temperature didn't affect it. To get some variability I added a $(uuidgen) to the prompt. Then I could run it again for a new response if I didn't like the first.
Something like https://github.com/av/klmbr could also work
This is cool. I think faster models can unlock entirely new usage paradigms, like how faster search enables incremental search.
I was curious to know the statistics on the mentions of various programming languages on HN over the years, so I got me a copy of all HN comments from a BigTable public source. But now I need to interpret each comment and so what I need is a semantic grep. The easiest would be to prompt an LLM.
Comments are pretty short, but there are many millions of them. So getting high throughput at minimum cost is key.
I'm hoping that Inception might be able to churn through this quickly.
If you folks have other ideas or suggestions, what might also work, I'd love to hear them!
The idea is having a semgrep command line tool. If latencies are dropping dramatically, it might be feasible.
Code output is verifiable in multiple ways. Combine that with this kind of speed (and far faster in future) and you can brute force your way to a killer app in a few minutes.
Yes, exactly. The demo of Gemini's Diffusion model [0] was really eye-opening to me in this regard. Since then, I've been convinced the future of lots of software engineering is basically UX and SQA: describe the desired states, have an LLM fill in the gaps based on its understanding of human intent, and unit test it to verify. Like most engineering fields, we'll have an empirical understanding of systems as opposed to the analytical understanding of code we have today. I'd argue most complex software is already only approximately understood even before LLMs. I doubt the quality of software will go up (in fact the opposite), but I think this work will scale much better and be much, much more boring.
[0] https://simonwillison.net/2025/May/21/gemini-diffusion/
I wonder if diffusion llms solve the hallucination problem more effectively. In the same way that image models learned to create less absurd images, dllms can perhaps learn to create sensical responses more predictably
The speed here is super impressive! I am curious - are there any qualitative ways in which modeling text using diffusion differs from that using autoregressive models? The kind of problems it works better on, creativity, and similar.
One works in the coarse-to-fine direction, another works start-to-end. Which means different directionality biases, at least. Difference in speed, generalization, etc. is less clear and needs to be proven in practice, as fundamentally they are closer than it seems. Diffusion models have some well-studied shortcuts to trade speed for quality, but nothing stops you from implementing the same for the other type.
I once read that diffusion is essentially just autoregression in the frequency domain. Honestly, that comparison didn’t seem too far off.
I guess this makes specific language patterns cheaper and more artistic language patterns more expensive. This could be a good way to limit pirated and masqueraded materials submitted by students.
I'm kind of impressed by the speed of it. I told it to write a MQTT topic pattern matcher based on a Trie and it spat out something reasonable on first try. It hat a few compilation issues though, but fair enough.
We have used their LLM in our company and it's great! From Accuracy to speed of response generation, this model seems very promising!
I strongly believe that this will be a really important technique in the near future. The cost saving this might create is mouth watering.
> I strongly believe that this will be a really important technique in the near future.
I share the same belief, but regardless of cost. What excites me is the ability to "go both ways", edit previous tokens after others have been generated, using other signals as "guided generation", and so on. Next token prediction works for "stories", but diffusion matches better with "coding flows" (i.e. going back and forth, add something, come back, import something, edit something, and so on).
It would also be very interesting to see how applying this at different "abstraction layers" would work. Say you have one layer working on ctags, one working on files, and one working on "functions". And they all "talk" to each other, passing context and "re-diffusing" their respective layers after each change. No idea where the data for this would come, maybe from IDEs?
I wonder if there's a way to do diffusion within some sort of schema-defined or type constrained space.
A lot of people these days are asking for structured output from LLMs so that a schema is followed. Even if you train on schema-following with a transformer, you're still just 'hoping' in the end that the generated json matches the schema.
I'm not a diffusion excerpt, but maybe there's a way to diffuse one value in the 'space' of numbers, and another value in the 'space' of all strings, as required by a schema:
{ "type": "object", "properties": { "amount": { "type": "number" }, "description": { "type": "string" } }, "required": ["amount", "description"] }
I'm not sure how far this could lead. Could you diffuse more complex schemas that generalize to a arbitrary syntax tree? E.g. diffuse some code in a programming language that is guaranteed to be type-safe?
I, for one, am willing to trade accuracy for speed. I'd rather have 10 iterations of poor replies which forces me to ask the right question than 1 reply which takes 10 times as long and _maybe_ is good, since it tries to reason about my poor question.
Personally I like asking coding agents a question and getting an answer back immediately. Systems like Junie that go off and research a bunch of irrelevant things than ask permission than do a lot more irrelevant research, ask more permission and such and then 15 minutes later give you a mountain of broken code are a waste of time if you ask me. (Even if you give permission in advance)
Can Mercury use tools? I haven't seen it described anywhere. How about streaming with tools?
For tools they say coming soon in their api docs here https://platform.inceptionlabs.ai/docs#models
Having token embeddings with diffusion models, for 16x16 transformer encoding. Image is tokenized before transformers compile it. If decomposed virtualization modulates according to a diffusion model.
Holy shit that is fast. Try the playground. You need to get that visceral experience to truly appreciate what the future looks like.
[dead]
[dead]
[dead]
[flagged]
[flagged]
[flagged]