I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel. https://clangbuiltlinux.github.io/
This LLM did it in (checks notes):
> Over nearly 2,000 Claude Code sessions and $20,000 in API costs
It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!
> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.
The next milestone is:
Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
I would claim that LLMs desperately need proprietary code in their training, before we see any big gains in quality.
There's some incredible source available code out there. Statistically, I think there's a LOT more not so great source available code out there, because the majority of output of seasoned/high skill developers is proprietary.
To me, a surprising portion of Claude 4.5 output definitely looks like student homework answers, because I think that's closer to the mean of the code population.
This is dead wrong: essentially the entirety of the huge gains in coding performance in the past year have come from RL, not from new sources of training data.
I echo the other commenters that proprietary code isn’t any better, plus it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there.
yeah, but isn't the whole point of claude code to get people to provide preference data/telemetry data to anthropic (unless you opt out?). same w/ other providers.
i'm guessing most of the gains we've seen recently are post training rather than pretraining.
Yes, but you have the problem that a good portion of that is going to be AI generated.
But, I naively assume most orgs would opt out. I know some orgs have a proxy in place that will prevent certain proprietary code from passing through!
This makes me curious if, in the allow case, Anthropic is recording generated output, to maybe down-weight it if it's seen in the training data (or something similar)?
I'd bet, on average, the quality of proprietary code is worse than open-source code. There have been decades of accumulated slop generated by human agents with wildly varied skill levels, all vibe-coded by ruthless, incompetent corporate bosses.
Not to mention, a team member is (surprise!) fired or let go, and no knowledge transfer exists. Womp, womp. Codebase just gets worse as the organization or team flails.
This is cool and actually demonstrates real utility. Using AI to take something that already exists and create it for a different library / framework / platform is cool. I'm sure there's a lot of training data in there for just this case.
But I wonder how it would fare given a language specification for a non-existent non-trivial language and build a compiler for that instead?
If you come up with a realistic language spec and wait maybe six months, by then it'll probably be approach being cheap enough that you could test the scenario yourself!
I see that as the point that all this is proving - most people, most of the time, are essentially reinventing the wheel at some scope and scale or another, so we’d all benefit from being able to find and copy each others’ homework more efficiently.
..A small thing, but it won't compile the RISCV version of hello.c if the source isn't installed on the machine it's running on.
It is standing on the shoulders of giants (all of the compilers of the past, built into it's training data... and the recent learnings about getting these agents to break up tasks) to get itself going. Still fairly impressive.
On a side-quest, I wonder where Anthropic is getting there power from. The whole energy debacle in the US at the moment probably means it made some CO2 in the process. Would be hard to avoid?
> Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
They don't need 16b x86 support for the RISCV or ARM ports, so yes, but depends on what 'it' we're talking about here.
Also, FWIW, GCC doesn't directly assemble to machine code either; it shells out to GAS (GNU Assembler). This blog post calls it "GCC assembler and linker" but to be more precise the author should edit this to "GNU binutils assembler and linker." Even then GNU binutils contains two linkers (BFD and GOLD), or did they excise GOLD already (IIRC, there was some discussion a few years ago about it)?
Yeah, didn't mention gas or ld, for similar reasons. I agree that a compiler doesn't necessarily "need" those.
I don't agree that all the claims are backed up by their own comments, which means that there's probably other places where it falls down.
Its... Misrepresentation.
Like Chicken is a Scheme compiler. But they're very up front that it depends on a C compiler.
Here, they wrote a C compiler that is at least sometimes reliant on having a different C compiler around. So is the project at 50%? 75%?
Even if its 99%, thats not the same story as they tried to write. And if they wrote that tale instead, it would be more impressive, rather than "There's some holes. How many?"
Their C compiler is not reliant on having another C compiler around. Compiling the 16-bit real mode bootstrap for the Linux kernel on x86(-64) requires another C compiler; you certainly don't need another compiler to compile the kernel for another architecture, or to compile another piece of software not subject to the 32k constraint.
The compiler itself is entirely functional; it just can't generate code optimal enough to fit within the constraints for that very specific (tiny!) part of the system, so another compiler is required to do that step.
Application-specific AI models can be much smaller and faster than the general purpose, do-everything LLM models. This allows them to run locally.
They can also be made to be deterministic. Some extra care is required to avoid computation paths that lead to numerical differences on different machines, but this can be accomplished reliably with small models that use integer math and use kernels that follow a specific order of operations. You get a lot more freedom to do these things on the small, application-specific models than you do when you're trying to run a big LLM across different GPU implementations in floating point.
Yeah, in the same way how pseudo-random number generators are "deterministic." They generate the exact same sequence of numbers every time given the seeds are the same!
But that's not the "determinism" people are referring to when they say LLMs aren't deterministic.
Some people care more about compile times than the performance of generated code. Perhaps even the correctness of generated code. Perhaps more so than determinism of the generated code. Different people in different contexts can have different priorities. Trying to make everyone happy can sometimes lead to making no one happy. Thus dichotomies like `-O2` vs `-Os`.
EDIT (since HN is preventing me from responding):
> Some people care more about compiler speed than the correctness?
Yeah, I think plenty of people writing code in languages that have concepts like Undefined Behavior technically don't really care as much about correctness as they may claim otherwise, as it's pretty hard to write large volumes of code without indirectly relying on UB somewhere. What is correct in such case was left up to interpretation of the implementer by ISO WG14.
Some people care more about compiler speed than the correctness? I would love to meet these imaginary people that are fine with a compiler that is straight up broken. Emitting working code is the baseline, not some preference slider.
Let's pretend, for just a second, that the people who do, having been able to learn how to program, are not absolute fucking morons. Straight up broken is obviously not useful, so maybe the conclusions you've jumped to could use some reexamination.
a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced. The only thing worse would be a CPU bug like the legendary Pentium bug. Imagine you compile something like Postgres only to have it crash in some unpredictable way. How long do you stare at Postgres source before suspecting the compiler? What if this compiler was used to compile code in software running all over cloud stacks? Bugs in compilers are very bad news, they have to be correct.
They found a bimodal distribution in failures over the lifetime of chips. Infant mortality was well understood. Silicon aging over time was much less well understood, and I still find surprising.
> a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced
Is this true? It’s not an everyday thing, but when using less common flags, or code structures, or targets… every few years I run into a codegen issue. It’s hard to imagine going through a career without a handful…
We're already starting to see people experimenting with applying AI towards register allocation and inlining heuristics. I think that many fields within a compiler are still ripe for experimentation.
Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.
The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.
It's amazing that it "works", but viability is another issue.
It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.
Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.
On top of that, Anthropic is losing money on it.
All of those things combined, viability remains a serious question.
That's a good point! Here claude opus wrote a C compiler. Outrageously cool.
Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.
To be fair, it was a pretty horrible mess of useEffects. But just another data point.
Also I was hoping opus would finally be able to handle complex typescript generics, but alas...
You wouldn’t pay a human to write 100k LOC. Or at least you shouldn’t. You’d pay a human to write a working useful compiler that isn’t riddled with copyright issues.
If you didn’t care about copying code, usefulness, or correctness you could probably get a human to whip you up a C compiler for a lot less than $20k.
Microsoft paid my company a lot of money to write code. And in the end you were able to count it, and the LOC is a perfectly fine metric which is still used today to measure complexity of a project.
If you actually work in software you know this.
I have no idea what point you're trying to make - but I've grown very tired of all the trolls attacking me. Good night.
Do you think this was guided by a low quality Anthropic developer?
You can give a developer the GCC test suite and have them build the compiler backwards, which is how this was done. They literally brute forced it, most developers can brute force. It also literally uses GCC in the background... Maybe try reading the article.
It's a shame that there's no way to block obnoxious people on HN.
I think I'll ask codex to write me a Chrome extension to censor you. :)
EDIT: Done!
(() => {
"use strict";
const BLOCKED_USER = "bopbopbop7";
// -------- helpers --------
/** True if the node (or its descendants) contain a user link matching blocked user */
function containsBlockedUserLink(root) {
if (!root || !(root instanceof Element)) return false;
const sel = `a[href^="user?id="], a[href*="user?id="]`;
const links = root.querySelectorAll(sel);
for (const a of links) {
const href = a.getAttribute("href") || "";
// HN uses relative href like "user?id=pg"
// Algolia uses full/relative links; check query portion.
if (href.includes(`user?id=${BLOCKED_USER}`)) return true;
if ((a.textContent || "").trim() === BLOCKED_USER && href.includes("user?id=")) return true;
}
return false;
}
/** Remove the closest HN listing block (table row pairs) for a story */
function removeHNStoryFromListing(anchorUserLink) {
// On HN listings, a story is typically 2 consecutive <tr>:
// 1) title row (class="athing"), 2) subtext row.
const userTr = anchorUserLink.closest("tr");
if (!userTr) return;
// If we are in the subtext row, the associated title row is usually previousElementSibling
// The title row has class "athing".
const maybeTitleTr = userTr.previousElementSibling;
const titleTr = maybeTitleTr && maybeTitleTr.classList && maybeTitleTr.classList.contains("athing")
? maybeTitleTr
: userTr.classList && userTr.classList.contains("athing")
? userTr
: null;
if (titleTr) {
const subtextTr = titleTr.nextElementSibling;
titleTr.remove();
if (subtextTr && subtextTr.tagName === "TR") subtextTr.remove();
// Sometimes there is a spacing row after; remove if it's just spacing
const spacer = (subtextTr && subtextTr.nextElementSibling) || null;
if (spacer && spacer.tagName === "TR" && spacer.textContent.trim() === "") {
spacer.remove();
}
} else {
// Fallback: remove just the row containing the user link
userTr.remove();
}
}
/** Collapse/remove comments by the blocked user on item pages */
function collapseHNComment(anchorUserLink) {
// HN comments are in <tr class="athing comtr"> (or similar)
const comtr = anchorUserLink.closest("tr.athing.comtr, tr.comtr");
if (!comtr) return;
// Remove the comment row plus the following "spacer" row if present
const next = comtr.nextElementSibling;
comtr.remove();
if (next && next.tagName === "TR" && next.textContent.trim() === "") {
next.remove();
}
}
/** Algolia results: remove the result item container */
function removeAlgoliaResult(anchorUserLink) {
// Algolia HN search commonly uses .item-title-and-infos or .Story or .Comment containers.
// We'll just walk up to a reasonable container.
const container =
anchorUserLink.closest("li, .item, .Story, .Comment, .SearchResults__item, .search-result, article, .result") ||
anchorUserLink.parentElement;
if (container) container.remove();
}
function handleUserLink(a) {
const href = a.getAttribute("href") || "";
const isBlocked =
href.includes(`user?id=${BLOCKED_USER}`) || (a.textContent || "").trim() === BLOCKED_USER;
if (!isBlocked) return;
const host = location.hostname;
if (host === "news.ycombinator.com") {
// Distinguish listing pages vs item pages by presence of "item?id="
const isItemPage = location.pathname === "/item";
if (isItemPage) {
collapseHNComment(a);
// Also on item page: if OP is blocked, remove the top submission block.
// We can try removing the whole "athing" story block if the subtext shows blocked user.
const topStoryTr = document.querySelector("tr.athing");
if (topStoryTr && containsBlockedUserLink(topStoryTr.nextElementSibling)) {
topStoryTr.remove();
const sub = topStoryTr.nextElementSibling;
if (sub && sub.tagName === "TR") sub.remove();
}
} else {
removeHNStoryFromListing(a);
}
} else if (host === "hn.algolia.com") {
removeAlgoliaResult(a);
}
}
function scanAndRemove(root = document) {
const links = root.querySelectorAll(`a[href*="user?id=${BLOCKED_USER}"]`);
for (const a of links) handleUserLink(a);
// Extra safety: sometimes the link text is present but href format differs.
// Catch those too.
const maybeTextLinks = root.querySelectorAll(`a`);
for (const a of maybeTextLinks) {
if ((a.textContent || "").trim() === BLOCKED_USER && (a.getAttribute("href") || "").includes("user?id=")) {
handleUserLink(a);
}
}
}
// Initial sweep
scanAndRemove(document);
// Keep it working with infinite scroll / dynamic loads
const obs = new MutationObserver((mutations) => {
for (const m of mutations) {
for (const node of m.addedNodes) {
if (node instanceof Element) scanAndRemove(node);
}
}
});
obs.observe(document.documentElement, { childList: true, subtree: true });
Yep. Building a working C compiler that compiles Linux is an impossible task for all but the top 1% of developers. And the ones that could do it have better things to do, plus they’d want a lot more than 20K for the trouble.
This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not
How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.
Being just a grunt engineer in a product firm I can't imagine being able to spend multiple years on one project. If it's something you're passionate about, that sounds like a dream!
`asm goto` was the big one. The x86_64 maintainers broke the clang builds very intentionally just after we had gotten x86_64 building (with necessary patches upstreamed) by requiring compiler support for that GNU C extension. This was right around the time of meltdown+spectre, and the x86_64 maintainers didn't want to support fallbacks for older versions of GCC (and ToT Clang at the time) that lacked `asm goto` support for the initial fixes shipped under duress (embargo). `asm goto` requires plumbing throughout the compiler, and I've learned more about register allocation than I particularly care...
Fixing some UB in the kernel sources, lots of plumbing to the build system (particularly making it more hermetic).
Getting the rest of the LLVM binutils substitutes to work in place of GNU binutils was also challenging. Rewriting a fair amount of 32b ARM assembler to be "unified syntax" in the kernel. Linker bugs are hard to debug. Kernel boot failures are hard to debug (thank god for QEMU+gdb protocol). Lots of people worked on many different parts here, not just me.
Evangelism and convincing upstream kernel developers why clang support was worth anyones while.
There's parts of LLVM architecture that are long in the tooth (IMO) (as is the language it's implemented in, IMO).
I had hoped one day to re-implement parts of LLVM itself in Rust; in particular, I've been curious if we can concurrently compile C (and parse C in parallel, or lazily) that haven't been explored in LLVM, and I think might be safer to do in Rust. I don't know enough about grammers to know if it's technically impossible, but a healthy dose of ignorance can sometimes lead to breakthroughs.
LLVM is pretty well designed for test. I was able to implement a lexer for C in Rust that could lex the Linux kernel, and use clang to cross check my implementation (I would compare my interpretation of the token stream against clang's). Just having a standard module system makes having reusable pieces seems like perhaps a better way to compose a toolchain, but maybe folks with more experience with rustc have scars to disagree?
> I had hoped one day to re-implement parts of LLVM itself in Rust
Heh, earlier this day, I was just thinking how crazy a proposal would it actually be to have a Rust dependency (specifically, the egg crate, since one of the things I'm banging my head against right now might be better solved with egraphs).
This is the proper deep critique / skepticism (or sophisticated goal-post moving, if you prefer) here. Yes, obviously this isn't just reproducing C compiler code in the training set, since this is Rust, but it is much less clear how much of the generated Rust code can (or can not) be accurately seen as being translated from C code in the training set.
One thing LLMs are really good at is translation. I haven’t tried porting projects from one language to another, but it wouldn’t surprise me if they were particularly good at that too.
Also: a large amount of folks seem to think Claude code is losing a ton of money. I have no idea where the final numbers land, however, if the $20,000 figure is accurate and based on some of the estimates I've seen, they could've hired 8 senior level developers at a quarter million a year for the same amount of money spent internally.
Granted, marketing sucks up far too much money for any startup, and again, we don't know the actual numbers in play, however, this is something to keep in mind. (The very same marketing that likely also wrote the blog post, FWIW).
this doesn't add up. the 20k is in API costs. people talk about CC losing money because it's way more efficient than the API. I.e. the same work with efficient use of CC might have cost ~$5k.
but regardless, hiring is difficult and high-end talent is limited. If the costs were anywhere close to equivalent, the agents are a no-brainer
CC hits their APIs, And internally I'm sure Anthropic tracks those calls, which is what they seem to be referencing here. What exactly did Anthropic do in this test to have "inefficient use of CC" vs your proposed "efficient use of CC"?
Or do you mean that if an external user replicated this experience they might get billed less than $20k due to CC being sold at lower rates than per-API-call metered billing?
Even if the dollar cost for product created was the same, the flexibility of being able to spin a team up and down with an API call is a major advantage. That AI can write working code at all is still amazing to me.
This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis
> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.
And the very open points about limitations (and hacks, as cc loves hacks):
> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
> It does not have its own assembler and linker;
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Ending with a very down to earth take:
> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.
This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.
The classical definition of a clean room implementation is something that's made by looking at the output of a prior implementation but not at the source.
I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.
If you read the entire GCC source code and then create a compatible compiler, it's not clean room. Which Opus basically did since, I'm assuming, its training set contained the entire source of GCC. So even if they were actively referencing GCC I think that counts.
I'd argue that no one would really care given it's GCC.
But if you worked for GiantSodaCo on their secret recipe under NDA, then create a new soda company 15 years later that tastes suspiciously similar to GiantSodaCo, you'd probably have legal issues. It would be hard to argue that you weren't using proprietary knowledge in that case.
Hmm... If Claude iterated a lot then chances are very good that the end result bears little resemblance to open source C compilers. One could check how much resemblance the result actually bears to open source compilers, and I rather suspect that if anyone does check they'll find it doesn't resemble any open source C compiler.
Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.
I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.
It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."
The "without infringing any of the copyrights" contains "any".
We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.
Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.
The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.
The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.
I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.
> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
Their technique really stretched the definition of extracting text from the LLM.
They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.
You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.
To make some vague claims explicit here, for interested readers:
> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) [...]"
So, yes, it is not "literally verbatim" (~96% verbatim), and there is indeed A LOT (hundreds or thousands of prompting attempts) to make this happen.
I leave it up to the reader to judge how much this weakens the more basic claims of the form "LLMs have nearly perfectly memorized some of their source / training materials".
I am imagining a grueling interrogation that "cracks" a witness, so he reveals perfect details of the crime scene that couldn't possibly have been known to anyone that wasn't there, and then a lawyer attempting the defense: "but look at how exhausting and unfair this interrogation was--of course such incredible detail was extracted from my innocent client!"
Sure, maybe it's tricky to coerce an LLM into spitting out a near verbatim copy of prior data, but that's orthoginal to whether or not the data to create a near verbatim copy exists in the model weights.
Especially since the recalls achieved in the paper are 96% (based on block largest-common substring approaches), the effort of extraction is utterly irrelevant.
We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.
The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"
We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic.
It is enough to have read even parts of a work for something to be considered a derivative.
I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.
It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.
> It is enough to have read even parts of a work for something to be considered a derivative.
For IP rights, I'll buy that. Not as important when the question is capabilities.
> I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.
For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":
It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.
ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.
> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) and refuses to continue after reaching the end of the first chapter; the generated text has nv-recall = 4.0% with the full book. We extract substantial proportions of the book from Gemini 2.5 Pro and Grok 3 (76.8% and 70.3%, respectively), and notably do not need to jailbreak them to do so (N = 0)."
Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.
During a "clean room" implementation, the implementor is generally selected for not being familiar with the workings of what they're implementing, and banned from researching using it.
Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.
I mean... It's in the name.
> The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.
If it can recall... Then it is not a clean room implementation. Fin.
While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is)
Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.
Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.
You couldn't reasonably claim you did a clean-room implementation of something you had read the source to even though you, too, would not have a verbatim copy of the entire source code in your memory (barring very rare people with exceptional memories).
It's kinda the whole point - you haven't read it so there's no doubt about copying in a clean-room experiment.
A "human style" clean-room copy here would have to be using a model trained on, say, all source code except GCC. Which would still probably work pretty well, IMO, since that's a pretty big universe still.
There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.
The main issue with improvements in the last year is that a lot of it is based not on the models strictly becoming better, but on tooling being better, and simply using a fuckton more tokens for the same task.
Remember that all these companies can only exist because of massive (over)investments in the hope of insane returns and AGI promises. While all these improvements (imho) prove the exact opposite: AGI is absolutely not coming, and the investments aren't going to generate these outsized returns. The will generate decent returns, and the tools are useful.
The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.
We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.
Model improvement is very much slowing down, if we actually use fair metrics. Most improvements in the last year or so comes down to external improvements, like better tooling, or the highly sophisticated practice of throwing way more tokens at the same problem (reasoning and agents).
Don't get me wrong, LLMs are useful. They just aren't the kind of useful that Sam et al. sold investors. No AGI, no full human worker replacement, no massive reduction in cost for SOTA.
Yes, and Moore's law took decades to start to fail to be true. Three years of history isn't even close to enough to predict whether or not we'll see exponential improvement, or an unsurmountable plateau. We could hit it in 6 months or 10 years, who knows.
And at least with Moore's law, we had some understanding of the physical realities as transistors would get smaller and smaller, and reasonably predict when we'd start to hit limitations. With LLMs, we just have no idea. And that could be go either way.
Except for Moore's law, everyone knew decades ahead of what the limits of Dennard scaling are (shrinking geometry through smaller optical feature sizes), and roughly when we would get to the limit.
Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.
Intel, at the time the unquestioned world leader in semiconductor fabrication was so unable to accurately predict the end of Dennard scaling that they rolled out the Pentium 4. "10Ghz by 2010!" was something they predicted publicly in earnest!
Personally my usage has fell off a cliff the past few months. Im not a SWE.
SWE's may be seeing benefit. But in other areas? Doesnt seem to be the case. Consumers may use it as a more preferred interface for search - but this is a different discussion.
I agree, I have been informed that people have been repeating it for three years. Sadly I'm not involved in the AI hype bubble so I wasn't aware. What an embarrassing faux pas.
Cool I guess. Kind of a meaningless statement yeah? Let's hit the bend, then we'll talk. Until then repeating, 'It's an S Curve guys and what's more, we're near the bend! trust me" ad infinitum is pointless. It's not some wise revelation lol.
Maybe the best thing to say is we can only really forecast about 3 months out accurately, and the rest is wild speculation :)
History has a way of being surprisingly boring, so personally I'm not betting on the world order being transformed in five years, but I also have to take my own advice and take things a day at a time.
The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.
Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."
Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.
"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.
Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?
Doesn’t matter how you gain general knowledge of compiler techniques as long as you don’t have specific knowledge of the implementation of the compiler you are reverse engineering.
If you have ever read the source code of the compiler you are reverse engineering, you are by definition not doing a clean room implementation.
The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.
The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.
Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?
If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.
Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.
But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.
This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.
Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.
Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.
Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.
I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.
A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.
This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.
I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.
If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.
Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.
The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.
If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?
Honestly I don't find it that impressive. I mean, it's objectively impressive that it can be done at all, but it's not impressive from the standpoint of doing stuff that nearly all real-world users will want it to do.
The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.
Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.
> Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.
Hello, this is what I did over my Christmas break. I've been taking some time to do other things, but plan on returning to it. But this absolutely works. Claude has written far more programs in my language than I have.
https://rue-lang.dev/ if you want to check it out. Spec and code are both linked there.
I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.
I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.
I am all-daily user (multiple claude max accounts). this fits my mental model mostly but not model I had before but developed with daily use. my job revolves around two core things:
1. data analysis / visualization / …
2. “is this possible? can this even be done?”
for #1 - I don’t do much anymore, for #2 I mostly do it still all “by hand” not for the lack of serious trying. so “it can do #1 1000x better than me cause it is generally solved problem(s) it is trained on while it can’t effectively do #2 otherwise” fits perfectly
It's desirable if you're trying to build a C compiler as a demo of coding agent capabilities without all of the Hacker News commenters saying "yeah but it could just copy implementation details from the internet".
It used the best tests it could find for existing compilers. This is effectively steering Claude to a well-defined solution.
Hard to find fully specified problems like this in the wild.
I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.
I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?
> Write extremely high-quality tests
> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.
> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.
> Hard to find fully specified problems like this in the wild.
This is such a big and obvious cope. This is obviously a very real problem in the wild and there are many, many others like it. Probably most problems are like this honestly or can be made to be like this.
It's weird to see the expectation that the result should be perfect.
All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!
"It's like if a squirrel started playing chess and instead of "holy shit this squirrel can play chess!" most people responded with "But his elo rating sucks""
It's more like "We were promised, over and over again, that the squirrel would be autonomous grand master level. We spent insane amounts of money, labour, and opportunity costs of human progress on this. Now, here's a very expensive squirrel, that still needs guidance from a human grandmaster, and most of it's moves are just replications of existing games. Oh, it also can't move the pieces by itself, so it depends on Piece Mover library."
But the Squirrel is only playing chess because someone stuffed the pieces with food and it has learned that the only way to release it is by moving them around in some weird patterns.
A symptom of the increasing backlash against generative AI (both in creative industries and in coding) is that any flaw in the resulting product is predicate to call it AI slop, even if it's very explicitly upfront that it's an experimental demo/proof of concept and not the NEXT BIG THING being hyped by influencers. That nuance is dead even outside of social media.
AI companies set that expectation when their CEOs ran around telling anyone who would listen that their product is a generational paradigm shift that will completely restructure both labor markets and human cognition itself. There is no nuance in their own PR, so why should they benefit from any when their product can't meet those expectations?
Because it leads to poor and nonconstructive discourse that doesn't educate anyone about the implications of the tech, which is expected on social media but has annoyingly leaked to Hacker News.
There's been more than enough drive-by comments from new accounts/green names even in this HN submission alone.
Maybe the general population will be willing to have a more constructive discussions about this tech once the trillion dollar companies stop pillaging everything they see in front of them and cease acting like sociopaths whose only objectives seem to be concentrating power, generating dissidence and harvesting wealth.
Cool project, but they really could have skipped the mention of clean room. Something trained on every copyrighted thing known to mankind is the opposite of clean room
That’s the opposite of clean-room. The whole point of clean-room design is that you have your software written by people who have not looked into the competing, existing implementation, to prevent any claim of plagiarism.
“Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.”
No they don't. One team meticulously documents and specs out what the original code does, and then a completely independent team, who has never seen the original source code, implements it.
True, but the human isn't allowed to bring 1TB of compressed data pertaining to what they are "redesigning from scratch/memory" into the clean room.
In fact the idea of a "clean room" implementation is that all you have to go on is the interface spec of what you are trying to build a clean (non-copyright violating) version of - e.g. IBM PC BIOS API interface.
You can't have previously read the IBM PC BIOS source code, then claim to have created a "clean room" clone!
What they don't do is read the product they're clean-rooming. That's kinda disqualifying. Impossible to know if the GCC source is in 4.6's training set but it would be kinda weird if it wasn't.
If that's what clean room means to you, I do know AI can definitely replace you.
As even ChatGPT is better than that.
(prompt: what does a clean room implementation mean?)
From ChatGPT without login BTW!
> A clean room implementation is a way of building something (usually software) without copying or being influenced by the original implementation, so you avoid copyright or IP issues.
> The core idea is separation.
> Here’s how it usually works:
> The basic setup
> Two teams (or two roles):
> Specification team (the “dirty room”)
> Looks at the original product, code, or behavior
> Documents what it does, not how it does it
> Produces specs, interfaces, test cases, and behavior descriptions
> Implementation team (the “clean room”)
> Never sees the original code
> Only reads the specs
> Writes a brand-new implementation from scratch
> Because the clean team never touches the original code, their work is considered independently created, even if the behavior matches.
If you try to reimplement something in a clean room, its a step by step process, using your own accumulated knowledge as the basis. That knowledge that you hold in your brain, all too often is code that may have copyrights on it, from the companies you worked on.
Is it any different for a LLM?
The fact that the LLM is trained on more data, does not change that when you work for a company, leave it, take that accumulated knowledge to a different company, you are by definition taking that knowledge (that may be copyrighted) and implementing it somewhere else. It only a issue if you copy the code directly, or do the implementation as a 1:1 copy. LLMs do not make 1:1 copies of the original.
At what point is trained on copyrighted data, any different then a human trained on copyrighted data, that get reimplemented in a transformative way. The big difference is that the LLM can hold more data over more fields, vs a human, true... But if we look at specializations, this can come back to the same, no?
Clean-room design is extremely specific. Anyone who has so much as glanced at Windows source code[1] (or even ReactOS code![2]) is permanently banned from contributing to WINE.
This is 100% unambiguously not clean-room unless they can somehow prove it was never trained on any C compiler code (which they can't, because it most certainly was).
If you have worked on a related copyrighted work you can't work on a clean room implementation. You will be sued. There are lots of people who have tried and found out.
They weren't trillion dollar AI companies to bankroll the defense sure. But thinking about clean room and using copyrighted stuff is not even an argument that's just nonsense to try to prove something when no one asked.
My second reaction: still incredible, but noting that a C compiler is one of the most rigorously specified pieces of software out there. The spec is precise, the expected behavior is well-defined, and test cases are unambiguous.
I'm curious how well this translates to the kind of work most of us do day-to-day where requirements are fuzzy, many edge cases are discovered on the go, and what we want to build is a moving target.
This is the key: the more you constrain the LLM, the better it will perform. At least that's my experience with Claude. When working with existing code, the better the code to begin with, the better Claude performs, while if the code has issues then Claude can end up spinning its wheels.
Yes I think any codegen with a lot of tests and verification is more about “fitting” to the tests. Like fitting an ML model. It’s model training, not coding.
But a lot of programming we discover correctness as we go, one reason humans don’t completely exit the loop. We need to see and build tests as we go, giving them particular care and attention to ensure they test what matters.
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.
This is incredible!
But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.
To the best of my knowledge, there's no Rust-based compiler that comes anywhere close to 99% on the GCC torture test suite, or able to compile Doom. So even if it saw the internals of GCC and a lot of other compilers, the ability to recreate this step-by-step in Rust is extremely impressive to me.
Agreed, but the next step is of having an AI agent actually run the business and be able to get the business context it needs as a human would. Obviously we're not quite there, but with the rapid progress on benchmarks like Vending-Bench [0], and especially with this teams approach, it doesn't seem far fetched anymore.
As a particular near-term step, I imagine that it won't be long before we see a SaaS company using an AI product manager, which can spawn agents to directly interview users as they utilize the app, independently propose and (after getting approval) run small product experiments, and come up with validated recommendations for changing the product roadmap. I still remember Tay, and wouldn't give something like that the keys to the kingdom any time soon, but as long as there's a human decision maker at the end, I think that the tech is already here.
How much of this result is effectively plagiarized open source compiler code? I don't understand how this is compelling at all: obviously it can regurgitate things that are nearly identical in capability to already existing code it was explicitly trained on...
It's very telling how all these examples are all "look, we made it recreate a shitter version of a thing that already exists in the training set".
The fact it couldn't actually stick to the 16 bit ABI so it had to cheat and call out to GCC to get the system to boot says a lot.
Without enough examples to copy from (despite CPU manuals being available in the training set) the approach failed. I wonder how well it'll do when you throw it a new/imaginary instruction set/CPU architecture; I bet it'll fail in similar ways.
"Couldn't stick to the ABI ... despite CPU manuals being available" is a bizarre interpretation. What the article describes is the generated code being too large. That's an optimization problem, not a "couldn't follow the documentation" problem.
And it's a bit of a nasty optimization problem, because the result is all or nothing. Implementing enough optimizations to get from 60kB to 33kB is useless, all the rewards come from getting to 32kB.
IMHO a new architecture doesn't really make it any more interesting: there's too many examples of adding new architectures in the existing codebases. Maybe if the new machine had some bizarre novel property, I suppose, but I can't come up with a good example.
If the model were retrained without any of the existing compilers/toolchains in its training set, and it could still do something like this, that would be very compelling to me.
This is just a frontend. It uses Cranelift as the backend. It's missing some fairly basic language features like bitfields and variadic functions. And if I'm reading the documentation right, it requires all the source code to be in a single file...
Look at what those compilers are capable of compiling and to which targets, and compare it to what this compiler can do. Those are wonderful, and I have nothing but respect for them, but they aren't going to be compiling the Linux kernel.
A genuinely impressive effort, but alas, still missing some pretty critical features (const, floating point, bools, inline, anonymous structs in function args).
Ok you can say this about literally any compiler though. The authors of every compiler have intimate knowledge of other compilers, how is this different?
Being written in rust is meaningless IMHO. There is absolutely zero inherent value to something being written in rust. Sometimes it's the right tool for the job, sometimes it isn't.
It means that it's not directly copying existing C compiler code which is overwhelmingly not written in Rust. Even if your argument is that it is plagiarizing C code and doing a direct translation to Rust, that's a pretty interesting capability for it to have.
Translating things between languages is probably one of the least interesting capabilities of LLMs - it's the one thing that they're pretty much meant to do well by design.
Surely you agree that directly copying existing code into a different language is still plagiarism?
I completely agree that "reweite this existing codebase into a new language" could be a very powerful tool. But the article is making much bolder claims. And the result was more limited in capability, so you can't even really claim they've achieved the rewrite skill yet.
Honestly, probably not a lot. Not that many C compilers are compatible with all of GCC's weird features, and the ones that are, I don't think are written in Rust. Hell, even clang couldn't compile the Linux kernel until ~10 years ago. This is a very impressive project.
> when agents started to compile the Linux kernel, they got stuck. [...] Every agent would hit the same bug, fix that bug, and then overwrite each other's changes.
> [...] The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel
This is a remarkably creative solution! Nicely done.
This is very much a "vibe coding can build you the Great Pyramids but it can't build a cathedral" situation, as described earlier today: https://news.ycombinator.com/item?id=46898223
I know this is an impressive accomplishment and is meant to show us the future potential, but it achieves big results by throwing an insane amount of compute at the problem, brute forcing its way to functionality. $20,000 set on fire, at Claude's discounted Max pricing no less.
Linear results from exponential compute is not nothing, but this certain feels like a dead end approach. The frontier should be more complexity for less compute, not more complexity from an insane amount more compute.
Is there really value being presented here? Is this codebase a stable enough base to continue developing this compiler or does it warrant a total rewrite? Honest question, it seems like the author mentioned it being at its limits. This mirrors my own experience with Opus in that it isn't that great at defining abstractions in one-shot at least. Maybe with enough loops it could converge but I haven't seen definite proof of that in current generation with these ambitious clickbaity projects.
If it generates a booting kernel and passes the test suite at 99% it's probably good enough to use, yeah.
The point isn't to replace GCC per se, it's to demonstrate that reasonably working software of equivalent complexity is within reach for $20k to solve whatever problem it is you do have.
> that reasonably working software of equivalent complexity is within reach for $20k to solve
But if this can't come close to replacing GCC and can't be modified without introducing bugs then it hasn't proven this yet. I learned some new hacks from the paper and that's great and all but from my experiencing of trying to harness even 4 claude sessions in parallel on a complex task it just goes off the rails in terms of coherence. I'll try the new techniques but my intuition is that its not really as good as you are selling it.
Thank you. That was a long article that started with a claim that was backed up by no proof, dismissing it as not the most interesting thing they were talking about when in fact it's the baseline of the whole discussion.
If I, a human, read the source code of $THING and then later implement my own version, that's not a "clean-room" re-implementation. The whole point of "clean-room" is that no single person has access to both the original code and the new code. (That way, you can legally prove that no copyright infringement took place.)
But when an AI does it, now it counts? Opus is trained on the source code of Clang, GCC, TCC, etc. So this is not "clean-room".
That's not the only way to protect yourself from accusations of copyright infringement. I remember reading that the GNU utils were designed to be as performant as possible in order to force themselves to structure the code differently from the unix originals.
This is like a working version of the Cursor blog. The evidence - it compiling the Linux kernel - is much more impressive than a browser that didn't even compile (until manually intervened)
It certainly slightly spoils what I was planning to be a fun little April Fool's joke (a daft but complete programming language). Last year's AI wasn't good enough to get me past the compiler-compiler even for the most fundamental basics, now it's all this.
I'll still work on it, of course. It just won't be so surprising.
It's cool that you can look at the git history to see what it did. Unfortunately, I do not see any of the human written prompts (?).
First 10 commits, "git log --all --pretty=format:%s --reverse | head",
Initial commit: empty repo structure
Lock: initial compiler scaffold task
Initial compiler scaffold: full pipeline for x86-64, AArch64, RISC-V
Lock: implement array subscript and lvalue assignments
Implement array subscript, lvalue assignments, and short-circuit evaluation
Add idea: type-aware codegen for correct sized operations
Lock: type-aware codegen for correct sized operations
Implement type-aware codegen for correct sized operations
Lock: implement global variable support
Implement global variable support across all three backends
The interesting thing here is what's this code worth (in money terms)? I would say it's worth only the cost of recreation, apparently $20,000, and not very much more. Perhaps you can add a bit for the time taken to prompt it. Anyone who can afford that can use the same prompt to generate another C compiler, and another one and another one.
GCC and Clang are worth much much more because they are battle-tested compilers that we understand and know work, even in a multitude of corner cases, over decades.
In future there's going to be lots and lots of basically worthless code, generated and regenerated over and over again. What will distinguish code that provides value? It's going to be code - however it was created, could be AI or human - that has actually been used and maintained in production for a long time, with a community or company behind it, bugs being triaged and fixed and so on.
The above is from the "sparks of AGI paper" on GPT-4, where they were floored that it could coherently reason through the 3 steps of inverting things (6 -> 9 -> 7 -> 4) while GPT 3.5 was still spitting out a nonsense argument of this form:
This is from March 2023 and it was genuinely very surprising at the time that these pattern matching machines trained on next token prediction could do this. Something like a LSTM can't do anything like this at all btw, no where close.
To me its very surprising that the C compiler works. It takes a ton of effort to build such a thing. I can imagine the flaws actually do get better over the next year as we push the goalposts out.
We live a wonderful time where I can spend hours and $20000 to build a C compiler which is slow and inefficient and anyway requires an existing great compiler to even work, and then neither I nor the agent has any idea on how to make it useful :D
Maybe I'm naive, but I find these re-engineering complex product posts underwhelming. C Compilers exist and realistically Claudes training corpus contains a ton of C Compiler code. The task is already perfectly defined. There exists a benchmark of well-adopted codebases that can be used to prove if this is a working solution. Half the difficulty in making something is proving it works and is complete.
IMO a simpler novel product that humans enjoy is 10x more impressive than rehashing a solved problem, regardless of difficulty.
I don't see this as just exercise in making a new useful thing, but benchmarking the SOTA models ability to create a massive* project on its own, with some verifiable metrics of success. I believe they were able to build FFMPEG with this rust compiler?
How much would it cost to pay someone to make a C compiler in rust? A lot more than $20k
* massive meaning "total context needed" >> model context window
And how long will it take before an open model recreates this. The "vibe" consensus before "thinking" models really took off was that open was ~6mo behind SotA. With the massive RL improvements, over the past 6 months I've thought the gap was actually increasing. This will be a nice little verifiable test going forward.
I agree. I don't understand there are so many software engineers who are excited about this. I would only be excited if I was a founder in addition to being a software engineer.
> To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.
If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
> If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
i don't know if you could. Let's say you get a check for $20k, how long will it take you to make an equivalent performing and compliant compiler? Are you going to put your life on pause until it's done for $20k? Who's going to pay your bills when the $20k is gone after 3 months?
If we're just writing off the billions in up front investment costs, they can just send all that my way while we're at it. No problem. Everybody happy.
Clicked on the first thing I happen to be interested in - SIMD stuff - and ended up at https://github.com/anthropics/claudes-c-compiler/blob/6f1b99..., which is a fast path incompatible with the _mm_free implementation; pretty trivial bug, not even actually SIMD or anything specialized at all.
A whole lot of UB in the actual SIMD impls (who'd have expected), but that can actually be fine here if the compiler is made to not take advantage of the UB. And then there's the super-weird mix of manual loops vs inline assembly vs builtins.
However it was achieved, building a such a complex project like a C compiler on a 20k $ budget in full autonomy is quite impressive.
Imho some commenters focus way too much on the (many, and honestly also shared by the blog post too) cons, that they forget to be genuinely impressed by the steps forward.
Next time can you build a Rust compiler in C? It doesn't even have to check things or have a borrow checker, as long as it reduces the compile times so it's like a fast debug iteration compiler.
You will experience very spooky behaviour if you do this, as the language is designed around those semantics. Nonetheless, mrustc exists: https://github.com/thepowersgang/mrustc
It will not be noticeably faster because most of the time isn't spent in the checks, it's spent in the codegen. The cranelift backend for rustc might help with this.
1) obvious green field project
2) well defined spec which will definitely be in the training data
3) an end result which lands you 90% from the finish
Now comes the hard part, the last 10%. Still not impressed here. Since fixing issues in the end was impossible without introducing bugs I have doubts about quality
I'm glad they do call it out in the end. That's fair
We went from barely able to ask these things to write a function to writes a compiler that actually kind of works in under a year. But sure, keep moving the goal posts!
I'm not particularly impressed that it can turn C into an SSA IR or assembly etc. The optimizations, however sophisticated is where anything impressive would be. Then again, we have lots of examples in the training set I would expect. C compilers are probably the most popular of all compilers. What would be more impressive is for it to have made a compiler for a well defined language that isn't very close to a popular language.
What I am impressed by is that the task it completed had many steps and the agent didn't get lost or caught in a loop in the many sessions and time it spent doing it.
I think we’re getting to a place where for anything with extensive verification available we’ll be “fitting” code to a task against tests like we fit an ML model to a loss function.
Now this is fairly "easy" as there are multitude of implementations/specs all over the Internet. How about trying to design a new language that is unquestionably better/safer/faster for low-level system programming than C/Rust/Zig? ML is great in aping existing stuff but how about pushing it to invent something valuable instead?
How about we get the LLM's to collaborate and design a perfect programming language for LLM coding, it would be terse (less tokens) easy for pattern searches etc and very fast to build, iterate over.
I cannot decide if LLMs would be excellent at writing in pure binary (why waste all that context on superfluous variable names and function symbols) or be absolutely awful at writing pure binary (would get hopelessly lost without the huge diversification of tokens).
We would still need the language to be human readable, but it could be very dense. They could build the ultimate std lib, that goes directly to kernels, so a call like spawn is all the tokens it needs to start a co routine for example.
well, you can use jules and spend zero dollar on it. I also create similiar project like this, c11 compiler in rust using AI agent + 1 developer(https://github.com/bungcip/cendol). not fully automated like anthophic did, but at least i can understand what it did.
I will say that one thing that's extremely interesting is that everyone laughed at and made fun of Steve Yegge when he released Gas Town, which centered exactly around this idea — of having more than a dozen agents working on a project simultaneously with some generalized agents focusing on implementing features while other are more specialized and tasked with second-order tasks, where you just independently run them in a loop from an orchestrator until they've finished the project where they all work on work trees and, you know, satisfy merch conflicts and stuff as a coordination mechanism — but it's starting to kind of look like he was right. He really was aiming for where the puck was headed. First we got cursor with the fast render browser, then we got Kimi K2.5 releasing with — from everything I can tell — actually very innovative and new specific RL techniques for orchestrating agent swarms. And now we have this, Anthropic themselves doing a Gas Town-style agent swarm model of development. It's beginning to look like he absolutely did know where the puck was headed before it got there.
Now, whether we should actually be building software in this fashion or even headed in this direction at all is a completely separate question. And I would tend strongly towards no. Not until at least we have very strong, yet easy to use concise and low effort formal verification, deterministic simulation testing, property-based testing, integration testing, etc; and even then, we'll end up pair programming those formal specifications and batteries of tests with AI agents. Not writing them ourselves, since that's inefficient, nor turning them over to agent swarms, since they are very important. And if we turn them over to swarms, we'd end up with an infinite regress problem. And ultimately, that's just programming at a higher level at that point. So I would argue we should never predominantly develop in this way.
But still, there is prescience in Gastown apparently, and that's interesting.
Cool article, interesting to read about their challenges. I've tasked Claude with building an Ada83 compiler targeting LLVM IR - which has gotten pretty far.
I am not using teams though and there is quite a bit of knowledge needed to direct it (even with the test suite).
They should add this to the benchmark suite, and create a custom eval for how good the resulting compiler is, as well as how maintainable the source code.
This would be an expensive benchmark to run on a regular basis, though I guess for the big AI labs it's nothing. Code quality is hard to objectively measure, however.
Brute forcing a problem with a perfect test oracle and a really good heuristic (how many c compilers are in the training data) is not enough to justify the hype imo.
Yes this is cool. I actually have worked on a similar project with a slightly worse test oracle and would gladly never have to do that sort of work again. Just tedious unfulfilling work. Though we caught issues with both the specifications/test oracle when doing the work. Also many of the team members learned and are now SMEs for related systems.
Is this evidence that knowledge work is dead or AGI is coming? Absolutely not. I think you’d be pretty ignorant with respect to the field to suggest such a thing.
It wrote the compiler in Rust. As far as I know, there aren't any Rust based C compilers with the same capabilities. If you can find one that can compile the Linux kernel or get 99% on the GCC torture test suite, I would be quite surprised. I couldn't in a search.
Maybe read the article before being so dismissive.
Why does language of the compiler matter? Its a solved problem and since other implementations are already available anyone can already transpile them to rust.
Direct transpilation would create a ton of unsafe code (this repo doesn't have any) and fixing that would require a lot of manual fixes from the model. Even that would be a massive achievement, but it's not how this was created.
It means that if you already have or a willing to build very robust test suite and the task is a complicated but already solved problem, you can get a sub-par implementation for a semi-reasonable amount of money.
I'm sure this is impressive, but it's probably not the best test case given how many C compilers there are out there and how they presumably have been featured in the training data.
This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.
It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago. It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.
A pertinent quote from the article (which is a really nice read, I'd recommend reading it fully at least once):
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.
In this case it's not reproducing training data verbatim but it probably is using algorithms and data structures that were learned from existing C compilers. On one hand it's good to reuse existing knowledge but such knowledge won't be available if you ask Claude to develop novel software.
I wouldn't say I need to invent much that is strictly novel, though I often iterate on what exists and delve into novel-ish territory. That being said I'm definitely in a minority where I have the luxury/opportunity to work outside the monotony of average programming.
The part I find concerning is that I wouldn't be in the place I am today without spending a fair amount of time in that monotony and really delving in to understand it and slowly push outside it's boundary. If I was starting programming today I can confidently say I would've given up.
They're very good at reiterating, that's true. The issue is that without the people outside of "most humans" there would be no code and no civilization. We'd still be sitting in trees. That is real intelligence.
"This AI can do 99.99%* of all human endeavours, but without that last 0.01% we'd still be in the trees", doesn't stop that 99.99% getting made redundant by the AI.
* vary as desired for your preference of argument, regarding how competent the AI actually is vs. how few people really show "true intelligence". Personally I think there's a big gap between them: paradigm-shifting inventiveness is necessarily rare, and AI can't fill in all the gaps under it yet. But I am very uncomfortable with how much AI can fill in for.
Here's a potentially more uncomfortable thought, if all people through history with potential for "true intelligence" had a tool that did 99% of everything do you think they would've had motivation to learn enough of that 99% to give insight into the yet discovered.
This is a good rebuttal to the "it was in the training data" argument - if that's how this stuff works, why couldn't Opus 4.5 or any of the other previous models achieve the same thing?
They couldn't do it because they weren't fine-tuned for multi-agent workflows, which basically means they were constrained by their context window.
How many agents did they use with previous Opus? 3?
You've chosen an argument that works against you, because they actually could do that if they were trained to.
Give them the same post-training (recipes/steering) and the same datasets, and voila, they'll be capable of the same thing. What do you think is happening there? Did Anthropic inject magic ponies?
Because for all those projects, the effective solution is to just use the existing implementation and not launder code through an LLM. We would rather see a stab at fixing CVEs or implementing features in open source projects. Like the wifi situation in FreeBSD.
LLMs can regurgitate almost all of the Harry Potter books, among others [0]. Clearly, these models can actually regurgitate large amounts of their training data, and reconstructing any gaps would be a lot less impressive than implementing the project truly from scratch.
(I'm not claiming this is what actually happened here, just pointing out that memorization is a lot more plausible/significant than you say)
> I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
This has been my experience of vibe coding too. Good for getting started, but you quickly reach the point where fixing one thing breaks another and you have to finish the project yourself.
The title should have said "Antropic stole GCC and other open-source compiler code to create a subpar, non-functional compiler", without attribution or compensation. Open source was never meant for thieving megacorps like them.
> This was a clean-room implementation (Claude did not have internet access at any point during its development);
This is absolutely false and I wish the people doing these demonstrations were more honest.
It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.
Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.
There's a terrible bug where once it compacts then it sometimes pulls in .o or binary files and immediately fills your entire context. Then it compacts again...10m and your token budget is gone for the 5 hour period. edit: hooks that prevent it from reading binary files can't prevent this.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Worse than "-O0" takes skill...
So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.
Except the one man might explain such arbitrary and shitty code as this:
Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...
I'm trying to recall a quote. Some war where all defeats were censored in the news, possibly Paris was losing to someone. It was something along the lines of "I can't help but notice how our great victories keep getting closer to home".
Last year I tried using an LLM to make a joke language, I couldn't even compile the compiler the source code was so bad. Before Christmas, same joke language, a previous version of Claude gave me something that worked. I wouldn't call it "good", it was a joke language, but it did work.
So it sucks at writing a compiler? Yay. The gloriously indefatigable human mind wins another battle against the mediocre AI, but I can't help but notice how the battles keep getting closer to home.
Great. Did your compiler support three different architectures (four, if you include x86 in addition to x86-64) and compile and pass the test suite for all of this software?
> Projects that compile and pass their test suites include PostgreSQL (all 237 regression tests), SQLite, QuickJS, zlib, Lua, libsodium, libpng, jq, libjpeg-turbo, mbedTLS, libuv, Redis, libffi, musl, TCC, and DOOM — all using the fully standalone assembler and linker with no external toolchain. Over 150 additional projects have also been built successfully, including FFmpeg (all 7331 FATE checkasm tests on x86-64 and AArch64), GNU coreutils, Busybox, CPython, QEMU, and LuaJIT.
Writing a C compiler is not that difficult, I agree. Writing a C compiler that can compile a significant amount of real software across multiple architectures? That's significantly more non-trivial.
> I can already feel the contracts coming to fix LLM slop
First, the agents will attempt to fix issues on their own. Most easy problems will be fixed or worked-around in this manner. The hard problems will require a deeper causal model of how things work. For these, the agents will give up. But, the code-base has evolved to a point where no-one understands whats going on including the agents and its human handlers. Expect your phone to ring at that point, and prepare to ask for a ransom.
Claude requires many lifetimes worth of data to "learn". Evolution aside humans don't require much data to learn, and our learning happens in real-time in response to our environment.
Train Claude without the programming dataset and give it a dozen of the best programming books, it'll have no chance of writing a compiler. Do the same for a human with an interest in learning to program and there's a good chance.
> I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot
Honest question, do you think it’d be easier to fix or rewrite from scratch? With domains I’m intimately familiar with, I’ve come very close to simply throwing the LLM code out after using it to establish some key test cases.
> So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026
What? Didn’t cursed lang do something similar like 6 or 7 months ago? These bombastic marketing tactics are getting tired.
Do you not see the difference between a toy language and a clean room implementation that can compile Linux, QEMU, Postgres, and sqlite? (No, it doesn't have the assembler and linker.)
No? That was a frontend for a toy language calling using LLVM as the backend. This is a totally self-contained compiler that's capable of compiling the Linux kernel. What's the part that you think is similar?
Can it create employment? How is this making life better.
I understand the achievement but come on, wouldn´t it be something to show if you created employment for 10000 people using your 20000 USD!
Microsoft, OpenAI, Anthropic, XAI, all solving the wrong problems, your problems not the collective ones.
Didn't you hear? We're heading towards a workless utopia where everything will be free (according to people who are actively working to eliminate things like food assistance for less fortunate mothers and children.)
I'm struggling to even parse the syntax of "WHATEVER LEADS TO REWARD COLLECTIVE HUMANS TO SURVIVE", but assuming that you're talking about resource allocation, my answer is UBI or something similar to it. We only need to "reward" for action when the resources are scarce, but when resources are plentiful, there's no particular reason not to just give them out.
I know it's "easier to imagine an end to the world than an end to capitalism", but to quote another dreamer: "Imagine all the people sharing all the world".
Except resources won't be plentiful for a long while since AI is only impacting the service sector. You can't eat a service, you can't live in one. SAAS will get very cheap though...
Obviously a human in the loop is always needed and this technology that is specifically trained to excel at all cognitive tasks that humans are capable of will lead to infinite new jobs being created. /s
Generating a 99% compliant C compiler is not a textbook task in any university I've ever heard of. There's a vast difference between a toy compiler and one that can actually compile Linux and Doom.
From a bit of research now, there are only three other compilers that can compile an unmodified Linux kernel: GCC, Clang/LLVM and Intel's oneAPI. I can't find any other compiler implementation that came close.
That's because you need to implement a bunch of gcc-specific behavior that linux relies on.
A 100% standards compliant c23 compiler can't compile linux.
Ok, yes, that's true, though my understanding is that it's not the GCC is not compliant, but rather that it includes extensions beyond the standard, which is allowed by the standard, which says (in section 4. Conformance):
> A conforming implementation may have extensions (including additional library functions), provided they do not alter the behavior of any strictly conforming program
Anyway, this just makes Claude's achievement here more impressive, right?
A simple C89 compiler is a textbook task; a GCC-compatible compiler targeting multiple architectures that can pass 99% of the GCC torture test suite is absolutely not.
You could hire a reasonably skilled dev in India for a week for $1k —- or you could pay $20k in LLM tokens, spend 2 hours writing essays to explain what you want, and then get a buggy mess.
This LLM did it in (checks notes):
> Over nearly 2,000 Claude Code sessions and $20,000 in API costs
It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!
> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.
The next milestone is:
Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Still a really cool project!
There's some incredible source available code out there. Statistically, I think there's a LOT more not so great source available code out there, because the majority of output of seasoned/high skill developers is proprietary.
To me, a surprising portion of Claude 4.5 output definitely looks like student homework answers, because I think that's closer to the mean of the code population.
I echo the other commenters that proprietary code isn’t any better, plus it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there.
Even worse in many cases because they are so over engineered nobody understands how they work.
i'm guessing most of the gains we've seen recently are post training rather than pretraining.
But, I naively assume most orgs would opt out. I know some orgs have a proxy in place that will prevent certain proprietary code from passing through!
This makes me curious if, in the allow case, Anthropic is recording generated output, to maybe down-weight it if it's seen in the training data (or something similar)?
Seen this way too often.
Exploits and HFT are the two examples I can think of. Both are usually closed source because of the financial incentives.
But I wonder how it would fare given a language specification for a non-existent non-trivial language and build a compiler for that instead?
It is standing on the shoulders of giants (all of the compilers of the past, built into it's training data... and the recent learnings about getting these agents to break up tasks) to get itself going. Still fairly impressive.
On a side-quest, I wonder where Anthropic is getting there power from. The whole energy debacle in the US at the moment probably means it made some CO2 in the process. Would be hard to avoid?
Does it really boot...?
They don't need 16b x86 support for the RISCV or ARM ports, so yes, but depends on what 'it' we're talking about here.
Also, FWIW, GCC doesn't directly assemble to machine code either; it shells out to GAS (GNU Assembler). This blog post calls it "GCC assembler and linker" but to be more precise the author should edit this to "GNU binutils assembler and linker." Even then GNU binutils contains two linkers (BFD and GOLD), or did they excise GOLD already (IIRC, there was some discussion a few years ago about it)?
I don't agree that all the claims are backed up by their own comments, which means that there's probably other places where it falls down.
Its... Misrepresentation.
Like Chicken is a Scheme compiler. But they're very up front that it depends on a C compiler.
Here, they wrote a C compiler that is at least sometimes reliant on having a different C compiler around. So is the project at 50%? 75%?
Even if its 99%, thats not the same story as they tried to write. And if they wrote that tale instead, it would be more impressive, rather than "There's some holes. How many?"
The compiler itself is entirely functional; it just can't generate code optimal enough to fit within the constraints for that very specific (tiny!) part of the system, so another compiler is required to do that step.
They can also be made to be deterministic. Some extra care is required to avoid computation paths that lead to numerical differences on different machines, but this can be accomplished reliably with small models that use integer math and use kernels that follow a specific order of operations. You get a lot more freedom to do these things on the small, application-specific models than you do when you're trying to run a big LLM across different GPU implementations in floating point.
Yeah, in the same way how pseudo-random number generators are "deterministic." They generate the exact same sequence of numbers every time given the seeds are the same!
But that's not the "determinism" people are referring to when they say LLMs aren't deterministic.
EDIT (since HN is preventing me from responding):
> Some people care more about compiler speed than the correctness?
Yeah, I think plenty of people writing code in languages that have concepts like Undefined Behavior technically don't really care as much about correctness as they may claim otherwise, as it's pretty hard to write large volumes of code without indirectly relying on UB somewhere. What is correct in such case was left up to interpretation of the implementer by ISO WG14.
Anyway, please define: "correctness".
They found a bimodal distribution in failures over the lifetime of chips. Infant mortality was well understood. Silicon aging over time was much less well understood, and I still find surprising.
Is this true? It’s not an everyday thing, but when using less common flags, or code structures, or targets… every few years I run into a codegen issue. It’s hard to imagine going through a career without a handful…
https://llvm.org/docs/MLGO.html
Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.
The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.
It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.
Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.
On top of that, Anthropic is losing money on it.
All of those things combined, viability remains a serious question.
It seems they are *not* losing money on inference: https://bsky.app/profile/steveklabnik.com/post/3mdirf7tj5s2e
Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.
To be fair, it was a pretty horrible mess of useEffects. But just another data point.
Also I was hoping opus would finally be able to handle complex typescript generics, but alas...
I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???
You should look it up. :)
If you didn’t care about copying code, usefulness, or correctness you could probably get a human to whip you up a C compiler for a lot less than $20k.
And it's really expensive, despite your suspicions.
We figured out that LOC was a useless productivity metric in the 80s.
Microsoft paid my company a lot of money to write code. And in the end you were able to count it, and the LOC is a perfectly fine metric which is still used today to measure complexity of a project.
If you actually work in software you know this.
I have no idea what point you're trying to make - but I've grown very tired of all the trolls attacking me. Good night.
That level of quality should be sufficient.
Do you know any low quality programmers that write C compilers in rust THAT CAN BUILD LINUX?
No you don't. They do not exist.
You can give a developer the GCC test suite and have them build the compiler backwards, which is how this was done. They literally brute forced it, most developers can brute force. It also literally uses GCC in the background... Maybe try reading the article.
You take care now.
I think I'll ask codex to write me a Chrome extension to censor you. :)
EDIT: Done!
(() => { "use strict";
})();This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not
with all optimizations disabled:
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc -j30 defconfig all
``` /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:44:184: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) ~0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (~0x80000000); (void)pto_tmp__; } asm ("and" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:49:183: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) 0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (0x80000000); (void)pto_tmp__; } asm ("or" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:61:212: error: expected ';' after expression before 'pao_tmp__' ```
Fixing some UB in the kernel sources, lots of plumbing to the build system (particularly making it more hermetic).
Getting the rest of the LLVM binutils substitutes to work in place of GNU binutils was also challenging. Rewriting a fair amount of 32b ARM assembler to be "unified syntax" in the kernel. Linker bugs are hard to debug. Kernel boot failures are hard to debug (thank god for QEMU+gdb protocol). Lots of people worked on many different parts here, not just me.
Evangelism and convincing upstream kernel developers why clang support was worth anyones while.
https://github.com/ClangBuiltLinux/linux/issues for a good historical perspective. https://github.com/ClangBuiltLinux/linux/wiki/Talks,-Present... for talks on the subject. Keynoting LLVM conf was a personal highlight (https://www.youtube.com/watch?v=6l4DtR5exwo).
I had hoped one day to re-implement parts of LLVM itself in Rust; in particular, I've been curious if we can concurrently compile C (and parse C in parallel, or lazily) that haven't been explored in LLVM, and I think might be safer to do in Rust. I don't know enough about grammers to know if it's technically impossible, but a healthy dose of ignorance can sometimes lead to breakthroughs.
LLVM is pretty well designed for test. I was able to implement a lexer for C in Rust that could lex the Linux kernel, and use clang to cross check my implementation (I would compare my interpretation of the token stream against clang's). Just having a standard module system makes having reusable pieces seems like perhaps a better way to compose a toolchain, but maybe folks with more experience with rustc have scars to disagree?
Heh, earlier this day, I was just thinking how crazy a proposal would it actually be to have a Rust dependency (specifically, the egg crate, since one of the things I'm banging my head against right now might be better solved with egraphs).
Granted, marketing sucks up far too much money for any startup, and again, we don't know the actual numbers in play, however, this is something to keep in mind. (The very same marketing that likely also wrote the blog post, FWIW).
but regardless, hiring is difficult and high-end talent is limited. If the costs were anywhere close to equivalent, the agents are a no-brainer
Or do you mean that if an external user replicated this experience they might get billed less than $20k due to CC being sold at lower rates than per-API-call metered billing?
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis
> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.
And the very open points about limitations (and hacks, as cc loves hacks):
> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
> It does not have its own assembler and linker;
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Ending with a very down to earth take:
> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.
This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.
It's not a clean-room implementation because of this:
> The fix was to use GCC as an online known-good compiler oracle to compare against
I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.
I'd argue that no one would really care given it's GCC.
But if you worked for GiantSodaCo on their secret recipe under NDA, then create a new soda company 15 years later that tastes suspiciously similar to GiantSodaCo, you'd probably have legal issues. It would be hard to argue that you weren't using proprietary knowledge in that case.
Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.
It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."
https://en.wikipedia.org/wiki/Clean-room_design
The "without infringing any of the copyrights" contains "any".
We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.
Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.
It's not a clean-room design, plain and simple.
It is a research topic for heaven's sake:
https://arxiv.org/abs/2504.16046
A quick search brings up several C compilers written in Rust. I'm not claiming they are necessarily in Claude's training data, but they do exist.
https://github.com/PhilippRados/wrecc (unfinished)
https://github.com/ClementTsang/rustcc
https://codeberg.org/notgull/dozer (unfinished)
https://github.com/jyn514/saltwater
I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.
https://arxiv.org/pdf/2601.02671
> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.
You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.
> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) [...]"
So, yes, it is not "literally verbatim" (~96% verbatim), and there is indeed A LOT (hundreds or thousands of prompting attempts) to make this happen.
I leave it up to the reader to judge how much this weakens the more basic claims of the form "LLMs have nearly perfectly memorized some of their source / training materials".
I am imagining a grueling interrogation that "cracks" a witness, so he reveals perfect details of the crime scene that couldn't possibly have been known to anyone that wasn't there, and then a lawyer attempting the defense: "but look at how exhausting and unfair this interrogation was--of course such incredible detail was extracted from my innocent client!"
The lesson here is that the Internet compresses pretty well.
A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.
But your overall point still stands, regardless.
The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"
It is enough to have read even parts of a work for something to be considered a derivative.
I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.
It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.
For IP rights, I'll buy that. Not as important when the question is capabilities.
> I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.
For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":
It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.
ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.
https://arxiv.org/pdf/2601.02671
> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) and refuses to continue after reaching the end of the first chapter; the generated text has nv-recall = 4.0% with the full book. We extract substantial proportions of the book from Gemini 2.5 Pro and Grok 3 (76.8% and 70.3%, respectively), and notably do not need to jailbreak them to do so (N = 0)."
if you want to quantify the "near" here.
Edit: actually, no, I take that back, that's just very similar to some other research I was familiar with.
Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.
I mean... It's in the name.
> The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.
If it can recall... Then it is not a clean room implementation. Fin.
Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.
It's kinda the whole point - you haven't read it so there's no doubt about copying in a clean-room experiment.
A "human style" clean-room copy here would have to be using a model trained on, say, all source code except GCC. Which would still probably work pretty well, IMO, since that's a pretty big universe still.
Remember that all these companies can only exist because of massive (over)investments in the hope of insane returns and AGI promises. While all these improvements (imho) prove the exact opposite: AGI is absolutely not coming, and the investments aren't going to generate these outsized returns. The will generate decent returns, and the tools are useful.
Don't get me wrong, LLMs are useful. They just aren't the kind of useful that Sam et al. sold investors. No AGI, no full human worker replacement, no massive reduction in cost for SOTA.
And at least with Moore's law, we had some understanding of the physical realities as transistors would get smaller and smaller, and reasonably predict when we'd start to hit limitations. With LLMs, we just have no idea. And that could be go either way.
Not from me you haven't!
> "they've hit a wall, no more data, running out of data, plateau this, saturated that"
Everyone thought Moore's Law was infallible too, right until they hit that bend. What hubris to think these AI models are different!
But you've probably been hearing that for 3 years too (though not from me).
> Models keep on getting better, at more broad tasks, and more useful by the month.
If you say so, I'll take your word for it.
Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.
Intel, at the time the unquestioned world leader in semiconductor fabrication was so unable to accurately predict the end of Dennard scaling that they rolled out the Pentium 4. "10Ghz by 2010!" was something they predicted publicly in earnest!
It, uhhh, didn't quite work out that way.
Idk, that sounds remarkably similar to these AI models to me.
I dunno. To me it doesn’t even look exponential any more. We are at most on the straight part of the incline.
SWE's may be seeing benefit. But in other areas? Doesnt seem to be the case. Consumers may use it as a more preferred interface for search - but this is a different discussion.
So where is that?
History has a way of being surprisingly boring, so personally I'm not betting on the world order being transformed in five years, but I also have to take my own advice and take things a day at a time.
Prove this statement wrong.
Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.
If you have ever read the source code of the compiler you are reverse engineering, you are by definition not doing a clean room implementation.
The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.
If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.
But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.
Are you sure about that? Do you have some examples? The older Claude models can’t do it according to TFA.
Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.
I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.
That seems implausible.
Why, exactly?
Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..
If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.
Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.
If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?
Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.
Kinda waiting for them to plateau so I can stop feeling so existential ¯\_(ツ)_/¯
The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.
Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.
Hello, this is what I did over my Christmas break. I've been taking some time to do other things, but plan on returning to it. But this absolutely works. Claude has written far more programs in my language than I have.
https://rue-lang.dev/ if you want to check it out. Spec and code are both linked there.
I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.
I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.
1. data analysis / visualization / …
2. “is this possible? can this even be done?”
for #1 - I don’t do much anymore, for #2 I mostly do it still all “by hand” not for the lack of serious trying. so “it can do #1 1000x better than me cause it is generally solved problem(s) it is trained on while it can’t effectively do #2 otherwise” fits perfectly
Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.
Hard to find fully specified problems like this in the wild.
I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.
I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?
> Write extremely high-quality tests
> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.
> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.
This is such a big and obvious cope. This is obviously a very real problem in the wild and there are many, many others like it. Probably most problems are like this honestly or can be made to be like this.
All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!
There's been more than enough drive-by comments from new accounts/green names even in this HN submission alone.
The promises made are ABSOLUTELY relevant to how promising or not these experiments are.
Maybe the general population will be willing to have a more constructive discussions about this tech once the trillion dollar companies stop pillaging everything they see in front of them and cease acting like sociopaths whose only objectives seem to be concentrating power, generating dissidence and harvesting wealth.
“Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.”
Otherwise it's not clean-room, it's plagiarism.
In fact the idea of a "clean room" implementation is that all you have to go on is the interface spec of what you are trying to build a clean (non-copyright violating) version of - e.g. IBM PC BIOS API interface.
You can't have previously read the IBM PC BIOS source code, then claim to have created a "clean room" clone!
I have read nowhere near as much code (or anything) as what Claude has to read to get to where it is.
And I can write an optimizing compiler that isn't slower than GCC -O0
(prompt: what does a clean room implementation mean?)
From ChatGPT without login BTW!
> A clean room implementation is a way of building something (usually software) without copying or being influenced by the original implementation, so you avoid copyright or IP issues.
> The core idea is separation.
> Here’s how it usually works:
> The basic setup
> Two teams (or two roles):
> Specification team (the “dirty room”)
> Looks at the original product, code, or behavior
> Documents what it does, not how it does it
> Produces specs, interfaces, test cases, and behavior descriptions
> Implementation team (the “clean room”)
> Never sees the original code
> Only reads the specs
> Writes a brand-new implementation from scratch
> Because the clean team never touches the original code, their work is considered independently created, even if the behavior matches.
> Why people do this
> Reverse-engineering legally
> Avoid copyright infringement
> Reimplement proprietary systems
> Create open-source replacements
> Build compatible software (file formats, APIs, protocols)
I really am starting to think we have achieved AGI. > Average (G)Human Intelligence
LMAO
If you try to reimplement something in a clean room, its a step by step process, using your own accumulated knowledge as the basis. That knowledge that you hold in your brain, all too often is code that may have copyrights on it, from the companies you worked on.
Is it any different for a LLM?
The fact that the LLM is trained on more data, does not change that when you work for a company, leave it, take that accumulated knowledge to a different company, you are by definition taking that knowledge (that may be copyrighted) and implementing it somewhere else. It only a issue if you copy the code directly, or do the implementation as a 1:1 copy. LLMs do not make 1:1 copies of the original.
At what point is trained on copyrighted data, any different then a human trained on copyrighted data, that get reimplemented in a transformative way. The big difference is that the LLM can hold more data over more fields, vs a human, true... But if we look at specializations, this can come back to the same, no?
This is 100% unambiguously not clean-room unless they can somehow prove it was never trained on any C compiler code (which they can't, because it most certainly was).
[1] https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...
[2] https://gitlab.winehq.org/wine/wine/-/wikis/Clean-Room-Guide...
They weren't trillion dollar AI companies to bankroll the defense sure. But thinking about clean room and using copyrighted stuff is not even an argument that's just nonsense to try to prove something when no one asked.
My second reaction: still incredible, but noting that a C compiler is one of the most rigorously specified pieces of software out there. The spec is precise, the expected behavior is well-defined, and test cases are unambiguous.
I'm curious how well this translates to the kind of work most of us do day-to-day where requirements are fuzzy, many edge cases are discovered on the go, and what we want to build is a moving target.
/me Laughs in "unspecified behavior."
Unspecified is whatever you want it to mean. I am also laughing, having never heard "unspecified" before.
This is the key: the more you constrain the LLM, the better it will perform. At least that's my experience with Claude. When working with existing code, the better the code to begin with, the better Claude performs, while if the code has issues then Claude can end up spinning its wheels.
But a lot of programming we discover correctness as we go, one reason humans don’t completely exit the loop. We need to see and build tests as we go, giving them particular care and attention to ensure they test what matters.
This is incredible!
But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.
As a particular near-term step, I imagine that it won't be long before we see a SaaS company using an AI product manager, which can spawn agents to directly interview users as they utilize the app, independently propose and (after getting approval) run small product experiments, and come up with validated recommendations for changing the product roadmap. I still remember Tay, and wouldn't give something like that the keys to the kingdom any time soon, but as long as there's a human decision maker at the end, I think that the tech is already here.
[0] https://andonlabs.com/evals/vending-bench-2
It's very telling how all these examples are all "look, we made it recreate a shitter version of a thing that already exists in the training set".
Without enough examples to copy from (despite CPU manuals being available in the training set) the approach failed. I wonder how well it'll do when you throw it a new/imaginary instruction set/CPU architecture; I bet it'll fail in similar ways.
And it's a bit of a nasty optimization problem, because the result is all or nothing. Implementing enough optimizations to get from 60kB to 33kB is useless, all the rewards come from getting to 32kB.
If the model were retrained without any of the existing compilers/toolchains in its training set, and it could still do something like this, that would be very compelling to me.
https://github.com/jyn514/saltwater
https://github.com/ClementTsang/rustcc
https://github.com/maekawatoshiki/rucc
> https://github.com/jyn514/saltwater
This is just a frontend. It uses Cranelift as the backend. It's missing some fairly basic language features like bitfields and variadic functions. And if I'm reading the documentation right, it requires all the source code to be in a single file...
> https://github.com/ClementTsang/rustcc
This will compile basically no real-world code. The only supported data type is "int".
> https://github.com/maekawatoshiki/rucc
This is just a frontend. It uses LLVM as the backend.
https://github.com/rustcoreutils/posixutils-rs/tree/main/cc
I completely agree that "reweite this existing codebase into a new language" could be a very powerful tool. But the article is making much bolder claims. And the result was more limited in capability, so you can't even really claim they've achieved the rewrite skill yet.
If Claude had NOT been trained on compiler code, it would NOT have been able to build a compiler.
Definitely signals the end of software IP or at least in its present form.
> [...] The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel
This is a remarkably creative solution! Nicely done.
I know this is an impressive accomplishment and is meant to show us the future potential, but it achieves big results by throwing an insane amount of compute at the problem, brute forcing its way to functionality. $20,000 set on fire, at Claude's discounted Max pricing no less.
Linear results from exponential compute is not nothing, but this certain feels like a dead end approach. The frontier should be more complexity for less compute, not more complexity from an insane amount more compute.
I would interpret this as being at API pricing. At subscription pricing, it's probably at most 5 or 6 Max subscriptions worth.
To be fair, that's two weeks of the employer cost of a FAANG engineer's labor. And no human hacks a working compiler in two weeks.
It's a lot of AI compute for a demo, sure. But $20k stunts are hardly unique. Clearly there's value being demonstrated here.
The point isn't to replace GCC per se, it's to demonstrate that reasonably working software of equivalent complexity is within reach for $20k to solve whatever problem it is you do have.
Not for general purpose use, only for demo.
> that reasonably working software of equivalent complexity is within reach for $20k to solve
But if this can't come close to replacing GCC and can't be modified without introducing bugs then it hasn't proven this yet. I learned some new hacks from the paper and that's great and all but from my experiencing of trying to harness even 4 claude sessions in parallel on a complex task it just goes off the rails in terms of coherence. I'll try the new techniques but my intuition is that its not really as good as you are selling it.
But yeah, either way it just needs to know where to find the stdlib.
But when an AI does it, now it counts? Opus is trained on the source code of Clang, GCC, TCC, etc. So this is not "clean-room".
I'll still work on it, of course. It just won't be so surprising.
- All prompts used
- The structure of the agent team (which agents / which roles)
- Any other material that went into the process
This would be a good source for learning, even though I'm not ready to spend 20k$ just for replicating the experiment.
First 10 commits, "git log --all --pretty=format:%s --reverse | head",
GCC and Clang are worth much much more because they are battle-tested compilers that we understand and know work, even in a multitude of corner cases, over decades.
In future there's going to be lots and lots of basically worthless code, generated and regenerated over and over again. What will distinguish code that provides value? It's going to be code - however it was created, could be AI or human - that has actually been used and maintained in production for a long time, with a community or company behind it, bugs being triaged and fixed and so on.
If you had the knowledge that a transformer could pull this off in 2022. Even with all its flawed code. You would be floored.
Keep in mind that just a few years ago, the state of the art in what these LLMs could do was questions of this nature:
Suppose g(x) = f−1 (x), g(0) = 5, g(4) = 7, g(3) = 2, g(7) = 9, g(9) = 6 what is f(f(f(6)))?
The above is from the "sparks of AGI paper" on GPT-4, where they were floored that it could coherently reason through the 3 steps of inverting things (6 -> 9 -> 7 -> 4) while GPT 3.5 was still spitting out a nonsense argument of this form:
f(f(f(6))) = f(f(g(9))) = f(f(6)) = f(g(7)) = f(9).
This is from March 2023 and it was genuinely very surprising at the time that these pattern matching machines trained on next token prediction could do this. Something like a LSTM can't do anything like this at all btw, no where close.
To me its very surprising that the C compiler works. It takes a ton of effort to build such a thing. I can imagine the flaws actually do get better over the next year as we push the goalposts out.
IMO a simpler novel product that humans enjoy is 10x more impressive than rehashing a solved problem, regardless of difficulty.
How much would it cost to pay someone to make a C compiler in rust? A lot more than $20k
* massive meaning "total context needed" >> model context window
If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
i don't know if you could. Let's say you get a check for $20k, how long will it take you to make an equivalent performing and compliant compiler? Are you going to put your life on pause until it's done for $20k? Who's going to pay your bills when the $20k is gone after 3 months?
A whole lot of UB in the actual SIMD impls (who'd have expected), but that can actually be fine here if the compiler is made to not take advantage of the UB. And then there's the super-weird mix of manual loops vs inline assembly vs builtins.
Imho some commenters focus way too much on the (many, and honestly also shared by the blog post too) cons, that they forget to be genuinely impressed by the steps forward.
It will not be noticeably faster because most of the time isn't spent in the checks, it's spent in the codegen. The cranelift backend for rustc might help with this.
1) obvious green field project 2) well defined spec which will definitely be in the training data 3) an end result which lands you 90% from the finish
Now comes the hard part, the last 10%. Still not impressed here. Since fixing issues in the end was impossible without introducing bugs I have doubts about quality
I'm glad they do call it out in the end. That's fair
What I am impressed by is that the task it completed had many steps and the agent didn't get lost or caught in a loop in the many sessions and time it spent doing it.
They posted this video, looks like they used `qemu-system-riscv64` to test.
Well there goes my weekend project plans
Now, whether we should actually be building software in this fashion or even headed in this direction at all is a completely separate question. And I would tend strongly towards no. Not until at least we have very strong, yet easy to use concise and low effort formal verification, deterministic simulation testing, property-based testing, integration testing, etc; and even then, we'll end up pair programming those formal specifications and batteries of tests with AI agents. Not writing them ourselves, since that's inefficient, nor turning them over to agent swarms, since they are very important. And if we turn them over to swarms, we'd end up with an infinite regress problem. And ultimately, that's just programming at a higher level at that point. So I would argue we should never predominantly develop in this way.
But still, there is prescience in Gastown apparently, and that's interesting.
its funny bacause by (most) definitions, it is not an artifact:
> a usually simple object (such as a tool or ornament) showing human workmanship or modification as distinguished from a natural object
I am not using teams though and there is quite a bit of knowledge needed to direct it (even with the test suite).
I need to reunderwrite what my vision of the future looks like.
Yes this is cool. I actually have worked on a similar project with a slightly worse test oracle and would gladly never have to do that sort of work again. Just tedious unfulfilling work. Though we caught issues with both the specifications/test oracle when doing the work. Also many of the team members learned and are now SMEs for related systems.
Is this evidence that knowledge work is dead or AGI is coming? Absolutely not. I think you’d be pretty ignorant with respect to the field to suggest such a thing.
Maybe read the article before being so dismissive.
If you trained on a neutral representation like an AST or IR, then the source language shouldn't matter. *
* I'm not familiar with how Anthropic builds their models, but training this way should nullify PL differences.
|Over nearly 2,000 Claude Code sessions and $20,000 in API cost
This is not entirely ridiculous.
Look at this: https://github.com/7mind/jopa
This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.
A pertinent quote from the article (which is a really nice read, I'd recommend reading it fully at least once):
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.
The part I find concerning is that I wouldn't be in the place I am today without spending a fair amount of time in that monotony and really delving in to understand it and slowly push outside it's boundary. If I was starting programming today I can confidently say I would've given up.
"This AI can do 99.99%* of all human endeavours, but without that last 0.01% we'd still be in the trees", doesn't stop that 99.99% getting made redundant by the AI.
* vary as desired for your preference of argument, regarding how competent the AI actually is vs. how few people really show "true intelligence". Personally I think there's a big gap between them: paradigm-shifting inventiveness is necessarily rare, and AI can't fill in all the gaps under it yet. But I am very uncomfortable with how much AI can fill in for.
Then they start improvising and the same person counters with "what a bunch of slop, just making things up!"
How many agents did they use with previous Opus? 3?
You've chosen an argument that works against you, because they actually could do that if they were trained to.
Give them the same post-training (recipes/steering) and the same datasets, and voila, they'll be capable of the same thing. What do you think is happening there? Did Anthropic inject magic ponies?
They only have to keep reiterating this because people are still pretending the training data doesn't contain all the information that it does.
> It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.
Maybe not any old LLM, but Claude gets really close.
https://arxiv.org/pdf/2601.02671v1
(I'm not claiming this is what actually happened here, just pointing out that memorization is a lot more plausible/significant than you say)
[0] https://www.theregister.com/2026/01/09/boffins_probe_commerc...
This has been my experience of vibe coding too. Good for getting started, but you quickly reach the point where fixing one thing breaks another and you have to finish the project yourself.
No, I did not read the article...
This is absolutely false and I wish the people doing these demonstrations were more honest.
It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.
Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.
That's because the "testing" was not done independently. So anything can be possibly be made to be misleading. Hence:
> Written by Nicholas Carlini, a researcher on our Safeguards team.
Please fix.. :)
Worse than "-O0" takes skill...
So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.
Except the one man might explain such arbitrary and shitty code as this:
https://github.com/anthropics/claudes-c-compiler/blob/main/s...
why x9? who knows?!
Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...
Last year I tried using an LLM to make a joke language, I couldn't even compile the compiler the source code was so bad. Before Christmas, same joke language, a previous version of Claude gave me something that worked. I wouldn't call it "good", it was a joke language, but it did work.
So it sucks at writing a compiler? Yay. The gloriously indefatigable human mind wins another battle against the mediocre AI, but I can't help but notice how the battles keep getting closer to home.
This has been true for all of (known) human history. I’m gonna go ahead and make another bold prediction: tech will keep getting better.
The issue with this blog post is it’s mostly marketing.
Maybe I'm underestimating the simplicity of the C language, but that doesn't sound very plausible to me.
> Projects that compile and pass their test suites include PostgreSQL (all 237 regression tests), SQLite, QuickJS, zlib, Lua, libsodium, libpng, jq, libjpeg-turbo, mbedTLS, libuv, Redis, libffi, musl, TCC, and DOOM — all using the fully standalone assembler and linker with no external toolchain. Over 150 additional projects have also been built successfully, including FFmpeg (all 7331 FATE checkasm tests on x86-64 and AArch64), GNU coreutils, Busybox, CPython, QEMU, and LuaJIT.
Writing a C compiler is not that difficult, I agree. Writing a C compiler that can compile a significant amount of real software across multiple architectures? That's significantly more non-trivial.
First, the agents will attempt to fix issues on their own. Most easy problems will be fixed or worked-around in this manner. The hard problems will require a deeper causal model of how things work. For these, the agents will give up. But, the code-base has evolved to a point where no-one understands whats going on including the agents and its human handlers. Expect your phone to ring at that point, and prepare to ask for a ransom.
Train Claude without the programming dataset and give it a dozen of the best programming books, it'll have no chance of writing a compiler. Do the same for a human with an interest in learning to program and there's a good chance.
Honest question, do you think it’d be easier to fix or rewrite from scratch? With domains I’m intimately familiar with, I’ve come very close to simply throwing the LLM code out after using it to establish some key test cases.
What? Didn’t cursed lang do something similar like 6 or 7 months ago? These bombastic marketing tactics are getting tired.
That's for $20,000.
Microsoft, OpenAI, Anthropic, XAI, all solving the wrong problems, your problems not the collective ones.
Call it as you wish, but I am certainly not talking about coding values.
I know it's "easier to imagine an end to the world than an end to capitalism", but to quote another dreamer: "Imagine all the people sharing all the world".
I guess if it only created 1.000 lines it would be easy to see where those lines came from.
Generating a 99% compliant C compiler is not a textbook task in any university I've ever heard of. There's a vast difference between a toy compiler and one that can actually compile Linux and Doom.
From a bit of research now, there are only three other compilers that can compile an unmodified Linux kernel: GCC, Clang/LLVM and Intel's oneAPI. I can't find any other compiler implementation that came close.
> A conforming implementation may have extensions (including additional library functions), provided they do not alter the behavior of any strictly conforming program
Anyway, this just makes Claude's achievement here more impressive, right?
building a working C compiler from scratch is literally in my "teach yourself C in 24 hours" book from 30 years ago
Might have been Compiler Design in C from 1990. Looks like that's available for free now: https://holub.com/compiler/
you'll forgive me if I don't ring them in the early hours of the morning...
remember C was specifically designed to be easy to compile
(hence anachronisms like forward declarations)