Back
aiopen-sourceopinionai-policy

The Open Source AI Lie: Weight-Washing, Broken Definitions, and Who Benefits

SR

Serendeep Rudraraju

April 08, 2026·14 min read
The Open Source AI Lie: Weight-Washing, Broken Definitions, and Who Benefits

Meta says Llama is open source. The Open Source Initiative, the organization that has maintained the definition of "open source" since 1998, says it isn't. Meta ignores them. A billion downloads later, the man who wrote the original Open Source Definition says the whole attempt to define open source AI has failed.

You've probably called Llama "open source" in an architecture doc at some point. I have. Most of us have. And we were wrong, in ways that have legal and regulatory consequences that aren't obvious until they bite you.

I should warn you up front: I started writing this to make a clean argument against weight-washing, and the other side's numbers kept getting in the way. A billion Llama downloads. Surgical copilots and maternal health chatbots in East Africa built on these models. Thirteen million HuggingFace users who never needed training data to build useful things. The case against caring about the definition is stronger than I wanted it to be.

TL;DR

No major AI model meets the Open Source AI Definition. Not Llama, not DeepSeek, not Mistral, not Qwen, not Gemma. Releasing weights without training data is the AI equivalent of distributing a compiled binary and calling it open source. The EU AI Act grants regulatory benefits to "open source" AI, which means getting the label right has financial consequences. Meanwhile, the people who wrote the original definition are fighting each other about whether their own compromise went too far.

A 28-Year-Old Definition Meets a Trillion-Dollar Industry

Quick history, because it matters.

In 1986, Richard Stallman published the Free Software Definition. Four freedoms: use, study, modify, share. All of them depend on one prerequisite: access to the source code. Without it, "study" and "modify" are empty promises.

In 1998, Christine Peterson coined the term "open source" at a meeting in Palo Alto. Bruce Perens adapted the Debian Free Software Guidelines into the Open Source Definition. He and Eric Raymond founded the OSI to steward it. The definition's core requirement: access to the "preferred form of the work for making modifications." Source code. Not binaries. Not bytecode. The human-readable thing.

For 26 years, nobody argued about what "source" meant.

Then we started shipping AI models, and the word stopped being obvious.

An AI model isn't one thing. It's several: architecture code, training code, training data, and model weights. The weights are the output of training. The result, not the recipe. When Meta releases Llama's weights, it's handing you the end product of a process you can't see, can't reproduce, and can't audit. The architecture is there. The inference code is there. But the training data, the thing that shaped what the model actually learned, is nowhere.

Bruce Schneier put it bluntly in November 2024:

"Since for a neural network, the training data is the source code—it's how the model gets programmed—the definition makes no sense."

— Bruce Schneier, "AI Industry Is Trying to Subvert the Definition of Open Source AI"

Here's how the analogy maps:

Loading diagram...

Green = what "open source" requires you to release. Red = what most AI companies actually withhold. The weights are the compiled artifact. The training data is the source.

That comparison sticks. Releasing weights without training data is like shipping a .exe and calling it open source. Sure, you can run it. You can even fine-tune it, the way you might hex-edit a binary and hope for the best. What you can't do is figure out how it was built, reproduce it, check whether the safety claims hold up, or fix the training process when something goes wrong.

The Honesty Audit

Enough abstraction. I went through the five most-downloaded "open" AI models and checked what they actually give you.

Llama 3DeepSeek R1Mistral 7BQwen 2.5Gemma 2
WeightsYesYesYesYesYes
Inference codeYesYesYesYesYes
Training codeNoPartialNoNoNo
Training dataNoNoNoNoNo
LicenseCustom (Meta)MITApache 2.0Apache 2.0Custom (Google)
OSI-approved licenseNoYesYesYesNo
Commercial restrictions700M MAU capNoneNoneNoneYes
Use restrictionsAcceptable use policySeparate policyNoneNoneYes
Calls itself "open source"YesYesVariesYesNo
Passes OSAID 1.0NoNoNoNoNo

Stare at that table for a second. Every row below "Inference code" is some variation of No. Zero training data across the board. Not one passes OSAID 1.0.

The details are worth unpacking, though, because these companies aren't all doing the same thing.

Llama is the worst offender. Meta wrote its own license—not OSI-approved—that caps commercial use at 700 million monthly active users. Think about who that cap targets. It's not protecting indie developers. It's letting Meta harvest community contributions while making sure Google, Amazon, and Microsoft can't compete with Llama derivatives. There's an acceptable use policy restricting whole categories of applications. The Free Software Foundation classified Llama 3.1 as nonfree in January 2025. Google and Microsoft, when asked, agreed to stop calling their restricted models "open source." Meta refused.

DeepSeek R1 comes closest to honesty. MIT license, same one used by jQuery, Rails, and Node.js. No MAU caps, no use restrictions, nothing weird in the grant. But no training data, no full training pipeline. Sit with this for a moment: a Chinese company backed by a quantitative trading firm ships under a more permissive license than the American social media company that won't shut up about "open source AI" as a force for democracy.

Mistral earned enormous goodwill by releasing Mistral 7B under Apache 2.0 in September 2023. Then they pivoted. Larger, more capable models went behind proprietary licenses or API-only access. CEO Arthur Mensch reframed the strategy as "open science" rather than "open source." Credit where it's due: at least that's a more honest label than what Meta uses.

Qwen 2.5 (Alibaba) ships under Apache 2.0, no restrictions. Same playbook as DeepSeek. Whether that's genuine openness or market penetration dressed up nicely, I'll leave to you.

Gemma surprised me. Google calls it "open weights," not "open source." The license is custom and restrictive, which is annoying. But the labeling is honest. Google watched Meta catch heat and apparently decided that not lying about what they're releasing was worth more than the marketing bump.

The models that actually pass the definition? Pythia from EleutherAI. OLMo from AI2. T5 from Google Research. Amber from LLM360. Full code, full weights, full training data. You've almost certainly never shipped any of them to production.

The Institutional Crisis Nobody's Talking About

The OSI spent two years trying to fix this. Twenty-five organizations at the table: Microsoft, Google, Meta, Amazon, the usual suspects. On October 28, 2024, at the All Things Open conference, they published OSAID 1.0.

The compromise: you need code, weights, and "sufficiently detailed information about the data used to train the system, so that a skilled person can build a substantially equivalent system." Not the actual data. A description of the data.

Purists hated it. A description isn't a dataset. Pragmatists ignored it. The community was already building on weights and didn't care what any definition said. The OSI managed to publish something both sides could attack, which is impressive in its own way.

Then it got worse.

In March 2025, Bradley Kuhn of the Software Freedom Conservancy and Richard Fontana of Red Hat ran for the OSI board. Their platform: repeal OSAID 1.0. They made it through the election. Then, about an hour after voting closed, OSI emailed non-incumbent candidates with a Board Member Agreement they had 47 hours to sign. Buried in it: a clause requiring board members to "support publicly all Board decisions, especially those that do not have unanimous consent."

Kuhn and Fontana struck the gag clause and sent it back with alternative language allowing public dissent. OSI said the modifications were invalid. Disqualified both. Threw out every vote cast for them.

Before that, a Debian developer named Luke Faraone had been rejected as a candidate because he submitted his application at 9 PM Pacific time, but OSI retroactively declared the deadline was UTC, which made him late. A community petition demanding full vote counts pulled 88% support. OSI didn't release them.

Bruce Perens, the man who wrote the Open Source Definition in 1998, watched all of this play out and said what a lot of people were thinking:

"The problem before the Open Source AI Definition was openwashing, saying that something was open source when it was not. They hoped that an AI-specific definition would reduce openwashing. If you look at the OSI's own anniversary report, the problem now that the definition is a year old, is... openwashing."

— Bruce Perens, FOSS Force, September 2025

He's now working on something called the "Post-Open" framework, a licensing model that moves beyond open source entirely. The guy who co-founded the OSI has decided the concept he helped create can't stretch to cover AI. I don't know what clearer signal you need that this is broken.

The Counter-Argument You Can't Dismiss

This is the part where the argument I've been building runs into a wall.

Thirteen million HuggingFace users. Two million public models, nearly all built by fine-tuning or distilling weights that came with no training data attached. A billion Llama downloads. Qwen alone spawned 113,000 derivative models. According to Epoch AI, open-weight models lag closed-source state-of-the-art by about three months now, down from a much larger gap. On some benchmarks the difference shrank from 8% to 1.7% in a single year.

Nobody needed training data for any of that.

And the downstream impact is concrete:

DomainProjectWhat It Does
HealthcareMendel AI (Llama 3)36% improvement in clinical record extraction
SurgeryActiv Surgical (Llama 3)Real-time AI surgical copilot
Medical QADeepSeek-R1-Distill>92% accuracy on USMLE Step 1
AgricultureDigital Green (Llama)Multilingual advisory for developing nations
Maternal healthJacaranda PROMPTS (Llama)AI clinical help desk across Kenya, Ghana, Eswatini

Mendel didn't need Meta's training data to hit 36% improvement. Jacaranda didn't audit Llama's training pipeline before building an SMS-based maternal health system for three African countries. These are shipping products. People are healthier because of them. And they were built on weights that fail every open source purity test I've outlined above.

Yann LeCun, formerly Meta's chief AI scientist and now running AMI Labs, frames it as a matter of principle:

"In the future, our entire information diet is going to be mediated by [AI] systems. They will constitute basically the repository of all human knowledge. And you cannot have this kind of dependency on a proprietary, closed system."

— Yann LeCun, Yann LeCun On How An Open Source Approach Could Shape AI

The pragmatist's case goes further than vibes. Training data release is a legal minefield. The US Copyright Office ruled in May 2025 that AI training on copyrighted works is not categorically fair use. These datasets contain trillions of tokens scraped from millions of copyrighted sources. Nobody is getting redistribution rights for all of that. In healthcare, GDPR and HIPAA make the data unshareable by law. And even if someone handed you the complete training data and code for Llama 3, you'd need north of $100 million in compute to reproduce the training run. The data is meaningful in theory and useless in practice to basically everyone who would download it.

Then there's geography. Chinese models (DeepSeek under MIT, Qwen under Apache 2.0) now make up 41% of HuggingFace downloads, more than US-origin models. If stricter openness requirements make American companies look less open by comparison, the ecosystem just shifts further east. That's not an argument for or against anything, but it's a thing that's happening.

I keep turning this over. Open weights aren't open source, but they're enormously better than the closed alternative. Making the definition stricter might produce fewer open releases, not more. That argument is mostly right. But it's not entirely right.

Why It Still Matters

Open weights being valuable doesn't make calling them "open source" harmless. Those are different claims.

The regulatory loophole is already being exploited. The EU AI Act, Article 53, gives lighter compliance obligations to "open source" AI. That exemption was written by people who assumed the phrase meant something specific. If Meta can stick "open source" on Llama and pocket the regulatory relief, that's not a definitional quibble. It's money. The exemption has a hole in it, and companies are walking through.

You can't audit what you can't see. About 5% of AI researchers share code in their papers. Model cards on HuggingFace use 947 different section naming conventions, so there's no consistency in what gets documented. When a company claims their model was tested for bias, deduped for harmful content, filtered for quality, and then hands you only the weights, what you have is a claim without evidence. You can observe the model's outputs. You cannot investigate its inputs. If it exhibits bias, you can describe the symptoms. You can't diagnose the cause.

Copyright law might not work here at all. The D.C. Circuit ruled in Thaler v. Perlmutter (March 2025, cert denied 2026) that AI cannot hold copyright. Follow the logic: if AI-generated code can't be copyrighted, then open source licenses, which are copyright licenses, might not attach to AI output. The entire legal mechanism that makes open source work might not apply. This isn't a hypothetical edge case. It's an unresolved question that affects everyone building on these models, and I haven't seen a convincing answer from anyone.

And the erosion compounds. "Open source" accumulated meaning over 28 years through a specific deal: you can see what you're running. Inspect it. Reproduce it. Improve it. Each time Meta puts that label on a model with a custom restrictive license and zero training data, the deal gets a little weaker. The words absorb more ambiguity. At some point "open source" just means "you can download it," which is what Meta wants, because then the label is free and the obligation is zero.

Where This Leaves You

I've been going back and forth on this for weeks, and I don't think there's a clean resolution.

If training data is the source code of AI, and I think Schneier's analogy holds, then nothing from Meta, DeepSeek, Mistral, Alibaba, or Google qualifies as open source. The four freedoms require that you can see and reproduce the thing you're using. Weights don't give you that.

But thirteen million people built useful things with weights alone. A maternal health system in Kenya doesn't care about definitional purity. The 1998 definition was written for a world where "source" meant text files you could read and compile. It doesn't map cleanly onto a trillion tokens scraped from the internet, tangled in copyright, privacy law, and trade secrets.

I land here: open weights are good. Calling them "open source" is bad. Both of those can be true at the same time.

Some things you can do with that:

Stop writing "open source" in your architecture docs when you mean Llama. Say "open weights." It's accurate, your compliance team won't get confused, and it doesn't corrode a phrase that still means something for actual software.

Read the license. I know, nobody does. But Llama's 700 million MAU cap has already bitten companies that assumed "open source" meant no strings. DeepSeek's MIT license actually has no strings. Those are different things and they matter when lawyers get involved.

If you need reproducibility, if you need to audit what a model learned or verify a safety claim or understand why it's producing biased output, use OLMo or Pythia. They're not as capable as Llama for most tasks. They're the only ones that earn the label.

Keep an eye on EU AI Act enforcement. The GPAI obligations kicked in August 2025. Regulators may end up caring about the definition more than the open source community does, and "we called it open source on our website" is going to be an awkward defense when it clearly isn't.

Open source meant something specific for 28 years. The AI industry would very much like you to forget what.


Sources

Enjoyed this post? Consider supporting the blog.

Buy me a coffee