Sunday, January 11, 2026

The Copy Machine in the Cloud




Why AI’s “memorization problem” threatens the foundations of the generative-AI industry

For years, the public has been told a comforting story about artificial intelligence. Large language models, the companies say, learn the way people do. They read enormous quantities of text, absorb patterns, and emerge with a generalized understanding of language—no different in spirit from a student educated in a library.

But that metaphor is collapsing.

A growing body of research now suggests that today’s most powerful AI systems do not merely abstract patterns from books, articles, and images. They retain them—sometimes in startlingly intact form. And the consequences of that discovery may reach far beyond academic debate, reshaping copyright law, AI economics, and the credibility of the industry’s core claims.

A discovery the industry didn’t want

In early January, researchers affiliated with Stanford and Yale released findings that cut directly against years of industry assurances. Testing four widely used models—OpenAI’s GPT series, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—they demonstrated that these systems could reproduce long, recognizable passages from copyrighted books when prompted in particular ways.

The most dramatic results came from Claude, which generated near-complete versions of Harry Potter and the Sorcerer’s Stone, The Great Gatsby, 1984, and Frankenstein, along with thousands of words from The Hunger Games and The Catcher in the Rye. Other models showed similar, though uneven, behavior across a test set of thirteen books.

This phenomenon—known in technical literature as memorization—has been discussed quietly among researchers for years. What is new is the scale, clarity, and undeniability of the evidence.

It also directly contradicts the public positions AI companies have taken before regulators. In 2023, OpenAI told the U.S. Copyright Office that its models “do not store copies of the information that they learn from.” Google made a parallel claim, stating that “there is no copy of the training data … present in the model itself.” Other major firms echoed the same language.

The Stanford-Yale results join a growing list of studies showing those statements to be, at best, incomplete.

Not learning—compressing

To understand why this matters, it helps to discard the learning metaphor entirely.

Inside AI research labs, engineers often describe large language models using a more precise term: lossy compression. The idea is borrowed from familiar technologies like MP3 audio files or JPEG images, which reduce file size by discarding some information while retaining enough structure to reconstruct a convincing approximation of the original.

Generative AI works in a similar way. Models ingest vast quantities of text or images and transform them into a dense mathematical structure. When prompted, they generate outputs that are statistically likely continuations of what they have seen before.

This framing has begun to appear outside the lab as well. In a recent German court case brought by GEMA, a music-licensing organization, a judge compared ChatGPT to compressed media formats after finding that it could reproduce close imitations of copyrighted song lyrics. The court rejected the notion that the system merely “understood” music in an abstract sense.

The analogy becomes especially vivid with image-generation models.

In 2022, Stability AI’s former CEO Emad Mostaque described Stable Diffusion as having compressed roughly 100,000 gigabytes of images into a model weighing about two gigabytes—small enough to run on consumer hardware. Researchers have since shown that the model can recreate near-identical versions of some training images when prompted with their original captions or metadata.

In one documented case, a promotional still from the television show Garfunkel and Oates was reproduced with telltale compression artifacts—blurring, distortion, and minor glitches—much like a low-quality JPEG. In another, Stable Diffusion generated an image closely resembling a graphite drawing by artist Karla Ortiz, now central to ongoing litigation against AI companies.

These outputs are not generic “conceptual” images. They preserve composition, pose, and structure in ways that strongly suggest stored visual information, not independent creative synthesis.

Language models behave the same way

Text models operate differently under the hood, but the principle is similar.

Books and articles are broken into tokens—fragments of words, punctuation, and spacing. A large language model records which tokens tend to follow others in specific contexts. The result is a massive probabilistic map of language sequences.

When an AI writes, it doesn’t consult an abstract notion of “English.” It traverses this map, choosing the most likely next token given what came before. In most cases, that produces novel combinations. But when the training data are dense and repetitive enough, the map contains entire passages—sometimes entire books—embedded almost intact.

A 2025 study of Meta’s Llama 3.1-70B model demonstrated this vividly. By supplying only the opening tokens “Mr. and Mrs. D.” researchers triggered a cascade that reproduced nearly all of Harry Potter and the Sorcerer’s Stone, missing only a handful of sentences. The same technique extracted more than 10,000 verbatim words from Ta-Nehisi Coates’s The Case for Reparations, originally published in The Atlantic.

Other works—including A Game of Thrones and Toni Morrison’s Beloved—showed similar vulnerabilities.

More recent research adds a subtler layer: paraphrased memorization. In these cases, models don’t copy sentences word-for-word but produce text that mirrors a specific passage’s structure, imagery, and cadence so closely that its origin is unmistakable. This behavior resembles what image models do when they remix visual elements from multiple stored works while preserving their distinctive style.

How common is this?

Exact duplication may be relatively rare in everyday use—but not vanishingly so. One large-scale analysis found that 8 to 15 percent of AI-generated text appears elsewhere on the web in identical form. That rate far exceeds what would be acceptable in human writing, where such overlap would typically be labeled plagiarism.

AI companies argue that these outcomes require “deceptive” or “abnormal” prompting. In its response to a lawsuit from The New York Times, OpenAI claimed that the newspaper violated its terms of service and used techniques no ordinary user would employ. The company characterized memorized outputs as rare bugs it intends to eliminate.

But researchers broadly disagree. In interviews, many have said that memorization is structural, not incidental—an inevitable result of training enormous models on massive, uncurated datasets.

The legal fault lines

If that is true, the legal consequences could be severe.

Copyright law creates at least two potential liabilities. First, if models can reproduce protected works, courts may require companies to implement safeguards preventing users from accessing memorized content. But existing filters are easily bypassed, as demonstrated by cases in which models refuse a request under one phrasing and comply under another.

Second—and more troubling for the industry—courts may decide that a trained model itself constitutes an unauthorized copy of copyrighted material. Stanford law professor Mark Lemley has noted that even if a model doesn’t store files in a conventional sense, it may function as “a set of instructions that allows us to create a copy on the fly.” That distinction may not be enough to avoid liability.

If judges conclude that models contain infringing material, remedies could include not just damages but destruction of the infringing copies—effectively forcing companies to retrain their systems using licensed data. Given the cost of training frontier models, such rulings could reshape the competitive landscape overnight.

The danger of the learning myth

Much of the industry’s legal strategy rests on analogies between AI and human learning. Judges have compared training models on books to teaching students to write. Executives speak of AI’s “right to learn,” as if reading were a natural act rather than a commercial ingestion of copyrighted works at industrial scale.

But the analogy fails under scrutiny.

Humans forget. AI systems do not—not in the same way. Humans cannot instantly reproduce entire novels verbatim. AI systems sometimes can. And humans experience the world through senses, judgment, and intention—none of which apply to statistical models predicting tokens.

As research into memorization adva
nces, the gap between metaphor and mechanism is becoming harder to ignore.

An industry built on borrowed words

The irony is difficult to miss. Generative AI is marketed as revolutionary, creative, and forward-looking. Yet its power derives almost entirely from the accumulated labor of writers, artists, journalists, and musicians—much of it absorbed without permission.

Whether courts ultimately classify that absorption as fair use or infringement, one thing is increasingly clear: these systems do not merely learn from culture; they retain it. And in doing so, they expose a fault line at the heart of the AI economy—one that no amount of metaphor can paper over.

The copy machine in the cloud is finally visible. What society chooses to do about it may determine the future of artificial intelligence itself.




Monday, January 05, 2026

There Aren’t 100 Million Immigrants So who, exactly, is the government preparing to deport?

On New Year’s Eve, the Department of Homeland Security posted an image of an empty, sun-washed beach—palm trees, a vintage car, no people—captioned: “America after 100 million deportations.” The accompanying text, reported by The Guardian, was even blunter: “The peace of a nation no longer besieged by the third world.” (The Guardian)

It was propaganda in the old sense of the word: not persuasion through argument, but a moodboard for power. The message did not ask for consent. It did not explain feasibility. It offered a still life of absence as a synonym for order.

And that’s where the arithmetic begins—because 100 million isn’t a policy number. It’s a demographic event.

The number that eats the category

Start with the simplest constraint: there are not 100 million undocumented immigrants in the United States. The most cited estimates are an order of magnitude smaller. Pew Research Center put the unauthorized immigrant population at about 14 million in 2023 (a record high), and DHS’s Office of Homeland Security Statistics estimated about 11 million as of January 1, 2022. (Pew Research Center)

So if “100 million deportations” is meant literally—100 million distinct human beings removed—then it cannot be achieved by “immigration enforcement” as Americans usually imagine it, because the category runs out of people long before the target is met.

What, then, fills the gap?

One answer is: time. Maybe “100 million deportations” is meant to be cumulative over many years, counting removals and returns and re-removals, the way bureaucracies sometimes inflate totals. But DHS didn’t post “100 million deportations over several decades including repeat removals.” It posted a country after it. A finished state.

And if the implied timeline is political—one administration, one movement, one “year of…”—then time won’t rescue the math.

The second answer is: expand the target population.

That’s the part American officials rarely say out loud, but their actions keep sketching the outline. Consider the scale of foreign-born residents. The Census Bureau estimated 46.2 million foreign-born people in 2022. (Census.gov) Even if you removed every foreign-born resident—citizens and non-citizens alike—you would still be more than 50 million short of the number DHS chose to romanticize.

Which forces the conclusion: a 100-million removal target is not an immigration policy. It is a citizenship policy.

It would require either:

  • stripping legal status from tens of millions of people who currently have it (including naturalized citizens), and/or

  • redefining Americanness in a way that captures large numbers of U.S.-born people, and/or

  • counting people as deportable based on something other than immigration status (association, dissent, ancestry, “undesirability,” etc.).

That is why “100 million” isn’t just extreme. It is structurally different from border enforcement. It implies an internal sorting.

When the agency in charge of “who stays” starts talking like a movement

Within days of the “100 million” post, DHS’s public messaging escalated again. A DHS post on X declared: “2026 will be the year of American Supremacy.” (X (formerly Twitter))

That phrase has no statutory meaning. It doesn’t need one. It functions the way slogans function when attached to a coercive apparatus: as a prioritization signal. DHS is not a think tank; it is the bureaucracy that touches detention, removal, databases, and referrals. When it adopts the language of supremacy, the operational question is not “Is this a law?” but “How will this shape discretion?”

And discretion—quiet, unreviewed, cumulative—is where large-scale redefinition happens.

The paperwork lever: denaturalization as throughput

If you want to make “100 million” even remotely plausible, you need a pipeline that converts protected people into removable people. One of the cleanest levers is denaturalization: turning citizenship into something you can lose, not only for the spectacular villain of civics-class hypotheticals, but as a routine administrative output.

In mid-December 2025, Reuters reported that internal guidance (first reported by The New York Times) directed USCIS field offices to supply 100–200 denaturalization case referrals per month for DOJ review in fiscal year 2026—an enormous jump from historical levels. (Reuters)

Quotas matter not because every referral succeeds, but because quotas change institutional behavior. Numbers turn judgment into throughput. People become inventory.

Once you have a machine that can convert “citizen” into “case,” you have the beginnings of a system that can scale past the undocumented population—because you are no longer limited by that population.

The identity lever: biometrics that don’t stop at the border

The other requirement for mass sorting is identification at scale—not just of individuals, but of networks: families, sponsors, “associations,” the relational map of a life.

In late 2025, DHS moved to widen biometric collection through both proposed and finalized actions:

  • DHS published a proposed rule on biometrics collection and use by USCIS that would expand modalities to include things like palm prints, voice prints, ocular imagery, and DNA, remove age limits, and broaden who can be required to provide biometrics (including certain U.S. citizens connected to immigration filings). (Federal Register)

  • Separately, DHS finalized a biometric entry-exit rule requiring facial comparison biometrics from non-U.S. citizens on entry and exit, reframing what had been “pilots” into a comprehensive system. (GovInfo)

Even read generously, these are not neutral “efficiency” upgrades. They are the scaffolding of a future in which identity is less a civic status than a permanent, queryable data object.

And the step after “collect more biometrics” is “query more datasets.”

The dragnet lever: data-sharing as deportation’s invisible engine

Immigration enforcement no longer depends on knocking on doors at random. It depends on finding people—quietly, cheaply, repeatedly—through systems built for other purposes.

Investigations have documented the ways ICE and DHS-linked agencies gain access to DMV and other state-level data. In late 2025, for example, multiple reports and a letter from congressional Democrats warned governors about ICE access to driver’s license and vehicle registration data via national law-enforcement networks. (Stateline)

And surveillance isn’t abstract here. The Financial Times reported a sharp rise in ICE surveillance contracting in 2025, based on procurement data. (FT Visual Journalism)

This is how mass enforcement becomes socially survivable: not by soldiers in the streets, but by “back office” systems that make removal feel like an administrative consequence of existing anywhere in modern life.

The cruelty rehearsal: when detention becomes a punchline

A mass project needs cultural permission. Not unanimous support—just enough numbness, enough distance, enough comedy.

In July 2025, The Atlantic described the rhetoric around a Florida immigrant-detention center in the Everglades. Laura Loomer, identified there as a Trump adviser, posted: “Alligator lives matter … alligators are guaranteed at least 65 million meals if we get started now.” (The Atlantic)

This is how a society is prepared: through jokes that train the audience to treat human beings as input-output.

The same pattern appears in individual cases that puncture the official “worst of the worst” narrative. In 2025, Sae Joon Park—a disabled Purple Heart veteran—was reported to have self-deported under threat of detention, drawing outrage from members of Congress and coverage in local and national media. (Mazie K. Hirono)

Whatever one believes about Park’s underlying immigration history, the point is the signal: membership and contribution do not guarantee insulation when the enforcement state is expanding its mandate.

The ideology nearby: hierarchy dressed up as “management”

Propaganda needs theory the way a building needs rebar: not because every worker reads the blueprint, but because the structure holds better when it’s been rationalized.

Curtis Yarvin—writing as “Mencius Moldbug”—has argued for replacing democratic equality with hierarchy and managerial rule. In his own writing, he described what he called a “humane alternative to genocide”: “virtualizing” “wards” in permanent confinement with immersive VR. (unqualified-reservations.org)

Multiple major outlets have documented Yarvin’s growing proximity to mainstream right politics and his influence on figures around power, including commentary about JD Vance engaging with or drawing from Yarvin-adjacent frameworks. (The Verge)

You don’t need to claim that any one official plans to enact Yarvin’s most dystopian passages to see the shared grammar: people sorted by “value,” democracy treated as a bug, confinement reframed as care.

And in the same broader ecosystem, Robert F. Kennedy Jr. has repeatedly promoted vaccine-linked autism narratives and described autism in catastrophic terms—language he has had to publicly apologize for, including a notorious “holocaust” comparison reported at the time by mainstream outlets. (CBS News)

The relevance isn’t partisan gossip. It’s the shared move: turning entire populations into evidence of contamination, damage, or threat—categories that make exclusion sound like hygiene.

So who fills the gap?

This is the question the “100 million” image refuses to answer, because answering it reveals the real project.

If you remove millions of workers, consumers, parents, renters, taxpayers—regardless of legal status—you don’t just change immigration statistics. You change schools, labor markets, housing demand, caregiving networks, military recruitment pools, and the basic functioning of whole regions.

And if the target is truly 100 million people, then the “gap” is not simply economic. It is civic: the gap between a nation of equal citizens and a nation of conditional residents—people who may live here but can be administratively reclassified out of belonging.

The propaganda beach is quiet because it is emptied. It offers “peace” as a reward for subtraction. But arithmetic has a way of stripping euphemism down to its skeleton. A number like 100 million doesn’t describe enforcement. It describes recomposition.

The most important verification step, after confirming DHS really posted the image and later used “American Supremacy” language, is to recognize what the verified facts already imply:

  • DHS did amplify “America after 100 million deportations.” (The Guardian)

  • The foreign-born population is far smaller than 100 million. (Census.gov)

  • The unauthorized population is far smaller than 100 million. (Pew Research Center)

  • USCIS was reportedly directed to dramatically scale denaturalization referrals. (Reuters)

  • DHS has proposed and finalized expansions in biometric collection and biometric infrastructure. (Federal Register)

Those pieces don’t prove a single master plan. But they do make the arithmetic unavoidable:

If the state wants “100 million,” it must first decide—explicitly or through discretion—who counts as American.

The Copy Machine in the Cloud

Why AI’s “memorization problem” threatens the foundations of the generative-AI industry For years, the public has been told a comforting sto...