What answer does ChatGPT, the program that can generate text or answer questions based on prompts, give when you ask it why journalism is so often used to train generative artificial intelligence?
ChatGPT will tell you that the news is factual, includes language variation and cultural awareness, comprises complex sentence structures, includes quotes that convey real-world conversations, excels at summarization and condensation, and can help a model improve information retrieval. In fact, the news is so valuable to this endeavor that it makes up half of the top 10 sites incorporated into one of Google’s datasets that is being used to train some of the most popular large language models, according to a recent analysis. That includes content that was put behind paywalls with the intention of being restricted to paid users.
This is the issue at the heart of recent lawsuits filed by The New York Times, The Intercept, Raw Story, Getty, and AlterNet against OpenAI, Microsoft, Stability AI, and others for using vast troves of their articles and images to train ChatGPT and other generative AI products and services. OpenAI and the other companies building the new generation of AI tools did not ask permission to use these stories and aren’t compensating the news organizations for their work but instead are arguing that their actions are covered by the fair use doctrine. But as we watch the fight between news organizations and Big Tech and the rules around AI unfold, we should be looking to policymakers to set the rules and ensure a level playing field so that journalism survives and thrives in the transition to AI. This is not only an imperative for the embattled news industry but also for democracy more broadly.
Indeed, journalism is still reeling from the social media age when Google and Facebook, now Meta, exploited their dominance and starved news outlets of digital advertising revenues. At a Senate hearing on AI and the future of journalism in January lawmakers and experts were aligned on their criticism of Big Tech’s role in imperiling local news. “It is literally eating away at the lifeblood of our democracy,” warned Sen. Richard Blumenthal, a Democrat from Connecticut, who chaired the bipartisan hearing.
And now their generative AI tools are being built with “stolen goods” as Roger Lynch, CEO of Condé Nast, testified. The hearing highlighted a rare show of bipartisan agreement that fair use should not apply to the way that large language models and giant tech companies have hoovered up content without permission or compensation.
The battle over whether or not scraping content for use in Large Language Models (LLMs), the machine learning algorithms that can understand and generate human-sounding language, is fair use is just starting, but we can take lessons on what not to do from how the news media and policymakers handled the rise of social media giants.
Journalism must not get locked into the current version of how search, content summarization, dissemination, and digital advertising work. The news industry can’t afford to give away content and must establish the value journalism provides to AI systems and other products. Yet, access to human created, high-quality content that is a relatively accurate and timely portrayal of reality, like journalism, is an important input for machine learning models.
Journalism adds value in many ways, not just the vast amounts of data it could provide to train LLMs. Training and updating models with higher-quality data can help minimize bias within the models and is key to making sure that AI search, summarization, and content generation isn’t informed by mis- or disinformation. (This is done through a process called grounding, which connects the AI’s outputs with specific data sources and retrieval augmented generation, which allows the model to access related and hopefully highly-authoritative information outside of its training data.) Without journalism, they will be more susceptible to problems like “hallucinations” — essentially incorrect answers made up by AI — and manipulated, outdated, or irrelevant information. (The Washington Post found, for example, that training sets already include several media outlets that rank low on NewsGuard’s independent scale for trustworthiness or which are backed by a foreign country, including Russia’s propaganda arm RT.)
Publishers and journalists should work collectively to understand how different types of journalism — like local, watchdog, and trade journalism — provide utility throughout the AI value chain, since this will be essential to developing business models for the next era of journalism.
But even more importantly, policymakers need to ensure that news publishers are on an even playing field with the tech platforms that have come to provide the necessary infrastructure for contemporary journalism. Bespoke, secretive deals with the largest or most influential news outlets are not a replacement for public policy and will not rescue local news from the precarity created by corporations who skirt the law and enjoy dominant market power.
The current crisis facing journalism is partly a result of regulatory frameworks that privilege tech platforms over the press and give the former unfettered power to set the rules. U.S. policymakers allowed a handful of companies to develop a powerful, extractive business model based on surveillance advertising that relies on vast troves of data that only they have access to along with AdTech infrastructure that they developed and control. This data has given them an advantage in AI development, which further reinforces their dominance.
Regulators also permitted these same firms to dominate search and social media, and thus access to audiences. The Federal Trade Commission did not prevent hundreds of mergers or acquisitions by Big Tech that killed off potential alternatives and cemented their status as critical infrastructure of the information ecosystem. From setting publishing protocols (such as Facebook’s Instant Articles or Google’s Accelerated Mobile Pages) to the tools and services journalists use to do their jobs, including email, web hosting, messaging, archiving, cloud services, and cybersecurity services (like Google’s Project Galileo), there is no escaping Big Tech. Artificial intelligence intensifies this dynamic.
Our laws and regulations have not caught up with this reality. Unlike the communication technologies of prior eras, today’s tech corporations enjoy unprecedented exemptions from libel law and non-discrimination requirements and privacy constraints. Unlike news organizations, platforms are exempted from liability for what they allow to be published thanks to Section 230 of the Communications Decency Act, which generally shields them from legal accountability for third-party content. And unlike operators of other information communication technologies, like telephone companies or broadcasters, they can decide who they want to kick off or amplify on their services regardless of rationale, transparency, or fairness. This includes blocking news entirely, as Meta did in Canada and Google has done in California, rather than paying for using the content. And as we’re seeing with AI, they claim rights that are unparalleled in scope to use copyright-protected material under the guise of fair use.
Given the fact that these tech monopolies form the backbone of the modern media system and have leveraged journalism to help build the value of their massive enterprises, policymakers need to ensure that news organizations are appropriately compensated and corporations play fair. Ideally, regulators would take steps to break up monopolies like Apple, Amazon, Google, and Meta and enact meaningful privacy legislation that would prohibit the mass surveillance relied on by the digital advertising industry. Antitrust legislation that promotes competition in AdTech, app stores, and digital marketplaces — several bipartisan bills like the American Innovation and Choice Online Act and the Advertising Middlemen Endangering Rigorous Internet Competition Accountability Act are currently working their way through Congress — would be another welcome step, if one that is long overdue.
Artificial intelligence has both hastened the need for these policy prescriptions and added the need for new ones. Policymakers should clarify that taking journalism to develop AI without compensation is not fair use, pass legislation to enable publishers to negotiate with these giant tech corporations on equal footing, and impose data transparency requirement to ensure they have the information they need to make this determination. Fair compensation legislation like the Australia’s News Media Bargaining Code, Canada’s Online News Act and the proposed federal Journalism Competition and Preservation Act and its state-level equivalents would be a good first step. But they should be amended to include scraping and crawling for AI and transparency mandates for large language models.
The FTC should continue to pursue antitrust cases against tech monopolies and impose meaningful penalties, like requiring that companies delete models and algorithms developed based on unlawfully obtained user data. The New York Times lawsuit against OpenAI seeks the destruction of LLMs trained on its work, and the FTC should consider similar remedies for companies that violate data privacy and antitrust laws.
Large publishers are forging ahead with voluntary agreements in the absence of legal regulatory clarity. But this leaves out smaller and local publishers and could undermine efforts to develop business model alternatives as opposed to one-off licensing opportunities.
Ad hoc approaches, however, risk worsening the compounding crises caused by the decline of local news and the scourge of disinformation. We are already seeing the proliferation of election related disinformation in the U.S. and around the world, from AI robocalls impersonating President Joe Biden to deepfakes of Moldovan candidates making false claims about alignment with Russia.
Renegotiating the relationship between tech platforms and the news industry must be a fundamental part of the efforts to support journalism and help news organizations adapt to the generative AI era. With an upcoming U.S. presidential election and more than 60 key elections taking place over the next year around the world, policymakers and the news industry are starting to confront the existential threat that the demise of journalism poses democracy. Their efforts can’t come too soon, as the risk of repeating the failures of the social media age in the era of AI will have devastating consequences.