The Times v. Microsoft/OpenAI: Unauthorized Reproduction of Times Works In GPT Model Training (10)

The New York Times Company v. Microsoft Corporation Court Filing December 27, 2023 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 10 of 27.

IV. FACTUAL ALLEGATIONS

C. Defendants’ Unauthorized Use and Copying of Times Content

82. Microsoft and OpenAI created and distributed reproductions of The Times’s content in several, independent ways in the course of training their LLMs and operating the products that incorporate them.

1. Unauthorized Reproduction of Times Works During GPT Model Training

83. Defendants’ GPT models are a family of LLMs, the first of which was introduced in 2018, followed by GPT-2 in 2019, GPT-3 in 2020, GPT-3.5 in 2022, and GPT-4 in 2023. The “chat” style LLMs, GPT-3.5 and GPT-4, were developed in two stages. First, a transformer model was pre-trained on a very large amount of data. Second, the model was “fine-tuned” on a much smaller supervised dataset in order to help the model solve specific tasks.

84. The pre-training step involved collecting and storing text content to create training datasets and processing that content through the GPT models. While OpenAI did not release the trained versions of GPT-2 onward, “[d]ue to [OpenAI’s] concerns about malicious applications of the technology,” OpenAI has published general information about its pre-training process for the GPT models.[12]

85. GPT-2 includes 1.5 billion parameters, which was a 10X scale up of GPT.[13] The training dataset for GPT-2 includes an internal corpus OpenAI built called “WebText,” which includes “the text contents of 45 million links posted by users of the ‘Reddit’ social network.”[14] The contents of the WebText dataset were created as a “new web scrape which emphasizes document quality.”[15] The WebText dataset contains a staggering amount of scraped content from The Times. For example, the NYTimes.com domain is one of the “top 15 domains by volume” in the WebText dataset,[16] and is listed as the 5th “top domain” in the WebText dataset with 333,160 entries.[17]

86. GPT-3 includes 175 billion parameters and was trained on the datasets listed in the table below.[18]

87. One of these datasets, WebText2, was created to prioritize high value content. Like the original WebText, it is composed of popular outbound links from Reddit. As shown in the table above, the WebText2 corpus was weighted 22% in the training mix for GPT-3 despite constituting less than 4% of the total tokens in the training mix. Times content—a total of 209,707 unique URLs—accounts for 1.23% of all sources listed in OpenWebText2, an open-source re-creation of the WebText2 dataset used in training GPT-3. Like the original WebText, OpenAI describes WebText2 as a “high-quality” dataset that is “an expanded version of the WebText dataset … collected by scraping links over a longer period of time.”[19]

88. The most highly weighted dataset in GPT-3, Common Crawl, is a “copy of the Internet” made available by an eponymous 501(c)(3) organization run by wealthy venture capital investors.[20] The domain www.nytimes.com is the most highly represented proprietary source (and the third overall behind only Wikipedia and a database of U.S. patent documents) represented in a filtered English-language subset of a 2019 snapshot of Common Crawl, accounting for 100 million tokens (basic units of text): [21]

89. The Common Crawl dataset includes at least 16 million unique records of content from The Times across News, Cooking, Wirecutter, and The Athletic, and more than 66 million total records of content from the Times.

90. Critically, OpenAI admits that “datasets we view as higher-quality are sampled more frequently” during training.[22] Accordingly, by OpenAI’s own admission, high-quality content, including content from The The Times, was more important and valuable for training the GPT models as compared to content taken from other, lower-quality sources.

91. While OpenAI has not released much information about GPT-4, experts suspect that GPT-4 includes 1.8 trillion parameters, which is over 10X larger than GPT-3, and was trained on approximately 13 trillion tokens.[23] The training set for GPT-3, GPT-3.5, and GPT-4 was comprised of 45 terabytes of data—the equivalent of a Microsoft Word document that is over 3.7 billion pages long. [24] Between the Common Crawl, WebText, and WebText2 datasets, the Defendants likely used millions of Times-owned works in full in order to train the GPT models.

92. Defendants repeatedly copied this mass of Times copyrighted content, without any license or other compensation to The Times. As part of training the GPT models, Microsoft and OpenAI collaborated to develop a complex, bespoke supercomputing system to house and reproduce copies of the training dataset, including copies of The Times-owned content. Millions of Times Works were copied and ingested—multiple times—for the purpose of “training” Defendants’ GPT models.

93. Upon information and belief, Microsoft and OpenAI acted jointly in the large-scale copying of The Times’s material involved in generating the GPT models programmed to accurately mimic The Times’s content and writers. Microsoft and OpenAI collaborated in designing the GPT models, selecting the training datasets, and supervising the training process. As Mr. Nadella stated:

So, there are a lot of, I call it, product design choices one gets tomake when you think about AI and AI safety. Then, let’s come at itthe other way. You have to take real care of the pretrained data because models are trained on pretrained data. What’s the quality,the provenance of that pretrained data? That’s a place where we’vedone a lot of work.[25]

94. To the extent that Microsoft did not select the works used to train the GPT models, it acted in self-described “partnership” with OpenAI respecting that selection, knew or was willfully blind to the identity of the selected works by virtue of its knowledge of the nature and identity of the training corpuses and selection criteria employed by OpenAI, and/or had the right and ability to prevent OpenAI from using any particular work for training by virtue of its physical control of the supercomputer it developed for that purpose and its legal and financial influence over the OpenAI Defendants.

95. Upon information and belief, Microsoft and OpenAI continue to create unauthorized copies of Times Works in the form of synthetic search results returned by their Bing Chat and Browse with Bing products. Microsoft actively gathers copies of the Times Works used to generate such results in the process of crawling the web to create the index for its Bing search engine.

96. On information and belief, Microsoft and OpenAI are currently or will imminently commence making additional copies of Times Works to train and/or fine-tune the next-generation GPT-5 LLM.

97. Defendants’ large-scale commercial exploitation of Times content is not licensed, nor have Defendants received permission from The Times to copy and use its works to build their GenAI tools.

Continue Reading Here.

[12] OpenAI, Better Language Models and Their Implications, OPENAI (Feb. 14, 2019), https://openai.com/research/better-language-models.

[13] Id.

[14] GPT-2 Model Card, GITHUB (Nov. 2019), https://github.com/openai/gpt-2/blob/master/model_card.md.

[15] RADFORD ET AL., LANGUAGE MODELS ARE UNSUPERVISED MULTITASK LEARNERS 3 (2018), https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.

[16] GPT-2 Model Card, supra note 14.

[17] GPT-2 / domains.txt, GITHUB, https://github.com/openai/gpt-2/blob/master/domains.txt (last visited Dec. 21, 2023).

[18] BROWN ET AL., LANGUAGE MODELS ARE FEW-SHOT LEARNERS 9 (2020), https://arxiv.org/pdf/2005.14165.pdf.

[19] Id. at 8.

[20] COMMON CRAWL, https://commoncrawl.org/ (last visited Dec. 21, 2023).

[21] DODGE ET AL., DOCUMENTING LARGE WEBTEXT CORPORA: A CASE STUDY ON THE COLOSSAL CLEAN CRAWLED CORPUS (2021), https://arxiv.org/abs/2104.08758.

[22] BROWN ET AL., supra note 18.

[23] Maximilian Schreiner, GPT-4 Architecture, Datasets, Costs and More Leaked, THE DECODER (July 11, 2023), https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/.

[24] Kindra Cooper, OpenAI GPT-3: Everything You Need to Know [Updated], SPRINGBOARD (Sept. 27, 2023), https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai/.

[25] Nilay Patel, Microsoft Thinks AI Can Beat Google at Search — CEO Satya Nadella Explains Why, THE VERGE (Feb. 7, 2023), https://www.theverge.com/23589994/microsoft-ceo-satya-nadella-bing-chatgpt-googlesearch-ai.

About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.

This court case 1:23-cv-11195 retrieved on December 29, 2023, from nycto-assets.nytimes.com is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.