Legal Issues Surrounding Copilot's Use of Training Data

DOE vs. Github (amended complaint) Court Filing (Redacted), June 8, 2023 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 17 of 38.

VII. FACTUAL ALLEGATIONS

E. Copilot Was Launched Despite Its Propensity for Producing Unlawful Outputs

86. GitHub and OpenAI have not provided much detail regarding what data Codex and OpenAI were trained on. Plaintiffs know for certain from GitHub and OpenAI’s statements, that both systems were trained on publicly available GitHub repositories, with Copilot having been trained on all available public GitHub repositories.

87. According to OpenAI, Codex was trained on “billions of lines of source code from publicly available sources, including code in public GitHub repositories.” Similarly, GitHub has described[13] Copilot’s training material as “billions of lines of public code.” GitHub researcher Eddie Aftandilian confirmed in a recent podcast14 that Copilot is “train[ed] on public repos on GitHub.”

88. In a recent customer-support message, GitHub’s support department clarified certain facts about training Copilot. First, GitHub said that “training for Codex (the model used by Copilot) is done by OpenAI, not GitHub.” Second, in its support message, GitHub put forward a more detailed justification for its use of copyrighted code as training data:

Training machine learning models on publicly available data is considered fair use across the machine learning community . . . OpenAI’s training of Codex is done in accordance with global copyright laws which permit the use of publicly accessible materials for computational analysis and training of machine learning models, and do not require consent of the owner of such materials. Such laws are intended to benefit society by enabling machines to learn and understand using copyrighted works, much as humans have done throughout history, and to ensure public benefit, these rights cannot generally be restricted by owners who have chosen to make their materials publicly accessible.

The claim that training ML models on publicly available code is widely accepted as fair use is not true. And regardless of this concept’s level of acceptance in “the machine learning community,” under Federal law, it is illegal.

89. Former GitHub CEO Nat Friedman said in June 2021—when Copilot was released to a limited number of customers—that “training ML systems on public data is fair use.”15 Friedman’s statement is pure speculation; no Court has considered the question of whether “training ML systems on public data is fair use.” The Fair Use affirmative defense is only applicable to Section 501 copyright infringement. It is not a defense to violations of the DMCA, breach of contract, nor any other claim alleged herein. It cannot be used to avoid liability here. At the same time Friedman asserted “the output [of Copilot] belongs to the operator.”

90. Other open-source stakeholders have made this point already. For example, in June 2021, Software Freedom Conservancy (“SFC”), a prominent open-source advocacy organization, asked Microsoft and GitHub to provide “legal references for GitHub’s public legal positions.” No references were provided by any of the Defendants.[16]

91. Beyond the examples above, Copilot regularly Output’s verbatim copies of Licensed Materials. For example, Copilot reproduced verbatim well-known code from the game Quake III, use of which is governed by one of the Suggested Licenses—GPL-2.[17]

92. Copilot also reproduced code that had been released under a license that allowed its use only for free games and required attribution by including a copy of the license. Copilot did not mention nor include the underlying license when providing a copy of this code as Output.[18]

93. Texas A&M computer-science professor Tim Davis has provided numerous examples of Copilot reproducing code belonging to him without its license or attribution.[19]

94. GitHub concedes that in ordinary use, Copilot will reproduce passages of code verbatim: “Our latest internal research shows that about 1% of the time, a suggestion [Output] may contain some code snippets longer than ~150 characters that matches” code from the training data. This standard is more limited than is necessary for copyright infringement. But even using GitHub’s own metric and the most conservative possible criteria, Copilot has violated the DMCA at least tens of thousands of times.

95. In June 2022, Copilot had 1,200,000 users. If only 1% of users have ever received Output based on Licensed Materials and only once each, Defendants have “only” breached Plaintiffs’ and the Class’s Licenses 12,000 times. However, each time Copilot outputs Licensed Materials without attribution, the copyright notice, or the License Terms it violates the DMCA three times. Thus, even using this extreme underestimate, Copilot has “only” violated the DMCA 36,000 times.[20] Because Copilot constantly Outputs code as a user writes, and because nearly all of Copilot’s training data was Licensed Material, this number is most likely exponentially lower than the true number of breaches and DMCA violations.

96. Furthermore, the Suggested Licenses impose attribution obligations not only when Licensed Materials have been used verbatim, but also when Licensed Materials have been modified or adapted. Though Output from Copilot is often a verbatim copy, even more often it is a modification: for instance, a near-identical copy that contains only semantically insignificant variations of the original Licensed Materials, or a modified copy that recreates the same algorithm. Whenever Copilot outputs Licensed Materials in a manner that qualifies as a modification, the attribution requirements of the Suggested Licenses still apply. Copilot’s failure to provide the attributions for outputs that are modifications of Licensed Materials represents another enormous set of license breaches and DMCA violations.

Continue Reading Here.

About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.

This court case 4:22-cv-06823-JST retrieved on August 26, 2023, from Storage Courtlistener is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.