Klosek, Teremi, and Wolfson "AI Essentials: From Tools to Strategies: A 2025 NISO Training Series, Session Four - AI Governance: Copyright, Licensing, and Compliance"

BaltimoreNISO 11 views 43 slides Oct 23, 2025
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

This presentation was provided by Katherine Klosek of the Association for Research Libraries (ARL), Samantha Teremi of the University of California, Berkeley, and Stephen Wolfson of the University of Pennsylvania Libraries, for the third session of the NISO training series "AI Essentials: From ...


Slide Content

AI Essentials: From Tools to Strategy

Understand how US copyright law and licensing terms govern the training of AI models on copyrighted content. Identify key considerations when evaluating vendor contracts and publisher policies related to AI usage on licensed materials. Discuss how institutional policies can shape responsible AI use in academic settings. Week 3 : AI Governance: Copyright, Licensing, and Compliance

Welcome today’s presenters Katherine Klosek Director of Information Policy & Federal Relations Association of Research Libraries Samantha Teremi Licensing Librarian University of California, Berkeley Stephen Wolfson Asst General Counsel & Copyright Advisor University of Pennsylvania

Reminder… This session offers practical guidance and foresight , not legal advice.

Quick poll When it comes to training AI models on copyrighted materials, which statement best reflects your view? Training on copyrighted works should be freely allowed — it’s part of innovation and fair use. Training is acceptable only with proper licensing or permission. Training should require transparency about data sources and opt-out mechanisms. Training should be prohibited unless all rights are cleared in advance. It depends — we need clearer laws and guidance.

Copyright and AI: Background and recent developments Stephen M. Wolfson Assistant General Counsel/Copyright Advisor University of Pennsylvania Libraries [email protected]

Copyright basics for AI Copyright is a limited-duration property right that allows authors to control their creative works in several ways Copyright grants authors a bundle of exclusive rights of : Reproduction, creation of derivative works, distribution, public display, public performance Copyright automatically attaches to original works of authorships as soon as they are created, if they satisfy three conditions: Independently created; creative; and fixed in a tangible medium of expression It is extremely easy to satisfy these three conditions Copyright does not protect things like: Facts, ideas, methods of operation, concept, principles Names, titles, short phrases Works of the federal government The text of cases and statutes

Copyright basics: Fair use Fair use is an exception to the exclusive rights that copyright law gives to rightsholders it allows others to use copyrighted works in ways that would otherwise be infringement without permission from the rightsholders Fair use is a balancing test To determine fair use, courts weigh four factors against each other, considering how the use fits within the purposes of copyright , and determine whether, on balance , a use is fair or not The factors are: 1. The purpose and character of the use; 2. the nature of the copied work; 3. the amount copied; 4. the harm to the market for the original

Copyright and generative AI

GenAI Inputs: Is Training Copyright Infringement? Large Language Models must be trained on huge amounts of data to work Most training data today is unlicensed LLMs ingest data, analyze it, adjust their weights and biases, and then purge those data LLMs do not store copies of training data; they learn patterns in data and can reproduce those patterns Whether training constitutes infringement is a threshold question for the future of the technology because LLMs can’t exist without training on data It would be cost prohibitive for many developers to license all of the data they need

A line of cases support the argument that training is fair use, at least under many circumstances

GenAI Inputs: Cases supporting fair use Authors Guild v. Google , 804 F.3d 202 (2nd. Cir. 2015) Using copyrighted works as parts of data in a database was transformative and fair use Kelly v. Arriba Soft , 336 F.3d 811 (9th Cir. 2003) Using image thumbnails in a search engine was transformative and fair use Vanderhye v. iParadigms , LLC, 562 F. 3d 630 (4th Cir. 2009) Using and saving copyrighted works for plagiarism checking software was transformative and fair use

But a few recent cases may change our thinking on this

Thomson Reuters v. Ross (D. Del. 2025)

T homson Reuters v. Ross (D. Del. Feb. 11, 2025) This case involves the legal research database, Westlaw Westlaw is an extremely powerful -- and quite expensive -- database that allows researchers to find all kinds of legal information resources, including cases, statues, regulations, administrative rules, treatises, guides, and scholarship Ross built an AI legal research tool using content taken from Westlaw Ross worked with LegalEase to produce “bulk memos” to train its AI model The bulk memos were based on Westlaw’s headnotes Headnotes are annotations on rules of law that West editors attach to cases Often annotations are almost exactly the same as the case text TR sued, saying that these bulk memos were essentially just West Headnotes

Westlaw headnotes

Westlaw headnotes

West key number system

Thomson Reuters v. Ross (D. Del. Feb. 11, 2025) The court first held that the headnotes and the key number system were copyrighted works Simply selecting some case text was enough to justify copyright protection, even where the text of the annotation was taken verbatim from the case Then t he court found that Ross’s use was not fair The most important factors were 1 (purpose and character/commerciality) and 4 (market harm) Ross’s use of the headnotes was for the same purpose as they were created -- to enable users to find cases -- and for a commercial purpose; both cut against fair use Ross wanted to create a competing product based on TR’s works -- this cut against fair use The court seemed very bothered that Ross used TRs works to create a competing product

Thomson Reuters v. Ross Inc.: What’s next? This will likely be the first decision from an appellate court on AI training Ross filed its brief for interlocutory appeal at the 3rd Circuit on fair use/training The 3rd Circuit is currently considering the case Nota Bene (maybe) The court district court noted that this is not a generative AI case but I think it is similar enough that it could have a very important impact

Bartz v. Anthropic (N.D. Cal. June 23, 2025)

Bartz v. Anthropic (N.D. Cal. June 23, 2025) Anthropic created a corpus of around 7 million items to train its LLM, Claude This included The Books3 library of around 200,000 unauthorized book copies and similar online libraries of digital book copies Digital copies of print books that Anthropic purchased -- including both new and used books -- then discarded the print copies Anthropic stored this corpus to be used for other purposes (including -- but not limited -- to training) A group of plaintiffs whose books are in Books3 sued Anthropic, claiming, among other things, that copying and storing their books as part of Claude’s training was copyright infringement

Bartz v. Anthropic (N.D. Cal. June 23, 2025) Judge Alsup held that: training Claude was fair use and copying/storing print books that Anthropic purchased was fair use but building and saving a library of unauthorized copies was not fair use He described training as “spectacularly” transformative But he doubted but did not rule that pirating copies that they could otherwise buy could ever be justified He seems to reject the market dilution theory The idea that an AI can compete in the market with works in its training data by being able to create generally similar works, not necessarily substantially similar works or identical works , and that will dilute the market for the works in the training data Market dilution theory seems wrong to me Copyright protects specific expressions, not against general competition This could significantly narrow the scope of fair use beyond the AI context

Bartz v. Anthropic: What ’s happening now On July 17, 2025, Judge Alsup certified a class of “All beneficial or legal copyright owners” infringed by Anthropic’s storing of unauthorized copies This included thousands of claimants including both authors and publishers Losing the class action could have cost Anthropic billions of dollars, possibly ending the company Anthropic fought this initially but decided it wanted to settle On Oct. 17, 2025, Judge Alsup granted preliminary approval of a $1.5B settlement in Bartz v. Anthropic Estimated ~$3100/claim https://www.anthropiccopyrightsettlement.com/

Kadrey v. Meta (N.D. Cal. June 25, 2025)

Kadrey v. Meta (N.D. Cal. Jun3 25, 2025) Meta trained its LLM, Llama, on the Books3 library and other shadow libraries A group of authors whose books were in Books3 , including comedian Sarah Silverman, sued Meta moved for summary judgement that training Llama on these books was fair use

Kadrey v. Meta Judge Chhabria was skeptical that training would be fair use in most cases But under these facts and these arguments, he felt bound to hold in favor of Meta Judge Chhabria disagreed with Judge Alsup on the market harm factor and instead he embraced market dilution “ Courts can’t stick their heads in the sand to an obvious way that a new technology might severely harm the incentive to create, just because the issue has not come up before. Indeed, it seems likely that market dilution will often cause plaintiffs to decisively win the fourth factor—and thus win the fair use question overall—in cases like this.” He described training as “highly transformative ” but still doubted it should be fair use because of the market harm Ultimately, he found in favor of Meta because the Authors did not develop their argument enough

What’s next? There are over 50+ other cases about copyright and genAI currently ongoing, so we will see more holdings in the future The 3rd Circuit will probably be the first appellate court to rule on these issues, so it is the one to watch right now These three cases represent three directions courts could take on training Will courts focus on whether and how the tool competes with the training data? Will courts split building a corpus from training? Will courts focus on market dilution? I expect SCOTUS to weigh in eventually, maybe in multiple cases So we are very far from the end of this story

Questions?

https://www.librarycopyrightalliance.org/

Copyright and Artificial Intelligence Part 3: Generative AI Training (pre-publication version) “Even where a model’s outputs are not substantially similar to any specific copyrighted work, they can dilute the market for works similar to those found in its training data, including by generating material stylistically similar to those works.” https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf

Breakout 1 What trends are you seeing in publisher policies for AI use in academic settings? How are these trends affecting teaching, learning, and research?

Licensing Considerations for AI

U.S. CONTRACTUAL OVERRIDE Even if a use is fair, or if the content is not protected by copyright at all, there may be a contract that restricts scraping, TDM, AI, and/or breaking DRM to do TDM or use AI

TDM Permitted: Can conduct TDM & retain copies of mined works for scientific research No AI Training Opt-Outs: C opyright owners may not opt out of allowing works to be used for AI training for scientific research No contractual Override: License agreements cannot negate either of these rights Appropriate security measures EU AI Act & D i rective on Copyright

Example AI Restriction Customer and its Authorized Users may not: directly or indirectly develop, train, program, improve, and/or enrich any artificial intelligence tool (“AI Tool”) accessible to anyone other than Customer and its Authorized Users, whether developed internally or provided by a third party; or reproduce or redistribute the Content to any third-party AI Tool, except to the extent limited portions of the Content are used solely for research and academic purposes (including to train an algorithm) and where the third-party AI Tool (a) is used locally in a self-hosted environment or closed hosted environment solely for use by Customer or Authorized Users; (b) is not trained or fine-tuned using the Content or any part thereof; and (c) does not share the Content or any part thereof with a third party.

L earning F rom L icensed C ontent

Homegrown Non-generative Homegrown generative Third-party Non-generative Third-party generative Use vs. Training

Negotiation Strategies Divide AI uses into subtypes Frame clauses in the negative Evaluate and address publisher concerns: Licensed content only available to Licensee & Authorized Users Disruption to functionality Reproduction/redistribution to third parties Competing or commercial products Reasonable information security standards

Except as expressly stated in this Agreement or otherwise permitted in writing by [Licensor] , or as permitted by any Creative Commons licenses or public domain dedications applied to the Subscribed Products , the Subscriber and its Authorized Users may not:... Negotiat ion Strategy: Exclude OA/ Creative Commons-Licensed Works

Negotiation Strategy: Leverage Stakeholder Support https://ucnet.universityofcalifornia.edu/employee-news/president-drake-and-provost-newman-affirm-the-universitys-commitment-to-protect-author-researcher-and-reader-rights/

Breakout 2 How can your organization or institution ensure that students and faculty have lawful access to copyrighted works for AI training in research and academic settings? How can we foster open dialogue and collaboration between libraries and publishers during negotiations for licensed scholarly materials?

Wrap up and takeaways