These are slides from the copyright session of the Building Legal Literacies for Text Data Mining (Building LLTDM) Institute. Hosted by the University of California, Berkeley Library's Office of Scholarly Communication Services.
1. Copyright & Fair Use for Digital Projects
Building Legal Literacies
for Text Data Mining
2. Copyright & Fair Use for Digital Projects
Day 2
Copyright
David Bamman
University of California, Berkeley
Brandon Butler
University of Virginia
Kyle K. Courtney
Harvard University
Brianna Schofield
Authors Alliance
4. Use Case
Author Title Text Snippet
Dickens Bleak House My Lady Dedlock has returned to her house
in town for a few days previous to her
departure for Paris
James Portrait of a
Lady
The two ladies had found Henrietta in Paris,
and Isabel constantly saw her
Hurston The Price of
Power
“So, how is Paris in late February?” Molly
asked.
Nabokov Pale Fire He answered he would be going to America
some time next month and had business in
Paris tomorrow.
Atwood Lady Oracle “My,” said one of the women, “I’ve always
wanted to go to Paris. Is it as beautiful as
they say?”
5. Use Case
● Data
○ Books published between 1800-2020, scanned and OCR’d
○ Unpublished manuscripts
● Algorithmic transformation: You want to perform named entity
recognition (NER) and toponym resolution on the entire dataset
○ Extract all mentions of place names from text
○ Ground those places in specific coordinates on a map
6. Use Case
● You want to improve NER by annotating selections from the novels in
your dataset and training a better model
● To support reproducible research, you want to publish those
annotations along with the original texts
9. Copyright Act & Purpose
U.S. Constitution, Article 1, § 8, clause 8: "To promote the progress of science and
useful arts, by securing for limited times to authors and inventors the exclusive
right to their respective writings and discoveries."
Copyright Act 1790: “An Act for the Encouragement of Learning….”
Copyright Act of 1976: Copyright protects “original works of authorship that are
fixed in any tangible medium of expression.”
10. Original Work of Authorship
An original work must embody some “minimum
amount of creativity.”
● Feist Publications, Inc., v. Rural Telephone
Service Co., 499 U.S. 340 (1991)
● Copyright Act, §102(b)
11. What is Copyright, really?
A “bundle of rights”
Limited economic monopoly for authors
A system “to promote the progress of science and the
useful arts…”
12.
13. The Exclusive Bundle of Rights (17 U.S.C. §106)
1. To reproduce the work copies
2. To prepare derivative works
3. To distribute copies
4. To perform the copyrighted work publicly
5. To display the copyrighted work publicly
Photo by Jamie Street on Unsplash
14. Subject Matter of Copyright
● Literary works
● Musical works
● Dramatic works
● Pantomimes/choreographic
works
● Pictorial, graphical and
sculptural works
● Motion pictures and
audiovisual works
● Sound recordings
● Architectural works
15. What You Can’t Copyright
● Slogans & Logos (Trademark)
● Processes, methods & systems (Patent Law)
● Proprietary formulas/recipes (Trade Secret
Law)
○ Coca Cola Formula
● Raw Data (un-copyrightable)
○ You can’t copyright a fact
○ White pages
Photo by Talles Alves on Unsplash
16. Know Copyright
Original/Creative &
Fixed
No Registration
Required
Grant of Rights to
Owners (the “bundle”)
Wide Range of Protected
Works
Long Term of Protection
Exceptions &
Limitations*
• Section 107: Fair Use
• And others…
17. Today: Life of the Author
+ 70 years
1976: Life of the Author
+ 50 years
1909: 28 years +
28 years renew
1790: 14 yrs
+ 14 yrs
renew
19. What is the Public Domain?
The commons of material that is not
protected by copyright—anyone is free
to use, copy, share, and remix public
domain works
● Works for which copyright has expired
● Works for which copyright owners failed to
comply with “formalities”
● Works that are not copyrightable
Photo by June Dalton on Unsplash
20. What the Public Domain is Not
The public domain has nothing to do with
what is readily available for public
consumption
● Just because something is on the internet,
for example, does not put it in the public
domain
Photo by Philipp Katzenberger on Unsplash
21. Expired Copyright
When copyright for a work expires, the
work enters the public domain
● As of 2020, all works first published in U.S.
in 1924 or earlier are now in the public
domain in the U.S.
● Works created today by an individual author
will still be under copyright for their entire
life plus 70 years
Photo by Badhon Ebrahim on Unsplash
22. Failure to Comply With “Formalities”
Copyright law used to require
“formalities”, and if these were not met
the work entered the public domain
● Because of this, many works published
between 1925-March 1989 may also be in
the public domain
Photo by Cameron Behymer on Unsplash
23. Expired Copyright
Resources
● Cornell University Library’s Copyright Term
and the Public Domain in the United States
copyright.cornell.edu/publicdomain
● Samuelson Law, Technology & Public Policy’s
Is it In the Public Domain?
z.umn.edu/pd-handbook
Photo by Badhon Ebrahim on Unsplash
24. Things That Are Uncopyrightable
Certain things are simply not eligible to
be protected by copyright and are
automatically in the public domain, e.g.:
● Facts
● Lists of ingredients, rules of a board game
● Titles, phrases, and slogans
● Works created by the federal government
Photo by Robert Coelho on Unsplash
25. Public Domain and TDM
Photos by Nick Fewings on Unsplash and Markus Spiske on Unsplash
If a work or collection of works
are in the public domain, then
copyright issues do not apply
Caution: Other legal issues;
social biases in “low-friction”
data
27. Licenses and Copyright
A license authorizes the licensee to make
uses covered by copyright’s exclusive
rights
● Your institution may already have a license
that sets parameters for TDM
● You can ask the copyright owner for a license
to use the work
● READ THE LICENSE (or talk to someone who
has)
Photo by sydney Rae on Unsplash
28. How Licenses Work (cont’d)
A license can affect copyright issues in
several ways:
● Permitted uses are safe from copyright
liability
● Not clearly permitted uses may need some
additional justification
● Expressly forbidden uses may be off-limits
even though copyright law would allow them
(e.g., via fair use)
Photo by sydney Rae on Unsplash
29. Public Licenses
Attribution Share-Alike Non-CommercialNo DerivativesCreative Commons
Some works are available under public licenses that allow for specific
uses of copyrighted works without the need to seek additional
permission from the owner
30. Licenses/Permissions and TDM
Photos Photo by Nick Fewings on Unsplash and Markus Spiske on Unsplash
If works made available under a
public license or another license,
these works can be used in ways that
comply with the terms of the license.
Caution: Other legal issues;
social biases in “low-friction”
data
33. What is Fair Use?
A user’s right that allows individuals to
exercise one of the exclusive rights of
copyright
● Without obtaining the permission of the
copyright owner, and
● Without the payment of any license fee
Photo by Jason Leung on Unsplash
34. Fair Use Factors
Four fair use factors
1. The purpose and character of the use
2. The nature of the copyrighted work
3. The amount and substantiality of the portion
used in relation to the copyrighted work as a
whole
4. The effect of the use on the potential market
for or value of the copyrighted work
Photo by David Pisnoy on Unsplash
35. Development of Transformative Fair Use
Does the use transform the material, by
using it for a different purpose?
Was the amount taken appropriate to the
new purpose?
Photo by Samule Sun on Unsplash
36. Fair Use is Flexible
Photo by Guillaume Peltier from Pixabay
The four factors are not mechanically
applied or weighed equally
● Fair use is flexible
● Courts take into account all the facts and
circumstances of a specific case to decide if
an unlicensed use of copyrighted material is
fair
38. TDM: The Fairest of them All?
The Queen asks the Mirror. Illustration by Franz Jüttner from
Sneewittchen (1905), Rpt. In Wikipedia.
A long line of cases favors fair use for
digital “non-consumptive” use
● Search engines and related tech have been
vindicated in court
● Harmony across judicial circuits
● A very close analogy to TDM research
39. Google Books: A Case Study
Photo by Susan Yin on Unsplash
Google (with library partners!) made
digital copies of millions of books to
create its Google Book Search service
● Users could search for relevant words, terms,
or phrases across the corpus
● Search results showed “snippets” - portions
of pages with words or phrases in context
● The Authors Guild sued; Google won.
40. Google Books: A Case Study
Photo by Susan Yin on Unsplash
The court analyzed the fairness of two
key “uses”
● Copying millions of complete books to create
a search index (behind the curtain)
● “Revealing” snippets of protected text as
part of search results displayed to the public,
as well as Ngram graphs showing recurrence
of words and phrases over time.
41. The purpose and character of the use
Google Books:
Factor One
Google’s actions found “highly
transformative”:
● Copying entire text of millions
of books to create index
● Creation of the ngrams tool
● Display of “snippets” of
relevant text in search results.
Photo by Markus Spiske on Unsplash
42. Google Books: Factor One
Purpose is “to make available
significant information about
those books,” it does not
supersede the objective of the
original work but “instead
add[s] something new, with a
further purpose or different
character.”
43. The nature of the copyrighted work
Google Books:
Factor Two
The court gave fairly cursory
treatment to this factor, saying
that nothing influenced it one
way or another with respect to
this factor in isolation
Photo by Sonny Ravesteijn on Unsplash
44. The amount and substantiality of the
portion used
Google Books:
Factor Three
● Copying entire work
“literally necessary” for
search
● Amount displayed in
snippets was reasonable for
purpose, not a substitute.
Photo by Hari AV on Unsplash
45. The effect of the use on potential market
for or value of work
Google Books:
Factor 4—Snippets
The snippet function did not give
searchers access to competing
substitutes and therefore does
not threaten rights holders with
any significant harm to the value
of their copyrights
Photo by Paul Bergmeir on Unsplash
46. The effect of the use on potential market
for or value of work
Google Books:
Factor 4—Index
Creating the index did not make
any new copies available to the
public, and the use itself was
transformative, so did not
require a derivative work license.
Photo by Paul Bergmeir on Unsplash
47. Google Books: Putting It Together
Photo by Simon Matzinger on Unsplash
Fair use favors digital research to learn
and communicate new information about
in-copyright works
● Different function from original books
● Amount copied was reasonable (necessary!)
● Amount “revealed” was tailored,
non-substitutional
● No harm to the market for original works
48. iParadigms: Case Study #2
Photo by Wesley Tingey on Unsplash
Defendants made a commercial use of
copyrighted works (student papers) to
create plagiarism detection software
● Use of works for plagiarism detection had an
entirely different purpose (to prevent
plagiarism) than the expressive content in
the original works
● Use of entire term paper was necessary to
serve purpose
49. Fair Use and TDM: Lessons from the Cases
Photos by Houcine Ncib on Unsplash, Markus Spiske on Unsplash, and Markus Spiske on Unsplash
Copying
to create a database
for TDM
Using
derived data
Publishing
data set
50. Copying to Create a Database for TDM
Photo by Houcine Ncib on Unsplash
Cases say:
1. Creating a database/corpus for TDM
analysis is a highly transformative
purpose
2. The appropriate amount for this work
is typically entire works (even millions of
works)
3. Creating such a database has no market
effect, is not a licensable “derivative
work”
51. Using Derived Data
Derived data (word frequency tables,
ngram graphs, etc.) do not infringe on the
rights of the copyright owner when they
comprise unprotectable facts and ideas
● Copyright only protects the expressive
content of protected works
● Infringement requires “substantial similarity”
Photo by Markus Spiske on Unsplash
52. Publishing Data Set
Photo by Markus Spiske on Unsplash
Further publishing of the data set
requires a separate fair use analysis
● Look at effects of release on traditional
market
● Amount released
● Security measures in place
54. Asking Permission
Myth:
I cannot rely on fair use if I ask
for permission and permission is
denied.
Photo by Felix Hanspach on Unsplash
55. Asking Permission
Fact:
You don’t have to ask
permission, but if you do, you
can still rely on fair use where
applicable.
Photo by Felix Hanspach on Unsplash
56. Using an Entire Work
Myth:
I cannot rely on fair use if I am
using the entire copyrighted
work.
Photo by Damien Creatz on Unsplash
57. Using an Entire Work
Fact:
The amount of the work copied is
just one of the factors, and it is
unlikely to be most important
one if there is a transformative
purpose.
Photo by Damien Creatz on Unsplash
59. Using Unpublished Material
Fact:
The Copyright Act explicitly allows
fair use when using unpublished
materials. The unpublished nature is
still considered, but must be
weighed with the other factors.
Photo by Hannah Olinger on Unsplash
60. Using Highly Creative Works
Myth:
I cannot rely on fair use if I am
using a highly creative
copyrighted work.
Photo by Tim Arterbury on Unsplash
61. Using Highly Creative Works
Fact:
Courts consider whether a work is
creative or factual, but this factor
tends to have very little weight in
the overall analysis, especially when
the purpose of the use is
transformative.
Photo by Tim Arterbury on Unsplash
62. Making Commercial Uses
Myth:
I cannot rely on fair use if I am
making a commercial use of a
copyrighted work.
Photo by Sharon McCutcheon on Unsplash
63. Making Commercial Uses
Fact:
While noncommercial uses may
weigh in favor of fair use,
commercial uses can be fair too.
Photo by Sharon McCutcheon on Unsplash
65. Expected value
For a given action (a spin of the wheel,
e.g.),
● Multiply the value of each possible outcome
by its probability of occurring
● Add them up
● For the wheel, EV = (½ x 1)+(¼ x 2)+(¼ x 3)
= 1.75
66. Black Swans and File-Sharing Fines
Low-probability, high-magnitude risk
● Copyright’s statutory damages: up to $150k
per work infringed
● Jammie Thomas-Rassett: $220k for sharing
24 songs, 59¢ - 79¢ ea. retail
● Copyright lawsuits are very, very rare, but
could go very, very badly
Jammie Thomas-Rassett, from the MilleLacs
Messenger
67. Copyright Risk Reducers for TDM Research
● No statutory damages for those with
“good faith belief” in fair use, for
employees of non-profit, educational
inst’ns, archives, and libraries
(§504(c))
● State sovereign immunity and qualified
immunity
● Registration requirement
Photo by Sergey Fokin on Unsplash
68. Copyright Risk Reducers for TDM Research
● Notice mechanisms and policies
● Reasonable attribution
● Plaintiffs face risks too!
○ Fee-shifting under Fogerty and
Kirtsaeng
○ Lawsuits are expensive
Photo by Mika Baumeister on Unsplash
69. Remember mission risk!
Mission risk: the risk to your mission when
you fail to take valuable action
● Letting fear of downsides prevent action can
cause much more harm to your mission than
taking reasonable risks
● HathiTrust was “reasonable acquisition of risk” —
there was some uncertainty, but the value of the
positive outcome was massive, and continues to
pay dividends.
Photo by Brian McGowan on Unsplash
71. Use Case
Author Title Text Snippet
Dickens Bleak House My Lady Dedlock has returned to her house
in town for a few days previous to her
departure for Paris
James Portrait of a
Lady
The two ladies had found Henrietta in Paris,
and Isabel constantly saw her
Hurston The Price of
Power
“So, how is Paris in late February?” Molly
asked.
Nabokov Pale Fire He answered he would be going to America
some time next month and had business in
Paris tomorrow.
Atwood Lady Oracle “My,” said one of the women, “I’ve always
wanted to go to Paris. Is it as beautiful as
they say?”
72. Use Case
Books published between 1800-2020, scanned and OCR’d
○ Published 1924 or earlier → likely in the public domain
○ Published 1925-2020 → more likely to be subject to
copyright (unless failed to comply with formalities)
○ Corpus includes works of fiction, “highly creative works”
○ Unpublished manuscripts
73. Use Case
Author Title Text Snippet
Dickens Bleak House My Lady Dedlock has returned to her house
in town for a few days previous to her
departure for Paris
James Portrait of a
Lady
The two ladies had found Henrietta in Paris,
and Isabel constantly saw her
Hurston The Price of
Power
“So, how is Paris in late February?” Molly
asked.
Nabokov Pale Fire He answered he would be going to America
some time next month and had business in
Paris tomorrow.
Atwood Lady Oracle “My,” said one of the women, “I’ve always
wanted to go to Paris. Is it as beautiful as
they say?”
74. Use Case
● You want to improve NER by annotating selections from the novels in
your dataset and publish those annotations to enable reproducible
research.
● Is it necessary to publish a sample of the original text?
75. Use Case
1. The purpose and character of the use
○ Do the annotations add new meaning or expression to the original work?
2. The nature of the copyrighted work
○ These are mainly creative works
3. The amount and substantiality of the portion used in relation to the copyrighted work
○ Are the samples published the “heart” of the work?
4. The effect of the use on the potential market for or value of the copyrighted work
○ Do the samples disrupt the market for the original?
76. Use Case
Author Title Text Snippet
Dickens Bleak House My Lady Dedlock has returned to her house
in town for a few days previous to her
departure for Paris
James Portrait of a
Lady
The two ladies had found Henrietta in Paris,
and Isabel constantly saw her
Hurston The Price of
Power
“So, how is Paris in late February?” Molly
asked.
Nabokov Pale Fire He answered he would be going to America
some time next month and had business in
Paris tomorrow.
Atwood Lady Oracle “My,” said one of the women, “I’ve always
wanted to go to Paris. Is it as beautiful as
they say?”