If you are working on a computational text analysis project and have wondered how to legally acquire, use, and publish text and data, this workshop is for you! We will teach you 5 legal literacies (copyright, contracts, privacy, ethics, and special use cases) that will empower you to make well-informed decisions about compiling, using, and sharing your corpus. By the end of this workshop, and with a useful checklist in hand, you will be able to confidently design lawful text analysis projects or be well positioned to help others design such projects.
5. The Basics of TDM
“Text mining is the use of automated tools, techniques or
technology to process large volumes of digital content
that is often not well structured - to identify and
select relevant information; to extract information from
the content, to identify relationships within / between /
across documents and incidents or events for
meta-analysis.”
- from Text & Data Mining - A Librarian Overview by Ann
Oakerson (2013)
10. Facts & Ideas
Nicholas Mazza,
Poetry therapy: Toward a
research agenda for the 1990s,
The Arts in Psychotherapy,
Volume 20, Issue 1,1993,51-59,
11. Content Data about the content
TDM researchers can use copyrighted content!
12. Fair Use
17 U.S.C.§ 107
“The fair use of a
copyrighted work…for
prposes such as
criticism, comment,
news reporting,
teaching…, scholarship,
or research, is not an
infringement of
copyright.”
13. Four-Factor Balancing Test
1. Purpose & character of use
“Transformativeness” often
dominates
2. Nature of copyrighted work
Whether factual/scholarly work
3. Amount and substantiality
Size & importance of portion
4. Effect on potential market
Whether it supplants market
14. Authors Guild v. HathiTrust
755 F.3d 87 (2d Cir. 2014)
Textual analysis that digital
library enabled was
transformative under factor
one, and overall fair
Authors Guild v. Google
804 F.3d 202 (2d Cir. 2015)
Creation of full-text
searchable database with
“snippet view” and “ngram
viewer” [search strings]
were fair uses
15. iParadigms, 562 F. 3d 630
(4th Cir. 2009)
Plagiarism detection
software that replicated
content to detect
similarities was fair use
17. Fox News v TVEyes,
883 F.3d 169 (2018)
Basic functionality and
archiving features were
fair use, but making
available 10-minute
clips was not
18. ● Likely fair to digitize to
conduct text data mining
(w/security precautions)
● May not be fair to republish
large portions of content
● May not be fair to circulate
the digitized texts/corpus
● Case-by-case
Takeaways
21. Archives
Agreement
“I understand that
permission to publish, or
otherwise publicly use,
materials . . . must be
[granted by library]
I understand further that
the University makes no
representation that it is
the owner of the
copyright... and that
permission to publish must
also be obtained from the
owner of the copyright.”
22. Website Terms
“If you intend to
quote extensive
amounts of text, use
other original
content, or
reproduce images
from this site,
please contact us
for permission.”
23. California Digital Library’s Model Database Language
Authorized Users may use the Licensed Materials to
perform and engage in text and/or data mining
activities for academic research, scholarship, and
other educational purposes... and may utilize and
share the results of text and/or data mining in their
scholarly work and make the results available for use
by others, so long as the purpose is not to create a
product for use by third parties that would substitute
for the Licensed Materials.
24. CDL Model License: Preserving Fair Use
Notwithstanding the foregoing, nothing in this
agreement shall otherwise restrict uses of the
material that would be fair use pursuant to 17 U.S.C.§
107 et seq.
25. ● Agreements may constrict uses that
would otherwise be fair
● Familiarize yourself with the
agreement(s), ask for help,
evaluate risk
● Alternatives:
○ Check to see if site has an API
○ Negotiate with content providers
/ ask permission
Takeaways