How and why
study big cultural data



Lev Manovich
manovich@ ucsd.edu
softwarestudies.com
New York Times (November 16, 2010):
“The next big idea in language, history and
the arts? Data.”


NEH/NSF Digging into Data competition
(2009): “How does the notion of scale
affect humanities and social science
research?
Now that scholars have access to huge
repositories of digitized data—far more
than they could read in a lifetime—what
does that mean for research?”
Why study
big cultural data ?
1 study societies through the social media
traces - social computing (but do we study
society or only social media itself?)

2 more inclusive understanding of history
and present (using much larger samples)

3 detect large scale cultural patterns

4 the best way to follow global
professionally produced digital culture;
understand new developed cultural fields
(“X” design)

5 map cultural variability and diversity
Data: 3,724 18th century volumes, using 10,000 most frequent words
(excluding proper nouns). Ted Underwood.
The Differentiation of Literary and nonliterary diction, 1700-1900.
Growth of a global culture space after 1990:
Cumulative number of new art biennales, 1895-2008.
Cumulative number of new art biennales, 1895-2008.
                                   6
modern (19th-20th centuries) social and
cultural theory: describe what is similar
(classes, structures, types) / statistics
(reduction)


computational humanities and social
science should focus on describing what
is different / variability / diversity


not “from data to knowledge” but from
(incomplete) knowledge to actual cultural
data
We are no longer interested in the
conformity of an individual to an ideal type;
we are now interested in the relation of an
individual to the other individuals with
which it interacts... Relations will be more
important than categories; functions, which
are variable, will be more important than
purposes; transitions will be more
important than boundaries; sequences will
be more important than hierarchies.

Louis Menand on Darvin, 2001.
Visualization: Thinking
without “large”
categories
“The ontological status of assemblages,
large and small, is always that of unique,
singular individuals.”
“Unlike taxonomic essentialism in which
genus, species and individuals are
separate ontological categories, the
ontology of assemblages is flat since it
contains nothing but differently scaled
individual singularities.”
Manuel DeLanda. A New Philosophy of
Society.
Bruno Latour:

The “whole” is now nothing more than a
provisional visualization which can be
modified and reversed at will, by moving
back to the individual components, and
then looking for yet other tools to
regroup the same elements into
alternative assemblages.
How to study
big cultural data ?
how to explore massive visual collections
(exploratory media analysis)?

which data analysis and visualization
techniques are appropriate for non-
technical users? How to democratize data
analysis?
Our approach:

media visualization
(visualizing media
directly rather than only
using abstract infovis
language)
visualizing large non-visual data using abstraction
media visualization: showing visual data
directly




Every cover of Times magazine, 1923-2009 (4535 images).
X-axis = publication date. Y-axis = saturation mean.
our media visualization software o
287 megapixel display
(image: 1 million manga pages)
our software on new display wall
with thin bezels (data: 4535 Time
magazine covers)
Our methods:
1. media visualization using existing
metadata - show complete collection

2. media visualization using existing
metadata - use samples to better reveal
patterns

3. digital image processing + media
visualization (use simple image features
which have direct perceptual meaning -
and gradually introduce humanists to
image processing)
1. media visualization / existing metadata:
montage
2. media visualization / existing metadata /
sample
3. digital image processing + media
visualization




Image plots of selected paintings by six impressionist artists.
X-axis = mean saturation. Y-axis = median hue.
Megan O’Rourke, 2012.
Advantages:

replacing discrete
categories
with continuos attributes
1. from timelines to curves

2. better represent analog
cultural attributes

3. understand cultural landscapes
(fuzzy / overlapping / hard
clusters?)

4. visualize cultural variability

5. discover new gropings
1. from timelines to curves
2. better represent analog attributes
3. our maps of cultural landscapes reveal
fuzzy/overlapping clusters - rather than discrete
categories with hard boundaries
4. visualize cultural variability
5. discover new grouping
Studying large cultural
data challenges our
existing theoretical
concepts and
assumptions

example: what is “style”?
one million manga pages
single short manga series (>1000 pages)
776 Vincent van Gogh paintings
Selected current
projects:
7000 year old stone arrowheads
(with UCSD anthropologist and CS postdoc
at University of Washington)

comparing Art Now & Graphic design Flickr
groups (340,000 images)
(with CS collaborator from Laurence
Berkeley National Laboratory)

One million images (+ metadata) from
deviantArt (with an art historian / DH
collaborator from Netherlands Academy of
Arts and Sciences)
4.7 million newspaper pages from Library
of Congress (UCSD undergraduate
students)


virtual world / game analytics (NSF Eager,
with UCSD Experimental Games Lab)


SEASR tools and workflows for working
with image and video data (with NCSA at
University of Illinois, Urbana-Champaign)
Conclusion:

Computational humanities
vs.
digital humanities
“The capacity to collect and analyze massive amounts of data has
transformed such fields as biology and physics. But the emergence of a
data-driven 'computational social science' has been much slower. Leading
journals in economics, sociology, and political science show little evidence
of this field. But computational social science is occurring in Internet
companies such as Google and Yahoo, and in government agencies such as
the U.S. National Security Agency.”

“Computational Social Science.” Science, vol. 323, no. 6, February 2009.




Digital humanities:
scholars are mostly working with the archives of digitized historical cultural
archives which were created by libraries and universities with the funding
from NEH and other institutions.
Computational
humanities:
Analyzing massive amounts of cultural content and and peoples'
conversations, opinions, and cultural activities online - personal and
professional web sites, general and specialized social media networks and
sites. This data offers us unprecedented opportunities to understand cultural
processes and their dynamics and develop new concepts and models which
can be also used to better understand the past.

Current players in computational humanities:
- Google, Facebook, YouTube, Blue Fin Lans, Echonest, and many other
companies which analyze social media signals (blogs, Twitter, etc.) and the
content of media on social networks.
- Computer scientists who are working with this data.
manovich@ ucsd.edu

www.softwarestudies.com
Appendix:

visualizing video
collections
use media visualization with a set of
keyframes

automatic selection of key frames (for
example, using free shot detection
software)
How and why study big cultural data
How and why study big cultural data

How and why study big cultural data

  • 1.
    How and why studybig cultural data Lev Manovich manovich@ ucsd.edu softwarestudies.com
  • 2.
    New York Times(November 16, 2010): “The next big idea in language, history and the arts? Data.” NEH/NSF Digging into Data competition (2009): “How does the notion of scale affect humanities and social science research? Now that scholars have access to huge repositories of digitized data—far more than they could read in a lifetime—what does that mean for research?”
  • 3.
  • 4.
    1 study societiesthrough the social media traces - social computing (but do we study society or only social media itself?) 2 more inclusive understanding of history and present (using much larger samples) 3 detect large scale cultural patterns 4 the best way to follow global professionally produced digital culture; understand new developed cultural fields (“X” design) 5 map cultural variability and diversity
  • 5.
    Data: 3,724 18thcentury volumes, using 10,000 most frequent words (excluding proper nouns). Ted Underwood. The Differentiation of Literary and nonliterary diction, 1700-1900.
  • 6.
    Growth of aglobal culture space after 1990: Cumulative number of new art biennales, 1895-2008. Cumulative number of new art biennales, 1895-2008. 6
  • 7.
    modern (19th-20th centuries)social and cultural theory: describe what is similar (classes, structures, types) / statistics (reduction) computational humanities and social science should focus on describing what is different / variability / diversity not “from data to knowledge” but from (incomplete) knowledge to actual cultural data
  • 8.
    We are nolonger interested in the conformity of an individual to an ideal type; we are now interested in the relation of an individual to the other individuals with which it interacts... Relations will be more important than categories; functions, which are variable, will be more important than purposes; transitions will be more important than boundaries; sequences will be more important than hierarchies. Louis Menand on Darvin, 2001.
  • 9.
    Visualization: Thinking without “large” categories “Theontological status of assemblages, large and small, is always that of unique, singular individuals.” “Unlike taxonomic essentialism in which genus, species and individuals are separate ontological categories, the ontology of assemblages is flat since it contains nothing but differently scaled individual singularities.” Manuel DeLanda. A New Philosophy of Society.
  • 10.
    Bruno Latour: The “whole”is now nothing more than a provisional visualization which can be modified and reversed at will, by moving back to the individual components, and then looking for yet other tools to regroup the same elements into alternative assemblages.
  • 11.
    How to study bigcultural data ? how to explore massive visual collections (exploratory media analysis)? which data analysis and visualization techniques are appropriate for non- technical users? How to democratize data analysis?
  • 12.
    Our approach: media visualization (visualizingmedia directly rather than only using abstract infovis language)
  • 13.
    visualizing large non-visualdata using abstraction
  • 14.
    media visualization: showingvisual data directly Every cover of Times magazine, 1923-2009 (4535 images). X-axis = publication date. Y-axis = saturation mean.
  • 15.
    our media visualizationsoftware o 287 megapixel display (image: 1 million manga pages)
  • 16.
    our software onnew display wall with thin bezels (data: 4535 Time magazine covers)
  • 17.
    Our methods: 1. mediavisualization using existing metadata - show complete collection 2. media visualization using existing metadata - use samples to better reveal patterns 3. digital image processing + media visualization (use simple image features which have direct perceptual meaning - and gradually introduce humanists to image processing)
  • 18.
    1. media visualization/ existing metadata: montage
  • 19.
    2. media visualization/ existing metadata / sample
  • 21.
    3. digital imageprocessing + media visualization Image plots of selected paintings by six impressionist artists. X-axis = mean saturation. Y-axis = median hue. Megan O’Rourke, 2012.
  • 23.
  • 24.
    1. from timelinesto curves 2. better represent analog cultural attributes 3. understand cultural landscapes (fuzzy / overlapping / hard clusters?) 4. visualize cultural variability 5. discover new gropings
  • 25.
  • 26.
    2. better representanalog attributes
  • 27.
    3. our mapsof cultural landscapes reveal fuzzy/overlapping clusters - rather than discrete categories with hard boundaries
  • 28.
  • 29.
  • 30.
    Studying large cultural datachallenges our existing theoretical concepts and assumptions example: what is “style”?
  • 31.
  • 34.
    single short mangaseries (>1000 pages)
  • 35.
    776 Vincent vanGogh paintings
  • 36.
    Selected current projects: 7000 yearold stone arrowheads (with UCSD anthropologist and CS postdoc at University of Washington) comparing Art Now & Graphic design Flickr groups (340,000 images) (with CS collaborator from Laurence Berkeley National Laboratory) One million images (+ metadata) from deviantArt (with an art historian / DH collaborator from Netherlands Academy of Arts and Sciences)
  • 37.
    4.7 million newspaperpages from Library of Congress (UCSD undergraduate students) virtual world / game analytics (NSF Eager, with UCSD Experimental Games Lab) SEASR tools and workflows for working with image and video data (with NCSA at University of Illinois, Urbana-Champaign)
  • 38.
  • 39.
    “The capacity tocollect and analyze massive amounts of data has transformed such fields as biology and physics. But the emergence of a data-driven 'computational social science' has been much slower. Leading journals in economics, sociology, and political science show little evidence of this field. But computational social science is occurring in Internet companies such as Google and Yahoo, and in government agencies such as the U.S. National Security Agency.” “Computational Social Science.” Science, vol. 323, no. 6, February 2009. Digital humanities: scholars are mostly working with the archives of digitized historical cultural archives which were created by libraries and universities with the funding from NEH and other institutions.
  • 40.
    Computational humanities: Analyzing massive amountsof cultural content and and peoples' conversations, opinions, and cultural activities online - personal and professional web sites, general and specialized social media networks and sites. This data offers us unprecedented opportunities to understand cultural processes and their dynamics and develop new concepts and models which can be also used to better understand the past. Current players in computational humanities: - Google, Facebook, YouTube, Blue Fin Lans, Echonest, and many other companies which analyze social media signals (blogs, Twitter, etc.) and the content of media on social networks. - Computer scientists who are working with this data.
  • 41.
  • 42.
    Appendix: visualizing video collections use mediavisualization with a set of keyframes automatic selection of key frames (for example, using free shot detection software)

Editor's Notes

  • #5 1 The exponential growth of a number of both non-professional and professional media producers over the last decade has created a fundamentally new cultural situation and a challenge to our normal ways of tracking and studying culture. Hundreds of millions of people are routinely creating and sharing cultural content - blogs, photos, videos, online comments and discussions, and so on. 2 The rapid growth of professional educational and cultural institutions in many newly globalized countries along with the instant availability of cultural news over the web and ubiquity of media and design software has also dramatically increased the number of culture professionals who participate in global cultural production and discussions.
  • #12 In summary, the availability of large digitized collections of humanities data certainly creates the case for humanists to use computational tools. However, the rise of social media and globalization of professional cultures leave us no other choice. But how can we explore patterns and relations in sets of photographs, designs, or video, which may number in hundreds of thousands, millions, or billions? (FB: 7 billion photos uploaded per month.)
  • #16 We are situated inside Calit2 which is working on creating next generation cyberinfrastructure: grid computing, super high resolution displays, optical networks which support real-time uncompressed streaming of 4K cinema and 4K teleconferencing
  • #19 ( can instead show my Time covers animations video ) Images are unique for big data analytics. Other media types such as music and text take time to process. If we display many of visualization, it does not work and we have to resort to information visualization. However, with images, a user can see many patterns instantly. at this point, show the true scale of the image in Photoshop
  • #22 illustration: use of simple perceptually meaningful image features
  • #25 3 Instead of starting with labels (genres, styles, authors) (supervised machine learning), map cultural landscapes (and their evolution) using content properties may or may not find clusters discover many new groupings we did not think of before
  • #26 1. from timelines to curves: normally a book or an exhibition divides artist work into discrete periods visualization allows us to study the gradual changes, and it may reveal that there are no discrete categories
  • #27 2. better represent analog dimensions film scholars describe motion in a shot using only half a dozen categories; image analysis + visualization allows us to map “amount of visual change” as a continuos value and discover patterns which were hidden by the use of categories
  • #28 fuzzy overlapping clusters
  • #29 map space of variations
  • #30 discover new groupings