This webinar was prepared for and hosted by the SLA Social Science and Transportation divisions and the Upstate New York chapter. Presented on August 14, 2013.
1. 14 August 2013
Data 101:
A Gentle Introduction
Presented by
Kimberly Silk, MLS,
Data Librarian, Martin Prosperity Institute,
Rotman School of Management, University of Toronto
2. 2
Our Agenda
• Defining data librarianship
• Basic terminology
• Common data sources
• Our challenge: data management, preservation,
discovery and access
• What are “big data”?
• What are data visualizations?
• Sources
• Q & A
3. 3
Defining Data Librarianship
• Data librarianship is a relatively new area of practice,
emerging with the growth of digital media since the
1970s;
• Data librarians are professional library staff engaged in
managing research data as a resource, and supporting
researchers in these activities;
• We support our institutions and researchers in the
areas of data management, metadata management,
and teaching how to use data as a resource;
• Many of us work in the social sciences, but there is
growth in the natural sciences and humanities as well.
4. 4
Basic Terminology
• Data – plural! Think: Squirrels!!
• Microdata – raw data, individual records consisting of rows of
numbers (Excel spreadsheet);
• Statistics – summarized tables and cross-tabulations that have
been formulated from the raw data;
• Aggregate data – statistical summaries organized in a data file
structure (Excel) that permits further analysis;
• PUMF – Public Use Microdata File – raw data that is available for
public use; some data may be filtered and geographies repressed
to ensure personal privacy;
• Variables – a set of factors, traits or conditions that describes a
unit of analysis; for instance, sex, age, marital status, etc.
• Frequencies – the number of times an observation occurs in the
data;
5. 5
Common Data Sources
• Gov’t- collected surveys
– US Census (American Fact Finder)
– Bureau of Labor Statistics, Bureau of Economic Analysis,
– Statistics Canada
– International sources such as UK Data Archive, Swedish
National Data Service, Australian Data Archive, etc.
– OECD iLibrary
– World DataBank
– Pew Research Center
– Gallup
– Thomson
6. 6
Other International Data Sources
• Some countries do not gather data, have not
been gathering data for very long, or else limit
or filter available data
• For instance, Russia, India, China and other
developing countries may not gather, preserve
or release their data;
• The BRICs (Brazil, Russia, India, China) will
struggle with this issue as their economies
grow.
7. 7
Uncommon Data Sources
• Data can come from everywhere;
• Occasionally, the MPI acquires data from
unusual sources, such as:
– Rolling Stone magazine
– MySpace social media site for bands
– CrunchBase database of technology companies
8. Data Management,
Preservation, Discovery &
Access
• We’ve conquered print collections,
but data present a new challenge;
• Like all digital files, metadata is
necessary to describe data assets;
• Like images, a single data set can
mean many things to many people;
• How do we manage these data to
make sure they are discoverable,
accessible, and preserved?
• Traditionally, data files have been
stored on network drives, and shared
or restricted according to the groups
who need to use them;
• Network drives are difficult to search,
can be hard to share and restrict, and
don’t deal with metadata well;
• Web pages with links has been a
common way to distribute data sets;
• We needed new tools – a new kind of
catalogue that is designed for the
specialized needs of data.
9. 9
Data Discovery Platforms
• Nesstar – developed in Norway by Norwegian
Social Science Data Services, used by Statistics
Canada, UK Data Archive, NORC at the
University of Chicago
• ODESI – proprietary system developed and
used by Scholars Portal
• Dataverse – Open source system developed by
the Institute for Quantitative Social Science
(IQSS) at Harvard, used by NBER and ICPSR
10. Dataverse
• We installed an iteration
of Dataverse at the
University of Toronto, in
our “cloud”, and I manage
my data collections myself;
• As an open source
solution, it’s cost-effective
and my colleagues at
Scholar’s Portal support it
for me and other Ontario
universities.
• The data are associated
with studies; several data
sets can be associated
with a single study;
• The world can see the
metadata for each data
collection, but access to
the data sets themselves
are restricted to those
who contact me to get
permission.
12. 12
What are Big Data?
• Big Data are data that are too large for the
average database management tool (Access and
Excel, for instance).
• Examples come from meteorology, genomics and
physics. At MPI we wrestle with large GIS data
sets (maps and satellite data), and deal with data
at the terabyte (1 trillion bytes) level.
• Larger data sets deal with petabytes (1
quadrillion bytes) and exabytes (1 quintillion
bytes).
13. 13
Data Visualizations
• The visual representation of data ---- literally,
a picture can say a thousand [numbers]
• Edward Tufte is a key pioneer:
http://www.edwardtufte.com/tufte/
• Fantastic examples at Flowing Data:
http://flowingdata.com/
• RSA Animate: http://www.thersa.org/
14. 14
Sources
• International Association for Social Science
Information Services & Technology (ASSIST) -
http://www.iassistdata.org/
• OECD iLibrary - http://www.oecd-ilibrary.org/
• World Bank Data - http://data.worldbank.org/
• UK Data Archive - http://data-archive.ac.uk/
• Nesstar - http://www.nesstar.com/
• Dataverse - http://thedata.org/
15. 17 September 2012
Q & A
(and, Thank You!)
Kimberly Silk, MLS, Data Librarian,
Martin Prosperity Institute, University of Toronto
kimberly.silk@martinprosperity.org