1. Data 101:
A Gentle Introduction
Presented by
Kimberly Silk, MLS,
Data Librarian, Martin Prosperity Institute,
Rotman School of Management, University of Toronto
23 October 2013
2. Our Agenda
•
•
•
•
•
•
Defining data librarianship
Basic terminology
Data sources
Big Data, Open Data
Data analysis tools
Our challenge: data management, preservation,
discovery and access
• What are data visualizations?
• Sources
• Q&A
2
3. Warnings:
• Data librarianship involves LOTS of acronyms
• There is NO MATH in data librarianship
– (well, almost none)
3
4. Defining Data Librarianship
• Data librarianship is a relatively new area of practice,
emerging with the growth of digital media since the
1970s;
• Data librarians are professional library staff engaged in
managing research data as a resource, and supporting
researchers in these activities;
• We support our institutions and researchers in the
areas of data management, metadata management,
and teaching how to use data as a resource;
• Many of us work in the social sciences, but there is
growth in the natural sciences and humanities as well.
4
5. Basic Terminology
• Data – plural!
Think: Squirrels!!
• Microdata – raw data, individual records consisting of rows of
numbers (Excel spreadsheet);
• Statistics – summarized tables and cross-tabulations that have
been formulated from the raw data;
• Aggregate data – statistical summaries organized in a data file
structure (Excel) that permits further analysis;
• PUMF – Public Use Microdata File – raw data that is available for
public use; some data may be filtered and geographies repressed
to ensure personal privacy;
• Variables – a set of factors, traits or conditions that describes a
unit of analysis; for instance, sex, age, marital status, etc.
• Frequencies – the number of times an observation occurs in the
data;
5
6. Common Data Sources
• Gov’t- collected surveys
–
–
–
–
–
–
–
–
–
–
–
6
Statistics Canada – public data and the Data Liberation Initiative
US Census (American Fact Finder)
Bureau of Labor Statistics, Bureau of Economic Analysis
Roper (Public Opinion Polls)
ICPSR (Inter-University Consortium for Political and Social Research)
International sources such as UK Data Archive, Swedish National Data
Service, Australian Data Archive, etc.
OECD iLibrary
World Bank Open Data
Pew Research Center
Gallup
Thomson
7. Other International Data Sources
• Some countries do not gather data, have not
been gathering data for very long, or else limit
or filter available data
• For instance, developing countries may not
gather, preserve or release their data;
• The BRICs (Brazil, Russia, India, China) will
struggle with this issue as their economies
grow.
7
8. Uncommon Data Sources
• Data can come from everywhere;
• Occasionally, the MPI acquires data from
unusual sources, such as:
– Billboard magazine
– MySpace social media site for bands
– CrunchBase database of technology companies
8
9. Open Data
• Open data are data that are openly available, free of charge and
copyright, and available in non-proprietary formats
• Can be used and re-used
• Public money (taxes) funds data creation and collection, and
therefore own the data
• Many governments are moving toward open data, but it takes time,
management, and caution
• Issues: privacy, transparency, maintenance
• Examples:
– Toronto
– Vancouver
– U.S.
9
10. Big Data
• Big Data are data that are too large for the
average database management tool (Access and
Excel, for instance).
• Examples come from meteorology, genomics and
physics. At MPI we wrestle with large GIS data
sets (maps and satellite data), and deal with data
at the terabyte (1 trillion bytes) level.
• Larger data sets deal with petabytes (1
quadrillion bytes) and exabytes (1 quintillion
bytes).
10
11. Data Discovery Platforms
• Nesstar – developed in Norway by Norwegian Social
Science Data Services, used by Statistics Canada, UK
Data Archive, NORC at the University of Chicago
• SDA – developed at Berkley, University of Toronto,
ICPSR
• Equinox – used at Western
• ODESI – proprietary system developed and used by
Scholars Portal
• Dataverse – Open source system developed by the
Institute for Quantitative Social Science (IQSS) at
Harvard, used by NBER and ICPSR
11
12. Data Analysis Packages
• SPSS – great for beginners, easy to use. Pointand-click interface; power users will want
more.
• SAS – preferred by power users; sharp
learning curve.
• Stata – easy to learn, powerful.
• R – open source, free, powerful, no GUI.
• Use what your colleagues are using.
12
13. Data Management,
Preservation, Discovery &
Access
•
•
•
•
•
•
•
•
We’ve conquered print collections,
but data present a new challenge;
Like all digital files, metadata is
necessary to describe data assets;
Like images, a single data set can
mean many things to many people;
How do we manage these data to
make sure they are discoverable,
accessible, and preserved?
Traditionally, data files have been
stored on network drives, and shared
or restricted according to the groups
who need to use them;
Network drives are difficult to search,
can be hard to share and restrict, and
don’t deal with metadata well;
Web pages with links has been a
common way to distribute data sets;
We needed new tools – a new kind of
catalogue that is designed for the
specialized needs of data.
14. Dataverse
•
We installed an iteration
of Dataverse at the
University of Toronto, in
our “cloud”, and I manage
my data collections myself;
•
As an open source
solution, it’s cost-effective
and my colleagues at
Scholar’s Portal support it
for me and other Ontario
universities.
•
The data are associated
with studies; several data
sets can be associated
with a single study;
•
The world can see the
metadata for each data
collection, but access to
the data sets themselves
are restricted to those
who contact me to get
permission.
16. Data Visualizations
• The visual representation of data ---- literally,
a picture can say a thousand [numbers]
• Edward Tufte is a key pioneer:
http://www.edwardtufte.com/tufte/
• Fantastic examples at Flowing Data:
http://flowingdata.com/
• RSA Animate: http://www.thersa.org/
16
17. Sources
• International Association for Social Science
Information Services & Technology (ASSIST) http://www.iassistdata.org/
•
•
•
•
•
17
OECD iLibrary - http://www.oecd-ilibrary.org/
World Bank Data - http://data.worldbank.org/
UK Data Archive - http://data-archive.ac.uk/
Nesstar - http://www.nesstar.com/
Dataverse - http://thedata.org/
18. Q&A
(and, Thank You!)
Kimberly Silk, MLS, Data Librarian,
Martin Prosperity Institute, University of Toronto
kimberly.silk@martinprosperity.org