Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data 101: A Gentle Introduction


Published on

An introduction to Data Librarianship.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Data 101: A Gentle Introduction

  1. 1. Data 101: A Gentle Introduction Presented by Kimberly Silk, MLS, Data Librarian, Martin Prosperity Institute, Rotman School of Management, University of Toronto 23 October 2013
  2. 2. Our Agenda • • • • • • Defining data librarianship Basic terminology Data sources Big Data, Open Data Data analysis tools Our challenge: data management, preservation, discovery and access • What are data visualizations? • Sources • Q&A 2
  3. 3. Warnings: • Data librarianship involves LOTS of acronyms • There is NO MATH in data librarianship – (well, almost none) 3
  4. 4. Defining Data Librarianship • Data librarianship is a relatively new area of practice, emerging with the growth of digital media since the 1970s; • Data librarians are professional library staff engaged in managing research data as a resource, and supporting researchers in these activities; • We support our institutions and researchers in the areas of data management, metadata management, and teaching how to use data as a resource; • Many of us work in the social sciences, but there is growth in the natural sciences and humanities as well. 4
  5. 5. Basic Terminology • Data – plural! Think: Squirrels!!  • Microdata – raw data, individual records consisting of rows of numbers (Excel spreadsheet); • Statistics – summarized tables and cross-tabulations that have been formulated from the raw data; • Aggregate data – statistical summaries organized in a data file structure (Excel) that permits further analysis; • PUMF – Public Use Microdata File – raw data that is available for public use; some data may be filtered and geographies repressed to ensure personal privacy; • Variables – a set of factors, traits or conditions that describes a unit of analysis; for instance, sex, age, marital status, etc. • Frequencies – the number of times an observation occurs in the data; 5
  6. 6. Common Data Sources • Gov’t- collected surveys – – – – – – – – – – – 6 Statistics Canada – public data and the Data Liberation Initiative US Census (American Fact Finder) Bureau of Labor Statistics, Bureau of Economic Analysis Roper (Public Opinion Polls) ICPSR (Inter-University Consortium for Political and Social Research) International sources such as UK Data Archive, Swedish National Data Service, Australian Data Archive, etc. OECD iLibrary World Bank Open Data Pew Research Center Gallup Thomson
  7. 7. Other International Data Sources • Some countries do not gather data, have not been gathering data for very long, or else limit or filter available data • For instance, developing countries may not gather, preserve or release their data; • The BRICs (Brazil, Russia, India, China) will struggle with this issue as their economies grow. 7
  8. 8. Uncommon Data Sources • Data can come from everywhere; • Occasionally, the MPI acquires data from unusual sources, such as: – Billboard magazine – MySpace social media site for bands – CrunchBase database of technology companies 8
  9. 9. Open Data • Open data are data that are openly available, free of charge and copyright, and available in non-proprietary formats • Can be used and re-used • Public money (taxes) funds data creation and collection, and therefore own the data • Many governments are moving toward open data, but it takes time, management, and caution • Issues: privacy, transparency, maintenance • Examples: – Toronto – Vancouver – U.S. 9
  10. 10. Big Data • Big Data are data that are too large for the average database management tool (Access and Excel, for instance). • Examples come from meteorology, genomics and physics. At MPI we wrestle with large GIS data sets (maps and satellite data), and deal with data at the terabyte (1 trillion bytes) level. • Larger data sets deal with petabytes (1 quadrillion bytes) and exabytes (1 quintillion bytes). 10
  11. 11. Data Discovery Platforms • Nesstar – developed in Norway by Norwegian Social Science Data Services, used by Statistics Canada, UK Data Archive, NORC at the University of Chicago • SDA – developed at Berkley, University of Toronto, ICPSR • Equinox – used at Western • ODESI – proprietary system developed and used by Scholars Portal • Dataverse – Open source system developed by the Institute for Quantitative Social Science (IQSS) at Harvard, used by NBER and ICPSR 11
  12. 12. Data Analysis Packages • SPSS – great for beginners, easy to use. Pointand-click interface; power users will want more. • SAS – preferred by power users; sharp learning curve. • Stata – easy to learn, powerful. • R – open source, free, powerful, no GUI. • Use what your colleagues are using. 12
  13. 13. Data Management, Preservation, Discovery & Access • • • • • • • • We’ve conquered print collections, but data present a new challenge; Like all digital files, metadata is necessary to describe data assets; Like images, a single data set can mean many things to many people; How do we manage these data to make sure they are discoverable, accessible, and preserved? Traditionally, data files have been stored on network drives, and shared or restricted according to the groups who need to use them; Network drives are difficult to search, can be hard to share and restrict, and don’t deal with metadata well; Web pages with links has been a common way to distribute data sets; We needed new tools – a new kind of catalogue that is designed for the specialized needs of data.
  14. 14. Dataverse • We installed an iteration of Dataverse at the University of Toronto, in our “cloud”, and I manage my data collections myself; • As an open source solution, it’s cost-effective and my colleagues at Scholar’s Portal support it for me and other Ontario universities. • The data are associated with studies; several data sets can be associated with a single study; • The world can see the metadata for each data collection, but access to the data sets themselves are restricted to those who contact me to get permission.
  15. 15. Data Visualizations • The visual representation of data ---- literally, a picture can say a thousand [numbers] • Edward Tufte is a key pioneer: • Fantastic examples at Flowing Data: • RSA Animate: 16
  16. 16. Sources • International Association for Social Science Information Services & Technology (ASSIST) • • • • • 17 OECD iLibrary - World Bank Data - UK Data Archive - Nesstar - Dataverse -
  17. 17. Q&A (and, Thank You!) Kimberly Silk, MLS, Data Librarian, Martin Prosperity Institute, University of Toronto