1. Discovering & Dealing with Data
Presented by
Kimberly Silk, MLS, Data Librarian,
Martin Prosperity Institute, University of Toronto
17 September 2012
2. Agenda
• The MPI information environment
• Common data sources & authority
• Data management, discovery and access
• What is Open Data? Big Data?
• Fun with data visualization
• Q&A
2
3. About the MPI
• The Martin Prosperity Institute is a economic
think-tank; we are part of the Rotman School
within the University of Toronto
• My client group consists of grad students, post-
docs, visiting faculty and researchers who use
social-science data to support their research
• To support their research process, I procure,
curate, preserve and make discoverable data sets.
• The MPI has our own data repository that has
grown to 4 TB in size.
3
4. Data Sources
• Common & Very authoritative sources
– StatsCan via the Data Liberation Initiative
– Bureau of Labor Statistics, Bureau of Economic
Analysis, American Fact Finder (Census)
– OECD eLibrary
– World Bank
– Int’l sources such as UK Data Archive, Swedish
National Data Service, etc.
– Pew Research Center
– Gallup
4
5. More data sources
• Less authoritative??
– Chinese Data Center
– Rolling Stone
– MySpace
– CrunchBase
5
6. Data Challenge: Discovery
• Lots of research data
being collected and
added, but no method
to manage it, catalogue
it, or make it findable
• Demands from various
clients: faculty,
students, researchers,
staff, administration
• The shared network
drive was no longer
effective
6
7. Show & Share…
• We want the world to see our data catalogue
• But, we don’t want the world to be able to
copy or change what’s in the catalogue, or the
catalogue itself
• We need to manage access to our data; who
are you? Where are you from? Why do you
want the data? What are you going to do with
it? Will you share your results?
7
8. Data Discovery Platforms
• I reviewed several platforms that would work in
an academic environment:
– Nesstar – developed in Norway by Norwegian Social
Science Data Services, used by StatsCan, UK Data
Archive, NORC at UChicago
– Islandora – Open source system based on Fedora
developed at UPEI
– ODESI – proprietary system developed and used by
Scholars Portal
– Dataverse – Open source system developed by the
Institute for Quantitative Social Science at Harvard,
used by NBER, and many academic think tanks.
8
9. Dataverse
• Dataverse was a good choice since we could
install an iteration at UToronto, in the UToronto
cloud, and I could manage it myself
• It was free, and my colleagues at Scholar’s Portal
was interested in installing it – I was the perfect
guinea pig
• Slowly, I am cataloguing my data collection; I
have set up a lending agreement, and it’s working
very well.
• Demo:
http://dataverse.scholarsportal.info/dvn/dv/mpi
9
10. Open Data
• Open data is an idea, that certain data should be
freely available to everyone to use, reuse, and
redistribute without restriction.
• Governments around the world have begun to
“open up” some of their data: US, UK, New
Zealand, Norway, Russia, Australia, Morocco,
Netherlands, Chile, Spain, Uruguay, France, Brazil,
Estonia, Portugal, etc.
• State- and municipal-levels of government have
also created open data sites.
10
11. Open Data Opportunities…
• Governments open up their data to foster
better citizenship and improve transparency
• Open Data can spur grass-roots innovation:
citizens access open data to use in software
programs to solve problems, such as finding a
local daycare, knowing when the next bus will
come, reporting crime on-the-fly, or watching
congress proceedings in real time.
11
12. … and Challenges
• Open Data takes commitment. Successful
implementations have a dedicated team of
people who decide what data to release
according to usefulness and demand
• The data must be anonymized, cleansed and
in a non-proprietary format
• Organizations must be prepared to listen to
the citizens, be responsive, and trouble-shoot.
• Open data is a public service.
12
13. Big Data
• Big Data is a collection of data sets that is too
large for the average database management tool
(Access and Excel, for instance).
• Examples come from meteorology, genomics and
physics. At MPI we wrestle with large GIS data
sets (maps and satellite data), and deal with data
at the terabyte (1 trillion bytes) level.
• Larger data sets deal with petabytes (1
quadrillion bytes) and exabytes (1 quintillion
bytes).
13
14. Data Visualizations
• The visual representation of data ---- literally,
a picture can say a thousand [numbers]
• Edward Tufte is a key pioneer:
http://www.edwardtufte.com/tufte/
• Fantastic examples at Flowing Data:
http://flowingdata.com/
• RSA Animate: http://www.thersa.org/
14
15. Q&A
(and, Thank You!)
Kimberly Silk, MLS, Data Librarian,
Martin Prosperity Institute, University of Toronto
kimberly.silk@martinprosperity.org
17 September 2012