• Like
ICPSR Data Exploration Tools
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

ICPSR Data Exploration Tools


Part I of a workshop conducted by ICPSR. This deck describes data exploration tools.

Part I of a workshop conducted by ICPSR. This deck describes data exploration tools.

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • This is part I of a 3-part workshop conducted at IASSIST on May 31, 2011.
  • In February 2011, over 59,614 datasets (over 540,000 files) available for download.As a sense of volume of downloads, total downloads for FY 2010 = 612,500 datasets downloaded/accessed. (Member downloads make up 283,472 of those=46% but 60% of all studies.)Also in FY2010 – about 19,800 MyData accounts downloaded/accessed something – were active.
  • ICPSR supports students, faculty, researchers, and policymakers.
  • ICPSR size of holdings run on 4/20/2011
  • More results = more confusion, so faceted search assists the visitor to narrow his/her search.The integrated search queries the full text of all the documentation files, so it indexes the questions, answers, and all the descriptive text at the variable and study level. Consequently, you get a lot more search results. To provide one example, a search on “condoms and Uganda” would have previously returned no results; now it returns 30 results, the first of which is the National Survey of Adolescents, 2005: Uganda. Because our study-level metadata records are broad and concise, they don’t list the individual birth control methods; rather, they simply list the subject term “birth control.” By exposing the full text of the questionnaire, we’ve taken a bit of the guesswork out of finding a query term that matches ICPSR’s preferred terms.In addition to the documentation search, our integrated search includes the citations.
  • 1) “teen drug use” produces 855 results. That can be overwhelming if the first page of results doesn’t look promising.2) Filter Results displays our faceted search that sorts by subject terms & other common metadata. The numbers in parentheses are the number of studies within that filter.3) The results are sorted by relevance – I’ve sorted this example by most downloaded. You can sort by other things like most cited or release date.4) Where does this text appear? Clicking here will show you the matching text from the documentation, citations, and metadata records where the matches were found.FAQ on boolean search – it is the case that we no longer offer the boolean option. 2/3rds of our users are grads/undergrads who have grown up Google and using sites (Amazon for example) that use faceted search. Boolean has gone the way of DOS.
  • View details takes you to the full description & citation. Here you’ll find all of the pertinent information on the study: PI names, funders, subject terms, methodology, sample composition, etc.Here you can browse the codebook.You can review and search the variables here – more on that later of course.This indicates that ICPSR has found 3,654 citations related to this study and you can find still more for the series – more on that later too.And finally, if you are interested in data that is available online (in addition to downloading it later perhaps), here you’ll find whether that capability exists.
  • RSS: It enables you to grab relevant information from multiple Web sites, stripping it of site-specific graphics and navigation so you can focus on the core content. If you’re not using it now, sign up for a Google account and try adding some RSS feeds to Google Reader. You’ll be pleased with how much easier it makes it to keep up with professional and news sites.In terms of the ICPSR series page, you can sign up for an RSS notification so that you’ll know when new waves are released in a series. Similar links appear on the study home page, so that you can use RSS to keep track of an individual study, receiving notification if the study is updated, with details on the changes so you can decide whether or not you need to download a new copy of the dataset.Going back to the search results page, I wanted to point out that we have RSS options present there as well. Thus I could set up an RSS feed for a query on a particular subject or investigator, and receive notification if ICPSR released a new study that matched that query.One last thing to point out are the export options at the bottom of the right column. Librarians can use this utility to download MARCXML for any query, such as all studies released in the last six months. The comma-delimited export gives you all the search results in a format that can easily be viewed in Excel. This is primarily of use for researchers who are doing an extensive review of available data.
  • One easy way to access it is from main ICPSR Web site page, from the pull-down menu of the main tab, Find & Analyzed Data.
  • Okay, that’s one way to access “it.” But just what is it? [Slide contents—click on each link to describe the three places.]In the past, one could access related literature via the main Bibliography search and browse page, or at the metadata page for any given study.Now the integrated search also brings up results found in the Bibliography.There is one caveat that I will discuss when we do a search. You’re only searching the elements of a citation—not the full text of the publication.
  • We’re back at the main page of the searchable citations database.While we’re here, let me point out a couple of things that will explain the how and why we developed it.First, if you’re interested in the methodology behind the creation of the database, you can link to it from here.
  • It covers everything from the goals of the project to the details of the collection process.
  • So, what was the reason for developing it and, indeed, for continuing to put resources toward it? It’s covered in a nutshell on the main Bibliography search page.
  • Basically, we make the link from the data the literature about it.This is a big deal, to have the literature grouped this way, with the dataset as the driver. It’s not a common type of link available for datasets.So, even though this is a very resource-intensive undertaking—non common citing practice, no easy way to make links, etc., we still find it to be a valuable resource for the whole range of our user base.[Read first half of the slide]So this database facilitates the standard lit search that students, researchers, and others do in the course of their study or work.[Read second half of slide]So, one can use it to measure study impact, although fairly primitively compared to journal citation databases and engines.I’ll cover later a tool built around teaching students about data using the Bibliography as the prime resource.
  • Starting on the main search page , you’ll see that the main access points from here are via :Searching the elements of the citations (not full text)Browsing by authorBrowsing by journal titleWe’ll start with a search.You can either type a search term, or if you want to see the whole collection, simply leave the search box empty.Then click on Search for Citations button.
  • Link to the FAQ if you want a larger list.
  • http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/26701?archive=ICPSR&q=nsduhFrom study page to related lit and series related litFrom series listhttp://www.icpsr.umich.edu/icpsrweb/ICPSR/series
  • This goes for social science subject instructors as well as information literacy instructors.Helps students focus and gives them valuable tools for searching in the larger article databases.Again, this is because it’s a resource that connects the literature to the data.
  • http://www.icpsr.umich.edu/icpsrweb/EDRL/index.jsp
  • Exploring Data Through Research LiteratureDesigned to teach quantitative research methods to undergraduates in a different way. Integrates ICPSR bibliography of data related literatureinto teaching students how make their way from ideas to empirical work to literature and back. Suitable for both research methods and other substantive courses requiring empirical researchhttp://www.icpsr.umich.edu/icpsrweb/EDRL/index.jsp
  • http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/submit.jsp
  • http://www.icpsr.umich.edu/icpsrweb/ICPSR/support/faqs/8707350211996719508


  • 1. ICPSR AT 50:Facilitating Research and Data Sharing
    Part I: Data Exploration
    IASSIST Vancouver, BC
    May 31, 2011
  • 2. Welcome to Vancouver!Our Agenda
    Data Exploration
    A Continuing Quest to Ease your Search
    Social Science Variables Database
    Bibliography of Data-related Literature
    Data Sharing
    2010 US Census Data
    Public Data Collections
    Data Management
    Data Management Plans
    Computing & Data Sharing in Secure Environments
    Managing Restricted Contracts
  • 3. Managing the Clock
    Intro and Data Exploration (9:30-10:30)
    Data Sharing (10:45-11:30)
    Data Management (11:45–12:30)
    Disclaimer: Times are approximate!
  • 4. What is ICPSR? - Then and Now -
    One of the world’s oldest and largest social science data archives, est. 1962
    Data distributed on punch cards, then reel-to-reel tape, now:
    Data available on demand
    Over 7,000 studies with over 65,000 data sets
    Membership organization among 21 universities, now:
    Currently about 700 members world-wide
    Federal funding of public collections
  • 5. What We Do – It’s About Data!
    Seek research data and pertinent documents from researchers (PIs, research agencies, government)
    Process and preserve the data and documents
    Disseminate data
    Provide education, training, & instructional resources
  • 6. Why People Use ICPSR
    Write articles, papers, or theses using real research data
    Conduct secondary research to support findings of current research or to generate new findings
    Use as intro material in grant proposals
    Preserve/disseminate primary research data
    Fulfill data management plan (grant) requirements
    Study or teach quantitative methods
  • 7. Data Exploration
  • 8. The Challenge – Hoards of Data & Metadata
    How does one make sense of:
    7,000 studies
    65,000 datasets
    550,000 files
    Millions of variables
    60,000 bibliographic citations
  • 9. Data Exploration- Integrated Search -Better Search for Better Results
    Data-related biblio
    SSVD the variables
    Docs, subjects, PIs, etc
    Search Results
  • 10. Integrating ICPSR’s Search“Sponsored by SOLR/Lucene”
    In 2009, an improved search engine
    Later, construction of full-text search
    Faceted search to narrow large result sets
  • 11. Search Terms: teen drug use
  • 12. Reviewing the Study Home Page
  • 13. The Search Continues: Automatic Search Updates
    Receive automatic updates on the study or series
    And updates on your query
  • 14. Data ExplorationThe Social Science Variables Database
    Data-related biblio
    SSVD the variables
    Docs, subjects, PIs, etc
    Search Results
  • 15. The Social Science Variables Database (SSVD)
    Sanda Ionescu,
    Documentation Specialist
  • 16. The Social Science Variables Database at ICPSR
    Enables ICPSR users to search variables across datasets
    Assists in:
    Data discovery
    Comparison / harmonization projects
    Data harvesting
    Data analysis
    Question mining for designing new research
  • 17. The Social Science Variables Database at ICPSR
    Tool for teaching
    Research Methods:
    • Concept operationalization
    • 18. Effect of question wording, context, and answer categories on variable distributions
    Substantive classes:
    • Cultural / social changes reflected in different question wordings, or elicited answers (longitudinal or time series data)
  • The Social Science Variables Database at ICPSR
    Officially launched Spring 2009.
    Pre-launch: two to three years’ preparation period
    Gather variable-level documentation; apply/refine selection criteria, quality checks
    Build database to host variable descriptions
    Initial upload: 3,500 files describing data from about 1,300 studies.
  • 19. The Social Science Variables Database at ICPSR
    Variables documented using the Data Documentation Initiative (DDI) specification
    DDI: a standard for documenting social science data, written in XML
    Easy to parse / process
    Allows fine-grained searches
    Flexible display in a variety of formats
    Highly shareable, promotes interoperability
    Ideal archival format (ASCII, not software dependent)
  • 20. The Social Science Variables Database at ICPSR
    DDI variable descriptions
    • Generated through an automated process used archive-wide to produce ICPSR’S archival and distribution information packages
    • 21. Include question text if available in the source documentation
  • The Social Science Variables Database at ICPSR
    Relational database
    • Built in Oracle as a separate entity, with links to studies’ and series’ descriptions (also stored in Oracle)
    • 22. Compatible with both DDI 2 and 3 (input and output)
    • 23. Oracle Text searches used in Beta-testing phase
    • 24. Slow retrieval
    • 25. Limited to 500 results
  • The Social Science Variables Database at ICPSR
    Search: autumn 2009 switched to Solr/Lucene:
    Easy indexing
    Faster searches, unlimited hits
    Facets/Filters imported from Study Descriptions (also DDI compatible)
    Time Period
    Storage: XML files are being indexed and searched directly – no longer uploaded in the database
  • 26. The Social Science Variables Database at ICPSR
    • Current content:
    • 27. 2,602 studies (48 percent of ICPSR holdings with data and setups)
    • 28. 6,493 datasets
    • 29. Approx. 1.7 million variables
    • 30. Continues to grow by including
    • 31. All new releases, if suitable
    • 32. Retrofits as made available by small-scale projects
  • The Social Science Variables Database at ICPSR
    • DDI fields searched:
    • 33. Variable name
    • 34. Variable label
    • 35. Question text sequence
    • 36. Descriptive text
    • 37. Category label
    • 38. Variable notes – not indexed / searched, but they are displayed
  • The Social Science Variables Database at ICPSR
    The Public Search Features:
    • Stemming
    • 39. “Phrase searches”
    • 40. Fielded searches (treated as a default Boolean “and”: Boolean operators “or,” and “not” are ignored)
    • 41. Variable label
    • 42. Question text
    • 43. Value labels
  • 44. The Social Science Variables Database at ICPSR
    Projected improvements/additional features:
    Enable selection of multiple filters
    Enable users to toggle on/off stemming
    Enable searching “within” results (adding new query to a result set)
    Show / hide response categories on result page
    Create interface for selecting results and exporting selection in a particular format
    From individual variable display, enable navigation to previous or next variable (to show context)
  • 45. The Social Science Variables Database at ICPSR
    Usage data (source: Google Analytics)
  • 46. Data ExplorationThe Bibliography of Data-related Literature
    Data-related biblio
    SSVD the variables
    Docs, subjects, PIs, etc
    Search Results
  • 47. ICPSR Bibliography of Data-related Literature
    Elizabeth Moss
    Assistant Librarian, ICPSR
  • 48. ICPSR Bibliography of Data-related Literature
    What we will cover:
    • What it is and how to access it
    • 49. How and why we developed it
    • 50. Main features
    • 51. How instructors find it useful
    • 52. You are a good source
  • What it is and how to access it
  • 53. What it is and how to access it
    It’s really a searchable database . . .
    containing 60,000 citations of known published and unpublished works resulting from analyses of data archived at ICPSR
    . . .that can generate study bibliographies
    associating each study with the literature about it
    . . . Now included in the integrated search
    on the ICPSR Web site
  • 54. How and why we developed it
    Brainchild of Richard Rockwell, former ICPSR director
    Funded by a grant from the National Science Foundation in 2000 to build the collection and create a way to access it
    ICPSR membership and federally-funded archives continue to support it
  • 55. How and why we developed it
    What’s in the collection?
    Resources using data in the ICPSR holdings as the primary data source
    Resources using ICPSR data in a comparison with the primary dataset investigated
    Resources "about" an ICPSR dataset or study series.
  • 56. How and why we developed it
  • 57. How and why we developed it
  • 58. How and why we developed it
  • 59. How and why we developed it
    Demonstrate impact of data for funding
  • 60. Main features
  • 61. Main features
    Search features:
    • Searches the full text of the elements of citations, e.g., title, author, journal
    • 62. Boolean “and” is assumed, and phrase searching in quotation marks:
    adolescents and “mental health”
    — thisworks
    • No Boolean “or” “not”:
    Havens or “Havens, Jennifer”
    — thisdoesn’t work (becomes “and”)
  • 63. Main features
    Linking from the search results:
    • To full text for journals
    • 64. Directly via DOI
    • 65. Using OpenURL via Google Scholar and WorldCat
    • 66. To full text of reports and other resources via PDF or HTML links
    • 67. To the detailed, fielded publication record
  • Main features
    Internal and external linking from the detailed citation record:
    • To the related study(s)
    • 68. To other citation records of publications by the same author
    • 69. To other articles in the same journal
    (but outside the search)
    • To full text options
  • Main features
    Exporting citations:
    • From search results: Up to 500 records in RIS format, exports directly to EndNote
    • 70. From individual detailed record: Export the citation in RIS format
  • Main features
    Filtering and sorting features:
    • Filter search results by author, pub type, journal, pub. year
    • 71. Coming soon—pub year range filter (similar to that in study search)
    • 72. Sort search results by relevance, pub date (oldest or newest), title, recency
  • Main features
    Browse from main Bibliography page:
    • By author name (no authority control)
    Juster, F. (2)
    Juster, F. Thomas (22)
    Juster, F.T. (1)
    • By journal title name (authority control)
  • Main features
    Link from individual study pages:
    • to the dynamically-generated study bibliography
    • 73. to series collections, when applicable
    Link from series description pages:
    • to series bibliographies from the series page
  • How instructors find it useful
    Senior seminar classes
    • Profs choose dataset and ask students to think of a research question
    • 74. Bibliography allows students to see the wide variety of topics available for a single dataset
  • How instructors find it useful
    Research proposal design
    • Good for finding studies that examine what a student wants to propose
    • 75. Does the data they would want already exist?
    • 76. If so, are there survey questions they could replicate?
    • 77. Authors’ suggestions for future research
  • How instructors find it useful
    Undergraduate introduction
    • Research papers—Good starting point for finding literature on a particular topic
    • 78. Finding data—Starting with the Bibliography can be more intuitive
  • How instructors find it useful
    From the ICPSR blog:
    “I can't say enough about how much I like the Bibliography of Data-related Literature. I find that students prefer to use this to identify key writings about data obtained from ICPSR. Students are sometimes really overwhelmed by trying to do literature searches in the many article databases subscribed to by the Library and they don't find what they need by using Google Scholar. So, I direct them to the Bibliography first to identify authors and subject terms. They can then use these to carry out successful searches in article databases.”
  • 79. How instructors find it useful
    From the ICPSR blog:
    “As a companion to the Bibliography I also use the instructional tool: Exploring Data Through Research Literature (EDRL). I think Rachel Barlow did a fantastic job on this. I have adapted pieces of EDRL for use in class presentations with great success. If you are in a library and you are involved ininformation literacy activities, this is a great tool.”
  • 80. How instructors find it useful
    The EDRL – an Online Module
  • 81. You are a good source
    Get credit for your work AND let us know about that of others:
    • Send a citation via the Web form
    • 82. Or send them in an email to bibliography@icpsr.umich.edu
    • 83. If you have a large library, we can take EndNote XML imports, or even RIS-format imports
  • You are a good source
  • 84. You are a good source
    A final request:
    When you write articles, reports, papers, and presentations that analyze or significantly discuss data, CITE the data
    Encourage others to do it, too
    Here’s how and why
  • 85. Let’s Take a BreakReturn at 10:45