Your SlideShare is downloading. ×
ICPSR Data Exploration Tools
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ICPSR Data Exploration Tools


Published on

Part I of a workshop conducted by ICPSR. This deck describes data exploration tools.

Part I of a workshop conducted by ICPSR. This deck describes data exploration tools.

Published in: Education, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • This is part I of a 3-part workshop conducted at IASSIST on May 31, 2011.
  • In February 2011, over 59,614 datasets (over 540,000 files) available for download.As a sense of volume of downloads, total downloads for FY 2010 = 612,500 datasets downloaded/accessed. (Member downloads make up 283,472 of those=46% but 60% of all studies.)Also in FY2010 – about 19,800 MyData accounts downloaded/accessed something – were active.
  • ICPSR supports students, faculty, researchers, and policymakers.
  • ICPSR size of holdings run on 4/20/2011
  • More results = more confusion, so faceted search assists the visitor to narrow his/her search.The integrated search queries the full text of all the documentation files, so it indexes the questions, answers, and all the descriptive text at the variable and study level. Consequently, you get a lot more search results. To provide one example, a search on “condoms and Uganda” would have previously returned no results; now it returns 30 results, the first of which is the National Survey of Adolescents, 2005: Uganda. Because our study-level metadata records are broad and concise, they don’t list the individual birth control methods; rather, they simply list the subject term “birth control.” By exposing the full text of the questionnaire, we’ve taken a bit of the guesswork out of finding a query term that matches ICPSR’s preferred terms.In addition to the documentation search, our integrated search includes the citations.
  • 1) “teen drug use” produces 855 results. That can be overwhelming if the first page of results doesn’t look promising.2) Filter Results displays our faceted search that sorts by subject terms & other common metadata. The numbers in parentheses are the number of studies within that filter.3) The results are sorted by relevance – I’ve sorted this example by most downloaded. You can sort by other things like most cited or release date.4) Where does this text appear? Clicking here will show you the matching text from the documentation, citations, and metadata records where the matches were found.FAQ on boolean search – it is the case that we no longer offer the boolean option. 2/3rds of our users are grads/undergrads who have grown up Google and using sites (Amazon for example) that use faceted search. Boolean has gone the way of DOS.
  • View details takes you to the full description & citation. Here you’ll find all of the pertinent information on the study: PI names, funders, subject terms, methodology, sample composition, etc.Here you can browse the codebook.You can review and search the variables here – more on that later of course.This indicates that ICPSR has found 3,654 citations related to this study and you can find still more for the series – more on that later too.And finally, if you are interested in data that is available online (in addition to downloading it later perhaps), here you’ll find whether that capability exists.
  • RSS: It enables you to grab relevant information from multiple Web sites, stripping it of site-specific graphics and navigation so you can focus on the core content. If you’re not using it now, sign up for a Google account and try adding some RSS feeds to Google Reader. You’ll be pleased with how much easier it makes it to keep up with professional and news sites.In terms of the ICPSR series page, you can sign up for an RSS notification so that you’ll know when new waves are released in a series. Similar links appear on the study home page, so that you can use RSS to keep track of an individual study, receiving notification if the study is updated, with details on the changes so you can decide whether or not you need to download a new copy of the dataset.Going back to the search results page, I wanted to point out that we have RSS options present there as well. Thus I could set up an RSS feed for a query on a particular subject or investigator, and receive notification if ICPSR released a new study that matched that query.One last thing to point out are the export options at the bottom of the right column. Librarians can use this utility to download MARCXML for any query, such as all studies released in the last six months. The comma-delimited export gives you all the search results in a format that can easily be viewed in Excel. This is primarily of use for researchers who are doing an extensive review of available data.
  • One easy way to access it is from main ICPSR Web site page, from the pull-down menu of the main tab, Find & Analyzed Data.
  • Okay, that’s one way to access “it.” But just what is it? [Slide contents—click on each link to describe the three places.]In the past, one could access related literature via the main Bibliography search and browse page, or at the metadata page for any given study.Now the integrated search also brings up results found in the Bibliography.There is one caveat that I will discuss when we do a search. You’re only searching the elements of a citation—not the full text of the publication.
  • We’re back at the main page of the searchable citations database.While we’re here, let me point out a couple of things that will explain the how and why we developed it.First, if you’re interested in the methodology behind the creation of the database, you can link to it from here.
  • It covers everything from the goals of the project to the details of the collection process.
  • So, what was the reason for developing it and, indeed, for continuing to put resources toward it? It’s covered in a nutshell on the main Bibliography search page.
  • Basically, we make the link from the data the literature about it.This is a big deal, to have the literature grouped this way, with the dataset as the driver. It’s not a common type of link available for datasets.So, even though this is a very resource-intensive undertaking—non common citing practice, no easy way to make links, etc., we still find it to be a valuable resource for the whole range of our user base.[Read first half of the slide]So this database facilitates the standard lit search that students, researchers, and others do in the course of their study or work.[Read second half of slide]So, one can use it to measure study impact, although fairly primitively compared to journal citation databases and engines.I’ll cover later a tool built around teaching students about data using the Bibliography as the prime resource.
  • Starting on the main search page , you’ll see that the main access points from here are via :Searching the elements of the citations (not full text)Browsing by authorBrowsing by journal titleWe’ll start with a search.You can either type a search term, or if you want to see the whole collection, simply leave the search box empty.Then click on Search for Citations button.
  • Link to the FAQ if you want a larger list.
  • study page to related lit and series related litFrom series list
  • This goes for social science subject instructors as well as information literacy instructors.Helps students focus and gives them valuable tools for searching in the larger article databases.Again, this is because it’s a resource that connects the literature to the data.
  • Exploring Data Through Research LiteratureDesigned to teach quantitative research methods to undergraduates in a different way. Integrates ICPSR bibliography of data related literatureinto teaching students how make their way from ideas to empirical work to literature and back. Suitable for both research methods and other substantive courses requiring empirical research
  • Transcript

    • 1. ICPSR AT 50:Facilitating Research and Data Sharing
      Part I: Data Exploration
      IASSIST Vancouver, BC
      May 31, 2011
    • 2. Welcome to Vancouver!Our Agenda
      Data Exploration
      A Continuing Quest to Ease your Search
      Social Science Variables Database
      Bibliography of Data-related Literature
      Data Sharing
      2010 US Census Data
      Public Data Collections
      Data Management
      Data Management Plans
      Computing & Data Sharing in Secure Environments
      Managing Restricted Contracts
    • 3. Managing the Clock
      Intro and Data Exploration (9:30-10:30)
      Data Sharing (10:45-11:30)
      Data Management (11:45–12:30)
      Disclaimer: Times are approximate!
    • 4. What is ICPSR? - Then and Now -
      One of the world’s oldest and largest social science data archives, est. 1962
      Data distributed on punch cards, then reel-to-reel tape, now:
      Data available on demand
      Over 7,000 studies with over 65,000 data sets
      Membership organization among 21 universities, now:
      Currently about 700 members world-wide
      Federal funding of public collections
    • 5. What We Do – It’s About Data!
      Seek research data and pertinent documents from researchers (PIs, research agencies, government)
      Process and preserve the data and documents
      Disseminate data
      Provide education, training, & instructional resources
    • 6. Why People Use ICPSR
      Write articles, papers, or theses using real research data
      Conduct secondary research to support findings of current research or to generate new findings
      Use as intro material in grant proposals
      Preserve/disseminate primary research data
      Fulfill data management plan (grant) requirements
      Study or teach quantitative methods
    • 7. Data Exploration
    • 8. The Challenge – Hoards of Data & Metadata
      How does one make sense of:
      7,000 studies
      65,000 datasets
      550,000 files
      Millions of variables
      60,000 bibliographic citations
    • 9. Data Exploration- Integrated Search -Better Search for Better Results
      Data-related biblio
      SSVD the variables
      Docs, subjects, PIs, etc
      Search Results
    • 10. Integrating ICPSR’s Search“Sponsored by SOLR/Lucene”
      In 2009, an improved search engine
      Later, construction of full-text search
      Faceted search to narrow large result sets
    • 11. Search Terms: teen drug use
    • 12. Reviewing the Study Home Page
    • 13. The Search Continues: Automatic Search Updates
      Receive automatic updates on the study or series
      And updates on your query
    • 14. Data ExplorationThe Social Science Variables Database
      Data-related biblio
      SSVD the variables
      Docs, subjects, PIs, etc
      Search Results
    • 15. The Social Science Variables Database (SSVD)
      Sanda Ionescu,
      Documentation Specialist
    • 16. The Social Science Variables Database at ICPSR
      Enables ICPSR users to search variables across datasets
      Assists in:
      Data discovery
      Comparison / harmonization projects
      Data harvesting
      Data analysis
      Question mining for designing new research
    • 17. The Social Science Variables Database at ICPSR
      Tool for teaching
      Research Methods:
      • Concept operationalization
      • 18. Effect of question wording, context, and answer categories on variable distributions
      Substantive classes:
      • Cultural / social changes reflected in different question wordings, or elicited answers (longitudinal or time series data)
    • The Social Science Variables Database at ICPSR
      Officially launched Spring 2009.
      Pre-launch: two to three years’ preparation period
      Gather variable-level documentation; apply/refine selection criteria, quality checks
      Build database to host variable descriptions
      Initial upload: 3,500 files describing data from about 1,300 studies.
    • 19. The Social Science Variables Database at ICPSR
      Variables documented using the Data Documentation Initiative (DDI) specification
      DDI: a standard for documenting social science data, written in XML
      Easy to parse / process
      Allows fine-grained searches
      Flexible display in a variety of formats
      Highly shareable, promotes interoperability
      Ideal archival format (ASCII, not software dependent)
    • 20. The Social Science Variables Database at ICPSR
      DDI variable descriptions
      • Generated through an automated process used archive-wide to produce ICPSR’S archival and distribution information packages
      • 21. Include question text if available in the source documentation
    • The Social Science Variables Database at ICPSR
      Relational database
      • Built in Oracle as a separate entity, with links to studies’ and series’ descriptions (also stored in Oracle)
      • 22. Compatible with both DDI 2 and 3 (input and output)
      • 23. Oracle Text searches used in Beta-testing phase
      • 24. Slow retrieval
      • 25. Limited to 500 results
    • The Social Science Variables Database at ICPSR
      Search: autumn 2009 switched to Solr/Lucene:
      Easy indexing
      Faster searches, unlimited hits
      Facets/Filters imported from Study Descriptions (also DDI compatible)
      Time Period
      Storage: XML files are being indexed and searched directly – no longer uploaded in the database
    • 26. The Social Science Variables Database at ICPSR
      • Current content:
      • 27. 2,602 studies (48 percent of ICPSR holdings with data and setups)
      • 28. 6,493 datasets
      • 29. Approx. 1.7 million variables
      • 30. Continues to grow by including
      • 31. All new releases, if suitable
      • 32. Retrofits as made available by small-scale projects
    • The Social Science Variables Database at ICPSR
      • DDI fields searched:
      • 33. Variable name
      • 34. Variable label
      • 35. Question text sequence
      • 36. Descriptive text
      • 37. Category label
      • 38. Variable notes – not indexed / searched, but they are displayed
    • The Social Science Variables Database at ICPSR
      The Public Search Features:
      • Stemming
      • 39. “Phrase searches”
      • 40. Fielded searches (treated as a default Boolean “and”: Boolean operators “or,” and “not” are ignored)
      • 41. Variable label
      • 42. Question text
      • 43. Value labels
    • 44. The Social Science Variables Database at ICPSR
      Projected improvements/additional features:
      Enable selection of multiple filters
      Enable users to toggle on/off stemming
      Enable searching “within” results (adding new query to a result set)
      Show / hide response categories on result page
      Create interface for selecting results and exporting selection in a particular format
      From individual variable display, enable navigation to previous or next variable (to show context)
    • 45. The Social Science Variables Database at ICPSR
      Usage data (source: Google Analytics)
    • 46. Data ExplorationThe Bibliography of Data-related Literature
      Data-related biblio
      SSVD the variables
      Docs, subjects, PIs, etc
      Search Results
    • 47. ICPSR Bibliography of Data-related Literature
      Elizabeth Moss
      Assistant Librarian, ICPSR
    • 48. ICPSR Bibliography of Data-related Literature
      What we will cover:
      • What it is and how to access it
      • 49. How and why we developed it
      • 50. Main features
      • 51. How instructors find it useful
      • 52. You are a good source
    • What it is and how to access it
    • 53. What it is and how to access it
      It’s really a searchable database . . .
      containing 60,000 citations of known published and unpublished works resulting from analyses of data archived at ICPSR
      . . .that can generate study bibliographies
      associating each study with the literature about it
      . . . Now included in the integrated search
      on the ICPSR Web site
    • 54. How and why we developed it
      Brainchild of Richard Rockwell, former ICPSR director
      Funded by a grant from the National Science Foundation in 2000 to build the collection and create a way to access it
      ICPSR membership and federally-funded archives continue to support it
    • 55. How and why we developed it
      What’s in the collection?
      Resources using data in the ICPSR holdings as the primary data source
      Resources using ICPSR data in a comparison with the primary dataset investigated
      Resources "about" an ICPSR dataset or study series.
    • 56. How and why we developed it
    • 57. How and why we developed it
    • 58. How and why we developed it
    • 59. How and why we developed it
      Demonstrate impact of data for funding
    • 60. Main features
    • 61. Main features
      Search features:
      • Searches the full text of the elements of citations, e.g., title, author, journal
      • 62. Boolean “and” is assumed, and phrase searching in quotation marks:
      adolescents and “mental health”
      — thisworks
      • No Boolean “or” “not”:
      Havens or “Havens, Jennifer”
      — thisdoesn’t work (becomes “and”)
    • 63. Main features
      Linking from the search results:
      • To full text for journals
      • 64. Directly via DOI
      • 65. Using OpenURL via Google Scholar and WorldCat
      • 66. To full text of reports and other resources via PDF or HTML links
      • 67. To the detailed, fielded publication record
    • Main features
      Internal and external linking from the detailed citation record:
      • To the related study(s)
      • 68. To other citation records of publications by the same author
      • 69. To other articles in the same journal
      (but outside the search)
      • To full text options
    • Main features
      Exporting citations:
      • From search results: Up to 500 records in RIS format, exports directly to EndNote
      • 70. From individual detailed record: Export the citation in RIS format
    • Main features
      Filtering and sorting features:
      • Filter search results by author, pub type, journal, pub. year
      • 71. Coming soon—pub year range filter (similar to that in study search)
      • 72. Sort search results by relevance, pub date (oldest or newest), title, recency
    • Main features
      Browse from main Bibliography page:
      • By author name (no authority control)
      Juster, F. (2)
      Juster, F. Thomas (22)
      Juster, F.T. (1)
      • By journal title name (authority control)
    • Main features
      Link from individual study pages:
      • to the dynamically-generated study bibliography
      • 73. to series collections, when applicable
      Link from series description pages:
      • to series bibliographies from the series page
    • How instructors find it useful
      Senior seminar classes
      • Profs choose dataset and ask students to think of a research question
      • 74. Bibliography allows students to see the wide variety of topics available for a single dataset
    • How instructors find it useful
      Research proposal design
      • Good for finding studies that examine what a student wants to propose
      • 75. Does the data they would want already exist?
      • 76. If so, are there survey questions they could replicate?
      • 77. Authors’ suggestions for future research
    • How instructors find it useful
      Undergraduate introduction
      • Research papers—Good starting point for finding literature on a particular topic
      • 78. Finding data—Starting with the Bibliography can be more intuitive
    • How instructors find it useful
      From the ICPSR blog:
      “I can't say enough about how much I like the Bibliography of Data-related Literature. I find that students prefer to use this to identify key writings about data obtained from ICPSR. Students are sometimes really overwhelmed by trying to do literature searches in the many article databases subscribed to by the Library and they don't find what they need by using Google Scholar. So, I direct them to the Bibliography first to identify authors and subject terms. They can then use these to carry out successful searches in article databases.”
    • 79. How instructors find it useful
      From the ICPSR blog:
      “As a companion to the Bibliography I also use the instructional tool: Exploring Data Through Research Literature (EDRL). I think Rachel Barlow did a fantastic job on this. I have adapted pieces of EDRL for use in class presentations with great success. If you are in a library and you are involved ininformation literacy activities, this is a great tool.”
    • 80. How instructors find it useful
      The EDRL – an Online Module
    • 81. You are a good source
      Get credit for your work AND let us know about that of others:
      • Send a citation via the Web form
      • 82. Or send them in an email to
      • 83. If you have a large library, we can take EndNote XML imports, or even RIS-format imports
    • You are a good source
    • 84. You are a good source
      A final request:
      When you write articles, reports, papers, and presentations that analyze or significantly discuss data, CITE the data
      Encourage others to do it, too
      Here’s how and why
    • 85. Let’s Take a BreakReturn at 10:45