ICPSR AT 50:Facilitating Research and Data SharingPart I: Data ExplorationIASSIST Vancouver, BCMay 31, 2011
Welcome to Vancouver!Our AgendaData ExplorationA Continuing Quest to Ease your SearchSocial Science Variables DatabaseBibliography of Data-related LiteratureData Sharing2010 US Census DataPublic Data CollectionsData ManagementData Management PlansComputing & Data Sharing in Secure EnvironmentsManaging Restricted Contracts
Managing the ClockIntro and Data Exploration (9:30-10:30)BreakData Sharing (10:45-11:30)BreakData Management (11:45–12:30)Escape!Disclaimer:  Times are approximate!
What is ICPSR?	- Then and Now -One of the world’s oldest and largest social science data archives, est. 1962Data distributed on punch cards, then reel-to-reel tape, now:  Data available on demandOver 7,000 studies with over 65,000 data sets Membership organization among 21 universities, now:Currently about 700 members world-wideFederal funding of public collections
What We Do – It’s About Data!Seek research data and pertinent documents from researchers (PIs, research agencies, government)Process and preserve the data and documents Disseminate dataProvide education, training, & instructional resources
Why People Use ICPSRWrite articles, papers, or theses using real research dataConduct secondary research to support findings of current research or to generate new findingsUse as intro material in grant proposalsPreserve/disseminate primary research dataFulfill data management plan (grant) requirementsStudy or teach quantitative methods
Data Exploration
The Challenge – Hoards of Data & MetadataHow does one make sense of:7,000 studies65,000 datasets550,000 filesMillions of variables60,000 bibliographic citations
Data Exploration- Integrated Search -Better Search for Better ResultsData-related biblioSSVD  the variablesDocs, subjects, PIs, etcSearch Results
Integrating ICPSR’s Search“Sponsored by SOLR/Lucene”In 2009, an improved search engineLater, construction of full-text search Faceted search to narrow large result sets
Search Terms: teen drug use
Reviewing the Study Home Page
The Search Continues: Automatic Search UpdatesReceive automatic updates on the study or seriesAnd updates on your query
Data ExplorationThe Social Science Variables DatabaseData-related biblioSSVD  the variablesDocs, subjects, PIs, etcSearch Results
The Social Science Variables Database (SSVD)Sanda Ionescu,Documentation Specialistsandai@umich.edu
The Social Science Variables Database at ICPSREnables ICPSR users to search variables across datasetsAssists in:Data discovery Comparison / harmonization projects Data harvesting Data analysisQuestion mining for designing new research
The Social Science Variables Database at ICPSRTool for teachingResearch Methods:Concept operationalization
Effect of question wording, context, and answer categories on variable distributionsSubstantive classes:Cultural / social changes reflected in different question wordings, or elicited answers (longitudinal or time series data)The Social Science Variables Database at ICPSROfficially launched Spring 2009.Pre-launch: two to three years’ preparation periodGather variable-level documentation; apply/refine selection criteria, quality checksBuild database to host variable descriptionsInitial upload: 3,500 files describing data from about 1,300 studies.
The Social Science Variables Database at ICPSRVariables documented using the Data Documentation Initiative (DDI) specificationDDI: a standard for documenting social science data, written in XMLEasy to parse / processAllows fine-grained searchesFlexible display in a variety of formats Highly shareable, promotes interoperabilityIdeal archival format (ASCII, not software dependent)
The Social Science Variables Database at ICPSRDDI variable descriptions Generated through an automated process used archive-wide to produce ICPSR’S archival and distribution information packages
Include question text if available in the source documentationThe Social Science Variables Database at ICPSRRelational databaseBuilt in Oracle as a separate entity, with links to studies’ and series’ descriptions (also stored in Oracle)
Compatible with both DDI 2 and 3 (input and output)
Oracle Text searches used in Beta-testing phase
Slow retrieval
Limited to 500 resultsThe Social Science Variables Database at ICPSRSearch: autumn 2009 switched to Solr/Lucene:Easy indexingFaster searches, unlimited hitsFacets/Filters imported from Study Descriptions (also DDI compatible)SeriesStudyTime PeriodGeographyStorage: XML files are being indexed and searched directly – no longer uploaded in the database
The Social Science Variables Database at ICPSRCurrent content:
2,602 studies (48 percent of ICPSR holdings with data and setups)
6,493 datasets
Approx. 1.7 million variables
Continues to grow by including
All new releases, if suitable
Retrofits as made available by small-scale projects                  The Social Science Variables Database at ICPSRDDI fields searched:
Variable name
Variable label
Question text sequence
Descriptive text
Category label
Variable notes – not indexed / searched, but they are displayedThe Social Science Variables Database at ICPSRThe Public Search Features:Stemming
“Phrase searches”
Fielded searches (treated as a default Boolean “and”: Boolean operators “or,” and “not” are ignored)
Variable label
Question text
Value labelshttp://www.icpsr.umich.edu/icpsrweb/ICPSR/
The Social Science Variables Database at ICPSRProjected improvements/additional features:Enable selection of multiple filtersEnable users to toggle on/off stemmingEnable searching “within” results (adding new query to a result set)Show / hide response categories on result pageCreate interface for selecting results and exporting selection in a particular formatFrom individual variable display, enable navigation to previous or next variable (to show context)
The Social Science Variables Database at ICPSRUsage data (source: Google Analytics)
Data ExplorationThe Bibliography of Data-related LiteratureData-related biblioSSVD  the variablesDocs, subjects, PIs, etcSearch Results
ICPSR Bibliography of Data-related LiteratureElizabeth MossAssistant Librarian, ICPSReammoss@umich.edu
ICPSR Bibliography of Data-related LiteratureWhat we will cover:What it is and how to access it
How and why we developed it
Main features
How instructors find it useful
You are a good sourceWhat it is and how to access it
What it is and how to access itIt’s really a searchable database . . .   containing 60,000 citations of known published and unpublished works resulting from analyses of data archived at ICPSR . . .that can generate study bibliographies   associating each study with the literature about it. . . Now included in the integrated search on the ICPSR Web site
How and why we developed itBrainchild of Richard Rockwell, former ICPSR directorFunded by a grant from the National Science Foundation in 2000 to build the collection and create a way to access itICPSR membership and federally-funded archives continue to support it
How and why we developed itWhat’s in the collection?Resources using data in the ICPSR holdings as the primary data sourceResources using ICPSR data in a comparison with the primary dataset investigatedResources "about" an ICPSR dataset or study series.
How and why we developed it

ICPSR Data Exploration Tools

  • 1.
    ICPSR AT 50:FacilitatingResearch and Data SharingPart I: Data ExplorationIASSIST Vancouver, BCMay 31, 2011
  • 2.
    Welcome to Vancouver!OurAgendaData ExplorationA Continuing Quest to Ease your SearchSocial Science Variables DatabaseBibliography of Data-related LiteratureData Sharing2010 US Census DataPublic Data CollectionsData ManagementData Management PlansComputing & Data Sharing in Secure EnvironmentsManaging Restricted Contracts
  • 3.
    Managing the ClockIntroand Data Exploration (9:30-10:30)BreakData Sharing (10:45-11:30)BreakData Management (11:45–12:30)Escape!Disclaimer: Times are approximate!
  • 4.
    What is ICPSR? -Then and Now -One of the world’s oldest and largest social science data archives, est. 1962Data distributed on punch cards, then reel-to-reel tape, now: Data available on demandOver 7,000 studies with over 65,000 data sets Membership organization among 21 universities, now:Currently about 700 members world-wideFederal funding of public collections
  • 5.
    What We Do– It’s About Data!Seek research data and pertinent documents from researchers (PIs, research agencies, government)Process and preserve the data and documents Disseminate dataProvide education, training, & instructional resources
  • 6.
    Why People UseICPSRWrite articles, papers, or theses using real research dataConduct secondary research to support findings of current research or to generate new findingsUse as intro material in grant proposalsPreserve/disseminate primary research dataFulfill data management plan (grant) requirementsStudy or teach quantitative methods
  • 7.
  • 8.
    The Challenge –Hoards of Data & MetadataHow does one make sense of:7,000 studies65,000 datasets550,000 filesMillions of variables60,000 bibliographic citations
  • 9.
    Data Exploration- IntegratedSearch -Better Search for Better ResultsData-related biblioSSVD the variablesDocs, subjects, PIs, etcSearch Results
  • 10.
    Integrating ICPSR’s Search“Sponsoredby SOLR/Lucene”In 2009, an improved search engineLater, construction of full-text search Faceted search to narrow large result sets
  • 11.
  • 12.
  • 13.
    The Search Continues:Automatic Search UpdatesReceive automatic updates on the study or seriesAnd updates on your query
  • 14.
    Data ExplorationThe SocialScience Variables DatabaseData-related biblioSSVD the variablesDocs, subjects, PIs, etcSearch Results
  • 15.
    The Social ScienceVariables Database (SSVD)Sanda Ionescu,Documentation Specialistsandai@umich.edu
  • 16.
    The Social ScienceVariables Database at ICPSREnables ICPSR users to search variables across datasetsAssists in:Data discovery Comparison / harmonization projects Data harvesting Data analysisQuestion mining for designing new research
  • 17.
    The Social ScienceVariables Database at ICPSRTool for teachingResearch Methods:Concept operationalization
  • 18.
    Effect of questionwording, context, and answer categories on variable distributionsSubstantive classes:Cultural / social changes reflected in different question wordings, or elicited answers (longitudinal or time series data)The Social Science Variables Database at ICPSROfficially launched Spring 2009.Pre-launch: two to three years’ preparation periodGather variable-level documentation; apply/refine selection criteria, quality checksBuild database to host variable descriptionsInitial upload: 3,500 files describing data from about 1,300 studies.
  • 19.
    The Social ScienceVariables Database at ICPSRVariables documented using the Data Documentation Initiative (DDI) specificationDDI: a standard for documenting social science data, written in XMLEasy to parse / processAllows fine-grained searchesFlexible display in a variety of formats Highly shareable, promotes interoperabilityIdeal archival format (ASCII, not software dependent)
  • 20.
    The Social ScienceVariables Database at ICPSRDDI variable descriptions Generated through an automated process used archive-wide to produce ICPSR’S archival and distribution information packages
  • 21.
    Include question textif available in the source documentationThe Social Science Variables Database at ICPSRRelational databaseBuilt in Oracle as a separate entity, with links to studies’ and series’ descriptions (also stored in Oracle)
  • 22.
    Compatible with bothDDI 2 and 3 (input and output)
  • 23.
    Oracle Text searchesused in Beta-testing phase
  • 24.
  • 25.
    Limited to 500resultsThe Social Science Variables Database at ICPSRSearch: autumn 2009 switched to Solr/Lucene:Easy indexingFaster searches, unlimited hitsFacets/Filters imported from Study Descriptions (also DDI compatible)SeriesStudyTime PeriodGeographyStorage: XML files are being indexed and searched directly – no longer uploaded in the database
  • 26.
    The Social ScienceVariables Database at ICPSRCurrent content:
  • 27.
    2,602 studies (48percent of ICPSR holdings with data and setups)
  • 28.
  • 29.
  • 30.
    Continues to growby including
  • 31.
  • 32.
    Retrofits as madeavailable by small-scale projects The Social Science Variables Database at ICPSRDDI fields searched:
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    Variable notes –not indexed / searched, but they are displayedThe Social Science Variables Database at ICPSRThe Public Search Features:Stemming
  • 39.
  • 40.
    Fielded searches (treatedas a default Boolean “and”: Boolean operators “or,” and “not” are ignored)
  • 41.
  • 42.
  • 43.
  • 44.
    The Social ScienceVariables Database at ICPSRProjected improvements/additional features:Enable selection of multiple filtersEnable users to toggle on/off stemmingEnable searching “within” results (adding new query to a result set)Show / hide response categories on result pageCreate interface for selecting results and exporting selection in a particular formatFrom individual variable display, enable navigation to previous or next variable (to show context)
  • 45.
    The Social ScienceVariables Database at ICPSRUsage data (source: Google Analytics)
  • 46.
    Data ExplorationThe Bibliographyof Data-related LiteratureData-related biblioSSVD the variablesDocs, subjects, PIs, etcSearch Results
  • 47.
    ICPSR Bibliography ofData-related LiteratureElizabeth MossAssistant Librarian, ICPSReammoss@umich.edu
  • 48.
    ICPSR Bibliography ofData-related LiteratureWhat we will cover:What it is and how to access it
  • 49.
    How and whywe developed it
  • 50.
  • 51.
  • 52.
    You are agood sourceWhat it is and how to access it
  • 53.
    What it isand how to access itIt’s really a searchable database . . . containing 60,000 citations of known published and unpublished works resulting from analyses of data archived at ICPSR . . .that can generate study bibliographies associating each study with the literature about it. . . Now included in the integrated search on the ICPSR Web site
  • 54.
    How and whywe developed itBrainchild of Richard Rockwell, former ICPSR directorFunded by a grant from the National Science Foundation in 2000 to build the collection and create a way to access itICPSR membership and federally-funded archives continue to support it
  • 55.
    How and whywe developed itWhat’s in the collection?Resources using data in the ICPSR holdings as the primary data sourceResources using ICPSR data in a comparison with the primary dataset investigatedResources "about" an ICPSR dataset or study series.
  • 56.
    How and whywe developed it
  • 57.
    How and whywe developed ithttp://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/methodology.jsp
  • 58.
    How and whywe developed it
  • 59.
    How and whywe developed itDemonstrate impact of data for funding
  • 60.
  • 61.
    Main featuresSearch features:Searchesthe full text of the elements of citations, e.g., title, author, journal
  • 62.
    Boolean “and” isassumed, and phrase searching in quotation marks:adolescents and “mental health” — thisworks No Boolean “or” “not”:Havens or “Havens, Jennifer” — thisdoesn’t work (becomes “and”)
  • 63.
    Main featuresLinking fromthe search results:To full text for journals
  • 64.
  • 65.
    Using OpenURL viaGoogle Scholar and WorldCat
  • 66.
    To full textof reports and other resources via PDF or HTML links
  • 67.
    To the detailed,fielded publication recordMain featuresInternal and external linking from the detailed citation record:To the related study(s)
  • 68.
    To other citationrecords of publications by the same author
  • 69.
    To other articlesin the same journal (but outside the search)To full text optionsMain featuresExporting citations:From search results: Up to 500 records in RIS format, exports directly to EndNote
  • 70.
    From individual detailedrecord: Export the citation in RIS formatMain featuresFiltering and sorting features:Filter search results by author, pub type, journal, pub. year
  • 71.
    Coming soon—pub yearrange filter (similar to that in study search)
  • 72.
    Sort search resultsby relevance, pub date (oldest or newest), title, recencyMain featuresBrowse from main Bibliography page:By author name (no authority control)Juster, F. (2)Juster, F. Thomas (22)Juster, F.T. (1)By journal title name (authority control)Main featuresLink from individual study pages:to the dynamically-generated study bibliography
  • 73.
    to series collections,when applicableLink from series description pages:to series bibliographies from the series page How instructors find it usefulSenior seminar classesProfs choose dataset and ask students to think of a research question
  • 74.
    Bibliography allows studentsto see the wide variety of topics available for a single datasetHow instructors find it usefulResearch proposal designGood for finding studies that examine what a student wants to propose
  • 75.
    Does the datathey would want already exist?
  • 76.
    If so, arethere survey questions they could replicate?
  • 77.
    Authors’ suggestionsfor future researchHow instructors find it usefulUndergraduate introduction Research papers—Good starting point for finding literature on a particular topic
  • 78.
    Finding data—Starting withthe Bibliography can be more intuitiveHow instructors find it usefulFrom the ICPSR blog:“I can't say enough about how much I like the Bibliography of Data-related Literature. I find that students prefer to use this to identify key writings about data obtained from ICPSR. Students are sometimes really overwhelmed by trying to do literature searches in the many article databases subscribed to by the Library and they don't find what they need by using Google Scholar. So, I direct them to the Bibliography first to identify authors and subject terms. They can then use these to carry out successful searches in article databases.”
  • 79.
    How instructors findit usefulFrom the ICPSR blog:“As a companion to the Bibliography I also use the instructional tool: Exploring Data Through Research Literature (EDRL). I think Rachel Barlow did a fantastic job on this. I have adapted pieces of EDRL for use in class presentations with great success. If you are in a library and you are involved ininformation literacy activities, this is a great tool.”
  • 80.
    How instructors findit usefulThe EDRL – an Online Module
  • 81.
    You are agood sourceGet credit for your work AND let us know about that of others:Send a citation via the Web form
  • 82.
    Or send themin an email to bibliography@icpsr.umich.edu
  • 83.
    If you havea large library, we can take EndNote XML imports, or even RIS-format importsYou are a good source
  • 84.
    You are agood sourceA final request:When you write articles, reports, papers, and presentations that analyze or significantly discuss data, CITE the dataEncourage others to do it, tooHere’s how and why
  • 85.
    Let’s Take aBreakReturn at 10:45

Editor's Notes

  • #2 This is part I of a 3-part workshop conducted at IASSIST on May 31, 2011.
  • #5 In February 2011, over 59,614 datasets (over 540,000 files) available for download.As a sense of volume of downloads, total downloads for FY 2010 = 612,500 datasets downloaded/accessed. (Member downloads make up 283,472 of those=46% but 60% of all studies.)Also in FY2010 – about 19,800 MyData accounts downloaded/accessed something – were active.
  • #7 ICPSR supports students, faculty, researchers, and policymakers.
  • #9 ICPSR size of holdings run on 4/20/2011
  • #11 More results = more confusion, so faceted search assists the visitor to narrow his/her search.The integrated search queries the full text of all the documentation files, so it indexes the questions, answers, and all the descriptive text at the variable and study level. Consequently, you get a lot more search results. To provide one example, a search on “condoms and Uganda” would have previously returned no results; now it returns 30 results, the first of which is the National Survey of Adolescents, 2005: Uganda. Because our study-level metadata records are broad and concise, they don’t list the individual birth control methods; rather, they simply list the subject term “birth control.” By exposing the full text of the questionnaire, we’ve taken a bit of the guesswork out of finding a query term that matches ICPSR’s preferred terms.In addition to the documentation search, our integrated search includes the citations.
  • #12 1) “teen drug use” produces 855 results. That can be overwhelming if the first page of results doesn’t look promising.2) Filter Results displays our faceted search that sorts by subject terms & other common metadata. The numbers in parentheses are the number of studies within that filter.3) The results are sorted by relevance – I’ve sorted this example by most downloaded. You can sort by other things like most cited or release date.4) Where does this text appear? Clicking here will show you the matching text from the documentation, citations, and metadata records where the matches were found.FAQ on boolean search – it is the case that we no longer offer the boolean option. 2/3rds of our users are grads/undergrads who have grown up Google and using sites (Amazon for example) that use faceted search. Boolean has gone the way of DOS.
  • #13 View details takes you to the full description & citation. Here you’ll find all of the pertinent information on the study: PI names, funders, subject terms, methodology, sample composition, etc.Here you can browse the codebook.You can review and search the variables here – more on that later of course.This indicates that ICPSR has found 3,654 citations related to this study and you can find still more for the series – more on that later too.And finally, if you are interested in data that is available online (in addition to downloading it later perhaps), here you’ll find whether that capability exists.
  • #14 RSS: It enables you to grab relevant information from multiple Web sites, stripping it of site-specific graphics and navigation so you can focus on the core content. If you’re not using it now, sign up for a Google account and try adding some RSS feeds to Google Reader. You’ll be pleased with how much easier it makes it to keep up with professional and news sites.In terms of the ICPSR series page, you can sign up for an RSS notification so that you’ll know when new waves are released in a series. Similar links appear on the study home page, so that you can use RSS to keep track of an individual study, receiving notification if the study is updated, with details on the changes so you can decide whether or not you need to download a new copy of the dataset.Going back to the search results page, I wanted to point out that we have RSS options present there as well. Thus I could set up an RSS feed for a query on a particular subject or investigator, and receive notification if ICPSR released a new study that matched that query.One last thing to point out are the export options at the bottom of the right column. Librarians can use this utility to download MARCXML for any query, such as all studies released in the last six months. The comma-delimited export gives you all the search results in a format that can easily be viewed in Excel. This is primarily of use for researchers who are doing an extensive review of available data.
  • #32 One easy way to access it is from main ICPSR Web site page, from the pull-down menu of the main tab, Find & Analyzed Data.
  • #33 Okay, that’s one way to access “it.” But just what is it? [Slide contents—click on each link to describe the three places.]In the past, one could access related literature via the main Bibliography search and browse page, or at the metadata page for any given study.Now the integrated search also brings up results found in the Bibliography.There is one caveat that I will discuss when we do a search. You’re only searching the elements of a citation—not the full text of the publication.
  • #36 We’re back at the main page of the searchable citations database.While we’re here, let me point out a couple of things that will explain the how and why we developed it.First, if you’re interested in the methodology behind the creation of the database, you can link to it from here.
  • #37 It covers everything from the goals of the project to the details of the collection process.
  • #38 So, what was the reason for developing it and, indeed, for continuing to put resources toward it? It’s covered in a nutshell on the main Bibliography search page.
  • #39 Basically, we make the link from the data the literature about it.This is a big deal, to have the literature grouped this way, with the dataset as the driver. It’s not a common type of link available for datasets.So, even though this is a very resource-intensive undertaking—non common citing practice, no easy way to make links, etc., we still find it to be a valuable resource for the whole range of our user base.[Read first half of the slide]So this database facilitates the standard lit search that students, researchers, and others do in the course of their study or work.[Read second half of slide]So, one can use it to measure study impact, although fairly primitively compared to journal citation databases and engines.I’ll cover later a tool built around teaching students about data using the Bibliography as the prime resource.
  • #40 Starting on the main search page , you’ll see that the main access points from here are via :Searching the elements of the citations (not full text)Browsing by authorBrowsing by journal titleWe’ll start with a search.You can either type a search term, or if you want to see the whole collection, simply leave the search box empty.Then click on Search for Citations button.
  • #44 Link to the FAQ if you want a larger list.
  • #47 http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/26701?archive=ICPSR&q=nsduhFrom study page to related lit and series related litFrom series listhttp://www.icpsr.umich.edu/icpsrweb/ICPSR/series
  • #51 This goes for social science subject instructors as well as information literacy instructors.Helps students focus and gives them valuable tools for searching in the larger article databases.Again, this is because it’s a resource that connects the literature to the data.
  • #52 http://www.icpsr.umich.edu/icpsrweb/EDRL/index.jsp
  • #53 Exploring Data Through Research LiteratureDesigned to teach quantitative research methods to undergraduates in a different way. Integrates ICPSR bibliography of data related literatureinto teaching students how make their way from ideas to empirical work to literature and back. Suitable for both research methods and other substantive courses requiring empirical researchhttp://www.icpsr.umich.edu/icpsrweb/EDRL/index.jsp
  • #55 http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/submit.jsp
  • #56 http://www.icpsr.umich.edu/icpsrweb/ICPSR/support/faqs/8707350211996719508