ICPSR Data Exploration Tools


Published on

Part I of a workshop conducted by ICPSR. This deck describes data exploration tools.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is part I of a 3-part workshop conducted at IASSIST on May 31, 2011.
  • In February 2011, over 59,614 datasets (over 540,000 files) available for download.As a sense of volume of downloads, total downloads for FY 2010 = 612,500 datasets downloaded/accessed. (Member downloads make up 283,472 of those=46% but 60% of all studies.)Also in FY2010 – about 19,800 MyData accounts downloaded/accessed something – were active.
  • ICPSR supports students, faculty, researchers, and policymakers.
  • ICPSR size of holdings run on 4/20/2011
  • More results = more confusion, so faceted search assists the visitor to narrow his/her search.The integrated search queries the full text of all the documentation files, so it indexes the questions, answers, and all the descriptive text at the variable and study level. Consequently, you get a lot more search results. To provide one example, a search on “condoms and Uganda” would have previously returned no results; now it returns 30 results, the first of which is the National Survey of Adolescents, 2005: Uganda. Because our study-level metadata records are broad and concise, they don’t list the individual birth control methods; rather, they simply list the subject term “birth control.” By exposing the full text of the questionnaire, we’ve taken a bit of the guesswork out of finding a query term that matches ICPSR’s preferred terms.In addition to the documentation search, our integrated search includes the citations.
  • 1) “teen drug use” produces 855 results. That can be overwhelming if the first page of results doesn’t look promising.2) Filter Results displays our faceted search that sorts by subject terms & other common metadata. The numbers in parentheses are the number of studies within that filter.3) The results are sorted by relevance – I’ve sorted this example by most downloaded. You can sort by other things like most cited or release date.4) Where does this text appear? Clicking here will show you the matching text from the documentation, citations, and metadata records where the matches were found.FAQ on boolean search – it is the case that we no longer offer the boolean option. 2/3rds of our users are grads/undergrads who have grown up Google and using sites (Amazon for example) that use faceted search. Boolean has gone the way of DOS.
  • View details takes you to the full description & citation. Here you’ll find all of the pertinent information on the study: PI names, funders, subject terms, methodology, sample composition, etc.Here you can browse the codebook.You can review and search the variables here – more on that later of course.This indicates that ICPSR has found 3,654 citations related to this study and you can find still more for the series – more on that later too.And finally, if you are interested in data that is available online (in addition to downloading it later perhaps), here you’ll find whether that capability exists.
  • RSS: It enables you to grab relevant information from multiple Web sites, stripping it of site-specific graphics and navigation so you can focus on the core content. If you’re not using it now, sign up for a Google account and try adding some RSS feeds to Google Reader. You’ll be pleased with how much easier it makes it to keep up with professional and news sites.In terms of the ICPSR series page, you can sign up for an RSS notification so that you’ll know when new waves are released in a series. Similar links appear on the study home page, so that you can use RSS to keep track of an individual study, receiving notification if the study is updated, with details on the changes so you can decide whether or not you need to download a new copy of the dataset.Going back to the search results page, I wanted to point out that we have RSS options present there as well. Thus I could set up an RSS feed for a query on a particular subject or investigator, and receive notification if ICPSR released a new study that matched that query.One last thing to point out are the export options at the bottom of the right column. Librarians can use this utility to download MARCXML for any query, such as all studies released in the last six months. The comma-delimited export gives you all the search results in a format that can easily be viewed in Excel. This is primarily of use for researchers who are doing an extensive review of available data.
  • One easy way to access it is from main ICPSR Web site page, from the pull-down menu of the main tab, Find & Analyzed Data.
  • Okay, that’s one way to access “it.” But just what is it? [Slide contents—click on each link to describe the three places.]In the past, one could access related literature via the main Bibliography search and browse page, or at the metadata page for any given study.Now the integrated search also brings up results found in the Bibliography.There is one caveat that I will discuss when we do a search. You’re only searching the elements of a citation—not the full text of the publication.
  • We’re back at the main page of the searchable citations database.While we’re here, let me point out a couple of things that will explain the how and why we developed it.First, if you’re interested in the methodology behind the creation of the database, you can link to it from here.
  • It covers everything from the goals of the project to the details of the collection process.
  • So, what was the reason for developing it and, indeed, for continuing to put resources toward it? It’s covered in a nutshell on the main Bibliography search page.
  • Basically, we make the link from the data the literature about it.This is a big deal, to have the literature grouped this way, with the dataset as the driver. It’s not a common type of link available for datasets.So, even though this is a very resource-intensive undertaking—non common citing practice, no easy way to make links, etc., we still find it to be a valuable resource for the whole range of our user base.[Read first half of the slide]So this database facilitates the standard lit search that students, researchers, and others do in the course of their study or work.[Read second half of slide]So, one can use it to measure study impact, although fairly primitively compared to journal citation databases and engines.I’ll cover later a tool built around teaching students about data using the Bibliography as the prime resource.
  • Starting on the main search page , you’ll see that the main access points from here are via :Searching the elements of the citations (not full text)Browsing by authorBrowsing by journal titleWe’ll start with a search.You can either type a search term, or if you want to see the whole collection, simply leave the search box empty.Then click on Search for Citations button.
  • Link to the FAQ if you want a larger list.
  • http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/26701?archive=ICPSR&q=nsduhFrom study page to related lit and series related litFrom series listhttp://www.icpsr.umich.edu/icpsrweb/ICPSR/series
  • This goes for social science subject instructors as well as information literacy instructors.Helps students focus and gives them valuable tools for searching in the larger article databases.Again, this is because it’s a resource that connects the literature to the data.
  • http://www.icpsr.umich.edu/icpsrweb/EDRL/index.jsp
  • Exploring Data Through Research LiteratureDesigned to teach quantitative research methods to undergraduates in a different way. Integrates ICPSR bibliography of data related literatureinto teaching students how make their way from ideas to empirical work to literature and back. Suitable for both research methods and other substantive courses requiring empirical researchhttp://www.icpsr.umich.edu/icpsrweb/EDRL/index.jsp
  • http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/submit.jsp
  • http://www.icpsr.umich.edu/icpsrweb/ICPSR/support/faqs/8707350211996719508
  • ICPSR Data Exploration Tools

    1. 1. ICPSR AT 50:Facilitating Research and Data Sharing<br />Part I: Data Exploration<br />IASSIST Vancouver, BC<br />May 31, 2011<br />
    2. 2. Welcome to Vancouver!Our Agenda<br />Data Exploration<br />A Continuing Quest to Ease your Search<br />Social Science Variables Database<br />Bibliography of Data-related Literature<br />Data Sharing<br />2010 US Census Data<br />Public Data Collections<br />Data Management<br />Data Management Plans<br />Computing & Data Sharing in Secure Environments<br />Managing Restricted Contracts<br />
    3. 3. Managing the Clock<br />Intro and Data Exploration (9:30-10:30)<br />Break<br />Data Sharing (10:45-11:30)<br />Break<br />Data Management (11:45–12:30)<br />Escape!<br />Disclaimer: Times are approximate!<br />
    4. 4. What is ICPSR? - Then and Now -<br />One of the world’s oldest and largest social science data archives, est. 1962<br />Data distributed on punch cards, then reel-to-reel tape, now: <br />Data available on demand<br />Over 7,000 studies with over 65,000 data sets <br />Membership organization among 21 universities, now:<br />Currently about 700 members world-wide<br />Federal funding of public collections<br />
    5. 5. What We Do – It’s About Data!<br />Seek research data and pertinent documents from researchers (PIs, research agencies, government)<br />Process and preserve the data and documents <br />Disseminate data<br />Provide education, training, & instructional resources<br />
    6. 6. Why People Use ICPSR<br />Write articles, papers, or theses using real research data<br />Conduct secondary research to support findings of current research or to generate new findings<br />Use as intro material in grant proposals<br />Preserve/disseminate primary research data<br />Fulfill data management plan (grant) requirements<br />Study or teach quantitative methods<br />
    7. 7. Data Exploration<br />
    8. 8. The Challenge – Hoards of Data & Metadata<br />How does one make sense of:<br />7,000 studies<br />65,000 datasets<br />550,000 files<br />Millions of variables<br />60,000 bibliographic citations<br />
    9. 9. Data Exploration- Integrated Search -Better Search for Better Results<br />Data-related biblio<br />SSVD the variables<br />Docs, subjects, PIs, etc<br />Search Results<br />
    10. 10. Integrating ICPSR’s Search“Sponsored by SOLR/Lucene”<br />In 2009, an improved search engine<br />Later, construction of full-text search <br />Faceted search to narrow large result sets<br />
    11. 11. Search Terms: teen drug use<br />
    12. 12. Reviewing the Study Home Page<br />
    13. 13. The Search Continues: Automatic Search Updates<br />Receive automatic updates on the study or series<br />And updates on your query<br />
    14. 14. Data ExplorationThe Social Science Variables Database<br />Data-related biblio<br />SSVD the variables<br />Docs, subjects, PIs, etc<br />Search Results<br />
    15. 15. The Social Science Variables Database (SSVD)<br />Sanda Ionescu,<br />Documentation Specialist<br />sandai@umich.edu <br />
    16. 16. The Social Science Variables Database at ICPSR<br />Enables ICPSR users to search variables across datasets<br />Assists in:<br />Data discovery <br />Comparison / harmonization projects <br />Data harvesting <br />Data analysis<br />Question mining for designing new research<br />
    17. 17. The Social Science Variables Database at ICPSR<br />Tool for teaching<br />Research Methods:<br /><ul><li>Concept operationalization
    18. 18. Effect of question wording, context, and answer categories on variable distributions</li></ul>Substantive classes:<br /><ul><li>Cultural / social changes reflected in different question wordings, or elicited answers (longitudinal or time series data)</li></li></ul><li>The Social Science Variables Database at ICPSR<br />Officially launched Spring 2009.<br />Pre-launch: two to three years’ preparation period<br />Gather variable-level documentation; apply/refine selection criteria, quality checks<br />Build database to host variable descriptions<br />Initial upload: 3,500 files describing data from about 1,300 studies.<br />
    19. 19. The Social Science Variables Database at ICPSR<br />Variables documented using the Data Documentation Initiative (DDI) specification<br />DDI: a standard for documenting social science data, written in XML<br />Easy to parse / process<br />Allows fine-grained searches<br />Flexible display in a variety of formats <br />Highly shareable, promotes interoperability<br />Ideal archival format (ASCII, not software dependent)<br />
    20. 20. The Social Science Variables Database at ICPSR<br />DDI variable descriptions <br /><ul><li>Generated through an automated process used archive-wide to produce ICPSR’S archival and distribution information packages
    21. 21. Include question text if available in the source documentation</li></li></ul><li>The Social Science Variables Database at ICPSR<br />Relational database<br /><ul><li>Built in Oracle as a separate entity, with links to studies’ and series’ descriptions (also stored in Oracle)
    22. 22. Compatible with both DDI 2 and 3 (input and output)
    23. 23. Oracle Text searches used in Beta-testing phase
    24. 24. Slow retrieval
    25. 25. Limited to 500 results</li></li></ul><li>The Social Science Variables Database at ICPSR<br />Search: autumn 2009 switched to Solr/Lucene:<br />Easy indexing<br />Faster searches, unlimited hits<br />Facets/Filters imported from Study Descriptions (also DDI compatible)<br />Series<br />Study<br />Time Period<br />Geography<br />Storage: XML files are being indexed and searched directly – no longer uploaded in the database<br />
    26. 26. The Social Science Variables Database at ICPSR<br /><ul><li>Current content:
    27. 27. 2,602 studies (48 percent of ICPSR holdings with data and setups)
    28. 28. 6,493 datasets
    29. 29. Approx. 1.7 million variables
    30. 30. Continues to grow by including
    31. 31. All new releases, if suitable
    32. 32. Retrofits as made available by small-scale projects </li></li></ul><li>The Social Science Variables Database at ICPSR<br /><ul><li>DDI fields searched:
    33. 33. Variable name
    34. 34. Variable label
    35. 35. Question text sequence
    36. 36. Descriptive text
    37. 37. Category label
    38. 38. Variable notes – not indexed / searched, but they are displayed</li></li></ul><li>The Social Science Variables Database at ICPSR<br />The Public Search Features:<br /><ul><li>Stemming
    39. 39. “Phrase searches”
    40. 40. Fielded searches (treated as a default Boolean “and”: Boolean operators “or,” and “not” are ignored)
    41. 41. Variable label
    42. 42. Question text
    43. 43. Value labels</li></ul>http://www.icpsr.umich.edu/icpsrweb/ICPSR/<br />
    44. 44. The Social Science Variables Database at ICPSR<br />Projected improvements/additional features:<br />Enable selection of multiple filters<br />Enable users to toggle on/off stemming<br />Enable searching “within” results (adding new query to a result set)<br />Show / hide response categories on result page<br />Create interface for selecting results and exporting selection in a particular format<br />From individual variable display, enable navigation to previous or next variable (to show context)<br />
    45. 45. The Social Science Variables Database at ICPSR<br />Usage data (source: Google Analytics)<br />
    46. 46. Data ExplorationThe Bibliography of Data-related Literature<br />Data-related biblio<br />SSVD the variables<br />Docs, subjects, PIs, etc<br />Search Results<br />
    47. 47. ICPSR Bibliography of Data-related Literature<br />Elizabeth Moss<br />Assistant Librarian, ICPSR<br />eammoss@umich.edu<br />
    48. 48. ICPSR Bibliography of Data-related Literature<br />What we will cover:<br /><ul><li>What it is and how to access it
    49. 49. How and why we developed it
    50. 50. Main features
    51. 51. How instructors find it useful
    52. 52. You are a good source</li></li></ul><li>What it is and how to access it<br />
    53. 53. What it is and how to access it<br />It’s really a searchable database . . .<br /> containing 60,000 citations of known published and unpublished works resulting from analyses of data archived at ICPSR<br /> . . .that can generate study bibliographies<br /> associating each study with the literature about it<br />. . . Now included in the integrated search <br />on the ICPSR Web site<br />
    54. 54. How and why we developed it<br />Brainchild of Richard Rockwell, former ICPSR director<br />Funded by a grant from the National Science Foundation in 2000 to build the collection and create a way to access it<br />ICPSR membership and federally-funded archives continue to support it<br />
    55. 55. How and why we developed it<br />What’s in the collection?<br />Resources using data in the ICPSR holdings as the primary data source<br />Resources using ICPSR data in a comparison with the primary dataset investigated<br />Resources "about" an ICPSR dataset or study series.<br />
    56. 56. How and why we developed it<br />
    57. 57. How and why we developed it<br />http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/methodology.jsp<br />
    58. 58. How and why we developed it<br />
    59. 59. How and why we developed it<br />Demonstrate impact of data for funding<br />
    60. 60. Main features<br />http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/index.jsp<br />
    61. 61. Main features<br />Search features:<br /><ul><li>Searches the full text of the elements of citations, e.g., title, author, journal
    62. 62. Boolean “and” is assumed, and phrase searching in quotation marks:</li></ul>adolescents and “mental health” <br />— thisworks <br /><ul><li>No Boolean “or” “not”:</li></ul>Havens or “Havens, Jennifer” <br />— thisdoesn’t work (becomes “and”)<br />
    63. 63. Main features<br />Linking from the search results:<br /><ul><li>To full text for journals
    64. 64. Directly via DOI
    65. 65. Using OpenURL via Google Scholar and WorldCat
    66. 66. To full text of reports and other resources via PDF or HTML links
    67. 67. To the detailed, fielded publication record</li></li></ul><li>Main features<br />Internal and external linking from the detailed citation record:<br /><ul><li>To the related study(s)
    68. 68. To other citation records of publications by the same author
    69. 69. To other articles in the same journal </li></ul> (but outside the search)<br /><ul><li>To full text options</li></li></ul><li>Main features<br />Exporting citations:<br /><ul><li>From search results: Up to 500 records in RIS format, exports directly to EndNote
    70. 70. From individual detailed record: Export the citation in RIS format</li></li></ul><li>Main features<br />Filtering and sorting features:<br /><ul><li>Filter search results by author, pub type, journal, pub. year
    71. 71. Coming soon—pub year range filter (similar to that in study search)
    72. 72. Sort search results by relevance, pub date (oldest or newest), title, recency</li></li></ul><li>Main features<br />Browse from main Bibliography page:<br /><ul><li>By author name (no authority control)</li></ul>Juster, F. (2)<br />Juster, F. Thomas (22)<br />Juster, F.T. (1)<br /><ul><li>By journal title name (authority control)</li></li></ul><li>Main features<br />Link from individual study pages:<br /><ul><li>to the dynamically-generated study bibliography
    73. 73. to series collections, when applicable</li></ul>Link from series description pages:<br /><ul><li>to series bibliographies from the series page </li></li></ul><li>How instructors find it useful<br />Senior seminar classes<br /><ul><li>Profs choose dataset and ask students to think of a research question
    74. 74. Bibliography allows students to see the wide variety of topics available for a single dataset</li></li></ul><li>How instructors find it useful<br />Research proposal design<br /><ul><li>Good for finding studies that examine what a student wants to propose
    75. 75. Does the data they would want already exist?
    76. 76. If so, are there survey questions they could replicate?
    77. 77. Authors’ suggestions for future research</li></li></ul><li>How instructors find it useful<br />Undergraduate introduction <br /><ul><li>Research papers—Good starting point for finding literature on a particular topic
    78. 78. Finding data—Starting with the Bibliography can be more intuitive</li></li></ul><li>How instructors find it useful<br />From the ICPSR blog:<br />“I can't say enough about how much I like the Bibliography of Data-related Literature. I find that students prefer to use this to identify key writings about data obtained from ICPSR. Students are sometimes really overwhelmed by trying to do literature searches in the many article databases subscribed to by the Library and they don't find what they need by using Google Scholar. So, I direct them to the Bibliography first to identify authors and subject terms. They can then use these to carry out successful searches in article databases.”<br />
    79. 79. How instructors find it useful<br />From the ICPSR blog:<br />“As a companion to the Bibliography I also use the instructional tool: Exploring Data Through Research Literature (EDRL). I think Rachel Barlow did a fantastic job on this. I have adapted pieces of EDRL for use in class presentations with great success. If you are in a library and you are involved ininformation literacy activities, this is a great tool.”<br />
    80. 80. How instructors find it useful<br />The EDRL – an Online Module<br />
    81. 81. You are a good source<br />Get credit for your work AND let us know about that of others:<br /><ul><li>Send a citation via the Web form
    82. 82. Or send them in an email to bibliography@icpsr.umich.edu
    83. 83. If you have a large library, we can take EndNote XML imports, or even RIS-format imports</li></li></ul><li>You are a good source<br />
    84. 84. You are a good source<br />A final request:<br />When you write articles, reports, papers, and presentations that analyze or significantly discuss data, CITE the data<br />Encourage others to do it, too<br />Here’s how and why<br />
    85. 85. Let’s Take a BreakReturn at 10:45<br />