Labou &quot;Data Science and the Library at UC San Diego&quot;

This presentation was provided by Stephanie Labou of The University of California - San Diego, during part two of the NISO two-part webinar "Building Data Science Skills: Strategic Support for the Work, Part Two" which was held on March 18, 2020.

Education

DATA SCIENCE
AND THE LIBRARY
AT UC SAN DIEGO
STEPHANIE LABOU
DATA SCIENCE LIBRARIAN
MARCH 18, 2020
NISO WEBINAR

SOME CONTEXT
• University of California San
Diego
• R1 university
• ~39,000 students
• Data Science Librarian = Data
Librarian +
• I don’t have a library degree or
any previous library
experience
• But I do have lots of experience

TODAY’S TOPIC
“This roundtable discussion will focus on the on-
going need for information professionals to be
well-versed in data science skills in order to
successfully support the work of students, scholars
and other professionals.”
“…additional tools or support are needed for
information professionals as they extract, wrangle,
analyze and present data? “

LET’S TALK SEMANTICS
• Data science = artificial intelligence (AI), deep
learning, machine learning (ML), neural
networks, high performance computing (HPC)
• Data science = data cleaning and manipulation,
using code to automate data tasks, data at “big
enough” scale

MY ROLE
• Questions about:
• 1) Looking for specific data about X
• What does data science – and other domains
leveraging data science methodologies – need? Data!
• 2) Have data – now what?
• Makes up the vast majority of support

THE COMMON THREAD:
COMPUTING
• Questions about using data in compute-heavy ways
• Reading in and formatting data in R/Python
• Working with non-traditional data formats
• API access, web scraping
• Access to additional resources for large (TB)
datasets
• Data & GIS Lab
• Using other platforms related to coding, like
GitHub, Jupyter

WHAT SKILLS DO I NEED FOR
THIS?
• Data life cycle 101 (find, manage, analyze,
preserve, etc.)
• For data science support, need knowledge of at
least one programming language
• Concepts transfer between languages
• My path: self-taught!
• Cons: this is the long and rocky path
• Pros: forced early on to develop excellent problem-solving skills

DO WE ALL NEED TO LEARN
“DATA SCIENCE”?
• In my opinion: no (but it depends)
• What are the support needs?
• Knowing “enough” goes a long way
• A handful of functions for a subset of topics (mostly
data cleaning and manipulation in an automated
platform) goes a long way
• More important to know where to find help, think
through how to approach a problem

SO WHAT SHOULD WE DO?
• Skilling up existing employees
• Library Carpentry, etc.
• “Know just enough to be dangerous”
• Hiring non-library for new/adapted roles
• Aka, my experience
• In-the-field skillset is valuable; higher level of support
• Outsourcing – collaborating with other groups on
campus
• IT, other computing groups

EXAMPLE PROJECTS
• Things we’ve done
• Python scripts to automate parts of metadata ingest
into system
• OpenRefine for metadata cleaning
• What we’d like to do
• Automate scraping DataCite
• Perhaps APIs?

GUIDING PRINCIPLES
• Look for problems where data science
methodologies could be the solution
• Could this manual process be automated? (coding)
• Could we better assess our metrics? (analytics)
• Could we better display this info for findability?
(visualization)
• Not “fancy solution in search of a problem”
• Data science for the sake of data science is just more work
for everyone

OTHER POPULAR TOPICS
• Collections as data
• Making existing collections more accessible for data
science topics
• Text mining, natural language processing, etc.
• Data as collections
• Once again: What does data science need? Data!
• Data collections/guides as high value, high use

LESSONS LEARNED
• Adaptability/flexibility
• Software changes but best practices remain (and get
better)
• This is a natural fit for the library!
• Building infrastructure today that will handle
tomorrow’s needs
• Collaboration is key
• Within-library and campus partners

Contact me:
slabou@ucsd.edu
THANK YOU!
QUESTIONS?

The document summarizes the role and challenges of research data management (RDM) information professionals from the perspective of a library practitioner. It discusses how RDM professionals educate researchers on topics like data management planning and repositories, consult on issues like workflows and publishing, and curate data to ensure findability, understandability and reuse. However, navigating relationships with different university offices, building shared understanding of technical concepts, and managing expectations with limited resources present challenges. Key principles for RDM professionals include keeping researchers central, considering future data re-users, and contributing to communities of practice. Ongoing gaps include supporting restricted and large data as well as developing actionable policies and training new professionals.

Goldman "Collaboratively Build Data Science Services and Skills"

Butler "Building Data Science Skills: Enhancing Core Capabilities and Underst...

Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"

This document discusses best practices for content delivery platforms to support artificial intelligence projects. It recommends that platforms (1) accept that they do not have all the data needed and should integrate third-party sources, (2) provide consistent tagging of content, (3) offer a lightweight programmatic interface, (4) embrace allowing large amounts of content to be taken offline for analysis, and (5) enable complex filtering and selection of data. The document also suggests platforms could consider offering preprocessed datasets or AI tools as new products.

Hansen Metadata for Institutional Repositories

Slides | Research data literacy and the library

Colleen DeLory

From Big Data to the Big Picture

SAGE Publishing

With big data research all the rage, how are librarians being asked to engage with data? As big data research takes off across Business, Science, and the Humanities, librarians need to understand big data and the issues around its storage and curation. How can it be made accessible? What tools and resources are required to use and analyze big data? In this webinar, panelists Caroline Muglia and Jill Parchuck share how big data is being used on their campuses and how they, as librarians, are supporting the sourcing and storage of this data.

Big Data & DS Analytics for PAARL aims to help library participants relate Big Data and Data Science applications to library services. The speaker discusses Big Data concepts like the 3 V's of volume, velocity, and variety. Library data resources and analytics challenges are presented. Opportunities for libraries in Big Data include expertise in metadata, assessment, and collaboration. Building a Big Data culture requires openness, investment, training, and data sharing standards. Data governance differs from data management. Machine learning and social listening are explored as examples. Trends in data science domains and tools are shared.

Putnam Data Quality and the IR

The liaison librarian: connecting with the qualitative research lifecycle

Celia Emmelhainz

RDAP14: Learning to Curate Panel

Slides | Targeting the librarian’s role in research services

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Research Data Services at the University of Utah

Rebekah Cummings

Springer "The Research Data Landscape: Beyond Filling Gaps"

Rscd 2017 bo f data lifecycle data skills for libs

SusanMRob

This document discusses the data skills required of librarians and presents a matrix of factors that influence these skills, including the librarian's role, the data lifecycle services provided by the library, and the research intensity of the institution. It notes the wide range of possible data-related skills and acknowledges that no individual can master all of them, emphasizing the need for librarians to work as a team with complementary skills. The document also examines questions around how librarians can become more involved in data science and what their future roles may be in supporting data-intensive research.

Lee "Supporting Research Data is a Group Effort"

This document summarizes research data support services at Tufts University. It discusses the context at Tufts including relevant support organizations. It describes collaborations between the libraries, technology services, and research centers to provide data management resources like the Tufts Data Lab, a data management team, and Carpentries data workshops. Ongoing work includes developing guidance on data storage, a centralized support website, and expanding the use of the Dataverse sharing platform.

SLIDES | 12 time-saving tips for research support

The document provides 25 tips for using various tools to work smart, work together, and stay up-to-date as a researcher. The tips include creating a document library, downloading and marking up documents, using an electronic lab notebook, joining a research ecosystem, setting alerts, following researchers, analyzing search results, and more. The overall message is that new tools can help researchers organize the growing amount of data, connect with collaborators, and maintain novelty in their work.

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Think like a Digital Curator

DigitalLibraryServices

1) The document provides tips for good research data management (RDM), including file management, naming, versioning, formats, documentation, storage, and addressing common questions. 2) It emphasizes the importance of RDM for identifying, locating, understanding, and reusing data effectively, as well as satisfying funder requirements. 3) Good RDM practices such as consistent naming, versioning, and use of open formats make data more accessible for collaboration, analysis, and preservation.

Linked Open Data for Digital Humanities

Christophe Guéret

Strasser "Effective data management and its role in open research"

RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...

RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...

Washington Linked Data Authority Service at University of Houston

Albert Anthony Gavino, MBA

Day in the life of a data librarian [presentation for ANU 23Things group]

Jane Frazier

This document summarizes the job responsibilities and career path of a data librarian. It describes how the librarian draws on skills from traditional librarianship, metadata work, digital curation, software development and research to support data management and sharing. The librarian's current role involves developing metadata standards, providing training and consultancy to researchers, and engaging with colleagues both within and outside their organization to improve data services. The document suggests aspiring data librarians learn new technologies, describe their skills to potential employers, and stay active developing their expertise through conferences and online resources.

Data 101: A Gentle Introduction

Hamilton Public Library

This document provides an overview of data librarianship presented by Kimberly Silk. It defines data librarianship and the role of data librarians in supporting data management, metadata, and teaching data use. The presentation covers basic data terminology, common data sources like government surveys and international organizations, challenges around big and open data, tools for data analysis and discovery like Dataverse, and examples of data visualizations.

Intro to dh data management

Rachel Di Cresce

How to crack Big Data and Data Science roles

UpXAcademy

What's hot

Big Data for Library Services (2017)

Putnam Data Quality and the IR

The liaison librarian: connecting with the qualitative research lifecycle

Celia Emmelhainz

RDAP14: Learning to Curate Panel

Slides | Targeting the librarian’s role in research services

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Research Data Services at the University of Utah

Rebekah Cummings

Springer "The Research Data Landscape: Beyond Filling Gaps"

Rscd 2017 bo f data lifecycle data skills for libs

SusanMRob

Lee "Supporting Research Data is a Group Effort"

SLIDES | 12 time-saving tips for research support

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Think like a Digital Curator

DigitalLibraryServices

Linked Open Data for Digital Humanities

Christophe Guéret

Strasser "Effective data management and its role in open research"

RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...

RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...

Washington Linked Data Authority Service at University of Houston

Adaryl "Bob" Wakefield, MBA

Day in the life of a data librarian [presentation for ANU 23Things group]

Jane Frazier

Data 101: A Gentle Introduction

Hamilton Public Library

What's hot (20)

Big Data for Library Services (2017)

Putnam Data Quality and the IR

The liaison librarian: connecting with the qualitative research lifecycle

RDAP14: Learning to Curate Panel

Slides | Targeting the librarian’s role in research services

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Research Data Services at the University of Utah

Springer "The Research Data Landscape: Beyond Filling Gaps"

Rscd 2017 bo f data lifecycle data skills for libs

Lee "Supporting Research Data is a Group Effort"

SLIDES | 12 time-saving tips for research support

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Think like a Digital Curator

Linked Open Data for Digital Humanities

Strasser "Effective data management and its role in open research"

RDAP 15 Data Management Outreach for the Humanities: A University of Illinois...

RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...

Washington Linked Data Authority Service at University of Houston

Day in the life of a data librarian [presentation for ANU 23Things group]

Data 101: A Gentle Introduction

Similar to Labou "Data Science and the Library at UC San Diego"

Intro to dh data management

Rachel Di Cresce

How to crack Big Data and Data Science roles

UpXAcademy

From SQL to Python - A Beginner's Guide to Making the Switch

Rachel Berryman

Feb.2016 Demystifying Digital Humanities - Workshop 3

Paige Morgan

01-Introduction.pdf

ngVnThng12

1) The document discusses big data and data science, defining big data using the three Vs of volume, velocity, and variety to characterize high amounts of diverse data sources. 2) Data science is presented as a combination of techniques from fields like mathematics, computer science, and statistics to extract knowledge from data. 3) Successful data scientists require a diverse skillset that includes quantitative skills, technical skills, skepticism, collaboration, and knowledge from multiple disciplines.

Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...

Maninda Edirisooriya

Love Your Data Locally

Erin D. Foster

Guy avoiding-dat apocalypse

ENUG

The document provides an overview of research data management and the importance of avoiding a "DATApocalypse" or data disaster. It discusses the definition of research data, why data management is important, questions to consider, best practices for data management planning, documentation, and long-term preservation. The goal is to help researchers and institutions properly manage data to enable sharing and preservation, as required by most major funders.

NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Vivian S. Zhang

This document summarizes a presentation on data science consulting. It discusses: 1) The Agile Analytics group at ThoughtWorks which does data science consulting projects using probabilistic modeling, machine learning, and big data technologies. 2) Two case studies are described, including developing a machine learning model to improve matching of healthcare product data and using logistic regression for retail recommendation systems. 3) The origins and future of the field are discussed, noting that while not entirely new, data science has grown due to improvements in technology, programming languages, and libraries that have increased productivity and driven new career opportunities in the field.

Lunch & Learn Intro to Big Data

Melissa Hornbostel

00-01 DSnDA.pdf

SugumarSarDurai

The document discusses the growth of data and the field of data science. It begins by noting the large amounts of data being generated daily by various sources like web/e-commerce transactions, social networks, and scientific projects. It then discusses some of the challenges of big data including volume, velocity, and variety. The document provides an overview of the multidisciplinary nature of data science and the skills required of data scientists. It also summarizes different approaches to and job roles in data science.

Presentation on Big Data Analytics

S P Sajjan

This document provides an introduction to a course on big data analytics. It discusses the characteristics of big data, including large scale, variety of data types and formats, and fast data generation speeds. It defines big data as data that requires new techniques to manage and analyze due to its scale, diversity and complexity. The document outlines some of the key challenges in handling big data and introduces Hadoop and MapReduce as technologies for managing large datasets in a scalable way. It provides an overview of what topics will be covered in the course, including programming models for Hadoop, analytics tools, and state-of-the-art research on big data technologies and optimizations.

How Oracle Uses CrowdFlower For Sentiment Analysis

CrowdFlower

Reproducible Research with R, The Tidyverse, Notebooks, and Spark

Many of us data science and business analytics practitioners perform research and analysis for decision makers on a regular basis. The deliverable of such analysis often results in a Power Point presentation, and/or a model that needs to be productionalized. The code used to produce the analysis also needs to be considered a deliverable. Many of us perform analysis without reproducibility in mind. With the increasing democratization of data, it is becoming more and more important for people that may not have scientific training to be able to create analysis that can be picked up by somebody else who can then reproduce your results. That, and creating reproducible research is just solid science. We are going to spend an evening walking though the various tools available to create reproducible research on Big Data. You will get introduced to the Tidyverse of R packages and how to use them. We will discuss the ins and outs of various notebook technologies like Jupyter, and Zeppelin. You will have an opportunity to learn how to get up and running with R and Spark and the various options you have to learn on real clusters instead of just your local environment. There also be a quick introduction to source control and the various options you have around using Git. The theme of the evening will be “getting started”. We will go over various training resources and show you the optimal path to go from zero to master. Some commentary will be provided around the current state of the job market and intel from the front lines of the data science language wars. This is a large topic and the evening will be fairly dynamic and responsive to the needs of the audience. Bob Wakefield has spent the better part of 16 years building data systems for many organizations across various industries. He has been running Hadoop in a lab environment for 3 years. He is the principal of Mass Street Analytics, LLC a boutique data consultancy. Mass Street is a Hortonworks Consultant Partner and Confluent Partner. In his spare time, he likes to work on an equity investment application that combines various sources of information to automatically arrive at investing decisions. When he is not doing that, you’ll find him flying his A-10 simulator. Full CV can be found here: https://www.linkedin.com/in/bobwakefieldmba/

“Filling the digital preservation gap”an update from the Jisc Research Data ...

Jenny Mitcham

This document summarizes the findings of the Jisc Research Data Spring project at the University of York and Hull which investigated how Archivematica could be used to provide digital preservation for research data. The project tested Archivematica, explored how it handles different file formats and research data, and identified ways to improve Archivematica and integrate it into research data management workflows. The next phases will develop Archivematica further and implement proof of concepts at York and Hull to preserve research data using Archivematica.

POWRR Tools: Lessons learned from an IMLS National Leadership Grant

Lynne Thomas

02-Lifecycle.pptx

Shree Shree

The term "life cycle" refers to the series of stages or phases that an organism, system, or product goes through from its beginning to its end. It is a concept that can be applied to various contexts, such as biology, ecology, business, technology, and project management. Here are a few examples of life cycles: Biological Life Cycle: In biology, the life cycle refers to the sequence of stages that an organism undergoes from birth to reproduction and eventually death. This can include processes like birth or germination, growth and development, reproduction, and death. Product Life Cycle: The product life cycle describes the stages a product goes through from its introduction to the market until its eventual decline. These stages typically include introduction, growth, maturity, and decline. Companies monitor the product life cycle to make strategic decisions regarding marketing, production, and product development. Project Life Cycle: The project life cycle outlines the stages involved in the management and execution of a project. These stages typically include initiation, planning, execution, monitoring and control, and closure. Each phase has specific activities and deliverables, ensuring that the project progresses in a structured and organized manner. Ecological Life Cycle: Ecological life cycles refer to the stages that ecosystems or species go through over time. This can involve the growth and decline of populations, adaptation to environmental changes, and interactions within the ecosystem. Human Life Cycle: The human life cycle encompasses the different stages of development and growth that individuals go through from birth to death. This includes infancy, childhood, adolescence, adulthood, and eventually old age. Understanding life cycles is important as it provides insight into the processes and changes that occur within various systems. It allows for better planning, decision-making, and adaptation to ensure sustainable growth, effective management, and optimal utilization of resources throughout the life cycle.

Databases, Web Services and Tools For Systems Immunology

Yannick Pouliot

This document provides an overview of databases, web services, tools, and computing resources needed for systems immunology. It discusses the importance of having a clear hypothesis, statistical understanding, large datasets from different levels of biology, software tools, programming expertise, and computing power. Specific databases, tools, and programming languages discussed include ImmPort, Stanford's HIMC database, MySQL, GenePattern, Galaxy, Weka, R, Perl, Python, and Amazon Cloud computing. The document provides recommendations and resources for learning statistics, data mining, programming languages, and using cloud computing resources.

Open Sesame (and other open movements)

Dorothea Salo

Dorothea Salo gave a presentation on various "open" movements and how they relate to libraries. She discussed open source software, open standards, open access, open data, and open notebook science. For each topic, she explained what is being opened, how it is opened through things like licensing and standards, and why libraries should care about supporting these movements. The overall goals were to disambiguate jargon, explain her role in promoting open access, and suggest opportunities for libraries to participate in and support open initiatives.

Data Day Seattle 2015: Sarah Guido

Bitly

Similar to Labou "Data Science and the Library at UC San Diego" (20)

Intro to dh data management

How to crack Big Data and Data Science roles

From SQL to Python - A Beginner's Guide to Making the Switch

Feb.2016 Demystifying Digital Humanities - Workshop 3

01-Introduction.pdf

Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...

Love Your Data Locally

Guy avoiding-dat apocalypse

NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Lunch & Learn Intro to Big Data

00-01 DSnDA.pdf

Presentation on Big Data Analytics

How Oracle Uses CrowdFlower For Sentiment Analysis

Reproducible Research with R, The Tidyverse, Notebooks, and Spark

“Filling the digital preservation gap”an update from the Jisc Research Data ...

POWRR Tools: Lessons learned from an IMLS National Leadership Grant

02-Lifecycle.pptx

Databases, Web Services and Tools For Systems Immunology

Open Sesame (and other open movements)

Data Day Seattle 2015: Sarah Guido

More from National Information Standards Organization (NISO)

Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"

Benner "Expanding Pathways to Publishing Careers"

Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...

Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"

Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"

Mattingly "AI and Prompt Design: LLMs with NER"

Mattingly "AI & Prompt Design: Named Entity Recognition"

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"

Mattingly "AI & Prompt Design: The Basics of Prompt Design"

Bazargan "NISO Webinar, Sustainability in Publishing"

Rapple "Scholarly Communications and the Sustainable Development Goals"

Compton "NISO Webinar, Sustainability in Publishing"

Mattingly "AI & Prompt Design: Large Language Models"

Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...

Mattingly "AI & Prompt Design" - Introduction to Machine Learning"

Mattingly "Text and Data Mining: Building Data Driven Applications"

Mattingly "Text and Data Mining: Searching Vectors"

Mattingly "Text Mining Techniques"

Mattingly "Text Processing for Library Data: Representing Text as Data"

Carpenter "Designing NISO's New Strategic Plan: 2023-2026"