SlideShare a Scribd company logo
1 of 17
Data Discovery and Reuse:
AI Solutions & the Human
Factor
F E B R U A R Y 2 4 , 2 0 2 0
Huajin Wang, Ph.D.
Liaison Librarian, Biology and Computer Science
Program Director, Open Science & Data Collaborations
huajinw@cmu.edu
The AI pipeline
starts from data
“Garbage in, garbage
out”: What does it
mean?
3
AI is only as good as it’s data
• Poor data quality (accuracy, completeness, bias, consistency,
validity, uniqueness, ….)
 Bias or wrong prediction
• Poor metadata or documentation
 Data data (re)usability
• Not enough data
 Model performance; overfitting
4
Key to successful AI systems:
accessible high quality labeled data
Road segmentation for self-driving cars
Adapted from Evgeny Toropov. AIDR ’19.
AI
Road Road
High quality data are hard to find
5
Lab website / Server
High quality data are hard to find
Technical
solutions
• Search engines to
find existing data
• Automation in
evaluating data
quality and
integrating datasets
• Automation in data
curation
• Model transfer and
data augmentation
7
https://events.library.cmu.edu/aidr2019/
Supported by NSF public access initiative
Save the Date:
May 11, 2020
https://events.library.cmu.edu/aidr2020/
8 Evgeny Toropov. AIDR ’19.
Road segmentation for self-driving cars
Technical use case 1: Domain adaptation to
transfer model to new data
9
Technical use case 2:
Finding existing datasets
* https://www.bostonwebdesigners.net/news/structured-data-and-local-seo/
• Datasets are distributed and hard to find
• Structured data
• Web (1%*)
• Repositories
• Unstructured data
• Web (99%)
• Publications
• Images
• Overall discovery layer missing
10
• Search engine powered by AI
• Simple keyword search for datasets across the web
• Searches over embedded metadata
• Searches over metadata from data providers
• schema.org data standards (embedded in html)
• Dataset name, description, provider, temporal coverage, …
Technical use case 2:
Finding existing datasets (structured)
11
Unstructured data: need
metadata tagging and
data linking first
• Keyphrase extraction
from scholarly
documents
Cornelia Caragea and C. Lee Giles. AIDR ’19.
Technical use case 2:
Finding existing datasets (unstructured)
12 https://www.deviantart.com/stantti/art/MARVIN-35569632
Why aren’t the data
already labeled? …
• Responsible data collection and
documentation
• Data management best practices
• Metadata & data standards
Non-technical solutions (the human factor)
Data stewardship Open science and data sharing
• Share data, code, workflow
• Follow FAIR principles
• Build easy to use and robust
tools for data sharing and reuse
• Interdisciplinary collaborations
13
14
Work together to build a healthy data ecosystem
15
Open Science & Data Collaborations Program
@Carnegie Mellon
Support open practices across the entire research life cycle – from project planning, DMPs,
preregistration, through operational data management and documenting reproducible
workflows and methods, to publicly sharing research products including data and code in a way
that is discoverable and reusable (FAIR+).
16 https://events.library.cmu.edu/aidr2020/
S A V E T H E D A T E:
May 11, 2020
Thank you!
huajinw@cmu.edu
@HuajinBioLib
Photo: "Mobot” (MObile roBOTs) competition at CMU’s spring carnival.

More Related Content

What's hot

Practical applications for altmetrics in a changing metrics landscape
Practical applications for altmetrics in a changing metrics landscapePractical applications for altmetrics in a changing metrics landscape
Practical applications for altmetrics in a changing metrics landscapeDigital Science
 
NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016Susanna-Assunta Sansone
 
Recommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsRecommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsFaisal Razzak
 
John morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptxJohn morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptxARDC
 
Sue cook c3 dis dm-ps 1.pptx
Sue cook c3 dis dm-ps 1.pptxSue cook c3 dis dm-ps 1.pptx
Sue cook c3 dis dm-ps 1.pptxARDC
 
Natasha intro to rdm c3 dis may 2018.pptx
Natasha intro to rdm c3 dis may 2018.pptxNatasha intro to rdm c3 dis may 2018.pptx
Natasha intro to rdm c3 dis may 2018.pptxARDC
 
RDAP 15: “This is just for me”: Researchers on their data documentation pract...
RDAP 15: “This is just for me”: Researchers on their data documentation pract...RDAP 15: “This is just for me”: Researchers on their data documentation pract...
RDAP 15: “This is just for me”: Researchers on their data documentation pract...ASIS&T
 
RDAP13 Elizabeth Moss: The impact of data reuse
RDAP13 Elizabeth Moss: The impact of data reuseRDAP13 Elizabeth Moss: The impact of data reuse
RDAP13 Elizabeth Moss: The impact of data reuseASIS&T
 
Tools for improving data publication and use
Tools for improving data publication and useTools for improving data publication and use
Tools for improving data publication and usegodanSec
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T
 
NIH BD2K bioCADDIE DataMed: Data Discovery Index
NIH BD2K bioCADDIE DataMed: Data Discovery IndexNIH BD2K bioCADDIE DataMed: Data Discovery Index
NIH BD2K bioCADDIE DataMed: Data Discovery IndexSusanna-Assunta Sansone
 
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...ASIS&T
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management EcosystemJohn Kunze
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersRebekah Cummings
 
Mapping a Privacy Framework to a Reference Model of Learning Analytics
Mapping a Privacy Framework to  a Reference Model of Learning AnalyticsMapping a Privacy Framework to  a Reference Model of Learning Analytics
Mapping a Privacy Framework to a Reference Model of Learning AnalyticsOpen Cyber University of Korea
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott LibraryRebekah Cummings
 

What's hot (20)

Practical applications for altmetrics in a changing metrics landscape
Practical applications for altmetrics in a changing metrics landscapePractical applications for altmetrics in a changing metrics landscape
Practical applications for altmetrics in a changing metrics landscape
 
NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Recommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsRecommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviews
 
John morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptxJohn morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptx
 
Sue cook c3 dis dm-ps 1.pptx
Sue cook c3 dis dm-ps 1.pptxSue cook c3 dis dm-ps 1.pptx
Sue cook c3 dis dm-ps 1.pptx
 
Natasha intro to rdm c3 dis may 2018.pptx
Natasha intro to rdm c3 dis may 2018.pptxNatasha intro to rdm c3 dis may 2018.pptx
Natasha intro to rdm c3 dis may 2018.pptx
 
RDAP 15: “This is just for me”: Researchers on their data documentation pract...
RDAP 15: “This is just for me”: Researchers on their data documentation pract...RDAP 15: “This is just for me”: Researchers on their data documentation pract...
RDAP 15: “This is just for me”: Researchers on their data documentation pract...
 
NISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management PlanNISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management Plan
 
RDAP13 Elizabeth Moss: The impact of data reuse
RDAP13 Elizabeth Moss: The impact of data reuseRDAP13 Elizabeth Moss: The impact of data reuse
RDAP13 Elizabeth Moss: The impact of data reuse
 
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
 
Tools for improving data publication and use
Tools for improving data publication and useTools for improving data publication and use
Tools for improving data publication and use
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
 
NIH BD2K bioCADDIE DataMed: Data Discovery Index
NIH BD2K bioCADDIE DataMed: Data Discovery IndexNIH BD2K bioCADDIE DataMed: Data Discovery Index
NIH BD2K bioCADDIE DataMed: Data Discovery Index
 
Burton - Security, Privacy and Trust
Burton - Security, Privacy and TrustBurton - Security, Privacy and Trust
Burton - Security, Privacy and Trust
 
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management Ecosystem
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
Mapping a Privacy Framework to a Reference Model of Learning Analytics
Mapping a Privacy Framework to  a Reference Model of Learning AnalyticsMapping a Privacy Framework to  a Reference Model of Learning Analytics
Mapping a Privacy Framework to a Reference Model of Learning Analytics
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott Library
 

Similar to NISO Plus: Data Discovery and Reuse: AI Solutions & the Human Factor

Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and StandardsARDC
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementMarieke Guy
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...SEAD
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseVaticle
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsJason Hattrick-Simpers
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in EducationPhilip Piety
 
Data, Data Everywhere: What's A Publisher to Do?
Data, Data Everywhere: What's  A Publisher to Do?Data, Data Everywhere: What's  A Publisher to Do?
Data, Data Everywhere: What's A Publisher to Do?Anita de Waard
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...Carole Goble
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelTrey Grainger
 
Research Data Management in GLAM: Managing Data for Cultural Heritage
Research Data Management in GLAM: Managing Data for Cultural HeritageResearch Data Management in GLAM: Managing Data for Cultural Heritage
Research Data Management in GLAM: Managing Data for Cultural HeritageSarah Anna Stewart
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithAfrican Open Science Platform
 
Data publishing at the UQ Library
Data publishing at the UQ LibraryData publishing at the UQ Library
Data publishing at the UQ LibraryARDC
 
Borgman orcid dryadsymposiumoxford20130523
Borgman orcid dryadsymposiumoxford20130523Borgman orcid dryadsymposiumoxford20130523
Borgman orcid dryadsymposiumoxford20130523ORCID, Inc
 
Why would a publisher care about open data?
Why would a publisher care about open data?Why would a publisher care about open data?
Why would a publisher care about open data?Anita de Waard
 

Similar to NISO Plus: Data Discovery and Reuse: AI Solutions & the Human Factor (20)

Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count
Lowenberg Making Data Count
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 
Strasser "Effective data management and its role in open research"
Strasser "Effective data management and its role in open research"Strasser "Effective data management and its role in open research"
Strasser "Effective data management and its role in open research"
 
Data, Data Everywhere: What's A Publisher to Do?
Data, Data Everywhere: What's  A Publisher to Do?Data, Data Everywhere: What's  A Publisher to Do?
Data, Data Everywhere: What's A Publisher to Do?
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 
Research Data Management in GLAM: Managing Data for Cultural Heritage
Research Data Management in GLAM: Managing Data for Cultural HeritageResearch Data Management in GLAM: Managing Data for Cultural Heritage
Research Data Management in GLAM: Managing Data for Cultural Heritage
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
 
Data publishing at the UQ Library
Data publishing at the UQ LibraryData publishing at the UQ Library
Data publishing at the UQ Library
 
Borgman orcid dryadsymposiumoxford20130523
Borgman orcid dryadsymposiumoxford20130523Borgman orcid dryadsymposiumoxford20130523
Borgman orcid dryadsymposiumoxford20130523
 
Why would a publisher care about open data?
Why would a publisher care about open data?Why would a publisher care about open data?
Why would a publisher care about open data?
 

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"
 
Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"
 
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
 
Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"
 
Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"
 
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
 
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
 
Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"
 
Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"
 
Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"
 

Recently uploaded

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Recently uploaded (20)

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 

NISO Plus: Data Discovery and Reuse: AI Solutions & the Human Factor

  • 1. Data Discovery and Reuse: AI Solutions & the Human Factor F E B R U A R Y 2 4 , 2 0 2 0 Huajin Wang, Ph.D. Liaison Librarian, Biology and Computer Science Program Director, Open Science & Data Collaborations huajinw@cmu.edu
  • 2. The AI pipeline starts from data “Garbage in, garbage out”: What does it mean?
  • 3. 3 AI is only as good as it’s data • Poor data quality (accuracy, completeness, bias, consistency, validity, uniqueness, ….)  Bias or wrong prediction • Poor metadata or documentation  Data data (re)usability • Not enough data  Model performance; overfitting
  • 4. 4 Key to successful AI systems: accessible high quality labeled data Road segmentation for self-driving cars Adapted from Evgeny Toropov. AIDR ’19. AI Road Road
  • 5. High quality data are hard to find 5 Lab website / Server
  • 6. High quality data are hard to find
  • 7. Technical solutions • Search engines to find existing data • Automation in evaluating data quality and integrating datasets • Automation in data curation • Model transfer and data augmentation 7 https://events.library.cmu.edu/aidr2019/ Supported by NSF public access initiative Save the Date: May 11, 2020 https://events.library.cmu.edu/aidr2020/
  • 8. 8 Evgeny Toropov. AIDR ’19. Road segmentation for self-driving cars Technical use case 1: Domain adaptation to transfer model to new data
  • 9. 9 Technical use case 2: Finding existing datasets * https://www.bostonwebdesigners.net/news/structured-data-and-local-seo/ • Datasets are distributed and hard to find • Structured data • Web (1%*) • Repositories • Unstructured data • Web (99%) • Publications • Images • Overall discovery layer missing
  • 10. 10 • Search engine powered by AI • Simple keyword search for datasets across the web • Searches over embedded metadata • Searches over metadata from data providers • schema.org data standards (embedded in html) • Dataset name, description, provider, temporal coverage, … Technical use case 2: Finding existing datasets (structured)
  • 11. 11 Unstructured data: need metadata tagging and data linking first • Keyphrase extraction from scholarly documents Cornelia Caragea and C. Lee Giles. AIDR ’19. Technical use case 2: Finding existing datasets (unstructured)
  • 13. • Responsible data collection and documentation • Data management best practices • Metadata & data standards Non-technical solutions (the human factor) Data stewardship Open science and data sharing • Share data, code, workflow • Follow FAIR principles • Build easy to use and robust tools for data sharing and reuse • Interdisciplinary collaborations 13
  • 14. 14 Work together to build a healthy data ecosystem
  • 15. 15 Open Science & Data Collaborations Program @Carnegie Mellon Support open practices across the entire research life cycle – from project planning, DMPs, preregistration, through operational data management and documenting reproducible workflows and methods, to publicly sharing research products including data and code in a way that is discoverable and reusable (FAIR+).
  • 16. 16 https://events.library.cmu.edu/aidr2020/ S A V E T H E D A T E: May 11, 2020
  • 17. Thank you! huajinw@cmu.edu @HuajinBioLib Photo: "Mobot” (MObile roBOTs) competition at CMU’s spring carnival.