SlideShare a Scribd company logo
Data Science,
Data Curation,
Human-Data Interaction
Bill Howe, Ph.D.
Associate Professor, Information School
Adjunct Associate Professor, Computer Science & Engineering
Associate Director and Senior Data Science Fellow, eScience Institute
7/26/2016 Bill Howe, UW 1
Dave Beck
Director of Research,
Life Sciences
Ph.D. Medicinal
Chemistry,
Biomolecular
Structure & Design
Jake VanderPlas
Director of Research,
Physical Sciences
Ph.D., Astronomy
Valentina Staneva
Data Scientist
Ph.D., Applied
Mathematics
and Statistics
Ariel Rokem
Data Scientist
Ph.D.,
Neuroscience
Andrew Gartland
Research Scientist
Ph.D., Biostatistics
Bryna Hazelton
Research Scientist
Ph.D., Physics
Bernease Herman
Data Scientist
BS, Stats
was SE at Amazon
Vaughn Iverson
Research Scientist
Ph.D., Oceanography
Rob Fatland
Director of Cloud and
Data Solutions
Senior Data Science
Fellow
PhD Geophysics
Joe Hellerstein
Senior Data Science Fellow
IBM Research,
Microsoft Research,
Google (ret.)
Data Scientists
Research Scientists
Research Faculty Cyberinfrastructure
Brittany Fiore-Gartland
Ethnographer
Ph.D Communication
Dir. Ethnography
http://escience.washington.edu
Time
Amountofdataintheworld
Time
Processingpower
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Amount of data in
the world
Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
7/26/2016 Bill Howe, UW 5
7/26/2016 Bill Howe, UW 8
Goal: Understand and optimize how people
use and share quantitative information
“Human-Data Interaction”
The SQLShare Corpus:
A multi-year log of hand-written SQL queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare
lifetime = days between first and last access of table
SIGMOD 2016
Shrainik Jain
http://uwescience.github.io/sqlshare/
Data “Grazing”: Short dataset lifetimes
MYRIA: POLYSTORE MGMT
Human-Data Interaction
7/26/2016 Bill Howe, UW 18
R A G K
Modern
Big Data
Ecosystems
many different
platforms, complex
analytics
Myria Algebra
Tables KeyVal Arrays Graphs
RACO: Relational Algebra COmpiler
Spark Accumulo CombBLAS GraphX
Parallel
Algebra
Logical
Algebra
RACO
Relational Algebra COmpiler
CombBLAS
API
Spark
API
Accumulo Graph
API
rewrite
rules
Array
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
https://github.com/uwescience/raco
7/26/2016 Bill Howe, UW 22
ISMIR 2016
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC
(Forward scatter)
Orange fluo
Red fluo
SeaFlow
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
7/26/2016 Bill Howe, UW 25
Ashes CAMHD
http://novae.ocean.washington.edu/story/Ashes_CAMHD_Live
Extract synchronized slices
Co-register
(camera jitter, bad time synch)
Separate fore- and back-ground
Classify critters in the foreground
Measure growth rate over time
“DEEP” CURATION
Human-Data Interaction
Microarray experiments
7/26/2016 Bill Howe, UW 33
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
color = labels supplied as
metadata
clusters = 1st two PCA
dimensions on the gene
expression data itself
Can we use the expression data
directly to curate algorithmically?
Maxim
Gretchkin
Hoifung
Poon
The expression data
and the text labels
appear to disagree
Maxim
Gretchkin
Hoifung
Poon
Better Tissue Type Labels
Domain knowledge
(Ontology)
Expression data
Free-text Metadata
2 Deep Networks
text
expr
SVM
Deep Curation Maxim
Gretchkin
Hoifung Poon
Distant supervision and co-learning between text-based
classified and expression-based classifier: Both models
improve by training on each others’ results.
Free-text classifier
Expression classifier
VIZIOMETRICS:
COMPREHENDING VISUAL INFORMATION
IN THE SCIENTIFIC LITERATURE
Human-Data Interaction
7/26/2016 Bill Howe, UW 37
Observations
• Figures in the literature are the currency of
scientific ideas
• Almost entirely unexplored
• Our thought: Mine patterns in the visual
literature
Step 1: Dismantling Composite Figures
Poshen Lee
ICPRAM 2015
Step 2: Classification
• Divide images into small patches
• Take a random sample
• Run k-means on samples (k = 200)
• For each figure in training set, generate
a length-200 feature vector by similarity
to clusters. Train a model.
• For each test image, create the vector
and classify by the model
Do high-impact papers have fewer equations,
as indicated by Fawcett and Higginson? (Yes)
Poshen LeeJevin West
high impact papers low impact papers
Do high-impact papers have more diagrams?
(Yes)
Poshen LeeJevin West
Do papers in top journals tend to involve
more or less visual information? (More) Poshen LeeJevin West
7/26/2016 Poshen Lee, UW 52
viziometrics.org
7/26/2016 Poshen Lee, UW 53
Burrows-Wheeler Alignment
Computation
DNA Sequencing
Citations: 7807 +11 since 2016
Eigenfactor: 0.0000574719
DNA Methylation Brain Cancer
Chromosomal Aberrations
Cancer Genome Atlas
Citations: 2094 +7 since 2016
Eigenfactor: 0.0000279023
Memory-efficient Computation
DNA Sequencing
Citations: 7459 +17 since 2016
Eigenfactor: 0.0000875579
Molecular biology
GeneticsGenomics
DNA
Citations: 3766 +15 since 2016
Eigenfactor: 0.0000183255
viziometrics.org
INFORMATION EXTRACTION
FROM FIGURES
Information-critical figures
Metabolic pathway diagrams
Phylogenetic heat maps
Architecture diagrams
Sean Yang
Normalize
Sean Yang
Corner Detection Line Detection
Extract Tree Structure
Sean Yang
VISUALIZATION
RECOMMENDATION
7/26/2016 Bill Howe, UW 59
60
Example of a Learned Rule (1)
low x-entropy => bad scatter plot
7/26/2016 Bill Howe, UW 61
bad scatter plotgood scatter plot
Example of a Learned Rule (3)
63
high x-periodicity => timeseries plot
(periodicity = 1 / variance in gap length between successive values)
Voyager
7/26/2016 Bill Howe, UW 64
Kanit “Ham”
Wongsuphasaw
at
Dominik
Moritz
InfoVis 15
Jeff Heer Jock
Mackinlay
Anushka
Anand
SCALABLE GRAPH
CLUSTERING
7/26/2016 Bill Howe, UW 65
Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known Serial
Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
Recap
• “Human-Data Interaction” is the bottleneck!
– SQLShare: Mining SQL logs to uncover user
behavior
– Myria/RACO: Polystore Optimization
– Deep Curation: Zero-training labeling of scientific
datasets
– Viziometrics: Mining the scientific literature
– Voyager: Visualization Recommendation
– GossipMap: Scalable Graph Clustering
http://myria.cs.washington.edu
http://uwescience.github.io/sqlshare/
https://github.com/vega/voyager
Voyager @billghowe
github: billhowe
http://homes.cs.washington.edu/~billhowe/
• OCCs:Big Data / Database researcher with broad impact and expertise in research data management,
• Democratizing Data Science
– Ourselves: Reduce overhead in attention-scarce regimes
– Other fields: Reduce overhead of interdisciplinary research
– The public: Reduce overhead of communicating with the public and policymakers
• SQLShare
– Why? What? Impact?
– Key: RDM, NSF-funded, hundreds of users
– Are these workloads any different than a typical database?
• HaLoop
– Why? What? Impact?
– Key: Papers, new subfield in big data
• Myria
– Why? What? Impact?
– Key: Funding
• Viziometrics
– Why? What? Impact?
• Data Curation through an Algorithmic Lens
– Why? What? Impact?
– Volume, variety, velocity. Volume: tasks that scale with the number of records: movement, validation. Variety: tasks that scale with the number of datasets:
metadata attachment, cataloging, metadata verfication. Velocity: tasks that scale with the time since release. Data journalism, legal cases
– Example? Maxim’s work. Prevalence of missing and incorrect labels.
– Is this dataset what it says it is?
– Why? Reproducibility crisis
– Is this fully automatic? No. Training data, computational steering
• http://www.urbanlibraries.org/living-voters-
guide--librarians-as-fact-checkers-innovation-
722.php?page_id=167
• http://engage.cs.washington.edu/
• https://www.commerce.gov/datausability/
• Available
– Can you get it if you know where to look?
• Discoverable
– Can you get it if you don’t know where to look?
• Manipulable
– What can you do with it, besides download it? Can the structure be
readily parsed and transformed?
• Interpretable
– Is the information internally consistent with respect to provenance,
metadata, column names, etc.?
• Contextualizable
– Is the information externally consistent with respect to other related
datasets? Can it be connected to other datasets through standards or
conventions? Does it admit connections to other datasets
Services emphasizing discovery,
citation, and preservation
Query, Viz, and Analytics Services
Google
Fusion
Tables
PredictDownload Query Join Visualize
url
doi
tags
space and
time
ontologies
standards
Server software, locally installed
• ISMIR paper
• Allen Institute example:
• flexibility gap between high level and low level
– Domain-specific languages
http://casestudies.brain-
map.org/ggb#section_explorea
http://blog.ibmjstart.net/2015/08/22/dynamic
-dashboards-from-jupyter-notebooks/
Time
Amountofdataintheworld
Time
Processingpower
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Amount of data in
the world
Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
A Typical Data Science Workflow
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring, transforming,
loading, filtering, deleting, combining, merging, verifying,
extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
7/26/2016 Bill Howe, UW 93

More Related Content

What's hot

From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
Paul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
Paul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Big Data Spain
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
Paul Groth
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
University of Washington
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
Paul Groth
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
James Hendler
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/Interoperability
James Hendler
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
University of Washington
 
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Artificial Intelligence Institute at UofSC
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
Micah Altman
 
Elements of AI Luxembourg - session 5
Elements of AI Luxembourg - session 5Elements of AI Luxembourg - session 5
Elements of AI Luxembourg - session 5
Jeremie Dauphin
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
Paul Groth
 
Broad Data
Broad DataBroad Data
Broad Data
James Hendler
 
The UVA School of Data Science
The UVA School of Data ScienceThe UVA School of Data Science
The UVA School of Data Science
Philip Bourne
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
Stefan Dietze
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use Cases
Rothamsted Research, UK
 

What's hot (20)

From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/Interoperability
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
 
Elements of AI Luxembourg - session 5
Elements of AI Luxembourg - session 5Elements of AI Luxembourg - session 5
Elements of AI Luxembourg - session 5
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Broad Data
Broad DataBroad Data
Broad Data
 
The UVA School of Data Science
The UVA School of Data ScienceThe UVA School of Data Science
The UVA School of Data Science
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use Cases
 

Similar to Data Science, Data Curation, and Human-Data Interaction

Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Bryan Heidorn
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Datavbrant
 
Why study Data Sharing? (+ why share your data)
Why study Data Sharing?  (+ why share your data)Why study Data Sharing?  (+ why share your data)
Why study Data Sharing? (+ why share your data)
Heather Piwowar
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
Scott Edmunds
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACwebuploader
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
Chris Rusbridge
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
Benjamin Good
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown
 
CRI - Teaching Through Research - John Jungck - BioQuest
CRI - Teaching Through Research - John Jungck - BioQuestCRI - Teaching Through Research - John Jungck - BioQuest
CRI - Teaching Through Research - John Jungck - BioQuestLeadershipProgram
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Carole Goble
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Artificial Intelligence Institute at UofSC
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
ExternalEvents
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
Anita de Waard
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
GigaScience, BGI Hong Kong
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Carole Goble
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
Elena Sügis
 
Resume H
Resume HResume H
Resume HPeterLI
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)
Duncan Hull
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
Elena Sügis
 

Similar to Data Science, Data Curation, and Human-Data Interaction (20)

Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Data
 
Why study Data Sharing? (+ why share your data)
Why study Data Sharing?  (+ why share your data)Why study Data Sharing?  (+ why share your data)
Why study Data Sharing? (+ why share your data)
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SAC
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
CRI - Teaching Through Research - John Jungck - BioQuest
CRI - Teaching Through Research - John Jungck - BioQuestCRI - Teaching Through Research - John Jungck - BioQuest
CRI - Teaching Through Research - John Jungck - BioQuest
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Resume H
Resume HResume H
Resume H
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 

More from University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
University of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
University of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
University of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
University of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
University of Washington
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
University of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
University of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
University of Washington
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
University of Washington
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
University of Washington
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
University of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
University of Washington
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
University of Washington
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
University of Washington
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
University of Washington
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
University of Washington
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
University of Washington
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
University of Washington
 

More from University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 

Recently uploaded

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 

Data Science, Data Curation, and Human-Data Interaction

  • 1. Data Science, Data Curation, Human-Data Interaction Bill Howe, Ph.D. Associate Professor, Information School Adjunct Associate Professor, Computer Science & Engineering Associate Director and Senior Data Science Fellow, eScience Institute 7/26/2016 Bill Howe, UW 1
  • 2. Dave Beck Director of Research, Life Sciences Ph.D. Medicinal Chemistry, Biomolecular Structure & Design Jake VanderPlas Director of Research, Physical Sciences Ph.D., Astronomy Valentina Staneva Data Scientist Ph.D., Applied Mathematics and Statistics Ariel Rokem Data Scientist Ph.D., Neuroscience Andrew Gartland Research Scientist Ph.D., Biostatistics Bryna Hazelton Research Scientist Ph.D., Physics Bernease Herman Data Scientist BS, Stats was SE at Amazon Vaughn Iverson Research Scientist Ph.D., Oceanography Rob Fatland Director of Cloud and Data Solutions Senior Data Science Fellow PhD Geophysics Joe Hellerstein Senior Data Science Fellow IBM Research, Microsoft Research, Google (ret.) Data Scientists Research Scientists Research Faculty Cyberinfrastructure Brittany Fiore-Gartland Ethnographer Ph.D Communication Dir. Ethnography http://escience.washington.edu
  • 3. Time Amountofdataintheworld Time Processingpower What is the rate-limiting step in data understanding? Processing power: Moore’s Law Amount of data in the world
  • 4. Processingpower Time What is the rate-limiting step in data understanding? Processing power: Moore’s Law Human cognitive capacity Idea adapted from “Less is More” by Bill Buxton (2001) Amount of data in the world slide src: Cecilia Aragon, UW HCDE
  • 5. How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 7/26/2016 Bill Howe, UW 5
  • 6.
  • 7. 7/26/2016 Bill Howe, UW 8 Goal: Understand and optimize how people use and share quantitative information “Human-Data Interaction”
  • 8. The SQLShare Corpus: A multi-year log of hand-written SQL queries Queries 24275 Views 4535 Tables 3891 Users 591 SIGMOD 2016 Shrainik Jain https://uwescience.github.io/sqlshare
  • 9. lifetime = days between first and last access of table SIGMOD 2016 Shrainik Jain http://uwescience.github.io/sqlshare/ Data “Grazing”: Short dataset lifetimes
  • 10. MYRIA: POLYSTORE MGMT Human-Data Interaction 7/26/2016 Bill Howe, UW 18
  • 11. R A G K Modern Big Data Ecosystems many different platforms, complex analytics
  • 12. Myria Algebra Tables KeyVal Arrays Graphs RACO: Relational Algebra COmpiler
  • 13. Spark Accumulo CombBLAS GraphX Parallel Algebra Logical Algebra RACO Relational Algebra COmpiler CombBLAS API Spark API Accumulo Graph API rewrite rules Array Algebra MyriaL Services: visualization, logging, discovery, history, browsing Orchestration https://github.com/uwescience/raco
  • 14. 7/26/2016 Bill Howe, UW 22 ISMIR 2016
  • 15. Laser Microscope Objective Pine Hole Lens Nozzle d1 d2 FSC (Forward scatter) Orange fluo Red fluo SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 18. Extract synchronized slices Co-register (camera jitter, bad time synch) Separate fore- and back-ground Classify critters in the foreground Measure growth rate over time
  • 21. 7/26/2016 Bill Howe, UW 33 Microarray samples submitted to the Gene Expression Omnibus Curation is fast becoming the bottleneck to data sharing Maxim Gretchkin Hoifung Poon
  • 22. color = labels supplied as metadata clusters = 1st two PCA dimensions on the gene expression data itself Can we use the expression data directly to curate algorithmically? Maxim Gretchkin Hoifung Poon The expression data and the text labels appear to disagree
  • 23. Maxim Gretchkin Hoifung Poon Better Tissue Type Labels Domain knowledge (Ontology) Expression data Free-text Metadata 2 Deep Networks text expr SVM
  • 24. Deep Curation Maxim Gretchkin Hoifung Poon Distant supervision and co-learning between text-based classified and expression-based classifier: Both models improve by training on each others’ results. Free-text classifier Expression classifier
  • 25. VIZIOMETRICS: COMPREHENDING VISUAL INFORMATION IN THE SCIENTIFIC LITERATURE Human-Data Interaction 7/26/2016 Bill Howe, UW 37
  • 26.
  • 27.
  • 28. Observations • Figures in the literature are the currency of scientific ideas • Almost entirely unexplored • Our thought: Mine patterns in the visual literature
  • 29. Step 1: Dismantling Composite Figures Poshen Lee ICPRAM 2015
  • 30. Step 2: Classification • Divide images into small patches • Take a random sample • Run k-means on samples (k = 200) • For each figure in training set, generate a length-200 feature vector by similarity to clusters. Train a model. • For each test image, create the vector and classify by the model
  • 31.
  • 32. Do high-impact papers have fewer equations, as indicated by Fawcett and Higginson? (Yes) Poshen LeeJevin West high impact papers low impact papers
  • 33. Do high-impact papers have more diagrams? (Yes) Poshen LeeJevin West
  • 34. Do papers in top journals tend to involve more or less visual information? (More) Poshen LeeJevin West
  • 35.
  • 36.
  • 37. 7/26/2016 Poshen Lee, UW 52 viziometrics.org
  • 38. 7/26/2016 Poshen Lee, UW 53 Burrows-Wheeler Alignment Computation DNA Sequencing Citations: 7807 +11 since 2016 Eigenfactor: 0.0000574719 DNA Methylation Brain Cancer Chromosomal Aberrations Cancer Genome Atlas Citations: 2094 +7 since 2016 Eigenfactor: 0.0000279023 Memory-efficient Computation DNA Sequencing Citations: 7459 +17 since 2016 Eigenfactor: 0.0000875579 Molecular biology GeneticsGenomics DNA Citations: 3766 +15 since 2016 Eigenfactor: 0.0000183255 viziometrics.org
  • 39. INFORMATION EXTRACTION FROM FIGURES Information-critical figures Metabolic pathway diagrams Phylogenetic heat maps Architecture diagrams
  • 45. 60
  • 46. Example of a Learned Rule (1) low x-entropy => bad scatter plot 7/26/2016 Bill Howe, UW 61 bad scatter plotgood scatter plot
  • 47. Example of a Learned Rule (3) 63 high x-periodicity => timeseries plot (periodicity = 1 / variance in gap length between successive values)
  • 48. Voyager 7/26/2016 Bill Howe, UW 64 Kanit “Ham” Wongsuphasaw at Dominik Moritz InfoVis 15 Jeff Heer Jock Mackinlay Anushka Anand
  • 50. Seung-Hee BaeScalable Graph Clustering Version 1 Parallelize Best-known Serial Algorithm ICDM 2013 Version 2 Free 30% improvement for any algorithm TKDD 2014 SC 2015 Version 3 Distributed approx. algorithm, 1.5B edges
  • 51. Recap • “Human-Data Interaction” is the bottleneck! – SQLShare: Mining SQL logs to uncover user behavior – Myria/RACO: Polystore Optimization – Deep Curation: Zero-training labeling of scientific datasets – Viziometrics: Mining the scientific literature – Voyager: Visualization Recommendation – GossipMap: Scalable Graph Clustering
  • 53. • OCCs:Big Data / Database researcher with broad impact and expertise in research data management, • Democratizing Data Science – Ourselves: Reduce overhead in attention-scarce regimes – Other fields: Reduce overhead of interdisciplinary research – The public: Reduce overhead of communicating with the public and policymakers • SQLShare – Why? What? Impact? – Key: RDM, NSF-funded, hundreds of users – Are these workloads any different than a typical database? • HaLoop – Why? What? Impact? – Key: Papers, new subfield in big data • Myria – Why? What? Impact? – Key: Funding • Viziometrics – Why? What? Impact? • Data Curation through an Algorithmic Lens – Why? What? Impact? – Volume, variety, velocity. Volume: tasks that scale with the number of records: movement, validation. Variety: tasks that scale with the number of datasets: metadata attachment, cataloging, metadata verfication. Velocity: tasks that scale with the time since release. Data journalism, legal cases – Example? Maxim’s work. Prevalence of missing and incorrect labels. – Is this dataset what it says it is? – Why? Reproducibility crisis – Is this fully automatic? No. Training data, computational steering
  • 57. • Available – Can you get it if you know where to look? • Discoverable – Can you get it if you don’t know where to look? • Manipulable – What can you do with it, besides download it? Can the structure be readily parsed and transformed? • Interpretable – Is the information internally consistent with respect to provenance, metadata, column names, etc.? • Contextualizable – Is the information externally consistent with respect to other related datasets? Can it be connected to other datasets through standards or conventions? Does it admit connections to other datasets
  • 59. Query, Viz, and Analytics Services Google Fusion Tables
  • 60. PredictDownload Query Join Visualize url doi tags space and time ontologies standards
  • 62.
  • 64. • Allen Institute example: • flexibility gap between high level and low level – Domain-specific languages http://casestudies.brain- map.org/ggb#section_explorea http://blog.ibmjstart.net/2015/08/22/dynamic -dashboards-from-jupyter-notebooks/
  • 65. Time Amountofdataintheworld Time Processingpower What is the rate-limiting step in data understanding? Processing power: Moore’s Law Amount of data in the world
  • 66. Processingpower Time What is the rate-limiting step in data understanding? Processing power: Moore’s Law Human cognitive capacity Idea adapted from “Less is More” by Bill Buxton (2001) Amount of data in the world slide src: Cecilia Aragon, UW HCDE
  • 67. A Typical Data Science Workflow 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work”
  • 68. How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 7/26/2016 Bill Howe, UW 93