Making Data Accessible
Mercè Crosas
Chief Data Science andTechnology Officer
Institute for Quantitative Social Science
Harvard University
Scholarly output doubles every 20 years
from 1750s to 2000
Scholarly output doubles every 20 years
from 1750s to 2000
1665
Source: Mabe,The Growth and Number of Journals, 2003
Philosophical
Transactions of the
Royal Society

Nullius in verba
Scholarly output doubles every 20 years
from 1750s to 2000
1665 1700
3 journals
Source: Mabe,The Growth and Number of Journals, 2003
Philosophical
Transactions of the
Royal Society

Nullius in verba
Scholarly output doubles every 20 years
from 1750s to 2000
1665 1700 1800
3 journals 10 journals
Source: Mabe,The Growth and Number of Journals, 2003
Philosophical
Transactions of the
Royal Society

Nullius in verba
Scholarly output doubles every 20 years
from 1750s to 2000
1665 1700 1800 1900
3 journals 10 journals 400 journals
Source: Mabe,The Growth and Number of Journals, 2003
Philosophical
Transactions of the
Royal Society

Nullius in verba
Scholarly output doubles every 20 years
from 1750s to 2000
1665 1700 1800 1900 2000
3 journals 10 journals 14, 000 journals

(peer-reviewed)
400 journals
Source: Mabe,The Growth and Number of Journals, 2003
Philosophical
Transactions of the
Royal Society

Nullius in verba
Scholarly output doubles every 20 years
from 1750s to 2000
1665 1700 1800 1900 2000
3 journals 10 journals 14, 000 journals

(peer-reviewed)
400 journals
Source: Mabe,The Growth and Number of Journals, 2003
Increase in specialization: 

every 100-150 authors = 

1 new journal for 500-1000 readers
Philosophical
Transactions of the
Royal Society

Nullius in verba
Scholarly output doubles every 20 years
from 1750s to 2000
1665 1700 1800 1900 2000
3 journals 10 journals 14, 000 journals

(peer-reviewed)
400 journals
Source: Mabe,The Growth and Number of Journals, 2003
Now:

• 80,000 total journals

• 33,000 peer-reviewed
Increase in specialization: 

every 100-150 authors = 

1 new journal for 500-1000 readers
Philosophical
Transactions of the
Royal Society

Nullius in verba
Data are part of the article in the form of
visuals and tables
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
30% with visuals;
use of tables
increases
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
30% with visuals;
use of tables
increases
50% with visuals, 

~7 each article;
integrated with tables
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
30% with visuals;
use of tables
increases
90% with figures and
tables; 

occupy 25% of space
50% with visuals, 

~7 each article;
integrated with tables
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
30% with visuals;
use of tables
increases
90% with figures and
tables; 

occupy 25% of space
First line graphs and
bar charts:
Playfair (1786)
50% with visuals, 

~7 each article;
integrated with tables
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
30% with visuals;
use of tables
increases
90% with figures and
tables; 

occupy 25% of space
First line graphs and
bar charts:
Playfair (1786)
First pie chart:
Playfair (1801)
50% with visuals, 

~7 each article;
integrated with tables
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
30% with visuals;
use of tables
increases
90% with figures and
tables; 

occupy 25% of space
First line graphs and
bar charts:
Playfair (1786)
First pie chart:
Playfair (1801)
First scatterplots:
Hershel (1833),
Galton (1896)
50% with visuals, 

~7 each article;
integrated with tables
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
30% with visuals;
use of tables
increases
90% with figures and
tables; 

occupy 25% of space
First line graphs and
bar charts:
Playfair (1786)
First pie chart:
Playfair (1801)
First scatterplots:
Hershel (1833),
Galton (1896)
Data visuals evolve from illustrations to scientific arguments
50% with visuals, 

~7 each article;
integrated with tables
1665 1700 1800 1900 2000
Data are part of the article in the form of
visuals and tables
30% of articles
with visuals;

5% have tables
30% with visuals;
use of tables
increases
90% with figures and
tables; 

occupy 25% of space
First line graphs and
bar charts:
Playfair (1786)
First pie chart:
Playfair (1801)
First scatterplots:
Hershel (1833),
Galton (1896)
Data visuals evolve from illustrations to scientific arguments
50% with visuals, 

~7 each article;
integrated with tables
Source: Gross Harmon, Reidy, Communicating Science, 2002
1665 1700 1800 1900 2000
Science communication adapts to the
increase in cognitive complexity
Science communication adapts to the
increase in cognitive complexity
• Gross, Herman and Reidy (2002):
Science communication adapts to the
increase in cognitive complexity
• Gross, Herman and Reidy (2002):
• analyze the style, presentation, and argument of a
sample of articles from 17th to 20th century
Science communication adapts to the
increase in cognitive complexity
• Gross, Herman and Reidy (2002):
• analyze the style, presentation, and argument of a
sample of articles from 17th to 20th century
• and argue that during that time we developed devices for
more efficient communication to compensate the
increase of cognitive complexity
Science communication adapts to the
increase in cognitive complexity
• Gross, Herman and Reidy (2002):
• analyze the style, presentation, and argument of a
sample of articles from 17th to 20th century
• and argue that during that time we developed devices for
more efficient communication to compensate the
increase of cognitive complexity
• The increase in quantity and complexity of scholarly output is
accompanied with an increase in research data
Science communication adapts to the
increase in cognitive complexity
• Gross, Herman and Reidy (2002):
• analyze the style, presentation, and argument of a
sample of articles from 17th to 20th century
• and argue that during that time we developed devices for
more efficient communication to compensate the
increase of cognitive complexity
• The increase in quantity and complexity of scholarly output is
accompanied with an increase in research data
• I argue that in the last decade data publishing is born as a new
necessary device for more efficient communication of science
Emergence of (domain-specific) digital
data archives
Emergence of (domain-specific) digital
data archives
1957
Roper Center
public and
operational
Emergence of (domain-specific) digital
data archives
1957 1960
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
Emergence of (domain-specific) digital
data archives
1957 1960 1962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Emergence of (domain-specific) digital
data archives
1957 1960 19641962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
Emergence of (domain-specific) digital
data archives
1957 1960 19641962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
1965
Emergence of (domain-specific) digital
data archives
1957 1960 19641962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
1965 1966
National Space
Science Data Center
Emergence of (domain-specific) digital
data archives
1957 1960 19641962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
1965 19671966
National Space
Science Data Center
Emergence of (domain-specific) digital
data archives
1957 1960 1964 19701962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
1965 19671966
National Space
Science Data Center
Emergence of (domain-specific) digital
data archives
1957 1960 1964 1970 19821962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
European
Nucleotide
Archive
GenBank
1965 19671966
National Space
Science Data Center
Emergence of (domain-specific) digital
data archives
1957 1960 1964 1970 19821962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
European
Nucleotide
Archive
GenBank
1965 19671966
National Space
Science Data Center
1987
Astrophysics
Data System
Emergence of (domain-specific) digital
data archives
1957 1960 1964 1970 19821962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
European
Nucleotide
Archive
GenBank
1965 19671966
National Space
Science Data Center
1987
Astrophysics
Data System
1990
EOSDIS
Emergence of (domain-specific) digital
data archives
1957 1960 1964 1970 19821962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
European
Nucleotide
Archive
GenBank
1965 1967 19951966
National Space
Science Data Center
Pangaea
1987
Astrophysics
Data System
1990
EOSDIS
Emergence of (domain-specific) digital
data archives
1957 1960 1964 1970 19821962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
European
Nucleotide
Archive
GenBank
1965 1967
social sciences
19951966
National Space
Science Data Center
Pangaea
1987
Astrophysics
Data System
1990
EOSDIS
Emergence of (domain-specific) digital
data archives
1957 1960 1964 1970 19821962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
European
Nucleotide
Archive
GenBank
1965 1967
social sciences
19951966
National Space
Science Data Center
Pangaea
1987
Astrophysics
Data System
1990
EOSDIS
astronomy
Emergence of (domain-specific) digital
data archives
1957 1960 1964 1970 19821962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
European
Nucleotide
Archive
GenBank
1965 1967
social sciences life sciences
19951966
National Space
Science Data Center
Pangaea
1987
Astrophysics
Data System
1990
EOSDIS
astronomy
Emergence of (domain-specific) digital
data archives
1957 1960 1964 1970 19821962
Roper Center
public and
operational
Zentral Archiv
für Empirische
Sozialforschung
(Germany)
ICPSR
Steinmetz Archive

(Netherlands)
ODUM

Data Archive
UK Data
Archive
Protein Data
Bank
European
Nucleotide
Archive
GenBank
1965 1967
social sciences life sciences
19951966
National Space
Science Data Center
Pangaea
1987
Astrophysics
Data System
1990
EOSDIS
astronomy
earth sciences
Data publishing in the hands of the
researcher
Data publishing in the hands of the
researcher
Dataverse
Dryad
2006
Data publishing in the hands of the
researcher
Dataverse
Dryad
Figshare
2006 2011
Data publishing in the hands of the
researcher
Dataverse
Dryad
Figshare
Zenodo
2006 2011 2013
Data publishing in the hands of the
researcher
Dataverse
Dryad
Figshare
Zenodo
2006 2011 2013
Dublin Core
metadata
Data
Description
Initiative
Data publishing in the hands of the
researcher
Dataverse
Dryad
Figshare
Zenodo
2006 2009 2011 2013
Dublin Core
metadata
Data
Description
Initiative
DataCite
Data publishing in the hands of the
researcher
Dataverse
Dryad
Figshare
Zenodo
2006 2009 2011 2013
Dublin Core
metadata
Data
Description
Initiative
DataCite Data Citation
Principles
Data publishing in the hands of the
researcher
Dataverse
Dryad
Figshare
Zenodo
2006 2009 2011 2013
Dublin Core
metadata
Data
Description
Initiative
DataCite Data Citation
Principles
# of (all types of ) data repositories from 2012 to 2016
source: r3data.org
1,500
“Research	
  data	
  publishing	
  is	
  the	
  release	
  of	
  research	
  
data,	
  associated	
  metadata,	
  accompanying	
  
documenta8on,	
  and	
  so9ware	
  code	
  (in	
  cases	
  where	
  
the	
  raw	
  data	
  have	
  been	
  processed	
  or	
  manipulated)	
  
for	
  re-­‐use	
  and	
  analysis	
  in	
  such	
  a	
  manner	
  that	
  they	
  
can	
  be	
  discovered	
  on	
  the	
  Web	
  and	
  referred	
  to	
  in	
  a	
  
unique	
  and	
  persistent	
  way.”
RDA Data Publishing Workflows Working Group; 10.5281/zenodo.34542
Best Practices for data publishing
Best Practices for data publishing
• Data Citation: to reference and locate with a persistent
identifier, and give credit to data authors
Best Practices for data publishing
• Data Citation: to reference and locate with a persistent
identifier, and give credit to data authors
• Metadata: to discover and reuse
Best Practices for data publishing
• Data Citation: to reference and locate with a persistent
identifier, and give credit to data authors
• Metadata: to discover and reuse
• Access control rules: to support publishing workflows and
data agreements and licenses, and protect privacy
Best Practices for data publishing
• Data Citation: to reference and locate with a persistent
identifier, and give credit to data authors
• Metadata: to discover and reuse
• Access control rules: to support publishing workflows and
data agreements and licenses, and protect privacy
• APIs and standards: to interoperate
Best Practices for data publishing
• Data Citation: to reference and locate with a persistent
identifier, and give credit to data authors
• Metadata: to discover and reuse
• Access control rules: to support publishing workflows and
data agreements and licenses, and protect privacy
• APIs and standards: to interoperate
The FAIR Guiding Principles scientific data management and stewardship, 2016; NIH Data Commons Principles
The Role of Dataverse
The Role of Dataverse
• Dataverse is an open-source software platform for building
data repositories
The Role of Dataverse
• Dataverse is an open-source software platform for building
data repositories
• Gives credit and control to researchers
The Role of Dataverse
• Dataverse is an open-source software platform for building
data repositories
• Gives credit and control to researchers
• Builds a community to:
The Role of Dataverse
• Dataverse is an open-source software platform for building
data repositories
• Gives credit and control to researchers
• Builds a community to:
• define new standards and best practices
The Role of Dataverse
• Dataverse is an open-source software platform for building
data repositories
• Gives credit and control to researchers
• Builds a community to:
• define new standards and best practices
• foster new research and collaboration in data sharing
The Role of Dataverse
• Dataverse is an open-source software platform for building
data repositories
• Gives credit and control to researchers
• Builds a community to:
• define new standards and best practices
• foster new research and collaboration in data sharing
• Has brought data publishing into the hands of researchers
The Dataverse Software Today
Installed in 20 sites world wide; 

Hosts dataverses from > 500 institutions
The Dataverse Community Today
35

GitHub
contributors
300

members in the
community list
Semi-monthly

community calls
Annual 

Community Meeting, 

with near 200
participants
The Harvard Dataverse Today
~12 new
datasets
published
per day
65,000 datasets
The Harvard Dataverse Today
The Harvard Dataverse Today
1.95 Million downloads
~ 1,500
downloads
per day
Dataverse Today and Tomorrow
TB-PB
# of datasets
datasetsize
GBs
< GB
Distribution of Research Data
Dataverse Today and Tomorrow
TB-PB
# of datasets
datasetsize
GBs
< GB
Support range of Dataverse and most general
purpose data publishing platforms
custom data

publishing solutions
Distribution of Research Data
Dataverse Today and Tomorrow
TB-PB
# of datasets
datasetsize
GBs
< GB
Support range of Dataverse and most general
purpose data publishing platforms
custom data

publishing solutions
Expansion
Distribution of Research Data
Addressing next challenges
Addressing next challenges
• Provenance: Seltzer, Crosas, King
Addressing next challenges
• Provenance: Seltzer, Crosas, King
• Connect journals to data: King & Crosas
Addressing next challenges
• Provenance: Seltzer, Crosas, King
• Connect journals to data: King & Crosas
• Sensitive data: Harvard Privacy Tools, led by Vadhan
Addressing next challenges
• Provenance: Seltzer, Crosas, King
• Connect journals to data: King & Crosas
• Sensitive data: Harvard Privacy Tools, led by Vadhan
• Large-scale datasets; integration with remote/cloud storage
and computing:
Addressing next challenges
• Provenance: Seltzer, Crosas, King
• Connect journals to data: King & Crosas
• Sensitive data: Harvard Privacy Tools, led by Vadhan
• Large-scale datasets; integration with remote/cloud storage
and computing:
• Large datasets in biomedicine, Sliz & Crosas
Addressing next challenges
• Provenance: Seltzer, Crosas, King
• Connect journals to data: King & Crosas
• Sensitive data: Harvard Privacy Tools, led by Vadhan
• Large-scale datasets; integration with remote/cloud storage
and computing:
• Large datasets in biomedicine, Sliz & Crosas
• Big Data data in social science, King & Crosas
Addressing next challenges
• Provenance: Seltzer, Crosas, King
• Connect journals to data: King & Crosas
• Sensitive data: Harvard Privacy Tools, led by Vadhan
• Large-scale datasets; integration with remote/cloud storage
and computing:
• Large datasets in biomedicine, Sliz & Crosas
• Big Data data in social science, King & Crosas
• Cloud Dataverse, with Krieger & Saowarattitada (BU)
Making Data Accessible use case:
Structural Biology Data Grid
with Piotr Sliz (Harvard Medical School), Pete Meyer, Bill McKinney, Stephanie Socias,
Jason Key, Kyle Chard , and IQSS Dataverse team
X-ray diffraction images are the primary data
for crystallographic structure determination
• X-ray diffraction images collected
from frozen protein crystal

• Typically obtained at synchrotron
x-ray sources
doi:10.15785/SBGRID/19	
  
• Primary source for building the
protein models published in
Protein Data Bank
Why share structural biology primary data?
1 dataset is 180 - 360 images

3.5 - 7 GB 

Primary Data
1 file of ‘structure factors:

0.0001 -0.005 GB

Intermediate Data
Atomic Coordinates

0.0001 -0.001 GB

Model
• Replicate published model results

• Reprocess primary data with new, improved analysis
Why Structural Biology Data Grid +
Dataverse?
Why Structural Biology Data Grid +
Dataverse?
• Support data publication, data access, preservation, and
live analysis for primary structural biology data
Why Structural Biology Data Grid +
Dataverse?
• Support data publication, data access, preservation, and
live analysis for primary structural biology data
• Apply best practices in data sharing: data citation,
metadata, standards and protocols
Why Structural Biology Data Grid +
Dataverse?
• Support data publication, data access, preservation, and
live analysis for primary structural biology data
• Apply best practices in data sharing: data citation,
metadata, standards and protocols
• Use an open-source platform with a growing community as
a sustainable infrastructure for the repository
Dataverse dataflow now
Deposit and get data
and metadata through
UI,API, and OAI-PMH
Network File System (NFS)
Dataverse expansion
GridFTP with
authentication
through
Replication Success Rate
85 % success
Reprocessing Pipeline
Data Access Alliance:
Local Access through a growing list of satellites
Meyer et al., Nature Communications, 2016
San Diego
Supercomputer
Center
Petrel Storage
Argonne Labs
Harvard Medical
School Orchestra
Cluster Institut Pasteur de
Montevideo
Data Access Alliance Flow
Data transfer
through GridFTP
OAI set per
remote site
Summary
Summary
• In the last decade, data publishing emerges as a field in science
communication
Summary
• In the last decade, data publishing emerges as a field in science
communication
• Dataverse plays a key role leading data publishing and promoting best
practices for making data accessible
Summary
• In the last decade, data publishing emerges as a field in science
communication
• Dataverse plays a key role leading data publishing and promoting best
practices for making data accessible
• Dataverse is expanding to support data provenance, sensitive data and
large scale data, while connecting data with publications and analysis.
Summary
• In the last decade, data publishing emerges as a field in science
communication
• Dataverse plays a key role leading data publishing and promoting best
practices for making data accessible
• Dataverse is expanding to support data provenance, sensitive data and
large scale data, while connecting data with publications and analysis.
• The Structural Biology Data Grid project shows how to:
Summary
• In the last decade, data publishing emerges as a field in science
communication
• Dataverse plays a key role leading data publishing and promoting best
practices for making data accessible
• Dataverse is expanding to support data provenance, sensitive data and
large scale data, while connecting data with publications and analysis.
• The Structural Biology Data Grid project shows how to:
• make data accessible in a sustainable way
Summary
• In the last decade, data publishing emerges as a field in science
communication
• Dataverse plays a key role leading data publishing and promoting best
practices for making data accessible
• Dataverse is expanding to support data provenance, sensitive data and
large scale data, while connecting data with publications and analysis.
• The Structural Biology Data Grid project shows how to:
• make data accessible in a sustainable way
• follow best practices in data publishing
Summary
• In the last decade, data publishing emerges as a field in science
communication
• Dataverse plays a key role leading data publishing and promoting best
practices for making data accessible
• Dataverse is expanding to support data provenance, sensitive data and
large scale data, while connecting data with publications and analysis.
• The Structural Biology Data Grid project shows how to:
• make data accessible in a sustainable way
• follow best practices in data publishing
• support larger scale data
Summary
• In the last decade, data publishing emerges as a field in science
communication
• Dataverse plays a key role leading data publishing and promoting best
practices for making data accessible
• Dataverse is expanding to support data provenance, sensitive data and
large scale data, while connecting data with publications and analysis.
• The Structural Biology Data Grid project shows how to:
• make data accessible in a sustainable way
• follow best practices in data publishing
• support larger scale data
• integrate data with analysis
“The	
  21st	
  of	
  April,	
  1665,	
  about	
  eight	
  
in	
  the	
  morning,	
  I	
  bored	
  a	
  hole	
  in	
  the	
  
body	
  of	
  a	
  fair	
  and	
  large	
  Birch,	
  and	
  
put	
  in	
  a	
  Cork	
  with	
  a	
  Quill	
  in	
  the	
  
middle;	
  aOer	
  a	
  Moment	
  or	
  two	
  it	
  [a	
  
sap]	
  began	
  to	
  drop,	
  but	
  yet	
  very	
  
soOly:	
  Some	
  three	
  Hours	
  aOer	
  I	
  
returned¸	
  and	
  it	
  had	
  filled	
  a	
  Pint	
  
Glass,	
  and	
  then	
  it	
  dropped	
  exceeding	
  
fast,	
  viz.	
  every	
  Pulse	
  a	
  Drop:	
  This	
  
Liquor	
  is	
  not	
  unpleasant	
  to	
  the	
  Taste,	
  
and	
  not	
  thick	
  or	
  troubled;	
  yet	
  it	
  looks	
  
as	
  though	
  some	
  few	
  drops	
  of	
  Milk	
  
were	
  split	
  in	
  a	
  Bason	
  of	
  Fountain	
  
Water.”	
  
	
  	
  
	
  (Lister,	
  1697)	
  	
  
Schemaac	
  of	
  solar	
  and	
  lunar	
  
eclipses	
  in	
  a	
  1665	
  paper	
  by	
  Hevelius	
  
(Philosophical	
  Transacaons,	
  Vol	
  1)	
  
Source:	
  Gross	
  Harmon,	
  Reidy,	
  Communica8ng	
  Science,	
  2002	
  

Making Data Accessible

  • 1.
    Making Data Accessible MercèCrosas Chief Data Science andTechnology Officer Institute for Quantitative Social Science Harvard University
  • 2.
    Scholarly output doublesevery 20 years from 1750s to 2000
  • 3.
    Scholarly output doublesevery 20 years from 1750s to 2000 1665 Source: Mabe,The Growth and Number of Journals, 2003 Philosophical Transactions of the Royal Society Nullius in verba
  • 4.
    Scholarly output doublesevery 20 years from 1750s to 2000 1665 1700 3 journals Source: Mabe,The Growth and Number of Journals, 2003 Philosophical Transactions of the Royal Society Nullius in verba
  • 5.
    Scholarly output doublesevery 20 years from 1750s to 2000 1665 1700 1800 3 journals 10 journals Source: Mabe,The Growth and Number of Journals, 2003 Philosophical Transactions of the Royal Society Nullius in verba
  • 6.
    Scholarly output doublesevery 20 years from 1750s to 2000 1665 1700 1800 1900 3 journals 10 journals 400 journals Source: Mabe,The Growth and Number of Journals, 2003 Philosophical Transactions of the Royal Society Nullius in verba
  • 7.
    Scholarly output doublesevery 20 years from 1750s to 2000 1665 1700 1800 1900 2000 3 journals 10 journals 14, 000 journals (peer-reviewed) 400 journals Source: Mabe,The Growth and Number of Journals, 2003 Philosophical Transactions of the Royal Society Nullius in verba
  • 8.
    Scholarly output doublesevery 20 years from 1750s to 2000 1665 1700 1800 1900 2000 3 journals 10 journals 14, 000 journals (peer-reviewed) 400 journals Source: Mabe,The Growth and Number of Journals, 2003 Increase in specialization: every 100-150 authors = 1 new journal for 500-1000 readers Philosophical Transactions of the Royal Society Nullius in verba
  • 9.
    Scholarly output doublesevery 20 years from 1750s to 2000 1665 1700 1800 1900 2000 3 journals 10 journals 14, 000 journals (peer-reviewed) 400 journals Source: Mabe,The Growth and Number of Journals, 2003 Now: • 80,000 total journals • 33,000 peer-reviewed Increase in specialization: every 100-150 authors = 1 new journal for 500-1000 readers Philosophical Transactions of the Royal Society Nullius in verba
  • 10.
    Data are partof the article in the form of visuals and tables 1665 1700 1800 1900 2000
  • 11.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 1665 1700 1800 1900 2000
  • 12.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 30% with visuals; use of tables increases 1665 1700 1800 1900 2000
  • 13.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 30% with visuals; use of tables increases 50% with visuals, ~7 each article; integrated with tables 1665 1700 1800 1900 2000
  • 14.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 30% with visuals; use of tables increases 90% with figures and tables; occupy 25% of space 50% with visuals, ~7 each article; integrated with tables 1665 1700 1800 1900 2000
  • 15.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 30% with visuals; use of tables increases 90% with figures and tables; occupy 25% of space First line graphs and bar charts: Playfair (1786) 50% with visuals, ~7 each article; integrated with tables 1665 1700 1800 1900 2000
  • 16.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 30% with visuals; use of tables increases 90% with figures and tables; occupy 25% of space First line graphs and bar charts: Playfair (1786) First pie chart: Playfair (1801) 50% with visuals, ~7 each article; integrated with tables 1665 1700 1800 1900 2000
  • 17.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 30% with visuals; use of tables increases 90% with figures and tables; occupy 25% of space First line graphs and bar charts: Playfair (1786) First pie chart: Playfair (1801) First scatterplots: Hershel (1833), Galton (1896) 50% with visuals, ~7 each article; integrated with tables 1665 1700 1800 1900 2000
  • 18.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 30% with visuals; use of tables increases 90% with figures and tables; occupy 25% of space First line graphs and bar charts: Playfair (1786) First pie chart: Playfair (1801) First scatterplots: Hershel (1833), Galton (1896) Data visuals evolve from illustrations to scientific arguments 50% with visuals, ~7 each article; integrated with tables 1665 1700 1800 1900 2000
  • 19.
    Data are partof the article in the form of visuals and tables 30% of articles with visuals; 5% have tables 30% with visuals; use of tables increases 90% with figures and tables; occupy 25% of space First line graphs and bar charts: Playfair (1786) First pie chart: Playfair (1801) First scatterplots: Hershel (1833), Galton (1896) Data visuals evolve from illustrations to scientific arguments 50% with visuals, ~7 each article; integrated with tables Source: Gross Harmon, Reidy, Communicating Science, 2002 1665 1700 1800 1900 2000
  • 20.
    Science communication adaptsto the increase in cognitive complexity
  • 21.
    Science communication adaptsto the increase in cognitive complexity • Gross, Herman and Reidy (2002):
  • 22.
    Science communication adaptsto the increase in cognitive complexity • Gross, Herman and Reidy (2002): • analyze the style, presentation, and argument of a sample of articles from 17th to 20th century
  • 23.
    Science communication adaptsto the increase in cognitive complexity • Gross, Herman and Reidy (2002): • analyze the style, presentation, and argument of a sample of articles from 17th to 20th century • and argue that during that time we developed devices for more efficient communication to compensate the increase of cognitive complexity
  • 24.
    Science communication adaptsto the increase in cognitive complexity • Gross, Herman and Reidy (2002): • analyze the style, presentation, and argument of a sample of articles from 17th to 20th century • and argue that during that time we developed devices for more efficient communication to compensate the increase of cognitive complexity • The increase in quantity and complexity of scholarly output is accompanied with an increase in research data
  • 25.
    Science communication adaptsto the increase in cognitive complexity • Gross, Herman and Reidy (2002): • analyze the style, presentation, and argument of a sample of articles from 17th to 20th century • and argue that during that time we developed devices for more efficient communication to compensate the increase of cognitive complexity • The increase in quantity and complexity of scholarly output is accompanied with an increase in research data • I argue that in the last decade data publishing is born as a new necessary device for more efficient communication of science
  • 26.
    Emergence of (domain-specific)digital data archives
  • 27.
    Emergence of (domain-specific)digital data archives 1957 Roper Center public and operational
  • 28.
    Emergence of (domain-specific)digital data archives 1957 1960 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany)
  • 29.
    Emergence of (domain-specific)digital data archives 1957 1960 1962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR
  • 30.
    Emergence of (domain-specific)digital data archives 1957 1960 19641962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands)
  • 31.
    Emergence of (domain-specific)digital data archives 1957 1960 19641962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive 1965
  • 32.
    Emergence of (domain-specific)digital data archives 1957 1960 19641962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive 1965 1966 National Space Science Data Center
  • 33.
    Emergence of (domain-specific)digital data archives 1957 1960 19641962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive 1965 19671966 National Space Science Data Center
  • 34.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 19701962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank 1965 19671966 National Space Science Data Center
  • 35.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 1970 19821962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank European Nucleotide Archive GenBank 1965 19671966 National Space Science Data Center
  • 36.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 1970 19821962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank European Nucleotide Archive GenBank 1965 19671966 National Space Science Data Center 1987 Astrophysics Data System
  • 37.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 1970 19821962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank European Nucleotide Archive GenBank 1965 19671966 National Space Science Data Center 1987 Astrophysics Data System 1990 EOSDIS
  • 38.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 1970 19821962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank European Nucleotide Archive GenBank 1965 1967 19951966 National Space Science Data Center Pangaea 1987 Astrophysics Data System 1990 EOSDIS
  • 39.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 1970 19821962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank European Nucleotide Archive GenBank 1965 1967 social sciences 19951966 National Space Science Data Center Pangaea 1987 Astrophysics Data System 1990 EOSDIS
  • 40.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 1970 19821962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank European Nucleotide Archive GenBank 1965 1967 social sciences 19951966 National Space Science Data Center Pangaea 1987 Astrophysics Data System 1990 EOSDIS astronomy
  • 41.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 1970 19821962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank European Nucleotide Archive GenBank 1965 1967 social sciences life sciences 19951966 National Space Science Data Center Pangaea 1987 Astrophysics Data System 1990 EOSDIS astronomy
  • 42.
    Emergence of (domain-specific)digital data archives 1957 1960 1964 1970 19821962 Roper Center public and operational Zentral Archiv für Empirische Sozialforschung (Germany) ICPSR Steinmetz Archive (Netherlands) ODUM Data Archive UK Data Archive Protein Data Bank European Nucleotide Archive GenBank 1965 1967 social sciences life sciences 19951966 National Space Science Data Center Pangaea 1987 Astrophysics Data System 1990 EOSDIS astronomy earth sciences
  • 43.
    Data publishing inthe hands of the researcher
  • 44.
    Data publishing inthe hands of the researcher Dataverse Dryad 2006
  • 45.
    Data publishing inthe hands of the researcher Dataverse Dryad Figshare 2006 2011
  • 46.
    Data publishing inthe hands of the researcher Dataverse Dryad Figshare Zenodo 2006 2011 2013
  • 47.
    Data publishing inthe hands of the researcher Dataverse Dryad Figshare Zenodo 2006 2011 2013 Dublin Core metadata Data Description Initiative
  • 48.
    Data publishing inthe hands of the researcher Dataverse Dryad Figshare Zenodo 2006 2009 2011 2013 Dublin Core metadata Data Description Initiative DataCite
  • 49.
    Data publishing inthe hands of the researcher Dataverse Dryad Figshare Zenodo 2006 2009 2011 2013 Dublin Core metadata Data Description Initiative DataCite Data Citation Principles
  • 50.
    Data publishing inthe hands of the researcher Dataverse Dryad Figshare Zenodo 2006 2009 2011 2013 Dublin Core metadata Data Description Initiative DataCite Data Citation Principles # of (all types of ) data repositories from 2012 to 2016 source: r3data.org 1,500
  • 51.
    “Research  data  publishing  is  the  release  of  research   data,  associated  metadata,  accompanying   documenta8on,  and  so9ware  code  (in  cases  where   the  raw  data  have  been  processed  or  manipulated)   for  re-­‐use  and  analysis  in  such  a  manner  that  they   can  be  discovered  on  the  Web  and  referred  to  in  a   unique  and  persistent  way.” RDA Data Publishing Workflows Working Group; 10.5281/zenodo.34542
  • 52.
    Best Practices fordata publishing
  • 53.
    Best Practices fordata publishing • Data Citation: to reference and locate with a persistent identifier, and give credit to data authors
  • 54.
    Best Practices fordata publishing • Data Citation: to reference and locate with a persistent identifier, and give credit to data authors • Metadata: to discover and reuse
  • 55.
    Best Practices fordata publishing • Data Citation: to reference and locate with a persistent identifier, and give credit to data authors • Metadata: to discover and reuse • Access control rules: to support publishing workflows and data agreements and licenses, and protect privacy
  • 56.
    Best Practices fordata publishing • Data Citation: to reference and locate with a persistent identifier, and give credit to data authors • Metadata: to discover and reuse • Access control rules: to support publishing workflows and data agreements and licenses, and protect privacy • APIs and standards: to interoperate
  • 57.
    Best Practices fordata publishing • Data Citation: to reference and locate with a persistent identifier, and give credit to data authors • Metadata: to discover and reuse • Access control rules: to support publishing workflows and data agreements and licenses, and protect privacy • APIs and standards: to interoperate The FAIR Guiding Principles scientific data management and stewardship, 2016; NIH Data Commons Principles
  • 58.
    The Role ofDataverse
  • 59.
    The Role ofDataverse • Dataverse is an open-source software platform for building data repositories
  • 60.
    The Role ofDataverse • Dataverse is an open-source software platform for building data repositories • Gives credit and control to researchers
  • 61.
    The Role ofDataverse • Dataverse is an open-source software platform for building data repositories • Gives credit and control to researchers • Builds a community to:
  • 62.
    The Role ofDataverse • Dataverse is an open-source software platform for building data repositories • Gives credit and control to researchers • Builds a community to: • define new standards and best practices
  • 63.
    The Role ofDataverse • Dataverse is an open-source software platform for building data repositories • Gives credit and control to researchers • Builds a community to: • define new standards and best practices • foster new research and collaboration in data sharing
  • 64.
    The Role ofDataverse • Dataverse is an open-source software platform for building data repositories • Gives credit and control to researchers • Builds a community to: • define new standards and best practices • foster new research and collaboration in data sharing • Has brought data publishing into the hands of researchers
  • 65.
    The Dataverse SoftwareToday Installed in 20 sites world wide; Hosts dataverses from > 500 institutions
  • 66.
    The Dataverse CommunityToday 35 GitHub contributors 300 members in the community list Semi-monthly community calls Annual Community Meeting, with near 200 participants
  • 67.
    The Harvard DataverseToday ~12 new datasets published per day 65,000 datasets
  • 68.
  • 69.
    The Harvard DataverseToday 1.95 Million downloads ~ 1,500 downloads per day
  • 70.
    Dataverse Today andTomorrow TB-PB # of datasets datasetsize GBs < GB Distribution of Research Data
  • 71.
    Dataverse Today andTomorrow TB-PB # of datasets datasetsize GBs < GB Support range of Dataverse and most general purpose data publishing platforms custom data publishing solutions Distribution of Research Data
  • 72.
    Dataverse Today andTomorrow TB-PB # of datasets datasetsize GBs < GB Support range of Dataverse and most general purpose data publishing platforms custom data publishing solutions Expansion Distribution of Research Data
  • 73.
  • 74.
    Addressing next challenges •Provenance: Seltzer, Crosas, King
  • 75.
    Addressing next challenges •Provenance: Seltzer, Crosas, King • Connect journals to data: King & Crosas
  • 76.
    Addressing next challenges •Provenance: Seltzer, Crosas, King • Connect journals to data: King & Crosas • Sensitive data: Harvard Privacy Tools, led by Vadhan
  • 77.
    Addressing next challenges •Provenance: Seltzer, Crosas, King • Connect journals to data: King & Crosas • Sensitive data: Harvard Privacy Tools, led by Vadhan • Large-scale datasets; integration with remote/cloud storage and computing:
  • 78.
    Addressing next challenges •Provenance: Seltzer, Crosas, King • Connect journals to data: King & Crosas • Sensitive data: Harvard Privacy Tools, led by Vadhan • Large-scale datasets; integration with remote/cloud storage and computing: • Large datasets in biomedicine, Sliz & Crosas
  • 79.
    Addressing next challenges •Provenance: Seltzer, Crosas, King • Connect journals to data: King & Crosas • Sensitive data: Harvard Privacy Tools, led by Vadhan • Large-scale datasets; integration with remote/cloud storage and computing: • Large datasets in biomedicine, Sliz & Crosas • Big Data data in social science, King & Crosas
  • 80.
    Addressing next challenges •Provenance: Seltzer, Crosas, King • Connect journals to data: King & Crosas • Sensitive data: Harvard Privacy Tools, led by Vadhan • Large-scale datasets; integration with remote/cloud storage and computing: • Large datasets in biomedicine, Sliz & Crosas • Big Data data in social science, King & Crosas • Cloud Dataverse, with Krieger & Saowarattitada (BU)
  • 81.
    Making Data Accessibleuse case: Structural Biology Data Grid with Piotr Sliz (Harvard Medical School), Pete Meyer, Bill McKinney, Stephanie Socias, Jason Key, Kyle Chard , and IQSS Dataverse team
  • 82.
    X-ray diffraction imagesare the primary data for crystallographic structure determination • X-ray diffraction images collected from frozen protein crystal • Typically obtained at synchrotron x-ray sources doi:10.15785/SBGRID/19   • Primary source for building the protein models published in Protein Data Bank
  • 83.
    Why share structuralbiology primary data? 1 dataset is 180 - 360 images 3.5 - 7 GB Primary Data 1 file of ‘structure factors: 0.0001 -0.005 GB Intermediate Data Atomic Coordinates 0.0001 -0.001 GB Model • Replicate published model results • Reprocess primary data with new, improved analysis
  • 84.
    Why Structural BiologyData Grid + Dataverse?
  • 85.
    Why Structural BiologyData Grid + Dataverse? • Support data publication, data access, preservation, and live analysis for primary structural biology data
  • 86.
    Why Structural BiologyData Grid + Dataverse? • Support data publication, data access, preservation, and live analysis for primary structural biology data • Apply best practices in data sharing: data citation, metadata, standards and protocols
  • 87.
    Why Structural BiologyData Grid + Dataverse? • Support data publication, data access, preservation, and live analysis for primary structural biology data • Apply best practices in data sharing: data citation, metadata, standards and protocols • Use an open-source platform with a growing community as a sustainable infrastructure for the repository
  • 88.
    Dataverse dataflow now Depositand get data and metadata through UI,API, and OAI-PMH Network File System (NFS)
  • 89.
  • 90.
  • 91.
  • 93.
    Data Access Alliance: LocalAccess through a growing list of satellites Meyer et al., Nature Communications, 2016 San Diego Supercomputer Center Petrel Storage Argonne Labs Harvard Medical School Orchestra Cluster Institut Pasteur de Montevideo
  • 94.
    Data Access AllianceFlow Data transfer through GridFTP OAI set per remote site
  • 95.
  • 96.
    Summary • In thelast decade, data publishing emerges as a field in science communication
  • 97.
    Summary • In thelast decade, data publishing emerges as a field in science communication • Dataverse plays a key role leading data publishing and promoting best practices for making data accessible
  • 98.
    Summary • In thelast decade, data publishing emerges as a field in science communication • Dataverse plays a key role leading data publishing and promoting best practices for making data accessible • Dataverse is expanding to support data provenance, sensitive data and large scale data, while connecting data with publications and analysis.
  • 99.
    Summary • In thelast decade, data publishing emerges as a field in science communication • Dataverse plays a key role leading data publishing and promoting best practices for making data accessible • Dataverse is expanding to support data provenance, sensitive data and large scale data, while connecting data with publications and analysis. • The Structural Biology Data Grid project shows how to:
  • 100.
    Summary • In thelast decade, data publishing emerges as a field in science communication • Dataverse plays a key role leading data publishing and promoting best practices for making data accessible • Dataverse is expanding to support data provenance, sensitive data and large scale data, while connecting data with publications and analysis. • The Structural Biology Data Grid project shows how to: • make data accessible in a sustainable way
  • 101.
    Summary • In thelast decade, data publishing emerges as a field in science communication • Dataverse plays a key role leading data publishing and promoting best practices for making data accessible • Dataverse is expanding to support data provenance, sensitive data and large scale data, while connecting data with publications and analysis. • The Structural Biology Data Grid project shows how to: • make data accessible in a sustainable way • follow best practices in data publishing
  • 102.
    Summary • In thelast decade, data publishing emerges as a field in science communication • Dataverse plays a key role leading data publishing and promoting best practices for making data accessible • Dataverse is expanding to support data provenance, sensitive data and large scale data, while connecting data with publications and analysis. • The Structural Biology Data Grid project shows how to: • make data accessible in a sustainable way • follow best practices in data publishing • support larger scale data
  • 103.
    Summary • In thelast decade, data publishing emerges as a field in science communication • Dataverse plays a key role leading data publishing and promoting best practices for making data accessible • Dataverse is expanding to support data provenance, sensitive data and large scale data, while connecting data with publications and analysis. • The Structural Biology Data Grid project shows how to: • make data accessible in a sustainable way • follow best practices in data publishing • support larger scale data • integrate data with analysis
  • 104.
    “The  21st  of  April,  1665,  about  eight   in  the  morning,  I  bored  a  hole  in  the   body  of  a  fair  and  large  Birch,  and   put  in  a  Cork  with  a  Quill  in  the   middle;  aOer  a  Moment  or  two  it  [a   sap]  began  to  drop,  but  yet  very   soOly:  Some  three  Hours  aOer  I   returned¸  and  it  had  filled  a  Pint   Glass,  and  then  it  dropped  exceeding   fast,  viz.  every  Pulse  a  Drop:  This   Liquor  is  not  unpleasant  to  the  Taste,   and  not  thick  or  troubled;  yet  it  looks   as  though  some  few  drops  of  Milk   were  split  in  a  Bason  of  Fountain   Water.”        (Lister,  1697)     Schemaac  of  solar  and  lunar   eclipses  in  a  1665  paper  by  Hevelius   (Philosophical  Transacaons,  Vol  1)   Source:  Gross  Harmon,  Reidy,  Communica8ng  Science,  2002