CLOUD
DATAVERSE
Mercè Crosas1
Orran Krieger2
Piyanai Saowarattitada2
1Institute for Quantitative Social Science (IQSS), Harvard University
2Massachussetts Open Cloud (MOC), Boston University
DATA REPOSITORIES NEED CLOUDS
CLOUDS NEED DATA REPOSITORIES
This Talk
1. The Need
– The rise of big data-centric computation
– The rise of modern data repositories
2. Our Platforms
– Dataverse: A premier open-source data repository platform
– MOC: Top collaborative OpenStack cloud with Big Data compute
3. The Solution
– Cloud Dataverse: Bringing MOC and Dataverse together
AWS sees the value in data
“When data is made publicly available on
AWS, anyone can analyze any volume of
data without needing to download or
store it themselves.”
Data and Compute leads to
Discovery
A wide range of fields and industries
can benefit from access to data
But, AWS public datasets miss key
aspects needed in data repositories
• Incentives to share data
• Citation to each version of the data
• Metadata for Discoverability
• Tiered access to non-public data
• Commitment to data archival & preservation
The scientific community has been thinking
about data archives and repositories for some
time
1957
Roper
Center
public and
operational
1960
Zentral Archiv für
Empirische
Sozialforschung
(Germany)
1962
ICPSR
1964
Steinmetz
Archive
(Netherlands)
ODUM
Data Archive
1965 1970
Protein
Data Bank
1982
European
Nucleotide
Archive
GenBank
social sciences life sciences
UK Data
Archive
19671966
National Space
Science Data
Center
1995
Pangae
a
1987
Astrophysics
Data System
1990
EOSDIS
astronomy
earth sciences
Number of data repositories grows
with growth of data sharing
Dryad Figshare
Zenodo
2006 2009 20112013
DataCite
Data
Citation
Principles
# of (all types of ) data repositories from 2012 to 2016
source: r3data.org
> 1,500 research data repositories
Today’s repositories incentivize data sharing by giving
credit to data authors through formal citation
Persistent citations to
datasets published in
data repositories
Bibliography
The Dataverse open-source platform
enables building any type of data
repository
Agriculture data
Repository in
Fudan, China
Data from 20 Universities
Public data repository
Science Consortium
Challenges
• Datasets have to be small
• Hard to copy 40 PB over the internet
• Not every one has the right compute infrastructure
DATAVERSE NEEDED A CLOUD
The Massachusetts Open
Cloud – an Open Cloud
eXchange
Imagine shrinking
Pacific Research
Platform to the
size of a building
Imagine shrinking
Pacific Research
Platform to the
size of a building
Consortium comparable to Pacific Research Platform
• Huge community covering every field of research
• Collaborations across the globe
• Massive data and computational requirements
• Massive student population covering every discipline
Widths are proportional
to enrollment
MGHPCC Data Center
15 MW, 90,000 square feet + can grow
10s of thousand HPC users, potentially many
more cloud users
The MOC partnership
Today’s model of Cloud
What we need:
an “Open Cloud eXchange (OCX)”
C3DDB
HP
C
Big
Data
We
b
OpenStack great, but… where is the data?
• We need:
– share data between providers
– expose cloud meta-data researchers/companies
• Our scientific users need
– In-situ compute on public & community data sets
– control with whom and how their data is shared
– reduced barrier to exploit rich tools to compute on the data
• Our commercial & public sector partners need
– share data with researchers/startups
– reduce the risk/barrier of publishing data
– model to expose technology in environment with rich data
The MOC need a modern Dataset repositoryOpenStack Needs a Dataset repository project
Data depositor Data users
Compute
Dataverse Before Cloud Dataverse
Data depositor Data users
Swift
Object Storage
Nova
Compute
Horizon
Nova
Compute
Sahara
Analytics
Giji
Dataverse After Cloud Dataverse
UI
Data depositor Data users
Swift
Object Storage
Nova
Compute
Horizon
Nova
Compute
Sahara
Analytics
Giji
What’s missing in Cloud ?
UI
Data depositor Data users
Swift
Object Storage
Nova
Compute
Horizon
Nova
Compute
Sahara
Analytics
Giji
What’s missing in Dataverse ?
Swift
Object Storage
So what is Cloud Dataverse ?
DEMO: Billion Object Platform(BOP) GeoTweets
Data users/analyst
Swift
Object Storage
Horizon
Tweets
BOP
GeoTwee
ts
COLD
report
Nova
Compute
Nova
Compute
Sahara
Analytics
Summary : BOP GEOTWEETS Cold Demo
Giji
BOP
Data depositor
Dataverse
Community
review
SUMMER 2016
FALL 2016 JANUARY 2017
DECEMBER
2016
POC
Barcelona
OpenSstack
Summit
#vBrownBag
Full
Collaborative
Development
Begins
MAY 2017
Boston
OpenStack
Summit :
*Swift per
repository
*URI
*Demos
SUMMER 2017
Worldwide
Data
Federation
A Year in the life of Cloud Dataverse
MOC
Annual
Workshop
POC
Preview
OCTOBER 2016
World Wide Data Federation
DATA REPOSITORIES NEED CLOUDS
CLOUDS NEED DATA REPOSITORIES
With Cloud Dataverse, we combine the power
and scalability of OpenStack cloud with the need
to access data using a feature-rich repository
THANKS

Cloud Dataverse: A Data repository platform for an OpenStack Cloud

  • 1.
    CLOUD DATAVERSE Mercè Crosas1 Orran Krieger2 PiyanaiSaowarattitada2 1Institute for Quantitative Social Science (IQSS), Harvard University 2Massachussetts Open Cloud (MOC), Boston University
  • 2.
    DATA REPOSITORIES NEEDCLOUDS CLOUDS NEED DATA REPOSITORIES
  • 3.
    This Talk 1. TheNeed – The rise of big data-centric computation – The rise of modern data repositories 2. Our Platforms – Dataverse: A premier open-source data repository platform – MOC: Top collaborative OpenStack cloud with Big Data compute 3. The Solution – Cloud Dataverse: Bringing MOC and Dataverse together
  • 4.
    AWS sees thevalue in data “When data is made publicly available on AWS, anyone can analyze any volume of data without needing to download or store it themselves.”
  • 5.
    Data and Computeleads to Discovery
  • 6.
    A wide rangeof fields and industries can benefit from access to data
  • 7.
    But, AWS publicdatasets miss key aspects needed in data repositories • Incentives to share data • Citation to each version of the data • Metadata for Discoverability • Tiered access to non-public data • Commitment to data archival & preservation
  • 8.
    The scientific communityhas been thinking about data archives and repositories for some time 1957 Roper Center public and operational 1960 Zentral Archiv für Empirische Sozialforschung (Germany) 1962 ICPSR 1964 Steinmetz Archive (Netherlands) ODUM Data Archive 1965 1970 Protein Data Bank 1982 European Nucleotide Archive GenBank social sciences life sciences UK Data Archive 19671966 National Space Science Data Center 1995 Pangae a 1987 Astrophysics Data System 1990 EOSDIS astronomy earth sciences
  • 9.
    Number of datarepositories grows with growth of data sharing Dryad Figshare Zenodo 2006 2009 20112013 DataCite Data Citation Principles # of (all types of ) data repositories from 2012 to 2016 source: r3data.org > 1,500 research data repositories
  • 10.
    Today’s repositories incentivizedata sharing by giving credit to data authors through formal citation Persistent citations to datasets published in data repositories Bibliography
  • 11.
    The Dataverse open-sourceplatform enables building any type of data repository Agriculture data Repository in Fudan, China Data from 20 Universities Public data repository Science Consortium
  • 12.
    Challenges • Datasets haveto be small • Hard to copy 40 PB over the internet • Not every one has the right compute infrastructure DATAVERSE NEEDED A CLOUD
  • 13.
    The Massachusetts Open Cloud– an Open Cloud eXchange
  • 14.
  • 15.
    Imagine shrinking Pacific Research Platformto the size of a building Consortium comparable to Pacific Research Platform • Huge community covering every field of research • Collaborations across the globe • Massive data and computational requirements • Massive student population covering every discipline Widths are proportional to enrollment
  • 16.
    MGHPCC Data Center 15MW, 90,000 square feet + can grow 10s of thousand HPC users, potentially many more cloud users
  • 17.
  • 18.
  • 19.
    What we need: an“Open Cloud eXchange (OCX)” C3DDB HP C Big Data We b
  • 20.
    OpenStack great, but…where is the data? • We need: – share data between providers – expose cloud meta-data researchers/companies • Our scientific users need – In-situ compute on public & community data sets – control with whom and how their data is shared – reduced barrier to exploit rich tools to compute on the data • Our commercial & public sector partners need – share data with researchers/startups – reduce the risk/barrier of publishing data – model to expose technology in environment with rich data The MOC need a modern Dataset repositoryOpenStack Needs a Dataset repository project
  • 21.
    Data depositor Datausers Compute Dataverse Before Cloud Dataverse
  • 22.
    Data depositor Datausers Swift Object Storage Nova Compute Horizon Nova Compute Sahara Analytics Giji Dataverse After Cloud Dataverse UI
  • 23.
    Data depositor Datausers Swift Object Storage Nova Compute Horizon Nova Compute Sahara Analytics Giji What’s missing in Cloud ? UI
  • 24.
    Data depositor Datausers Swift Object Storage Nova Compute Horizon Nova Compute Sahara Analytics Giji What’s missing in Dataverse ?
  • 25.
    Swift Object Storage So whatis Cloud Dataverse ?
  • 26.
    DEMO: Billion ObjectPlatform(BOP) GeoTweets
  • 27.
  • 28.
    Dataverse Community review SUMMER 2016 FALL 2016JANUARY 2017 DECEMBER 2016 POC Barcelona OpenSstack Summit #vBrownBag Full Collaborative Development Begins MAY 2017 Boston OpenStack Summit : *Swift per repository *URI *Demos SUMMER 2017 Worldwide Data Federation A Year in the life of Cloud Dataverse MOC Annual Workshop POC Preview OCTOBER 2016
  • 29.
    World Wide DataFederation
  • 30.
    DATA REPOSITORIES NEEDCLOUDS CLOUDS NEED DATA REPOSITORIES With Cloud Dataverse, we combine the power and scalability of OpenStack cloud with the need to access data using a feature-rich repository
  • 31.

Editor's Notes

  • #15 Scale down from full picture of PRP
  • #16 Scale down from full picture of PRP
  • #19 We couldn’t
  • #20 We couldn’t
  • #21 public data sets used by many different groups compute on massive datasets in-situ host data sets from wide variety of scientific disciplines a way to reduce the risk/barrier of publishing data way for researchers, startups to discover datasets of interest and request permission
  • #22 TRATIONAL DV PROVIDES US WITH A RICH DEPOSITORY THAT INSENTIFY THE DATA DEPOSITOR WITH PERSISTENT CITATION. ONCE THE DATA SETS ARE IN DATAVERSE, WE THEN CAN DO DATA ANALYTIC OF THOSE DATA SETS USING ANY DESIRED COMPUTE PLATFORM OFTEN TIME ENOUGH, THE COMPUTE PLATFORM IS NOT CO-LOCATED with the DV FORCING DATA USERS TO HAVE TO COPY THOSE DATA SETS OVER THE INTERNET
  • #23 . SO WHAT TYPICAL CLOUD BRINGS TO DV ? THE CLOUD PROVIDES AN EXTENSIVE ALREADY AVAILABLE SERVICES AND RESOURCES ; compute, (object) storage, data processing frame work and a way for user to access the services – ie. The UI
  • #24 What missing in OS is a way for Data users to access the powerful cloud services/resources. GIJI a home grown UI fills that role. GIJI is a simple UI that acts as a gate way to MOC services – as such CDV is one such service.
  • #25  Once OS helper GIJI is place, let’s look at the DV end… In the DV, we enable the upload data sets to be stored in Swift object storage. At such point, the uploaded data sets are ready to be compute using the traditional tools such as R The last piece of the puzzle is a way to activate those awesome services provided by the cloud platform For this, CLICK, we add a compute button to DV. This button serves as a connection between DV and its compute cloud platform.
  • #26 SO what changes have we made to DV ? First, we enable users to upload data sets to Swift Object storage Second, we enable users with the capability to get to a cloud compute platform which is co-located with the DV repository. These two features are available right in the DV repository.
  • #27 In this demo, you will see a use case of us using another Harvard IQSS creation, the Billion Object Platform (BOP) to initially sort the data set by time, space and keyword. Then upload the sorted data to CDV. We will take you down the journey of this large sorted date set from BOP to CDV to Giji, then Sahara. And finally to Spark cluster where the data analytic occurs. The data set contains a lat-long Tweeter tweets from mobile devices that is screened/sorted for the word COLD in BOP from Feb 4th, 2014 to last Thurs, May 4th when this demo was created. WAIT FOR SPARK JOB TO COMPLESE 29 – 13 STOP TALKING AT 14 This is the actual Spark processing and response time which took about 13 seconds. Yes, it’s actually succeessful as you will see…
  • #28 So what we have just seen ? - The sophisticated medical researcher, Lucy, goes to BOP BOP is an instance in another of MOC OS cloud - Asks for specific set of Tweets sorting for the word “COLD” from Feb. 2014 to last Thurs. By Lucy pushing that DATAVERSE button on BOP, BOP then on behave of Lucy simply upload her BOP GEOTWEET COLD data set to Swift via the cloud enabled CDV Lucy now has the data set ready for compute via Swift object store. - Better, Lucy creates her compute cluster (Spark) and the job she’s interested in - Once she launch the cluster and the job, she can check her simple sort word out in her Apache Spark.
  • #29 We are making this first