2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Trust threads: Provenance for Data Reuse in Long Tail ScienceBeth Plale
Ā
Invited Colloquium talk, Apr 23, 2015, Dept of Information and Library Science, School of Informatics and Computing, Indiana University. Abstract: The world contains a vast amount of digital information which grows vaster ever more rapidly. This makes it possible to do many things on an unprecedented scale: spot social trends, prevent diseases, increase fresh water supplies, accelerate innovation, and so on. As science and technology innovation is essential to improved public health and welfare, the growing sources of data can unlock more secrets. But the rapid growth of data makes accountability and transparency of research increasingly difficult. Data that are not adequately described are not useable except within the research lab that produced it. Data that are intentionally or unintentionally inaccessible or difficult to access and verify are not available to contribute to new forms of research. In this talk I show that data can carry with it thin threads of information that connect it to both its past and its future, forming its lineage particularly as it transitions into a shareable dataset residing in a public repository. In carrying this minimal provenance, the data becomes more trustworthy. This thread of trust is a critical element to the successful sharing, use, and reuse of big data in science and technology research in the future.
Trust threads : Active Curation and Publishing in SEADBeth Plale
Ā
Describes Trust Threads, minimalist approach to provenance capture to enhance trustworthiness of published data. Implemented as part of SEAD's Active Curation and Publishing Services. At National Data Integrity Conference, Ft. Collins, Colorado, May 2015.
Presentation of science 2.0 at European Astronomical Societyosimod
Ā
The document discusses Science 2.0 and the emerging open science ecosystem. It provides three examples of open science projects: Galaxy Zoo, which had volunteers classify galaxies; Synaptic Leap, which published all data and experiments online to identify a new drug; and a paper on debt and growth that was found to have errors after its data and methods were shared. It then outlines various aspects of open science like open data, citizen science, and mass collaboration.
This document discusses licensing research data for reuse. It begins by providing a scenario where a user has downloaded a dataset but is unsure what they can do with the data due to licensing. It then discusses that licensing is critical to enabling data reuse and citation. It provides information on AusGOAL, the Australian open access and licensing framework, and notes it is recommended for data publishing by ANDS partners. It also includes links to licensing guides and FAQs. In summary, the document emphasizes the importance of data licensing for enabling reuse and outlines Australia's recommended licensing system.
The document discusses open data and data sharing, including defining open data, the benefits of open data, overcoming barriers to opening data such as concerns about scooping and sensitive data, best practices for making data open through formats, licensing and description, and the role of research databases and data citation in promoting open data.
Data, Data Everywhere: What's A Publisher to Do?Anita de Waard
Ā
The document discusses publishers' roles in data sharing and challenges in open science. It notes that while most scientists agree access to others' data would benefit research, fewer are willing to share their own data due to lack of training and incentives. Publishers are working to establish data sharing guidelines and integrate platforms to store, share, and analyze research data and tools. However, many questions remain around publishing data science given distributed and interconnected data, tools, and knowledge networks. Publishers will need to transition from pipelines to platforms and enable these new network effects.
Massive-Scale Analytics Applied to Real-World Problemsinside-BigData.com
Ā
In this deck from PASC18, David Bader from Georgia Tech presents: Massive-Scale Analytics Applied to Real-World Problems.
"Emerging real-world graph problems include: detecting and preventing disease in human populations; revealing community structure in large social networks; and improving the resilience of the electric power grid. Unlike traditional applications in computational science and engineering, solving these social problems at scale often raises new challenges because of the sparsity and lack of locality in the data, the need for research on scalable algorithms and development of frameworks for solving these real-world problems on high performance computers, and for improved models that capture the noise and bias inherent in the torrential data streams. In this talk, Bader will discuss the opportunities and challenges in massive data-intensive computing for applications in social sciences, physical sciences, and engineering."
Watch the video: https://wp.me/p3RLHQ-iPk
Learn more: https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Trust threads: Provenance for Data Reuse in Long Tail ScienceBeth Plale
Ā
Invited Colloquium talk, Apr 23, 2015, Dept of Information and Library Science, School of Informatics and Computing, Indiana University. Abstract: The world contains a vast amount of digital information which grows vaster ever more rapidly. This makes it possible to do many things on an unprecedented scale: spot social trends, prevent diseases, increase fresh water supplies, accelerate innovation, and so on. As science and technology innovation is essential to improved public health and welfare, the growing sources of data can unlock more secrets. But the rapid growth of data makes accountability and transparency of research increasingly difficult. Data that are not adequately described are not useable except within the research lab that produced it. Data that are intentionally or unintentionally inaccessible or difficult to access and verify are not available to contribute to new forms of research. In this talk I show that data can carry with it thin threads of information that connect it to both its past and its future, forming its lineage particularly as it transitions into a shareable dataset residing in a public repository. In carrying this minimal provenance, the data becomes more trustworthy. This thread of trust is a critical element to the successful sharing, use, and reuse of big data in science and technology research in the future.
Trust threads : Active Curation and Publishing in SEADBeth Plale
Ā
Describes Trust Threads, minimalist approach to provenance capture to enhance trustworthiness of published data. Implemented as part of SEAD's Active Curation and Publishing Services. At National Data Integrity Conference, Ft. Collins, Colorado, May 2015.
Presentation of science 2.0 at European Astronomical Societyosimod
Ā
The document discusses Science 2.0 and the emerging open science ecosystem. It provides three examples of open science projects: Galaxy Zoo, which had volunteers classify galaxies; Synaptic Leap, which published all data and experiments online to identify a new drug; and a paper on debt and growth that was found to have errors after its data and methods were shared. It then outlines various aspects of open science like open data, citizen science, and mass collaboration.
This document discusses licensing research data for reuse. It begins by providing a scenario where a user has downloaded a dataset but is unsure what they can do with the data due to licensing. It then discusses that licensing is critical to enabling data reuse and citation. It provides information on AusGOAL, the Australian open access and licensing framework, and notes it is recommended for data publishing by ANDS partners. It also includes links to licensing guides and FAQs. In summary, the document emphasizes the importance of data licensing for enabling reuse and outlines Australia's recommended licensing system.
The document discusses open data and data sharing, including defining open data, the benefits of open data, overcoming barriers to opening data such as concerns about scooping and sensitive data, best practices for making data open through formats, licensing and description, and the role of research databases and data citation in promoting open data.
Data, Data Everywhere: What's A Publisher to Do?Anita de Waard
Ā
The document discusses publishers' roles in data sharing and challenges in open science. It notes that while most scientists agree access to others' data would benefit research, fewer are willing to share their own data due to lack of training and incentives. Publishers are working to establish data sharing guidelines and integrate platforms to store, share, and analyze research data and tools. However, many questions remain around publishing data science given distributed and interconnected data, tools, and knowledge networks. Publishers will need to transition from pipelines to platforms and enable these new network effects.
Massive-Scale Analytics Applied to Real-World Problemsinside-BigData.com
Ā
In this deck from PASC18, David Bader from Georgia Tech presents: Massive-Scale Analytics Applied to Real-World Problems.
"Emerging real-world graph problems include: detecting and preventing disease in human populations; revealing community structure in large social networks; and improving the resilience of the electric power grid. Unlike traditional applications in computational science and engineering, solving these social problems at scale often raises new challenges because of the sparsity and lack of locality in the data, the need for research on scalable algorithms and development of frameworks for solving these real-world problems on high performance computers, and for improved models that capture the noise and bias inherent in the torrential data streams. In this talk, Bader will discuss the opportunities and challenges in massive data-intensive computing for applications in social sciences, physical sciences, and engineering."
Watch the video: https://wp.me/p3RLHQ-iPk
Learn more: https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This presentation was provided by Joe Zucca of the University of Pennsylvania, during Session Five of the NISO event "Assessment Practices and Metrics for the 21st Century," held on November 22, 2019.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
This document provides an overview of advanced research computing resources and services available to researchers at the University of York. It describes the research computing facilities including research0, the York Advanced Research Computing Cluster (YARCC), the regional N8 HPC facility, and the national ARCHER HPC service. It also covers storage, virtual machines, databases, software, support and training resources, research data management, and includes case studies of researchers using the facilities. The resources aim to support researchers by providing computing power for complex analysis and large datasets that is faster and more productive than standard desktop computers.
This document provides an overview of the research conducted by the NGSP Group at Swinburne University of Technology on cloud computing and workflow technologies. The group conducts research on data management in cloud computing, performance management in scientific workflows, security and privacy protection in the cloud, and their SwinDeW-C cloud workflow system. Specific topics studied include data storage, placement and replication strategies, temporal quality of service in workflows, and verifying temporal constraints in scientific workflows. The goal is to develop cost-effective and high performance techniques for complex software systems and services in cloud computing environments.
The document discusses big data analysis and provides an introduction to key concepts. It is divided into three parts: Part 1 introduces big data and Hadoop, the open-source software framework for storing and processing large datasets. Part 2 provides a very quick introduction to understanding data and analyzing data, intended for those new to the topic. Part 3 discusses concepts and references to use cases for big data analysis in the airline industry, intended for more advanced readers. The document aims to familiarize business and management users with big data analysis terms and thinking processes for formulating analytical questions to address business problems.
Big Data as a Catalyst for Collaboration & InnovationPhilip Bourne
Ā
Big data is disrupting biomedical research through digitization of data sources. The National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative to support this disruption. BD2K funds various programs including data sharing policies, data science training, and the development of shared infrastructure and standards. This infrastructure includes the "Commons" which would provide discoverable, accessible, interoperable and reusable research objects to catalyze collaboration using open APIs and computing platforms. SRP could interact with BD2K through initiatives like open science competitions, data standards development, and leadership in trans-NIH big data efforts.
The document provides an overview of the development of the NIH Data Commons. It discusses factors driving the need for a data commons, including large amounts of data being generated and increased support for data sharing. It outlines the goals of making data findable, accessible, interoperable and reusable. Several pilots are exploring the feasibility of the commons framework, including placing large datasets in the cloud and developing indexing methods. Considerations in fully realizing the commons are also discussed, such as standards, discoverability, policies and incentives.
This document discusses big data, including its characteristics of volume, velocity, and variety. It outlines challenges of big data such as privacy and security issues, analytical challenges, and technical challenges of storage, transfer, and processing large datasets. Advantages are presented like understanding customers and optimizing processes. The conclusion emphasizes that addressing challenges is key to realizing value from big data through talent, teams, and analytic-based decisions.
This document discusses Science 2.0 and the shift towards more open and collaborative ways of conducting science. It provides three examples of Science 2.0 projects: Galaxyzoo, which had over 150,000 volunteers classify galaxies; Synaptic Leap, which published all data and experiments online to collaborate on finding new drug treatments; and a study on government debt that was found to have coding errors after others accessed the original data. The document argues that Science 2.0 involves more than just open access, and includes data-intensive science, citizen science, open code, and open lab books/workflows. It discusses how different Science 2.0 practices are growing at different rates and the implications this shift has for scientific outputs, methods,
Australia's Environmental Predictive CapabilityTERN Australia
Ā
Federating world-leading research, data and technical capabilities to create Australiaās National Environmental Prediction System (NEPS).
Community consultation presentation.
3-12 February 2020
Dr Michelle Barker (Facilitator)
(Presentation v5)
Facilitating good research data management practice as part of scholarly publ...Varsha Khodiyar
Ā
Presentation given to the SciDataCon #IDW2018 session: Democratising Data Publishing: A Global Perspective, on Tuesday 6th November 2018, Gaborone, Botswana
Presentation from the 2013 Bio-IT World conference. It describes the design and implementation of data and compute infrastructure for the New York Genome Center.
A look back at how the practice of data science has evolved over the years, modern trends, and where it might be headed in the future. Starting from before anyone had the title "data scientist" on their resume, to the dawn of the cloud and big data, and the new tools and companies trying to push the state of the art forward. Finally, some wild speculation on where data science might be headed.
Presentation given to Seattle Data Science Meetup on Friday July 24th 2015.
Real-World Data Challenges: Moving Towards RicherĀ Data EcosystemsAnita de Waard
Ā
The document discusses trends in scientific data repositories and ecosystems. It notes that repositories are becoming more like virtual laboratories where scientists can conduct research. It also discusses how artificial intelligence and machine learning are being used to complement human discovery and analysis of large and complex datasets. The document raises several challenges around issues such as data ownership, rewards for data sharing and software development, and the roles of various stakeholders in research data management.
Data Repositories: Recommendation, Certification and Models for Cost RecoveryAnita de Waard
Ā
Talk at NITRD Workshop "Measuring the Impact of Digital Repositories" February 28 ā March 1, 2017 https://www.nitrd.gov/nitrdgroups/index.php?title=DigitalRepositories
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
Revolutionising the Journal through Big Data Computational ResearchAmye Kenall
Ā
BioMed Central is an open access publisher that publishes over 260 journals annually covering fields like genomics, computational biology, and public health. The document discusses BioMed Central's efforts to revolutionize journals through encouraging data reuse and reproducibility in computational research. This includes providing datasets used in articles, applying DOIs to additional files to improve searchability and citation, and exploring options like interactive tabular data and virtual machines to facilitate replicating analyses. Challenges discussed include balancing included versus external data sizes, dataset versioning, and encouraging author data sharing.
This presentation was provided by Joe Zucca of the University of Pennsylvania, during Session Five of the NISO event "Assessment Practices and Metrics for the 21st Century," held on November 22, 2019.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
This document provides an overview of advanced research computing resources and services available to researchers at the University of York. It describes the research computing facilities including research0, the York Advanced Research Computing Cluster (YARCC), the regional N8 HPC facility, and the national ARCHER HPC service. It also covers storage, virtual machines, databases, software, support and training resources, research data management, and includes case studies of researchers using the facilities. The resources aim to support researchers by providing computing power for complex analysis and large datasets that is faster and more productive than standard desktop computers.
This document provides an overview of the research conducted by the NGSP Group at Swinburne University of Technology on cloud computing and workflow technologies. The group conducts research on data management in cloud computing, performance management in scientific workflows, security and privacy protection in the cloud, and their SwinDeW-C cloud workflow system. Specific topics studied include data storage, placement and replication strategies, temporal quality of service in workflows, and verifying temporal constraints in scientific workflows. The goal is to develop cost-effective and high performance techniques for complex software systems and services in cloud computing environments.
The document discusses big data analysis and provides an introduction to key concepts. It is divided into three parts: Part 1 introduces big data and Hadoop, the open-source software framework for storing and processing large datasets. Part 2 provides a very quick introduction to understanding data and analyzing data, intended for those new to the topic. Part 3 discusses concepts and references to use cases for big data analysis in the airline industry, intended for more advanced readers. The document aims to familiarize business and management users with big data analysis terms and thinking processes for formulating analytical questions to address business problems.
Big Data as a Catalyst for Collaboration & InnovationPhilip Bourne
Ā
Big data is disrupting biomedical research through digitization of data sources. The National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative to support this disruption. BD2K funds various programs including data sharing policies, data science training, and the development of shared infrastructure and standards. This infrastructure includes the "Commons" which would provide discoverable, accessible, interoperable and reusable research objects to catalyze collaboration using open APIs and computing platforms. SRP could interact with BD2K through initiatives like open science competitions, data standards development, and leadership in trans-NIH big data efforts.
The document provides an overview of the development of the NIH Data Commons. It discusses factors driving the need for a data commons, including large amounts of data being generated and increased support for data sharing. It outlines the goals of making data findable, accessible, interoperable and reusable. Several pilots are exploring the feasibility of the commons framework, including placing large datasets in the cloud and developing indexing methods. Considerations in fully realizing the commons are also discussed, such as standards, discoverability, policies and incentives.
This document discusses big data, including its characteristics of volume, velocity, and variety. It outlines challenges of big data such as privacy and security issues, analytical challenges, and technical challenges of storage, transfer, and processing large datasets. Advantages are presented like understanding customers and optimizing processes. The conclusion emphasizes that addressing challenges is key to realizing value from big data through talent, teams, and analytic-based decisions.
This document discusses Science 2.0 and the shift towards more open and collaborative ways of conducting science. It provides three examples of Science 2.0 projects: Galaxyzoo, which had over 150,000 volunteers classify galaxies; Synaptic Leap, which published all data and experiments online to collaborate on finding new drug treatments; and a study on government debt that was found to have coding errors after others accessed the original data. The document argues that Science 2.0 involves more than just open access, and includes data-intensive science, citizen science, open code, and open lab books/workflows. It discusses how different Science 2.0 practices are growing at different rates and the implications this shift has for scientific outputs, methods,
Australia's Environmental Predictive CapabilityTERN Australia
Ā
Federating world-leading research, data and technical capabilities to create Australiaās National Environmental Prediction System (NEPS).
Community consultation presentation.
3-12 February 2020
Dr Michelle Barker (Facilitator)
(Presentation v5)
Facilitating good research data management practice as part of scholarly publ...Varsha Khodiyar
Ā
Presentation given to the SciDataCon #IDW2018 session: Democratising Data Publishing: A Global Perspective, on Tuesday 6th November 2018, Gaborone, Botswana
Presentation from the 2013 Bio-IT World conference. It describes the design and implementation of data and compute infrastructure for the New York Genome Center.
A look back at how the practice of data science has evolved over the years, modern trends, and where it might be headed in the future. Starting from before anyone had the title "data scientist" on their resume, to the dawn of the cloud and big data, and the new tools and companies trying to push the state of the art forward. Finally, some wild speculation on where data science might be headed.
Presentation given to Seattle Data Science Meetup on Friday July 24th 2015.
Real-World Data Challenges: Moving Towards RicherĀ Data EcosystemsAnita de Waard
Ā
The document discusses trends in scientific data repositories and ecosystems. It notes that repositories are becoming more like virtual laboratories where scientists can conduct research. It also discusses how artificial intelligence and machine learning are being used to complement human discovery and analysis of large and complex datasets. The document raises several challenges around issues such as data ownership, rewards for data sharing and software development, and the roles of various stakeholders in research data management.
Data Repositories: Recommendation, Certification and Models for Cost RecoveryAnita de Waard
Ā
Talk at NITRD Workshop "Measuring the Impact of Digital Repositories" February 28 ā March 1, 2017 https://www.nitrd.gov/nitrdgroups/index.php?title=DigitalRepositories
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664
Revolutionising the Journal through Big Data Computational ResearchAmye Kenall
Ā
BioMed Central is an open access publisher that publishes over 260 journals annually covering fields like genomics, computational biology, and public health. The document discusses BioMed Central's efforts to revolutionize journals through encouraging data reuse and reproducibility in computational research. This includes providing datasets used in articles, applying DOIs to additional files to improve searchability and citation, and exploring options like interactive tabular data and virtual machines to facilitate replicating analyses. Challenges discussed include balancing included versus external data sizes, dataset versioning, and encouraging author data sharing.
A brief overview of the development and current workflows for Research Data Management at Imperial College London, presented to colleagues at the University of Copenhagen and Roskilde University in Denmark.
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
Ā
This document discusses the Sustainable Environment Actionable Data (SEAD) project, which aims to lower the costs and increase the value of data curation through a data lifecycle approach. SEAD provides lightweight data services to support sustainability research, including secure project workspaces, active and social curation tools, and integrated lifecycle support for data from ingest to long-term preservation. By leveraging technologies like Web 2.0 and standards, SEAD simplifies and automates curation processes using metadata captured from data producers and users. This allows curation activities to begin earlier in the data lifecycle and be distributed across researchers and curators.
If Big Data is data that exceeds the processing capacity of conventional systems, thereby necessitating alternative processing measures, we are looking at an essentially technological challenge that IT managers are best equipped to address.
The DCC is currently working with 18 HEIs to support and develop their capabilities in the management of research data and, whilst the aforementioned challenge is not usually core to their expressed concerns, are there particular issues of curation inherent to Big Data that might force a different perspective?
We have some understanding of Big Data from our contacts in the Astronomy and High Energy Physics domains, and the scale and speed of development in Genomics data generation is well known, but the inability to provide sufficient processing capacity is not one of their more frequent complaints.
Thatās not to say that Big Science and its Big Data are free of challenges in data curation; only that they are shared with their lesser cousins, where one might say that the real challenge is less one of size than diversity and complexity.
This brief presentation explores those aspects of data curation that go beyond the challenges of processing power but which may lend a broader perspective to the technology selection process.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
Ā
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
Ā
There is perhaps aĀ broadĀ consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Birgit Plietzsch āRDM within research computing supportā SALCTG June 2013SALCTG
Ā
An overview of Research Data Management: the research process from developing ideas to preservation of data; funder perspectives, the impact on the wider service, Data Asset Frameworks, preservation and access, and cost implications.
The document provides an overview of the data analytics process (lifecycle). It discusses the key phases in the lifecycle including discovery, data preparation, model planning, model building, communicating results, and operationalizing. In the discovery phase, stakeholders analyze business trends and domains to build hypotheses. In data preparation, data is explored, preprocessed, and conditioned to create an analytics sandbox. This involves extract, transform, load processes to prepare the data for analysis.
PAARL's 1st Marina G. Dayrit Lecture Series held at UP's Melchor Hall, 5F, Proctor & Gamble Audiovisual Hall, College of Engineering, on 3 March 2017, with Albert Anthony D. Gavino of Smart Communications Inc. as resource speaker on the topic "Using Big Data to Enhance Library Services"
Unlock Your Data for ML & AI using Data VirtualizationDenodo
Ā
How Denodo Complementās Logical Data Lake in Cloud
ā Denodo does not substitute data warehouses, data lakes,
ETLs...
ā Denodo enables the use of all together plus other data
sources
ā In a logical data warehouse
ā In a logical data lake
ā They are very similar, the only difference is in the main
objective
ā There are also use cases where Denodo can be used as data
source in a ETL flow
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Ā
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Ā Cluster resource management, File and object data management, Ā Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...Sarah Anna Stewart
Ā
Presentation given at the M25 Consortium of Academic Libraries, CPD25 Event on 'The Role of the Library in Supporting Research'. Provides an introduction to data, software and PIDs and a brief look at how libraries can enable researchers to gain impact and credit for their research data and software.
What is eScience, and where does it go from here?Daniel S. Katz
Ā
eScience has evolved from focusing on global scientific collaborations enabled by distributed computing infrastructure to emphasizing joint advances in digital infrastructure and how that infrastructure enables new research. This symbiotic relationship between research and infrastructure development could be called Research and Infrastructure Development Symbiosis (RaIDS). Going forward, RaIDS conferences should focus on improving communication between infrastructure developers and researchers to facilitate new collaborations, ensure research publications appropriately attribute enabling infrastructure advances, and standardize catalogs of available infrastructure and research challenges.
This presentation was provided by Karen Baker, University of Illinois - Urbana-Champaign, during a NISO Virtual Conference on the topic of data curation, held on Wednesday, August 31, 2016
This document provides an overview of big data and how to get started with it. It introduces key concepts like what big data is, the different technology choices available and how to make an impact with data science. Specific topics covered include Hadoop and NoSQL databases, challenges of big data, sample use cases like customer churn analysis and the Expedia case study. The presentation emphasizes that big data is an evolving field and recommends taking a scientific approach to data analysis to drive business insights and impact.
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
Ā
These slides cover evolving federal research requirements for sharing scientific data. Provided are updates on federal agency responses to the 2013 OSTP memo, guidance on data management plans, resources for data management and curation training for staff/researchers, and tips for evaluating public data-sharing services. ICPSR's public data-sharing service, openICPSR, is also presented. Recording of this presentation is here: https://www.youtube.com/watch?v=2_erMkASSv4&feature=youtu.be
The document discusses the Materials Genome Initiative (MGI) and the High-Throughput Experimental Materials Collaboratory (HTE-MC). It describes NIST's role in supporting MGI through developing a materials innovation infrastructure. It outlines the vision for HTE-MC, which would integrate high-throughput synthesis and characterization tools across multiple institutions through a shared network and data management platform. This would provide broader access to experimental facilities and materials data to support accelerated materials discovery. A workshop was held in 2018 to discuss establishing the HTE-MC concept and defining its technical, operational and business models.
Similar to 2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Ridge National Laboratory) (20)
ODIN Final Event - Publishing and citing, and the role of persistent identifiersdatacite
Ā
SĆ¼nje Dallmeier-Tiessen
CERN
Presentation delivered at the ODIN Final Event in Amsterdam (Netherlands) on Wednesday, September 24, 2014: ORCID and DataCite: Towards Holistic Open Research.
More info: www.odin-project.eu
ODIN Final Event - Submission to datacentresdatacite
Ā
Sergio Ruiz
DataCite
Presentation delivered at the ODIN Final Event in Amsterdam (Netherlands) on Wednesday, September 24, 2014: ORCID and DataCite: Towards Holistic Open Research.
More info: www.odin-project.eu
ODIN Final Event - Supporting the research lifecycle: Discovery and Analysisdatacite
Ā
Rachael Kotarski
The British Library
Presentation delivered at the ODIN Final Event in Amsterdam (Netherlands) on Wednesday, September 24, 2014: ORCID and DataCite: Towards Holistic Open Research.
More info: www.odin-project.eu
ODIN Final Event - The Care and Feeding of Scientific Datadatacite
Ā
MercĆØ Crosas @mercecrosas
Director of Data Science, IQSS, Harvard University
Presentation delivered at the ODIN Final Event in Amsterdam (Netherlands) on Wednesday, September 24, 2014: ORCID and DataCite: Towards Holistic Open Research.
More info: www.odin-project.eu
2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...datacite
Ā
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...datacite
Ā
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...datacite
Ā
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Out of Cite, Out of Mind: Report of the CODATA...datacite
Ā
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Update on Force 11 and the Amsterdam manifesto...datacite
Ā
This document summarizes the process undertaken by the Data Citation Synthesis Group to develop a consensus set of principles for data citation. The group was formed in response to multiple organizations developing similar sets of principles. It brought together 36 members from around 20 organizations to review 4 existing sets of data citation principles over 3 months of weekly meetings. They merged the principles into a single synthesis set of 8 high-level, simple principles for data citation. The principles address the importance of data citation, credit and attribution for data contributors, use of data citations as evidence, use of persistent and unique identifiers, access to data and metadata, ensuring identifier and metadata persistence beyond the data lifespan, accommodating versioning and granularity of data, and ensuring inter
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...datacite
Ā
Michael Witt presented on the Purdue University Research Repository (PURR) at the DataCite summer meeting. PURR is a collaborative effort between Purdue University Libraries, Office of the Vice President for Research, and Information Technology. It provides researchers a space to store, share, and publish research data, with librarian support for data management plans and curation. PURR aims to encourage citation of datasets by assigning identifiers, displaying licenses, providing citation examples, and exposing structured citations. It is built on open source HUBzero software and has over 1,000 registered researchers sharing data across 200 projects.
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...datacite
Ā
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Opening Keynote: A short history of the Higgs ...datacite
Ā
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Ā
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
Ā
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Ā
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Ā
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
Ā
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Fueling AI with Great Data with Airbyte WebinarZilliz
Ā
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. š This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. š»
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. š„ļø
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. š
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Ā
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
Ā
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Ā
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Ā
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
Ā
An English š¬š§ translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech šØšæ version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
Ā
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Ā
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Ā
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Ridge National Laboratory)
1. DOIs and Supercomputing
DataCite Summer 2013 Meeting
Terry Jones, Sudharshan Vazhkudai, Doug Fuller
Oak Ridge National Laboratory
2. DataCite Summer 2013 / Washington DC
Why Supercomputers!?
Because Innovation Drives The Economyā¦
ā¢ Over the last 5 years, 38% of the international innovation āR&D 100ā awards went
to US National Labs
0
5
10
15
20
25
30
35
40
45
50
2009 2010 2011 2012 2013
ā¢ This was done with YOUR tax
money
ā¢ Ideas shape the course of
history ā John Maynard
Keynes
ā¢ The central goal of economic
policy should be to spur higher
productivity through greater
innovation ā Joseph
Schumpeterās Innovation
Economics
3. DataCite Summer 2013 / Washington DC
Why Supercomputers!? (part 2)
ā¦And in 2013, Supercomputers Drive Innovation
Computers have changed the way we conduct
experiments. Given enough computer power, we
can perform accurate experiments more
quickly, more cheaply, and often with greater
control.
4. DataCite Summer 2013 / Washington DC
The New Laboratory:
High-Performance Computing yields breakthroughs
H = -
ļØ2
2mi
Ći
2
i=1
n
Ć„ -
eiej
rijiĀ¹j
n
Ć„
5. DataCite Summer 2013 / Washington DC
Big Problems Require Big Solutions
Energy
Healthcare
Competitiveness
OLCF resources are available to
academia and industry through
open, peer-reviewed allocation
mechanisms.
6. DataCite Summer 2013 / Washington DC
ā¢ High Performance Production Computing for the
Office of Science
ā¢ Characterized by a large number of projects (over 400) and users (
over 4800)
ā¢ Leadership Computing for Open Science
ā¢ Characterized by a small number of projects ( about 50) and
users (about 800) with computationally intensive projects
ā¢ Linking it together ā ESnet
ā¢ Investing in the future ā R&E Prototypes
ESnet
Titan at ORNL
(#2)
Mira at ANL
(#5)
Hopper at LBNL
(#24)
June 2013
DOE Office of Science HPC User Facilities
8. DataCite Summer 2013 / Washington DC
With Big Computations Comes Big Data
ā¢ DOE HPC User Facilities produce enormous volumes of data
ā¢ Each User Facility has tertiary (archival) storage, often HPSS
ā statistics for one such computer center pictured here
ā¢ In addition, each center provides secondary storage
ā for example: a 10PB Lustre parallel file system
9. DataCite Summer 2013 / Washington DC
ā¢ Part of a Collaborative DOE Office of
Science program at ORNL and ANL
ā¢ Mission: Provide the computational
and data resources required to solve
the most challenging problems.
ā¢ Access to the most powerful computer
in the world for open access computing
(Titan)
ā¢ Highly competitive user allocation
programs (INCITE, ALCC).
ā¢ Projects receive 10x to 100x more
resource than at other generally
available centers.
ā¢ OLCF centers partner with users to
enable science & engineering
breakthroughs (Liaisons, Catalysts).
Oak Ridge Leadership Computing Facility (OLCF)
-- A Leading DOE User Facility
10. DataCite Summer 2013 / Washington DC
We have increased our system capability
by 10,000 times since 2004
ā¢ Strong partnerships with supercomputer vendors.
ā¢ LCF users employ large portions of the machine for large fractions of time.
ā¢ Strong partnerships with our users to scale codes and algorithms.
11. DataCite Summer 2013 / Washington DC
OLCF Future (Based On Extrapolation)
Jaguar: 2.3 PF
Leadership
system for science
Titan (OLCF-3):
10ā20 PF
Leadership system
2009 2012 2016 2019
OLCF-5:
1 EF
OLCF-4:
100ā250 PF
ā¢ Computer system performance increases through parallelism
ā Clock speed trend flat to slower over coming years
ā In the last 28 years, systems have scaled from 64 cores to ~300,000
ā Applications must utilize all inherent parallelism
ā¢ Our compute and data resources have grown 10,000X over the
decade, are in high demand, and are effectively used.
12. DataCite Summer 2013 / Washington DC
The Data Deluge
2013 4PB disk & 34PB tape [Titan]
2017 64PB disk & 600PB tape [Coral]
2021 1EB disk & 10EB tape (?)
ā¢ Key Challenge: Make Sense of So Much Data
ā¢ Weāll Need Better Tools
ā¢ If āmany hands make light work,ā how can we
enable more people to make sense of the data?
13. DataCite Summer 2013 / Washington DC
What Breakthroughs Are We Missing?
ā¢ HPC will remain important to Scientific Discovery
ā Important for Climate, Material Science, Energy Security
ā¢ Today, the state-of-the-art is (still!) bibliographic
publications
ā¢ But The Gains From Bibliographic Sharing Are
Limited
ā Constraints in paper length
ā Limited Focus of paper
ā Limited ability to convey with graphs, figures, tables
ā¢ Urgently Needed: A Quick Way To āEnableā Data
14. DataCite Summer 2013 / Washington DC
New External Drivers for Supercomputing Centers
ā¢ The push is on to squeeze more results from High-Performance Computing
ā Scientists have difficulty in replicating (or even understanding) otherās results
ā Tax payers want more openness
ā The Holdren memo
15. DataCite Summer 2013 / Washington DC
Our Response: Make Supercomputer Produced
Data As Widely Available As Possible
ā¢ DOIs provide the necessary mechanism & implementation
ā¢ Makes sense for OLCF (uniquely qualified for 100TB datasets)
ā¢ Will benefit from DataCiteās integration with Thomson Reuterās data citation index and
other services.
ā¢ Already successful for sensor-driven research like NASA
ā¢ As research goes forward, the project Principal Investigator stores āappropriate dataā
ā Presumably, if data can support a bibliographic result (graph, figure, data), the data is worth
curation.
ā¢ After curation, the data is available to the entire scientific community
ā Helps OLCF with āresearch trackingā
ā Helps OLCF with āreporting to sponsorsā
ā Helps OLCF resolve data disposition questions
ā All The Traditional Benefits To Researchers
16. DataCite Summer 2013 / Washington DC
DOI BenefitsDOI Benefits
ā¢ Identify & Cite key data products
of interest and value, and
annotate them.
ā¢ Safely share data with their
collaborators even before
publishing the result in a
scientific communication.
ā¢ Future data analyses can easily
feed off of the data
products, fostering a highly
dynamic, and collaborative
environment.
From Userās Perspective, DOIs can: From Sponsorās Perspective, DOIs can:
ā¢ Help with research tracking and identifying the major results coming
out of a project allocation on the centerās resources.
ā¢ Aid in reporting to sponsors.
ā¢ Since the DOIs also capture some basic metadata along with the
index, it can help the center to answer questions on the disposition
of the data, search and discover them.
From Centerās Perspective, DOIs can:
ā¢ Added benefit of seeing data sharing flourish within
the community, and more data analyses spawned from
the data products.
ā¢ Both users and centers that the sponsor funds now
have rich tools for data management.
ā¢ Preserve data products for a longer-term, much beyond the
expiration of their projects at the centers.
ā¢ Satisfy requirements from funding agencies on data management
plans in terms of long-term preservation, sharing and dissemination
of research results.
ā¢ DOIs enable more value
for the dollar spent. In
addition to software
tools, research
artifacts, and
papers, there is now a
new entity, the citable
data product.
ā¢ Better utilization of HPC
center resources.
ā¢ Provides a tool the to cull the data holdings. Provide tangible policies to users for long-term data preservation.
ā¢ Evolve to support ādata-onlyā users through data science tools such as DOIs.
ā¢ Provide an opportunity for our center to distinguish itself from other centers (they have the best data tools)
17. DataCite Summer 2013 / Washington DC
Workflow for DOI Creation
1. User
creates data
2. User
requests DOI
3. ORNL
requests DOI
4. OSTI
provides DOI
5. DOI stored
at data portal
6. Request
Permanent
Data Copy
7. Data
Migrated to
Archive
8. Archive
success
response
9. DOI
success
response
18. DataCite Summer 2013 / Washington DC
Workflow for DOI Data Retrieval
1. User
provides
search criteria
4. Request
Data Subset
5. Data
Migrated for
Upload
2. Matches
found via
Metadata
3. User
identifies
needed data
6. User
retrieves data
19. DataCite Summer 2013 / Washington DC
Some Challenges Are Expected
ā¢ How will permanent data storage be funded?
ā Projects last 3 years.
ā¢ Researchers are affiliated with institutions that have their own data policies.
ā For example, the Princeton Plasma Physics Lab may have policies affecting how we can support
itās fusion projects.
ā¢ Some fields will require effort to make their data āportableā for a wide audience.
ā Astrophysics has a standard file format, Fusion does not.
ā¢ Developing good metadata is a human intensive effort
ā Getting PIs to provide the metadata
ā Looking to OSTI & DataCite for some help with DOI Q&A
20. DataCite Summer 2013 / Washington DC
ā¦More Challenges
ā¢ What about Authenticated access to data?
Or malicious users in general...
ā¢ What about the long-term QA aspects of
maintaining data?
ā¢ What about the logistics of very large data?
ā Staging
ā Retrieving huge files (canāt be on disk)
Whereās The
Data?
21. DataCite Summer 2013 / Washington DC
Current Project Status
ā¢ Provided a DOI recommendation for the Center
ā Pros and Cons
ā Long term implications
ā¢ Designed the Workflow
ā¢ Created infrastructure to support the workflow
ā Frontend infrastructure for storing & DOI association
ā Backend infrastructure for search & retrieval
ā¢ Having conversations with a few selected HPC user communities
1. Astrophysics
2. Groundwater Simulation
3. Climate
4. Turbulence
5. Fusion
22. DataCite Summer 2013 / Washington DC
Summary
ā¢ High Performance Computing & Data are integral to scientific
discovery
ā¢ Bibliographic publications cannot contain the wealth of insight
available in the raw data
ā¢ ORNL is leading an effort to make HPC data available to all
with DOIs
ā¢ In the future, āPublishā to
a scientist will probably
refer to obtaining a DOI
for a supercomputer
dataset
23. DataCite Summer 2013 / Washington DC
Acknowledgements
ā¢ OLCF DOI Team
ā Sudharshan Vazhkudai
ā Doug Fuller
ā Terry Jones
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak
Ridge National Laboratory, which is supported by the Office of Science of the U.S.
Department of Energy under Contract No. DE-AC05-00OR22725.
ā¢ OSTI Support
ā Mark Martin
ā Jannean Elliott
ā¢ ORNL Support
ā Jack Wells
ā Giri Palanisamy
ā John Cobb
ā Stan White
26. DataCite Summer 2013 / Washington DC
High-Temperature
Superconductivity Biofluidic Systems Plasma Physics Cosmology
Taking a Quantum Leap
in Time to Solution for
Simulations of High-TC
Superconductors
19 Petaflops
Simulation of Protein
Suspensions in
Crowding Conditions
Radiative Signatures
of the Relativistic
Kelvin-Helmholtz
Instability
HACC: Extreme
Scaling and
Performance Across
Diverse Architectures
Titan Titan Titan Sequoia, Mira, Titan
How Does The OLCF Compare With Other Centers?
Peter Staar
ETH Zurich
Massimo Bernaschi
ICNR-IAC Rome
Michael Bussmann
HZDR - Dresden
Salman Habib
ANL
Four of Six SC13 Gordon Bell Finalists Used Titan
27. DataCite Summer 2013 / Washington DC
The New Laboratory (continued):
High-Performance Computing is widely applicable
Editor's Notes
Scientific breakthroughs change our lives:* Explained photosynthesis. Ever wonder how plants turn sunlight into energy? A National Lab scientist determined the path of carbon through photosynthesis, a scientific milestone that illuminated one of lifeās most important processes. Today, this work allows scientists to explore how to derive sustainable energy sources from the sun.*Made refrigerators cool.Next-generation refrigerators will likely put the freeze on harmful chemical coolants in favor of an environmentally friendly alloy, thanks to National Lab scientists.* Brought safe water to millions.Removing arsenic from drinking water is a global priority. A long-lasting particle engineered at a National Lab can now do exactly that, making contaminated water safe to drink. Another technology developed at a National Lab uses ultraviolet light to kill microbes that cause water-borne diseases such as dysentery. This process has reduced child mortality in the developing world.Put the digital in DVDsThe optical digital recording technology behind music, video, and data storage originated at a National Lab nearly 40 years ago.Tamed hydrogen with nanoparticlesTo replace gasoline, hydrogen must be safely stored and easy to use, but this has proved elusive. National Lab researchers have now designed a new pliable material using nanoparticles that can rapidly absorb and release hydrogen without ill effects, a major step in making fuel-cell powered cars a commercial reality
Exabyte comes after PettabyteThen ZettabyteThen Yottabyte
In May, an OMB memo and an Executive Order were released in support of the Holdren memo
Opens the door to other vast communities (as evidenced by the wide-ranging audience at this meeting)
Previously, users did not have a tool to identify what is important to them, which resulted in indiscriminately storing all intermediate snapshot data from scratch storage into archival storage. However, with DOIs, there is now a means to identify datasets of value, which may change this user behavior, resulting in manageable data sizes. This has ramifications to the provisioning of center storage resources.
Tie-in to DataCite attendees; one thing we liked about the DataCite philosophy that will help us is the landing page philosophy will help us (anyone can go to the landing page)Some data could be embargoed (but available to others later)