2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...datacite
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Inside Analysis
The Briefing Room with Robin Bloor and Pervasive Software
Slides from the Live Webcast on May 1, 2012
The old methods of delivering data for analysts and other business users will simply not scale to meet new demands. Hadoop is rapidly emerging as a powerful and economic platform for storing and processing Big Data. And yet, the biggest obstacle to implementing Hadoop solutions is the scarcity of Hadoop programming skills.
Check out this episode of The Briefing Room to learn from veteran Analyst Robin Bloor, who will explain why modern information architectures must embrace the new, massively parallel world of computing as it relates to several enterprise roles: traditional business analysts, data scientists, and line-of-business workers. He'll be briefed by David Inbar and Jim Falgout of Pervasive Software, who will explain how Pervasive RushAnalyzer™ was designed to accommodate the new reality of Big Data.
For more information visit: http://www.insideanalysis.com
Watch us on YouTube: http://www.youtube.com/playlist?list=PL5EE76E2EEEC8CF9E
A sponsored supplement produced for Jisc on how researchers can cope with the data deluge of modern research techniques. Published by Times Higher Education on 25 November 2009
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...datacite
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Inside Analysis
The Briefing Room with Robin Bloor and Pervasive Software
Slides from the Live Webcast on May 1, 2012
The old methods of delivering data for analysts and other business users will simply not scale to meet new demands. Hadoop is rapidly emerging as a powerful and economic platform for storing and processing Big Data. And yet, the biggest obstacle to implementing Hadoop solutions is the scarcity of Hadoop programming skills.
Check out this episode of The Briefing Room to learn from veteran Analyst Robin Bloor, who will explain why modern information architectures must embrace the new, massively parallel world of computing as it relates to several enterprise roles: traditional business analysts, data scientists, and line-of-business workers. He'll be briefed by David Inbar and Jim Falgout of Pervasive Software, who will explain how Pervasive RushAnalyzer™ was designed to accommodate the new reality of Big Data.
For more information visit: http://www.insideanalysis.com
Watch us on YouTube: http://www.youtube.com/playlist?list=PL5EE76E2EEEC8CF9E
A sponsored supplement produced for Jisc on how researchers can cope with the data deluge of modern research techniques. Published by Times Higher Education on 25 November 2009
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
Keynote, Munich, June 2016
The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.
Research Management Solutions from Microsoft are smart and affordable platforms as well as tools that can help you create the ultimate community-based research platform. These solutions are very cost-effective and are powered by cutting-edge technology.
Bi isn't big data and big data isn't BI (updated)mark madsen
Big data is hyped, but isn't hype. There are definite technical, process and business differences in the big data market when compared to BI and data warehousing, but they are often poorly understood or explained. BI isn't big data, and big data isn't BI. By distilling the technical and process realities of big data systems and projects we can separate fact from fiction. This session examines the underlying assumptions and abstractions we use in the BI and DW world, the abstractions that evolved in the big data world, and how they are different. Armed with this knowledge, you will be better able to make design and architecture decisions. The session is sometimes conceptual, sometimes detailed technical explorations of data, processing and technology, but promises to be entertaining regardless of the level.
Yes, it’s about the data normally called “big”, but it’s not Hadoop for the database crowd, despite the prominent role Hadoop plays. The session will be technical, but in a technology preview/overview fashion. I won’t be teaching you to write MapReduce jobs or anything of the sort.
The first part will be an overview of the types, formats and structures of data that aren’t normally in the data warehouse realm. The second part will cover some of the basic technology components, vendors and architecture.
The goal is to provide an overview of the extent of data available and some of the nuances or challenges in processing it, coupled with some examples of tools or vendors that may be a starting point if you are building in a particular area.
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
In many ways, moving data is like moving furniture: it's an unpleasant process dubbed an occasional necessary evil. But as the data pipelines of old decay, a new reality is taking shape: the data-native architecture. Unlike traditional data processing for BI and Analytics, this approach works on data right where it lives, thus eliminating the pain of forklifting, narrowing the margin of error, and expediting the time to business benefit. The new architecture embodies new assumptions, some of which we will talk about here.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain why this shift is truly tectonic. He'll be briefed by Steve Wooledge of Arcadia Data who will showcase his company's technology, which leverages a data-native architecture to fuel rapid-fire visualization and analysis of both big data and small.
With the development of advanced remote sensing and communication technology, new sources of data began to develop in the lots of industries such as finance, marketing, transport, utility, etc.. These new types of datasets are being received continuously at a very high speed. Researchers in academia and industry have made many efforts to improve the value of big data and significant use of its value using data science. Having a good process for data mining and machine learning and clear guidelines is always plus point for any data science project. It also helps to focus required time and resources early in the process to get a clear idea of the business problem to be solved.
Hence, the framework is proposed to aid data science project lifecycle and bridge the gap with business needs and technical realities.
Main motivation of building this new framework is to address big data analysis changes and reduce the complexity of the any big data related data science projects. Recent improvements in technology demand real-time data processing and analytics and visualization to gain completive advantage of real-time decision making. After carefully examination and analysis of the related literature, there are a variety of issues in Big Data processing and analysis. Therefore, this research present new Big Data analytics and processing framework for data acquisition, data fusion, data storing, managing, processing, analysing, visualising and modelling. Often the purpose of data analysis is not only to identify pattern, but to build models, if possible by gaining an understanding of process. We believe that without a proper coordination and structuring framework there is likely to be much overlap and duplication amongst project phases, and can cause confusion around the responsibilities of each project participant. A common mistake made in big data projects is rushing into data collection and data analysis, which prevents spending adequate time to plan the amount of work involved in the project, understanding business requirements, or even defining the business problem properly. Big data has is available all around us in various formats, shapes and sizes. Understanding the relevance of each of these data sets to business problem is a key aspect to succeed with the project. Also, big data has multiple layers of hidden complexity that are not visible by simply inspecting. Poorly planned project can ruin entire project and the finding of the project in any organization. If the project does not clearly identify the appropriate level of complexity and the granularity, then the chances are high an erroneous result set will occur that twists the expected analytical outputs.
This presentation was part of a workshop of IEDA (http://www.iedadata.org) at the AGU (American Geophysical Union) Fall Meeting 2013 in San Francisco that was intended as an introduction to the topic of data publication.
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
Keynote, Munich, June 2016
The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.
Research Management Solutions from Microsoft are smart and affordable platforms as well as tools that can help you create the ultimate community-based research platform. These solutions are very cost-effective and are powered by cutting-edge technology.
Bi isn't big data and big data isn't BI (updated)mark madsen
Big data is hyped, but isn't hype. There are definite technical, process and business differences in the big data market when compared to BI and data warehousing, but they are often poorly understood or explained. BI isn't big data, and big data isn't BI. By distilling the technical and process realities of big data systems and projects we can separate fact from fiction. This session examines the underlying assumptions and abstractions we use in the BI and DW world, the abstractions that evolved in the big data world, and how they are different. Armed with this knowledge, you will be better able to make design and architecture decisions. The session is sometimes conceptual, sometimes detailed technical explorations of data, processing and technology, but promises to be entertaining regardless of the level.
Yes, it’s about the data normally called “big”, but it’s not Hadoop for the database crowd, despite the prominent role Hadoop plays. The session will be technical, but in a technology preview/overview fashion. I won’t be teaching you to write MapReduce jobs or anything of the sort.
The first part will be an overview of the types, formats and structures of data that aren’t normally in the data warehouse realm. The second part will cover some of the basic technology components, vendors and architecture.
The goal is to provide an overview of the extent of data available and some of the nuances or challenges in processing it, coupled with some examples of tools or vendors that may be a starting point if you are building in a particular area.
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
In many ways, moving data is like moving furniture: it's an unpleasant process dubbed an occasional necessary evil. But as the data pipelines of old decay, a new reality is taking shape: the data-native architecture. Unlike traditional data processing for BI and Analytics, this approach works on data right where it lives, thus eliminating the pain of forklifting, narrowing the margin of error, and expediting the time to business benefit. The new architecture embodies new assumptions, some of which we will talk about here.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature explain why this shift is truly tectonic. He'll be briefed by Steve Wooledge of Arcadia Data who will showcase his company's technology, which leverages a data-native architecture to fuel rapid-fire visualization and analysis of both big data and small.
With the development of advanced remote sensing and communication technology, new sources of data began to develop in the lots of industries such as finance, marketing, transport, utility, etc.. These new types of datasets are being received continuously at a very high speed. Researchers in academia and industry have made many efforts to improve the value of big data and significant use of its value using data science. Having a good process for data mining and machine learning and clear guidelines is always plus point for any data science project. It also helps to focus required time and resources early in the process to get a clear idea of the business problem to be solved.
Hence, the framework is proposed to aid data science project lifecycle and bridge the gap with business needs and technical realities.
Main motivation of building this new framework is to address big data analysis changes and reduce the complexity of the any big data related data science projects. Recent improvements in technology demand real-time data processing and analytics and visualization to gain completive advantage of real-time decision making. After carefully examination and analysis of the related literature, there are a variety of issues in Big Data processing and analysis. Therefore, this research present new Big Data analytics and processing framework for data acquisition, data fusion, data storing, managing, processing, analysing, visualising and modelling. Often the purpose of data analysis is not only to identify pattern, but to build models, if possible by gaining an understanding of process. We believe that without a proper coordination and structuring framework there is likely to be much overlap and duplication amongst project phases, and can cause confusion around the responsibilities of each project participant. A common mistake made in big data projects is rushing into data collection and data analysis, which prevents spending adequate time to plan the amount of work involved in the project, understanding business requirements, or even defining the business problem properly. Big data has is available all around us in various formats, shapes and sizes. Understanding the relevance of each of these data sets to business problem is a key aspect to succeed with the project. Also, big data has multiple layers of hidden complexity that are not visible by simply inspecting. Poorly planned project can ruin entire project and the finding of the project in any organization. If the project does not clearly identify the appropriate level of complexity and the granularity, then the chances are high an erroneous result set will occur that twists the expected analytical outputs.
This presentation was part of a workshop of IEDA (http://www.iedadata.org) at the AGU (American Geophysical Union) Fall Meeting 2013 in San Francisco that was intended as an introduction to the topic of data publication.
Final version of the general presentation that the RDA Secretary General presented about a dozen times at various conferences and workshops around Europe in the last two months.
Tools für das Management von ForschungsdatenHeinz Pampel
Workshop „Wege in die Köpfe“ des DFG-Projekts „EWIG - Entwicklung von Workflowkomponenten für die Langzeitarchivierung von Forschungsdaten in den Geowissenschaften“ | Berlin, 03.07.2014
FAIR for the future: embracing all things dataARDC
FAIR for the future: embracing all things data - Natasha Simons, Keith Russell and Liz Stokes, presented at Taylor & Francis Scholarly Summits in Sydney 11 Feb 2019 and Melbourne 14 Feb 2019.
This presentation was provided by Karen Baker, University of Illinois - Urbana-Champaign, during a NISO Virtual Conference on the topic of data curation, held on Wednesday, August 31, 2016
A discussion of the role of taxonomies and other controlled vocabularies in the managing of large amounts of data for researchers, focusing in particular on searchability and data visualization. Presented by Marjorie M.K. Hlava, president of Access Innovations, Inc., for the SLA Military Libraries Division 2013 Workshop, December 12, 2013.
PAARL's 1st Marina G. Dayrit Lecture Series held at UP's Melchor Hall, 5F, Proctor & Gamble Audiovisual Hall, College of Engineering, on 3 March 2017, with Albert Anthony D. Gavino of Smart Communications Inc. as resource speaker on the topic "Using Big Data to Enhance Library Services"
Similar to Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and Breakthrough Technologies (20)
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and Breakthrough Technologies
1. Do It Yourself (DIY) Earth Science
Collaboratories Using Best Practices
and Breakthrough Technologies
IN13D-01
ERIC STEPHAN
December 11, 2017 1
Pacific Northwest National Laboratory
AGU Fall meeting 2017, New Orleans, LA
IN13D: Approaches for Curation to Data Discovery in the Era of Big Data Variety II
2. Addressing Data Challenges of Scientists on
Small and Midscale Budgets
Do it yourself (DIY) home project videos have taken storm in media,
helping you reroof a house or replace a water pump.
DIY recommendations can even help you determine if you can, do it yourself!
Talk targeting innovative smaller sized science projects that produce
quality science products including data that can be shared with future
consumer communities..
Many best practices can be carried out in even the humblest situations.
big data center, smaller projects want more effective ways to connect to your
resources beyond ’point and click’.
December 11, 2017 2
3. Emergence of Scientific Collaborative Tools –
Science inspired the Web and so much more!
Collaboratory - A center without walls, in which the nation’s researchers can perform their research
without regard to physical location, interacting with colleagues, accessing instrumentation, sharing data
and computational resources, [and] accessing information in digital libraries1
December 11, 2017 3
1The national collaboratory. In Towards a national collaboratory. Unpublished report of a National Science Foundation
invitational workshop, Rockefeller University, New York. 1988.
The DOE 2000 Project
Environmental Molecular Sciences
Laboratory (EMSL) User Facility
12 March 1989, Sir Tim
Berners-Lee original “vague
but exciting” submission to
CERN on a distributed
information system
National Institute of Health:
The Human Genome Project
(HGP) Began 1989.
Engage with EMSL to advance your research
How can we work together?
§ Collaborate with our experts
§ Work within multi-disc iplinary teams
to ac c elerate sc ience
§ Acc ess world-c lass sc ientific
user facilities and spec ialized
instrumentation
§ Provide research and c areer
opportunities for your students
Dec ember 8, 2017
www.emsl.pnnl.gov
www.universities.pnnl.gov
4. Examples of Off the Shelf and Standards
Deluge: What Works for You?
December 11, 2017 4
5. Attaining Data Study Afterlife?
December 11, 2017 5
Signal
Message
Application
Database
File store
Archive
Deep Web
Science publications
Data
Visibility through commercial search engine
New advancements in science
and engineering require
careful attention to keeping
scientific discovery literature
and data artifacts in
circulation
Example
Data
Lifecycle
“…Placed in storage, the data has as much
productive value as your labor value when
you sit on the sofa at night to watch TV. “
“…If you want to increase the value of your data
you have to increase its active circulation and
utility!” Steven Adler, DWBP co-Chair
Without some help, science can remain largely
invisible in the Deep Web
6. Increasing Lifespan, Reuse and Visibility DIY
Choose from 35 DWBP best practices to match research functional needs
Scope best practices with reference model sketches
Assess off the shelf product capabilities and limitations with DWBP
Identify required additional plumbing to accomplish research
https://www.w3.org/TR/dwbp/
7. DWBP Data Challenges and Motivating
Questions
December 11, 2017 7
Metadata
Data License
Provenance
Data Quality
Versioning
Identification
Data Formats
Vocabularies
Access
Preservation
Feedback
Enrichment
Replication
How do I provide metadata?
How do I permit/restrict access?
How can I convey transparency?
How can I add trust?
How can I track version history?
How can I create and use
persistent identifiers?
What non-proprietary structures
should I use?
How do I make my data more
easily understood?
How can I make data retrieval
easy, robust, and intuitive?
What should I consider when
archiving?
How can data producers and users
be better engaged?
How can I add better value to
data?
How do I use data responsibly?
“The Web is not a glorified USB Stick”,
Phil Archer, W3C Data Activity Lead https://www.w3.org/2017/Talks/0621-phila-oai/
http://w3c.github.io/dwbp/dwbp-implementation-report.html
8. Best Practices Benefit Measures
December 11, 2017 8
• Comprehension: humans will have a better understanding about the data
structure and meaning, the metadata and the nature of the dataset.
• Processability: machines can automatically ingest and operate on data.
• Discoverability: finding new associations between and in data resources.
• Reuse: increase intrinsic value to wider data consumer communities.
• Trust: improving the confidence that consumers have in the dataset.
• Linkability: it will be possible to associate data resources
• Access: humans and machines will be able to retrieve relevant data in familiar
common formats.
• Interoperability: cooperation among data publishers and consumers.
9. Using Technology Agnostic Reference
Models to Assess Best Practice Relevance
December 11, 2017 9
ISO Open Archival Information System (OAIS) ISO 14721:2003
The Context, Containers, Components and Classes (C4) model for software architecture
10. • Provide data provenance information
• Provide data quality information
• Provide a version indicator
• Provide version history
• Preserve identifiers
Example Context Data Producer Reference
Models
December 11, 2017 10
• Provide metadata
• Provide structural metadata
• Use machine-readable standardized data formats
• Provide data in multiple formats
• Reuse vocabularies, preferably standardized
ones
• Provide Subsets for Large Datasets
Provide bulk download
Provide Subsets for Large Datasets
11. Use Case: Energy Exascale Earth System
Model (E3SM) and Mass Spectrometry
Achieves this through IETF, W3C
formats, W3C Provenance,
Interoperable Protocols,
Off the shelf: Swagger, Jupyter
Notebook, NoSQL databases
Repurposed to support
reproducible Mass Spectrometry
Experiments
December 11, 2017 11
Focus: Recovering enough information to re-execute a given simulation
Thomas M, J Laskin, B Raju, EG Stephan, TO Elsethagen, NYS Van, and SN Nguyen. 2016. "Enabling Re-
executable Workflows with Near-real-time Visualization, Provenance Capture and Advanced Querying for Mass
Spectrometry Data." In NYSDS 2016 - Data-Driven Discovery.
12. Example Context Data Publisher Reference
Model
December 11, 2017 12
• Provide metadata
• Provide descriptive metadata
• Provide structural metadata
• Provide data provenance information
• Use locale-neutral data representations
• Reuse vocabularies, preferably standardized ones
• Choose the right formalization level
• Gather feedback from data consumers
• Enrich data by generating new data
• Provide Complementary Presentations
• Interoperability
• Use persistent URIs as identifiers of datasets
• Use persistent URIs as identifiers within datasets
• Reuse vocabularies, preferably standardized ones
• Choose the right formalization level
• Make data available through an API
• Use Web Standards as the foundation of APIs
• Avoid Breaking Changes to Your API
• Provide Feedback to the Original Publisher
• Provide data provenance information
• Provide data quality information
• Provide a version indicator
• Provide version history
• Preserve identifiers
13. December 11, 2017 13
Example curating and re-publishing to
support discovery
Based on a single soil moisture use case
1.4 billion triples curated measurement
metadata (i.e., relationships, graph edges)
Including descriptions of 777,230 datasets,
2,767 data catalogs,
1,701 data centers,
52 data networks.
Chappell AR, JR Weaver, S Purohit, WP Smith, KL Schuchardt, P West, B Lee, and P Fox. 2015. "Enhancing the Impact of Science Data:
Toward Data Discovery and Reuse." In Proceedings of the 14th IEEE/ACIS International Conference on Computer and Information Science
2015.
Ontology alignment
Query Optimization with SPARQL and
Schema.org
Use of services such as geonames.org
14. DWBP Implementation Report: Field
Guide to Examples of Best Practices
December 11, 2017 14
Use evaluation criteria in report for
assessing your own technology stack and
data resources.
http://w3c.github.io/dwbp/dwbp-implementation-report.html
15. Indirect Collaborations
December 11, 2017 15
Producers
Publishers
Analysts
Researchers
There is real interest in your data from
emerging fields!
Using common methods and
approaches are extremely helpful
indirect collaborations
Internationalizing your products can
widen your impact
Approach supports open and closed
(behind firewall) collaborations
Example
Data
Lifecycle
16. What Type of Data Terrain Are We Providing
for Future Science?
Active technical recommendation communities such as W3C are here to serve
you and are interested in your problems.
Evolving good practice as a guideline is less expensive than technology solution
context switching without good practices.
Success criteria described in the DWBP can help you measure benefit to your
project
Change is good, for legacy applications, good practice and new technology
adoption may be more impactful at a gradual pace
December 11, 2017 16
Questions? Eric.Stephan@pnnl.gov
Paraphrased from notes on TBL’s remarks at the the W3C Technical Plenary and Advisor Committee 2014
“Thank you for giving us level terrain to build upon”
Sir Tim Berners-Lee (inventor of the Web), recalling a conversation he had with Vint Cerf (co-
inventor of the Internet)
17. The International Data on the Web Best
Practices Recommendations Team!
Contributors:
• Annette Greiner (Lawrence Berkley National Laboratory)
• Antoine Isaac
• Carlos Iglesias
• Carlos Laufer
• Christophe Guéret
• Deirdre Lee (Working Group co-Chair)
• Doug Schepers
• Eric G. Stephan (Pacific Northwest National Laboratory)
• Eric Kauz
• Ghislain A. Atemezing
• Hadley Beeman (Working Group co-Chair)
• Ig Ibert Bittencourt
• João Paulo Almeida
• Makx Dekkers
• Peter Winstanley
• Phil Archer (Data Activity Chair)
• Riccardo Albertoni
• Sumit Purohit (Pacific Northwest National Laboratory)
• Yasodara Córdova December 11, 2017 17
DWBP Editors:
• Bernadette Farias Lóscio
• Caroline Burle
• Newton Calegari
Working Group Chairs
• Hadley Beeman
• Deirdre Lee
• Yasodara Córdova
• Steven Adler, Perspective & Community Outreach
W3C Data Activity Lead, W3C Team Contact: Phil Archer