The scientific method consists of generating and analyzing data to create knowledge. Indeed, every materials scientist uses data from syntheses, characterization, and models to explain and optimize materials behavior. Yet, despite the centrality of data to progress in materials, the world’s immense body of materials data remains unstandardized, unstructured, and trapped in myriad publications, isolated repositories, and private computers. This disaggregation (the mishmash) not only prevents materials scientists from standing on the shoulders of giants, but also limits our ability to use large-scale data analytics to dramatically accelerate materials modeling, discovery, and manufacture (à la Moneyball).
Citrine Informatics is a team of materials scientists dedicated to uniting all materials data on a single platform within a single data standard, and putting user-friendly, data-driven tools into the hands of all materials researchers. The company’s vision is to make the full materials R&D pipeline—from initial discovery to scale-up and commercialization—ten times faster than it is today. In this talk, we will review the present state of affairs in materials data, notable progress to date, opportunities for the future, and the challenges likely to arise along the way.
Bryce Meredig of Citrine Informatics presents the company's materials data platform, Citrination. For academic and government users, this infrastructure is a free and open means to meet data management plan requirements of many federal funding agencies.
Optique - to provide semantic end-to-end connection between users and data sources; enable users to rapidly formulate intuitive queries using familiar vocabularies and conceptualisations and return timely answers from large scale and heterogeneous data sources.
Machine Learning and Cultural Heritage: What Is It Good Enough For?John Stack
Funded through the AHRC’s Towards a National Collection Programme, the Science Museum Group (SMG) is collaborating with the V&A and School of Advanced Study, University of London, on a two-year project entitled “Heritage Connector: Transforming text into data to extract meaning and make connections”.
As with almost all data, museum collection catalogues are largely unstructured, variable in consistency and overwhelmingly composed of thin records. The form of these catalogues means that the potential for new forms of research, access and scholarly enquiry that range across multiple collections and related datasets remains dormant.
The Heritage Connector project is deploying a range of machine learning-based techniques to extract information from the SMG collection catalogue, link it to third-party sources – primarily Wikidata and the V&A’s collection – will then create a set of prototypes that demonstrate and explore the affordances of the resulting dataset.
Rather than attempting to deploy machine learning to create a perfect linked data model, Heritage Connector asks what’s “good enough” to provide useful functionality to different audiences.
https://www.aeolian-network.net/events/workshop-1-employing-machine-learning-and-artificial-intelligence-in-cultural-institutions/
Research results in peer-reviewed publications are reproducible, right? If only it was so clear cut. With high profile paper retractions and pushes for better data sharing by funders, publishers and the community, the spotlight is now focussing on the whole way research is conducted around the world.
This talk from the Software Sustainability Institute's Collaborations Workshop 2014 describes how cloud computing, with Microsoft Azure, is helping researchers realize the goals of scientific reproducibility.
Find out more at www.azure4research.com
Bryce Meredig of Citrine Informatics presents the company's materials data platform, Citrination. For academic and government users, this infrastructure is a free and open means to meet data management plan requirements of many federal funding agencies.
Optique - to provide semantic end-to-end connection between users and data sources; enable users to rapidly formulate intuitive queries using familiar vocabularies and conceptualisations and return timely answers from large scale and heterogeneous data sources.
Machine Learning and Cultural Heritage: What Is It Good Enough For?John Stack
Funded through the AHRC’s Towards a National Collection Programme, the Science Museum Group (SMG) is collaborating with the V&A and School of Advanced Study, University of London, on a two-year project entitled “Heritage Connector: Transforming text into data to extract meaning and make connections”.
As with almost all data, museum collection catalogues are largely unstructured, variable in consistency and overwhelmingly composed of thin records. The form of these catalogues means that the potential for new forms of research, access and scholarly enquiry that range across multiple collections and related datasets remains dormant.
The Heritage Connector project is deploying a range of machine learning-based techniques to extract information from the SMG collection catalogue, link it to third-party sources – primarily Wikidata and the V&A’s collection – will then create a set of prototypes that demonstrate and explore the affordances of the resulting dataset.
Rather than attempting to deploy machine learning to create a perfect linked data model, Heritage Connector asks what’s “good enough” to provide useful functionality to different audiences.
https://www.aeolian-network.net/events/workshop-1-employing-machine-learning-and-artificial-intelligence-in-cultural-institutions/
Research results in peer-reviewed publications are reproducible, right? If only it was so clear cut. With high profile paper retractions and pushes for better data sharing by funders, publishers and the community, the spotlight is now focussing on the whole way research is conducted around the world.
This talk from the Software Sustainability Institute's Collaborations Workshop 2014 describes how cloud computing, with Microsoft Azure, is helping researchers realize the goals of scientific reproducibility.
Find out more at www.azure4research.com
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
How cloud computing can accelerate your research. Presentation given at Moscow State University on 19th May 2015.
Apply for Azure for Research Awards at http://research.microsoft.com/en-US/projects/azure/awards.aspx
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
FAIRy stories: tales from building the FAIR Research CommonsCarole Goble
Plenary Lecture Presented at INCF Neuroinformatics 2019 https://www.neuroinformatics2019.org
Title: FAIRy stories: tales from building the FAIR Research Commons
Findable Accessable Interoperable Reusable. The “FAIR Principles” for research data, software, computational workflows, scripts, or any kind of Research Object is a mantra; a method; a meme; a myth; a mystery. For the past 15 years I have been working on FAIR in a range of projects and initiatives in the Life Sciences as we try to build the FAIR Research Commons. Some are top-down like the European Research Infrastructures ELIXIR, ISBE and IBISBA, and the NIH Data Commons. Some are bottom-up, supporting FAIR for investigator-led projects (FAIRDOM), biodiversity analytics (BioVel), and FAIR drug discovery (Open PHACTS, FAIRplus). Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. There are villains and heroes. Some have happy endings; all have morals.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...Connected Data World
Borislav Popov's slides from his lightning talk at Connected Data London. Borislav - a Director of Business Development at Ontotext presented Ontotext's approach to tackling the Panama Papers leak. Using a technology that is a mix between semantic web and graph databases.
Application of Clustering in Data Science using Real-life Examples Edureka!
Clustering data into subsets is an important task for many data science applications. It is considered as one of the most important unsupervised learning technique. Keeping this in mind, we have come with a free webinar ‘Application of Cluster in Data Science using Real-life examples.’
A talk given at a workshop in Atlanta on "Building an Integrated MGI Accelerator Network": see http://acceleratornetwork.org/event/building-an-integrated-mgi-accelerator-network/.
The US Materials Genome Initiative seeks to develop an infrastructure that will accelerate advanced materials development and deployment. The term Materials Genome suggests a science that is fundamentally driven by the systematic capture of large quantities of elemental data. In practice, we know, things are more complex—in materials as in biology. Nevertheless, the ability to locate and reuse data is often essential to research progress. I discuss here three aspects of networking materials data: data publication and discovery; linking instruments, computations, and people to enable new research modalities based on near-real-time processing; and organizing data generation, transformation, and analysis software to facilitate understanding and reuse. I use these three problems to motivate a discussion of recent results in cloud computing, data publication management, high-performance computing, and related topics.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureXiaogang (Marshall) Ma
A presentation with a review of technical trends in data management, publication and citation, and methodologies on data interoperability, provenance of research and semantic escience.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
How cloud computing can accelerate your research. Presentation given at Moscow State University on 19th May 2015.
Apply for Azure for Research Awards at http://research.microsoft.com/en-US/projects/azure/awards.aspx
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
FAIRy stories: tales from building the FAIR Research CommonsCarole Goble
Plenary Lecture Presented at INCF Neuroinformatics 2019 https://www.neuroinformatics2019.org
Title: FAIRy stories: tales from building the FAIR Research Commons
Findable Accessable Interoperable Reusable. The “FAIR Principles” for research data, software, computational workflows, scripts, or any kind of Research Object is a mantra; a method; a meme; a myth; a mystery. For the past 15 years I have been working on FAIR in a range of projects and initiatives in the Life Sciences as we try to build the FAIR Research Commons. Some are top-down like the European Research Infrastructures ELIXIR, ISBE and IBISBA, and the NIH Data Commons. Some are bottom-up, supporting FAIR for investigator-led projects (FAIRDOM), biodiversity analytics (BioVel), and FAIR drug discovery (Open PHACTS, FAIRplus). Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. There are villains and heroes. Some have happy endings; all have morals.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...Connected Data World
Borislav Popov's slides from his lightning talk at Connected Data London. Borislav - a Director of Business Development at Ontotext presented Ontotext's approach to tackling the Panama Papers leak. Using a technology that is a mix between semantic web and graph databases.
Application of Clustering in Data Science using Real-life Examples Edureka!
Clustering data into subsets is an important task for many data science applications. It is considered as one of the most important unsupervised learning technique. Keeping this in mind, we have come with a free webinar ‘Application of Cluster in Data Science using Real-life examples.’
A talk given at a workshop in Atlanta on "Building an Integrated MGI Accelerator Network": see http://acceleratornetwork.org/event/building-an-integrated-mgi-accelerator-network/.
The US Materials Genome Initiative seeks to develop an infrastructure that will accelerate advanced materials development and deployment. The term Materials Genome suggests a science that is fundamentally driven by the systematic capture of large quantities of elemental data. In practice, we know, things are more complex—in materials as in biology. Nevertheless, the ability to locate and reuse data is often essential to research progress. I discuss here three aspects of networking materials data: data publication and discovery; linking instruments, computations, and people to enable new research modalities based on near-real-time processing; and organizing data generation, transformation, and analysis software to facilitate understanding and reuse. I use these three problems to motivate a discussion of recent results in cloud computing, data publication management, high-performance computing, and related topics.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureXiaogang (Marshall) Ma
A presentation with a review of technical trends in data management, publication and citation, and methodologies on data interoperability, provenance of research and semantic escience.
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
This session demonstrates how cloud can accelerate breakthroughs in scientific research by providing on-demand access to powerful computing. You will gain insight into how scientific researchers are using the cloud to solve complex science, engineering, and business problems that require high bandwidth, low latency networking and very high compute capabilities. You will hear how leveraging the cloud reduces the costs and time to conduct large scale, worldwide collaborative research. Researchers can then access computational power, data storage, and supercomputing resources, and data sharing capabilities in a cost-efficient manner without implementation delays. Disease research can be accomplished in a fraction of the time, and innovative researchers in small schools or distant corners of the world have access to the same computing power as those at major research institutions by leveraging Amazon EC2, Amazon S3, optimizing C3 instances and more to increase collaboration. This session will provide best practices and insight from UC Berkeley AMP Lab on the services used to connect disparate sets of data to drive meaningful new insight and impact.
The eNanoMapper database for nanomaterial safety information: storage and queryNina Jeliazkova
A number of challenges exist in engineered nanomaterials (ENM) data representation and integration mainly due to data complexity and provenance. We have recently described the eNanoMapper database [doi:10.1109/BIBM.2014.699936] as part of the computational infrastructure for toxicological data management of ENM, developed within the EU FP7 eNanoMapper project. The ontology-supported data model is based on an exhaustive review of existing nano-related data models, databases, and nanomaterial related entries in chemical and toxicogenomic databases. We demonstrate how this approach provides a common ground for integration of data represented in diverse formats (ISA-TAB, OECD HT, custom RDF and set of spreadsheet templates used by the EU NanoSafety Cluster projects) and enables uniform approach towards import, storage and searching of ENM physicochemical measurements and biological assay results. A configurable parser enables import of the data stored in spreadsheet templates, accommodating different organization of the data. The configuration metadata is defined in a separate file, mapping the spreadsheet into the internal data model. The demonstration data provided by eNanoMapper partners ((i) NanoWiki, (ii) a literature dataset on protein coronas and (iii) the ModNanoTox project dataset consisting of 86 assays and 100 different endpoints) illustrates the capability of the associated REST API to support a variety of tests and endpoints, recommended by the OECD Working Party of Manufactured Nanomaterials. The API is tightly integrated with a chemical structure search, allowing highlighting the function as a core, coating or functionalisation. The REST API enables graphical summaries of the data and integration in applications such as NanoQSAR modelling via programmatic interaction.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Accelerating Time to Science: Transforming Research in the CloudJamie Kinney
Researchers working on projects ranging from work at individual labs to some of the world's largest scientific investigations are using AWS to accelerate the pace of scientific discovery and ask questions that were previously impossible to explore. This talk explains why scientists are using Amazon Web Services and showcases a range of real-word examples.
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Globus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Ben Galewsky from the National Center for Supercomputing Applications (NCSA).
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
31. Materials Data Standard
JSON-based definition
of arbitrary materials
objects & processes
Able to accommodate
wide variety of materials
data
32. Thermoelectric Discovery
Model Input Data Canonical Thermoelectrics Citrine Discovery
Compound positions determined by weighted composition
(e.g., SiGe would be halfway between Si and Ge; Mg2Si is 1/3 of the way from Mg to Si.)
Distant, novel
class of
thermoelectrics
Universe of
known TE
compounds
MW Gaultois, AO Oliynyk, A Mar, TD Sparks, GJ Mulholland, & B Meredig, “A Recommendation
Engine for Suggesting Unexpected Thermoelectric Chemistries: Initial Experimental Validation.”
33. MW Gaultois, AO Oliynyk, A Mar, TD Sparks, GJ Mulholland, & B Meredig, “A Recommendation
Engine for Suggesting Unexpected Thermoelectric Chemistries: Initial Experimental Validation.”
41. Stakeholders
universities
government labs (DOE labs, NIST...) in the US, EU, Japan, China...
funding agencies
journal publishers
scholarship search engines
professional societies
database providers
equipment makers
materials industry (Dow, DuPont, Alcoa, Corning…)
industries that rely on matls (aerospace, electronics, energy...)
and YOU.
42.
43. Ways to Get Involved
email bryce@citrine.io to join mailing list
try citrination.com and give us feedback
contribute data
contribute models to platform (alpha)
grant proposals – drive our dev