- AiiDA is a framework that aims to automate and manage computational workflows and data in materials science. It provides tools for provenance tracking, reproducibility of results, and sharing of data.
- Key features include automation of calculations, robust storage of data and links between calculations in a database, and development of reusable scientific workflows to calculate material properties.
- The framework uses a plugin-based system to interface with different codes, data formats, computing resources, and more through a unified Python interface.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
This document discusses image search and analysis techniques for remote sensing data. It describes an index management system that takes in data and indexes it using column-based databases. Images are analyzed to extract features that allow for image search based on compression in compressed streams. Queries can be performed on the indexed data to return similar images based on semantic labels and normalized distances from queries. Examples are provided using different remote sensing datasets, including GeoEye, DigitalGlobe, and TerraSAR-X images.
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
Searching for information within large sets of unstructured, heterogeneous scientific data can be very challenging unless an inverted index has been created in advance. Several solutions, mainly based on the Hadoop ecosystem, have been proposed to accelerate the process of index construction. These solutions perform well when data are already distributed across the cluster nodes involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index heterogeneous data. We further improve the performance by using GPUs and POSIX Threads programming for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark.
High Performance Machine Learning in R with H2OSri Ambati
This document summarizes a presentation by Erin LeDell from H2O.ai about machine learning using the H2O software. H2O is an open-source machine learning platform that provides APIs for R, Python, Scala and other languages. It allows distributed machine learning on large datasets across clusters. The presentation covers H2O's architecture, algorithms like random forests and deep learning, and how to use H2O within R including loading data, training models, and running grid searches. It also discusses H2O on Spark via Sparkling Water and real-world use cases with customers.
This document discusses tools for social network analysis and visualization. It covers Netvizz, which extracts data from Facebook for research. It also covers Pajek and Gephi, two programs for analyzing and visualizing networks. Pajek is suitable for large networks with thousands of nodes, while Gephi is interactive and can handle networks of up to 100,000 nodes. Both support a variety of input and output formats and feature layout algorithms and metrics for analysis.
Virtual Knowledge Graphs for Federated Log AnalysisKabul Kurniawan
This document presents a method for executing federated graph pattern queries on dispersed and heterogeneous raw log data by dynamically constructing virtual knowledge graphs (VKGs). The approach extracts only relevant log messages on demand, integrates log events into a common graph, federates queries across endpoints, and links results to background knowledge. The architecture includes modules for log parsing, query processing, and a prototype implementation demonstrates the approach for security analytics use cases. An evaluation analyzes the performance of query execution time against factors like number of extracted log lines and queried hosts.
This document discusses data visualization tools in Python. It introduces Matplotlib as the first and still standard Python visualization tool. It also covers Seaborn which builds on Matplotlib, Bokeh for interactive visualizations, HoloViews as a higher-level wrapper for Bokeh, and Datashader for big data visualization. Additional tools discussed include Folium for maps, and yt for volumetric data visualization. The document concludes that Python is well-suited for data science and visualization with many options available.
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...Deltares
The document discusses different methods for accessing environmental data from servers using open standards. It focuses on OPeNDAP, WMS and WCS protocols. Examples are provided on how to access data from these servers using Python, Matlab and QGIS. The last section promotes using the OpenEarth server stack to serve your own data using open standards.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
This document discusses image search and analysis techniques for remote sensing data. It describes an index management system that takes in data and indexes it using column-based databases. Images are analyzed to extract features that allow for image search based on compression in compressed streams. Queries can be performed on the indexed data to return similar images based on semantic labels and normalized distances from queries. Examples are provided using different remote sensing datasets, including GeoEye, DigitalGlobe, and TerraSAR-X images.
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
Searching for information within large sets of unstructured, heterogeneous scientific data can be very challenging unless an inverted index has been created in advance. Several solutions, mainly based on the Hadoop ecosystem, have been proposed to accelerate the process of index construction. These solutions perform well when data are already distributed across the cluster nodes involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index heterogeneous data. We further improve the performance by using GPUs and POSIX Threads programming for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark.
High Performance Machine Learning in R with H2OSri Ambati
This document summarizes a presentation by Erin LeDell from H2O.ai about machine learning using the H2O software. H2O is an open-source machine learning platform that provides APIs for R, Python, Scala and other languages. It allows distributed machine learning on large datasets across clusters. The presentation covers H2O's architecture, algorithms like random forests and deep learning, and how to use H2O within R including loading data, training models, and running grid searches. It also discusses H2O on Spark via Sparkling Water and real-world use cases with customers.
This document discusses tools for social network analysis and visualization. It covers Netvizz, which extracts data from Facebook for research. It also covers Pajek and Gephi, two programs for analyzing and visualizing networks. Pajek is suitable for large networks with thousands of nodes, while Gephi is interactive and can handle networks of up to 100,000 nodes. Both support a variety of input and output formats and feature layout algorithms and metrics for analysis.
Virtual Knowledge Graphs for Federated Log AnalysisKabul Kurniawan
This document presents a method for executing federated graph pattern queries on dispersed and heterogeneous raw log data by dynamically constructing virtual knowledge graphs (VKGs). The approach extracts only relevant log messages on demand, integrates log events into a common graph, federates queries across endpoints, and links results to background knowledge. The architecture includes modules for log parsing, query processing, and a prototype implementation demonstrates the approach for security analytics use cases. An evaluation analyzes the performance of query execution time against factors like number of extracted log lines and queried hosts.
This document discusses data visualization tools in Python. It introduces Matplotlib as the first and still standard Python visualization tool. It also covers Seaborn which builds on Matplotlib, Bokeh for interactive visualizations, HoloViews as a higher-level wrapper for Bokeh, and Datashader for big data visualization. Additional tools discussed include Folium for maps, and yt for volumetric data visualization. The document concludes that Python is well-suited for data science and visualization with many options available.
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...Deltares
The document discusses different methods for accessing environmental data from servers using open standards. It focuses on OPeNDAP, WMS and WCS protocols. Examples are provided on how to access data from these servers using Python, Matlab and QGIS. The last section promotes using the OpenEarth server stack to serve your own data using open standards.
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
Data Structures for Statistical Computing in PythonWes McKinney
The document discusses statistical data structures in Python. It summarizes that structured arrays are commonly used to store statistical data sets but have limitations. The R data frame is introduced as a flexible alternative that inspired the pandas library in Python. Pandas aims to create intuitive data structures for statistical analysis with labeled axes and automatic data alignment. Its core data structure, the DataFrame, functions similarly to R's data frame.
A walk through the maze of understanding Data Visualization using several tools such as Python, R, Knime and Google Data Studio.
This workshop is hands-on and this set of presentations is designed to be an agenda to the workshop
This document provides an introduction to Apache Spark, including its history and key concepts. It discusses how Spark was developed in response to big data processing needs at Google and how it builds upon earlier systems like MapReduce. The document then covers Spark's core abstractions like RDDs and DataFrames/Datasets and common transformations and actions. It also provides an overview of Spark SQL and how to deploy Spark applications on a cluster.
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
The DuraMat Data Hub and Analytics Capability provides a centralized resource for sharing solar PV data. It collects performance, materials properties, meteorological, and other data through a central Data Hub. A data analytics thrust works with partners to provide software, visualization, and data mining capabilities. The goal is to enhance efficiency, reproducibility, and new analyses by combining multiple data sources in one location. Examples of ongoing projects using the hub include clear sky detection modeling to automatically classify sky conditions from irradiance data.
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
The document discusses software tools for high-throughput materials design and machine learning developed by Anubhav Jain and collaborators. The tools include pymatgen for structure analysis, FireWorks for workflow management, and atomate for running calculations and collecting output into databases. The matminer package allows analyzing data from atomate with machine learning methods. These open-source tools have been used to run millions of calculations and power databases like the Materials Project.
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
Scipy 2011 Time Series Analysis in PythonWes McKinney
1) The document discusses statsmodels, a Python library for statistical modeling that implements standard statistical models. It includes tools for linear regression, descriptive statistics, statistical tests, time series analysis, and more.
2) The talk provides an overview of using statsmodels for time series analysis, including descriptive statistics, autoregressive moving average (ARMA) models, vector autoregression (VAR) models, and filtering tools.
3) The discussion highlights the development of statsmodels and the need for integrated statistical data structures and user interfaces to make Python more competitive with R for data analysis and statistics.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code.
Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
Marius Eriksen discusses Reflow, a new cloud-native workflow framework for bioinformatics. Reflow programs workflows directly using a functional programming language for simplicity and composability. It leverages lazy evaluation and caching to efficiently parallelize and distribute work across private clusters. Reflow aims to untie the hands of implementors compared to traditional workflow systems through its unified approach to programming, execution, and infrastructure.
This document discusses supporting parallel OLAP (online analytical processing) over big data. It presents different data partitioning schemes for distributed warehouses and evaluates their performance using the TPC-H benchmark. Experimental results show improved query response times when fragmenting and distributing tables over multiple database backends compared to a single backend. The authors also introduce derived data techniques to further optimize query performance. They conclude more work is needed to automate data partitioning and support larger datasets.
Dr. Francesco Bongiovanni has expertise in scalable distributed systems and algorithms, cloud computing, applied formal methods, and distributed optimizations. He has a B.Sc. in Computer Systems, M.Sc. in Software Engineering of Distributed Systems, and Ph.D. in Computer Science. He has worked at INRIA and Verimag Laboratory. This presentation provides an overview of big data frameworks and tools including HDFS, Mesos, Spark, Spark Streaming, Spark SQL, GraphX, MLLib, Chapel, ZooKeeper, and SparkR that can be run on the eScience cluster for processing large datasets in a scalable, fault-tolerant manner. Examples demonstrate performing operations like averaging 1 billion elements
This summary provides the key details from a local newspaper classified ad section:
1) The classified ad section includes listings for real estate rentals and sales, automotive sales, services such as tree trimming and heating/AC repair, help wanted ads, and community event notices.
2) One help wanted ad is for a customer service representative position at a local insurance agency. Another ad is for production workers at a factory in Ubly offering benefits.
3) Upcoming community events noted include a strawberry social fundraiser, Vacation Bible School, and a VFW bazaar in August with craft vendors and food.
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
Data Structures for Statistical Computing in PythonWes McKinney
The document discusses statistical data structures in Python. It summarizes that structured arrays are commonly used to store statistical data sets but have limitations. The R data frame is introduced as a flexible alternative that inspired the pandas library in Python. Pandas aims to create intuitive data structures for statistical analysis with labeled axes and automatic data alignment. Its core data structure, the DataFrame, functions similarly to R's data frame.
A walk through the maze of understanding Data Visualization using several tools such as Python, R, Knime and Google Data Studio.
This workshop is hands-on and this set of presentations is designed to be an agenda to the workshop
This document provides an introduction to Apache Spark, including its history and key concepts. It discusses how Spark was developed in response to big data processing needs at Google and how it builds upon earlier systems like MapReduce. The document then covers Spark's core abstractions like RDDs and DataFrames/Datasets and common transformations and actions. It also provides an overview of Spark SQL and how to deploy Spark applications on a cluster.
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
The DuraMat Data Hub and Analytics Capability provides a centralized resource for sharing solar PV data. It collects performance, materials properties, meteorological, and other data through a central Data Hub. A data analytics thrust works with partners to provide software, visualization, and data mining capabilities. The goal is to enhance efficiency, reproducibility, and new analyses by combining multiple data sources in one location. Examples of ongoing projects using the hub include clear sky detection modeling to automatically classify sky conditions from irradiance data.
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
The document discusses software tools for high-throughput materials design and machine learning developed by Anubhav Jain and collaborators. The tools include pymatgen for structure analysis, FireWorks for workflow management, and atomate for running calculations and collecting output into databases. The matminer package allows analyzing data from atomate with machine learning methods. These open-source tools have been used to run millions of calculations and power databases like the Materials Project.
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
Scipy 2011 Time Series Analysis in PythonWes McKinney
1) The document discusses statsmodels, a Python library for statistical modeling that implements standard statistical models. It includes tools for linear regression, descriptive statistics, statistical tests, time series analysis, and more.
2) The talk provides an overview of using statsmodels for time series analysis, including descriptive statistics, autoregressive moving average (ARMA) models, vector autoregression (VAR) models, and filtering tools.
3) The discussion highlights the development of statsmodels and the need for integrated statistical data structures and user interfaces to make Python more competitive with R for data analysis and statistics.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code.
Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
Marius Eriksen discusses Reflow, a new cloud-native workflow framework for bioinformatics. Reflow programs workflows directly using a functional programming language for simplicity and composability. It leverages lazy evaluation and caching to efficiently parallelize and distribute work across private clusters. Reflow aims to untie the hands of implementors compared to traditional workflow systems through its unified approach to programming, execution, and infrastructure.
This document discusses supporting parallel OLAP (online analytical processing) over big data. It presents different data partitioning schemes for distributed warehouses and evaluates their performance using the TPC-H benchmark. Experimental results show improved query response times when fragmenting and distributing tables over multiple database backends compared to a single backend. The authors also introduce derived data techniques to further optimize query performance. They conclude more work is needed to automate data partitioning and support larger datasets.
Dr. Francesco Bongiovanni has expertise in scalable distributed systems and algorithms, cloud computing, applied formal methods, and distributed optimizations. He has a B.Sc. in Computer Systems, M.Sc. in Software Engineering of Distributed Systems, and Ph.D. in Computer Science. He has worked at INRIA and Verimag Laboratory. This presentation provides an overview of big data frameworks and tools including HDFS, Mesos, Spark, Spark Streaming, Spark SQL, GraphX, MLLib, Chapel, ZooKeeper, and SparkR that can be run on the eScience cluster for processing large datasets in a scalable, fault-tolerant manner. Examples demonstrate performing operations like averaging 1 billion elements
This summary provides the key details from a local newspaper classified ad section:
1) The classified ad section includes listings for real estate rentals and sales, automotive sales, services such as tree trimming and heating/AC repair, help wanted ads, and community event notices.
2) One help wanted ad is for a customer service representative position at a local insurance agency. Another ad is for production workers at a factory in Ubly offering benefits.
3) Upcoming community events noted include a strawberry social fundraiser, Vacation Bible School, and a VFW bazaar in August with craft vendors and food.
El documento habla sobre tres temas relacionados con la sostenibilidad: la restauración ambiental, los sistemas de información geográfica y el reciclaje y la eficiencia energética.
Epic Research is an experienced Singapore Stock Exchange signals provider; Contact us to know best picks about SGX Live, SGX Stock Price and Stock market of Singapore.
Individual adult therapy can help those experiencing upsetting or disproportionate emotions and behaviors by providing a safe, confidential environment to openly explore thoughts and perceptions. Speaking freely to a professional away from daily life can help uncover root causes of troublesome feelings and identify patterns contributing to issues. The goal is to gain a deeper understanding of oneself through interpretation, leading to greater self-awareness and acceptance.
CONTRATACIONES GRUPO ENTREPARENTESIS 2009guest4a899
Un grupo musical chileno formado en 1998 con 10 años de experiencia en música vocal. Han ganado premios como el FONDART 2001 y han cantado para la presidenta Michelle Bachelet. También han compartido escenario con reconocidos artistas como Alberto Plaza y Myriam Hernández, y han participado en eventos de la Teletón y empresas importantes del país.
Weed Control Strategies in Organically Grown Carrots and Onions
`
For more information, Please see websites below:
`
Organic Edible Schoolyards & Gardening with Children
http://scribd.com/doc/239851214
`
Double Food Production from your School Garden with Organic Tech
http://scribd.com/doc/239851079
`
Free School Gardening Art Posters
http://scribd.com/doc/239851159`
`
Companion Planting Increases Food Production from School Gardens
http://scribd.com/doc/239851159
`
Healthy Foods Dramatically Improves Student Academic Success
http://scribd.com/doc/239851348
`
City Chickens for your Organic School Garden
http://scribd.com/doc/239850440
`
Simple Square Foot Gardening for Schools - Teacher Guide
http://scribd.com/doc/239851110
Epic Research provides ultimate FOREX signals for their clients to produce amazingly accurate results. Our research team prepare such I-FOREX Signals live charts and track-sheets of the past performance consulting which traders can generate maximum profit from the market place.This report helps you to achieve desired success in the SGX Stock Exchange.
The REALM is an immersive and open-source environment for users to discover, create, collaborate, trade, access, and share resources to achieve personal and professional goals, combining elements of social media, rich content, and semantic data systems. For businesses, The REALM is a permission-based marketing environment that provides direct access to targeted customers and audiences through a revolutionary advertainment system that captivates individuals with precisely relevant information, rewards them for interaction, and builds authentic relationships leading to repeat sales and referrals.
Curso: Implementación del Control Interno.RC Consulting
Este documento presenta la información sobre un curso técnico especializado sobre la implementación del sistema de control interno en las entidades del Estado de acuerdo a la Directiva N°013-2016-CG/GPROD de la Contraloría General de la República. El curso se llevará a cabo del 14 al 16 de diciembre de 2016 y tiene como objetivo capacitar a los responsables para implementar el sistema de control interno y medir su nivel de madurez antes del plazo establecido. El curso consta de tres módulos que cubren aspectos como la
Beyond Dots on a Map: Spatially Modeled Surfaces of DHS dataMEASURE Evaluation
This presentation was shared by Clara R. Burgert-Brucker, Pete Gething, Andy Tatem, and Tom Bird, all with The DHS Program, at the June 2016 MEASURE Evaluation GIS Working Group Meeting.
To download the editable version of this document, go to www.slidebooks.com
Learn how to create a financial plan with a training and templates in editable Powerpoint slides created by former Deloitte Management Consultants .
German Conference on Bioinformatics 2021
https://gcb2021.de/
FAIR Computational Workflows
Computational workflows capture precise descriptions of the steps and data dependencies needed to carry out computational data pipelines, analysis and simulations in many areas of Science, including the Life Sciences. The use of computational workflows to manage these multi-step computational processes has accelerated in the past few years driven by the need for scalable data processing, the exchange of processing know-how, and the desire for more reproducible (or at least transparent) and quality assured processing methods. The SARS-CoV-2 pandemic has significantly highlighted the value of workflows.
This increased interest in workflows has been matched by the number of workflow management systems available to scientists (Galaxy, Snakemake, Nextflow and 270+ more) and the number of workflow services like registries and monitors. There is also recognition that workflows are first class, publishable Research Objects just as data are. They deserve their own FAIR (Findable, Accessible, Interoperable, Reusable) principles and services that cater for their dual roles as explicit method description and software method execution [1]. To promote long-term usability and uptake by the scientific community, workflows (as well as the tools that integrate them) should become FAIR+R(eproducible), and citable so that author’s credit is attributed fairly and accurately.
The work on improving the FAIRness of workflows has already started and a whole ecosystem of tools, guidelines and best practices has been under development to reduce the time needed to adapt, reuse and extend existing scientific workflows. An example is the EOSC-Life Cluster of 13 European Biomedical Research Infrastructures which is developing a FAIR Workflow Collaboratory based on the ELIXIR Research Infrastructure for Life Science Data Tools ecosystem. While there are many tools for addressing different aspects of FAIR workflows, many challenges remain for describing, annotating, and exposing scientific workflows so that they can be found, understood and reused by other scientists.
This keynote will explore the FAIR principles for computational workflows in the Life Science using the EOSC-Life Workflow Collaboratory as an example.
[1] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes,Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, and Daniel Schober FAIR Computational Workflows Data Intelligence 2020 2:1-2, 108-121 https://doi.org/10.1162/dint_a_00033.
This document discusses FAIR computational workflows and why they are important. It defines computational workflows as multi-step processes for data analysis and simulation that link computational steps and handle data and processing dependencies. Workflows improve reproducibility, enable automation, and allow for increased sharing and reuse of research. The document outlines how applying FAIR principles to workflows makes them findable, accessible, interoperable, and reusable. This includes using standardized metadata, identifiers, licensing, and formats to describe workflows and ensure their components and data are also FAIR. Adopting FAIR workflows requires support from workflow systems, tools, communities and services.
Dr. REEJA S R gave a talk on high performance computing (HPC) and Python. She discussed what HPC is, when it is needed, and what it includes. She also covered the history of computer architectures for HPC, including vector computers, massively parallel processors, symmetric multiprocessors, and clusters. Additionally, she explained what Python is, why it is useful for HPC, and some of the libraries that can help with HPC tasks like NumPy, SciPy, and MPI4py. Finally, she discussed some challenges with Python for HPC and ways to improve performance, such as through the PyMPI, Pynamic, PyTrilinos, ODIN, and Seamless libraries
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
1) The MaX Centre of Excellence aims to enable high-throughput materials design through automated simulations and tracking of provenance using the AiiDA platform.
2) AiiDA and the Materials Cloud platform allow over 10,000 simulations per day by automating workflows, tracking provenance to ensure reproducibility, and sharing data according to FAIR principles.
3) Potential areas for collaboration with EOSC include integrating AiiDA Lab and the Materials Cloud Archive, developing standardized workflows as services, and providing authentication and authorization through B2ACCESS and EGI Check-In.
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
The document discusses Microsoft's Cognitive Toolkit (CNTK), an open source deep learning toolkit developed by Microsoft. It provides the following key points:
1. CNTK uses computational graphs to represent machine learning models like DNNs, CNNs, RNNs in a flexible way.
2. It supports CPU and GPU training and works on Windows and Linux.
3. CNTK achieves state-of-the-art accuracy and is efficient, scaling to multi-GPU and multi-server settings.
2016-10-20 BioExcel: Advances in Scientific Workflow EnvironmentsStian Soiland-Reyes
Carole Goble, Stian Soiland-Reyes
http://orcid.org/0000-0001-9842-9718
Presented at 2016-10-20 BioExcel Workflow Training, BSC, Barcelona
http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular-research/
NOTE: Although these slides are licensed as CC Attribution, it includes various logos which are covered by their own licenses and copyrights.
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
This document discusses the RAMSES project, which aims to develop a new science of end-to-end analytical performance modeling of science workflows in extreme-scale science environments. The RAMSES research agenda involves developing component and end-to-end models, tools to provide performance advice, data-driven estimation methods, automated experiments, and a performance database. The models will be evaluated using five challenge workflows: high-performance file transfer, diffuse scattering experimental data analysis, data-intensive distributed analytics, exascale application kernels, and in-situ analysis placement.
The assets of remote senses digital world daily generate massive volume of real-time data where insight information has a potential significance if collected and aggregated effectively. we propose real-time Big Data analytical architecture for remote sensing satellite application that welcomes both online and offline data processing
Apache Airavata is an open source science gateway software framework that allows users to compose, manage, execute, and monitor distributed computational workflows. It provides tools and services to register applications, schedule jobs on various resources, and manage workflows and generated data. Airavata is used across several domains to support scientific workflows and is largely derived from academic research funded by the NSF.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Software tools to facilitate materials science researchAnubhav Jain
The document discusses software tools to facilitate materials science research, noting that the author's group works to standardize and automate computational methods for high-throughput calculations and discovery of new functional materials. It advocates for developing automated workflows and analysis frameworks to reduce errors, improve efficiency, and enable non-experts to easily conduct complex simulations and analyses through intuitive online interfaces. The goal is to make advanced computational materials science accessible to a wider audience.
1) Scientists at the Advanced Photon Source use the Argonne Leadership Computing Facility for data reconstruction and analysis from experimental facilities in real-time or near real-time. This provides feedback during experiments.
2) Using the Swift parallel scripting language and ALCF supercomputers like Mira, scientists can process terabytes of data from experiments in minutes rather than hours or days. This enables errors to be detected and addressed during experiments.
3) Key applications discussed include near-field high-energy X-ray diffraction microscopy, X-ray nano/microtomography, and determining crystal structures from diffuse scattering images through simulation and optimization. The workflows developed provide significant time savings and improved experimental outcomes.
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
The document summarizes Parsl, a Python library for pervasive parallel programming. Parsl allows users to naturally express parallelism in Python programs and execute tasks concurrently across different computing platforms while respecting data dependencies. It supports various use cases from small machine learning workloads to extreme-scale simulations involving millions of tasks and thousands of nodes. Parsl provides simple, scalable, and flexible parallel programming while hiding complexity of parallel execution.
The document discusses grid computing and the development of computational grids. Key points:
- Grids allow for sharing of computing power and resources across geographic locations through networked supercomputers, databases, and instruments.
- Major organizations like NASA, DOE, and NSF are working to build computational grids for applications like scientific simulations and instrument control.
- Indiana University is involved in grid research through various departments and projects focused on resource sharing, portals, middleware, and more.
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
High-performance computing resources are currently widely used in science and engineering areas. Typical post-hoc approaches use persistent storage to save produced data from simulation, thus reading from storage to memory is required for data analysis tasks. For large-scale scientific simulations, such I/O operation will produce significant overhead. In-situ/in-transit approaches bypass I/O by accessing and processing in-memory simulation results directly, which suggests simulations and analysis applications should be more closely coupled. This paper constructs a flexible and extensible framework to connect scientific simulations with multi-steps machine learning processes and in-situ visualization tools, thus providing plugged-in analysis and visualization functionality over complex workflows at real time. A distributed simulation-time clustering method is proposed to detect anomalies from real turbulence flows.
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
Atomate is a high-level interface that makes it easy to generate, execute, and analyze computational materials science workflows. It contains a library of simulation procedures for different packages like VASP. Each procedure translates instructions into workflows of jobs and tasks. Atomate encodes expertise to run simulations and allows customizing workflows. It integrates with FireWorks to execute workflows on supercomputers and store results in databases for further analysis. The goal is to automate simulations and scale to millions of calculations.
The function-as-a-service (FaaS) model is well established in commercial cloud offerings but less so in research computing environments. The Globus Compute service enables remote computing using the FaaS model, but allows users to execute functions on any compute resource where they have access. We provide an overview of the Globus Compute service, and demonstrate how to install an endpoint and execute a function on a remote system.
This material was presented at the Research Computing and Data Management Workshop, hosted by Rensselaer Polytechnic Institute on February 27-28, 2024.
Supermicro designed and implemented a rack-level cluster solution for San Diego Supercomputing Center (SDSC) optimized for their custom and experimental AI training and inferencing workloads, and meeting their environmental and TCO requirements. The project team will discuss the journey of designing and deploying our Rack Plug and Play cluster, and Shawn Strande, Dupty Director, SDSC, will be sharing his experience of partnering with the Supermicro team to solve his challgenges in HPC and AI.
The team will also share the technology that powers the SDSC Voyager Supercomputer, the Habana Gaudi AI system with 3rd Gen Intel® Xeon® Scalable processors for Deep Learning Training, and Habana Goya for Inferencing.
Watch the webinar: https://www.brighttalk.com/webcast/17278/517013
Similar to Handling data and workflows in computational materials science: the AiiDA initiative (20)
The Research Data Alliance (RDA) is an international organization with over 11,000 members from 145 countries working to build the social and technical infrastructure to enable open sharing and re-use of research data across technologies, disciplines, and borders. RDA has 36 working groups and 57 interest groups addressing challenges in domains like agriculture, health, materials science, and more. It has produced 50 technical specifications and standards to reduce barriers to data sharing.
The Research Data Alliance (RDA) is an international organization with over 10,000 members from 145 countries that works to reduce barriers to data sharing and exchange. RDA brings together researchers, scientists, and data professionals through Working Groups and Interest Groups to develop standards and best practices for data infrastructure and sharing. RDA has produced 50 outputs including technical specifications and has groups working on issues across multiple disciplines.
The Research Data Alliance (RDA) is an international organization focused on building the social and technical infrastructure to enable open sharing of data. It has over 10,000 individual members from 144 countries collaborating in Working and Interest Groups to develop recommendations and standards to reduce barriers to data sharing. Some of RDA's achievements include 47 flagship outputs, 100+ adoption cases, and 93 active groups addressing challenges such as metadata, repositories, legal issues, and more. The ultimate goal is to allow researchers and innovators to openly share data across technologies and disciplines to address societal challenges.
The Research Data Alliance (RDA) is an international organization with over 10,000 members from 145 countries working to build the social and technical infrastructure to enable open sharing of data. It has 98 working groups and interest groups addressing challenges such as interoperability, data citation, metadata standards, and skills training. The RDA produces recommendations and outputs that are adopted by data repositories, domain organizations, and research communities to reduce barriers to data sharing and exchange.
The Research Data Alliance (RDA) is an international organization with over 10,000 members from 145 countries working to build the social and technical infrastructure to enable open sharing of data. Its vision is for researchers to openly share data across technologies, disciplines, and countries to address societal challenges. The RDA has produced 45 flagship recommendations and outputs and has over 100 cases of adoption across domains. It has 95 active working and interest groups focusing on issues like specific domains, data stewardship, and infrastructure.
The Research Data Alliance (RDA) is an international organization with over 10,000 members from 145 countries working to build the social and technical infrastructure to enable open sharing of data. RDA has 91 working groups and interest groups focused on issues like different academic disciplines, legal and technical interoperability, and community needs. The organization has produced 37 flagship recommendations and outputs that have been adopted over 100 times to help reduce barriers to sharing data internationally.
The Research Data Alliance (RDA) is an international organization with over 10,000 members from 144 countries working to build the social and technical infrastructure to enable open sharing of data. Its vision is for researchers to openly share data across technologies, disciplines, and countries to address societal challenges. RDA has over 100 groups working on data interoperability issues and has produced 37 flagship outputs, including technical specifications, with over 100 adoption cases in various organizations and disciplines.
The Research Data Alliance (RDA) is an international organization with over 9,859 members from 144 countries working to build the social and technical infrastructure to enable open sharing of data. Its vision is for researchers to openly share data across technologies, disciplines and countries to address societal challenges. RDA has 85 groups working on data interoperability challenges through Working Groups and Interest Groups. It has produced 32 outputs including technical specifications and seen adoption in over 100 cases. RDA membership is open and free for individuals and provides benefits such as networking and skills development, while organizational membership provides additional benefits such as influencing RDA activities.
The Research Data Alliance (RDA) is an international organization with over 9,600 members from 137 countries working to build the social and technical infrastructure to enable open sharing of data. Its vision is for researchers to openly share data across technologies, disciplines, and countries to address societal challenges. RDA has 85 working and interest groups collaborating to develop recommendations and standards to reduce barriers to data sharing. It has produced 32 flagship recommendations that have been adopted in over 75 cases by organizations worldwide. Membership is open and free for individuals and provides opportunities to work on global data interoperability challenges.
The Research Data Alliance (RDA) is an international organization with over 9,499 members from 137 countries working to build the social and technical infrastructure to enable open sharing of data. RDA has developed 32 flagship technical specifications and standards, and their recommendations have been adopted in 75 cases across multiple disciplines, organizations, and countries. RDA members collaborate in 85 working and interest groups focused on issues like interoperability, data stewardship, and community needs. The organization's vision is for researchers to openly share data to address societal challenges.
The Research Data Alliance (RDA) is an international organization with over 9,400 members from 137 countries working to build the social and technical infrastructure to enable open sharing of data. Its mission is to reduce barriers to data sharing across technologies, disciplines and countries. RDA has numerous working groups and interest groups addressing challenges such as metadata, citation, preservation, and more. Membership is open and free for individuals and provides opportunities for collaboration.
The Research Data Alliance (RDA) aims to build social and technical bridges that enable open sharing of data. It has over 9,000 members from 137 countries working in 83 groups to address challenges like interoperability, best practices, and more. RDA produces recommendations and specifications to help researchers openly share data across technologies and disciplines to solve societal challenges.
The Research Data Alliance (RDA) aims to facilitate data sharing across disciplines to address societal challenges. Individuals are encouraged to engage with RDA to contribute their expertise to discussions and recommendations, access an international network, receive updates on RDA's work, participate in meetings, and gain experience in all stages of the data lifecycle. RDA benefits from individual participation, as individuals bring ideas, problems, and solutions to create a valuable global community focused on reducing barriers to data sharing.
The Research Data Alliance (RDA) aims to facilitate data sharing across disciplines to address societal challenges. Individuals are encouraged to engage with RDA to contribute their expertise to discussions and recommendations, access an international network, receive updates on RDA's work, participate in meetings, and gain experience in all stages of the data lifecycle. RDA benefits from individual participation, as individuals bring ideas, problems, and solutions to create a valuable global community focused on reducing barriers to data sharing.
The document discusses the value of research infrastructure providers engaging with the Research Data Alliance (RDA). It outlines that RDA works to enable open sharing of research data globally across disciplines to address societal challenges. As research is global, infrastructure providers need globally compatible services, and RDA ensures this. The document provides reasons for providers to engage with RDA, such as access to an international network and opportunities to collaborate on data standards. It also describes ways providers can engage, such as joining RDA groups or attending meetings.
The Research Data Alliance (RDA) is an international organization with over 8,900 members from 137 countries working to build the social and technical infrastructure to enable open sharing of data. The RDA has developed 32 flagship recommendations and specifications to reduce barriers to data sharing, and has seen 75 cases of adoption across multiple disciplines and countries. It convenes various working and interest groups to develop solutions to challenges in areas like reference frameworks, data stewardship, and community needs.
The Research Data Alliance (RDA) aims to facilitate open sharing of data across technologies and disciplines to address societal challenges. There are two main components - the volunteer community that builds social and technical connections through Working Groups, and the business operations that support the community. Organizations performing research can engage with RDA in various ways like sponsorship, membership, or participation in Working Groups to help shape standards and address issues like data management, quality, and interoperability. RDA offers a global network and opportunities for collaboration on solutions to research data challenges.
The document discusses the value of libraries engaging with the Research Data Alliance (RDA). It outlines several benefits libraries can gain from involvement such as interacting with data professionals, developing strategic partnerships, and gaining expertise. Libraries are encouraged to become organizational members of RDA, have staff join working groups, adopt RDA recommendations, and send representatives to plenaries. RDA works to address challenges around research data reproducibility, preservation, best practices, and more through global collaboration. Libraries are positioned to augment RDA's network as bridges between data activities and open sharing.
The document discusses ways that research funders can engage with and benefit from the Research Data Alliance (RDA). RDA works to build infrastructure for open data sharing across disciplines. Funders that support RDA can get more value from the research they fund through improved data quality, reuse, and benefits to stakeholders. Funders can encourage adoption of RDA outputs, support RDA operations, participate in forums, and sponsor events, fellowships, and pilots implementing RDA recommendations. Engaging with RDA helps funders deliver more benefits from research and supports RDA's work of improving data sharing.
The Research Data Alliance (RDA) aims to build social and technical bridges to enable open sharing of data. It has over 8,800 members from 137 countries working in 87 groups to develop recommendations and standards to reduce barriers to data sharing. Some of RDA's outputs include recommendations on data citation, metadata standards, and repository interoperability.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
Handling data and workflows in computational materials science: the AiiDA initiative
1. Andrea Ferretti
Handling data and workflows in
computational materials science
The AiiDA initiative
Firenze, 15 Nov 2016
2. - Highly accurate ab initio
methods in electronic structure
- Large computational power
required (now available)
- High-throughput screening
possible
- Reduced need for exp dat
COMPUTATIONAL MATERIALS’ SCIENCE
N. Marzari, Nature Materials, Apr 2016
PRL 105, 106601 (2010)
3. COMPUTATIONAL MATERIALS’ SCIENCE
G. Hautier et al, Nat Comm 4, 2292 (2013)
p-type dopability have already been reported experimentally
or computationally for several of them. B6O has been
experimentally measured to show p-type conductivity31. It has
been demonstrated experimentally that PbZr0.5Ti0.5O3 can be
K2Pb2O
4
3
2
Defectformationenergy(eV)
1
0
–1
–2
–3
–4
–2
–3
c
1–
Defec
Figure 3 | Vacan
indicate results fo
vacancy formatio
are indicated by o
1.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
ZnO SnO2
In2O3
AlCuO2
SrCu2
O2
ZnRh2O4
K2Sn2O3
Sb4
Cl2
O5
K2
Pb2
O3 PbTiO3
Ca4
As2
O
Ca4
P2
O
Sr4
P2
O
Sr4As2O
Hg2
SO4
PbZrO3NaNbO2
Tl4
V2
O7
Tl4
O3
ZrSO HfSO
B6
O
Na2
Sn2
O3
PbHfO3
Band gap (eV)
Effectivemass Current p-type
TCOs
Current n-type TCOs
2 2.5
Figure 2 | Effective mass versus band gap for the p-type TCO candidates.
We superposed on the band gap axis a colour spectrum corresponding to
the wavelength associated with a photon energy. The TCO candidates are
marked with red dots. A few known p-type (blue diamonds) and n-type
(green square) TCOs can be compared to the new candidates. The best
TCOs should lie in the lower right corner. For clarity, we kept only one
representative when polymorphs existed for a given stoechiometry (for
4. - Highly accurate ab initio
methods in electronic structure
- Large computational power
required (now available)
- High-throughput screening
possible
- Reduced need for exp data
- Data handling needed
COMPUTATIONAL MATERIALS’ SCIENCE
N. Marzari, Nature Materials, Apr 2016
PRL 105, 106601 (2010)
5. SOME THOUGHTS ON DATA
• In computational science, data are naturally generated,
so the workflows that create properties and data from a
structure are key
• Curated data are needed (e.g. for verification or for
machine learning)
• A model of data-on-demand can be implemented
(high-throughput pushes the development of robust
workflows to calculate automatically).
6. OBJECTIVES
• Automation: run thousands of calculations daily
• Provenance: all children and all parent data are
recorded
• Reproducibility: go back to a simulation years later,
and redo it with new parameters or codes
• Extensible/agnostic to models, codes and formats
• Workflows: dynamical, robust, complex “turnkey
solutions” that calculate desired properties on demand
• Sharing: provide the distributed environment to
disseminate workflows and data and to provide
services
7. ADES MODEL FOR COMPUTATIONAL
SCIENCE
G. Pizzi et al., Comp. Mat. Sci 111, 218-230 (2016)
Low-level pillars User-level pillars
9. Automation Data Environment Sharing
Automation Database Research environment Social
Remote management Provenance Scientific workflows Sharing
High-throughput Storage Data analytics Standards
A factory A library A scholar A community
http://www.aiida.net
(MIT BSD, jointly developed with Robert Bosch)
G. Pizzi et al., Comp. Mat. Sci. 111, 218 (2016)
10. G. Pizzi, A.C., et al., arXiv:1504.01163
ADES
Automation in AiiDA
Remote management
Coupling to data
High throughput
11. Automation in AiiDA
1. The core of the code is the AiiDA API (Application Programming
Interface), a set of Python classes that exposes the users to the key
objects: Calculations, Codes, and Data.
What is AiiDA?
12. Automation in AiiDA
2. The AiiDA Object-Relational Mapper (ORM) maps AiiDA objects into
Python Classes, so that the objects can be created/modified/queried via
an agnostic high-level interface. Any interaction with Storage occurs
transparently via Python calls.
13. Automation in AiiDA
3. A daemon manages calculation states (submission, retrieval,
parsing…) without user intervention (uses Python celery+supervisor
modules), through remote transports and Slurm/PBS Pro/SGE/
Torque plugins.
14. Automation in AiiDA
4. User interactions occurs via the command line tool Verdi, the
interactive shell or via Python scripts
15. Coupling automation with storage
• The AiiDA-API acts as the unique interface to
heterogeneous, remote HPC resources, that are
abstracted away
– All work can be done on the local resources, and the user
does not need to connect explicitly to remote HPC
• Coupling automation with storage ensures:
– uniformity of the input data, usage of codes and computers
(the same interface encompasses several supercomputers,
different schedulers, connection protocols…
– full reproducibility and provenance, with automatic storage of
all data and links
– seamless sharing of calculations with other users
16. G. Pizzi, A.C., et al., arXiv:1504.01163
ADES
Data in AiiDA
Storage
Database
Provenance
17. The Open Provenance Model
• Any calculation is a function,
manipulating an input to obtain an
output:
out1, out2 = F(in1, in2)
• Each functional object is a node in a
graph, connected together with
directional, labeled links
• Output nodes in turn can be used as
inputs of following calculations out1 out2
in1 in2
F
data
data
data
data
calc
19. Saving the DAGs: Nodes and Links
Nodes and links:
a graph structure
• Each node: row in a SQL table
+ folder for files
• Links also stored in a SQL table
jobs provenance
Transitive closure (TC) table
• Allows queries that traverse the graph
• Automatically updated using triggers
• Queries using TC in SQL faster than with
graph DB backends!
20. Benchmark against Neo4j
• Graph databases exist (Neo4j)
• They are still young, while SQL is very mature
• Our benchmark (with postgreSQL) vs. Neo4j on same realistic
data, ~11K graphs, ~100K nodes, >1M attributes)
AiiDA (query 1 and 2)
Neo4j (query 1)
Neo4j (query 2)
Number of results
Query
time (s)
21. The AiiDA daemon
A daemon runs in the background
Calculation state
SUBMITTING
WITHSCHEDULER
RETRIEVING
PARSING
FINISHED
22. G. Pizzi, A.C., et al., arXiv:1504.01163
ADES
Environment in AiiDA
High-level workspace
Scientific workflows
Data analytics
23. Environment in AiiDA: plugins
All functionality provided using a plugin interface
Calculation Data Parser Transport Scheduler
Generation of
input files for a
given code
Quantum Espresso,
Phonopy, GPAW,
Yambo, NWChem,
…
Management of
data objects for
input/output
files&folders,
parameter sets,
remote data,
structures,
pseudos, ...
Parsing of code
output and
generation of
new DB nodes
Quantum Espresso,
Phonopy, GPAW,
Yambo,
NWChem, ...
How to connect
to a cluster
local connection,
ssh, ...
How to interact
with the
scheduler
PBSPro, Torque,
SGE, SLURM, ...
24. • Full python scripting capabilities
• AiiDA manages calculation dependency
• They are modular: users can expand on the workflows of others
• A step can call nested subworkflows.
• Develop turn-key solutions for the calculation of material
properties: libraries of workflows
Environment in AiiDA: Workflows
25. Workflows features
• Automatic provenance tracking, stored in DB using simple
python functions
inputs, outputs, function calls stored by adding simple decorator to existing functions
• Serial and parallel execution support
can launch long running tasks on separate threads and wait for result when needed
• Control provenance granularity
store level of detail relevant to the workflows
• Seamless mixing of local and remote jobs
• Progress checkpointing
restart from arbitrary step, retry on failure
• Easy debugging
execute workflows in IDE and observe/change states of variables as it runs
• Background execution
daemon execution allows machine to be shutdown and continue from last point,
essential for running long remote jobs
26. WORKFLOWS ENCODING CORE KNOWLEDGE
CHRONOS workflow:
electronic-magnetic-
atomic structure
PHONON workflow:
phonon dispersions
(+elastic, dielectric)
Single q
calculation
Single q
calculation
Phonon
initialization
Energy
calculation
Input
parameters
Dynamical matrices
Phonon calculation
Phonon calculation
Single q
calculation
Collect results
Fourier
interpolation
Phonon
dispersion
q-points distribution
Loops on itself
if fails (change
parameters)
Restart if clean stop (max
CPU time reached)
Phonon “restart”
sub-workflow Testing metallic
character
Generating structures with
random magnetizations
Structure
Magnetic
energy relax.
Fully relaxed
structure
Magnetic
energy relax.
Magnetic
energy relax.
Lowest energy
configuration
Non-magnetic
energy relaxation
Final energy
relaxation + bands
Electronic
bands
Energy calculation +
bands
Finding
magnetic
properties
Set of tested &
converged
pseudos (SSSP)
28. G. Pizzi, A.C., et al., arXiv:1504.01163
ADES
Sharing in AiiDA
Social ecosystem
Repository pipelines
Standardization
29. Sharing in AiiDA
Clusters
Users
Databases
Private
data
Public/shared
data
Group 1
Group 3
Group 2
Some
data
shared
Some
data
shared
• Sharing model in AiiDA
• Data can be pushed to the
outside world or other
repositories
• Importer of previous
calculations
• UUIDs used to uniquely identify all
data/calculation objects
31. CONCLUSIONS
l In computational science, data are naturally
calculated, not harvested
l ADES model
(automation – data – environment - sharing)
l AiiDA v1.0 released by end of 2016
l A DMP is part of (and distributed with) the AiiDA sw
l AiiDA as a turn-key solution for Data management