The document discusses storing high-energy physics data in the DAOS object storage system. It provides an overview of high-energy physics experiments and data, and the ROOT framework commonly used for data analysis. It then introduces RNTuple, a new data format being developed to replace the legacy TTree format used in ROOT. The document details how RNTuple maps data to DAOS objects and describes a C++ interface for interacting with DAOS. It notes that from the user perspective, working with RNTuple data in DAOS is similar to working with files on disk.
NASA's Earth Observing System (EOS) archive includes data collected over many years by many satellite instruments. These data are stored in the HDF format that includes data and metadata. The content of the metadata was examined for compliance with a set of conventions developed by the NASA science community at the beginning of the EOS Project (the HDF-EOS conventions). The initial results show that ~50% of the data files and 76% of the datasets have metadata that allows them to be used easily in standard tools. This talk was presented at the ESIP (espied.org) meeting during January 2014.
A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
New data access paradigms support a variety of human and machine access paths with data servers (THREDDS, https://www.unidata.ucar.edu/software/thredds/current/tds/ and Hyrax, http://opendap.org) that support multiple services for a given dataset. We need metadata that can describe those services and unambiguously differentiate between access paths for humans and for machines. The ISO 19115 metadata standard includes service metadata and allows data and services for that data to be described in the same record. I propose that we use the service metadata for machine access and the more traditional distribution information for human access. This talk was presented at the ESIP (espied.org) meeting during January 2014.
A brief introduction to the Bayesian analysis program PyRate for paleobiology colleagues. Given at a lab meeting, so the format is casual and a good chunk of prior knowledge is assumed.
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Pig is an engine for executing data flows in parallel on Hadoop. It uses a language called Pig Latin to analyze large datasets. Pig provides relational operators like FOREACH, GROUP, and FILTER to process data in parallel. A hands-on example demonstrates loading dividend data, grouping it by stock symbol, calculating the average dividend for each symbol, and storing the results.
I Mapreduced a Neo store: Creating large Neo4j Databases with HadoopGoDataDriven
When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.
When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).
In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
Mobile data is becoming the new source for data. Managing data in the mobile devices has become easier with NoSQL Couchbase Lite mobile database. Making sense, analyzing, scaling to exabytes has also become easier with LexisNexis Big Data platform HPCC Systems.
NASA's Earth Observing System (EOS) archive includes data collected over many years by many satellite instruments. These data are stored in the HDF format that includes data and metadata. The content of the metadata was examined for compliance with a set of conventions developed by the NASA science community at the beginning of the EOS Project (the HDF-EOS conventions). The initial results show that ~50% of the data files and 76% of the datasets have metadata that allows them to be used easily in standard tools. This talk was presented at the ESIP (espied.org) meeting during January 2014.
A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
New data access paradigms support a variety of human and machine access paths with data servers (THREDDS, https://www.unidata.ucar.edu/software/thredds/current/tds/ and Hyrax, http://opendap.org) that support multiple services for a given dataset. We need metadata that can describe those services and unambiguously differentiate between access paths for humans and for machines. The ISO 19115 metadata standard includes service metadata and allows data and services for that data to be described in the same record. I propose that we use the service metadata for machine access and the more traditional distribution information for human access. This talk was presented at the ESIP (espied.org) meeting during January 2014.
A brief introduction to the Bayesian analysis program PyRate for paleobiology colleagues. Given at a lab meeting, so the format is casual and a good chunk of prior knowledge is assumed.
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Pig is an engine for executing data flows in parallel on Hadoop. It uses a language called Pig Latin to analyze large datasets. Pig provides relational operators like FOREACH, GROUP, and FILTER to process data in parallel. A hands-on example demonstrates loading dividend data, grouping it by stock symbol, calculating the average dividend for each symbol, and storing the results.
I Mapreduced a Neo store: Creating large Neo4j Databases with HadoopGoDataDriven
When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.
When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).
In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
Mobile data is becoming the new source for data. Managing data in the mobile devices has become easier with NoSQL Couchbase Lite mobile database. Making sense, analyzing, scaling to exabytes has also become easier with LexisNexis Big Data platform HPCC Systems.
Big Data - Load CSV File & Query the EZ way - HPCC SystemsFujio Turner
A "How To" to load CSV files into HPCC Systems and query them. You can use this method to migrate your RDBMS data ,MySQL / Oracle / SQL, into HPCC Systems.
The document discusses how storage models need to evolve as the underlying technologies change. Object stores like S3 provide scale and high availability but lack semantics and performance of file systems. Non-volatile memory also challenges current models. The POSIX file system metaphor is ill-suited for object stores and NVM. SQL provides an alternative that abstracts away the underlying complexities, leaving just object-relational mapping and transaction isolation to address. The document examines renaming operations, asynchronous I/O, and persistent in-memory data structures as examples of areas where new models may be needed.
Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always been a challenge for the Yahoo data ETL pipeline. A study done recently inside Yahoo has shown a dramatic data size reduction by switching from Sequence to RC File Format. We have decided to take the approach of converting our data to the RC File Format. The most challenging task is to manually serialize the data objects. We rely on Jute, a Hadoop Record Compiler, to provide serialization code. However, Jute does not support RC File Format. In addition, RC file format does not support native Hadoop writable objects. Therefore writing serialization code becomes complicated and repetitive. Hence, we invented the JuteRC compiler which is an extension to the Hadoop Record Compiler (Jute). It generates serialization/deserialization code for any user defined primitive or composite data types. MapReduce programmer can directly plug in the serialization/deserialization code to generate MapReduce output data file that is in RC File Storage Format. With the help of JuteRC compiler, our experiment against Yahoo audience data showed a 26-28% file size reduction and 40% read/write performance improvement compared to Sequence File. We are currently in the process to open source JuteRC.
Using NLP to Explore Entity Relationships in COVID-19 LiteratureDatabricks
In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.
The MathWorks introduced MATLAB support for HDF5 in 2002 via three high-level functions: HDF5INFO, HDF5READ, and HDF5WRITE. These functions worked well for their purpose-providing simple interfaces to a complicated file format-but MATLAB users requested finer control over their HDF5 files and the HDF5 library. MATLAB 7.3 (R2006b) adds this precise level of support for version 1.6.5 of the HDF5 library via a close mapping of the HDF5 C API to MATLAB function calls.
This presentation will briefly introduce the earlier, high-level HDF5 interface (and its limitations) before showing in detail the low-level HDF5 functions. It will show how to interact with the HDF5 library and files using the thirteen classes of functions in MATLAB, which encapsulate groupings of functionality found in the HDF5 C API. But because MATLAB is itself a higher-level language than C, we will also present MATLAB's extensions and modifications of the HDF5 C API that make it more MATLAB-like, work with defined values, and perform ID and memory management.
Wrapping a library like HDF5 requires a great deal of effort and design, and we will briefly present a general-purpose mechanism for creating close mappings between library interfaces and an application like MATLAB. One of our goals in this presentation is to facilitate communication with The HDF Group about how The MathWorks builds our HDF5 interfaces in order to ease adoption of future versions of the HDF5 library in large, general-purpose applications.
The document discusses how to use the R programming language and Amazon's Elastic MapReduce service to quickly create a Hadoop cluster on Amazon Web Services in only 15 minutes. It demonstrates running a stochastic simulation to estimate pi by distributing 1,000 simulations across the Hadoop cluster and combining the results. The total cost of running the 15 minute cluster was only $0.15, showing how inexpensive it can be to leverage Hadoop's capabilities.
The document discusses the Semantic Web and Linked Data. It provides an overview of RDF syntaxes, storage and querying technologies for the Semantic Web. It also discusses issues around scalability and reasoning over large amounts of semantic data. Examples are provided to illustrate SPARQL querying of RDF data, including graph patterns, conjunctions, optional patterns and value testing.
HDF5 is a powerful and feature-rich creature, and getting the most out of it requires powerful tools. The MathWorks provides a "low-level" interface to the HDF5 library that closely corresponds to the C API and exposes much of its richness. This short tutorial will present ways to use the low-level MATLAB interface to build those tools and tackle such topics as subsetting, chunking, and compression.
NCAR Command Language (NCL) is an interpreted language designed for sceintific data analysis and visualization with high quality graphics, espeially for atmospherice scince. NCL has been support NetCDF 3/4, GRIB 1/2, HDF-SDS, HDF_EOS, shapefiles, binary, and ASCII files for years. Now HDF-EOS5 is the released version, and HDF5 in beta-test stage.
Now NCL team are developing NCL to write HDF5 files, and to read HDF-EOS5 data with OPeNDAP.
NCL team will share with people their experience to visualize and analyze HDF-EOS5 and HDF5 data.
Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
This document discusses a Python library for parsing Hadoop Record files.
The library includes a parser that can parse Hadoop's Data Definition Language into generic Python data types. It outputs the data structure, but the user must transform it into a class structure.
The parsing library is only part of what is needed - a DDL translator is still needed to fully convert the data definition language into Python classes. Feedback is welcomed to improve the library.
This document provides an overview of Spark, including its history, use cases, architecture, and ecosystem. Some key points:
- Spark is an open-source cluster computing framework that allows processing of large datasets in parallel across compute clusters. It was developed at UC Berkeley in 2009 and became a top-level Apache project in 2013.
- Spark can be used for tasks like log analysis, text processing, analytics, search, and fraud detection on large datasets distributed across clusters. It offers APIs in Scala, Java, Python and can integrate with Hadoop ecosystem.
- Spark uses Resilient Distributed Datasets (RDDs) as its basic abstraction, allowing data to be processed in parallel. Transformations on
PigSPARQL: A SPARQL Query Processing Baseline for Big DataAlexander Schätzle
In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of Pig, a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce. This additional level of abstraction makes our approach independent of the actual Hadoop version and thus ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer. We revisit PigSPARQL and demonstrate the performance improvement when simply switching the underlying version of Pig from 0.5.0 to 0.11.0 without any changes to PigSPARQL itself. Because of this sustainability, PigSPARQL is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which is also underpinned by its competitiveness with existing systems, e.g. HadoopRDF.
Uplift – Generating RDF datasets from non-RDF data with R2RMLChristophe Debruyne
The document discusses generating RDF datasets from non-RDF data sources using R2RML (RDB to RDF Mapping Language). It provides background on RDF, RDF Schema, SPARQL, Linked Data, and describes how R2RML can be used to map relational databases to RDF according to the W3C recommendation. Direct mappings are discussed as well as content negotiation in transforming data to RDF.
Big Data - Load, Index & Query the EZ way - HPCC SystemsFujio Turner
Learn how to index your Big Data to get the speed that you want and need. With HPCC Systems use less machines and do more work faster then Hadoop.
To Install HPCC Systems in just 5 Minutes Watch this Youtube video. http://www.youtube.com/watch?v=8SV43DCUqJg
The document discusses providing easy access to HDF data via NCL, IDL, and MATLAB. It presents examples and code snippets for reading HDF data from various NASA data centers like GES DISC, MODAPS, NSIDC, and LP-DAAC into the three software packages. Common issues when working with HDF files like HDF-EOS2 swaths with dimension maps and different ways metadata is stored are also addressed. The overall goal is to help lower the learning curve for users who want to analyze HDF data in their favorite analysis packages.
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
This document discusses setting up and using Tajo, an Apache Hadoop-based data warehousing system, on AWS. It provides instructions on using Tajo Cloud to easily configure a Tajo cluster on AWS. Examples show how to connect external data from S3, perform queries, and analyze customer cohort data to understand purchase patterns over time. Tajo allows direct access to data in S3 and dynamic scaling of worker nodes, and its connector enables remote querying from SQL clients, Excel, and R.
Hadoop is an open-source software framework written in Java for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. Hadoop features include a distributed file system called HDFS that stores data across compute nodes, and a programming model called MapReduce that processes data in parallel across the cluster.
This document summarizes the process for connecting HDF and ISO metadata standards. It involves:
1. Creating an ISO compliant XML file (SMAP.xml) containing metadata.
2. Transforming this ISO file into an equivalent NcML format file (ISO2NCML.xml) using an XSL stylesheet.
3. Transforming the NcML file into Python code (NCML2h5py.py) to instantiate the metadata structure into an HDF5 file (SMAP.h5) using another XSL stylesheet.
4. Extracting the HDF5 metadata structure back into an XML file (SMAPHDF.xml) and transforming it back into ISO format (SM
Full version of http://www.slideshare.net/valexiev1/gvp-lodcidocshort. Same is available on http://vladimiralexiev.github.io/pres/20140905-CIDOC-GVP/index.html
CIDOC Congress, Dresden, Germany
2014-09-05: International Terminology Working Group: full version.
2014-09-09: Getty special session: short version
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
Big Data - Load CSV File & Query the EZ way - HPCC SystemsFujio Turner
A "How To" to load CSV files into HPCC Systems and query them. You can use this method to migrate your RDBMS data ,MySQL / Oracle / SQL, into HPCC Systems.
The document discusses how storage models need to evolve as the underlying technologies change. Object stores like S3 provide scale and high availability but lack semantics and performance of file systems. Non-volatile memory also challenges current models. The POSIX file system metaphor is ill-suited for object stores and NVM. SQL provides an alternative that abstracts away the underlying complexities, leaving just object-relational mapping and transaction isolation to address. The document examines renaming operations, asynchronous I/O, and persistent in-memory data structures as examples of areas where new models may be needed.
Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always been a challenge for the Yahoo data ETL pipeline. A study done recently inside Yahoo has shown a dramatic data size reduction by switching from Sequence to RC File Format. We have decided to take the approach of converting our data to the RC File Format. The most challenging task is to manually serialize the data objects. We rely on Jute, a Hadoop Record Compiler, to provide serialization code. However, Jute does not support RC File Format. In addition, RC file format does not support native Hadoop writable objects. Therefore writing serialization code becomes complicated and repetitive. Hence, we invented the JuteRC compiler which is an extension to the Hadoop Record Compiler (Jute). It generates serialization/deserialization code for any user defined primitive or composite data types. MapReduce programmer can directly plug in the serialization/deserialization code to generate MapReduce output data file that is in RC File Storage Format. With the help of JuteRC compiler, our experiment against Yahoo audience data showed a 26-28% file size reduction and 40% read/write performance improvement compared to Sequence File. We are currently in the process to open source JuteRC.
Using NLP to Explore Entity Relationships in COVID-19 LiteratureDatabricks
In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.
The MathWorks introduced MATLAB support for HDF5 in 2002 via three high-level functions: HDF5INFO, HDF5READ, and HDF5WRITE. These functions worked well for their purpose-providing simple interfaces to a complicated file format-but MATLAB users requested finer control over their HDF5 files and the HDF5 library. MATLAB 7.3 (R2006b) adds this precise level of support for version 1.6.5 of the HDF5 library via a close mapping of the HDF5 C API to MATLAB function calls.
This presentation will briefly introduce the earlier, high-level HDF5 interface (and its limitations) before showing in detail the low-level HDF5 functions. It will show how to interact with the HDF5 library and files using the thirteen classes of functions in MATLAB, which encapsulate groupings of functionality found in the HDF5 C API. But because MATLAB is itself a higher-level language than C, we will also present MATLAB's extensions and modifications of the HDF5 C API that make it more MATLAB-like, work with defined values, and perform ID and memory management.
Wrapping a library like HDF5 requires a great deal of effort and design, and we will briefly present a general-purpose mechanism for creating close mappings between library interfaces and an application like MATLAB. One of our goals in this presentation is to facilitate communication with The HDF Group about how The MathWorks builds our HDF5 interfaces in order to ease adoption of future versions of the HDF5 library in large, general-purpose applications.
The document discusses how to use the R programming language and Amazon's Elastic MapReduce service to quickly create a Hadoop cluster on Amazon Web Services in only 15 minutes. It demonstrates running a stochastic simulation to estimate pi by distributing 1,000 simulations across the Hadoop cluster and combining the results. The total cost of running the 15 minute cluster was only $0.15, showing how inexpensive it can be to leverage Hadoop's capabilities.
The document discusses the Semantic Web and Linked Data. It provides an overview of RDF syntaxes, storage and querying technologies for the Semantic Web. It also discusses issues around scalability and reasoning over large amounts of semantic data. Examples are provided to illustrate SPARQL querying of RDF data, including graph patterns, conjunctions, optional patterns and value testing.
HDF5 is a powerful and feature-rich creature, and getting the most out of it requires powerful tools. The MathWorks provides a "low-level" interface to the HDF5 library that closely corresponds to the C API and exposes much of its richness. This short tutorial will present ways to use the low-level MATLAB interface to build those tools and tackle such topics as subsetting, chunking, and compression.
NCAR Command Language (NCL) is an interpreted language designed for sceintific data analysis and visualization with high quality graphics, espeially for atmospherice scince. NCL has been support NetCDF 3/4, GRIB 1/2, HDF-SDS, HDF_EOS, shapefiles, binary, and ASCII files for years. Now HDF-EOS5 is the released version, and HDF5 in beta-test stage.
Now NCL team are developing NCL to write HDF5 files, and to read HDF-EOS5 data with OPeNDAP.
NCL team will share with people their experience to visualize and analyze HDF-EOS5 and HDF5 data.
Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
This document discusses a Python library for parsing Hadoop Record files.
The library includes a parser that can parse Hadoop's Data Definition Language into generic Python data types. It outputs the data structure, but the user must transform it into a class structure.
The parsing library is only part of what is needed - a DDL translator is still needed to fully convert the data definition language into Python classes. Feedback is welcomed to improve the library.
This document provides an overview of Spark, including its history, use cases, architecture, and ecosystem. Some key points:
- Spark is an open-source cluster computing framework that allows processing of large datasets in parallel across compute clusters. It was developed at UC Berkeley in 2009 and became a top-level Apache project in 2013.
- Spark can be used for tasks like log analysis, text processing, analytics, search, and fraud detection on large datasets distributed across clusters. It offers APIs in Scala, Java, Python and can integrate with Hadoop ecosystem.
- Spark uses Resilient Distributed Datasets (RDDs) as its basic abstraction, allowing data to be processed in parallel. Transformations on
PigSPARQL: A SPARQL Query Processing Baseline for Big DataAlexander Schätzle
In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of Pig, a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce. This additional level of abstraction makes our approach independent of the actual Hadoop version and thus ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer. We revisit PigSPARQL and demonstrate the performance improvement when simply switching the underlying version of Pig from 0.5.0 to 0.11.0 without any changes to PigSPARQL itself. Because of this sustainability, PigSPARQL is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which is also underpinned by its competitiveness with existing systems, e.g. HadoopRDF.
Uplift – Generating RDF datasets from non-RDF data with R2RMLChristophe Debruyne
The document discusses generating RDF datasets from non-RDF data sources using R2RML (RDB to RDF Mapping Language). It provides background on RDF, RDF Schema, SPARQL, Linked Data, and describes how R2RML can be used to map relational databases to RDF according to the W3C recommendation. Direct mappings are discussed as well as content negotiation in transforming data to RDF.
Big Data - Load, Index & Query the EZ way - HPCC SystemsFujio Turner
Learn how to index your Big Data to get the speed that you want and need. With HPCC Systems use less machines and do more work faster then Hadoop.
To Install HPCC Systems in just 5 Minutes Watch this Youtube video. http://www.youtube.com/watch?v=8SV43DCUqJg
The document discusses providing easy access to HDF data via NCL, IDL, and MATLAB. It presents examples and code snippets for reading HDF data from various NASA data centers like GES DISC, MODAPS, NSIDC, and LP-DAAC into the three software packages. Common issues when working with HDF files like HDF-EOS2 swaths with dimension maps and different ways metadata is stored are also addressed. The overall goal is to help lower the learning curve for users who want to analyze HDF data in their favorite analysis packages.
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
This document discusses setting up and using Tajo, an Apache Hadoop-based data warehousing system, on AWS. It provides instructions on using Tajo Cloud to easily configure a Tajo cluster on AWS. Examples show how to connect external data from S3, perform queries, and analyze customer cohort data to understand purchase patterns over time. Tajo allows direct access to data in S3 and dynamic scaling of worker nodes, and its connector enables remote querying from SQL clients, Excel, and R.
Hadoop is an open-source software framework written in Java for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. Hadoop features include a distributed file system called HDFS that stores data across compute nodes, and a programming model called MapReduce that processes data in parallel across the cluster.
This document summarizes the process for connecting HDF and ISO metadata standards. It involves:
1. Creating an ISO compliant XML file (SMAP.xml) containing metadata.
2. Transforming this ISO file into an equivalent NcML format file (ISO2NCML.xml) using an XSL stylesheet.
3. Transforming the NcML file into Python code (NCML2h5py.py) to instantiate the metadata structure into an HDF5 file (SMAP.h5) using another XSL stylesheet.
4. Extracting the HDF5 metadata structure back into an XML file (SMAPHDF.xml) and transforming it back into ISO format (SM
Full version of http://www.slideshare.net/valexiev1/gvp-lodcidocshort. Same is available on http://vladimiralexiev.github.io/pres/20140905-CIDOC-GVP/index.html
CIDOC Congress, Dresden, Germany
2014-09-05: International Terminology Working Group: full version.
2014-09-09: Getty special session: short version
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
This document discusses using a stack of Hadoop, Spark, and Elasticsearch to perform anomaly detection on large datasets in both batch and real-time. Hadoop is used for large-scale data storage and preprocessing. Spark is used to perform in-depth analysis to identify common entities and build models. Elasticsearch allows searching the data in real-time and performing aggregations to identify uncommon entities. A live loop continuously adapts the models to react to streaming data and improve anomaly detection over time.
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
CIDOC Congress, Dresden, Germany
2014-09-05: International Terminology Working Group: full version (http://vladimiralexiev.github.io/pres/20140905-CIDOC-GVP/index.html)
2014-09-09: Getty special session: short version (http://VladimirAlexiev.github.io/pres/20140905-CIDOC-GVP/GVP-LOD-CIDOC-short.pdf)
This document describes Pypet, a Python parameter exploration toolbox. Pypet allows for easy exploration of parameter spaces and storage of simulation results and parameters. It revolves around a trajectory container, which uses a tree data structure to manage parameters and results in a natural naming scheme. Pypet supports a variety of data formats and storage via HDF5. It provides tools for disentangling simulations from I/O, logging, version control integration, and parallelization. Pypet is open source, well tested, and documented.
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
PyWren is a serverless framework that allows data scientists to easily scale Python code across AWS Lambda. It uses Lambda to parallelize work by mapping Python functions to a large dataset. The functions and data are serialized and uploaded to S3, which then triggers Lambda. Results are stored in S3. This allows data science problems that take minutes or hours to be solved to complete in seconds by parallelizing across thousands of Lambda instances. PyWren aims to abstract away the complexity of serverless infrastructure so data scientists can focus on their code instead of operations.
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
This document introduces R and its use for data mining. It discusses R's functionality for statistical analysis and graphics. It also outlines various R packages for common data mining tasks like classification, clustering, association rule mining and text mining. Finally, it covers importing and exporting data to and from R, and provides online resources for learning more about using R for data analysis and data mining.
The document discusses the objectives and outcomes of the FAIRport Skunkworks team so far. The team is exploring existing technologies to build prototype FAIRport code components using existing standards. They aim to enable findable, accessible, interoperable, and reusable data across repositories. However, repositories use different metadata schemas and standards like DCAT in incomplete ways. The team proposes "FAIR Profiles" - a generic way to describe metadata fields and constraints for any repository using a standardized vocabulary and structure. This would enable rich queries across repositories. They define a FAIR Profile Schema to serve as a lightweight meta-meta-descriptor for describing diverse repository metadata schemas in a consistent way.
The document discusses using RDFS and OWL reasoning to integrate heterogeneous linked data by addressing issues like terminology and naming heterogeneity. It presents an approach using a subset of OWL 2 RL rules to reason over a billion triple corpus in a scalable way, handling the TBox separately from the ABox to avoid quadratic inferences. It also describes augmenting the reasoning with annotations to track trustworthiness and using this to filter inferences, detect inconsistencies and perform a light repair of the data. Consolidation is discussed as rewriting URIs to canonical identifiers based on owl:sameAs relations. Performance results show the different techniques taking between 1-20 hours to run over the corpus distributed across 9 machines.
With the rise of the cloud, data intensive systems and the Internet of Things the use of distributed systems have become widespread.
The first big player was Hadoop, which provided an integral solution to Big Data storage and computation problems. Its popularity empowered many organizations to adopt this technology. However new challenges appeared, like the need to be able to execute iterative, interactive or in-memory algorithms without the disk-intensive burden of MapReduce. For that very reason Hadoop evolved, decoupling its resources manager from the main computation engine: YARN was born. As a result of its vast adoption, YARN has become the de-facto distributed operating system for Big Data.
Since early releases, Apache Spark provided a way to be executed on YARN-powered clusters. In this talk we will take a look into that technology, and we will learn what it means having Spark running on this kind of infrastructure.
Hive is used at Facebook for data warehousing and analytics tasks on a large Hadoop cluster. It allows SQL-like queries on structured data stored in HDFS files. Key features include schema definitions, data summarization and filtering, extensibility through custom scripts and functions. Hive provides scalability for Facebook's rapidly growing data needs through its ability to distribute queries across thousands of nodes.
The document discusses enabling live linked data by synchronizing semantic data stores with commutative replicated data types (CRDTs). CRDTs allow for massive optimistic replication while preserving convergence and intentions. The approach aims to complement the linked open data cloud by making linked data writable through a social network of data participants that follow each other's update streams. This would enable a "read/write" semantic web and transition linked data from version 1.0 to 2.0.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Modern data lakes are now built on cloud storage, helping organizations leverage the scale and economics of object storage while simplifying overall data storage and analysis flow
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...shravanthium111
This document summarizes a student presentation on analyzing the frequency of tweets using MapReduce. It discusses big data, Hadoop frameworks, HDFS, and how MapReduce works. It then describes the student's proposed approach of using Python to extract tweets from Twitter and implement MapReduce to count the frequency of dates in the tweets and output the results.
Spark is an open-source software framework for rapid calculations on in-memory datasets. It uses Resilient Distributed Datasets (RDDs) that can be recreated if lost and supports transformations and actions on RDDs. Spark is useful for batch, interactive, and real-time processing across various problem domains like SQL, streaming, and machine learning via MLlib.
Similar to DUG'20: 07 - Storing High-Energy Physics data in DAOS (20)
HPE plans to deliver a DAOS storage solution targeting version 2.0 to enable initial customer deployments. The reference implementation will include HPE servers, Intel Ice Lake CPUs with DCPMM and NVMe SSDs, Mellanox or HPE Slingshot switches, and customized Cray management software. HPE will host potential customers in their CTO lab to run proofs of concept and collect feedback to inform full productization planned with Sapphire Rapids.
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...Andrey Kudryavtsev
1) The document discusses the performance evolution of a reference storage platform over time as DAOS software improved from version 0.8 to 1.0.
2) Bandwidth and IOPS measurements increased significantly with each DAOS update as well as when using dual socket CPUs in DAOS 1.0.
3) Read latency times improved in DAOS 1.0, showing Optane-like write latencies and NAND-like read latencies from data destaged to QLC SSDs.
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesAndrey Kudryavtsev
RSC's BasIS storage orchestration platform addresses complications with deploying DAOS storage. It simplifies DAOS deployment by dynamically composing DAOS clusters from servers' NVMe and PMEM resources over a fabric. This composable disaggregated approach provides flexibility to use PMEM nodes for different roles like DAOS or databases. The orchestration significantly improves on DAOS by making it deployable on existing heterogeneous servers and suitable for cloud environments. Performance tests show NVMe-over-Fabric with the orchestrator achieves similar throughput to local NVMe drives.
The document provides an overview of updates to the DAOS middleware:
1. It discusses different options for accessing DAOS storage including directly through the DAOS API, through a POSIX interface using dfuse, or through an interception library for performance.
2. It describes two consistency modes for the distributed file system - a relaxed mode prioritizing performance and a more balanced mode for stricter consistency.
3. An MPI-IO driver has been added to allow applications to use the DAOS object store through MPI file interfaces without changes.
4. An HDF5 connector has been developed to enable HDF5 files on top of DAOS, now compatible with the main HDF5 release and supporting various
1) The document discusses mapping seismic data stored in the SEG-Y format to the DAOS object storage system to improve storage and processing efficiency.
2) Currently, seismic processing copies data for each step, but DAOS snapshots and versioning can reduce copies by storing only updates.
3) The DAOS-SEIS mapping represents seismic data as a graph with trace headers and data mapped to DAOS objects to allow random access and filtering.
4) Early benchmarking shows the DAOS-SEIS API outperforms seismic processing libraries on large datasets, though further optimization is needed.
The document summarizes the experiences of CERN openlab in testing and benchmarking the Distributed Asynchronous Object Storage (DAOS) system. Key points include:
- CERN openlab collaborated with Intel to test and benchmark DAOS on their cluster hardware.
- They found initial performance gains when switching from socket to PSM2 configuration, but also limitations likely due to hardware restrictions.
- Feedback is provided to DAOS developers on areas like installation difficulties, lack of error and system information, and opportunities for improved monitoring integration.
- While commissioning DAOS presented challenges, CERN openlab gained valuable insights and sees potential for increased performance with more nodes and optimized configurations.
DUG'20: 05 - Very Early Experiences with a 0.5 PByte DAOS TestbedAndrey Kudryavtsev
Steffen Christgau, Supercomputing dept., Zuse Institute Berlin
Tobias Watermann, Supercomputing dept., Zuse Institute Berlin
Thomas Steinke, Supercomputing dept., Zuse Institute Berlin
DAOS User Group event, November 2020.
1. DAOS provides end-to-end data integrity through checksums calculated by the client library and stored persistently with the data to detect silent corruption.
2. It supports online server addition through a data migration service that handles data movement for various activities like rebuild, reintegration, and addition in a generic way.
3. DAOS uses erasure coding for efficient data recovery and space utilization, with Reed-Solomon coding and support for degraded reads, background rebuild, and space reclamation through data aggregation.
This document discusses compression capabilities added to DAOS (Distributed Asynchronous Object Storage). It provides an overview of the DAOS compression framework and algorithms including LZ4 and deflate. It evaluates the performance of these algorithms with and without using Intel QuickAssist Technology hardware acceleration. Testing shows QAT achieves the best compression throughput and ratio while LZ4 has the best decompression throughput. Future work is outlined to improve compression functionality in DAOS.
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev
This document summarizes accelerating Apache Spark with DAOS (Distributed Asynchronous Object Storage) on Aurora. It describes using DAOS as a Hadoop filesystem for Spark input/output storage and as a shuffle data store. It shows how the DAOS Hadoop filesystem delivers similar throughput to DAOS DFS. It also introduces a DAOS object-based Spark shuffle manager that improves shuffle read throughput by up to 1.5x compared to using the local filesystem, especially for smaller shuffle blocks. Future plans include optimizing the DAOS Spark shuffle manager using async APIs and simplifying user configuration.
This document provides an agenda for a meeting on high performance computing. The agenda includes presentations on accelerating Apache Spark with DAOS, online data compression in DAOS with Intel QAT, DAOS features and updates, experiences deploying and using DAOS in different environments, and plans for DAOS from various companies. Resources for the DAOS open source distributed storage platform are also listed at the end.
DAOS (Distributed Application Object Storage) is a high-performance storage architecture and software stack that delivers scalable object storage capabilities. It uses Intel Optane memory and NVMe SSDs to provide high IOPS, bandwidth, and low latency storage. DAOS supports various data models and interfaces like POSIX, HDF5, Spark, and Python. It allows applications to access storage with library calls instead of system calls for high performance.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Azure API Management to expose backend services securely
DUG'20: 07 - Storing High-Energy Physics data in DAOS
1. Storing High-Energy Physics data in DAOS
Javier López Gómez – CERN fellow
<javier.lopez.gomez@cern.ch>
DUG ’20, 19th November 2020
ROOT project,
EP-SFT (SoFTware Development for Experiments),
CERN
http://root.cern/
4. High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)High-Energy Physics (HEP)
High-Energy Physics studies laws governing our universe at the smallest
scale: fundamental particles, forces and its carriers, mass, etc. The
“Standard model” describes these particles/interactions.
CERN experiments observe particle interactions (typically by colliding
particles at high-energies).
HEP data = detector observations.
2/15
5. Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)Large Hadron Collider (LHC)
Figure 1: Graphical representation of a CMS event.1
LHC collides protons that move in opposite directions. Detectors are
similar to a 100 MP camera taking a picture every 25 ns.
109
collisions/sec generating ∼ 10 TB/s.
Processing:
- Online: filtering step. Part of the detector read-out.
- Offline: distributed; disk storage at different LHC compute centers around
the globe.
1
http://opendata.cern.ch/visualise/events/cms 3/15
6. ROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT projectROOT project
ROOT: open-source data analysis framework written in C++. Provides C++
interpretation, object serialization (I/O), statistics, graphics, and much
more.
PyROOT provides dynamic C++ ↔ Python bindings.
ROOT I/O: row-wise/column-wise storage of C++ objects.
4/15
7. TTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTuple
HEP data analysis often only requires
access to a subset of the properties of
each event.
Row-wise storage is inefficient. TTree
organizes the dataset in columns that
contain any type of C++ object.
1+ EB of HEP data stored in TTree ROOT
files.
TTree has been there for 25 years.
RNTuple is the R&D project to replace
TTree for the next 30 years.
Object stores are first-class.
x y z mass
...
...
...
...
0.423 1.123 3.744 23.1413
...
...
...
...
...
...
...
...
5/15
8. TTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTupleTTree and RNTuple
HEP data analysis often only requires
access to a subset of the properties of
each event.
Row-wise storage is inefficient. TTree
organizes the dataset in columns that
contain any type of C++ object.
1+ EB of HEP data stored in TTree ROOT
files.
TTree has been there for 25 years.
RNTuple is the R&D project to replace
TTree for the next 30 years.
Object stores are first-class.
x y z mass
...
...
...
...
0.423 1.123 3.744 23.1413
...
...
...
...
...
...
...
...
5/15
10. RNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architectureRNTuple architecture
Storage layer / byte ranges
POSIX files, object stores, …
Primitives layer / simple types
“Columns” containing elements of fundamental types (float,
int, …) grouped into (compressed) pages and clusters
Logical layer / C++ objects
Mapping of C++ types onto columns, e.g.
std::vector<float> → index column and a value column
Event iteration
Looping over events for reading/writing
Storage layer: access to the header (= schema), the pages, and the footer (=
location of pages).
6/15
11. File backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk formatFile backend: on-disk format
… …
Anchor Header Page
Cluster
Footer
struct Event {
int fId;
vector<Particle> fPtcls;
};
struct Particle {
float fE;
vector<int> fIds;
};
To put it simple…
Anchor: specifies the offset and size of the header and footer sections.
Header: schema information.2
Footer: location of pages and clusters.2
Pages: little-endian fundamental types (possibly packed, e.g. bit-fields)
—typically in the order of tens of KiB.2
2
This element may be compressed or not.
7/15
13. libdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classeslibdaos C++ interface classes
To simplify resource management, we wrote C++ wrappers for part of
libdaos functionality.
auto pool = std::make_shared<RDaosPool>(
"e6f8e503-e409-4b08-8eeb-7e4d77cce6bb", "1");
RDaosContainer cont(pool, "b4f6d9fc-e081-41d4-91ae-41adf800b537");
std::string s("foo bar baz");
cont.WriteObject(daos_obj_id_t{0xcafe4a11deadbeef, 0}, s.data(), s.size()
, /*dkey =*/ 0, /*akey =*/ 0);
8/15
14. DAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objectsDAOS backend: mapping things to objects
… …
Anchor Header Page
Cluster
Footer
struct Event {
int fId;
vector<Particle> fPtcls;
};
struct Particle {
float fE;
vector<int> fIds;
};
Each RNTuple page is stored in a separate object. The UUID is
sequential starting from 00000000-0000-0000-0000-000000000000 .
Header, Footer, and Anchor are stored in three different objects with
reserved UUIDs.
9/15
15. Usage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOS
From the user’s perspective…
auto model = RNTupleModel::Create();
auto ntuple = RNTupleReader::Open(std::move(model),
"DecayTree",
"./B2HHH~zstd.ntuple");
auto viewH1IsMuon = ntuple->GetView<int>("H1_isMuon");
auto viewH2IsMuon = ntuple->GetView<int>("H2_isMuon");
auto viewH3IsMuon = ntuple->GetView<int>("H3_isMuon");
10/15
16. Usage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOSUsage: RNTuple/file vs. RNTuple/DAOS
From the user’s perspective…
auto model = RNTupleModel::Create();
auto ntuple = RNTupleReader::Open(std::move(model),
"b4f6d9fc-e081-41d4-91ae-41adf800b537",
"daos://e6f8e503-e409-4b08-8eeb-7e4d77cce6bb/1");
auto viewH1IsMuon = ntuple->GetView<int>("H1_isMuon");
auto viewH2IsMuon = ntuple->GetView<int>("H2_isMuon");
auto viewH3IsMuon = ntuple->GetView<int>("H3_isMuon");
10/15
24. Storing High-Energy Physics data in DAOS
Javier López Gómez – CERN fellow
<javier.lopez.gomez@cern.ch>
DUG ’20, 19th November 2020
ROOT project,
EP-SFT (SoFTware Development for Experiments),
CERN
http://root.cern/