This document discusses recent developments in Python histogram libraries. It describes Boost.Histogram, a C++ histogramming library that serves as the foundation for the boost-histogram Python package. Boost.Histogram provides fast, customizable histogram filling and manipulation. The document also outlines plans for hist, a Python analysis frontend, and aghast, a library for converting between histogram formats. Together, boost-histogram, hist, and aghast comprise the Scikit-HEP histogramming framework.
This document provides an overview of histograms and various histogram libraries. It introduces boost-histogram, a C++ histogram library that is fast and header-only. It then describes the new Python bindings for boost-histogram, which are designed to be fast and easy to use while resembling the C++ version. Finally, it outlines plans for additional Python histogram tools like hist, Aghast, and Unified Histogram Indexing to integrate boost-histogram into the wider ecosystem.
The document discusses plans for the boost-histogram and hist Python libraries. Boost-histogram is a multidimensional histogram library inspired by ROOT that provides flexibility through many axis and storage types. Hist will provide plotting and analysis functionality by interfacing with libraries like mpl-hep. Future plans include improved indexing, slicing, and NumPy conversions for boost-histogram as well as statistical functions, serialization, and integration with fitters for hist.
The document discusses the current state of histograms in Python and the need for a new library. It introduces boost-histogram, a C++ histogram library, and its new Python bindings. The bindings aim to provide a fast, flexible, and easily distributable histogram object for Python with support for multiple axis types and storage options. It also discusses plans for an additional wrapper library called hist for easy plotting and interfacing with other tools.
The document discusses the new features of Python 3.8, which was recently released. Some key updates include positional-only arguments, the walrus operator for variable assignment, improved static typing support, and performance enhancements. The document also notes additional developer changes and provides resources for obtaining Python 3.8.
GPUs have been used in the LHCb experiment to significantly speed up particle physics analysis tasks like amplitude analyses and the Manet energy test. The GooFit and Hydra frameworks provide massively parallelized fitting and Monte Carlo simulations that achieve speedups of over 100x compared to CPUs. The Manet energy test code has been optimized for GPUs, allowing a search for CP violation using over 1 million events to complete in 30 minutes. Future work includes improving Python interfaces and optimizing frameworks for newer GPU architectures.
2019 IRIS-HEP AS workshop: Boost-histogram and histHenry Schreiner
The document discusses the current state of histograms in Python and the need for a new histogramming library. It introduces boost-histogram, a C++ histogramming library, and its new Python bindings. The bindings aim to provide a fast, flexible and easily distributable histogram object for Python. Key features discussed include histogram design that treats it as a first-class object, fast filling via multi-threading, a variety of axis and storage types, and performance benchmarks showing it can be over 10x faster than NumPy for filling histograms. Distribution is focused on providing binary wheels for many platforms via continuous integration.
The GooFit package provides physicists a simple, familiar syntax for manipulating probability density functions and performing fits, but is highly optimized for data analysis on NVIDIA GPUs and multithreaded CPU backends. GooFit is being updated to version 2.0, bringing a host of new features. A completely revamped and redesigned build system makes GooFit easier to install, develop with, and run on virtually any system. Unit testing, continuous integration, and advanced logging options are improving the stability and reliability of the system. Developing new PDFs now uses standard CUDA terminology and provides a lower barrier for new users. The system now has built-in support for multiple graphics cards or nodes using MPI, and is being tested on a wide range of different systems.
GooFit also has significant improvements in performance on some GPU architectures due to optimized memory access. Support for time-dependent four body amplitude analyses has also been added.
This document provides an overview of histograms and various histogram libraries. It introduces boost-histogram, a C++ histogram library that is fast and header-only. It then describes the new Python bindings for boost-histogram, which are designed to be fast and easy to use while resembling the C++ version. Finally, it outlines plans for additional Python histogram tools like hist, Aghast, and Unified Histogram Indexing to integrate boost-histogram into the wider ecosystem.
The document discusses plans for the boost-histogram and hist Python libraries. Boost-histogram is a multidimensional histogram library inspired by ROOT that provides flexibility through many axis and storage types. Hist will provide plotting and analysis functionality by interfacing with libraries like mpl-hep. Future plans include improved indexing, slicing, and NumPy conversions for boost-histogram as well as statistical functions, serialization, and integration with fitters for hist.
The document discusses the current state of histograms in Python and the need for a new library. It introduces boost-histogram, a C++ histogram library, and its new Python bindings. The bindings aim to provide a fast, flexible, and easily distributable histogram object for Python with support for multiple axis types and storage options. It also discusses plans for an additional wrapper library called hist for easy plotting and interfacing with other tools.
The document discusses the new features of Python 3.8, which was recently released. Some key updates include positional-only arguments, the walrus operator for variable assignment, improved static typing support, and performance enhancements. The document also notes additional developer changes and provides resources for obtaining Python 3.8.
GPUs have been used in the LHCb experiment to significantly speed up particle physics analysis tasks like amplitude analyses and the Manet energy test. The GooFit and Hydra frameworks provide massively parallelized fitting and Monte Carlo simulations that achieve speedups of over 100x compared to CPUs. The Manet energy test code has been optimized for GPUs, allowing a search for CP violation using over 1 million events to complete in 30 minutes. Future work includes improving Python interfaces and optimizing frameworks for newer GPU architectures.
2019 IRIS-HEP AS workshop: Boost-histogram and histHenry Schreiner
The document discusses the current state of histograms in Python and the need for a new histogramming library. It introduces boost-histogram, a C++ histogramming library, and its new Python bindings. The bindings aim to provide a fast, flexible and easily distributable histogram object for Python. Key features discussed include histogram design that treats it as a first-class object, fast filling via multi-threading, a variety of axis and storage types, and performance benchmarks showing it can be over 10x faster than NumPy for filling histograms. Distribution is focused on providing binary wheels for many platforms via continuous integration.
The GooFit package provides physicists a simple, familiar syntax for manipulating probability density functions and performing fits, but is highly optimized for data analysis on NVIDIA GPUs and multithreaded CPU backends. GooFit is being updated to version 2.0, bringing a host of new features. A completely revamped and redesigned build system makes GooFit easier to install, develop with, and run on virtually any system. Unit testing, continuous integration, and advanced logging options are improving the stability and reliability of the system. Developing new PDFs now uses standard CUDA terminology and provides a lower barrier for new users. The system now has built-in support for multiple graphics cards or nodes using MPI, and is being tested on a wide range of different systems.
GooFit also has significant improvements in performance on some GPU architectures due to optimized memory access. Support for time-dependent four body amplitude analyses has also been added.
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingHenry Schreiner
A Python upgrade to the GooFit package for parallel fitting
9 Jul 2018, 15:30
15m
Hall 3 (National Palace of Culture)
presentation Track 5 – Software development T5 - Software development
Speaker
Henry Fredrick Schreiner (University of Cincinnati (US))
Description
The GooFit highly parallel fitting package for GPUs and CPUs has been substantially upgraded in the past year. Python bindings have been added to allow simple access to the fitting configuration, setup, and execution. A Python tool to write custom GooFit code given a (compact and elegant) MINT3/AmpGen amplitude description allows the corresponding C++ code to be written quickly and correctly. New PDFs have been added. The most recent release was built on top of the December 2017 2.0 release that added easier builds, new platforms, and a more robust and efficient underlying function evaluation engine.
iminuit is an external Python interface to the Minuit2 C++ code,
which can be compiled standalone without the rest of ROOT. iminuit
has recently seen a boost of development which culminated in the
latest 1.3 release and will join the Scikit-HEP project in this year.
To simplify Minuit’s use as a standalone CMake package, for projects
like iminuit and GooFit, a new standalone build system
was implemented for Minuit2, and has been
included in the latest release of ROOT. This system uses modern CMake
patterns, and lives in a peaceful coexistence with the ROOT build
system. The production of source packages is handled without external
scripts, and the system even supports building from inside ROOT.
Integrating this into the ROOT source and build system provided several
challenges, with some interesting solutions that will be shown.
2019 IRIS-HEP AS workshop: Particles and decaysHenry Schreiner
The Scikit-HEP project aims to create an ecosystem for particle physics data analysis in Python. It includes packages like Particle and DecayLanguage that provide tools for working with particle data and decay descriptions. Particle allows users to easily access and search particle property data from sources like the PDG. DecayLanguage allows parsing decay file formats, representing and manipulating decay chains, and converting between decay model representations. Future work includes expanding particle ID support and improving visualization of decay trees.
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.
Pybind11 is a lightweight header-only library for binding C++ types to Python. It allows exposing C++ functions and classes to Python with minimal code. Compared to Boost.Python, pybind11 has simpler binding code, smaller binaries, and supports newer C++ features. The document provides examples of binding basic functions and classes to Python using pybind11.
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.
The document discusses the data analysis tool Apache Pig. Pig is a platform for analyzing large datasets that uses a high-level language called Pig Latin for expressing data analysis programs. Pig is built on Hadoop and allows for easy programming while also providing optimization opportunities and extensibility. The document provides an overview of Pig internals including its data structures, basic operations like group, join and order, and architecture. Examples of Pig Latin code and use cases for Pig are also presented.
The document discusses whether the PyPy implementation of Python is ready for production use. It provides an overview of PyPy, benchmarks various workloads against CPython, and evaluates PyPy based on common criteria for determining if a software project is production-ready. While some workloads are slower on PyPy and it fails with some Python modules, it meets most criteria and provides performance improvements for CPU-bound tasks. Overall, the document concludes PyPy could be considered for production use, especially given its advantages in scalability and upcoming improvements to its just-in-time compiler and Python 3 support.
PyPy is an implementation of the Python language written in RPython, a restricted subset of Python. It includes a just-in-time compiler that can provide significant performance improvements over CPython for certain workloads. The author benchmarked several of their workflows and found that PyPy was 2-4x faster for some tasks like CSV to XML conversion, but up to 4x slower for others like simple CSV parsing. Overall, PyPy satisfies most criteria for production readiness and supports the key Python modules used by the author's projects, making it a viable alternative to CPython for improving performance.
The document summarizes available HPC resources at CSUC, including hardware facilities, the working environment, development tools, and how to access services. The main systems are Canigó with 384 cores and 33 TFlops peak performance, and Pirineus II with 2,688 cores and 284 TFlops. Resources are managed by Slurm and available partitions include standard, GPU, and Intel KNL nodes. Users can access resources through RES projects or by purchasing compute units.
A brief overview of using HDF5 with Python and Andrew Collette's h5py module will be presented, including examples which show how and why Python can be used in the place of HDF5 tools. Extensions to the HDF5 API will be proposed which would further improve the utility of Python/h5py.
This document discusses using Python with the H5py module to interact with HDF5 files. Some key points made include:
- H5py allows HDF5 files to be manipulated as if they were Python dictionaries, with dataset names as keys and arrays as values.
- NumPy provides array manipulation capabilities to work with the dataset values retrieved from HDF5 files.
- Examples demonstrate reading and writing HDF5 datasets, comparing contents of datasets between files, and recursively listing contents of an HDF5 file.
- Using Python with H5py is more concise than other languages like C/Fortran, reducing development time and potential for errors.
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. The tutorial will cover functionality and useful features of the HDF5 utilities h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. We will also introduce a prototype of the new h52jpeg conversion tool and recently released h5perf_serial tool used for performance studies. We will briefly introduce HDFView. Details of the HDFView and HDF-Java will be discussed in a separate tutorial.
Go-carbon is a Graphite implementation written in Go that receives metrics via TCP and stores them using the Whisper database format. It can run as a daemon and its configuration includes settings for the TCP listen port, internal queue buffer size, Whisper data directory, and storage schemas file. An example use case presented was for real-time bidding systems. Resources provided on Graphite, Carbon, and go-carbon documentation and architecture.
The document summarizes the steps to transform an analytics script into a scalable data science service using Python, from defining an importable module, exposing an API endpoint, making it asynchronous using Celery, containerizing it using Docker, and deploying it on Kubernetes. It discusses each step in detail providing code examples. The overall goal is to make the data science solution robust, responsive and able to handle production scale loads.
Presentació a càrrec d'Adrián Macía, cap de Càlcul Científic del CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de desembre de 2021 en format virtual.
Yuta Kashino is the CEO of BakFoo, Inc. and gave a presentation at PyData Tokyo about TensorFlow, PyTorch, and deep learning frameworks. He discussed features of TensorFlow like eager execution, debugging tools, and hardware support. He also covered PyTorch and compared it to TensorFlow and Chainer, noting its Pythonic APIs and define-by-run approach. He highlighted many popular deep learning models, libraries, and resources available for PyTorch.
This document summarizes a Chainer meetup where Yuta Kashino presented on PyTorch. Key points discussed include:
- Yuta Kashino is the CEO of BakFoo, Inc. and discussed his background in astrophysics, Zope, and Python.
- PyTorch was introduced as an alternative to Chainer and TensorFlow that is Pythonic and defines models dynamically through code.
- PyTorch uses autograd to track gradients like Chainer but with Pythonic APIs and integration with NumPy for efficient operations.
- Similarities between PyTorch and Chainer were highlighted around defining models as chains of functions, GPU support, and optimizer libraries.
- Resources for learning more about PyTorch were provided, including tutorials on their website and courses on
Graph operations in Git version control systemJakub Narębski
The document discusses improving performance of graph operations in the Git version control system. It describes how Git repositories are growing very large with millions of commits. This causes noticeable slowdowns in Git operations that work with the commit graph, such as git log --graph. The author presents methods that have been used to improve performance, including a bitmap index and generation numbers. Future work is also discussed, such as handling corrected commit dates and other graph labels.
Elastic multicore scheduling with the XiTAO runtimeMiquel Pericas
This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingHenry Schreiner
A Python upgrade to the GooFit package for parallel fitting
9 Jul 2018, 15:30
15m
Hall 3 (National Palace of Culture)
presentation Track 5 – Software development T5 - Software development
Speaker
Henry Fredrick Schreiner (University of Cincinnati (US))
Description
The GooFit highly parallel fitting package for GPUs and CPUs has been substantially upgraded in the past year. Python bindings have been added to allow simple access to the fitting configuration, setup, and execution. A Python tool to write custom GooFit code given a (compact and elegant) MINT3/AmpGen amplitude description allows the corresponding C++ code to be written quickly and correctly. New PDFs have been added. The most recent release was built on top of the December 2017 2.0 release that added easier builds, new platforms, and a more robust and efficient underlying function evaluation engine.
iminuit is an external Python interface to the Minuit2 C++ code,
which can be compiled standalone without the rest of ROOT. iminuit
has recently seen a boost of development which culminated in the
latest 1.3 release and will join the Scikit-HEP project in this year.
To simplify Minuit’s use as a standalone CMake package, for projects
like iminuit and GooFit, a new standalone build system
was implemented for Minuit2, and has been
included in the latest release of ROOT. This system uses modern CMake
patterns, and lives in a peaceful coexistence with the ROOT build
system. The production of source packages is handled without external
scripts, and the system even supports building from inside ROOT.
Integrating this into the ROOT source and build system provided several
challenges, with some interesting solutions that will be shown.
2019 IRIS-HEP AS workshop: Particles and decaysHenry Schreiner
The Scikit-HEP project aims to create an ecosystem for particle physics data analysis in Python. It includes packages like Particle and DecayLanguage that provide tools for working with particle data and decay descriptions. Particle allows users to easily access and search particle property data from sources like the PDG. DecayLanguage allows parsing decay file formats, representing and manipulating decay chains, and converting between decay model representations. Future work includes expanding particle ID support and improving visualization of decay trees.
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.
Pybind11 is a lightweight header-only library for binding C++ types to Python. It allows exposing C++ functions and classes to Python with minimal code. Compared to Boost.Python, pybind11 has simpler binding code, smaller binaries, and supports newer C++ features. The document provides examples of binding basic functions and classes to Python using pybind11.
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.
The document discusses the data analysis tool Apache Pig. Pig is a platform for analyzing large datasets that uses a high-level language called Pig Latin for expressing data analysis programs. Pig is built on Hadoop and allows for easy programming while also providing optimization opportunities and extensibility. The document provides an overview of Pig internals including its data structures, basic operations like group, join and order, and architecture. Examples of Pig Latin code and use cases for Pig are also presented.
The document discusses whether the PyPy implementation of Python is ready for production use. It provides an overview of PyPy, benchmarks various workloads against CPython, and evaluates PyPy based on common criteria for determining if a software project is production-ready. While some workloads are slower on PyPy and it fails with some Python modules, it meets most criteria and provides performance improvements for CPU-bound tasks. Overall, the document concludes PyPy could be considered for production use, especially given its advantages in scalability and upcoming improvements to its just-in-time compiler and Python 3 support.
PyPy is an implementation of the Python language written in RPython, a restricted subset of Python. It includes a just-in-time compiler that can provide significant performance improvements over CPython for certain workloads. The author benchmarked several of their workflows and found that PyPy was 2-4x faster for some tasks like CSV to XML conversion, but up to 4x slower for others like simple CSV parsing. Overall, PyPy satisfies most criteria for production readiness and supports the key Python modules used by the author's projects, making it a viable alternative to CPython for improving performance.
The document summarizes available HPC resources at CSUC, including hardware facilities, the working environment, development tools, and how to access services. The main systems are Canigó with 384 cores and 33 TFlops peak performance, and Pirineus II with 2,688 cores and 284 TFlops. Resources are managed by Slurm and available partitions include standard, GPU, and Intel KNL nodes. Users can access resources through RES projects or by purchasing compute units.
A brief overview of using HDF5 with Python and Andrew Collette's h5py module will be presented, including examples which show how and why Python can be used in the place of HDF5 tools. Extensions to the HDF5 API will be proposed which would further improve the utility of Python/h5py.
This document discusses using Python with the H5py module to interact with HDF5 files. Some key points made include:
- H5py allows HDF5 files to be manipulated as if they were Python dictionaries, with dataset names as keys and arrays as values.
- NumPy provides array manipulation capabilities to work with the dataset values retrieved from HDF5 files.
- Examples demonstrate reading and writing HDF5 datasets, comparing contents of datasets between files, and recursively listing contents of an HDF5 file.
- Using Python with H5py is more concise than other languages like C/Fortran, reducing development time and potential for errors.
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. The tutorial will cover functionality and useful features of the HDF5 utilities h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. We will also introduce a prototype of the new h52jpeg conversion tool and recently released h5perf_serial tool used for performance studies. We will briefly introduce HDFView. Details of the HDFView and HDF-Java will be discussed in a separate tutorial.
Go-carbon is a Graphite implementation written in Go that receives metrics via TCP and stores them using the Whisper database format. It can run as a daemon and its configuration includes settings for the TCP listen port, internal queue buffer size, Whisper data directory, and storage schemas file. An example use case presented was for real-time bidding systems. Resources provided on Graphite, Carbon, and go-carbon documentation and architecture.
The document summarizes the steps to transform an analytics script into a scalable data science service using Python, from defining an importable module, exposing an API endpoint, making it asynchronous using Celery, containerizing it using Docker, and deploying it on Kubernetes. It discusses each step in detail providing code examples. The overall goal is to make the data science solution robust, responsive and able to handle production scale loads.
Presentació a càrrec d'Adrián Macía, cap de Càlcul Científic del CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de desembre de 2021 en format virtual.
Yuta Kashino is the CEO of BakFoo, Inc. and gave a presentation at PyData Tokyo about TensorFlow, PyTorch, and deep learning frameworks. He discussed features of TensorFlow like eager execution, debugging tools, and hardware support. He also covered PyTorch and compared it to TensorFlow and Chainer, noting its Pythonic APIs and define-by-run approach. He highlighted many popular deep learning models, libraries, and resources available for PyTorch.
This document summarizes a Chainer meetup where Yuta Kashino presented on PyTorch. Key points discussed include:
- Yuta Kashino is the CEO of BakFoo, Inc. and discussed his background in astrophysics, Zope, and Python.
- PyTorch was introduced as an alternative to Chainer and TensorFlow that is Pythonic and defines models dynamically through code.
- PyTorch uses autograd to track gradients like Chainer but with Pythonic APIs and integration with NumPy for efficient operations.
- Similarities between PyTorch and Chainer were highlighted around defining models as chains of functions, GPU support, and optimizer libraries.
- Resources for learning more about PyTorch were provided, including tutorials on their website and courses on
Graph operations in Git version control systemJakub Narębski
The document discusses improving performance of graph operations in the Git version control system. It describes how Git repositories are growing very large with millions of commits. This causes noticeable slowdowns in Git operations that work with the commit graph, such as git log --graph. The author presents methods that have been used to improve performance, including a bitmap index and generation numbers. Future work is also discussed, such as handling corrected commit dates and other graph labels.
Elastic multicore scheduling with the XiTAO runtimeMiquel Pericas
This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.
Elastic multicore scheduling with the XiTAO runtimeLEGATO project
This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.
Time series data monitoring at 99acres.comRavi Raj
The document describes the current single box setup for 99acres.com monitoring which includes Carbon, Whisper, and Graphite Web. Carbon receives metrics and flushes them to Whisper. Whisper is a flat-file database that stores each metric in a separate file. Graphite Web is a Django UI that queries Carbon and Whisper to return and graph metrics data. The proposed final approach adds a Carbon-Relay box and dedicated Graphite Web box for load balancing and fault tolerance across multiple Graphite storage nodes.
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
We interact with an increasing amount of data but classical data structures and algorithms can't fit our requirements anymore. This talk is to present the probabilistic algorithms and data structures and describe the main areas of their applications.
This document discusses and compares several R tools for visualizing Hi-C data, including HiTC, HiCBricks, DNA_Rchitect, GENOVA, GenomicInteractions, Sushi, HiCeekR, and adjclust. It summarizes the file formats each tool accepts, how the data are imported and stored, and the types of visualizations provided, such as heatmaps, arcs/networks, and quality control plots. While most tools can import basic text-based formats, support for newer formats like cool and hic files is still limited. Heatmaps are widely supported but tools vary in customization options. Networks/arcs are best for small regions due to readability issues. Further development is still
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
This document is a report on Python for a class. It includes sections on the history of Python, why it is a good choice for learning programming, its core characteristics like being interpreted and object-oriented, common data structures like lists and dictionaries, the NumPy package for scientific computing, and a conclusion about the benefits of using Python as a teaching language.
This document is a report on Python for a class. It includes sections on the history of Python, why it is a good choice for learning programming, its core characteristics like being interpreted and object-oriented, common data structures like lists and dictionaries, the NumPy package for scientific computing, and a conclusion about the benefits of using Python as a teaching language.
This document discusses using Python and the pandas library for financial data analysis. It provides an overview of pandas, describing it as providing rich data structures like DataFrame for working with financial time series and panel data. It highlights pandas' features for fast data alignment, time series functionality, and SQL-like operations which make it well-suited for financial analysis tasks. The document also presents pandas as addressing weaknesses that Python previously had for statistical analysis and filling gaps relative to data analysis tools like R.
Python for Financial Data Analysis with pandasWes McKinney
This document discusses using Python and the pandas library for financial data analysis. It provides an overview of pandas, describing it as a tool that offers rich data structures and SQL-like functionality for working with time series and cross-sectional data. The document also outlines some key advantages of Python for financial data analysis tasks, such as its simple syntax, powerful built-in data types, and large standard library.
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...AMIDST Toolbox
This document summarizes a research paper that proposes a parallelized algorithm for scaling up filter-based feature selection in machine learning classification problems. The algorithm uses conditional mutual information as its filter measure and leverages balanced incomplete block designs to distribute feature scoring calculations across multiple processor cores. Experimental results on both simulated and real-world datasets demonstrate that the algorithm achieves significant speed improvements over a single-threaded approach, with speed-up factors increasing nearly linearly with the number of processor cores.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
Developing High Performance Application with Aerospike & GoChris Stivers
In this presentation, Chris Stivers, introduces the audience to Aerospike and provides tips on improving performance of Application written in Go. Tips include how to use memory more effectively in Go, and using Aerospike for high throughput / low latency transactions.
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.
Numba is an open source just-in-time compiler for Python that uses the LLVM compiler infrastructure to generate optimized machine code from Python syntax. It allows Python code to be accelerated by running on multicore CPUs or GPUs. Numba can compile NumPy array expressions and ufuncs, parallel for loops, and user-defined Python functions to run at native speeds without rewriting in a different language. It provides an easy to use interface and can achieve large speedups of over 1000x compared to Python.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Similar to CHEP 2019: Recent developments in histogram libraries (20)
Modern binary build systems have made shipping binary packages for Python much easier than ever before. This talk discusses three of the most popular build systems for Python packages using the new standards developed for packaging.
This document discusses software quality assurance tooling, focusing on pre-commit. It introduces pre-commit as a tool for running code quality checks before code is committed. Pre-commit allows configuring hooks that run checks and fixers on files matching certain patterns. Hooks can be installed from repositories and support many languages including Python. The document provides examples of pre-commit checks such as disallowing improper capitalization in code comments and files. It also discusses how to configure, run, update and install pre-commit hooks.
The document summarizes Henry Schreiner's work on several Python and C++ scientific computing projects. It describes a scientific Python development guide built from the Scikit-HEP summit. It also outlines Henry's work on pybind11 for C++ bindings, scikit-build for building extensions, cibuildwheel for building wheels on CI, and several other related projects.
Flake8 is a Python linter that is fast, simple, and extensible. It can be configured through setup.cfg or .flake8 files to ignore certain checks or select others. The summary recommends using the flake8-bugbear plugin and avoiding all print statements with flake8-print. Linters like Flake8 help find errors, improve code quality, and avoid historical baggage, but one does not need every check and it is okay to build a long ignore list.
The document describes various productivity tools for Python development, including:
- Pre-commit hooks to run checks before committing code
- Hot code reloading in Jupyter notebooks using the %load_ext and %autoreload magic commands
- Cookiecutter for generating project templates
- SSH configuration files and escape sequences for easier remote access
- Autojump to quickly navigate frequently visited directories
- Terminal tips like command history search and referencing the last argument
- Options for tracking Jupyter notebooks with git like stripping outputs or synchronizing notebooks and Python files.
SciPy22 - Building binary extensions with pybind11, scikit build, and cibuild...Henry Schreiner
Building binary extensions is easier than ever thanks to several key libraries. Pybind11 provides a natural C++ language for extensions without requiring pre-processing or special dependencies. Scikit-build ties the premier C++ build system, CMake, into the Python extension build process. And cibuildwheel makes it easy to build highly compatible wheels for over 80 different platforms using CI or on your local machine. We will look at advancements to all three libraries over the last year, as well as future plans.
This document discusses the history and development of Python packages for high energy physics (HEP) analysis. It describes how experiments initially used ROOT and C++, but Python gained popularity for configuration and analysis. This led to the creation of packages like Scikit-HEP, Uproot, and Awkward Array to bridge the gap between ROOT files and the Python data science stack. Scikit-HEP grew to include many related packages and provides best practices through its developer pages. The future may include adopting Scikit-build for building Python packages with C/C++ extensions and running packages in the browser via WebAssembly.
PyCon 2022 -Scikit-HEP Developer Pages: Guidelines for modern packagingHenry Schreiner
This was a PyCon 2022 lightning talk over the Scikit-HEP developer pages. It highlights best practices and guides shown there, and the quick package creation cookiecutter. And finally it demos the Pyodide WebAssembly app embedded into the Scikit-HEP developer pages!
Talk at PyCon2022 over building binary packages for Python. Covers an overview and an in-depth look into pybind11 for binding, scikit-build for creating the build, and build & cibuildwheel for making the binaries that can be distributed on PyPI.
Digital RSE: automated code quality checks - RSE group meetingHenry Schreiner
Given at a local RSE group meeting. Covers code quality practices, focusing on Python but over multiple languages, with useful tools highlighted throughout.
This document provides best practices for using CMake, including:
- Set the cmake_minimum_required version to ensure modern features while maintaining backward compatibility.
- Use targets to define executables and libraries, their properties, and dependencies.
- Fetch remote dependencies at configure time using FetchContent or integrate with package managers like Conan.
- Import library targets rather than reimplementing Find modules when possible.
- Treat CUDA as a first-class language in CMake projects.
HOW 2019: Machine Learning for the Primary Vertex ReconstructionHenry Schreiner
The document describes a machine learning approach for primary vertex reconstruction in high-energy physics experiments. A hybrid method is proposed that uses a 1D convolutional neural network to analyze histograms produced from tracking data. The network is able to find primary vertices with high efficiency and tunable false positive rates, demonstrating the potential of machine learning for this task. Future work involves adding more tracking information and iterating between track association and vertex finding to improve performance.
HOW 2019: A complete reproducible ROOT environment in under 5 minutesHenry Schreiner
The document discusses setting up a ROOT environment using Conda in under 5 minutes. It describes downloading and installing Miniconda and then using Conda commands to create a new environment and install ROOT and its dependencies from the conda-forge channel. The ROOT package provides full ROOT functionality, including compilation and graphics, and supports Linux, macOS, and multiple Python versions.
ACAT 2019: A hybrid deep learning approach to vertexingHenry Schreiner
This document presents a hybrid deep learning approach for vertex finding in high-energy physics experiments. It uses a 1D convolutional neural network to analyze kernel density estimates of track information in order to identify primary vertex positions. The approach achieves primary vertex finding efficiencies of 88-94% with low false positive rates comparable to traditional algorithms. The authors demonstrate tuning of the efficiency-false positive rate tradeoff and discuss plans to improve performance by incorporating additional track information and iterative refinement.
2019 CtD: A hybrid deep learning approach to vertexingHenry Schreiner
This document presents a hybrid deep learning approach for vertex finding using 1D convolutional neural networks. It describes generating 1D kernel densities from tracking information, building target distributions, and using a CNN architecture with an adjustable cost function to optimize the false positive rate versus efficiency. The approach achieves 93.87% efficiency with a 0.251 false positive rate on test data. Future work includes incorporating additional xy information and exploring full 2D kernel densities.
2019 IML workshop: A hybrid deep learning approach to vertexingHenry Schreiner
A hybrid deep learning approach is proposed for vertex finding using 1D convolutional neural networks on kernel density estimates from tracking data. The approach generates 1D histograms from 3D tracking data and uses a CNN to classify primary vertex positions. In a proof-of-concept on simulated data, it achieves primary vertex finding efficiencies and false positive rates comparable to traditional algorithms, with tunable efficiency-false positive tradeoffs. Future work includes incorporating additional tracking features, associating tracks to vertices, and deploying the inference engine for the LHCb trigger.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
CHEP 2019: Recent developments in histogram libraries
1. Recent developments in histogram libraries
Hans Dembinski1
Jim Pivarski2
Henry Schreiner2
November 7, 2019
Max Planck Institute for Nuclear Physics, Heidelberg, Germany1
Princeton University, Princeton, USA2
12. Boost.Histogram C++14: Performance
• call is one at a time fill
• fill is new (1.72) array fill
• S is static (fixed types)
• D is dynamic (unlimited storage)
7/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
13. boost-histogram: Python bindings
• Boost.Histogram developed with Python in mind
▶ Original prototype bindings based on Boost::Python
• New bindings: /scikit-hep/boost-histogram
▶ 0-dependency build (C++14 only)
▶ State-of-the-art PyBind11
▶ Designed with the original author, Hans Dembinski
• Public beta 0.5.2 currently available, 0.6 coming soon
Design Flexibility Speed Distribution
8/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
14. Design
• 500+ unit tests run on Azure on Linux, macOS, and Windows
• Stays close to Boost.Histogram with Pythonizations
C++17
#include <boost/histogram.hpp>
namespace bh = boost::histogram;
auto hist = bh::histogram(
bh::axis::regular{2, 0, 1, "x"},
bh::axis::regular{4, 0, 1, "y"});
hist(.2, .3); // Fill will also be
hist(.4, .5); // available in 1.72
hist(.3, .2);
Python
import boost_histogram as bh
hist = bh.Histogram(
bh.axis.Regular(2, 0, 1, metadata="x"),
bh.axis.Regular(4, 0, 1, metadata="y"))
hist.fill(
[.2, .4, .3],
[.3, .5, .2])
9/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
15. Design: Manipulations
Combine two histograms
hist1 + hist2
Scale a histogram
hist * 2.0
Sum a histogram contents
hist.sum()
Access an axis
ax = hist.axes[0]
ax.edges # The edges array
ax.centers # Centers of bins
ax.widths # Width of each bin
hist.axes.centers # All centers
# Etc.
Fill 2D histogram with values or arrays
hist.fill(a1, a2, weights=[w1, w2],
samples=[s1, s2])
Convert contents to Numpy array
hist.view()
Convert to Numpy style histogram tuple
hist.to_numpy()
Pickle supported (multiprocessing)
pickle.dumps(hist, -1)
Copy/deepcopy supported
hist2 = copy.deepcopy(hist)
Numpy functions provided
H, E = bh.numpy.histogram(...)
10/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
16. Design: Unified Histogram Indexing (UHI)
The language here (bh.loc, etc) is defined in such a way that any library can provide them -
“Unified”.
Access
v = h[b] # Returns bin contents, indexed by bin number
v = h[bh.loc(b)] # Returns the bin containing the value
v = h[bh.underflow] # Underflow and overflow can be accessed with special tags
Setting
h[b] = v
h[bh.loc(b)] = v
h[bh.underflow] = v
11/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
17. Design: Unified Histogram Indexing (UHI) (2)
h == h[:] # Slice over everything
h2 = h[a:b] # Slice of histogram (includes flow bins)
h2 = h[:b] # Leaving out endpoints is okay
h2 = h[bh.loc(v):] # Slices can be in data coordinates, too
h2 = h[::bh.sum] # Remove an axis by summation
h2 = h[0:5:bh.sum] # Can add endpoints
h2 = h[::bh.rebin(2)] # Modification operations (rebin)
h2 = h[a:b:bh.rebin(2)] # Modifications can combine with slices
h2 = h[a:b, ...] # Ellipsis work just like normal numpy
• Docs are here
• Description may move to a new repository
12/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
19. Performance
• Factor of 2 faster than 1D regular binning in Numpy 1.17
▶ Currently no specialization, just a 1D regular fill
▶ Could be optimized further for common fills
• Factor of 6-10 faster than 2D regular binning Numpy
14/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
20. Distribution
If pip install boost-histogram fails, we have failed.
• Docker ManyLinux1 GCC 9.2: /scikit-hep/manylinuxgcc
• Used in /scikit-hep/iMinuit, /scikit-hep/awkward1
• Described on iscinumpy.gitlab.io, also see /scikit-hep/azure-wheel-helpers
Wheels
• manylinux1 32/64 bit, Py 2.7, 3.5, 3.6, 3.7
• manylinux2010 64 bit, Py 2.7, 3.5, 3.6, 3.7, 3.8
• macOS 10.9+ 64 bit, Py 2.7, 3.6, 3.7, 3.8
• Windows 32/64 bit, Py 2.7, 3.6, 3.7
Source
• SDist on PyPI
• Build directly from GitHub
Conda-Forge
• All (including 3.8) except 2.7 on
Windows
python3 -m pip install boost-histogram
conda install -c conda-forge boost-histogram
15/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
21. Hist
hist is the ‘wrapper’ piece that does plotting and interacts with the rest of the ecosystem.
Plans
• Easy plotting adaptors (mpl-hep)
• Serialization formats via aghast (ROOT, HDF5)
• Implicit multithreading
• Statistical functions (Like TEfficiency)
• Multihistogram plotting (HistBook)
• Interaction with fitters (ZFit, GooFit, etc)
• Bayesian Blocks algorithm from Scikit-HEP
• Command line histograms for stream of numbers
Call for contributions
• What do you need?
• Looking for a student
interested in development
Join in the development! This
should combine the best features
of other packages.
16/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
22. aghast
aghast is a histogramming library that does not fill histograms
and does not plot them.
/scikit-hep/aghast
• A memory format for histograms, like Apache Arrow
• Converts to and from other libraries
• Uses flatbuffers to hold histograms
• Indexing ideas inspired the UHI
Available Binnings
• IntegerBinning
• RegularBinning
• HexagonalBinning
• EdgesBinning
• IrregularBinning
• CategoryBinning
• SparseRegularBinning
• FractionBinning
• PredicateBinning
• VariationBinning
17/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
23. Conclusions
• Boost.Histogram C++14 is a solid foundation
• boost-histogram for Python is a core histogramming library
• Hist will be developed next and is analysis friendly
• aghast is the glue to ROOT/everything else
Core histogramming libraries boost-histogram ROOT
Universal adaptor aghast
Front ends (plotting, etc) hist mpl-hep coffea others
18/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019
24. Questions
?Supported by
• IRIS-HEP, NSF OAC-1836650
• DIANA-HEP, NSF OAC-1558216, NSF OAC-1558233, NSF OAC-1558219
19/19Dembinski, Pivarski, Schreiner Recent developments in histogram libraries November 7, 2019