A three-part presentation on the Swift programming language:
• An introduction to Swift for Objective-C developers
• Changes in Swift 2
• What's coming in Swift 2.2 & 3.0
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia
By Andy Wingo.
With the new compiler and virtual machine in Guile 2.2, Guile hackers need to update their mental performance models. This talk will give a bit of a state of the union of Guile performance, with an updated overview of the cost of various kinds of abstractions. Sometimes abstraction is free!
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
A three-part presentation on the Swift programming language:
• An introduction to Swift for Objective-C developers
• Changes in Swift 2
• What's coming in Swift 2.2 & 3.0
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia
By Andy Wingo.
With the new compiler and virtual machine in Guile 2.2, Guile hackers need to update their mental performance models. This talk will give a bit of a state of the union of Guile performance, with an updated overview of the cost of various kinds of abstractions. Sometimes abstraction is free!
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
Despite being a slow interpreter, Python is a key component in high-performance computing (HPC). Python is easy to use. C++ is fast. Together they are a beautiful blend. A new tool, pybind11, makes this approach even more attractive to HPC code. It focuses on the niceties C++11 brings in. Beyond the syntactic sugar around the Python C API, it is interesting to see how pybind11 handles the vast difference between the two languages, and what matters to HPC.
This presentation helps discover some of the specific features of Java 8, (in particular, atomics and parallelism) and start using them effectively.
This presentation by Maksym Voronyi (Software Engineer, GlobalLogic) was delivered at Java.io 3.0 conference in Kharkiv on March 24, 2016, and GlobalLogic Mykolaiv Java Conference on June 11, 2016.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Presentation from DICE Coder's Day (2010 November) by Johan Torp:
This talk is about making object-oriented code more cache-friendly and how we can incrementally move towards parallelizable data-oriented designs. Filled with production code examples from Frostbite’s pathfinding implementation.
Asynchronous single page applications without a line of HTML or Javascript, o...Robert Schadek
AngularJS, together with Node.js, is an extremely powerful combination for building single page applications. Unfortunately, its development requires writing HTML and Javascript, which is tedious and error prone. By using vibe.d, HTML is no longer necessary, and the developers can use the full power of a static-typed language for the development of the backend. Substituting Javascript with Typescript in addition to a little bit of CTFE D magic then removes the need for redundant data type declarations, and makes everything statically typed. At the end of the talk, the attendee will have witnessed the creation of a statically typed, asynchronous single page application that required little extra typing than its dynamically typed equivalent. Additionally, the attendees will be motivated to explore the presented combination of frameworks as a viable desktop application UI framework.
This presentation deals with RealmDB, which is a convenient replacement for SQLite & Core Data in mobile development.
Presentation by Anton Minashkin (Software Engineer, GlobalLogic, Lviv), delivered at Mobile TechTalk Lviv on April 28, 2015.
More details - http://globallogic.com.ua/mobile-techtalk-lviv-2015-report
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Despite being a slow interpreter, Python is a key component in high-performance computing (HPC). Python is easy to use. C++ is fast. Together they are a beautiful blend. A new tool, pybind11, makes this approach even more attractive to HPC code. It focuses on the niceties C++11 brings in. Beyond the syntactic sugar around the Python C API, it is interesting to see how pybind11 handles the vast difference between the two languages, and what matters to HPC.
This presentation helps discover some of the specific features of Java 8, (in particular, atomics and parallelism) and start using them effectively.
This presentation by Maksym Voronyi (Software Engineer, GlobalLogic) was delivered at Java.io 3.0 conference in Kharkiv on March 24, 2016, and GlobalLogic Mykolaiv Java Conference on June 11, 2016.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Presentation from DICE Coder's Day (2010 November) by Johan Torp:
This talk is about making object-oriented code more cache-friendly and how we can incrementally move towards parallelizable data-oriented designs. Filled with production code examples from Frostbite’s pathfinding implementation.
Asynchronous single page applications without a line of HTML or Javascript, o...Robert Schadek
AngularJS, together with Node.js, is an extremely powerful combination for building single page applications. Unfortunately, its development requires writing HTML and Javascript, which is tedious and error prone. By using vibe.d, HTML is no longer necessary, and the developers can use the full power of a static-typed language for the development of the backend. Substituting Javascript with Typescript in addition to a little bit of CTFE D magic then removes the need for redundant data type declarations, and makes everything statically typed. At the end of the talk, the attendee will have witnessed the creation of a statically typed, asynchronous single page application that required little extra typing than its dynamically typed equivalent. Additionally, the attendees will be motivated to explore the presented combination of frameworks as a viable desktop application UI framework.
This presentation deals with RealmDB, which is a convenient replacement for SQLite & Core Data in mobile development.
Presentation by Anton Minashkin (Software Engineer, GlobalLogic, Lviv), delivered at Mobile TechTalk Lviv on April 28, 2015.
More details - http://globallogic.com.ua/mobile-techtalk-lviv-2015-report
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
The Journal of MC Square Scientific Research is published by MC Square Publication on the monthly basis. It aims to publish original research papers devoted to wide areas in various disciplines of science and engineering and their applications in industry. This journal is basically devoted to interdisciplinary research in Science, Engineering and Technology, which can improve the technology being used in industry. The real-life problems involve multi-disciplinary knowledge, and thus strong inter-disciplinary approach is the need of the research.
"Optimization of a .NET application- is it simple ! / ?", Yevhen TatarynovFwdays
Optimization of .NET application seems complex and tied full task, but don’t hurry up with conclusions. Let’s look on several cases from real projects.
For this we:
look under the hood of an application from a real project;
define the metric for optimization;
choose the necessary tools;
find bottlenecks /memory leaks and best practice to resolve them.
We'll improve the application step by step and we’ll what with simple analysis and simple best practice we can significantly reduce total resources usage.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
In C language, you may use functions without defining them. Pay attention that I speak about C language, not C++. Of course, this ability is very dangerous. Let us have a look at an interesting example of a 64-bit error related to it.
Options and trade offs for parallelism and concurrency in Modern C++Satalia
While threads have become a first class citizen in C++ since C++11, it is not always the case that they are the best abstraction to express parallelism where the objective is to speed up computations. OpenMP is a parallelism API for C/C++ and Fortran that has been around for a long time. Intel's Threading Building Blocks (TBB) is only a little bit more than 10 years old, but is very mature, and specifically for C++.
Mats will introduce OpenMP and TBB and their use in modern C++ and provide some best practices for them as well as try to predict what the C++ standard has in store for us when it comes to parallelism in the future.
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
PVS-Studio for Linux Went on a Tour Around DisneyPVS-Studio
Recently we released a Linux version of PVS-Studio analyzer, which we had used before to check a number of open-source projects such as Chromium, GCC, LLVM (Clang), and others. Now this list includes several projects developed by Walt Disney Animation Studios for the community of virtual-reality developers. Let's see what bugs and defects the analyzer found in these projects.
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
Maximizing performance in data engineering is a daunting challenge. We present some of our work on designing faster indexes, with a particular emphasis on compressed indexes. Some of our prior work includes (1) Roaring indexes which are part of multiple big-data systems such as Spark, Hive, Druid, Atlas, Pinot, Kylin, (2) EWAH indexes are part of Git (GitHub) and included in major Linux distributions.
We will present ongoing and future work on how we can process data faster while supporting the diverse systems found in the cloud (with upcoming ARM processors) and under multiple programming languages (e.g., Java, C++, Go, Python). We seek to minimize shared resources (e.g., RAM) while exploiting algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. Our end goal is to process billions of records per second per core.
The talk will be aimed at programmers who want to better understand the performance characteristics of current big-data systems as well as their evolution. The following specific topics will be addressed:
1. The various types of indexes and their performance characteristics and trade-offs: hashing, sorted arrays, bitsets and so forth.
2. Index and table compression techniques: binary packing, patched coding, dictionary coding, frame-of-reference.
We have finished studying the patterns of 64-bit errors and the last thing we will speak about, concerning these errors, is in what ways they may occur in programs.
Design of QSD Number System Addition using Delayed Addition TechniqueKumar Goud
Abstract: Quaternary number system is a base-4 numeral system. Using Quaternary Signed Digit (QSD) number system may also execute carry free addition, borrow free subtraction and multiplication. The QSD number system wants a different group of prime modulo based logic elements for each arithmetic operation. In this work we extend this QSD addition to Delayed addition in place of carry free addition. Carry free addition generates intermediate carry and intermediate sum, in this carry propagation is required to generate intermediate sum. To reduce carry propagation we evaluated delayed addition. This delayed addition reduces carry propagation and improves arithmetic calculations. We present both QSD and Floating –point single precision addition using delayed addition. The design work is carried by using Verilog HDL in ISE.
Keywords: QSD, DA, CFA and Floating-Point.
Design of QSD Number System Addition using Delayed Addition TechniqueKumar Goud
Abstract: Quaternary number system is a base-4 numeral system. Using Quaternary Signed Digit (QSD) number system may also execute carry free addition, borrow free subtraction and multiplication. The QSD number system wants a different group of prime modulo based logic elements for each arithmetic operation. In this work we extend this QSD addition to Delayed addition in place of carry free addition. Carry free addition generates intermediate carry and intermediate sum, in this carry propagation is required to generate intermediate sum. To reduce carry propagation we evaluated delayed addition. This delayed addition reduces carry propagation and improves arithmetic calculations. We present both QSD and Floating –point single precision addition using delayed addition. The design work is carried by using Verilog HDL in ISE.
Keywords: QSD, DA, CFA and Floating-Point.
Building High-Performance Language Implementations With Low EffortStefan Marr
This talk shows how languages can be implemented as self-optimizing interpreters, and how Truffle or RPython go about to just-in-time compile these interpreters to efficient native code.
Programming languages are never perfect, so people start building domain-specific languages to be able to solve their problems more easily. However, custom languages are often slow, or take enormous amounts of effort to be made fast by building custom compilers or virtual machines.
With the notion of self-optimizing interpreters, researchers proposed a way to implement languages easily and generate a JIT compiler from a simple interpreter. We explore the idea and experiment with it on top of RPython (of PyPy fame) with its meta-tracing JIT compiler, as well as Truffle, the JVM framework of Oracle Labs for self-optimizing interpreters.
In this talk, we show how a simple interpreter can reach the same order of magnitude of performance as the highly optimizing JVM for Java. We discuss the implementation on top of RPython as well as on top of Java with Truffle so that you can start right away, independent of whether you prefer the Python or JVM ecosystem.
While our own experiments focus on SOM, a little Smalltalk variant to keep things simple, other people have used this approach to improve peek performance of JRuby, or build languages such as JavaScript, R, and Python 3.
Le slide deck de l'Université que nous avons donnée avec Rémi Forax à Devoxx France 2019.
Comme promis, Java sort sa version majeure tous les 6 mois. Le train passe et amène son lot de nouveautés. Parmi elles, certaines sont sorties : une nouvelle syntaxe pour les clauses switch et l'instruction de byte code CONSTANT_DYNAMIC. D'autres sont en chantier, plus ou moins avancé : une nouvelle façon d'écrire des méthodes de façon condensée, un instanceof 'intelligent', des constantes évaluées au moment où elles sont utilisées. Les projets progressent. Loom, et son nouveau modèle de programmation concurrente que l'ont peut tester avec Jetty. Amber, qui introduit les data types et des nouvelles syntaxes. Valhalla, dont les value types donnent leurs premiers résultats. S'il est difficile de prévoir une date de sortie pour ces nouveautés, on sait en revanche qu'une fois prêtes elles sortiront en moins de 6 mois. De tout ceci nous parlerons donc au futur et en public, avec des démonstrations de code, des slides, du code, de la joie et de la bonne humeur !
Authors:
Jeff Plaisance, Indeed
Nathan Kurz, Verse Communications
Daniel Lemire, LICEF, Universite du Québec
Paper accepted to the International Symposium on Web Algorithms (iSWAG), 2015.
Blog post: http://engineering.indeed.com/blog/2015/03/vectorized-vbyte-decoding-high-performance-vector-instructions/
Abstract:
We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (MASKED VBYTE) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.
This is my second version of the quantum notes collected as part of my study.
This organizes content from various open source for study and reference only.
Accurate and efficient software microbenchmarksDaniel Lemire
Software is often improved incrementally. Each software optimization should be assessed with microbenchmarks. In a microbenchmark, we record performance measures such as elapsed time or instruction counts during specific tasks, often in idealized conditions. In principle, the process is easy: if the new code is faster, we adopt it. Unfortunately, there are many pitfalls, such as unrealistic statistical assumptions and poorly designed benchmarks. Abstractions like cloud computing add further challenges. We illustrate effective benchmarking practices with examples.
Presentation on Roaring bitmaps for the Go Montreal meetup (Go 10th anniversary).
Roaring bitmaps are a standard indexing data structure. They are
widely used in search and database engines. For example, Lucene, the
search engine powering Wikipedia relies on Roaring. The Go library
roaring implements Roaring bitmaps in Go. It is used in several
popular systems such as InfluxDB, Pilosa and Bleve. This library is
used in production in several systems, it is part of the Awesome Go
collection. After presenting the library, we will cover some advanced
Go topics such as the use of assembly language, unsafe mappings, and
so forth.
Our disks and networks can load gigabytes of data per second; we feel strongly that our software should follow suit. Thus we wrote what might be the fastest JSON parser in the world, simdjson. It can parse typical JSON files at speeds of over 2 GB/s on single commodity Intel core with full validation; it is several times faster than conventional parsers.
How did we go so fast? We started with the insight that we should make full use of the SIMD instructions available on commodity processors. These instructions are everywhere, from the ARM chip in your smartphone all to way to server processors. SIMD instructions work on wide registers (e.g., spanning 32 bytes): they are faster because they process more data using fewer instructions. To our knowledge, nobody had ever attempted to produce a full parser for something as complex as JSON by relying primarily on SIMD instructions. And many people were skeptical that a full parser could be done fruitfully with SIMD instructions. We had to develop interesting new strategies that are generally applicable. In the end, we learned several lessons. Maybe one of the most important lesson is the importance of a nearly obsessive focus on performance metrics. We constantly measure the impact of the choices we make.
Ingénierie de la performance au sein des mégadonnéesDaniel Lemire
Les index logiciels accélèrent les applications en intelligence d'affaire, en apprentissage machine et en science des données. Ils déterminent souvent la performance des applications portant sur les mégadonnées. Les index efficaces améliorent non seulement la latence et le débit, mais aussi la consommation d'énergie. Plusieurs index font une utilisation parcimonieuse de la mémoire vive afin que les données critiques demeurent près du processeur. Il est aussi souhaitable de travailler directement sur les données compressées afin d'éviter une étape de décodage supplémentaire.
(1) Nous nous intéressons aux index bitmap. Nous les trouvons dans une vaste gamme de systèmes :
Oracle, Hive, Spark, Druid, Kylin, Lucene, Elastic, Git... Ils sont une composante de systèmes, tels que Wikipedia ou GitHub, dont dépendent des millions d'utilisateurs à tous les jours. Nous
présenterons certains progrès récents ayant trait à l'optimisation des index bitmap, tels qu'ils sont utilisés au sein des systèmes actuels. Nous montrons par des exemples comment multiplier la
performance de ces index dans certains cas sur les processeurs bénéficiant d'instructions SIMD (instruction unique, données multiples) avancées.
(2) Nous ciblons aussi les listes d'entiers que l'on trouve au sein des arbres B+, dans les indexes inversés et les index bitmap compressés. Nous donnons un exemple récent de technique de compression (Stream VByte) d’entiers qui permet de décoder des milliards d’entiers compressés par seconde.
SIMD Compression and the Intersection of Sorted IntegersDaniel Lemire
Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decoding speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD Galloping algorithm. We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once. We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach.
Decoding billions of integers per second through vectorizationDaniel Lemire
In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Daniel Lemire
Nowadays, medical image compression is an essential process in eHealth systems. Compressing medical images in high quality is a vital demand to avoid misdiagnosing medical exams by radiologists. WAAVES is a promising medical images compression algorithm based on the discrete wavelet transform (DWT) that achieves a high compression performance compared to the state of the art. The main aims of this work are to enhance image quality when compressing using WAAVES and to provide a high-speed DWT architecture for image compression on embedded systems. Regarding the quality improvement, the logarithmic number systems (LNS) was explored to be used as an alternative to the linear arithmetic in DWT computations. A new LNS library was developed and validated to realize the logarithmic DWT. In addition, a new quantization method called (LNS-Q) based on logarithmic arithmetic was proposed. A novel compression scheme (LNS-WAAVES) based on integrating the Hybrid-DWT and the LNS-Q method with WAAVES was developed. Hybrid-DWT combines the advantages of both the logarithmic and the linear domains leading to enhancement of the image quality and the compression ratio. The results showed that LNS-WAAVES is able to achieve an improvement in the quality by a percentage of 8% and up to 34% compared to WAAVES depending on the compression configuration parameters and the image modalities.
We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (Masked VByte) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.
Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015.
http://arxiv.org/pdf/1503.07387.pdf
Better bitmap performance with Roaring bitmaps
Bitmaps are used to implement fast set operations in software. They are frequently found in databases and search engines.
Without compression, bitmaps scale poorly, so they are often compressed. Many bitmap compression techniques have been proposed, almost all relying primarily on run-length encoding (RLE). For example, Oracle relies on BBC bitmap compression while the version control system Git and Apache Hive rely on EWAH compression.
We can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique, Roaring, has been adopted by several production platforms (e.g., Apache Lucene/Solr/Elastic, Apache Spark, eBay's Apache Kylin and Metamarkets' Druid).
Overall, our implementation of Roaring can be several times faster (up to two orders of magnitude) than the implementations of traditional RLE-based alternatives (WAH, Concise, EWAH) while compressing better. We review the design choices and optimizations that make these good results possible.
La vectorisation des algorithmes de compression Daniel Lemire
Depuis la mise en marché du Pentium 4, nos processeurs bénéficient d'instructions vectorielles. En tenant compte explicitement de ces instructions dans la conception de nos algorithmes, nous pouvons grandement accélérer les calculs. À titre d'exemple, considérons la compression des listes d'entiers telle qu'elle s'effectue au sein de la plupart des moteurs de recherche ou des bases de données. En cette matière, nous utilisons souvent encore des algorithmes développés dans les années 70. Nous expliquerons comment on peut faire beaucoup mieux
en ce qui a trait à la vitesse en exploitant les instructions
vectorielles.
Decoding billions of integers per second through vectorization Daniel Lemire
In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression rate within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.
Extracting, Transforming and Archiving Scientific DataDaniel Lemire
It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.
Innovation without permission: from Codd to NoSQLDaniel Lemire
Practitioners often fail to apply textbook database design principles. We observe both a perversion of the relational model and a growth of less formal alternatives. Overall, there is an opposition between the analytic thought that prevailed when many data modeling techniques were initiated, and the pragmatism which now dominates among practitioners. There are at least two recent trends supporting this rejection of traditional models:
(1) the rise of the sophisticated user,
most notably in social media is challenge to the rationalist view, as it blurs the distinction between design and operation,
(2) in the new technological landscape where there are billions of interconnected computers worldwide, simple concepts like
consistency sometimes become prohibitively expensive. Overall, for a wide range of information systems, design and operation are becoming integrated in the spirit of pragmatism. Thus, we are left with design methodologies which embrace fast and continual iterations and and exploratory testing. These methodologies allow innovation without permission in that the right to design new features is no longer so closely guarded.
Fo
Recent research results in optimizing column-oriented indexes for faster data warehousing. This talks aims to answer the following question: when is sorting the table a sufficiently good optimization?
Column-oriented databases have become fashionable following the work of Stonebraker et al. In the data warehousing industry, the terms "column oriented" and "column store" have become necessary marketing buzzwords. One of the benefits of column-oriented indexes is good compression through run-length encoding (RLE). This type of compression is particularly benefitial since it simultaneously reduce the volume of data and the necessary computations. However, the efficiency of the compression depends on the order of the rows in the table and this is even more important with larger tables. Finding the best row ordering is NP hard. We compare some heuristics for this problem including variations on the lexicographical order, Gray codes, and Hilbert space-filling curves.
All About Bitmap Indexes... And Sorting ThemDaniel Lemire
A review of bitmap index from an academic perspective. Several theoretical results are presented. The talk also discuss technical issues regarding sorting the tables prior to indexing, as a way to improve the indexes.
Much of the talk is based on the following preprint:
Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes.
http://arxiv.org/abs/0901.3751
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAPDaniel Lemire
A data warehouse cannot materialize all possible views, hence we must estimate quickly, accurately, and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively, unassuming probabilistic techniques are slower, but they estimate accurately and reliability very large view sizes using little memory. We compare five unassuming hashing-based view-size estimation techniques including Stochastic Probabilistic Counting and LogLog Probabilistic Counting. Our experiments show that only Generalized Counting, Gibbons-Tirthapura, and Adaptive Counting provide universally tight estimates irrespective of the size of the view; of those, only Adaptive Counting remains constantly fast as we increase the memory budget.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Search and Society: Reimagining Information Access for Radical Futures
Engineering fast indexes (Deepdive)
1. ENGINEERING FAST INDEXES (DEEP DIVE)
Daniel Lemire
https://lemire.me
Joint work with lots of super smart people
2. Roaring : Hybrid Model
A collection of containers...
array: sorted arrays ({1,20,144}) of packed 16‑bit integers
bitset: bitsets spanning 65536 bits or 1024 64‑bit words
run: sequences of runs ([0,10],[15,20])
2
3. Keeping track
E.g., a bitset with few 1s need to be converted back to array.
→ we need to keep track of the cardinality!
In Roaring, we do it automagically
3
5. In x64 assembly with BMI instructions:
shrx %[6], %[p], %[q] // q = p / 64
mov (%[w],%[q],8), %[ow] // ow = w [q]
bts %[p], %[ow] // ow |= ( 1<< (p % 64)) + flag
sbb $-1, %[cardinality] // update card based on flag
mov %[load], (%[w],%[q],8) // w[q] = ow
sbb is the extra work
5
8. Bitset vs. Bitset...
Intersection:
First compute the cardinality of the result.
If low, use an array for the result (slow), otherwise generate
a bitset (fast).
Union: Always generate a bitset (fast).
(Unless cardinality is high then maybe create a run!)
We generally keep track of the cardinality of the result.
8
9. Cardinality of the result
How fast does this code run?
int c = 0;
for (int k = 0; k < 1024; ++k) {
c += Long.bitCount(A[k] & B[k]);
}
We have 1024 calls to Long.bitCount .
This counts the number of 1s in a 64‑bit word.
9
10. Population count in Java
// Hacker`s Delight
int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L)
+ ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
Sounds expensive?
10
11. Population count in C
How do you think that the C compiler clang compiles this code?
#include <stdint.h>
int count(uint64_t x) {
int v = 0;
while(x != 0) {
x &= x - 1;
v++;
}
return v;
}
11
12. Compile with -O1 -march=native on a recent x64 machine:
popcnt rax, rdi
12
13. Why care for popcnt ?
popcnt : throughput of 1 instruction per cycle (recent Intel CPUs)
Really fast.
13
14. Population count in Java?
// Hacker`s Delight
int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L)
+ ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
14
15. Population count in Java!
Also compiles to popcnt if hardware supports it
$ java -XX:+PrintFlagsFinal
| grep UsePopCountInstruction
bool UsePopCountInstruction = true
But only if you call it from Long.bitCount
15
17. Cardinality of the intersection
How fast does this code run?
int c = 0;
for (int k = 0; k < 1024; ++k) {
c += Long.bitCount(A[k] & B[k]);
}
A bit over ≈ 2 cycles per pair of 64‑bit words.
load A, load B
bitwise AND
popcnt
17
18. Take away
Bitset vs. Bitset operations are fast
even if you need to track the cardinality.
even in Java
e.g., popcnt overhead might be negligible compared to other costs
like cache misses.
18
19. Array vs. Array intersection
Always output an array. Use galloping O(m log n) if the sizes
differs a lot.
int intersect(A, B) {
if (A.length * 25 < B.length) {
return galloping(A,B);
} else if (B.length * 25 < A.length) {
return galloping(B,A);
} else {
return boring_intersection(A,B);
}
}
19
20. Galloping intersection
You have two arrays a small and a large one...
while (true) {
if (largeSet[k1] < smallSet[k2]) {
find k1 by binary search such that
largeSet[k1] >= smallSet[k2]
}
if (smallSet[k2] < largeSet[k1]) {
++k2;
} else {
// got a match! (smallSet[k2] == largeSet[k1])
}
}
If the small set is tiny, runs in O(log(size of big set))
20
21. Array vs. Array union
Union: If sum of cardinalities is large, go for a bitset. Revert to an
array if we got it wrong.
union (A,B) {
total = A.length + B.length;
if (total > DEFAULT_MAX_SIZE) {// bitmap?
create empty bitmap C and add both A and B to it
if (C.cardinality <= DEFAULT_MAX_SIZE) {
convert C to array
} else if (C is full) {
convert C to run
} else {
C is fine as a bitmap
}
}
otherwise merge two arrays and output array
}
21
22. Array vs. Bitmap (Intersection)...
Intersection: Always an array.
Branchy (3 to 16 cycles per array value):
answer = new array
for value in array {
if value in bitset {
append value to answer
}
}
22
23. Branchless (3 cycles per array value):
answer = new array
pos = 0
for value in array {
answer[pos] = value
pos += bit_value(bitset, value)
}
23
24. Array vs. Bitmap (Union)...
Always a bitset. Very fast. Few cycles per value in array.
answer = clone the bitset
for value in array { // branchless
set bit in answer at index value
}
Without tracking the cardinality ≈ 1.65 cycles per value
Tracking the cardinality ≈ 2.2 cycles per value
24
25. Parallelization is not just multicore + distributed
In practice, all commodity processors support Single instruction,
multiple data (SIMD) instructions.
Raspberry Pi
Your phone
Your PC
Working with words x × larger has the potential of multiplying the
performance by x.
No lock needed.
Purely deterministic/testable.
25
26. SIMD is not too hard conceptually
Instead of working with x + y you do
(x , x , x , x ) + (y , y , y , y ).
Alas: it is messy in actual code.
1 2 3 4 1 2 3 4
26
27. With SIMD small words help!
With scalar code, working on 16‑bit integers is not 2 × faster than
32‑bit integers.
But with SIMD instructions, going from 64‑bit integers to 16‑bit
integers can mean 4 × gain.
Roaring uses arrays of 16‑bit integers.
27
28. Bitsets are vectorizable
Logical ORs, ANDs, ANDNOTs, XORs can be computed fast with
Single instruction, multiple data (SIMD) instructions.
Intel Cannonlake (late 2017), AVX‑512
Operate on 64 bytes with ONE instruction
→ Several 512‑bit ops/cycle
Java 9's Hotspot can use AVX 512
ARM v8‑A to get Scalable Vector Extension...
up to 2048 bits!!!
28
30. Vectorization matters!
for(size_t i = 0; i < len; i++) {
a[i] |= b[i];
}
using scalar : 1.5 cycles per byte
with AVX2 : 0.43 cycles per byte (3.5 × better)
With AVX‑512, the performance gap exceeds 5 ×
Can also vectorize OR, AND, ANDNOT, XOR + population count
(AVX2‑Harley‑Seal)
30
31. Vectorization beats popcnt
int count = 0;
for(size_t i = 0; i < len; i++) {
count += popcount(a[i]);
}
using fast scalar (popcnt): 1 cycle per input byte
using AVX2 Harley‑Seal: 0.5 cycles per input byte
even greater gain with AVX‑512
31
32. Sorted arrays
sorted arrays are vectorizable:
array union
array difference
array symmetric difference
array intersection
sorted arrays can be compressed with SIMD
32
33. Bitsets are vectorizable... sadly...
Java's hotspot is limited in what it can autovectorize:
1. Copying arrays
2. String.indexOf
3. ...
And it seems that Unsafe effectively disables autovectorization!
33
34. There is hope yet for Java
One big reason, today, for binding closely to hardware is to
process wider data flows in SIMD modes. (And IMO this is a
long‑term trend towards right‑sizing data channel widths, as
hardware grows wider in various ways.) AVX bindings are where
we are experimenting, today
(John Rose, Oracle)
34
35. Fun things you can do with SIMD: Masked VByte
Consider the ubiquitous VByte format:
Use 1 byte to store all integers in [0, 2 )
Use 2 bytes to store all integers in [2 , 2 )
...
Decoding can become a bottleneck. Google developed Varint‑GB.
What if you are stuck with the conventional format? (E.g., Lucene,
LEB128, Protocol Buffers...)
7
7 14
35
36. Masked VByte
Joint work with J. Plaisance (Indeed.com) and N. Kurz.
http://maskedvbyte.org/
36
37. Go try it out!
Fully vectorized Roaring implementation (C/C++):
https://github.com/RoaringBitmap/CRoaring
Wrappers in Python, Go, Rust...
37