Recent research results in optimizing column-oriented indexes for faster data warehousing. This talks aims to answer the following question: when is sorting the table a sufficiently good optimization?
Column-oriented databases have become fashionable following the work of Stonebraker et al. In the data warehousing industry, the terms "column oriented" and "column store" have become necessary marketing buzzwords. One of the benefits of column-oriented indexes is good compression through run-length encoding (RLE). This type of compression is particularly benefitial since it simultaneously reduce the volume of data and the necessary computations. However, the efficiency of the compression depends on the order of the rows in the table and this is even more important with larger tables. Finding the best row ordering is NP hard. We compare some heuristics for this problem including variations on the lexicographical order, Gray codes, and Hilbert space-filling curves.
We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (Masked VByte) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.
Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015.
http://arxiv.org/pdf/1503.07387.pdf
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Column-oriented databases have become fashionable following the work of Stonebraker et al. In the data warehousing industry, the terms "column oriented" and "column store" have become necessary marketing buzzwords. One of the benefits of column-oriented indexes is good compression through run-length encoding (RLE). This type of compression is particularly benefitial since it simultaneously reduce the volume of data and the necessary computations. However, the efficiency of the compression depends on the order of the rows in the table and this is even more important with larger tables. Finding the best row ordering is NP hard. We compare some heuristics for this problem including variations on the lexicographical order, Gray codes, and Hilbert space-filling curves.
We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (Masked VByte) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.
Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015.
http://arxiv.org/pdf/1503.07387.pdf
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
C has become pretty old school, but the way C developers now and throughout its history have used C, extended C is the foundation of a amazing dynamic languages. This talk describes some dynamic C data structures and also gives an overview of developing a language on top of these tools.
AWS July Webinar Series - Getting Started with Amazon DynamoDBAmazon Web Services
This webinar provides an overview of Amazon DynamoDB, a fast, flexible, and fully managed NoSQL database service for Mobile, Web, AdTech, IOT and Gaming applications that need consistent, single-digit millisecond latency at any scale.The webinar will cover key topics around general architecture of DynamoDB, data types, throughput provisioning, querying and indexing, and recent features.
The webinar includes a live demo of the basic operations used to read and write data to a DynamoDB table, and how the concept of provisioned IO affects the throughput of these operations.
Learning Objectives:
Enable users to understand how DynamoDB works so that they can evaluate and use DynamoDB as the data store for their application
Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.
We present faster practical encoding and decoding procedures for block compression. Such encoding and decoding procedures are important to efficiently support rank/select queries on compressed bit vectors. This paper was presented at the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017) in Palermo, Italy.
An Efficient Language Model Using Double-Array StructuresJun-ya Norimatsu
A presentation slide of EMNLP 2013.
Paper : http://aclweb.org/anthology/D/D13/
Direct Link : http://aclweb.org/anthology/D/D13/D13-1023.pdf
Source Code : https://github.com/jnory/DALM
Christoph Koch is a professor of Computer Science at EPFL, specializing in data management. Until 2010, he was an Associate Professor in the Department of Computer Science at Cornell University. Previously to this, from 2005 to 2007, he was an Associate Professor of Computer Science at Saarland University. Earlier, he obtained his PhD in Artificial Intelligence from TU Vienna and CERN (2001), was a postdoctoral researcher at TU Vienna and the University of Edinburgh (2001-2003), and an assistant professor at TU Vienna (2003-2005). He has won Best Paper Awards at PODS 2002, ICALP 2005, and SIGMOD 2011, an Outrageous Ideas and Vision Paper Award at CIDR 2013, a Google Research Award (in 2009), and an ERC Grant (in 2011). He is a PI of the FET Flagship Human Brain Project and of NCCR MARVEL, a new Swiss national research center for materials research. He (co-)chaired the program committees of DBPL 2005, WebDB 2008, ICDE 2011, VLDB 2013, and was PC vice-chair of ICDE 2008 and ICDE 2009. He has served on the editorial board of ACM Transactions on Internet Technology and as Editor-in-Chief of PVLDB.
A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
Aim of this presentation to provide enough information for enterprise architect to choose whether Cassandra will be project data store. Presentation describes each nuance of Cassandra architecture and ways to design data and work with them.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Accurate and efficient software microbenchmarksDaniel Lemire
Software is often improved incrementally. Each software optimization should be assessed with microbenchmarks. In a microbenchmark, we record performance measures such as elapsed time or instruction counts during specific tasks, often in idealized conditions. In principle, the process is easy: if the new code is faster, we adopt it. Unfortunately, there are many pitfalls, such as unrealistic statistical assumptions and poorly designed benchmarks. Abstractions like cloud computing add further challenges. We illustrate effective benchmarking practices with examples.
Presentation on Roaring bitmaps for the Go Montreal meetup (Go 10th anniversary).
Roaring bitmaps are a standard indexing data structure. They are
widely used in search and database engines. For example, Lucene, the
search engine powering Wikipedia relies on Roaring. The Go library
roaring implements Roaring bitmaps in Go. It is used in several
popular systems such as InfluxDB, Pilosa and Bleve. This library is
used in production in several systems, it is part of the Awesome Go
collection. After presenting the library, we will cover some advanced
Go topics such as the use of assembly language, unsafe mappings, and
so forth.
C has become pretty old school, but the way C developers now and throughout its history have used C, extended C is the foundation of a amazing dynamic languages. This talk describes some dynamic C data structures and also gives an overview of developing a language on top of these tools.
AWS July Webinar Series - Getting Started with Amazon DynamoDBAmazon Web Services
This webinar provides an overview of Amazon DynamoDB, a fast, flexible, and fully managed NoSQL database service for Mobile, Web, AdTech, IOT and Gaming applications that need consistent, single-digit millisecond latency at any scale.The webinar will cover key topics around general architecture of DynamoDB, data types, throughput provisioning, querying and indexing, and recent features.
The webinar includes a live demo of the basic operations used to read and write data to a DynamoDB table, and how the concept of provisioned IO affects the throughput of these operations.
Learning Objectives:
Enable users to understand how DynamoDB works so that they can evaluate and use DynamoDB as the data store for their application
Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.
We present faster practical encoding and decoding procedures for block compression. Such encoding and decoding procedures are important to efficiently support rank/select queries on compressed bit vectors. This paper was presented at the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017) in Palermo, Italy.
An Efficient Language Model Using Double-Array StructuresJun-ya Norimatsu
A presentation slide of EMNLP 2013.
Paper : http://aclweb.org/anthology/D/D13/
Direct Link : http://aclweb.org/anthology/D/D13/D13-1023.pdf
Source Code : https://github.com/jnory/DALM
Christoph Koch is a professor of Computer Science at EPFL, specializing in data management. Until 2010, he was an Associate Professor in the Department of Computer Science at Cornell University. Previously to this, from 2005 to 2007, he was an Associate Professor of Computer Science at Saarland University. Earlier, he obtained his PhD in Artificial Intelligence from TU Vienna and CERN (2001), was a postdoctoral researcher at TU Vienna and the University of Edinburgh (2001-2003), and an assistant professor at TU Vienna (2003-2005). He has won Best Paper Awards at PODS 2002, ICALP 2005, and SIGMOD 2011, an Outrageous Ideas and Vision Paper Award at CIDR 2013, a Google Research Award (in 2009), and an ERC Grant (in 2011). He is a PI of the FET Flagship Human Brain Project and of NCCR MARVEL, a new Swiss national research center for materials research. He (co-)chaired the program committees of DBPL 2005, WebDB 2008, ICDE 2011, VLDB 2013, and was PC vice-chair of ICDE 2008 and ICDE 2009. He has served on the editorial board of ACM Transactions on Internet Technology and as Editor-in-Chief of PVLDB.
A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
Aim of this presentation to provide enough information for enterprise architect to choose whether Cassandra will be project data store. Presentation describes each nuance of Cassandra architecture and ways to design data and work with them.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Accurate and efficient software microbenchmarksDaniel Lemire
Software is often improved incrementally. Each software optimization should be assessed with microbenchmarks. In a microbenchmark, we record performance measures such as elapsed time or instruction counts during specific tasks, often in idealized conditions. In principle, the process is easy: if the new code is faster, we adopt it. Unfortunately, there are many pitfalls, such as unrealistic statistical assumptions and poorly designed benchmarks. Abstractions like cloud computing add further challenges. We illustrate effective benchmarking practices with examples.
Presentation on Roaring bitmaps for the Go Montreal meetup (Go 10th anniversary).
Roaring bitmaps are a standard indexing data structure. They are
widely used in search and database engines. For example, Lucene, the
search engine powering Wikipedia relies on Roaring. The Go library
roaring implements Roaring bitmaps in Go. It is used in several
popular systems such as InfluxDB, Pilosa and Bleve. This library is
used in production in several systems, it is part of the Awesome Go
collection. After presenting the library, we will cover some advanced
Go topics such as the use of assembly language, unsafe mappings, and
so forth.
Our disks and networks can load gigabytes of data per second; we feel strongly that our software should follow suit. Thus we wrote what might be the fastest JSON parser in the world, simdjson. It can parse typical JSON files at speeds of over 2 GB/s on single commodity Intel core with full validation; it is several times faster than conventional parsers.
How did we go so fast? We started with the insight that we should make full use of the SIMD instructions available on commodity processors. These instructions are everywhere, from the ARM chip in your smartphone all to way to server processors. SIMD instructions work on wide registers (e.g., spanning 32 bytes): they are faster because they process more data using fewer instructions. To our knowledge, nobody had ever attempted to produce a full parser for something as complex as JSON by relying primarily on SIMD instructions. And many people were skeptical that a full parser could be done fruitfully with SIMD instructions. We had to develop interesting new strategies that are generally applicable. In the end, we learned several lessons. Maybe one of the most important lesson is the importance of a nearly obsessive focus on performance metrics. We constantly measure the impact of the choices we make.
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
Maximizing performance in data engineering is a daunting challenge. We present some of our work on designing faster indexes, with a particular emphasis on compressed indexes. Some of our prior work includes (1) Roaring indexes which are part of multiple big-data systems such as Spark, Hive, Druid, Atlas, Pinot, Kylin, (2) EWAH indexes are part of Git (GitHub) and included in major Linux distributions.
We will present ongoing and future work on how we can process data faster while supporting the diverse systems found in the cloud (with upcoming ARM processors) and under multiple programming languages (e.g., Java, C++, Go, Python). We seek to minimize shared resources (e.g., RAM) while exploiting algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. Our end goal is to process billions of records per second per core.
The talk will be aimed at programmers who want to better understand the performance characteristics of current big-data systems as well as their evolution. The following specific topics will be addressed:
1. The various types of indexes and their performance characteristics and trade-offs: hashing, sorted arrays, bitsets and so forth.
2. Index and table compression techniques: binary packing, patched coding, dictionary coding, frame-of-reference.
Ingénierie de la performance au sein des mégadonnéesDaniel Lemire
Les index logiciels accélèrent les applications en intelligence d'affaire, en apprentissage machine et en science des données. Ils déterminent souvent la performance des applications portant sur les mégadonnées. Les index efficaces améliorent non seulement la latence et le débit, mais aussi la consommation d'énergie. Plusieurs index font une utilisation parcimonieuse de la mémoire vive afin que les données critiques demeurent près du processeur. Il est aussi souhaitable de travailler directement sur les données compressées afin d'éviter une étape de décodage supplémentaire.
(1) Nous nous intéressons aux index bitmap. Nous les trouvons dans une vaste gamme de systèmes :
Oracle, Hive, Spark, Druid, Kylin, Lucene, Elastic, Git... Ils sont une composante de systèmes, tels que Wikipedia ou GitHub, dont dépendent des millions d'utilisateurs à tous les jours. Nous
présenterons certains progrès récents ayant trait à l'optimisation des index bitmap, tels qu'ils sont utilisés au sein des systèmes actuels. Nous montrons par des exemples comment multiplier la
performance de ces index dans certains cas sur les processeurs bénéficiant d'instructions SIMD (instruction unique, données multiples) avancées.
(2) Nous ciblons aussi les listes d'entiers que l'on trouve au sein des arbres B+, dans les indexes inversés et les index bitmap compressés. Nous donnons un exemple récent de technique de compression (Stream VByte) d’entiers qui permet de décoder des milliards d’entiers compressés par seconde.
SIMD Compression and the Intersection of Sorted IntegersDaniel Lemire
Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decoding speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD Galloping algorithm. We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once. We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach.
Decoding billions of integers per second through vectorizationDaniel Lemire
In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Daniel Lemire
Nowadays, medical image compression is an essential process in eHealth systems. Compressing medical images in high quality is a vital demand to avoid misdiagnosing medical exams by radiologists. WAAVES is a promising medical images compression algorithm based on the discrete wavelet transform (DWT) that achieves a high compression performance compared to the state of the art. The main aims of this work are to enhance image quality when compressing using WAAVES and to provide a high-speed DWT architecture for image compression on embedded systems. Regarding the quality improvement, the logarithmic number systems (LNS) was explored to be used as an alternative to the linear arithmetic in DWT computations. A new LNS library was developed and validated to realize the logarithmic DWT. In addition, a new quantization method called (LNS-Q) based on logarithmic arithmetic was proposed. A novel compression scheme (LNS-WAAVES) based on integrating the Hybrid-DWT and the LNS-Q method with WAAVES was developed. Hybrid-DWT combines the advantages of both the logarithmic and the linear domains leading to enhancement of the image quality and the compression ratio. The results showed that LNS-WAAVES is able to achieve an improvement in the quality by a percentage of 8% and up to 34% compared to WAAVES depending on the compression configuration parameters and the image modalities.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Better bitmap performance with Roaring bitmaps
Bitmaps are used to implement fast set operations in software. They are frequently found in databases and search engines.
Without compression, bitmaps scale poorly, so they are often compressed. Many bitmap compression techniques have been proposed, almost all relying primarily on run-length encoding (RLE). For example, Oracle relies on BBC bitmap compression while the version control system Git and Apache Hive rely on EWAH compression.
We can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique, Roaring, has been adopted by several production platforms (e.g., Apache Lucene/Solr/Elastic, Apache Spark, eBay's Apache Kylin and Metamarkets' Druid).
Overall, our implementation of Roaring can be several times faster (up to two orders of magnitude) than the implementations of traditional RLE-based alternatives (WAH, Concise, EWAH) while compressing better. We review the design choices and optimizations that make these good results possible.
La vectorisation des algorithmes de compression Daniel Lemire
Depuis la mise en marché du Pentium 4, nos processeurs bénéficient d'instructions vectorielles. En tenant compte explicitement de ces instructions dans la conception de nos algorithmes, nous pouvons grandement accélérer les calculs. À titre d'exemple, considérons la compression des listes d'entiers telle qu'elle s'effectue au sein de la plupart des moteurs de recherche ou des bases de données. En cette matière, nous utilisons souvent encore des algorithmes développés dans les années 70. Nous expliquerons comment on peut faire beaucoup mieux
en ce qui a trait à la vitesse en exploitant les instructions
vectorielles.
Decoding billions of integers per second through vectorization Daniel Lemire
In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression rate within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.
Extracting, Transforming and Archiving Scientific DataDaniel Lemire
It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.
Innovation without permission: from Codd to NoSQLDaniel Lemire
Practitioners often fail to apply textbook database design principles. We observe both a perversion of the relational model and a growth of less formal alternatives. Overall, there is an opposition between the analytic thought that prevailed when many data modeling techniques were initiated, and the pragmatism which now dominates among practitioners. There are at least two recent trends supporting this rejection of traditional models:
(1) the rise of the sophisticated user,
most notably in social media is challenge to the rationalist view, as it blurs the distinction between design and operation,
(2) in the new technological landscape where there are billions of interconnected computers worldwide, simple concepts like
consistency sometimes become prohibitively expensive. Overall, for a wide range of information systems, design and operation are becoming integrated in the spirit of pragmatism. Thus, we are left with design methodologies which embrace fast and continual iterations and and exploratory testing. These methodologies allow innovation without permission in that the right to design new features is no longer so closely guarded.
Fo
All About Bitmap Indexes... And Sorting ThemDaniel Lemire
A review of bitmap index from an academic perspective. Several theoretical results are presented. The talk also discuss technical issues regarding sorting the tables prior to indexing, as a way to improve the indexes.
Much of the talk is based on the following preprint:
Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes.
http://arxiv.org/abs/0901.3751
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAPDaniel Lemire
A data warehouse cannot materialize all possible views, hence we must estimate quickly, accurately, and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively, unassuming probabilistic techniques are slower, but they estimate accurately and reliability very large view sizes using little memory. We compare five unassuming hashing-based view-size estimation techniques including Stochastic Probabilistic Counting and LogLog Probabilistic Counting. Our experiments show that only Generalized Counting, Gibbons-Tirthapura, and Adaptive Counting provide universally tight estimates irrespective of the size of the view; of those, only Adaptive Counting remains constantly fast as we increase the memory budget.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Faster Column-Oriented Indexes
1. Faster Column-Oriented Indexes
Daniel Lemire
http://www.professeurs.uqam.ca/pages/lemire.daniel.htm
blog: http://www.daniel-lemire.com/
Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc).
February 10, 2010
Daniel Lemire Faster Column-Oriented Indexes
2. Some trends in business intelligence (BI)
Low-latency BI, Complex Event
Processing [Hyde, 2010]
Commotization, open source software:
Pentaho, LucidDB
(http://www.luciddb.org/)
Column-oriented databases ←
source: gooddata.com
Daniel Lemire Faster Column-Oriented Indexes
4. Column Stores
Goes back to StatCan in the
seventies [Turner et al., 1979]
Made fashionable again in Data
name date age sex salary
Warehousing by
Stonebraker [Stonebraker et al., 2005]
New: Oracle Exadata hybrid columnar
compression
Daniel Lemire Faster Column-Oriented Indexes
5. Vectorization
Modern superscalar CPUs support
const i n t N = 2048; vectorization (SSE)
i n t a [N] , b [N ] ; This code is four times faster with
i n t i =0; -ftree-vectorize (GNU GCC)
f o r ( ; i <N ; i ++) Need long streams, same data type, and
a [ i ] += b [ i ] ; no branching.
Columns are good candidates!
Daniel Lemire Faster Column-Oriented Indexes
6. Main column-oriented indexes
(1) Bitmap indexes [O’Neil, 1989]
(2) Projection indexes [O’Neil and Quass, 1997]
Both are compressible.
Daniel Lemire Faster Column-Oriented Indexes
7. Bitmap indexes
SELECT * FROM
T WHERE x=a Vectors of booleans
AND y=b;
Above, compute
{r | r is the row id of a row where x = a} ∩
{r | r is the row id of a row where y = b}
Daniel Lemire Faster Column-Oriented Indexes
8. Other applications of the bitmaps/bitsets
The Java language has had a bitmap class since the
beginning: java.util.BitSet. (Sun’s implementation is based
on 64-bit words.)
Search engines use bitmaps to filter queries, e.g. Apache
Lucene: org.apache.lucene.util.OpenBitSet.java.
Daniel Lemire Faster Column-Oriented Indexes
9. Bitmaps and fast AND/OR operations
Computing the union of two sets of integers between 1 and 64
(eg row ids, trivial table). . .
E.g., {1, 5, 8} ∪ {1, 3, 5}?
Can be done in one operation by a CPU:
BitwiseOR( 10001001, 10101000)
Extend to sets from 1..N using N/64 operations.
To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] :
a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ;
a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ;
a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ;
...
It is a form of vectorization.
Daniel Lemire Faster Column-Oriented Indexes
10. What are bitmap indexes for?
Myth: bitmap indexes are for low cardinality columns (e.g.,
SEX).
the Bitmap index is the conclusive choice for data
warehouse design for columns with high or low
cardinality [Zaker et al., 2008].
Daniel Lemire Faster Column-Oriented Indexes
11. Projection indexes
name
date
city
Write out the (normalized)
column values sequentially.
It is a projection of the table
on a single column.
name
Best for low selectivity queries
on few columns:
date SELECT sum(number*price)
city FROM T;.
Daniel Lemire Faster Column-Oriented Indexes
12. How to compress column indexes?
Must handle long streams of identical values efficiently ⇒
Run-length encoding? (RLE)
Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
So just encode the run lengths, e.g.,
0001111100010111 →
3, 5, 3, 1,1,3
It is a bit more complicated (more another day)
Daniel Lemire Faster Column-Oriented Indexes
13. What about other compression types?
With RLE, we can often process the data in compressed form
Hence, with RLE, compression saves both storage and
CPU cycles!!!!
Not always true with other techniques such as Huffman,
LZ77, Arithmetic Coding, . . .
Daniel Lemire Faster Column-Oriented Indexes
14. How do we improve performance?
Smaller indexes are faster.
In data warehousing: data is often updated in batches.
So spend time at construction time optimizing the index.
Daniel Lemire Faster Column-Oriented Indexes
15. Modelling the size of an index
Any formal result?
Tricky: There are many variations on RLE.
Use: number of runs of identical value in a column
AAABBBCCAA has 4 runs
Daniel Lemire Faster Column-Oriented Indexes
16. Improving compression by reordering the rows
RLE is order-sensitive:
they compress sorted tables better;
But finding the best row ordering is
NP-hard [Lemire et al., 2010].
Actually an instance of the Traveling Salesman Problem
(TSP)
So we use heuristics:
lexicographically
Gray codes
Hilbert, . . .
Daniel Lemire Faster Column-Oriented Indexes
17. How many ways to sort? (1)
Lexicographic row sorting is a a
fast, even for very large a b
tables. a c
easy: sort is a Unix staple.
b a
Substantial index-size reductions b b
(often 2.5 times, benefits grow b c
with table size)
Daniel Lemire Faster Column-Oriented Indexes
18. How many ways to sort? (2)
Gray Codes are list of tuples a a
with successive (Hamming) a b
distance of 1 [Knuth, 2005]. a c
b c
Reflected Gray Code order is
b b
sometimes slightly better
than lexicographical. . . b a
Daniel Lemire Faster Column-Oriented Indexes
19. How many ways to sort? (3)
a a
Reflected Gray Code order is not a b
the only Gray code. a c
b c
Knuth also presents Modular
b a
Gray-code.
b b
Daniel Lemire Faster Column-Oriented Indexes
20. How many ways to sort? (4)
Hilbert Index
[Hamilton and Rau-Chaplin, 2007].
Also a Gray code
(conditionnally)
Gives very bad results for
column-oriented indexes.
Daniel Lemire Faster Column-Oriented Indexes
21. Recursive orders
Lexicographical, reflected Gray code and modular Gray
code belong to a larger class: recursive orders.
They sort on the first column, then the second and so on.
Not all Gray codes are recursive orders: Hilbert is not.
Daniel Lemire Faster Column-Oriented Indexes
22. Best column order?
Column order is important for recursive orders.
We almost have this result [Lemire and Kaser, 2009]:
any recursive order
order the columns by increasing cardinality (small to
LARGE)
Proposition
The expected number of runs is minimized (among all possible
column orders).
Daniel Lemire Faster Column-Oriented Indexes
23. How do you know when the lexicographical order is good
enough?
Even though row reordering is NP-hard, we find it hard to
improve over recursive orders.
Sometimes, fancier alternatives (to be discussed another day)
work better, but not always.
Daniel Lemire Faster Column-Oriented Indexes
24. Thankfully, we can detect cases where recursive orders are
good enough
We can bound the suboptimality of all recursive orders.
Proposition
Consider a table with n distinct rows and column cardinalities Ni
for i = 1, . . . , c. Recursive ordering is µ-optimal for the problem of
minimizing the runs where
min(N1 , n) + min(N1 N2 , n) + · · · + min(N1 N2 · · · Nc , n)
µ = .
n
Daniel Lemire Faster Column-Oriented Indexes
25. Bounding the optimality of sorting: the computation
How do you compute µ very fast so you know lexicographical
sort is good enough?
Trick is to determine n, the number of distinct rows without
sorting the table.
Thankfully: n can be estimated quickly with probabilistic
methods [Aouiche and Lemire, 2007].
Daniel Lemire Faster Column-Oriented Indexes
26. Bounding the optimality of sorting: actual numbers
columns µ
Census-Income 4-D 4 2.63
DBGEN 4-D 4 1.02
Netflix 4 2.00
Census1881 7 5.09
Daniel Lemire Faster Column-Oriented Indexes
27. Take away message
Column stores are good because of vectorization and
RLE/sorting
Sorting is sometimes nearly optimal, but not always but we
can sometimes tell when sorting is optimal
Daniel Lemire Faster Column-Oriented Indexes
28. Future direction?
Minimizing the number of runs it the wrong problem! We
want to maximize long runs!
Must study fancier row-reordering heuristics.
Daniel Lemire Faster Column-Oriented Indexes
29. Questions?
?
Daniel Lemire Faster Column-Oriented Indexes
30. Aouiche, K. and Lemire, D. (2007).
A comparison of five probabilistic view-size estimation
techniques in OLAP.
In DOLAP’07, pages 17–24.
Hamilton, C. H. and Rau-Chaplin, A. (2007).
Compact Hilbert indices: Space-filling curves for domains with
unequal side lengths.
Information Processing Letters, 105(5):155–163.
Hyde, J. (2010).
Data in flight.
Commun. ACM, 53(1):48–52.
Knuth, D. E. (2005).
The Art of Computer Programming, volume 4, chapter fascicle
2.
Addison Wesley.
Lemire, D. and Kaser, O. (2009).
Daniel Lemire Faster Column-Oriented Indexes
31. Reordering columns for smaller indexes.
in preparation, available from
http://arxiv.org/abs/0909.1346.
Lemire, D., Kaser, O., and Aouiche, K. (2010).
Sorting improves word-aligned bitmap indexes.
Data & Knowledge Engineering, 69(1):3–28.
O’Neil, P. and Quass, D. (1997).
Improved query performance with variant indexes.
In SIGMOD ’97, pages 38–49.
O’Neil, P. E. (1989).
Model 204 architecture and performance.
In 2nd International Workshop on High Performance
Transaction Systems, pages 40–59.
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X.,
Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S.,
O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S.
(2005).
Daniel Lemire Faster Column-Oriented Indexes
32. C-store: a column-oriented DBMS.
In VLDB’05, pages 553–564.
Turner, M. J., Hammond, R., and Cotton, P. (1979).
A DBMS for large statistical databases.
In VLDB’79, pages 319–327.
Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008).
An adequate design for large data warehouse systems: Bitmap
index versus B-Tree index.
IJCC, 2(2).
Daniel Lemire Faster Column-Oriented Indexes