This document outlines Simon Jonassen's research on efficient query processing in distributed search engines. It discusses three main areas:
1) Partitioned query processing, including semi-pipelined and pipelined approaches with skipping to improve throughput and latency.
2) Skipping and pruning techniques like efficient compression and linear programming to improve pruning for disjunctive queries.
3) Caching approaches including modeling static two-level caching and prefetching query results to improve search engine performance. The research is evaluated using large test collections and clusters of up to 9 nodes.
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
Resilient Distributed DataSets - Apache SPARKTaposh Roy
RDDs (Resilient Distributed Datasets) provide a fault-tolerant abstraction for data reuse across jobs in distributed applications. They allow data to be persisted in memory and manipulated using transformations like map and filter. This enables efficient processing of iterative algorithms. RDDs achieve fault tolerance by logging the transformations used to build a dataset rather than the actual data, enabling recovery of lost partitions through recomputation.
Efficient top-k queries processing in column-family distributed databasesRui Vieira
The document discusses efficient top-k query processing on distributed column family databases. It begins by introducing top-k queries and their uses. It then discusses challenges with naive solutions and prior work using batch processing. The document proposes three algorithms - TPUT, Hybrid Threshold, and KLEE - to enable real-time top-k queries on distributed data in a memory, bandwidth, and computation efficient manner. It also discusses implementation considerations for Cassandra's data model and CQL.
Resilient Distributed Datasets (RDDs) provide a fault-tolerant abstraction for in-memory cluster computing. RDDs allow data to be partitioned across nodes and kept in memory for efficient reuse across jobs, while retaining properties of MapReduce like fault tolerance. RDDs track the lineage of transformations to rebuild lost data and optimize data placement and partitioning to minimize network shuffling.
Write intensive workloads and lsm treesTilak Patidar
This document discusses LSM trees as an alternative to B+ trees for write-intensive workloads like time series and sensor data. LSM trees optimize for writes by sequentially appending data to sorted string tables, using bloom filters to index them, and periodically merging files in the background. This avoids issues with B+ trees like index locking during updates and excessive disk seeks from fragmentation over time. The document outlines the basic framework of LSM trees including memtables, sorted string tables, bloom filters, and compaction. It also describes how reads, writes, deletes, and compaction work in an LSM tree to provide fast, write-optimized performance for large volumes of time-series and sensor data.
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
Introducing Apache Carbon Data - Hadoop Native Columnar Data FormatVimal Das Kammath
Apache CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, Carbon Data has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
Resilient Distributed DataSets - Apache SPARKTaposh Roy
RDDs (Resilient Distributed Datasets) provide a fault-tolerant abstraction for data reuse across jobs in distributed applications. They allow data to be persisted in memory and manipulated using transformations like map and filter. This enables efficient processing of iterative algorithms. RDDs achieve fault tolerance by logging the transformations used to build a dataset rather than the actual data, enabling recovery of lost partitions through recomputation.
Efficient top-k queries processing in column-family distributed databasesRui Vieira
The document discusses efficient top-k query processing on distributed column family databases. It begins by introducing top-k queries and their uses. It then discusses challenges with naive solutions and prior work using batch processing. The document proposes three algorithms - TPUT, Hybrid Threshold, and KLEE - to enable real-time top-k queries on distributed data in a memory, bandwidth, and computation efficient manner. It also discusses implementation considerations for Cassandra's data model and CQL.
Resilient Distributed Datasets (RDDs) provide a fault-tolerant abstraction for in-memory cluster computing. RDDs allow data to be partitioned across nodes and kept in memory for efficient reuse across jobs, while retaining properties of MapReduce like fault tolerance. RDDs track the lineage of transformations to rebuild lost data and optimize data placement and partitioning to minimize network shuffling.
Write intensive workloads and lsm treesTilak Patidar
This document discusses LSM trees as an alternative to B+ trees for write-intensive workloads like time series and sensor data. LSM trees optimize for writes by sequentially appending data to sorted string tables, using bloom filters to index them, and periodically merging files in the background. This avoids issues with B+ trees like index locking during updates and excessive disk seeks from fragmentation over time. The document outlines the basic framework of LSM trees including memtables, sorted string tables, bloom filters, and compaction. It also describes how reads, writes, deletes, and compaction work in an LSM tree to provide fast, write-optimized performance for large volumes of time-series and sensor data.
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
Introducing Apache Carbon Data - Hadoop Native Columnar Data FormatVimal Das Kammath
Apache CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, Carbon Data has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
The document describes Dremel, an interactive analysis system for web-scale datasets. Dremel uses a columnar data storage model and tree-based query serving architecture to enable interactive analysis of trillion record datasets distributed across thousands of nodes. It provides an SQL-like interface and can process queries orders of magnitude faster than traditional MapReduce systems by avoiding record assembly costs. Experiments show Dremel can analyze tens to hundreds of billions of records interactively on commodity hardware.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
This deep dive attempts to "de-mystify" Spark by touching on some of the main design philosophies and diving into some of the more advanced features that make it such a flexible and powerful cluster computing framework. It will touch on some common pitfalls and attempt to build some best practices for building, configuring, and deploying Spark applications.
Dremel interactive analysis of web scale datasetsCarl Lu
Dremel is an interactive query system that can analyze large web-scale datasets containing trillions of records in seconds. It uses a columnar data layout and multi-level query execution trees to distribute queries across thousands of servers. Dremel's nested data model and column-striped storage allows it to efficiently retrieve and analyze only the necessary columns from large datasets. Experimental results demonstrated Dremel's ability to process queries over datasets containing trillions of records and petabytes of data in seconds using a cluster of thousands of servers.
The document discusses Dremel, an interactive query system for analyzing large-scale datasets. Dremel uses a columnar data storage format and a multi-level query execution tree to enable fast querying. It evaluates Dremel's performance on interactive queries, showing it can count terms in a field within seconds using 3000 workers, while MapReduce takes hours. Dremel also scales linearly and handles stragglers well. Today, similar systems like Google BigQuery and Apache Drill use Dremel-like techniques for interactive analysis of web-scale data.
PostgreSQL 9.5 introduces many new capabilities that enhance its functionality as a relational database with NoSQL features. Key additions include full JSON support with CRUD operations and pretty printing, row-level security policies, UPSERT functionality, enhanced password management, improved replication after failovers, foreign table constraints and inheritance, expanded analytics with CUBE and ROLLUP, new versions of PostGIS with additional geospatial tools, and various administrative improvements. These new features further strengthen PostgreSQL's position as the most advanced open source database platform.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
This document summarizes the key findings from a study analyzing the performance bottlenecks in Spark data analytics frameworks. The study used three different workloads run on Spark and found that: network optimizations provided at most a 2% reduction in job completion time; CPU was often the main bottleneck rather than disk or network I/O; optimizing disk performance reduced completion time by less than 19%; and many straggler causes could be identified and addressed to improve performance. The document discusses the methodology used to measure bottlenecks and blocked times, limitations of the study, and reasons why the results differed from assumptions in prior works.
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
Big SQL 3.0 is IBM's SQL engine for Hadoop that addresses challenges of building a first class SQL engine on Hadoop. It uses a modern MPP shared-nothing architecture and is architected from the ground up for low latency and high throughput. Key challenges included data placement on Hadoop, reading and writing Hadoop file formats, query optimization with limited statistics, and resource management with a shared Hadoop cluster. The architecture utilizes existing SQL query rewrite and optimization capabilities while introducing new capabilities for statistics, constraints, and pushdown to Hadoop file formats and data sources.
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationOlaf Hartig
These are the slides of my invited talk at the 5th Int. Workshop on Usage Analysis and the Web of Data (USEWOD 2015): http://usewod.org/usewod2015.html
The abstract of this talks is given as follows:
To reduce user-perceived response time many interactive Web applications visualize information in a dynamic, incremental manner. Such an incremental presentation can be particularly effective for cases in which the underlying data processing systems are not capable of completely answering the users' information needs instantaneously. An example of such systems are systems that support live querying of the Web of Data, in which case query execution times of several seconds, or even minutes, are an inherent consequence of these systems' ability to guarantee up-to-date results. However, support for an incremental result visualization has not received much attention in existing work on such systems. Therefore, the goal of this talk is to discuss approaches that enable query systems for the Web of Data to return query results incrementally.
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
This document summarizes a presentation about annotating millions of documents at scale using dictionary-based annotation with Apache Spark, Apache Solr, and Apache OpenNLP. The key points discussed include:
- The problem of annotating millions of documents from science corpora and the need to do it efficiently without model training.
- The architecture of SoDA (Dictionary Based Named Entity Annotator), which uses Apache Solr, SolrTextTagger, and OpenNLP for annotation and can be run on Spark for scaling.
- Performance optimizations made including combining paragraphs, tuning Solr garbage collection, and increasing the Spark cluster size, which resulted in annotating over 20 documents per second.
- Further work proposed
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
This document discusses computationally intensive machine learning at large scales. It compares the algorithmic and statistical perspectives of computer scientists and statisticians when analyzing big data. It describes three science applications that use linear algebra techniques like PCA, NMF and CX decompositions on large datasets. Experiments are presented comparing the performance of these techniques implemented in Spark and MPI on different HPC platforms. The results show Spark can be 4-26x slower than optimized MPI codes. Next steps proposed include developing Alchemist to interface Spark and MPI more efficiently and exploring communication-avoiding machine learning algorithms.
MPTStore: A Fast, Scalable, and Stable Resource IndexChris Wilper
MPTStore is a resource index for Fedora that uses a relational database to store RDF triples in a fast, scalable, and stable way. It maps predicates to tables to improve performance of queries and updates. Tests showed MPTStore had significantly faster datastream modifications than the existing Kowari triplestore and comparable query speeds. Its use of a relational database provided benefits like transaction support and ease of administration compared to other triplestores.
For a long time, relational database management systems have been the only solution for persistent data store. However, with the phenomenal growth of data, this conventional way of storing has become problematic.
To manage the exponentially growing data traffic, largest information technology companies such as Google, Amazon and Yahoo have developed alternative solutions that store data in what have come to be known as NoSQL databases.
Some of the NoSQL features are flexible schema, horizontal scaling and no ACID support. NoSQL databases store and replicate data in distributed systems, often across datacenters, to achieve scalability and reliability.
The CAP theorem states that any networked shared-data system (e.g. NoSQL) can have at most two of three desirable properties:
• consistency(C) - equivalent to having a single up-to-date copy of the data
• availability(A) of that data (for reads and writes)
• tolerance to network partitions(P)
Because of this inherent tradeoff, it is necessary to sacrifice one of these properties. The general belief is that designers cannot sacrifice P and therefore have a difficult choice between C and A.
In this seminar two NoSQL databases are presented: Amazon's Dynamo, which sacrifices consistency thereby achieving very high availability and Google's BigTable, which guarantees strong consistency while provides only best-effort availability.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
Geographic web search engines allow users to constrain and order search results in an intuitive manner by focusing a query on a particular geographic region. Geographic search technology, also called local search, has recently received significant interest from major search engine companies. Academic research in this area has focused primarily on techniques for extracting geographic knowledge from the web. In this paper, we study the problem of efficient query processing in scalable geographic search engines. Query processing is a major bottleneck in standard web search engines, and the main reason for the thousands of machines used by the major engines. Geographic search engine query processing is different in that it requires a combination of text and spatial data processing techniques. We propose several algorithms for efficient query processing in geographic search engines, integrate them into an existing web search query processor, and evaluate them on large sets of real data and query traces.
This document proposes and evaluates cooperative scans, a framework for improving disk bandwidth utilization and reducing latency for concurrent scan queries in a database management system (DBMS). It analyzes existing scheduling policies like normal, attach, and elevator scans and finds they do not fully exploit data sharing opportunities. The cooperative scans framework introduces a new relevance policy that dynamically schedules queries and their data requests based on the current system situation. Initial results on row storage show significant performance improvements over traditional policies. The document also discusses extending cooperative scans to column storage and related implementation challenges.
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
The document describes Dremel, an interactive analysis system for web-scale datasets. Dremel uses a columnar data storage model and tree-based query serving architecture to enable interactive analysis of trillion record datasets distributed across thousands of nodes. It provides an SQL-like interface and can process queries orders of magnitude faster than traditional MapReduce systems by avoiding record assembly costs. Experiments show Dremel can analyze tens to hundreds of billions of records interactively on commodity hardware.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
This deep dive attempts to "de-mystify" Spark by touching on some of the main design philosophies and diving into some of the more advanced features that make it such a flexible and powerful cluster computing framework. It will touch on some common pitfalls and attempt to build some best practices for building, configuring, and deploying Spark applications.
Dremel interactive analysis of web scale datasetsCarl Lu
Dremel is an interactive query system that can analyze large web-scale datasets containing trillions of records in seconds. It uses a columnar data layout and multi-level query execution trees to distribute queries across thousands of servers. Dremel's nested data model and column-striped storage allows it to efficiently retrieve and analyze only the necessary columns from large datasets. Experimental results demonstrated Dremel's ability to process queries over datasets containing trillions of records and petabytes of data in seconds using a cluster of thousands of servers.
The document discusses Dremel, an interactive query system for analyzing large-scale datasets. Dremel uses a columnar data storage format and a multi-level query execution tree to enable fast querying. It evaluates Dremel's performance on interactive queries, showing it can count terms in a field within seconds using 3000 workers, while MapReduce takes hours. Dremel also scales linearly and handles stragglers well. Today, similar systems like Google BigQuery and Apache Drill use Dremel-like techniques for interactive analysis of web-scale data.
PostgreSQL 9.5 introduces many new capabilities that enhance its functionality as a relational database with NoSQL features. Key additions include full JSON support with CRUD operations and pretty printing, row-level security policies, UPSERT functionality, enhanced password management, improved replication after failovers, foreign table constraints and inheritance, expanded analytics with CUBE and ROLLUP, new versions of PostGIS with additional geospatial tools, and various administrative improvements. These new features further strengthen PostgreSQL's position as the most advanced open source database platform.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
This document summarizes the key findings from a study analyzing the performance bottlenecks in Spark data analytics frameworks. The study used three different workloads run on Spark and found that: network optimizations provided at most a 2% reduction in job completion time; CPU was often the main bottleneck rather than disk or network I/O; optimizing disk performance reduced completion time by less than 19%; and many straggler causes could be identified and addressed to improve performance. The document discusses the methodology used to measure bottlenecks and blocked times, limitations of the study, and reasons why the results differed from assumptions in prior works.
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
Big SQL 3.0 is IBM's SQL engine for Hadoop that addresses challenges of building a first class SQL engine on Hadoop. It uses a modern MPP shared-nothing architecture and is architected from the ground up for low latency and high throughput. Key challenges included data placement on Hadoop, reading and writing Hadoop file formats, query optimization with limited statistics, and resource management with a shared Hadoop cluster. The architecture utilizes existing SQL query rewrite and optimization capabilities while introducing new capabilities for statistics, constraints, and pushdown to Hadoop file formats and data sources.
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationOlaf Hartig
These are the slides of my invited talk at the 5th Int. Workshop on Usage Analysis and the Web of Data (USEWOD 2015): http://usewod.org/usewod2015.html
The abstract of this talks is given as follows:
To reduce user-perceived response time many interactive Web applications visualize information in a dynamic, incremental manner. Such an incremental presentation can be particularly effective for cases in which the underlying data processing systems are not capable of completely answering the users' information needs instantaneously. An example of such systems are systems that support live querying of the Web of Data, in which case query execution times of several seconds, or even minutes, are an inherent consequence of these systems' ability to guarantee up-to-date results. However, support for an incremental result visualization has not received much attention in existing work on such systems. Therefore, the goal of this talk is to discuss approaches that enable query systems for the Web of Data to return query results incrementally.
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
This document summarizes a presentation about annotating millions of documents at scale using dictionary-based annotation with Apache Spark, Apache Solr, and Apache OpenNLP. The key points discussed include:
- The problem of annotating millions of documents from science corpora and the need to do it efficiently without model training.
- The architecture of SoDA (Dictionary Based Named Entity Annotator), which uses Apache Solr, SolrTextTagger, and OpenNLP for annotation and can be run on Spark for scaling.
- Performance optimizations made including combining paragraphs, tuning Solr garbage collection, and increasing the Spark cluster size, which resulted in annotating over 20 documents per second.
- Further work proposed
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
This document discusses computationally intensive machine learning at large scales. It compares the algorithmic and statistical perspectives of computer scientists and statisticians when analyzing big data. It describes three science applications that use linear algebra techniques like PCA, NMF and CX decompositions on large datasets. Experiments are presented comparing the performance of these techniques implemented in Spark and MPI on different HPC platforms. The results show Spark can be 4-26x slower than optimized MPI codes. Next steps proposed include developing Alchemist to interface Spark and MPI more efficiently and exploring communication-avoiding machine learning algorithms.
MPTStore: A Fast, Scalable, and Stable Resource IndexChris Wilper
MPTStore is a resource index for Fedora that uses a relational database to store RDF triples in a fast, scalable, and stable way. It maps predicates to tables to improve performance of queries and updates. Tests showed MPTStore had significantly faster datastream modifications than the existing Kowari triplestore and comparable query speeds. Its use of a relational database provided benefits like transaction support and ease of administration compared to other triplestores.
For a long time, relational database management systems have been the only solution for persistent data store. However, with the phenomenal growth of data, this conventional way of storing has become problematic.
To manage the exponentially growing data traffic, largest information technology companies such as Google, Amazon and Yahoo have developed alternative solutions that store data in what have come to be known as NoSQL databases.
Some of the NoSQL features are flexible schema, horizontal scaling and no ACID support. NoSQL databases store and replicate data in distributed systems, often across datacenters, to achieve scalability and reliability.
The CAP theorem states that any networked shared-data system (e.g. NoSQL) can have at most two of three desirable properties:
• consistency(C) - equivalent to having a single up-to-date copy of the data
• availability(A) of that data (for reads and writes)
• tolerance to network partitions(P)
Because of this inherent tradeoff, it is necessary to sacrifice one of these properties. The general belief is that designers cannot sacrifice P and therefore have a difficult choice between C and A.
In this seminar two NoSQL databases are presented: Amazon's Dynamo, which sacrifices consistency thereby achieving very high availability and Google's BigTable, which guarantees strong consistency while provides only best-effort availability.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
Geographic web search engines allow users to constrain and order search results in an intuitive manner by focusing a query on a particular geographic region. Geographic search technology, also called local search, has recently received significant interest from major search engine companies. Academic research in this area has focused primarily on techniques for extracting geographic knowledge from the web. In this paper, we study the problem of efficient query processing in scalable geographic search engines. Query processing is a major bottleneck in standard web search engines, and the main reason for the thousands of machines used by the major engines. Geographic search engine query processing is different in that it requires a combination of text and spatial data processing techniques. We propose several algorithms for efficient query processing in geographic search engines, integrate them into an existing web search query processor, and evaluate them on large sets of real data and query traces.
This document proposes and evaluates cooperative scans, a framework for improving disk bandwidth utilization and reducing latency for concurrent scan queries in a database management system (DBMS). It analyzes existing scheduling policies like normal, attach, and elevator scans and finds they do not fully exploit data sharing opportunities. The cooperative scans framework introduces a new relevance policy that dynamically schedules queries and their data requests based on the current system situation. Initial results on row storage show significant performance improvements over traditional policies. The document also discusses extending cooperative scans to column storage and related implementation challenges.
Denis Wilson is a principal consultant who works with DWP Information Architects Inc. and serves on the board of the Independent Association of Microsoft Channel Partners. The document discusses new features in Office 2016 and Windows 10 that enable improved collaboration, mobility, and productivity. It highlights integration with Skype, Cortana, Outlook, Delve, Excel, and other apps. The document encourages partners to get ready for the Office 2016 launch through Drumbeat and cloud roadshows to renew or upsell Office 365 subscriptions and reach new customers with innovative user experiences.
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
This document discusses the challenges of distributed information retrieval. It covers four main modules: crawling, indexing, query processing, and caching. For each module, it discusses issues like partitioning tasks across servers, dealing with failures, communication between servers, and external factors. It provides more details on partitioning approaches for crawling and indexing, as well as the benefits and challenges of document partitioning during indexing.
Office 365 Business is a cloud-based subscription service that provides modern workplace tools and services including the latest version of Office applications. It allows users to access files from any device with offline and online editing capabilities. Key benefits include collaboration features, 1TB of OneDrive storage, automatic updates, and security and compliance services. Plans start at $5 per user per month.
This document summarizes and compares two deep learning frameworks: Microsoft Cognitive Toolkit (CNTK) and PyTorch. CNTK is an open-source toolkit created by Microsoft for deep learning. It is production-ready and optimized for performance and scalability. It uses symbolic computational graphs and supports distributed training techniques like 1-bit SGD. PyTorch is a Python-based framework developed by Facebook for rapid prototyping. It uses dynamic computational graphs and supports automatic differentiation via its autograd package. Both frameworks express neural networks as compositions of basic building blocks and allow defining and training custom models.
This document discusses MapReduce and its suitability for processing large datasets across distributed systems. It describes challenges like node failures, network bottlenecks and the motivation for a simple programming model that can handle massive computations and datasets across thousands of machines. MapReduce provides a programming model using map and reduce functions that hides complexities of parallelization, fault tolerance and load balancing. It has been widely adopted for applications involving log analysis, indexing large datasets, iterative graph processing and more.
This document describes a student's project to design an energy efficient cache memory using Verilog HDL. It presents the background and motivation, including that write-through caches consume a lot of energy due to increased access at lower levels. The proposed work introduces a way-tagged cache architecture to reduce energy consumption. Simulation results show energy savings of the proposed cache compared to a conventional cache for various operations like reset, data load, read and write hits/misses. The conclusion is that the way-tagged cache adds new components to L1 cache but remains inbuilt with the processor, with area overhead as a drawback.
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...confluent
Ever wondered just how many CPU cores of KSQL Server you need to provision to handle your planned stream processing workload ? Or how many GBits of aggregate network bandwidth, spread across some number of processing threads, you'll need to deal with combined peak throughput of multiple queries ? In this talk we'll first explore the basic drivers of KSQL throughput and hardware requirements, building up to more advanced query plan analysis and capacity-planning techniques, and review some real-world testing results along the way. Finally we will recap how and what to monitor to know you got it right!
This document provides an overview of efficient query processing infrastructures for web search engines. It discusses how search engines use distributed architectures across many servers to efficiently process queries at large scale. It also describes how search engines employ various techniques like index compression, skipping, dynamic pruning, and learning to rank to efficiently evaluate queries while maintaining effectiveness. The goal is to provide concise yet relevant search results to users as fast as possible despite the massive scale of web data.
Performance doesn’t have the same definition between system administrators, developpers and business teams. What is Performance ? High CPU usage, not scalable web site, low business transaction rate per sec, slow response time, … This presentation is about maths, code performance, load testing, web performance, best practices, … Working on performance optimizaton is a very broad topic. It’s important to really understand main concepts and to have a clean and strong methodology because it could be a very time consumming activity. Happy reading !
This document provides a summary of practical machine learning on big data platforms. It begins with an introduction and agenda, then provides a quick brief on the machine learning process. It discusses the current landscape of open source tools, including evolutionary drivers and examples. It covers case studies from Twitter and their experience. Finally, it discusses architectural forces like Moore's Law and Kryder's Law that are shaping the field. The document aims to present a unified approach for machine learning on big data platforms and discuss how industry leaders are implementing these techniques.
The document discusses the ISO-OSI 7-layer reference model and related IEEE standards. It covers the purpose and functions of each layer, including the physical, data link, network, transport, session, presentation and application layers. It also describes how data is formatted and encapsulated as it passes through each layer. Finally, it discusses the IEEE 802 standards group and some of the key standards they developed that apply to networking, particularly at the data link and physical layers.
The document discusses parallel algorithms and their analysis. It introduces a simple parallel algorithm for adding n numbers using log n steps. Parallel algorithms are analyzed based on their time complexity, processor complexity, and work complexity. For adding n numbers in parallel, the time complexity is O(log n), processor complexity is O(n), and work complexity is O(n log n). The document also discusses models of parallel computation like PRAM and designs of parallel architectures like meshes and hypercubes.
The document discusses parallel computing platforms and techniques for hiding memory latency. It covers the following key points:
1) Implicit parallelism in microprocessors has increased through pipelining and superscalar execution, but memory latency remains a bottleneck. Caches help reduce effective latency by exploiting data locality.
2) Multithreading and prefetching are approaches to hide memory latency by keeping the processor occupied while waiting for data, but they increase bandwidth demands and hardware costs.
3) Different applications utilize different types of parallelism, like data-level parallelism for throughput or task-level parallelism for aggregate performance. Understanding performance bottlenecks is important for parallelization.
The document discusses CERN's use of Oracle's In-Memory Column Store to perform real-time analysis of physics experiment data from the Large Hadron Collider. Benchmark tests showed significant performance improvements over traditional row-based storage, with analytic queries running 10-100x faster. The columnar format also improved data compression rates. Additionally, OLTP workloads saw no negative impacts. CERN plans to consider the technology for future projects given its ability to enable real-time analysis that was previously not possible.
In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. View the slides with video recording: www.mongodb.com/presentations/hardware-provisioning-mongodb
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
Realtime analytics over large datasets has become an increasing wide-spread demand, over the past several years, Hadoop ecosystem has been continuously evolving, even complex queries over large datasets can be realized in an interactive fashion with distributed processing framework like Apache Spark, new paradigm of efficient storage were introduced as well to facilitate data processing framework, such as Apache Parquet, ORC provide fast scan over columnar data format, and Apache Hbase offers fast ingest and millisecond scale random access.
In this talk, we will outline Apache Carbondata, a new addition to open source Hadoop ecosystem which is an indexed columnar file format aimed for bridging the gap to fully enable real-time analytics abilities. It has been deeply integrated with Spark SQL and enables dramatic acceleration of query processing by leveraging efficient encoding/compression and effective predicate push down through Carbondata’s multi-level index technique.
In this video from SC17 in Denver, Dan Reed moderates a panel discussion on HPC Software for Energy Efficiency.
"We have already achieved major gains in energy-efficiency for both the datacenter and HPC equipment. For example, the PUE of the Swiss Supercomputer (CSCS) datacenter prior to 2012 was 1.8, but the current PUE is about 1.25; a factor of ~1.5 improvement. HPC system improvements have also been very strong, as evidenced by FLOPS/Watt performance on the Green500 List. While we have seen gains from data center and HPC system efficiency, there are also energy-efficiency gains to be had from software- application performance improvements, for example. This panel will explore what HPC software capabilities were most helpful over the past years in improving HPC system energy efficiency? It will then look forward; asking in what layers of the software stack should a priority be put on introducing energy-awareness; e.g., runtime, scheduling, applications? What is needed moving forward? Who is responsible for that forward momentum?"
Watch the video: https://wp.me/p3RLHQ-hHQ
Learn more: https://sc17.supercomputing.org/presentation/?id=pan103&sess=sess245
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
Pak Lui from the HPC Advisory Council presented this deck at the 2017 Stanford HPC Conference.
"To achieve good scalability performance on the HPC scientific applications typically involves good understanding of the workload though performing profile analysis, and comparing behaviors of using different hardware which pinpoint bottlenecks in different areas of the HPC cluster. In this session, a selection of HPC applications will be shown to demonstrate various methods of profiling and analysis to determine the bottleneck, and the effectiveness of the tuning to improve on the application performance from tests conducted at the HPC Advisory Council High Performance Center."
Watch the video presentation: http://wp.me/p3RLHQ-gpY
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses instruction pipelining and main memory. It begins by explaining how an instruction pipeline works, overlapping the fetch, decode, and execute phases of instruction processing. It notes some difficulties in pipelining including resource conflicts, data dependencies, and branch instructions. It then discusses pipeline control and performance, noting that pipelining provides faster processing by decomposing tasks into sequential sub-operations that can overlap. It concludes by answering questions about pipelining hazards and calculating pipeline metrics for example processors.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Recently, massively parallel processing relational database systems (MPPDBs) have gained much momentum in the big data analytic market. With the advent of hosted cloud computing, we envision that the offering of MPPDB-as-a-Service (MPPDBaaS) will become attractive for companies having analytical tasks on only hundreds gigabytes to some ten terabytes of data because they can enjoy high-end parallel analytics at a cheap cost. This paper presents Thrifty, a prototype implementation of MPPDB-as-a-service. The
major research issue is how to achieve a lower total cost of ownership by consolidating thousands of MPPDB tenants on to a shared hardware infrastructure, with a performance SLA that guarantees the tenants can obtain the query results as if they are executing their queries on dedicated machines. Thrifty achieves the goal by using a tenant-driven design that includes (1) a cluster design that carefully arranges the nodes in the cluster into groups and creates an MPPDB for each group of nodes, (2) a tenant placement that assigns each tenant to several MPPDBs (for high availability service through replication), and (3) a query routing algorithm that routes
a tenant’s query to the proper MPPDB at run-time. Experiments show that in a MPPDBaaS with 5000 tenants, where each tenant requests 2 to 32 nodes MPPDB to query against 200GB to 3.2TB of data, Thrifty can serve all the tenants with a 99.9% performance SLA guarantee and a high availability replication factor of 3, using only 18.7% of the nodes requested by the tenants.
Similar to Efficient Query Processing in Distributed Search Engines (20)
This talk is a quick introduction to counting sketches and HyperLogLog (HLL) in particular. HLL is a probabilistic data structure that can be used for counting the number of distinct elements (cardinality) in sub-linear space. With just 2 KB memory footprint it can approximate count for millions of distinct items with an error below 2%. This has a range of applications in batch, stream, and distributed processing, most importantly reducing the amount of data we have to store or transmit over the wire, but also several pitfalls, for example when it comes to computing an intersection between two sets. In this talk, I will explain the main idea and some of the applications, show code and benchmark examples from my previous work, and provide further references for those who want to learn more.
Large-Scale Real-Time Data Management for Engagement and MonetizationSimon Lia-Jonassen
Invited talk at the Workshop on Large-Scale and Distributed Systems for Information Retrieval 2015.
Cxense helps companies understand their audience and build great online experiences. Cxense Insight and DMP let customers annotate, filter, segment and target their users based on the consumed content and performed actions in real-time. With more than 5000 active websites, Insight alone tracks more than a billion unique users with more than 15 billions page views per month. To leverage the huge amounts of data in real-time, we have built a large distributed system relying on techniques familiar from databases, information retrieval and data mining. In this talk, we outline our solutions and give some insight into the technology we use and the challenges we face. This introduction should be interesting to undergraduate and PhD students as well as experienced researchers and engineers.
Abstract: Cxense Insight helps companies to understand their audience and build great online experiences. Our interactive UI and APIs help customers to
annotate, filter, segment and target their users based on the visited content and actions in realtime. Today we already track more than half a billion of unique user identities across more than 5000 web-sites, contributing to more than 10 billions of analytics events on a monthly basis.
To leverage these amounts of data in realtime, we built a large distributed system relying on the concepts familiar from databases, information retrieval and data mining. The first part of this talk will therefore give an insight into the challenges, the architecture and the techniques we have used. While the second part of the talk will briefly demonstrate our UI and APIs in action. We hope that both parts will be interesting for undergraduate students taking IR/DB courses as well as PhD students, experienced researchers and staff.
Spark is a framework for efficient parallel data processing. It uses resilient distributed datasets (RDDs) that can be operated on in parallel, cached in memory, and recomputed when needed. The core of Spark provides functions for data sharing and basic operations like filtering, mapping, and reducing RDDs. Additional Spark modules provide capabilities for SQL, streaming, machine learning, and graph processing.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Efficient Query Processing in Distributed Search Engines
1. Efficient Query Processing in
Distributed Search Engines
Simon Jonassen
Department of Computer and Information Science
Norwegian University of Science and Technology
2. Outline
• Introduction
– Motivation
– Research questions
– Research process
• Part I – Partitioned query processing
• Part II – Skipping and pruning
• Part III – Caching
• Conclusions
4. Motivation
• Web search: rapid information growth and
tight performance constraints.
• Search is used in most of Web applications.
5. Example: Google
• Estimated Web index size in 2012: 35-50 billion pages.
• More than a billion visitors in 2012.
• 36 data centers in 2008.
• Infrastructure cost in 4Q 2011: ~ 951 million USD.
techcrunch.com
6. Research questions
v What are the main challenges of existing methods for
efficient query processing in distributed search engines
and how can these be resolved?
² How to improve the efficiency of
Ø partitioned query processing
Ø query processing and pruning
Ø caching
7. Research process
• Iterative, exploratory research:
Study/
observation
Problem/
hypothesis
Solution
proposal
Quantitative
Evaluation
Publication
of results
8. Outline
• Introduction
• Part I – Partitioned query processing
– Introduction
– Motivation
– Combined semi-pipelined query processing
– Efficient compressed skipping for disjunctive text-queries
– Pipelined query processing with skipping
– Intra-query concurrent pipelined processing
– Further work
• Part II – Skipping and pruning
• Part III – Caching
• Conclusions
10. Part I: Introduction.
Document-wise partitioning
• Each query is processed by all of the nodes in parallel.
• One of the nodes merges the results.
• Advantages:
– Simple, fast and scalable.
– Intra-query concurrency.
• Challenges:
– Each query is processed by
all of the nodes.
– Number of disk seeks.
11. Part I: Introduction.
Term-wise partitioning
• Each node fetches posting lists and sends them to one of
the nodes, which receives and processes all of them.
• Advantages:
– A query involves only a few nodes.
– Smaller number of disk seeks.
– Inter-query concurrency.
• Challenges:
– A single node does all processing.
– Other nodes act just like advanced
network disks.
– High network load.
– Load balancing.
12. Part I: Introduction.
Pipelined query processing [Moffat et al. 2007]
• A query bundle is routed from one node to next.
• Each node contributes with its own postings.
The last node extracts the top results.
• The maximum accumulator set size
is constrained. [Lester et al. 2005]
• Advantages:
– The work is divided between the nodes.
– Reduced overhead on the last node.
– Reduced network load.
• Challenges:
– Long query latency.
– Load balancing.
13. Part I: Motivation.
Pipelined query processing (TP/PL)
• Several publications find TP/PL being advantageous under
certain conditions.
– Expensive disk access, fast network, efficient pruning, high
multiprogramming level (query load), small main memory, etc.
• TP/PL has several advantages.
– Number of involved nodes, inter-query concurrency, etc.
• TP/PL may outperform DP in peak throughput.
– Requires high multiprogramming level and results in long query latency.
• Practical results are more pessimistic.
Ø How can we improve the method?
• The main challenges of TP/PL [Büttcher et al. 2010]:
– Scalability, load balancing, term/node-at-a-time processing
and lack of intra-query concurrency.
14. Part I: Combined semi-pipelined query processing.
Motivation
• Pipelined – higher throughput, but longer latency.
• Non-pipelined – shorter latency, but lower throughput.
• We wanted to combine the advantage of both methods:
short latency AND high throughput.
16. Part I: Combined semi-pipelined query processing.
Contributions and evaluation
ü Decision heuristic and alternative routing strategy.
ü Evaluated using a modified version of Terrier v2.2
(http://terrier.org):
Ø 426GB TREC GOV2 Corpus – 25 million documents.
Ø 20 000 queries from the TREC Terabyte Track Efficiency Topics
2005 (first 10 000 queries were used as a warm-up).
Ø 8 nodes with 2x2.0GHz Intel Quad-Core, 8GB RAM and a SATA
hard-disk on each node. Gigabit network.
18. Part I: Combined semi-pipelined query processing.
Post-note and motivation for other papers
• “Inefficient” compression methods:
– Gamma for doc ID gaps and Unary for frequencies.
• Posting lists have to be fetched and decompressed
completely and remain in main memory until the sub-
query has been fully processed.
• Can we do this better?
– Modern “super-fast” index compression.
• NewPFoR [Yan, Ding and Suel 2009].
– Skipping.
– Partial evaluation (pruning).
19. Part I: Efficient compressed skipping for disjunctive text-queries.
Contributions
ü A self-skipping inverted index designed specifically for
NewPFoR compression.
20. Part I: Efficient compressed skipping for disjunctive text-queries.
Contributions
ü Query processing methods for disjunctive (OR) queries:
Ø A skipping version of the space-limited
pruning method by Lester et al. 2007.
Ø A description of Max-Score that combines
the advantage of the descriptions by Turtle
and Flood 1995 and Strohman, Turtle and
Croft 2005.
Ø We also explained how to apply skipping and
presented updated performance results.
AND OR
latency
21. Part I: Efficient compressed skipping for disjunctive text-queries.
Evaluation
ü Evaluated with a real implementation:
Ø TREC GOV2.
• 15.4 million terms, 25.2 million docs and 4.7 billion postings.
• Index w/o skipping is 5.977GB. Skipping adds 87.1MB.
Ø TREC Terabyte Track Efficiency Topics 2005.
• 10 000 queries with 2+ query terms.
Ø Workstation: 2.66GHz Intel Quad-Core 2, 8GB RAM,
1TB 7200RPM SATA2. 16KB blocks by default.
22. Part I: Efficient compressed skipping for disjunctive text-queries.
Main results
348
260
93.6
51.8
44.4
283
90.7
120
47.8
42.2
23. Part I: Pipelined query processing with skipping.
Contributions
ü Pipelined query processing with skipping.
ü Pipelined Max-Score with DAAT sub-query processing.
ü Posting list assignment by
maximum scores.
24. Part I: Pipelined query processing with skipping.
Evaluation
• Evaluated with a real implementation:
Ø Same index design as in the last paper, but now extended to a
distributed system.
Ø TREC GOV2 and Terabyte Track Efficiency Topics 2005/2006:
• 5 000 queries – warm-up, 15 000 – evaluation.
Ø 9 node cluster – same as in the first paper, plus a broker.
26. Part I: Intra-query concurrent pipelined processing.
Motivation
• At low query loads there is available CPU/network
capacity.
• In this case, the performance
can be improved by intra-query
concurrency.
• How can we provide intra-query
concurrency with TP/PL?
27. Part I: Intra-query concurrent pipelined processing.
Contributions
ü Parallel sub-query execution on different nodes.
Ø Fragment-based processing:
Ø Each fragment covers a
document ID range.
Ø The number of fragments in
each query depends on the
estimated processing cost.
28. Part I: Intra-query concurrent pipelined processing.
Contributions
ü Concurrent sub-query execution on the same node.
Ø When the query load is low, different
fragments of the same query can be
processed concurrently.
Ø Concurrency depends on the number
of fragments in the query and the load
on the node.
29. Part I: Intra-query concurrent pipelined processing.
Evaluation and main results
ü Evaluated with TREC GOV2, TREC Terabyte Track Efficiency
Topics 2005 (2 000 warm-up and 8 000 test queries).
30. Part I.
Further work
• Evaluation of scalability:
– So far: 8 nodes and 25 mil documents.
• Further extension of the methods:
– Dynamic load balancing.
– Caching.
• Open questions:
– Term- vs document-wise partitioning.
– Document- vs impact-ordered indexes.
31. Outline
• Introduction
• Part I – Partitioned query processing
• Part II – Skipping and pruning
– Contributions presented within Part I
– Improving pruning efficiency via linear programming
• Part III – Caching
• Conclusions
32. Part II.
Contributions presented within Part I
ü Efficient processing of disjunctive (OR) queries
(monolithic index).
ü Pipelined query processing with skipping
(term-wise partitioned index).
33. Part II: Improving dynamic index pruning via linear programming.
Motivation and contributions
• Pruning methods s.a. Max-Score and WAND rely
on tight score upper-bound estimates.
• The upper-bound for a combination of several
terms is smaller than the sum of the individual
upper-bounds.
• Terms and their combinations repeat.
ü LP approach to improving the performance of
Max-Score/WAND.
Ø Relies on LP estimation of score bounds for query
subsets.
Ø In particular, uses upper bounds of single terms and
the most frequent term-pairs.
anna
banana
nana
0.5
0.6
0.7
0.6
1.0
1.3
34. Part II: Improving dynamic index pruning via linear programming.
Evaluation and extended results
ü Evaluated with a real implementation.
Ø 20 million Web pages, 100 000 queries from Yahoo, Max-Score.
ü The results indicate significant I/O, decompression and
computational savings for queries of 3 to 6 terms.
|q|
AND
(ms)
OR
(ms)
3
0.002
0.002
4
0.01
0.016
5
0.082
0.136
6
0.928
1.483
7
14.794
22.676
8
300.728
437.005
|q|
T
(ms)
TI
TI**
AND
3
10.759
1.59%
1.64%
(k=10)
4
10.589
1.30%
1.43%
5
9.085
0.09%
0.97%
6
6.256
-‐14.86%
-‐0.95%
OR
3
18.448
3.74%
3.78%
(k=10)
4
24.590
6.85%
6.93%
5
33.148
7.85%
8.24%
6
42.766
6.04%
9.08%
Latency improvements. LP solver latency.
|q|
%
1
14.91%
2
36.80%
3
29.75%
4
12.35%
5
4.08%
6
1.30%
7
0.42%
8+
0.39%
Query length
distribution.
** 10x faster LP solver.
35. Part II: Improving dynamic index pruning via linear programming.
Further work
• Larger document collection.
• Other types of queries.
• Reuse of previously computed scores.
• Application to pipelined query processing.
36. Outline
• Introduction
• Part I – Partitioned query processing
• Part II – Skipping and pruning
• Part III – Caching
– Modeling static caching in Web search engines
– Prefetching query results and its impact on search engines
• Conclusions
38. Part III: Modeling static caching in Web search engines.
Contributions, evaluation and further work
ü A modeling approach that allows to find a nearly optimal
split in a static two-level cache.
Ø Requires estimation of only six parameters.
ü Evaluated with a cache-simulator:
Ø 37.9 million documents from UK domain (2006)
Ø 80.7 million queries (April 2006 - April 2007)
• Further work:
– Larger query log and document collection.
– Improved model.
– Other types of cache.
39. Part III: Prefetching query results and its impact on search engines.
Background and related work
• Result caching:
– Reuse of previously computed results to answer future queries.
• Our data: About 60% of queries repeat.
– Limited vs infinite-capacity caches.
• Limited: Beaza-Yates et al. 2007, Gan and Suel 2009.
• Infinite: Cambazoglu et al. 2010.
– Cache staleness problem.
• Results become stale after a while.
• Time-to-live (TTL) invalidation.
– Cambazoglu et al. 2010, Alici et al. 2012.
• Result freshness.
– Cambazoglu et al. 2010.
40. Part III: Prefetching query results and its impact on search engines.
Motivation
0 2 4 6 8 10 12 14 16 18 20 22
Hour of the day
0
5
10
15
20
25
30
35
40
Backendquerytraffic(query/sec)
TTL = 1 hour
TTL = 4 hours
TTL = 16 hours
TTL = infinite
Opportunity for
prefetching
Risk of overflow
Potential for
load reduction
Peak query throughput
sustainable by the backend
41. Part III: Prefetching query results and its impact on search engines.
Contributions
ü Three-stage approach:
1. Selection – Which query results to prefetch?
2. Ordering – Which query results to prefetch first?
3. Scheduling – When to prefetch?
ü Two selection and ordering strategies:
– Offline – queries issued 24 hours before.
– Online – machine learned prediction of the next arrival.
SEARCH
CLUSTER
BACKEND
SEARCH ENGINE
FRONTEND
SEARCH CLUSTER
FRONTEND
RESULT CACHE
User
QUERY PREFETCHING MODULE
42. Part III: Prefetching query results and its impact on search engines.
Evaluation and main results
ü Evaluated with an event-driven search engine simulator and
Web queries sampled Yahoo! Logs.
Ø Millions of queries, billions of documents, thousands of nodes.
Ø Methods: no-ttl, no-prefetch, age-dg, online, offline.
ü The results show that prefetching can improve the hit rate,
query response time, result freshness and query degradation
rate relative to the state-of-the-art baseline.
ü The key aspect is the prediction methodology.
43. 0 2 4 6 8 10 12 14 16 18 20 22
Hour of the day
80
100
120
140
160
180
200
220
Averagequeryresponsetime(ms)
no-prefetch
no-ttl
on-agedg (R=T/4)
oracle-agedg (R=T/4)
Average query response time (1020 nodes).
44. 0 2 4 6 8 10 12 14 16 18 20 22
Hour of the day
0
40
80
120
160
200
240
280
320
360
400
Averagedegradationrate(x10^-4)
no-prefetch
no-ttl
on-agedg (R=T/4)
oracle-agedg (R=T/4)
Average degradation (850 nodes).
45. Part III: Prefetching query results and its impact on search engines.
Further work
• Improved prediction model.
• Speculative prefetching.
• Economic evaluation.
46. Outline
• Introduction
• Part I – Partitioned query processing
• Part II – Skipping and pruning
• Part III – Caching
• Conclusions
47. Conclusions
• We have contributed to resolving some of the most
important challenges in distributed query processing.
• Our results indicate significant improvements over
the state-of-the-art methods.
• Practical applicability and implications, however, need a
further evaluation in real-life settings.
• There are several opportunities for future work.
48. Key references
• Alici at al.: “Adaptive Time-to-Live Strategies for Query Result Caching in Web Search Engines”. In
Proc. ECIR 2012, pp.401-412.
• Baeza-Yates et al.: “The impact of caching on search engines”. In Proc. SIGIR 2007, pp.183-190.
• Büttcher, Clarke and Cormack: “Information Retrieval – Implementing and Evaluating Search Engines”,
2010.
• Cambazoglu et al.: “A refreshing perspective of search engine caching”. In Proc. WWW 2010, pp.
181-190.
• Gan and Suel: “Improved techniques for result caching in web search engines”. In Proc. WWW 2009,
pp. 431-440.
• Lester et al.: “Space-Limited Ranked Query Evaluation Using Adaptive Pruning”. In Proc. WISE 2005,
pp. 470-477.
• Moffat et al.: “A pipelined architecture for distributed text query evaluation”. Inf. Retr. 10(3) 2007, pp.
205-231.
• Strohman, Turtle and Croft: “Optimization strategies for complex queries”. In Proc. SIGIR 2005, pp.
175-182.
• Turtle and Flood: “Query Evaluation: Strategies and Optimizations”. Inf. Process. Manage. 31(6) 1995,
pp. 838-849.
• Yan, Ding and Suel: “Inverted index compression and query processing with optimized document
ordering”. In Proc. WWW 2009, pp. 401-410.
49. Publications
1. Baeza-Yates and Jonassen: “Modeling Static Caching in Web Search Engines”. In Proc. ECIR 2012,
pp. 436-446.
2. Jonassen and Bratsberg: “Improving the Performance of Pipelined Query Processing with Skipping”. In
Proc. WISE 2012, pp. 1-15.
3. Jonassen and Bratsberg: “Intra-Query Concurrent Pipelined Processing for Distributed Full-Text
Retrieval”. In Proc. ECIR 2012, pp. 413-425.
4. Jonassen and Bratsberg: “Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries”.
In Proc. ECIR 2011, pp. 530-542.
5. Jonassen and Bratsberg: “A Combined Semi-Pipelined Query Processing Architecture for Distributed
Full-Text Retrieval”. In Proc. WISE 2010, pp. 587-601.
6. Jonassen and Cambazoglu: “Improving Dynamic Index Pruning via Linear Programming”. In progress.
7. Jonassen, Cambazoglu and Silvestri: “Prefetching Query Results and its Impact on Search Engines”.
In Proc. SIGIR 2012, pp. 631-640.
Secondary (Not included):
1. Cambazoglu et al.: “A Term-Based Inverted Index Partitioning Model for Efficient Query Processing”.
Under submission.
2. Rocha-Junior et al.: “Efficient Processing of Top-k Spatial Keyword Queries”. In Proc. SSTD 2011, pp.
205-222.
3. Jonassen: “Scalable Search Platform: Improving Pipelined Query Processing for Distributed Full-Text
Retrieval”. In Proc. WWW (Comp. Vol.) 2012, pp. 145-150.
4. Jonassen and Bratsberg: “Impact of the Query Model and System Settings on Performance of
Distributed Inverted Indexes”. In Proc. NIK 2009, pp. 143-151.