A lot of data is best represented as time series: Operational data, financial data, and even in data warehouses the dominant dimension is often time. We present Chronix, a time series database based on Apache Solr and Spark which is able to handle trillions of time series data points and perform interactive queries. Chronix Spark is open source software and battle-proven at a German car manufacturer and an international telco.
We demonstrate several use cases of Chronix from real-life. Afterwards we lift the curtain and deep-dive into the Chronix architecture esp. how we're using Solr to store time series data and how we've hooked up Solr with Spark. We provide some benchmarks showing how Chronix has outperformed other time series databases in both performance and storage-efficiency.
Chronix is open source under the Apache License (http://chronix.io).
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.
InfluxDB is an open source time series database designed to handle high write and query speeds for real-time metrics, events, and sensor data. It uses a schemaless data model and stores data as time-stamped points in measurements, which can be queried using a SQL-like language. InfluxDB excels at aggregating and analyzing time series data for use cases like monitoring, analytics, and alerting.
The document demonstrates how to recover data from PostgreSQL database files using the pg_filedump tool. It shows extracting table data and metadata like the table schema from the heap and system catalog files. Key points extracted include:
- pg_filedump can display formatted contents of PostgreSQL files including tables, indexes, and system catalogs
- Running pg_filedump on the table file extracted the table data including column types
- Further analysis of system catalog files using pg_filedump provided the table and column names and types to fully recover the table schema
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
This document provides an introduction to the R programming language presented by Alex Storer at ComputeFest 2012. It discusses why R should be used over other languages like MATLAB and Python, provides examples of basic R syntax and functions, and walks through an example of loading climate data and creating plots to visualize rainfall anomalies over time. The goal is to provide attendees with a foundation of R basics while working through a real data analysis problem.
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length encoding and binary packing designed for CPU and cache optimizations. Benchmark results show Parquet provides much better compression and faster query performance than other formats like text, Avro and RCFile. The project is developed as an open source community with contributions from many organizations.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
A lot of data is best represented as time series: Operational data, financial data, and even in data warehouses the dominant dimension is often time. We present Chronix, a time series database based on Apache Solr and Spark which is able to handle trillions of time series data points and perform interactive queries. Chronix Spark is open source software and battle-proven at a German car manufacturer and an international telco.
We demonstrate several use cases of Chronix from real-life. Afterwards we lift the curtain and deep-dive into the Chronix architecture esp. how we're using Solr to store time series data and how we've hooked up Solr with Spark. We provide some benchmarks showing how Chronix has outperformed other time series databases in both performance and storage-efficiency.
Chronix is open source under the Apache License (http://chronix.io).
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.
InfluxDB is an open source time series database designed to handle high write and query speeds for real-time metrics, events, and sensor data. It uses a schemaless data model and stores data as time-stamped points in measurements, which can be queried using a SQL-like language. InfluxDB excels at aggregating and analyzing time series data for use cases like monitoring, analytics, and alerting.
The document demonstrates how to recover data from PostgreSQL database files using the pg_filedump tool. It shows extracting table data and metadata like the table schema from the heap and system catalog files. Key points extracted include:
- pg_filedump can display formatted contents of PostgreSQL files including tables, indexes, and system catalogs
- Running pg_filedump on the table file extracted the table data including column types
- Further analysis of system catalog files using pg_filedump provided the table and column names and types to fully recover the table schema
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
This document provides an introduction to the R programming language presented by Alex Storer at ComputeFest 2012. It discusses why R should be used over other languages like MATLAB and Python, provides examples of basic R syntax and functions, and walks through an example of loading climate data and creating plots to visualize rainfall anomalies over time. The goal is to provide attendees with a foundation of R basics while working through a real data analysis problem.
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length encoding and binary packing designed for CPU and cache optimizations. Benchmark results show Parquet provides much better compression and faster query performance than other formats like text, Avro and RCFile. The project is developed as an open source community with contributions from many organizations.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
The process of sorting has been one of those problems in computer science that have been around almost from the beginning of time. For example, the tabulating machine (IBM, 1890’s Census) was the first early data processing
unit able to sort data cards for people in the USA. After all the first census took around 7 years to be finished, making all the stored data obsolete. Therefore, the need for sorting. It is more, studying the different techniques of sorting
allows for a more precise introduction of the algorithm concept. Some corrections were done to a bound for the max-heapfy... My deepest excuses for the mistakes!!!
Data correlation using PySpark and HDFSJohn Conley
This document discusses using PySpark and HDFS to correlate different types of data at scale. It describes some challenges with correlating out-of-order and high-volume data. It then summarizes three approaches tried: using Redis, RDD joins, and writing bindings to HDFS. The current recommended approach reads relevant binding buckets from HDFS to correlate records in small windows, supporting different temporal models. Configuration and custom logic can be plugged in at various points in the correlation process. While scalable, further improvements in latency and throughput are still needed.
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
Big companies typically integrate their data from various heterogeneous systems when building a data lake as single point for accessing data. To achieve this goal technical teams often deal with data defined by complex schemas and various data formats. Spark SQL Datasets are currently compatible with data formats such as XML, Avro and Parquet by providing primitive and complex data types such as structs and arrays.
Although Dataset API offers rich set of functions, general manipulation of array and deeply nested data structures is lacking. We will demonstrate this fact by providing examples of data which is currently very hard to process in Spark efficiently. We designed and developed an extension of Dataset API to allow developers to work with array and complex type elements in a more straightforward and consistent way. The extension should help users dealing with complex and structured big data to use Apache Spark as a truly generic processing framework.
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Altinity Ltd
The document provides an overview of ClickHouse and techniques for optimizing performance. It discusses how the ClickHouse query log can help understand query execution and bottlenecks. Methods covered for improving performance include adding indexes, optimizing data layout through partitioning and ordering, using encodings to reduce data size, and materialized views. Storage optimizations like multi-disk volumes and tiered storage are also introduced.
Storm is an open source distributed real-time computation system for processing unbounded streams of data. It provides reliable processing of data streams, is fast and scalable by processing over a million tuples per second per node, and guarantees data will be processed. Storm allows building real-time analytics applications that can perform tasks like search, personalization, monitoring and more by acting as a real-time processing layer integrated with systems like Kafka, Elasticsearch and Hadoop.
This document discusses scalable machine learning techniques. It summarizes Spark MLlib, which provides machine learning algorithms that can run on large datasets in a distributed manner using Apache Spark. It also discusses H2O, which provides fast machine learning algorithms that can integrate with Spark via Sparkling Water to allow transparent use of H2O models and algorithms with the Spark API. Examples of using K-means clustering and logistic regression are provided to illustrate MLlib and H2O.
Как приготовить тестовые данные для Big Data проекта. Пример из практикиSQALab
The document describes an approach to sampling large datasets for testing Big Data projects. It discusses clustering similar records into groups to reduce the size of the sample. The records are divided into quartiles based on field values and each unique combination of quartiles forms a cluster. A representative sample is taken by selecting one record from each cluster. A tool is presented that automates this process using T-SQL on SQL Server to partition, cluster and filter the data into a reduced representative sample for testing.
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStoreandyseaborne
This document summarizes Andy Seaborne's talk on clustering an RDF triplestore. It discusses why clustering is important for resilience and performance scaling. It describes how the Apache Jena TDB triplestore is designed with custom indexes and node tables. It then explains how Lizard clusters TDB by sharding the indexes by subject and replicating shards, while distributing the node table across replicas. This allows modified SPARQL execution to perform joins in a distributed manner across the clustered storage.
SSN-TC workshop talk at ISWC 2015 on EmroozMarkus Stocker
Slides for the talk describing the paper on Emrooz, a scalable database for sensor observations with semantics according to the Semantic Sensor Network ontology.
This document presents two new approaches for reliable message processing in distributed streaming systems like Apache Storm:
1. A fingerprint-based approach that embeds a digest representing message context that is recursively passed down and updated.
2. A share-split approach that embeds a "share" with each message and splits the share at each component until the leaf where shares are reported.
It also discusses prototyping one approach by integrating it into Apache Storm and notes on the implementation.
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
This document provides an overview of big data analytics with Scala, including common frameworks and techniques. It discusses Lambda architecture, MapReduce, word counting examples, Scalding for batch and streaming jobs, Apache Storm, Trident, SummingBird for unified batch and streaming, and Apache Spark for fast cluster computing with resilient distributed datasets. It also covers clustering with Mahout, streaming word counting, and analytics platforms that combine batch and stream processing.
This document discusses Parquet, an open-source columnar file format for Hadoop. It was created at Twitter to optimize their analytics infrastructure, which includes several large Hadoop clusters processing data from their 200M+ users. Parquet aims to improve on existing storage formats by organizing data column-wise for better compression and scanning capabilities. It uses a row group structure and supports efficient reads of individual columns. Initial results at Twitter found a 28% space savings over existing formats and scan performance improvements. The project is open source and aims to continue optimizing column storage and enabling new execution engines for Hadoop.
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...Databricks
This document discusses common anti-patterns when using Spark with Cassandra. It begins by introducing the authors and their experience. The main section describes several common issues like out of memory errors, RPC failures, and slow performance. It then discusses the most common performance pitfall of collecting and re-parallelizing data. Alternative approaches are provided. Other topics covered include predicate pushdowns, serialization, and understanding how Catalyst optimizes queries.
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Databricks
Meltdown and Spectre are two security vulnerabilities disclosed in early 2018 that expose systems to cross-VM and cross-process attacks. They were the first of their kind and opened up a new class of exploits that allow one program to scan another program’s memory. The kernel and VM patches released to address these vulnerabilities have shown to degrade the performance of Apache Spark workloads in the cloud by 2-5%.
This talk will dive deep into the exploits and their patches in order to help explain the origin of this decline in performance.
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Lucidworks
This document discusses optimizations for field faceting in Solr. It begins with an overview of faceting at web scale for the Danish Net Archive. Various techniques for optimizing faceting performance are then presented, including reusing counters, tracking updated counters, caching counters across shards, and using alternative counter structures like PackedInts and n-plane counters. The optimizations are shown to significantly speed up faceting for fields with high cardinality.
The process of sorting has been one of those problems in computer science that have been around almost from the beginning of time. For example, the tabulating machine (IBM, 1890’s Census) was the first early data processing
unit able to sort data cards for people in the USA. After all the first census took around 7 years to be finished, making all the stored data obsolete. Therefore, the need for sorting. It is more, studying the different techniques of sorting
allows for a more precise introduction of the algorithm concept. Some corrections were done to a bound for the max-heapfy... My deepest excuses for the mistakes!!!
Data correlation using PySpark and HDFSJohn Conley
This document discusses using PySpark and HDFS to correlate different types of data at scale. It describes some challenges with correlating out-of-order and high-volume data. It then summarizes three approaches tried: using Redis, RDD joins, and writing bindings to HDFS. The current recommended approach reads relevant binding buckets from HDFS to correlate records in small windows, supporting different temporal models. Configuration and custom logic can be plugged in at various points in the correlation process. While scalable, further improvements in latency and throughput are still needed.
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
Big companies typically integrate their data from various heterogeneous systems when building a data lake as single point for accessing data. To achieve this goal technical teams often deal with data defined by complex schemas and various data formats. Spark SQL Datasets are currently compatible with data formats such as XML, Avro and Parquet by providing primitive and complex data types such as structs and arrays.
Although Dataset API offers rich set of functions, general manipulation of array and deeply nested data structures is lacking. We will demonstrate this fact by providing examples of data which is currently very hard to process in Spark efficiently. We designed and developed an extension of Dataset API to allow developers to work with array and complex type elements in a more straightforward and consistent way. The extension should help users dealing with complex and structured big data to use Apache Spark as a truly generic processing framework.
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Altinity Ltd
The document provides an overview of ClickHouse and techniques for optimizing performance. It discusses how the ClickHouse query log can help understand query execution and bottlenecks. Methods covered for improving performance include adding indexes, optimizing data layout through partitioning and ordering, using encodings to reduce data size, and materialized views. Storage optimizations like multi-disk volumes and tiered storage are also introduced.
Storm is an open source distributed real-time computation system for processing unbounded streams of data. It provides reliable processing of data streams, is fast and scalable by processing over a million tuples per second per node, and guarantees data will be processed. Storm allows building real-time analytics applications that can perform tasks like search, personalization, monitoring and more by acting as a real-time processing layer integrated with systems like Kafka, Elasticsearch and Hadoop.
This document discusses scalable machine learning techniques. It summarizes Spark MLlib, which provides machine learning algorithms that can run on large datasets in a distributed manner using Apache Spark. It also discusses H2O, which provides fast machine learning algorithms that can integrate with Spark via Sparkling Water to allow transparent use of H2O models and algorithms with the Spark API. Examples of using K-means clustering and logistic regression are provided to illustrate MLlib and H2O.
Как приготовить тестовые данные для Big Data проекта. Пример из практикиSQALab
The document describes an approach to sampling large datasets for testing Big Data projects. It discusses clustering similar records into groups to reduce the size of the sample. The records are divided into quartiles based on field values and each unique combination of quartiles forms a cluster. A representative sample is taken by selecting one record from each cluster. A tool is presented that automates this process using T-SQL on SQL Server to partition, cluster and filter the data into a reduced representative sample for testing.
2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStoreandyseaborne
This document summarizes Andy Seaborne's talk on clustering an RDF triplestore. It discusses why clustering is important for resilience and performance scaling. It describes how the Apache Jena TDB triplestore is designed with custom indexes and node tables. It then explains how Lizard clusters TDB by sharding the indexes by subject and replicating shards, while distributing the node table across replicas. This allows modified SPARQL execution to perform joins in a distributed manner across the clustered storage.
SSN-TC workshop talk at ISWC 2015 on EmroozMarkus Stocker
Slides for the talk describing the paper on Emrooz, a scalable database for sensor observations with semantics according to the Semantic Sensor Network ontology.
This document presents two new approaches for reliable message processing in distributed streaming systems like Apache Storm:
1. A fingerprint-based approach that embeds a digest representing message context that is recursively passed down and updated.
2. A share-split approach that embeds a "share" with each message and splits the share at each component until the leaf where shares are reported.
It also discusses prototyping one approach by integrating it into Apache Storm and notes on the implementation.
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
This document provides an overview of big data analytics with Scala, including common frameworks and techniques. It discusses Lambda architecture, MapReduce, word counting examples, Scalding for batch and streaming jobs, Apache Storm, Trident, SummingBird for unified batch and streaming, and Apache Spark for fast cluster computing with resilient distributed datasets. It also covers clustering with Mahout, streaming word counting, and analytics platforms that combine batch and stream processing.
This document discusses Parquet, an open-source columnar file format for Hadoop. It was created at Twitter to optimize their analytics infrastructure, which includes several large Hadoop clusters processing data from their 200M+ users. Parquet aims to improve on existing storage formats by organizing data column-wise for better compression and scanning capabilities. It uses a row group structure and supports efficient reads of individual columns. Initial results at Twitter found a 28% space savings over existing formats and scan performance improvements. The project is open source and aims to continue optimizing column storage and enabling new execution engines for Hadoop.
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...Databricks
This document discusses common anti-patterns when using Spark with Cassandra. It begins by introducing the authors and their experience. The main section describes several common issues like out of memory errors, RPC failures, and slow performance. It then discusses the most common performance pitfall of collecting and re-parallelizing data. Alternative approaches are provided. Other topics covered include predicate pushdowns, serialization, and understanding how Catalyst optimizes queries.
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Databricks
Meltdown and Spectre are two security vulnerabilities disclosed in early 2018 that expose systems to cross-VM and cross-process attacks. They were the first of their kind and opened up a new class of exploits that allow one program to scan another program’s memory. The kernel and VM patches released to address these vulnerabilities have shown to degrade the performance of Apache Spark workloads in the cloud by 2-5%.
This talk will dive deep into the exploits and their patches in order to help explain the origin of this decline in performance.
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Lucidworks
This document discusses optimizations for field faceting in Solr. It begins with an overview of faceting at web scale for the Danish Net Archive. Various techniques for optimizing faceting performance are then presented, including reusing counters, tracking updated counters, caching counters across shards, and using alternative counter structures like PackedInts and n-plane counters. The optimizations are shown to significantly speed up faceting for fields with high cardinality.
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLucidworks
Gregg Donovan presented on lessons learned from sharding Solr at Etsy over three versions:
1) Initially, Etsy did not shard to avoid problems, but the single node approach did not scale.
2) The first sharding version used local sharding across multiple JVMs per host for better latency and manageability.
3) The current version uses distributed sharding across data centers for further latency gains, but this introduced challenges of partial failures, synchronization, and distributed queries.
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.
Solr Exchange: Introduction to SolrCloudthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will provide an architectural overview of SolrCloud and highlight its most important features. Specifically, Tim covers topics such as: sharding, replication, ZooKeeper fundamentals, leaders/replicas, and failure/recovery scenarios. Any discussion of a complex distributed system would not be complete without a discussion of the CAP theorem. Mr. Potter will describe why Solr is considered a CP system and how that impacts the design of a search application.
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksLucidworks
Banana is a fork of Kibana that works with Apache Solr data. It uses Kibana's dashboard capabilities and ports key panels to work with Solr, providing additional capabilities like new D3.js panels. Banana aims to create rich and flexible UIs, enable rapid application development, and leverage Solr's power. To build a custom panel in Banana, you need an editor HTML file for settings, a module HTML file for display, and a module JS file containing panel logic.
Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.
Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).
Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.
We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.
This document discusses scaling search with Apache SolrCloud. It provides an introduction to Solr and how scaling search was difficult in previous versions due to manually managing shards and replicas. SolrCloud makes scaling easier by utilizing ZooKeeper for centralized configuration and management across a cluster. Nodes can be added to a SolrCloud cluster and will automatically be configured and assigned as shards or replicas. This allows for effortless scaling, fault tolerance, and load balancing. The document promotes upcoming features in Solr 4 and demonstrates indexing and querying in a SolrCloud cluster.
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Svetlin Nakov
Few days ago I gave a talk about software architectures. My goal was to explain as easy as possible the main ideas behind the most popular software architectures like the client-server model, the 3-tier and multi-tier layered models, the idea behind SOA architecture and cloud computing, and few widely used architectural patterns like MVC (Model-View-Controller), MVP (Model-View-Presenter), PAC (Presentation Abstraction Control), MVVM (Model-View-ViewModel). In my talk I explain that MVC, MVP and MVVM are not necessary bound to any particular architectural model like client-server, 3-tier of SOA. MVC, MVP and MVVM are architectural principles applicable when we need to separate the presentation (UI), the data model and the presentation logic.
Additionally I made an overview of the popular architectural principals IoC (Inversion of Control) and DI (Dependency Injection) and give examples how to build your own Inversion of Control (IoC) container.
Using Business Architecture to enable customer experience and digital strategyCraig Martin
Digital disruption is shifting business model design from a focus on product profitability to a stronger focus on customer experience and lifetime value.
The presentation looks at environmental pressures caused by digital disruption and identifies how to use business architecture and business design to address these changes.
It covers business architecture for digital strategy, customer-driven value chains, re-writing of the 4Ps of the marketing mix, and the nine laws of disruption and how they affect business model design.Craig also investigates the changes afoot with strategic business planning and Enterprise Architecture, which are experiencing their own form of disruption. Will Enterprise Architecture as we know it become a commodity too?
This presentation was delivered as an OpenGroup webinar and is available for viewing from the www.enterprisearchitects.com web site.
This document summarizes load testing experiments conducted on Amazon RDS using an Oracle database. The tests aimed to evaluate RDS performance under different configurations and provide a basis for future load testing. Tests were run using m2.4xlarge and m1.xlarge instance types with varying provisioned IOPS. Key results showed that provisioned IOPS had a significant impact on throughput and latency. Higher IOPS configurations achieved thousands of transactions per second but also had periods of high latency. Lower IOPS configurations had more stable performance but lower throughput. The experiments provided insights into how different factors like instance type, IOPS provisioning, and read/write ratios influence RDS and database performance under load.
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedis Labs
This document summarizes RedisConf 2017, covering several topics:
1. Running Redis on Flash in a DBaaS model for improved performance and cost savings compared to other NoSQL databases.
2. Redis modules gaining momentum with over 50 created so far, and the importance of multi-threading for high performance. Useful modules highlighted include RediSearch, ReJSON, Redis-ML, and Redis-Graph.
3. Using Redis for IoT applications, with challenges around small edge devices and clusters, high throughput from thousands of devices, and varied functionality needs addressed through Redis modules.
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks
Spark can be used to improve the performance of importing and searching large datasets in Solr. Data can be imported from HDFS files into Solr in parallel using Spark, speeding up the import process. Spark can also be used to stream data from Solr into RDDs for further processing, such as aggregation, filtering, and joining with other data. Techniques like column-based denormalization and compressed storage of event data in Solr documents can reduce data volume and improve import and query speeds by orders of magnitude.
Leveraging the Power of Solr with SparkQAware GmbH
Lucene Revolution 2016, Boston: Talk by Johannes Weigend (@JohannesWeigend, CTO at QAware).
Abstract: Solr is a distributed NoSQL database with impressive search capabilities. Spark is the new megastar in the distributed computing universe. In this code-intense session we show you how to combine both to solve real-time search and processing problems. We will show you how to set up a Solr/Spark combination from scratch and develop first jobs with runs distributed on shared Solr data. We will also show you how to use this combination for your next-generation BI platform.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
This document provides an overview and interpretation of the Automatic Workload Repository (AWR) report in Oracle database. Some key points:
- AWR collects snapshots of database metrics and performance data every 60 minutes by default and retains them for 7 days. This data is used by tools like ADDM for self-management and diagnosing issues.
- The top timed waits in the AWR report usually indicate where to focus tuning efforts. Common waits include I/O waits, buffer busy waits, and enqueue waits.
- Other useful AWR metrics include parse/execute ratios, wait event distributions, and top activities to identify bottlenecks like parsing overhead, locking issues, or inefficient SQL.
Traditionally database systems were optimized either for OLAP either for OLTP workloads. Such mainstream DBMSes like Postgres,MySQL,... are mostly used for OLTP, while Greenplum, Vertica, Clickhouse, SparkSQL,... are oriented on analytic queries. But right now many companies do not want to have two different data stores for OLAP/OLTP and need to perform analytic queries on most recent data. I want to discuss which features should be added to Postgres to efficiently handle HTAP workload.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
The document discusses Oracle Database result caching. It provides an overview of database caches including the result cache. It then describes a hand-made result cache implementation for a retailer case study and how it improved performance from 20 minutes to 4 minutes for a report. It also discusses using the Oracle Database result cache explicitly with hints and annotations, how to monitor and manage it using views and packages, limitations, and best practices.
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
The earliest relational databases were monolithic on-premise systems that were powerful and full-featured. Fast forward to the Internet and NoSQL: BigTable, DynamoDB and Cassandra. These distributed systems were built to scale out for ballooning user bases and operations. As more and more companies vied to be the next Google, Amazon, or Facebook, they too "required" horizontal scalability.
But in a real way, NoSQL and even NewSQL have forgotten single node performance where scaling out isn't an option. And single node performance is important because it allows you to do more with much less. With a smaller footprint and simpler stack, overhead decreases and your application can still scale.
In this talk, we describe TimescaleDB's methods for single node performance. The nature of time-series workloads and how data is partitioned allows users to elastically scale up even on single machines, which provides operational ease and architectural simplicity, especially in cloud environments.
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextLucidworks
This document discusses tuning Solr for log search and analysis. The author describes optimizing a baseline Solr configuration through techniques like time-based field collections, document value fields, commit settings and hardware choices. Significant performance gains were found, such as a 31x increase over baseline by using time-based collections with a 10 minute window. The author also recommends using specialized log processing tools like Apache Flume to parallelize and distribute indexing load for further throughput improvements.
Apache con 2020 use cases and optimizations of iotdbZhangZhengming
This document summarizes a presentation about IoTDB, an open source time series database optimized for IoT data. It discusses IoTDB's architecture, use cases, optimizations, and common questions. Key points include that IoTDB uses a time-oriented storage engine and tree-structured schema to efficiently store and query IoT sensor data, and that optimizations like schema design, memory allocation, and handling out-of-order data can improve performance. Common issues addressed relate to version compatibility, system load, and error conditions.
This document summarizes an update on OpenTSDB, an open source time series database. It discusses OpenTSDB's ability to store trillions of data points at scale using HBase, Cassandra, or Bigtable as backends. Use cases mentioned include systems monitoring, sensor data, and financial data. The document outlines writing and querying functionality and describes the data model and table schema. It also discusses new features in OpenTSDB 2.2 and 2.3 like downsampling, expressions, and data stores. Community projects using OpenTSDB are highlighted and the future of OpenTSDB is discussed.
This document discusses using Hadoop to unify data management. It describes challenges with managing huge volumes of fast-moving machine data and outlines an overall architecture using Hadoop components like HDFS, HBase, Solr, Impala and OpenTSDB to store, search, analyze and build features from different types of data. Key aspects of the architecture include intelligent search, batch and real-time analytics, parsing, time series data and alerts.
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
Pinot is a real-time OLAP data store that can support multiple analytics use cases like interactive dashboards, site facing queries, and anomaly detection in a single system. It achieves this through features like configurable indexes, dynamic query planning and execution, smart data partitioning and routing, and pre-materialized indexes like star-trees that optimize for latency and throughput across different workloads. The document discusses Pinot's architecture and optimizations that enable it to meet the performance requirements of these different use cases.
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Изначально будут раскрыты базовые причины, которые заставили появиться такой части механизма СУБД, как кэш результатов, и почему в ряде СУБД он есть или отсутствует.
Будут рассмотрены различные варианты кэширования результатов как sql-запросов, так и результатов хранимой в БД бизнес-логики. Произведено сравнение способов кэширования (программируемые вручную кэши, стандартный функционал) и даны рекомендации, когда и в каких случаях данные способы оптимальны, а порой опасны.
Для каждой из рекомендаций будут продемонстрированы как положительные так и отрицательные кейсы из опыта production-эксплуатации реальных систем, где используются разные варианты кэшей
The document contains various notes and information snippets on topics including:
- WiFi login details
- Maslow's hierarchy of needs
- LAMP stack vs 3-tier architecture
- Performance impacts of small delays in page load times
- Quicksort algorithm implementation in Haskell
- Comparisons of Reactive Extensions, F# Observable, and Nessos Streams libraries
- HybridDictionary data structure optimization in .NET
- B+ tree data structure praise
- Phillip Trelford's Twitter handle and blog
100500 способов кэширования в Oracle Database или как достичь максимальной ск...Ontico
HighLoad++ 2017
Зал «Рио-де-Жанейро», 8 ноября, 14:00
Тезисы:
http://www.highload.ru/2017/abstracts/2913.html
Изначально будут раскрыты базовые причины, которые заставили появиться такой части механизма СУБД, как кэш результатов, и почему в ряде СУБД он есть или отсутствует.
Будут рассмотрены различные варианты кэширования результатов как sql-запросов, так и результатов хранимой в БД бизнес-логики. Произведено сравнение способов кэширования (программируемые вручную кэши, стандартный функционал) и даны рекомендации, когда и в каких случаях данные способы оптимальны, а порой опасны.
...
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Odoo ERP software
Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth.
The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently.
This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
2. Faceting optimizations for Solr
Toke Eskildsen
Search Engineer / Solr Hacker
State and University Library, Denmark
@TokeEskildsen / te@statsbiblioteket.dk
3. 3
3/55
Overview
Web scale at the State and University Library,
Denmark
Field faceting 101
Optimizations
− Reuse
− Tracking
− Caching
− Alternative counters
4. 4/55
Web scale for a small web
Denmark
− Consolidation circa 10th
century
− 5.6 million people
Danish Net Archive (http://netarkivet.dk)
− Constitution 2005
− 20 billion items / 590TB+ raw data
5. 5/55
Indexing 20 billion web items / 590TB into Solr
Solr index size is 1/9th of real data = 70TB
Each shard holds 200M documents / 900GB
− Shards build chronologically by dedicated machine
− Projected 80 shards
− Current build time per shard: 4 days
− Total build time is 20 CPU-core years
− So far only 7.4 billion documents / 27TB in index
6. 6/55
Searching a 7.4 billion documents / 27TB Solr index
SolrCloud with 2 machines, each having
− 16 HT-cores, 256GB RAM, 25 * 930GB SSD
− 25 shards @ 900GB
− 1 Solr/shard/SSD, Xmx=8g, Solr 4.10
− Disk cache 100GB or < 1% of index size
8. 8/55
String faceting 101 (single shard)
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
for entry: priorityQueue
result.add(resolveTerm(ordinal), count)
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3
9. 9/55
Test setup 1 (easy start)
Solr setup
− 16 HT-cores, 256GB RAM, SSD
− Single shard 250M documents / 900GB
URL field
− Single String value
− 200M unique terms
3 concurrent “users”
Random search terms
12. 12/55
Reuse the counter
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
<counter no more referenced and will be garbage collected at some point>
13. 13/55
Reuse the counter
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
Note: The JSON Facet API in Solr 5 already supports reuse of counters
26. 26/55
Distributed faceting
Phase 1) All shards performs faceting.
The Merger calculates the top-X terms.
Phase 2) The term counts are requested from the shards
that did not return them in phase 1.
The Merger calculates the final counts for the top-X terms.
for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))
27. 27/55
Test setup 2 (more shards, smaller field)
Solr setup
− 16 HT-cores, 256GB RAM, SSD
− 9 shards @ 250M documents / 900GB
domain field
− Single String value
− 1.1M unique terms per shard
1 concurrent “user”
Random search terms
29. 29/55
Fine counting can be slow
Phase 1: Standard faceting
Phase 2:
for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))
30. 30/55
Alternative fine counting
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter.increment(ordinal)
for term: fineCountRequest.getTerms()
result.add(term, counter.get(getOrdinal(term)))
}Same as phase 1, which yields
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
31. 31/55
Using cached counters from phase 1 in phase 2
counter = pool.getCounter(key)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))
pool.release(counter)
49. 49/55
I could go on about
Threaded counting
Heuristic faceting
Fine count skipping
Counter capping
Monotonically increasing tracker for n-plane-z
Regexp filtering
50. 50/55
What about huge result sets?
Rare for explorative term-based searches
Common for batch extractions
Threading works poorly as #shards > #CPUs
But how bad is it really?
52. 52/55
Heuristic faceting
Use sampling to guess top-X terms
− Re-use the existing tracked counters
− 1:1000 sampling seems usable for the field links,
which has 5 billion references per shard
Fine-count the guessed terms
55. 55/55
Never enough time, but talk to me about
Threaded counting
Monotonically increasing tracker for n-plane-z
Regexp filtering
Fine count skipping
Counter capping
56. 56/55
Extra info
The techniques presented can be tested with sparse faceting, available as a plug-in replacement
WAR for Solr 4.10 at https://tokee.github.io/lucene-solr/. A version for Solr 5 will eventually be
implemented, but the timeframe is unknown.
No current plans for incorporating the full feature set in the official Solr distribution exists.
Suggested approach for incorporation is to split it into multiple independent or semi-
independent features, starting with those applicable to most people, such as the distributes
faceting fine count optimization.
In-depth descriptions and performance tests of the different features can be found at
https://sbdevel.wordpress.com.
58. 58/55
6 billion docs / 20TB, 25 shards, single machine
facet on 6 fields (1*4000M, 2*20M, 3*smaller)
59. 59/55
7 billion docs / 23TB, 25 shards, single machine
facet on 5 fields (2*20M, 3*smaller)
Editor's Notes
“Solr at Scale for Time-Oriented data, Rocana” covers just about all, just nicer.
Tika is the heavy part: 90% of indexing CPU power goes into Tika analysis .
Static & optimized shards
No replicas (but we do have backup)
Rarely more than 1 concurrent user
Standard JRE 1.7 garbage collector – no tuning.
Full GC means delay for the client.
Standard GC means higher CPU load.
Some info on JSON Faceting API and reusing at http://yonik.com/facet-performance/
The pool is responsible for cleaning the counter
Counter cleaning is a background thread
NOTE: Was I wrong about JSON faceting reuse?
Note: It always takes at least 500ms in this test
Note: It always takes at least 500ms in this test
This scenario represents the highest faceting feature set we are currently willing to run on our net search. Fortunately the standard scenario is that more than 1 concurrent search is rare. Our established upper acceptable response time is 2 seconds (median), with no defined worst-case limit.
Faceting on the links field requires 60GB of heap per concurrent call. While this might be technically feasible for our setup, it would leave very little memory available for disk cache.
Not the true minimum, as we round up to nearest power of 2 minus 1
Blue squares are overflow bits. Finding the index for the term in a higher plane is done by counting the number of overflow bits. Fortunately this can be done with a rank function (~3% memory overhead) in constant time.
The standard tracker is not used, as it would require more heap than the counter structure itself. Instead a bitmap counter structure is used (1/64 overhead). Details about this counter structure is not part of this presentation.
n-plane-z uses a little less than 2x theoretical min
Multiple n-plane-z shares overflow-bits, so extra concurrent counters takes up only slightly more than the theoretical minimum amount of heap.
Fine counting could be replaced with multiplication with 1/sampling_factor
We want top-25, but ask for top-100 to raise the chances of getting the right terms
Counts are guaranteed to be correct
Bonus slide 1
Graphs from production core library search (books, articles etc) logs. Logs are taken from same week day, for 4 weeks.
Blue, pink and green are response times with vanilla Solr. Orange is with sparse faceting.
Bonus slide: The effect of artificially reducing the amount of memory available for disk caching. Reducing this below 50GB has severe performance implications.
Morale: SSD allows for very low relative disk cache, but do not count on the performance relative to disk cache to be linear.
Bonus slide.
Performance of search with multiple concurrent users. Note that the large URL field is not part of faceting.
This slide demonstrates performance for a more “normal” search situation on a machine with a relative small amount of disk cache.