Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr’s full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Building Data Pipelines for Solr with Apache NiFiBryan Bende
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic. Some of NiFi's key features include a web-based user interface for monitoring and controlling data flows, guaranteed delivery, data provenance, and easy extensibility through custom processor development.
These features make NiFi a perfect candidate for building production quality data pipelines that interact with Apache Solr. This talk will demonstrate how to use a NiFi processor that delivers data to a Solr update handler, as well as a processor for extracting data from Solr on regular intervals for delivery to down-stream systems. In addition we will show how these processors can be combined with other built-in NiFi processors to solve a variety of use cases, including log aggregation, and indexing messages received from Kafka.
As Apache Solr becomes more powerful and easier to use, the accessibility of high quality data becomes key to unlocking the full potential of Solr’s search and analytic capabilities. Traditional approaches to acquiring data frequently involve a combination of homegrown tools and scripts, often requiring significant development efforts and becoming hard to change, hard to monitor, and hard to maintain. This talk will discuss how Apache NiFi addresses the above challenges and can be used to build production-grade data pipelines for Solr. We will start by giving an introduction to the core features of NiFi, such as visual command & control, dynamic prioritization, back-pressure, and provenance. We will then look at NiFi’s processors for integrating with Solr, covering topics such as ingesting and extracting data, interacting with secure Solr instances, and performance tuning. We will conclude by building a live dataflow from scratch, demonstrating how to prepare data and ingest to Solr.
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Building Data Pipelines for Solr with Apache NiFiBryan Bende
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic. Some of NiFi's key features include a web-based user interface for monitoring and controlling data flows, guaranteed delivery, data provenance, and easy extensibility through custom processor development.
These features make NiFi a perfect candidate for building production quality data pipelines that interact with Apache Solr. This talk will demonstrate how to use a NiFi processor that delivers data to a Solr update handler, as well as a processor for extracting data from Solr on regular intervals for delivery to down-stream systems. In addition we will show how these processors can be combined with other built-in NiFi processors to solve a variety of use cases, including log aggregation, and indexing messages received from Kafka.
As Apache Solr becomes more powerful and easier to use, the accessibility of high quality data becomes key to unlocking the full potential of Solr’s search and analytic capabilities. Traditional approaches to acquiring data frequently involve a combination of homegrown tools and scripts, often requiring significant development efforts and becoming hard to change, hard to monitor, and hard to maintain. This talk will discuss how Apache NiFi addresses the above challenges and can be used to build production-grade data pipelines for Solr. We will start by giving an introduction to the core features of NiFi, such as visual command & control, dynamic prioritization, back-pressure, and provenance. We will then look at NiFi’s processors for integrating with Solr, covering topics such as ingesting and extracting data, interacting with secure Solr instances, and performance tuning. We will conclude by building a live dataflow from scratch, demonstrating how to prepare data and ingest to Solr.
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
We will talk about two real-world challenging SQL on Hadoop use cases: #1 Highly Parallel Workload Over Massive Data, #2 Sub-second SQL for Online Reporting. The challenge is to meet very strict performance requirement over hundreds of billions of data. We will introduce how we solved these challenges using Hive on Tez, Hive LLAP and Phoenix. With real-life performance number!
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...DataWorks Summit
The last 5 years have been marked by an explosion of Internet-connected devices. From cars to solar power, from TVs to juice makers, modern life is filled with interconnected smart devices.
But while those ubiquitous devices enhance the interaction with the technology that surrounds us, the lifecycle management of IoT firmware and poor security design choices still present a significant threat to our daily lives.
Despite the ascent of threats like the Mirai botnet, the amount of published research around how to programmatically detect new IoTs in the wild has been somewhat limited.
In this presentation we introduce Data Engineering in the context of cyber security, discuss why it is important to move away from the view that security log pipelines are enrichment and indicator matching tools, and push the boundaries of “Simple Event Processing” to demonstrate how Apache NiFi and Apache MiNiFi’s feature rich dataflows can be used to dynamically identify new IoT botnet activities in the wild.
Speakers
Andre Fucs De Miranda, Independent Consultant, Fluenda
Andy LoPresto, Sr. Member of Technical Staff, Hortonworks
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
This talk will give an overview of two exciting releases for Apache HBase 2.0 and Phoenix 5.0. HBase provides a NoSQL column store on Hadoop for random, real-time read/write workloads. Phoenix provides SQL on top of HBase. HBase 2.0 contains a large number of features that were a long time in development, including rewritten region assignment, performance improvements (RPC, rewritten write pipeline, etc), async clients and WAL, a C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next big Phoenix release because of its integration with HBase 2.0 and a lot of performance improvements in support of secondary Indexes. It has many important new features such as encoded columns, Kafka and Hive integration, and many other performance improvements. This session will also describe the uses cases that HBase and Phoenix are a good architectural fit for.
Speaker: Alan Gates, Co-Founder, Hortonworks
LLAP (Live Long and Process) is the newest query acceleration engine for Hive 2.0, which entered GA in 2017. LLAP brings into light a new set of trade-offs and optimizations that allows for efficient and secure multi-user BI systems on the cloud. In this talk, we discuss the specifics of building a modern BI engine within those boundaries, designed to be fast and cost-effective on the public cloud. The focus of the LLAP cache is to speed up common BI query patterns on the cloud, while avoiding most of the operational administration overheads of maintaining a caching layer, with an automatically coherent cache with intelligent eviction and support for custom file formats from text to ORC, and explore the possibilities of combining the cache with a transactional storage layer which supports online UPDATE and DELETES without full data reloads. LLAP by itself, as a relational data layer, extends the same caching and security advantages to any other data processing framework. We overview the structure of such a hybrid system, where both Hive and Spark use LLAP to provide SQL query acceleration on the cloud with new, improved concurrent query support and production-ready tools and UI.
Speaker
Sergey Shelukin, Member of Technical Staff, Hortonworks
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
Keep your hadoop cluster at its best! v4Chris Nauroth
Hadoop has become a backbone of many enterprises. While it can do wonders for businesses, it sometimes can be overwhelming for its operators and users. Amateurs as well as seasoned operators of Hadoop are caught unaware by common pitfalls of deploying, tuning and operating a Hadoop cluster. Having spent 5+ years working with 100s of Hadoop users, running clusters with 1000s of nodes, managing 10s of petabytes of data and running 100s of 1000s of tasks per day, we have seen people's unintentional acts, suboptimal configurations and common mistakes have resulted into downtimes, SLA violations, many hours of recovery operations and in some cases even data loss! Most of these traumas could have been easily avoided by applying easy to follow best practices that would protect data and optimize performance. In this talk we present real life stories, common pitfalls and most importantly, strategies on how to correctly deploy and manage Hadoop clusters. The talk will empower users and help make their Hadoop journey more fulfilling and rewarding. We will also discuss SmartSense. SmartSense can identify latent problems in a cluster and provide recommendations so that an operator can fix them before they manifest as a service degradation or outage.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
We will talk about two real-world challenging SQL on Hadoop use cases: #1 Highly Parallel Workload Over Massive Data, #2 Sub-second SQL for Online Reporting. The challenge is to meet very strict performance requirement over hundreds of billions of data. We will introduce how we solved these challenges using Hive on Tez, Hive LLAP and Phoenix. With real-life performance number!
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...DataWorks Summit
The last 5 years have been marked by an explosion of Internet-connected devices. From cars to solar power, from TVs to juice makers, modern life is filled with interconnected smart devices.
But while those ubiquitous devices enhance the interaction with the technology that surrounds us, the lifecycle management of IoT firmware and poor security design choices still present a significant threat to our daily lives.
Despite the ascent of threats like the Mirai botnet, the amount of published research around how to programmatically detect new IoTs in the wild has been somewhat limited.
In this presentation we introduce Data Engineering in the context of cyber security, discuss why it is important to move away from the view that security log pipelines are enrichment and indicator matching tools, and push the boundaries of “Simple Event Processing” to demonstrate how Apache NiFi and Apache MiNiFi’s feature rich dataflows can be used to dynamically identify new IoT botnet activities in the wild.
Speakers
Andre Fucs De Miranda, Independent Consultant, Fluenda
Andy LoPresto, Sr. Member of Technical Staff, Hortonworks
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
It is a TPC/H/DS benchmark on both Hive (Low Latency Analytical Processing) and Presto, comparing the two popular bigdata query engines.
The results shows significant advantages of Hive LLAP on performance and durability.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
This talk will give an overview of two exciting releases for Apache HBase 2.0 and Phoenix 5.0. HBase provides a NoSQL column store on Hadoop for random, real-time read/write workloads. Phoenix provides SQL on top of HBase. HBase 2.0 contains a large number of features that were a long time in development, including rewritten region assignment, performance improvements (RPC, rewritten write pipeline, etc), async clients and WAL, a C++ client, offheaping memstore and other buffers, shading of dependencies, as well as a lot of other fixes and stability improvements. We will go into details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next big Phoenix release because of its integration with HBase 2.0 and a lot of performance improvements in support of secondary Indexes. It has many important new features such as encoded columns, Kafka and Hive integration, and many other performance improvements. This session will also describe the uses cases that HBase and Phoenix are a good architectural fit for.
Speaker: Alan Gates, Co-Founder, Hortonworks
LLAP (Live Long and Process) is the newest query acceleration engine for Hive 2.0, which entered GA in 2017. LLAP brings into light a new set of trade-offs and optimizations that allows for efficient and secure multi-user BI systems on the cloud. In this talk, we discuss the specifics of building a modern BI engine within those boundaries, designed to be fast and cost-effective on the public cloud. The focus of the LLAP cache is to speed up common BI query patterns on the cloud, while avoiding most of the operational administration overheads of maintaining a caching layer, with an automatically coherent cache with intelligent eviction and support for custom file formats from text to ORC, and explore the possibilities of combining the cache with a transactional storage layer which supports online UPDATE and DELETES without full data reloads. LLAP by itself, as a relational data layer, extends the same caching and security advantages to any other data processing framework. We overview the structure of such a hybrid system, where both Hive and Spark use LLAP to provide SQL query acceleration on the cloud with new, improved concurrent query support and production-ready tools and UI.
Speaker
Sergey Shelukin, Member of Technical Staff, Hortonworks
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
Keep your hadoop cluster at its best! v4Chris Nauroth
Hadoop has become a backbone of many enterprises. While it can do wonders for businesses, it sometimes can be overwhelming for its operators and users. Amateurs as well as seasoned operators of Hadoop are caught unaware by common pitfalls of deploying, tuning and operating a Hadoop cluster. Having spent 5+ years working with 100s of Hadoop users, running clusters with 1000s of nodes, managing 10s of petabytes of data and running 100s of 1000s of tasks per day, we have seen people's unintentional acts, suboptimal configurations and common mistakes have resulted into downtimes, SLA violations, many hours of recovery operations and in some cases even data loss! Most of these traumas could have been easily avoided by applying easy to follow best practices that would protect data and optimize performance. In this talk we present real life stories, common pitfalls and most importantly, strategies on how to correctly deploy and manage Hadoop clusters. The talk will empower users and help make their Hadoop journey more fulfilling and rewarding. We will also discuss SmartSense. SmartSense can identify latent problems in a cluster and provide recommendations so that an operator can fix them before they manifest as a service degradation or outage.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Building a near real time search engine & analytics for logs using solrlucenerevolution
Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd
Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.
Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble
"Hadoop Ecosystem and Low Latency Streaming Architecture" was presented by Vijay Mandava and Lan Jiang to Detroit Java User Group on 3/23/2015. It covers the basic introduction of Hadoop Ecosystem and then focus on the low latency streaming architecture, including frameworks such as Flume, Kafka and Storm.
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012TEST Huddle
EuroSTAR Software Testing Conference 2012 presentation on Innovations for Testing Parallel Software by Mike Bartley.
See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
The Science and Technology Facilities Council is a UK Research Council which funds research and provides large facilities to the UK Scientific Community. This includes running a Tier 1 site for the LHC computing project, the JASMIN Super Data Cluster and a number of other HPC and HTC facilities. The Scientific Computing Department at the Rutherford Appleton Laboratory has been developing a cloud for use across both sites of the Department and in the wider scientific community. This is an OpenNebula backed by Ceph block storage. I will give a brief background of the project, describe our set up, some use cases and the work we have done around OpenNebula (including a simplified web front-end and a number of hooks to provide us with traceability). I will also discuss how we are creating an elastic boundary between our HTC batch farm and cloud.
Author Biography
I am a Systems Administrator in the Scientific Computing Department of the UK’s Science and Technology Facilities Council. I work as part of the cloud team and I also work on a number of Grid services including our HTC batch farm for the LHC computing project.
Prior to my position here I worked in IT at a SMB focusing on Storage and Virtualisation, in particular Hyper-V and VMWare.
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelDaniel Coupal
MongoDB presentation from Silicon Valley Code Camp 2015.
Walkthrough developing, deploying and operating a MongoDB application, avoiding the most common pitfalls.
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
Devnexus 2018 - Let Your Data Flow with Apache NiFiBryan Bende
Introduction to Apache NiFi features such as interactive command and control, version control of process groups, record processing, provenance, and prioritzation, and building customer extensions.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
2. REAL-TIME INVERTED SEARCH IN THE
CLOUD USING LUCENE AND STORM
Joshua Conlin, Bryan Bende, James Owen
conlin_joshua@bah.com
bende_bryan@bah.com
owen_james@bah.com
3. Table of Contents
Problem Statement
Storm
Methodology
Results
4. Who are we ?
Booz Allen Hamilton
– Large consulting firm supporting many industries
• Healthcare, Finance, Energy, Defense
– Strategic Innovation Group
• Focus on innovative solutions that can be applied across industries
• Major focus on data science, big data, & information retrieval
• Multiple clients utilizing Solr for implementing search capabilities
• Explore Date Science
• Self-paced data science training, launching TODAY!
• https://exploredatascience.com
5. Client Applications & Architecture
Ingest
SolrCloud
Web App
Typical client applications allow users to:
• Query document index using Lucene syntax
• Filter and facet results
• Save queries for future use
6. Problem Statement
How do we instantly notify users of new documents that match their
saved queries?
Constraints:
• Process documents in real-time, notify as soon as possible
• Scale with the number of saved queries (starting with tens of thousands)
• Result set of notifications must match saved queries
• Must not impact performance of the web application
• Data arrives at varying speeds and varying sizes
7. Possible Solutions
1. Fork ingest to a second Solr instance, run stored queries periodically
– Pros: Easy to setup, works for small amount of data data & small # of queries
– Cons: Bound by time to execute all queries
2. Same secondary Solr instance, but distribute queries to multiple servers
– Pros: Reduces query processing time by dividing across several servers
– Cons: Now writing custom code to distribute queries, possible synchronization issues
ensuring each server executes queries against the same data
3. Give each server its own Solr instance and subset of queries
– Pros: Very scalable, only bound by number of servers
– Cons: Difficult to maintain, still writing custom code to distribute data and queries
8. Possible Solutions
Is there a way we can set up this system so that it’s:
• easy to maintain,
• easy to scale, and
• easy to synchronize?
9. Candidate Solution
• Integrate Solr and/or Lucene with a stream processing framework
• Process data in real-time, leverage proven framework for distributed stream
processing
Ingest
SolrCloud
Web App
Storm
Notifications
10. Storm - Overview
• Storm is an open source stream processing framework.
• It’s a scalable platform that lets you distribute processes across a cluster quickly
and easily.
• You can add more resources to your cluster and easily utilize those resources in
your processing.
11. Storm - Components
• Nimbus – the control node for the cluster, distributes topology through the cluster
• Supervisor – one on each machines in the cluster, controls the allocation of worker
assignments on its machine
• Worker – JVM process for running topology components
Nimbus
Supervisor
Worker
Worker
Worker
Worker
Supervisor
Worker
Worker
Worker
Worker
Supervisor
Worker
Worker
Worker
Worker
12. Storm – Core Concepts
• Topology – defines a running process, which includes all of the processes to be
run, the connections between those processes, and their configuration
• Stream – the flow of data through a topology; it is an unbounded collection of
tuples that is passed from process to process
• Storm has 2 types of processing units:
– Spout – the start of a stream; it can be thought of as the source of the data;
that data can be read in however the spout wants—from a database, from a
message queue, etc.
– Bolt – the primary processing unit for a topology; it accepts any number of
streams, does whatever processing you’ve set it to do, and outputs any
number of streams based on how you configure it
13. Storm – Core Concepts (continued)
• Stream Groupings – defines how topology processing units (spouts and bolts) are
connected to each other; some common groupings are:
– All Grouping – stream is sent to all bolts
– Shuffle Grouping – stream is evenly distributed across bolts
– Fields grouping – sends tuples that match on the designated “field” to the
same bolt
15. How to Utilize Storm
How can we use this framework to solve our problem?
Let Storm distribute out the data and queries between
processing nodes
…but we would still need to manage a Solr instance on each
VM, and we would even need to ensure synchronization
between query processing bolts running on the same VM.
16. How to Utilize Storm
What if instead of having a Solr installation on each machine we ran
Solr in memory inside each of the processing bolts?
• Use Storm spout to distribute new documents
• Use Storm bolt to execute queries against EmbeddedSolrServer with
RAMDirectory
– Incoming documents added to index
– Queries executed
– Documents removed from index
• Use Storm bolt to process query results
Bolt
EmbeddedSolrServer
RAMDirectory
17. Advantages
This has several advantages:
• It removes the need to maintain a Solr instance on each VM.
• It’s easier to scale and more flexible; it doesn’t matter which Supervisor the bolts
get sent to, all the processing is self-contained.
• It removes the need to synchronize processing between bolts.
• Documents are volatile, existing queries over new data
18. Execution Topology
Data
Spout
Data
Spout
Data
Spout
Query
Spout
Executor
Bolt
Executor
Bolt
Executor
Bolt
Executor
Bolt
Executor
Bolt
Notification
Bolt
Data Spout – Receives incoming
data files and sends to every
Executor Bolt
Query Spout – Coordinates
updates to queries
Executor Bolt – Loads and
executes queries
Notification Bolt – Generates
All notifications based on results
Grouping
Shuffle
Grouping
19. Executor Bolt
1. Queries are loaded into memory
2. Incoming documents are added to the
Lucene index
3. Documents are processed when one
of the following conditions are met:
a) The number of documents have
exceeded the max batch size
b) The time since the last execution
is longer than the max interval
time
4. Matching queries and document UIDs
are emitted
5. Remove all documents from index
Query List
Documents
emit()
1
2
3
4
20. Solr In-Memory Processing Bolt Issues
• Attempted to run Solr with in-memory index inside Storm bolt
• Solr 4.5 requires:
– http-client 4.2.3
– http-core 4.2.2
• Storm 0.8.2 & 0.9.0 require:
– http-client 4.1.1
– http-core 4.1
• Could exclude libraries from super jar and rely on storm/lib, but Solr
expecting SystemDefaultHttpClient from 4.2.3
• Could build Storm with newer version of libraries, but not
guaranteed to work
21. Lucene In-Memory Processing Bolt
Advantages:
• Fast, Lightweight
• No Dependency Conflicts
• RAMDirectory backed
• Easy Solr to Lucene Document Conversion
• Solr Schema based
Bolt
Lucene Index
RAMDirectory
1. Initialization
– Parse Common Solr Schema
– Replace Solr Classes
2. Add Documents
– Convert SolrInputDocument to Lucene
Document
– Add to index
22. Lucene In-Memory Processing Bolt
Parse Read/Parse/Update Solr Schema File using Stax
Create IndexSchema from new Solr Schema data
public void addDocument(SolrInputDocument doc) throws Exception {
if (doc != null) {
Document luceneDoc = solrDocumentConverter.convert(doc);
indexWriter.addDocument(luceneDoc);
indexWriter.commit();
}
}
public Document convert(SolrInputDocument solrDocument) throws
Exception {
return DocumentBuilder.toDocument(solrDocument, indexSchema);
}
23. Prototype Solution
• Infrastructure:
– 8 node cluster on Amazon EC2
– Each VM has 2 cores and 8G of memory
• Data:
– 92,000 news article summaries
– Average file size: ~1k
• Queries:
– Generated 1 million sample queries
– Randomly selected terms from document set
– Stored in MariaDB (username, query string)
– Query Executor Bolt configured to as any subset of these queries
24. Prototype Solution – Monitoring Performance
• Metrics Provided by Storm UI
– Emitted: number of tuples emitted
– Transferred: number of tuples transferred (emitted * # follow-on bolts)
– Acked: number of tuples acknowledged
– Execute Latency: timestamp when execute function ends - timestamp when execute is
passed tuple
– Process Latency: timestamp when ack is called - timestamp when execute is passed tuple
– Capacity: % of the time in the last 10 minutes the bolt spent executing tuples
• Many metrics are samples, don’t always indicate problems
• Good measurement is comparing number of tuples transferred from spout, to number
of tuples acknowledged in bolt
– If transferred number is getting increasingly higher than number of acknowledged tuples, then
the topology is not keeping up with the rate of data
25. Trial Runs – First Attempt
• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts
• Article spout emitting as fast as possible
• Query execution at 1k docs or 60 seconds elapsed time
• Increased number of queries on each trial: 10k, 50k, 100k, 200k, 300k, 400k, 500k
Results:
• Articles emitted too fast for
bolts to keep up
• If data continued to stream
at this rate, topology would
back up and drop tuples
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Node 7
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Node 6
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 1
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 5
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 2
Worker x 4
Node 3
Worker x 4
Node 4
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 8
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Article Spout
Node 1
26. Trial Runs – Second Attempt
• 8 workers, 1 Spout, 8 Query Executor Bolts, 8 Result Bolts
• Article spout now places articles on queue in background thread every 100ms
• Everything else the same…
Results:
• Topology performing much
better, keeping up with data
flow for query size of 10k,
50k, 100k, 200k
• Slows down around 300k
queries, approx 37.5k
queries/bolt
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Node 7
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Node 6
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 1
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 5
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 2
Worker x 4
Node 3
Worker x 4
Node 4
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 8
Worker QWuoerrkye rB ox x l4t
4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Article Spout
Node 1
27. Trials Runs – Third Attempt
• Each node has 4 worker slots so lets scale up
• 16 workers, 1 spout, 16 Query Executor Bolts, 8 Result Bolts
• Everything else the same…
Results:
• 300k queries now keeping
up no problem
• 400k doing ok…
• 500k backing up a bit
QWuoerrkye Worker rB ox x l4t 4
x 2
Worker RWeosruklte rB ox x l4t
4
Node 7
QWuoerrkye Worker rB ox x l4t 4
x 2
Worker RWeosruklte rB ox x l4t
4
Worker x 4
QWuoerrkye Worker rB ox x l4t 4
x 2
Worker RWeosruklte rB ox x l4t
4
Node 6
QWuoerrkye Worker rB ox x l4t 4
x 2
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 1
QWuoerrkye Worker rB ox x l4t 4
x 2
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 5
QWuoerrkye Worker rB ox x l4t 4
x 2
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 2
Worker x 4
Node 3
Worker x 4
Node 4
QWuoerrkye Worker rB ox x l4t 4
x 2
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 8
QWuoerrkye Worker rB ox x l4t 4
x 2
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Article Spout
Node 1
28. Trial Runs – Fourth Attempt
• Next logical step, 32 workers, 1 spout, 32 Query Executor Bolts
• Didn’t result in anticipated performance gain, 500k still too much
• Hypothesizing that 2-core VMs might not be enough to get full performance from 4
worker slots
QWuoerrkye Worker rB ox x l4t 4
x 4
Worker RWeosruklte rB ox x l4t
4
Node 7
QWuoerrkye Worker rB ox x l4t 4
x 4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
QWuoerrkye Worker rB ox x l4t 4
x 4
Worker RWeosruklte rB ox x l4t
4
Node 6
QWuoerrkye Worker rB ox x l4t 4
x 4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 1
QWuoerrkye Worker rB ox x l4t 4
x 4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 5
QWuoerrkye Worker rB ox x l4t 4
x 4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 2
Worker x 4
Node 3
Worker x 4
Node 4
Worker x 4
QWuoerrkye Br ox l4t x 4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Node 8
QWuoerrkye Worker rB ox x l4t 4
x 4
Worker RWeosruklte rB ox x l4t
4
Worker x 4
Article Spout
Node 1
29. Trials Runs – Conclusions
• Most important factor affecting performance is relationship between data rate and
number of queries
• Ideal Storm configuration is dependent on hardware executing the topology
• Optimal configuration resulted in 250 queries per second per bolt, 4k queries per
second across topology
• High level of performance from relatively small cluster
30. Conclusions
• Low barrier to entry working with Storm
• Easy conversion of Solr indices to Lucene Indices
• Simple integration between Lucene and Storm; Solr more complicated
• Configuration is key, tune topology to your needs
• Overall strategy appears to scale well for our use case, limited only by hardware
31. Future Considerations
• Adjust the batch size on the query executor bolt
• Combine duplicate queries (between users) if your system has many duplicates
• Investigate additional optimizations during Solr to Lucene
• Run topology with more complex queries (fielded, filtered, etc.)
• Investigate handling of bolt failure
• If ratio of incoming data to queries was reversed, consider switching the groupings
between the spouts and executor bolts
33. Updates Since Solr Lucene Revolution 2013
• Storm has moved to top-level Apache project
– https://storm.incubator.apache.org/
– Released 0.9.1, 0.9.2, 0.9.3-rc1
– Newer releases resolve classpath issue with EmbeddedSolrServer
– Improved Netty transport, new topology visualization
Source: http://storm.incubator.apache.org/2014/06/25/storm092-released.html
34. How can we test our topology at various scales with minimal setup?
• Launch Storm clusters on Amazon Web Services
– storm-deploy - https://github.com/nathanmarz/storm-deploy
• Created before Storm moved to Apache, limited activity
• install-0.9.1 branch has updates to pull Storm from Apache repo
• lein deploy-storm --start --name mycluster --branch master --commit v0.9.2-incubating
• Always launches m1.small - https://github.com/nathanmarz/storm-deploy/issues/67
– storm-deploy-alternative- https://github.com/KasperMadsen/storm-deploy-alternative
• Java alternative to storm-deploy
• Latest Apache Storm releases not supported yet, works with 0.8.2 and 0.9.0
– wirbelsturm - https://github.com/miguno/wirbelsturm
• Based on Vagrant and Puppet
• http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with-
vagrant-puppet/
• Steeper learning curve to get going
35. How can we test our topology at various scales with minimal setup?
• Make the topology independent of Storm cluster
– Previous spout required data to be on server where spout is running
• Better approach - poll an external source for data (Redis, Kafaka, etc)
– Previous executor bolt loaded queries from a database
• Better approach - package a file of queries into topology jar
– Previous executor bolt expected a Solr config directory on the server
• Better approach – package config into topology jar, extract from classpath to
disk on start up
Redis Spout
Executor Bolt
queries
SOLR_HOME
Result
Bolt
Storm Cluster
36. Luwak
• Presentation by Flax at Solr Lucene Revoltion 2013 in Dublin
– Turning Search Upside Down: Using Lucene for Very Fast Stored Queries
– https://www.youtube.com/watch?v=rmRCsrJp2A8&list=UUKuRrzEQYP8pfCgCN8il4gQ
• Open sourced by Flax shortly after
– https://github.com/flaxsearch/luwak
• True inverted search solution
– Index queries
– Turn an incoming document into a query
– Determine which queries match that document
• Easy to integrate into existing Storm solution
• Clean API and documentation
Monitor monitor = new Monitor(
new LuceneQueryParser("field"),
new TermFilteredPresearcher());
MonitorQuery mq = new MonitorQuery(
"query1", "field:text");
monitor.update(mq);
InputDocument doc = InputDocument.builder("doc1”)
.addField(textfield, document,
new StandardTokenizer(Version.LUCENE_50))
.build();
SimpleMatcher matches = monitor.match(
doc, SimpleMatcher.FACTORY);
37. Performance Comparison
• How fast can we process all 92k
articles with varying query sizes?
• Performance comparison outside of
Storm, single-thread Java process
• Solr & Lucene solutions batch docs
– Allow 1,000 docs to be added to in-memory
index
– Execute all queries, clear, start over
• Luwak evaluates one document at a
time against indexed queries
38. Wrap-Up
• Conclusion
– Storm = scalable stream processing framework
– Luwak = high performance inverted search solution
– Luwak + Storm = scalable, high performance, inverted search solution!
• Contact Info
– bende_bryan@bah.com / Twitter @bbende
– conlin_joshua@bah.com / Twitter @jmconlin
• Thanks for having us!