This document provides information about integrating Apache Solr and Apache Spark. It discusses using Solr as a data source and sink for Spark applications, including indexing data from Spark jobs into Solr in real-time and exposing Solr query results as Spark RDDs. The document also summarizes the Spark Streaming and RDD APIs and provides code examples for indexing tweets from Spark Streaming into Solr and reading from Solr into a DataFrame.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we've scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.
In practice, it's common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you'll learn about new features in Solr to help manage large-scale clusters. Specifically, we'll cover data partitioning and shard splitting.
Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We'll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.
Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.
How to make a simple cheap high availability self-healing solr clusterlucenerevolution
Presented by Stephane Gamard, Chief Technology Officer, Searchbox
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shardes, and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster.
We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Solr Exchange: Introduction to SolrCloudthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will provide an architectural overview of SolrCloud and highlight its most important features. Specifically, Tim covers topics such as: sharding, replication, ZooKeeper fundamentals, leaders/replicas, and failure/recovery scenarios. Any discussion of a complex distributed system would not be complete without a discussion of the CAP theorem. Mr. Potter will describe why Solr is considered a CP system and how that impacts the design of a search application.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we've scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.
In practice, it's common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you'll learn about new features in Solr to help manage large-scale clusters. Specifically, we'll cover data partitioning and shard splitting.
Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We'll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.
Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.
How to make a simple cheap high availability self-healing solr clusterlucenerevolution
Presented by Stephane Gamard, Chief Technology Officer, Searchbox
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shardes, and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster.
We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Solr Exchange: Introduction to SolrCloudthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will provide an architectural overview of SolrCloud and highlight its most important features. Specifically, Tim covers topics such as: sharding, replication, ZooKeeper fundamentals, leaders/replicas, and failure/recovery scenarios. Any discussion of a complex distributed system would not be complete without a discussion of the CAP theorem. Mr. Potter will describe why Solr is considered a CP system and how that impacts the design of a search application.
An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.
Solr cluster with SolrCloud at lucenerevolution (tutorial)searchbox-com
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shards and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster. We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
These slides were presented at the Great Indian Developer Summit 2014 at Bangalore. See http://www.developermarch.com/developersummit/session.html?insert=ShalinMangar2
"SolrCloud" is the name given to Apache Solr's feature set for fault tolerant, highly available, and massively scalable capabilities. SolrCloud has enabled organizations to scale, impressively, into the billions of documents with sub-second search!
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
Cross Datacenter Replication aka CDCR has been a long requested feature in Apache Solr. In this talk, we will discuss CDCR as released in Apache Solr 6.0 and beyond to understand its use-cases, limitations, setup and performance. We will also take a quick look at the future enhancements that can further simplify and scale this feature.
In the big data world, our data stores communicate over an asynchronous, unreliable network to provide a facade of consistency. However, to really understand the guarantees of these systems, we must understand the realities of networks and test our data stores against them.
Jepsen is a tool which simulates network partitions in data stores and helps us understand the guarantees of our systems and its failure modes. In this talk, I will help you understand why you should care about network partitions and how can we test datastores against partitions using Jepsen. I will explain what Jepsen is and how it works and the kind of tests it lets you create. We will try to understand the subtleties of distributed consensus, the CAP theorem and demonstrate how different data stores such as MongoDB, Cassandra, Elastic and Solr behave under network partitions. Finally, I will describe the results of the tests I wrote using Jepsen for Apache Solr and discuss the kinds of rare failures which were found by this excellent tool.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
In these slides is given an overview of the different parts of Apache Spark.
We analyze spark shell both in scala and python. Then we consider Spark SQL with an introduction to Data Frame API. Finally we describe Spark Streaming and we make some code examples.
Topics:spark-shell, pyspark, HDFS, how to copy file to HDFS, spark transformations, spark actions, Spark SQL (Shark),
spark streaming, streaming transformation stateless vs stateful, sliding windows, examples
An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.
Solr cluster with SolrCloud at lucenerevolution (tutorial)searchbox-com
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shards and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster. We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
These slides were presented at the Great Indian Developer Summit 2014 at Bangalore. See http://www.developermarch.com/developersummit/session.html?insert=ShalinMangar2
"SolrCloud" is the name given to Apache Solr's feature set for fault tolerant, highly available, and massively scalable capabilities. SolrCloud has enabled organizations to scale, impressively, into the billions of documents with sub-second search!
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
Cross Datacenter Replication aka CDCR has been a long requested feature in Apache Solr. In this talk, we will discuss CDCR as released in Apache Solr 6.0 and beyond to understand its use-cases, limitations, setup and performance. We will also take a quick look at the future enhancements that can further simplify and scale this feature.
In the big data world, our data stores communicate over an asynchronous, unreliable network to provide a facade of consistency. However, to really understand the guarantees of these systems, we must understand the realities of networks and test our data stores against them.
Jepsen is a tool which simulates network partitions in data stores and helps us understand the guarantees of our systems and its failure modes. In this talk, I will help you understand why you should care about network partitions and how can we test datastores against partitions using Jepsen. I will explain what Jepsen is and how it works and the kind of tests it lets you create. We will try to understand the subtleties of distributed consensus, the CAP theorem and demonstrate how different data stores such as MongoDB, Cassandra, Elastic and Solr behave under network partitions. Finally, I will describe the results of the tests I wrote using Jepsen for Apache Solr and discuss the kinds of rare failures which were found by this excellent tool.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
In these slides is given an overview of the different parts of Apache Spark.
We analyze spark shell both in scala and python. Then we consider Spark SQL with an introduction to Data Frame API. Finally we describe Spark Streaming and we make some code examples.
Topics:spark-shell, pyspark, HDFS, how to copy file to HDFS, spark transformations, spark actions, Spark SQL (Shark),
spark streaming, streaming transformation stateless vs stateful, sliding windows, examples
This presentation is focused on the architecture, scalability concerns, performance bottlenecks, operational characteristics and lessons learned while designing and implementing Yammer distributed real-time search system. Yammer is an enterprise social network SaaS offering with over 100,000 networks (including 85% of the Fortune 100) and nearly 2 million users. The search system we developed scales well up to 1B messages and serves a foundation of knowledge base analysis services Yammer is developing.
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)Amazon Web Services
Video is a "last-mile problem" for search technology. Unlike webpages, documents, and email, content in videos has traditionally been impossible to search. Recent advances in automated speech and text recognition, however, let businesses and universities search inside video assets as easily as inside textual content. In this session, you'll learn how Panopto is using AWS to solve the video-search problem at scale, while saving over 50% in operating costs by taking advantage of Spot instances. We discuss the cross-platform architecture that combines Windows and Linux to provide cost-effective video processing and search indexing. We also dive deep into scaling Spot elastically based on user demand, handling fallback situations when instances are revoked, and using the Spot bidding process to optimize cost structure. Finally, we discuss future plans to reduce operating costs even further through Spot fleets and grid processing.
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
The traditional and typical search use case is the one large search collection distributed among many nodes and shared by all users. However, there is a class of applications which need a large number of small or medium collections which can be used, managed and scaled separately. This talk will cover our effort in helping a client set up a large scale SolrCloud setup with thousands of collections running on hundreds of nodes. I will describe the bottlenecks that we found in SolrCloud when running a large number of collections. I will also take you through the multiple features and optimizations that we contributed to Apache Solr to reduce or remove the choke points in the system. Finally, I will talk about the benchmarking process and the lessons learned from supporting such an installation in production.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
Similar to NYC Lucene/Solr Meetup: Spark / Solr (20)
3. Solr & Spark
https://github.com/LucidWorks/spark-solr/
• Indexing from Spark
• Reading data from Solr
• Solr data as a Spark SQL DataFrame
• Interacting with Solr from the Spark shell
• Document Matching
• Reading Term vectors from Solr for MLlib
4. • Solr user since 2010, committer since April 2014, work for
Lucidworks, PMC member ~ May 2015
• Focus mainly on SolrCloud features … and bin/solr!
• Release manager for Lucene / Solr 5.1
• Co-author of Solr in Action
• Several years experience working with Hadoop, Pig, Hive,
ZooKeeper, Spark about 9 months …
• Other contributions include Solr on YARN, Solr Scale Toolkit,
and Solr/Storm integration project on github
About Me …
5. About Solr
• Vibrant, thriving open source community
• Solr 5.2.1 just released!
Pluggable authentication and authorization
~2x indexing performance w/ replication
http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
Field cardinality estimation using HyperLogLog
Rule-based replica placement strategy
• Deploy to YARN cluster using Slider
6. Spark Overview
• Wealth of overview / getting started resources on the Web
Start here -> https://spark.apache.org/
Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
• Faster, more modernized alternative to MapReduce
Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less
computing power)
• Unified platform for Big Data
Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining
• Write code in Java, Scala, or Python … REPL interface too
• Runs on YARN (or Mesos), plays well with HDFS
8. Physical Architecture
Spark Master (daemon)
Spark Slave (daemon)
spark-solr-1.0.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
• Keeps track of live workers
• Web UI on port 8080
• Task Scheduler
• Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
• RDD Graph
• DAG Scheduler
• Block tracker
• Shuffle tracker
10. RDD Illustrated: Word count
map(word => (word, 1))
Map words into
pairs with count of 1
(quick,1)
(brown,1)
(fox,1)
(quick,1)
(quick,1)
val file =
spark.textFile("hdfs://...")
HDFS
file RDD from HDFS
quick brown fox jumped …
quick brownie recipe …
quick drying glue …
………
file.flatMap(line => line.split(" "))
Split lines into words
quick
brown
fox
quick
quick
……
reduceByKey(_ + _)
Send all keys to same
reducer and sum
(quick,1)
(quick,1)
(quick,1)
(quick,3)
Shuffle
across
machine
boundaries
Executors assigned based on data-locality if possible, narrow transformations occur in same executor
Spark keeps track of the transformations made to generate each RDD
Partition 1
Partition 2
Partition 3
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
11. Understanding Resilient Distributed Datasets (RDD)
• Read-only partitioned collection of records with fault-tolerance
• Created from external system OR using a transformation of another RDD
• RDDs track the lineage of coarse-grained transformations (map, join, filter, etc)
• If a partition is lost, RDDs can be re-computed by re-playing the transformations
• User can choose to persist an RDD (for reusing during interactive data-mining)
• User can control partitioning scheme
12. Spark & Solr Integration
• https://github.com/LucidWorks/spark-solr/
• Streaming applications
Real-time, streaming ETL jobs
Solr as sink for Spark job
Real-time document matching against stored queries
• Distributed computations (interactive data mining, machine learning)
Expose results from Solr query as Spark RDD (resilient distributed dataset)
Optionally process results from each shard in parallel
Read millions of rows efficiently using deep paging
SparkSQL DataFrame support (uses Solr schema API) and Term Vectors too!
13. Spark Streaming: Nuts & Bolts
• Transform a stream of records into small, deterministic batches
Discretized stream: sequence of RDDs
Once you have an RDD, you can use all the other Spark libs (MLlib, etc)
Low-latency micro batches
Time to process a batch must be less than the batch interval time
• Two types of operators:
Transformations (group by, join, etc)
Output (send to some external sink, e.g. Solr)
• Impressive performance!
4GB/s (40M records/s) on 100 node cluster with less than 1 second latency
Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark
14. Spark Streaming Example: Solr as Sink
Twitter
./spark-submit --master MASTER --class com.lucidworks.spark.SparkApp spark-solr-1.0.jar
twitter-to-solr -zkHost localhost:2181 –collection social
Solr
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
Various transformations / enrichments
on each tweet (e.g. sentiment analysis,
language detection)
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
map()
class TwitterToSolrStreamProcessor
extends SparkApp.StreamProcessor
SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
15. Spark Streaming Example: Solr as Sink
// start receiving a stream of tweets ...
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
// map incoming tweets into SolrInputDocument objects for indexing in Solr
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
public SolrInputDocument call(Status status) {
SolrInputDocument doc =
SolrSupport.autoMapToSolrInputDoc("tweet-"+status.getId(), status, null);
doc.setField("provider_s", "twitter");
doc.setField("author_s", status.getUser().getScreenName());
doc.setField("type_s", status.isRetweet() ? "echo" : "post");
return doc;
}
}
);
// when ready, send the docs into a SolrCloud cluster
SolrSupport.indexDStreamOfDocs(zkHost, collection, docs);
16. com.lucidworks.spark.SolrSupport
public static void indexDStreamOfDocs(final String zkHost, final String collection, final int batchSize,
JavaDStream<SolrInputDocument> docs)
{
docs.foreachRDD(
new Function<JavaRDD<SolrInputDocument>, Void>() {
public Void call(JavaRDD<SolrInputDocument> solrInputDocumentJavaRDD) throws Exception {
solrInputDocumentJavaRDD.foreachPartition(
new VoidFunction<Iterator<SolrInputDocument>>() {
public void call(Iterator<SolrInputDocument> solrInputDocumentIterator) throws Exception {
final SolrServer solrServer = getSolrServer(zkHost);
List<SolrInputDocument> batch = new ArrayList<SolrInputDocument>();
while (solrInputDocumentIterator.hasNext()) {
batch.add(solrInputDocumentIterator.next());
if (batch.size() >= batchSize)
sendBatchToSolr(solrServer, collection, batch);
}
if (!batch.isEmpty())
sendBatchToSolr(solrServer, collection, batch);
}
}
);
return null;
}
}
);
}
17. com.lucidworks.spark.ShardPartitioner
• Custom partitioning scheme for RDD using Solr’s DocRouter
• Stream docs directly to each shard leader using metadata from ZooKeeper, do
cument shard assignment, and ConcurrentUpdateSolrClient
final ShardPartitioner shardPartitioner = new ShardPartitioner(zkHost, collection);
pairs.partitionBy(shardPartitioner).foreachPartition(
new VoidFunction<Iterator<Tuple2<String, SolrInputDocument>>>() {
public void call(Iterator<Tuple2<String, SolrInputDocument>> tupleIter) throws Exception {
ConcurrentUpdateSolrClient cuss = null;
while (tupleIter.hasNext()) {
// ... Initialize ConcurrentUpdateSolrClient once per partition
cuss.add(doc);
}
}
});
18. SolrRDD: Reading data from Solr into Spark
• Can execute any query and expose as an RDD
• SolrRDD produces JavaRDD<SolrDocument>
• Use deep-paging if needed (cursorMark)
• Stream docs from Solr (vs. building lists on the server-side)
• More parallelism using a range filter on a numeric field (_version_)
e.g. 10 shards x 10 splits per shard == 100 concurrent Spark tasks
19. SolrRDD: Reading data from Solr into Spark
Shard 1
Shard 2
Solr
Collection
Partition 1
SolrRDD
Partition 2
Spark
Driver
App
q=*:*
ZooKeeper
Read collection metadata
q=*:*&rows=1000&
distrib=false&cursorMark=*
Results streamed back from Solr
JavaRDD<SolrDocument>
20. Solr as a Spark SQL Data Source
• DataFrame is a DSL for distributed data manipulation
• Data source provides a DataFrame
• Uniform way of working with data from multiple sources
• Hive, JDBC, Solr, Cassandra, etc.
• Seamless integration with other Spark technologies: SparkR, Python, MLlib
…
Map<String, String> options = new HashMap<String, String>();
options.put("zkhost", zkHost);
options.put("collection”, "tweets");
DataFrame df = sqlContext.read().format("solr").options(options).load();
count = df.filter(df.col("type_s").equalTo(“echo")).count();
21. Spark SQL
Query Solr, then expose results as a SQL table
Map<String, String> options = new HashMap<String, String>();
options.put("zkhost", zkHost);
options.put("collection”, "tweets");
DataFrame df = sqlContext.read().format("solr").options(options).load();
df.registerTempTable("tweets");
sqlContext.sql("SELECT count(*) FROM tweets WHERE type_s='echo'");
22. Query Solr from the Spark Shell
Interactive data mining with the full power of Solr queries
ADD_JARS=$PROJECT_HOME/target/spark-solr-1.0-SNAPSHOT.jar bin/spark-shell
val solrDF = sqlContext.load("solr", Map(
"zkHost" -> "localhost:9983",
"collection" -> "gettingstarted"))
solrDF.registerTempTable("tweets")
sqlContext.sql("SELECT COUNT(type_s) FROM tweets WHERE type_s='echo'").show()
23. Reading Term Vectors from Solr
• Pull TF/IDF (or just TF) for each term in a field for each document in query
results from Solr
• Can be used to construct RDD<Vector> which can then be passed to MLLib:
SolrRDD solrRDD = new SolrRDD(zkHost, collection);
JavaRDD<Vector> vectors =
solrRDD.queryTermVectors(jsc, solrQuery, field, numFeatures);
vectors.cache();
KMeansModel clusters =
KMeans.train(vectors.rdd(), numClusters, numIterations);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(vectors.rdd());
24. Document Matching using Stored Queries
• For each document, determine which of a large set of stored queries
matches.
• Useful for alerts, alternative flow paths through a stream, etc
• Index a micro-batch into an embedded (in-memory) Solr instance and then
determine which queries match
• Matching framework; you have to decide where to load the stored queries
from and what to do when matches are found
• Scale it using Spark … need to scale to many queries, checkout Luwak
25. Document Matching using Stored Queries
Stored Queries
DocFilterContext
Twitter map()
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
JavaDStream<SolrInputDocument> enriched =
SolrSupport.filterDocuments(docFilterContext, …);
Get queries
Index docs into an
EmbeddedSolrServer
Initialized from configs
stored in ZooKeeper
…
ZooKeeper
Key abstraction to allow
you to plug-in how to
store the queries and
what action to take when
docs match
27. Wrap-up and Q & A
Need more use cases :-)
Feel free to reach out to me with questions:
tim.potter@lucidworks.com / @thelabdude
Editor's Notes
Solr 5 – overview: http://www.slideshare.net/lucidworks/webinar-inside-apache-solr-5
Who is using Solr in production?
Anyone currently evaluating Solr and other technologies for a search project?
Anyone using Spark?
Started out as a research project at UC Berkeley – platform for exploring new areas of research in distributed systems / Big Data
Shorter paper: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record)
http://www.datanami.com/2014/10/10/spark-smashes-mapreduce-big-data-benchmark/
Highly optimized shuffle code and new network transport sub-system
Key abstraction – Resilient Distributed Dataset
Other projects using / moving to Spark:
Mahout - https://www.mapr.com/blog/mahout-spark-what%E2%80%99s-new-recommenders#.VI5CBWTF9kA
Hive
Pig
Internals talk: https://www.youtube.com/watch?v=dmL0N3qfSc8
Spark has all the same basic concepts around optimizing the shuffle stage (custom partitioning, combiners, etc)
Recently overhauled the shuffle and network transport subsystem to use Netty and zero-copy techniques
Can have multiple master nodes deployed for HA (leader is elected using ZooKeeper)
Akka and Netty under the covers
Execution Model:
Create a DAG of RDDs
Create logical execution plan for the DAG
Schedule and execute individual tasks across the cluster
Spark organizes tasks into stages; boundaries between stages are when the data needs to be re-organized (such as doing a groupBy or reduce)
Stages are super operations that happen locally
A task is data + computation
Tasks get scheduled based on data locality
Great presentation by Spark founder: https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
MapReduce suffers from having to write intermediate data to disk to be used by other jobs or iterations; no good way to share data across jobs / iterations
Data locality is still important
Spark chooses to share data across iterations / interactive queries – the hard part is fault-tolerance, which it achieves using an RDD
Less boilerplate code
One way to think about Spark is it is a more intelligent optimizer that’s very good at keeping data that is reused in memory
reliance on persistent storage to provide fault tolerance and its one-pass computation model
parallel programs look very much like sequential programs, which make them easier to develop and reason about
Different color boxes indicate partitions of the same RDD
Some text data in HDFS, partitioned by HDFS blocks
Spark assigns tasks to process the blocks based on data locality
Narrow transformations occur in the same executor (no shuffling across machines)
Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Parallel computations using a restricted set of high-level operators
Applied to *ALL* elements of a dataset at once
Log one operation that is applied to many elements
coarse-grained updates that apply the same operation to many data items
Lineage + partition == low overhead recovery
Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data
Parallel applications apply the same transformations to many data items
Persist – says to keep the RDD in-memory (probably because we’re going to be reusing it)
Lazy execution: Spark will generate a DAG of stages to compute the result of an action
The two technologies combined together provide near real-time processing, ad hoc queries, batch processing / deep analytics, machine learning, and horizontal scaling
Aims to be a framework to help reduce boilerplate and get you started quickly, but you still have to write some code!
Basically, split a stream into very small discretized batches (1 second is typical) and then all the other Spark RDD goodies apply
AMP Camp Tathagata Das
Probably on-par with Storm Trident (micro-batching)
A series of very small deterministic batch jobs
http://www.slideshare.net/pacoid/tiny-batches-in-the-wine-shiny-new-bits-in-spark-streaming
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
Don’t have to have a separate stack for streaming apps e.g. instead of having Storm for streaming and Spark for interactive data mining, you just have Spark
Spark chops live stream up into small batches of N seconds (each batch being an RDD)
DStream is batch of records to be processed
DStream is processed in micro-batches (controlled when the job is configured)
map() step converts Twitter4J Status objects into SolrInputDocuments OR we could just send JSON directly to a Fusion pipeline and then do the mapping in the pipeline.
This slide is here to show some ugliness that our Solr framework hides from end-users
SolrSupport – removes need to worry about Spark boilerplate for sending a stream of docs to Solr
Need to fix SOLR-3382 to get better error reporting when streaming docs to Solr using CUSS
Basic process is to query Solr, expose Results as a JavaSchemaRDD, register as a temp table, perform queries
Use Solr’s SchemaAPI to get metadata about fields in the query
You can also get a Spark vector by doing: Vector vector = SolrTermVector.newInstance(String docId, HashingTF hashingTF, String rawText) // uses the Lucene StandardAnalyzer
Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Parallel computations using a restricted set of high-level operators
Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data
Parallel applications apply the same transformations to many data items
When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions
Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Parallel computations using a restricted set of high-level operators
Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data
Parallel applications apply the same transformations to many data items
When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions