Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark SQL allows users to execute relation queries in Spark with distributed in-memory computations. Though Spark gives us faster in-memory computations, Solr is blazing fast for some analytic queries. In this talk, we will take a deep dive into how to optimize the SQL queries from Spark to Solr by plugging into the Spark LogicalPlanner using pushdown strategies. The key take aways from the talk will be:
How to perform Spark SQL queries with Apache Solr?
What happens inside a Spark SQL query?
How to plug into Spark Logical Planner?
What type of push-down strategies are optimal with Solr?
Examples of push-down strategies
Presented at Lucene Revolution - http://sched.co/BAwV
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark SQL allows users to execute relation queries in Spark with distributed in-memory computations. Though Spark gives us faster in-memory computations, Solr is blazing fast for some analytic queries. In this talk, we will take a deep dive into how to optimize the SQL queries from Spark to Solr by plugging into the Spark LogicalPlanner using pushdown strategies. The key take aways from the talk will be:
How to perform Spark SQL queries with Apache Solr?
What happens inside a Spark SQL query?
How to plug into Spark Logical Planner?
What type of push-down strategies are optimal with Solr?
Examples of push-down strategies
Presented at Lucene Revolution - http://sched.co/BAwV
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
LuceneRDD for (Geospatial) Search and Entity Linkagezouzias
In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
The next major release of Solr is right around the corner! Join Solr Committer Cassandra Targett and Lucidworks SVP of Engineering Trey Grainger for a first look into what’s included in the upcoming release.
Ingesting and Manipulating Data with JavaScriptLucidworks
Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
LuceneRDD for (Geospatial) Search and Entity Linkagezouzias
In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
The next major release of Solr is right around the corner! Join Solr Committer Cassandra Targett and Lucidworks SVP of Engineering Trey Grainger for a first look into what’s included in the upcoming release.
Ingesting and Manipulating Data with JavaScriptLucidworks
Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.
Solr Exchange: Introduction to SolrCloudthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will provide an architectural overview of SolrCloud and highlight its most important features. Specifically, Tim covers topics such as: sharding, replication, ZooKeeper fundamentals, leaders/replicas, and failure/recovery scenarios. Any discussion of a complex distributed system would not be complete without a discussion of the CAP theorem. Mr. Potter will describe why Solr is considered a CP system and how that impacts the design of a search application.
Leveraging the Power of Solr with SparkQAware GmbH
Lucene Revolution 2016, Boston: Talk by Johannes Weigend (@JohannesWeigend, CTO at QAware).
Abstract: Solr is a distributed NoSQL database with impressive search capabilities. Spark is the new megastar in the distributed computing universe. In this code-intense session we show you how to combine both to solve real-time search and processing problems. We will show you how to set up a Solr/Spark combination from scratch and develop first jobs with runs distributed on shared Solr data. We will also show you how to use this combination for your next-generation BI platform.
Big data insights with Red Hat JBoss Data VirtualizationKenneth Peeples
You’re hearing a lot about big data these days. And big data and the technologies that store and process it, like Hadoop, aren’t just new data silos. You might be looking to integrate big data with existing enterprise information systems to gain better understanding of your business. You want to take informed action.
During this session, we’ll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data. You’ll learn how Red Hat JBoss Data Virtualization:
Can help you integrate your existing and growing data infrastructure.
Integrates big data with your existing enterprise data infrastructure.
Lets non-technical users access big data result sets.
We’ll also provide typical uses cases and examples and a demonstration of the integration of Hadoop sentiment analysis with sales data.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
Muktadiur Rahman
Team Lead,
M&H Informatics(BD) Ltd
Similar to ApacheCon NA 2015 Spark / Solr Integration (20)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. • Spark Overview / High-level Architecture
• Indexing from Spark
• Reading data from Solr
+ term vectors & Spark SQL
• Document Matching
• Q&A
Solr & Spark
3. • Solr user since 2010, committer since April 2014, work for
Lucidworks
• Focus mainly on SolrCloud features … and bin/solr!
• Release manager for Lucene / Solr 5.1
• Co-author of Solr in Action
• Several years experience working with Hadoop, Pig, Hive,
ZooKeeper, but only started using Spark about 6 months ago
…
• Other contributions include Solr on YARN, Solr Scale Toolkit,
and Spark-Solr integration project on github
About Me …
4. Spark Overview
• Wealth of overview / getting started resources on the Web
Start here -> https://spark.apache.org/
Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
• Faster, more modernized alternative to MapReduce
Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less
computing power)
• Unified platform for Big Data
Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining
• Write code in Java, Scala, or Python … REPL interface too
• Runs on YARN (or Mesos), plays well with HDFS
6. Physical Architecture
Spark Master (daemon)
Spark Slave (daemon)
spark-solr-1.0.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
• Keeps track of live workers
• Web UI on port 8080
• Task Scheduler
• Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
• RDD Graph
• DAG Scheduler
• Block tracker
• Shuffle tracker
7. Spark vs. Hadoop’s Map/Reduce
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
8. RDD Illustrated: Word count
file RDD from HDFS
val file =
spark.textFile("hdfs://...")
HDFS
file.flatMap(line => line.split(" "))
Split lines into words Map words into
pairs with count of 1
map(word => (word, 1))
quick brown fox jumped … quick
brown
fox
(quick,1)
(brown,1)
(fox,1)
reduceByKey(_ + _)
quick brownie recipe … quick (quick,1)
Send all keys to same
reducer and sum
(quick,1)
(quick,1)
quick drying glue … quick (quick,1)
(quick,1)
(quick,3)
………
……
Shuffle
across
machine
boundaries
Executors assigned based on data-locality if possible, narrow transformations occur in same executor
Spark keeps track of the transformations made to generate each RDD
Partition 1
Partition 2
Partition 3
9. Understanding Resilient Distributed Datasets (RDD)
• Read-only partitioned collection of records with smart(er) fault-tolerance
• Created from external system OR using a transformation of another RDD
• RDDs track the lineage of coarse-grained transformations (map, join, filter, etc)
• If a partition is lost, RDDs can be re-computed by re-playing the transformations
• User can choose to persist an RDD (for reusing during interactive data-mining)
• User can control partitioning scheme; default is based on the input source
10. Spark & Solr Integration
• https://github.com/LucidWorks/spark-solr/
• Streaming applications
Real-time, streaming ETL jobs
Solr as sink for Spark job
Real-time document matching against stored queries
• Distributed computations (interactive data mining, machine learning)
Expose results from Solr query as Spark RDD (resilient distributed dataset)
Optionally process results from each shard in parallel
Read millions of rows efficiently using deep paging
SparkSQL DataFrame support (uses Solr schema API) and Term Vectors too!
11. Spark Streaming: Nuts & Bolts
• Transform a stream of records into small, deterministic batches
Discretized stream: sequence of RDDs
Once you have an RDD, you can use all the other Spark libs (MLlib, etc)
Low-latency micro batches
Time to process a batch must be less than the batch interval time
• Two types of operators:
Transformations (group by, join, etc)
Output (send to some external sink, e.g. Solr)
• Impressive performance!
4GB/s (40M records/s) on 100 node cluster with less than 1 second latency
Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark
12. Spark Streaming Example: Solr as Sink
Twitter
./spark-submit --master MASTER --class com.lucidworks.spark.SparkApp spark-solr-1.0.jar
twitter-to-solr -zkHost localhost:2181 –collection social
Solr
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
Various transformations / enrichments
on each tweet (e.g. sentiment analysis,
language detection)
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
map()
class TwitterToSolrStreamProcessor
extends SparkApp.StreamProcessor
SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
13. Spark Streaming Example: Solr as Sink
// start receiving a stream of tweets ...
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
// map incoming tweets into SolrInputDocument objects for indexing in Solr
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
public SolrInputDocument call(Status status) {
SolrInputDocument doc =
SolrSupport.autoMapToSolrInputDoc("tweet-"+status.getId(), status, null);
doc.setField("provider_s", "twitter");
doc.setField("author_s", status.getUser().getScreenName());
doc.setField("type_s", status.isRetweet() ? "echo" : "post");
return doc;
}
}
);
// when ready, send the docs into a SolrCloud cluster
SolrSupport.indexDStreamOfDocs(zkHost, collection, docs);
14. com.lucidworks.spark.SolrSupport
public static void indexDStreamOfDocs(final String zkHost, final String collection, final int batchSize,
JavaDStream<SolrInputDocument> docs)
{
docs.foreachRDD(
new Function<JavaRDD<SolrInputDocument>, Void>() {
public Void call(JavaRDD<SolrInputDocument> solrInputDocumentJavaRDD) throws Exception {
solrInputDocumentJavaRDD.foreachPartition(
new VoidFunction<Iterator<SolrInputDocument>>() {
public void call(Iterator<SolrInputDocument> solrInputDocumentIterator) throws Exception {
final SolrServer solrServer = getSolrServer(zkHost);
List<SolrInputDocument> batch = new ArrayList<SolrInputDocument>();
while (solrInputDocumentIterator.hasNext()) {
batch.add(solrInputDocumentIterator.next());
if (batch.size() >= batchSize)
sendBatchToSolr(solrServer, collection, batch);
}
if (!batch.isEmpty())
sendBatchToSolr(solrServer, collection, batch);
}
}
);
return null;
}
}
);
}
15. Document Matching using Stored Queries
• For each document, determine which of a large set of stored queries
matches.
• Useful for alerts, alternative flow paths through a stream, etc
• Index a micro-batch into an embedded (in-memory) Solr instance and then
determine which queries match
• Matching framework; you have to decide where to load the stored queries
from and what to do when matches are found
• Scale it using Spark … need to scale to many queries, checkout Luwak
16. Document Matching using Stored Queries
Stored Queries
DocFilterContext
Twitter map()
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
JavaDStream<SolrInputDocument> enriched =
SolrSupport.filterDocuments(docFilterContext, …);
Get queries
Index docs into an
EmbeddedSolrServer
Initialized from configs
stored in ZooKeeper
…
ZooKeeper
Key abstraction to allow
you to plug-in how to
store the queries and
what action to take when
docs match
17. com.lucidworks.spark.ShardPartitioner
• Custom partitioning scheme for RDD using Solr’s DocRouter
• Stream docs directly to each shard leader using metadata from ZooKeeper, do
cument shard assignment, and ConcurrentUpdateSolrClient
final ShardPartitioner shardPartitioner = new ShardPartitioner(zkHost, collection);
pairs.partitionBy(shardPartitioner).foreachPartition(
new VoidFunction<Iterator<Tuple2<String, SolrInputDocument>>>() {
public void call(Iterator<Tuple2<String, SolrInputDocument>> tupleIter) throws Exception {
ConcurrentUpdateSolrClient cuss = null;
while (tupleIter.hasNext()) {
// ... Initialize ConcurrentUpdateSolrClient once per partition
cuss.add(doc);
}
}
});
18. SolrRDD: Reading data from Solr into Spark
• Can execute any query and expose as an RDD
• SolrRDD produces JavaRDD<SolrDocument>
• Use deep-paging if needed (cursorMark)
• For reading full result sets where global sort order doesn’t matter,
parallelize query execution by distributing requests across the Spark cluster
JavaRDD<SolrDocument> results =
solrRDD.queryShards(jsc, solrQuery);
19. Reading Term Vectors from Solr
• Pull TF/IDF (or just TF) for each term in a field for each document in query
results from Solr
• Can be used to construct RDD<Vector> which can then be passed to MLLib:
SolrRDD solrRDD = new SolrRDD(zkHost, collection);
JavaRDD<Vector> vectors =
solrRDD.queryTermVectors(jsc, solrQuery, field, numFeatures);
vectors.cache();
KMeansModel clusters =
KMeans.train(vectors.rdd(), numClusters, numIterations);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(vectors.rdd());
20. Spark SQL
• Query Solr, then expose results as a SQL table
SolrQuery solrQuery = new SolrQuery(...);
solrQuery.setFields("text_t","type_s");
SolrRDD solrRDD = new SolrRDD(zkHost, collection);
JavaRDD<SolrDocument> solrJavaRDD = solrRDD.queryShards(jsc, solrQuery);
SQLContext sqlContext = new SQLContext(jsc);
DataFrame df =
solrRDD.applySchema(sqlContext, solrQuery, solrJavaRDD, zkHost, collection);
df.registerTempTable("tweets");
JavaSchemaRDD results =
sqlContext.sql("SELECT COUNT(type_s) FROM tweets WHERE type_s='echo'");
List<Long> count = results.javaRDD().map(new Function<Row, Long>() {
public Long call(Row row) {
return row.getLong(0);
}
}).collect();
System.out.println("# of echos : "+count.get(0));
21. Wrap-up and Q & A
• Reference implementation of Solr and Spark on YARN
• Formal benchmarks for reads and writes to Solr
• Checkout SOLR-6816 – improving replication performance
• Add Spark support to Solr Scale Toolkit
• Integrate metrics to give visibility into performance
• More use cases …
Feel free to reach out to me with questions:
tim.potter@lucidworks.com / @thelabdude
Editor's Notes
Solr 5 – overview: http://www.slideshare.net/lucidworks/webinar-inside-apache-solr-5
Who is using Solr in production?
Anyone currently evaluating Solr and other technologies for a search project?
Anyone using Spark?
Started out as a research project at UC Berkeley – platform for exploring new areas of research in distributed systems / Big Data
Shorter paper: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record)
http://www.datanami.com/2014/10/10/spark-smashes-mapreduce-big-data-benchmark/
Highly optimized shuffle code and new network transport sub-system
Key abstraction – Resilient Distributed Dataset
Other projects using / moving to Spark:
Mahout - https://www.mapr.com/blog/mahout-spark-what%E2%80%99s-new-recommenders#.VI5CBWTF9kA
Hive
Pig
Internals talk: https://www.youtube.com/watch?v=dmL0N3qfSc8
Spark has all the same basic concepts around optimizing the shuffle stage (custom partitioning, combiners, etc)
Recently overhauled the shuffle and network transport subsystem to use Netty and zero-copy techniques
Can have multiple master nodes deployed for HA (leader is elected using ZooKeeper)
Akka and Netty under the covers
Execution Model:
Create a DAG of RDDs
Create logical execution plan for the DAG
Schedule and execute individual tasks across the cluster
Spark organizes tasks into stages; boundaries between stages are when the data needs to be re-organized (such as doing a groupBy or reduce)
Stages are super operations that happen locally
A task is data + computation
Tasks get scheduled based on data locality
Great presentation by Spark founder: https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
MapReduce suffers from having to write intermediate data to disk to be used by other jobs or iterations; no good way to share data across jobs / iterations
Data locality is still important
Spark chooses to share data across iterations / interactive queries – the hard part is fault-tolerance, which it achieves using an RDD
Less boilerplate code
One way to think about Spark is it is a more intelligent optimizer that’s very good at keeping data that is reused in memory
reliance on persistent storage to provide fault tolerance and its one-pass computation model
parallel programs look very much like sequential programs, which make them easier to develop and reason about
Different color boxes indicate partitions of the same RDD
Some text data in HDFS, partitioned by HDFS blocks
Spark assigns tasks to process the blocks based on data locality
Narrow transformations occur in the same executor (no shuffling across machines)
Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Parallel computations using a restricted set of high-level operators
Applied to *ALL* elements of a dataset at once
Log one operation that is applied to many elements
coarse-grained updates that apply the same operation to many data items
Lineage + partition == low overhead recovery
Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data
Parallel applications apply the same transformations to many data items
Persist – says to keep the RDD in-memory (probably because we’re going to be reusing it)
Lazy execution: Spark will generate a DAG of stages to compute the result of an action
The two technologies combined together provide near real-time processing, ad hoc queries, batch processing / deep analytics, machine learning, and horizontal scaling
Aims to be a framework to help reduce boilerplate and get you started quickly, but you still have to write some code!
Basically, split a stream into very small discretized batches (1 second is typical) and then all the other Spark RDD goodies apply
AMP Camp Tathagata Das
Probably on-par with Storm Trident (micro-batching)
A series of very small deterministic batch jobs
http://www.slideshare.net/pacoid/tiny-batches-in-the-wine-shiny-new-bits-in-spark-streaming
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
Don’t have to have a separate stack for streaming apps e.g. instead of having Storm for streaming and Spark for interactive data mining, you just have Spark
Spark chops live stream up into small batches of N seconds (each batch being an RDD)
DStream is batch of records to be processed
DStream is processed in micro-batches (controlled when the job is configured)
map() step converts Twitter4J Status objects into SolrInputDocuments OR we could just send JSON directly to a Fusion pipeline and then do the mapping in the pipeline.
This slide is here to show some ugliness that our Solr framework hides from end-users
SolrSupport – removes need to worry about Spark boilerplate for sending a stream of docs to Solr
Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Parallel computations using a restricted set of high-level operators
Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data
Parallel applications apply the same transformations to many data items
When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions
Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Parallel computations using a restricted set of high-level operators
Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data
Parallel applications apply the same transformations to many data items
When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions
Need to fix SOLR-3382 to get better error reporting when streaming docs to Solr using CUSS
You can also get a Spark vector by doing: Vector vector = SolrTermVector.newInstance(String docId, HashingTF hashingTF, String rawText) // uses the Lucene StandardAnalyzer
Basic process is to query Solr, expose Results as a JavaSchemaRDD, register as a temp table, perform queries
Use Solr’s SchemaAPI to get metadata about fields in the query