Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark SQL allows users to execute relation queries in Spark with distributed in-memory computations. Though Spark gives us faster in-memory computations, Solr is blazing fast for some analytic queries. In this talk, we will take a deep dive into how to optimize the SQL queries from Spark to Solr by plugging into the Spark LogicalPlanner using pushdown strategies. The key take aways from the talk will be:
How to perform Spark SQL queries with Apache Solr?
What happens inside a Spark SQL query?
How to plug into Spark Logical Planner?
What type of push-down strategies are optimal with Solr?
Examples of push-down strategies
Presented at Lucene Revolution - http://sched.co/BAwV
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Ingesting and Manipulating Data with JavaScriptLucidworks
Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark SQL allows users to execute relation queries in Spark with distributed in-memory computations. Though Spark gives us faster in-memory computations, Solr is blazing fast for some analytic queries. In this talk, we will take a deep dive into how to optimize the SQL queries from Spark to Solr by plugging into the Spark LogicalPlanner using pushdown strategies. The key take aways from the talk will be:
How to perform Spark SQL queries with Apache Solr?
What happens inside a Spark SQL query?
How to plug into Spark Logical Planner?
What type of push-down strategies are optimal with Solr?
Examples of push-down strategies
Presented at Lucene Revolution - http://sched.co/BAwV
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Ingesting and Manipulating Data with JavaScriptLucidworks
Data in the wild isn’t always in the right format we need for search or even mere usability. Lucidworks Fusion offers powerful pipelines, parsers, and stages to wrangle your data into the right format to make it more findable and friendly. However, there are some cases where more obscure data will require the power of scripting.
Your data may need a complex transformation, a custom decryption algorithm, or you may already have existing code for handling a piece of data. Even in these more complex cases, Fusion’s JavaScript capabilities have got you covered.
The next major release of Solr is right around the corner! Join Solr Committer Cassandra Targett and Lucidworks SVP of Engineering Trey Grainger for a first look into what’s included in the upcoming release.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Got data? Let's make it searchable! This presentation will demonstrate getting documents into Solr quickly, will provide some tips in adjusting Solr's schema to match your needs better, and finally will discuss how to showcase your data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and debugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production.
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
Beyond SQL: Speeding up Spark with DataFramesDatabricks
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
The next major release of Solr is right around the corner! Join Solr Committer Cassandra Targett and Lucidworks SVP of Engineering Trey Grainger for a first look into what’s included in the upcoming release.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Got data? Let's make it searchable! This presentation will demonstrate getting documents into Solr quickly, will provide some tips in adjusting Solr's schema to match your needs better, and finally will discuss how to showcase your data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and debugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production.
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
Beyond SQL: Speeding up Spark with DataFramesDatabricks
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
Presented at JAX London 2013
Apache Drill is a distributed system for interactive ad-hoc query and analysis of large-scale datasets. It is the Open Source version of Google’s Dremel technology. Apache Drill is designed to scale to thousands of servers and able to process Petabytes of data in seconds, enabling SQL-on-Hadoop and supporting a variety of data sources.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Leveraging Hadoop in your PostgreSQL EnvironmentJim Mlodgenski
This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
Solr as a Spark SQL Datasource
1.
2. Solr as a Spark SQL Datasource
Kiran Chitturi,
Lucidworks
3. Solr & Spark
• A few interesting things about Spark
• Overview of SparkSQL and DataFrames
• Solr as a SparkSQL DataSource in depth
• Use Lucene for text analysis in ML pipelines
• Example Use Case: Lucidworks Fusion and Spark
5. What’s interesting about Spark?
• Wealth of overview / getting started resources on the Web
➢ Start here -> https://spark.apache.org/
➢ Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
• Faster, more modernized alternative to MapReduce
➢ Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less
computing power)
• Unified platform for Big Data
➢ Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining
➢ Runs on YARN, Mesos, and plays well with HDFS
• Nice API for Java, Scala, Python, R, and sometimes SQL … REPL interface too
• >14,200 Issues in JIRA, 1000+ code contributors, 2.0 coming soon!
7. Physical Architecture (Standalone)
Spark Master (daemon)
Spark Slave (daemon)
my-spark-job.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
• Keeps track of live workers
• Web UI on port 8080
• Task Scheduler
• Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
• RDD Graph
• DAG Scheduler
• Block tracker
• Shuffle tracker
8. Spark SQL
• DataSource API for reading from and writing to external data sources
• DataFrame is an RDD[Row] + schema
• Secret sauce is logical plan optimizer
• SQL or relational operators on DF
• JDBC / ODBC
• UDFs!
• Machine Learning Pipelines
9. Solr as a Spark SQL Data Source
• Read/write data from/to Solr as DataFrame
• Use Solr Schema API to access field-level metadata
• Push predicates down into Solr query constructs, e.g. fq clause
• Deep-paging, shard partitioning, intra-shard splitting, streaming results
// Connect to Solr
val opts = Map("zkhost" -> "localhost:9983", "collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
// Perform SQL queries
sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM trips").show()
12. Data-locality Hint
• SolrRDD extends RDD[SolrDocument] (written in Scala)
• Give hint to Spark task scheduler about where data lives
override def getPreferredLocations(split: Partition): Seq[String] = {
// return preferred hostname for a Solr partition
}
• Useful when Spark executor and Solr replicas live on same physical
host, as we do in Fusion
• Query to a shard has a “preferred” replica; can fallback to other
replicas if the preferred goes down (will be in 2.1)
13. Solr Streaming API for fast reads
• Contributed to spark-solr by Bloomberg team (we PRs)
• Extremely fast “table scans” over large result sets in Solr
• Relies on a column-oriented data structure in Lucene: docValues
• DocValues help speed up faceting and sorting too!
• Coming soon! Push SQL predicates down into Solr’s Parallel SQL
engine available in Solr 6.x
14. Writing to Solr (aka indexing)
• Cloud-aware client sends updates to shard leaders in parallel
• Solr Schema API used to create fields on-the-fly using the DataFrame
schema
• Better parallelism than traditional approaches like Solr DIH
val dbOpts = Map(
"url" -> "jdbc:postgresql:mydb",
"dbtable" -> “schema.table",
"partitionColumn" -> "foo",
"numPartitions" -> "10")
val jdbcDF = sqlContext.read.format("jdbc").options(dbOpts).load
val solrOpts = Map("zkhost" -> "localhost:9983", "collection" -> "mycoll")
jdbcDF.write.format("solr").options(solrOpts).mode(SaveMode.Overwrite).save
15. Solr / Lucene Analyzers for Spark ML Pipelines
• Spark ML Pipeline provides nice API for defining stages to train / predict ML models
• Crazy idea ~ use battle-hardened Lucene for text analysis in Spark
• Pipelines support import/export (work in progress, more coming in Spark 2.0)
• Can try different text analysis techniques during cross-validation
DF
Spark ML Pipeline
Lucene
Analyzer
HashingTF
Standard
Scaler
Trained
Model
(SVM)
save
https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/
16.
17. Fusion & Spark
• spark-solr 2.0.1 released, built into Fusion 2.4
• Users leave evidence of their needs & experience as they use your app
• Fusion closes the feedback loop to improve results based on user “signals”
• Train and serve ML Pipeline and mllib based Machine Learning models
• Run custom Scala “script” jobs in the background in Fusion
➢Complex aggregation jobs (see next slide)
➢Unsupervised learning (LDA topic modeling)
➢Re-train supervised models as new training data flows in
18. Scheduled Fusion job to compute stats for user sessions
val opts = Map("zkhost" -> "localhost:9983”, "collection" -> "apachelogs”)
var logEvents = sqlContext.read.format("solr").options(opts).load
logEvents.registerTempTable("logs”)
sqlContext.udf.register("ts2ms", (d: java.sql.Timestamp) => d.getTime)
sqlContext.udf.register("asInt", (b: String) => b.toInt)
val sessions = sqlContext.sql("""
|SELECT *, sum(IF(diff_ms > 30000, 1, 0))
|OVER (PARTITION BY clientip ORDER BY ts) session_id
|FROM (SELECT *, ts2ms(ts) - lag(ts2ms(ts))
|OVER (PARTITION BY clientip ORDER BY ts) as diff_ms FROM logs) tmp
""".stripMargin)
sessions.registerTempTable("sessions")
var sessionsAgg = sqlContext.sql("""
|SELECT concat_ws('||', clientip,session_id) as id,
| first(clientip) as clientip,
| min(ts) as session_start,
| max(ts) as session_end,
| (ts2ms(max(ts)) - ts2ms(min(ts))) as session_len_ms_l,
| count(*) as total_requests_l
|FROM sessions
|GROUP BY clientip,session_id
""".stripMargin)
sessionsAgg.write.format("solr").options(Map("zkhost" -> "localhost:9983", "collection" -> "apachelogs_signals_aggr"))
.mode(org.apache.spark.sql.SaveMode.Overwrite).save
19. Getting started with spark-solr
• Import package via maven
./bin/spark-shell --packages "com.lucidworks.solr:spark-solr:2.0.1"
• Build from source
git clone https://github.com/LucidWorks/spark-solr
cd spark-solr
mvn clean package -DskipTests
./bin/spark-shell --jars 2.1.0-SNAPSHOT.jar
20. Example : Deep paging via shards
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT * FROM trips LIMIT 2").show()
21. Example : Deep paging with intra shard splitting
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips",
"splits" -> "true")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT * FROM trips").count()
22. Example : Streaming API (/export handler)
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM
trips").show()
23. Performance test
• NYC taxi data (30 months - 91.7M rows)
• Dataset loaded in to AWS RDS instance (Postgres)
• 3 EC2 nodes of r3.2x large instances
• Solr and Spark instances co-located together
• Collection ‘nyc-taxi’ created with 6 shards, 1 replication
• Deployed using solr-scale-tk (https://github.com/LucidWorks/solr-scale-tk)
• Dataset link: https://github.com/toddwschneider/nyc-taxi-data
• More details: https://gist.github.com/kiranchitturi/
0be62fc13e4ec7f9ae5def53180ed181
24. • Query - simple aggregation query to calculate averages
• Streaming expressions took 2.3 mins across 6 tasks
• Deep paging took 20 minutes across 120 tasks
Query performance
25. Index performance
• 91.4M rows imported to Solr in 49 minutes
• Docs per second: 31K
• JDBC batch size: 5000
• Indexing batch size: 50000
• Partitions: 200
26. Wrap-up and Q & A
Download Fusion: http://lucidworks.com/fusion/download/
Feel free to reach out to me with questions:
kiran.chitturi@lucidworks.com / @chitturikiran
27. Spark Streaming: Nuts & Bolts
• Transform a stream of records into small, deterministic batches
✓ Discretized stream: sequence of RDDs
✓ Once you have an RDD, you can use all the other Spark libs (MLlib, etc)
✓ Low-latency micro batches
✓ Time to process a batch must be less than the batch interval time
• Two types of operators:
✓ Transformations (group by, join, etc)
✓ Output (send to some external sink, e.g. Solr)
• Impressive performance!
✓ 4GB/s (40M records/s) on 100 node cluster with less than 1 second latency (note: not indexing rate)
✓ Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark
28. Spark Streaming Example: Solr as Sink
Twitter
./spark-submit --master MASTER --class com.lucidworks.spark.SparkApp spark-solr-1.0.jar
twitter-to-solr -zkHost localhost:2181 –collection social
Solr
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
Various transformations / enrichments
on each tweet (e.g. sentiment analysis,
language detection)
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
map()
class TwitterToSolrStreamProcessor
extends SparkApp.StreamProcessor
SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
29. Document Matching using Stored Queries
• For each document, determine which of a large set of stored queries
matches.
• Useful for alerts, alternative flow paths through a stream, etc
• Index a micro-batch into an embedded (in-memory) Solr instance and
then determine which queries match
• Matching framework; you have to decide where to load the stored
queries from and what to do when matches are found
• Scale it using Spark … need to scale to many queries, checkout Luwak
30. Document Matching using Stored Queries
Stored Queries
DocFilterContext
Twitter map()
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
JavaDStream<SolrInputDocument> enriched =
SolrSupport.filterDocuments(docFilterContext, …);
Get queries
Index docs into an
EmbeddedSolrServer
Initialized from configs
stored in ZooKeeper
…
ZooKeeper
Key abstraction to allow
you to plug-in how to
store the queries and
what action to take when
docs match
31. RDD Illustrated: Word count
map(word => (word, 1))
Map words into
pairs with count of 1
(quick,1)
(brown,1)
(fox,1)
(quick,1)
(quick,1)
val file =
spark.textFile("hdfs://...")
HDFS
file RDD from HDFS
quick brown fox jumped …
quick brownie recipe …
quick drying glue …
………
file.flatMap(line => line.split(" "))
Split lines into words
quick
brown
fox
quick
quick
……
reduceByKey(_ + _)
Send all keys to same
reducer and sum
(quick,1)
(quick,1)
(quick,1)
(quick,3)
Shuffle
across
machine
boundaries
Executors assigned based on data-locality if possible, narrow transformations occur in same executor
Spark keeps track of the transformations made to generate each RDD
Partition 1
Partition 2
Partition 3
x
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
32. Understanding Resilient Distributed Datasets (RDD)
• Read-only partitioned collection of records with fault-tolerance
• Created from external system OR using a transformation of another RDD
• RDDs track the lineage of coarse-grained transformations (map, join, filter, etc)
• If a partition is lost, RDDs can be re-computed by re-playing the transformations
• User can choose to persist an RDD (for reusing during interactive data-mining)
• User can control partitioning scheme
33. Physical Architecture
Spark Master (daemon)
Spark Slave (daemon)
my-spark-job.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
• Keeps track of live workers
• Web UI on port 8080
• Task Scheduler
• Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
• RDD Graph
• DAG Scheduler
• Block tracker
• Shuffle tracker