SlideShare a Scribd company logo
1 of 27
Download to read offline
GC Tuning
lucenerevolution.org
October 13-16 Ÿ Austin, TX
Solr & Spark
•  Spark Overview
•  Resilient Distributed Dataset (RDD)
•  Solr as a SparkSQL DataSource
•  Interacting with Solr from the Spark shell
•  Indexing from Spark Streaming
•  Lucidworks Fusion and Spark
•  Other goodies (document matching, term
vectors, etc.)
•  Solr user since 2010, committer since April 2014, PMC
member ~ May 2015
•  Work on core Fusion team at Lucidworks
•  Co-author of Solr in Action w/ Trey Grainger
•  Several years experience working with Hadoop, Pig, Hive,
ZooKeeper … working with Spark for about 1 year
•  Other contributions include Solr on YARN, Solr Scale
Toolkit, and Solr/Storm integration project on github
About Me …
Why Spark?
•  Wealth of overview / getting started resources on the Web
Ø  Start here -> https://spark.apache.org/
Ø  Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
•  Faster, more modernized alternative to MapReduce
Ø  Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x
less computing power)
•  Unified platform for Big Data
Ø  Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining
•  Write code in Java, Scala, or Python … REPL interface too
•  Runs on YARN (or Mesos), plays well with HDFS
Spark Components
Spark Core
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(BSP)
Hadoop YARN Mesos Standalone
HDFS
Execution
Model
The Shuffle Caching
components
engine
cluster
mgmt
Tachyon
languages Scala Java Python R
shared
memory
Physical Architecture
Spark Master (daemon)
Spark Slave (daemon)
my-spark-job.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
•  Keeps track of live workers
•  Web UI on port 8080
•  Task Scheduler
•  Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
•  RDD Graph
•  DAG Scheduler
•  Block tracker
•  Shuffle tracker
Spark vs. Hadoop’s Map/Reduce
Operators File System Fault-
Tolerance
Ecosystem
Hadoop Map-Reduce HDFS
S3
Replicated
blocks
Java API
Pig
Hive
Spark filter, map,
flatMap, join,
groupByKey,
reduce,
sortByKey,
count, distinct,
top, cogroup,
etc …
HDFS
S3
Tachyon
Immutable RDD
lineage
Python / Scala /
Java API
SparkSQL
GraphX
MLlib
SparkR
Solr as a Spark SQL Data Source
•  DataFrame is a DSL for distributed data manipulation
•  Data source provides a DataFrame
•  Uniform way of working with data from multiple sources
•  Hive, JDBC, Solr, Cassandra, etc.
•  Seamless integration with other Spark technologies: SparkR, Python,
MLlib …
Map<String,	
  String>	
  options	
  =	
  new	
  HashMap<String,	
  String>();	
  
options.put("zkhost",	
  zkHost);	
  
options.put("collection”,	
  "tweets");	
  
	
  
DataFrame	
  df	
  =	
  sqlContext.read().format("solr").options(options).load();	
  
count	
  =	
  df.filter(df.col("type_s").equalTo(“echo")).count();	
  
Shard 1
Shard 2
socialdata
collection
Partition 1
(spark task)
solr.DefaultSource
Partition 2
(spark task)
Spark
Driver
App
ZooKeeper
Read collection metadata
(num shards, replica URLs, etc)
Table	
  Scan	
  Query:	
  
q=*:*&rows=1000&	
  
distrib=false&cursorMark=*	
  
Results streamed back from Solr
Parallelizequeryexecution
sqlContext.load(“solr”,opts)	
  
DataFrame
(schema + RDD<Row>)
Map(“collection”	
  -­‐>	
  “socialdata”,	
  
	
  	
  	
  	
  	
  “zkHost”	
  -­‐>	
  “…”)	
  
SolrRDD: Reading data from Solr into Spark
Shard 1
Shard 2
socialdata
collection
Partition 1
(spark task)
solr.DefaultSource
Partition 3
(spark task)
Spark
Driver
App
ZooKeeper
Read collection metadata
(num shards, replica URLs, etc)
Table	
  Scan	
  Query:	
  
q=*:*&rows=1000&	
  
distrib=false&cursorMark=*	
  
&split_field:[aa	
  TO	
  ba}	
  
Parallelizequeryexecution
sqlContext.load(“solr”,opts)	
  
DataFrame
(schema + RDD<Row>)
Map(“collection”	
  -­‐>	
  “socialdata”,	
  
	
  	
  	
  	
  	
  “zkHost”	
  -­‐>	
  “…”)	
  
SolrRDD: Reading data from Solr into Spark
Partition 2
(spark task) &split_field:[ba	
  TO	
  bz]	
  
Partition 4
(spark task)
&split_field:[aa	
  TO	
  ac}	
  
&split_field:[ac	
  TO	
  ae]	
  
Query Solr from the Spark Shell
Interactive data mining with the full power of Solr queries
ADD_JARS=$PROJECT_HOME/target/spark-­‐solr-­‐1.0-­‐SNAPSHOT.jar	
  bin/spark-­‐shell	
  
	
  
val	
  solrDF	
  =	
  sqlContext.load("solr",	
  Map(	
  
	
  	
  "zkHost"	
  -­‐>	
  "localhost:2181",	
  
	
  	
  "collection"	
  -­‐>	
  ”rev"))	
  
	
  
solrDF.printSchema	
  
	
  
solrDF.registerTempTable("tweets")	
  
	
  
sqlContext.sql("select	
  userName,	
  COUNT(userName)	
  num_posts	
  from	
  tweets	
  group	
  by	
  
userName	
  order	
  by	
  num_posts	
  desc	
  limit	
  25").show()	
  
	
  
tweets.groupBy("author_s").count().orderBy(desc("count")).limit(10).show()	
  
Lucidworks Fusion
Fusion & Spark
•  Users leave evidence of their needs & experience as they use your app
•  Fusion closes the feedback loop to improve results based on user “signals”
•  Scale out distributed aggregation jobs across Spark cluster
Simple: top queries, top documents, …
Complex: click stream with decaying weights, co-occurence matrix
•  Recommendations at scale
RDD Illustrated: Word count
map(word	
  =>	
  (word,	
  1))	
  
Map words into
pairs with count of 1
(quick,1)	
  
(brown,1)	
  
(fox,1)	
  
(quick,1)	
  
(quick,1)	
  
val	
  file	
  =	
  	
  
	
  	
  spark.textFile("hdfs://...")	
  
HDFS	
  
file RDD from HDFS
quick	
  brown	
  fox	
  jumped	
  …	
  
quick	
  brownie	
  recipe	
  …	
  
quick	
  drying	
  glue	
  …	
  
…	
  …	
  …	
  
file.flatMap(line	
  =>	
  line.split("	
  "))
Split lines into words
quick	
  
brown	
  
fox	
  
quick	
  
quick	
  
…	
  …	
  
reduceByKey(_	
  +	
  _)
Send all keys to same
reducer and sum
(quick,1)	
  
(quick,1)	
  
(quick,1)	
  
(quick,3)	
  
Shuffle
across
machine
boundaries
Executors assigned based on data-locality if possible, narrow transformations occur in same executor
Spark keeps track of the transformations made to generate each RDD
Partition 1
Partition 2
Partition 3
x
val	
  file	
  =	
  spark.textFile("hdfs://...")	
  
val	
  counts	
  =	
  file.flatMap(line	
  =>	
  line.split("	
  "))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  
counts.saveAsTextFile("hdfs://...")	
  
Understanding Resilient Distributed Datasets (RDD)
•  Read-only partitioned collection of records with fault-tolerance
•  Created from external system OR using a transformation of another RDD
•  RDDs track the lineage of coarse-grained transformations (map, join, filter, etc)
•  If a partition is lost, RDDs can be re-computed by re-playing the transformations
•  User can choose to persist an RDD (for reusing during interactive data-mining)
•  User can control partitioning scheme
Spark Streaming: Nuts & Bolts
•  Transform a stream of records into small, deterministic batches
ü  Discretized stream: sequence of RDDs
ü  Once you have an RDD, you can use all the other Spark libs (MLlib, etc)
ü  Low-latency micro batches
ü  Time to process a batch must be less than the batch interval time
•  Two types of operators:
ü  Transformations (group by, join, etc)
ü  Output (send to some external sink, e.g. Solr)
•  Impressive performance!
ü  4GB/s (40M records/s) on 100 node cluster with less than 1 second latency
ü  Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark
Spark Streaming Example: Solr as Sink
Twitter
./spark-­‐submit	
  -­‐-­‐master	
  MASTER	
  -­‐-­‐class	
  com.lucidworks.spark.SparkApp	
  spark-­‐solr-­‐1.0.jar	
  	
  
	
  	
  	
  	
  	
  twitter-­‐to-­‐solr	
  -­‐zkHost	
  localhost:2181	
  –collection	
  social	
  
Solr
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
Various transformations / enrichments
on each tweet (e.g. sentiment analysis,
language detection)
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
map()
class TwitterToSolrStreamProcessor
extends SparkApp.StreamProcessor
SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
Spark Streaming Example: Solr as Sink
//	
  start	
  receiving	
  a	
  stream	
  of	
  tweets	
  ...	
  
JavaReceiverInputDStream<Status>	
  tweets	
  =	
  
	
  	
  TwitterUtils.createStream(jssc,	
  null,	
  filters);	
  
	
  
//	
  map	
  incoming	
  tweets	
  into	
  SolrInputDocument	
  objects	
  for	
  indexing	
  in	
  Solr	
  
JavaDStream<SolrInputDocument>	
  docs	
  =	
  tweets.map(	
  
	
  
	
  	
  new	
  Function<Status,SolrInputDocument>()	
  {	
  
	
  	
  	
  	
  public	
  SolrInputDocument	
  call(Status	
  status)	
  {	
  
	
  	
  	
  	
  	
  	
  SolrInputDocument	
  doc	
  =	
  
	
  	
  	
  	
  	
  	
  	
  	
  SolrSupport.autoMapToSolrInputDoc("tweet-­‐"+status.getId(),	
  status,	
  null);	
  
	
  	
  	
  	
  	
  	
  doc.setField("provider_s",	
  "twitter");	
  
	
  	
  	
  	
  	
  	
  doc.setField("author_s",	
  status.getUser().getScreenName());	
  
	
  	
  	
  	
  	
  	
  doc.setField("type_s",	
  status.isRetweet()	
  ?	
  "echo"	
  :	
  "post");	
  
	
  	
  	
  	
  	
  	
  return	
  doc;	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
	
  
);	
  
	
  
//	
  when	
  ready,	
  send	
  the	
  docs	
  into	
  a	
  SolrCloud	
  cluster	
  
SolrSupport.indexDStreamOfDocs(zkHost,	
  collection,	
  docs);	
  
com.lucidworks.spark.SolrSupport
	
  public	
  static	
  void	
  indexDStreamOfDocs(final	
  String	
  zkHost,	
  final	
  String	
  collection,	
  final	
  int	
  batchSize,	
  
JavaDStream<SolrInputDocument>	
  docs)	
  
	
  {	
  
	
  	
  	
  	
  docs.foreachRDD(	
  
	
  	
  	
  	
  	
  	
  new	
  Function<JavaRDD<SolrInputDocument>,	
  Void>()	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  public	
  Void	
  call(JavaRDD<SolrInputDocument>	
  solrInputDocumentJavaRDD)	
  throws	
  Exception	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  solrInputDocumentJavaRDD.foreachPartition(	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  new	
  VoidFunction<Iterator<SolrInputDocument>>()	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  public	
  void	
  call(Iterator<SolrInputDocument>	
  solrInputDocumentIterator)	
  throws	
  Exception	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  SolrServer	
  solrServer	
  =	
  getSolrServer(zkHost);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  List<SolrInputDocument>	
  batch	
  =	
  new	
  ArrayList<SolrInputDocument>();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  while	
  (solrInputDocumentIterator.hasNext())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  batch.add(solrInputDocumentIterator.next());	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  (batch.size()	
  >=	
  batchSize)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sendBatchToSolr(solrServer,	
  collection,	
  batch);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  (!batch.isEmpty())	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sendBatchToSolr(solrServer,	
  collection,	
  batch);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  );	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  return	
  null;	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  );	
  
	
  	
  }	
  
com.lucidworks.spark.ShardPartitioner
•  Custom partitioning scheme for RDD using Solr’s DocRouter
•  Stream docs directly to each shard leader using metadata from ZooKeeper,
document shard assignment, and ConcurrentUpdateSolrClient
final	
  ShardPartitioner	
  shardPartitioner	
  =	
  new	
  ShardPartitioner(zkHost,	
  collection);	
  
pairs.partitionBy(shardPartitioner).foreachPartition(	
  
	
  	
  new	
  VoidFunction<Iterator<Tuple2<String,	
  SolrInputDocument>>>()	
  {	
  
	
  	
  	
  	
  public	
  void	
  call(Iterator<Tuple2<String,	
  SolrInputDocument>>	
  tupleIter)	
  throws	
  Exception	
  {	
  
	
  	
  	
  	
  	
  	
  ConcurrentUpdateSolrClient	
  cuss	
  =	
  null;	
  
	
  	
  	
  	
  	
  	
  while	
  (tupleIter.hasNext())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  ...	
  Initialize	
  ConcurrentUpdateSolrClient	
  once	
  per	
  partition	
  
	
  	
  	
  	
  	
  	
  	
  cuss.add(doc);	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  }	
  
});	
  
Extra goodies: Reading Term Vectors from Solr
•  Pull TF/IDF (or just TF) for each term in a field for each document in query
results from Solr
•  Can be used to construct RDD<Vector> which can then be passed to MLLib:
SolrRDD	
  solrRDD	
  =	
  new	
  SolrRDD(zkHost,	
  collection);	
  
	
  
JavaRDD<Vector>	
  vectors	
  =	
  	
  
	
  	
  solrRDD.queryTermVectors(jsc,	
  solrQuery,	
  field,	
  numFeatures);	
  
vectors.cache();	
  
	
  
KMeansModel	
  clusters	
  =	
  	
  
	
  	
  KMeans.train(vectors.rdd(),	
  numClusters,	
  numIterations);	
  
	
  
//	
  Evaluate	
  clustering	
  by	
  computing	
  Within	
  Set	
  Sum	
  of	
  Squared	
  Errors	
  
double	
  WSSSE	
  =	
  clusters.computeCost(vectors.rdd());	
  
	
  
Document Matching using Stored Queries
•  For each document, determine which of a large set of stored queries
matches.
•  Useful for alerts, alternative flow paths through a stream, etc
•  Index a micro-batch into an embedded (in-memory) Solr instance and
then determine which queries match
•  Matching framework; you have to decide where to load the stored
queries from and what to do when matches are found
•  Scale it using Spark … need to scale to many queries, checkout Luwak
Document Matching using Stored Queries
Stored Queries
DocFilterContext
Twitter map()
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
JavaDStream<SolrInputDocument> enriched =
SolrSupport.filterDocuments(docFilterContext, …);
Get queries
Index docs into an
EmbeddedSolrServer
Initialized from configs
stored in ZooKeeper
…
ZooKeeper
Key abstraction to allow
you to plug-in how to
store the queries and
what action to take
when docs match
Getting Started
git clone https://github.com/LucidWorks/spark-solr/
cd spark-solr
mvn clean package -DskipTests
See: https://lucidworks.com/blog/solr-spark-sql-datasource/
Wrap-up and Q & A
Download Fusion: http://lucidworks.com/fusion/download/
Feel free to reach out to me with questions:
tim.potter@lucidworks.com / @thelabdude
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Lucidworks

More Related Content

What's hot

Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Lucidworks
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Lucidworks
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Lucidworks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Lucidworks
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Shalin Shekhar Mangar
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Murshed Ahmmad Khan
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksErik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 

What's hot (20)

Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 

Similar to Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Lucidworks

NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 

Similar to Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Lucidworks (20)

NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Recently uploaded (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Lucidworks

  • 2. Solr & Spark •  Spark Overview •  Resilient Distributed Dataset (RDD) •  Solr as a SparkSQL DataSource •  Interacting with Solr from the Spark shell •  Indexing from Spark Streaming •  Lucidworks Fusion and Spark •  Other goodies (document matching, term vectors, etc.)
  • 3. •  Solr user since 2010, committer since April 2014, PMC member ~ May 2015 •  Work on core Fusion team at Lucidworks •  Co-author of Solr in Action w/ Trey Grainger •  Several years experience working with Hadoop, Pig, Hive, ZooKeeper … working with Spark for about 1 year •  Other contributions include Solr on YARN, Solr Scale Toolkit, and Solr/Storm integration project on github About Me …
  • 4. Why Spark? •  Wealth of overview / getting started resources on the Web Ø  Start here -> https://spark.apache.org/ Ø  Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf •  Faster, more modernized alternative to MapReduce Ø  Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less computing power) •  Unified platform for Big Data Ø  Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining •  Write code in Java, Scala, or Python … REPL interface too •  Runs on YARN (or Mesos), plays well with HDFS
  • 5. Spark Components Spark Core Spark SQL Spark Streaming MLlib (machine learning) GraphX (BSP) Hadoop YARN Mesos Standalone HDFS Execution Model The Shuffle Caching components engine cluster mgmt Tachyon languages Scala Java Python R shared memory
  • 6. Physical Architecture Spark Master (daemon) Spark Slave (daemon) my-spark-job.jar (w/ shaded deps) My Spark App SparkContext (driver) •  Keeps track of live workers •  Web UI on port 8080 •  Task Scheduler •  Restart failed tasks Spark Executor (JVM process) Tasks Executor runs in separate process than slave daemon Spark Worker Node (1...N of these) Each task works on some partition of a data set to apply a transformation or action Cache Losing a master prevents new applications from being executed Can achieve HA using ZooKeeper and multiple master nodes Tasks are assigned based on data-locality When selecting which node to execute a task on, the master takes into account data locality •  RDD Graph •  DAG Scheduler •  Block tracker •  Shuffle tracker
  • 7. Spark vs. Hadoop’s Map/Reduce Operators File System Fault- Tolerance Ecosystem Hadoop Map-Reduce HDFS S3 Replicated blocks Java API Pig Hive Spark filter, map, flatMap, join, groupByKey, reduce, sortByKey, count, distinct, top, cogroup, etc … HDFS S3 Tachyon Immutable RDD lineage Python / Scala / Java API SparkSQL GraphX MLlib SparkR
  • 8. Solr as a Spark SQL Data Source •  DataFrame is a DSL for distributed data manipulation •  Data source provides a DataFrame •  Uniform way of working with data from multiple sources •  Hive, JDBC, Solr, Cassandra, etc. •  Seamless integration with other Spark technologies: SparkR, Python, MLlib … Map<String,  String>  options  =  new  HashMap<String,  String>();   options.put("zkhost",  zkHost);   options.put("collection”,  "tweets");     DataFrame  df  =  sqlContext.read().format("solr").options(options).load();   count  =  df.filter(df.col("type_s").equalTo(“echo")).count();  
  • 9. Shard 1 Shard 2 socialdata collection Partition 1 (spark task) solr.DefaultSource Partition 2 (spark task) Spark Driver App ZooKeeper Read collection metadata (num shards, replica URLs, etc) Table  Scan  Query:   q=*:*&rows=1000&   distrib=false&cursorMark=*   Results streamed back from Solr Parallelizequeryexecution sqlContext.load(“solr”,opts)   DataFrame (schema + RDD<Row>) Map(“collection”  -­‐>  “socialdata”,            “zkHost”  -­‐>  “…”)   SolrRDD: Reading data from Solr into Spark
  • 10. Shard 1 Shard 2 socialdata collection Partition 1 (spark task) solr.DefaultSource Partition 3 (spark task) Spark Driver App ZooKeeper Read collection metadata (num shards, replica URLs, etc) Table  Scan  Query:   q=*:*&rows=1000&   distrib=false&cursorMark=*   &split_field:[aa  TO  ba}   Parallelizequeryexecution sqlContext.load(“solr”,opts)   DataFrame (schema + RDD<Row>) Map(“collection”  -­‐>  “socialdata”,            “zkHost”  -­‐>  “…”)   SolrRDD: Reading data from Solr into Spark Partition 2 (spark task) &split_field:[ba  TO  bz]   Partition 4 (spark task) &split_field:[aa  TO  ac}   &split_field:[ac  TO  ae]  
  • 11. Query Solr from the Spark Shell Interactive data mining with the full power of Solr queries ADD_JARS=$PROJECT_HOME/target/spark-­‐solr-­‐1.0-­‐SNAPSHOT.jar  bin/spark-­‐shell     val  solrDF  =  sqlContext.load("solr",  Map(      "zkHost"  -­‐>  "localhost:2181",      "collection"  -­‐>  ”rev"))     solrDF.printSchema     solrDF.registerTempTable("tweets")     sqlContext.sql("select  userName,  COUNT(userName)  num_posts  from  tweets  group  by   userName  order  by  num_posts  desc  limit  25").show()     tweets.groupBy("author_s").count().orderBy(desc("count")).limit(10).show()  
  • 12.
  • 14. Fusion & Spark •  Users leave evidence of their needs & experience as they use your app •  Fusion closes the feedback loop to improve results based on user “signals” •  Scale out distributed aggregation jobs across Spark cluster Simple: top queries, top documents, … Complex: click stream with decaying weights, co-occurence matrix •  Recommendations at scale
  • 15. RDD Illustrated: Word count map(word  =>  (word,  1))   Map words into pairs with count of 1 (quick,1)   (brown,1)   (fox,1)   (quick,1)   (quick,1)   val  file  =        spark.textFile("hdfs://...")   HDFS   file RDD from HDFS quick  brown  fox  jumped  …   quick  brownie  recipe  …   quick  drying  glue  …   …  …  …   file.flatMap(line  =>  line.split("  ")) Split lines into words quick   brown   fox   quick   quick   …  …   reduceByKey(_  +  _) Send all keys to same reducer and sum (quick,1)   (quick,1)   (quick,1)   (quick,3)   Shuffle across machine boundaries Executors assigned based on data-locality if possible, narrow transformations occur in same executor Spark keeps track of the transformations made to generate each RDD Partition 1 Partition 2 Partition 3 x val  file  =  spark.textFile("hdfs://...")   val  counts  =  file.flatMap(line  =>  line.split("  "))                                    .map(word  =>  (word,  1))                                    .reduceByKey(_  +  _)   counts.saveAsTextFile("hdfs://...")  
  • 16. Understanding Resilient Distributed Datasets (RDD) •  Read-only partitioned collection of records with fault-tolerance •  Created from external system OR using a transformation of another RDD •  RDDs track the lineage of coarse-grained transformations (map, join, filter, etc) •  If a partition is lost, RDDs can be re-computed by re-playing the transformations •  User can choose to persist an RDD (for reusing during interactive data-mining) •  User can control partitioning scheme
  • 17. Spark Streaming: Nuts & Bolts •  Transform a stream of records into small, deterministic batches ü  Discretized stream: sequence of RDDs ü  Once you have an RDD, you can use all the other Spark libs (MLlib, etc) ü  Low-latency micro batches ü  Time to process a batch must be less than the batch interval time •  Two types of operators: ü  Transformations (group by, join, etc) ü  Output (send to some external sink, e.g. Solr) •  Impressive performance! ü  4GB/s (40M records/s) on 100 node cluster with less than 1 second latency ü  Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark
  • 18. Spark Streaming Example: Solr as Sink Twitter ./spark-­‐submit  -­‐-­‐master  MASTER  -­‐-­‐class  com.lucidworks.spark.SparkApp  spark-­‐solr-­‐1.0.jar              twitter-­‐to-­‐solr  -­‐zkHost  localhost:2181  –collection  social   Solr JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); Various transformations / enrichments on each tweet (e.g. sentiment analysis, language detection) JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }}); map() class TwitterToSolrStreamProcessor extends SparkApp.StreamProcessor SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs); Slide Legend Provided by Spark Custom Java / Scala code Provided by Lucidworks
  • 19. Spark Streaming Example: Solr as Sink //  start  receiving  a  stream  of  tweets  ...   JavaReceiverInputDStream<Status>  tweets  =      TwitterUtils.createStream(jssc,  null,  filters);     //  map  incoming  tweets  into  SolrInputDocument  objects  for  indexing  in  Solr   JavaDStream<SolrInputDocument>  docs  =  tweets.map(        new  Function<Status,SolrInputDocument>()  {          public  SolrInputDocument  call(Status  status)  {              SolrInputDocument  doc  =                  SolrSupport.autoMapToSolrInputDoc("tweet-­‐"+status.getId(),  status,  null);              doc.setField("provider_s",  "twitter");              doc.setField("author_s",  status.getUser().getScreenName());              doc.setField("type_s",  status.isRetweet()  ?  "echo"  :  "post");              return  doc;          }      }     );     //  when  ready,  send  the  docs  into  a  SolrCloud  cluster   SolrSupport.indexDStreamOfDocs(zkHost,  collection,  docs);  
  • 20. com.lucidworks.spark.SolrSupport  public  static  void  indexDStreamOfDocs(final  String  zkHost,  final  String  collection,  final  int  batchSize,   JavaDStream<SolrInputDocument>  docs)    {          docs.foreachRDD(              new  Function<JavaRDD<SolrInputDocument>,  Void>()  {                  public  Void  call(JavaRDD<SolrInputDocument>  solrInputDocumentJavaRDD)  throws  Exception  {                      solrInputDocumentJavaRDD.foreachPartition(                          new  VoidFunction<Iterator<SolrInputDocument>>()  {                              public  void  call(Iterator<SolrInputDocument>  solrInputDocumentIterator)  throws  Exception  {                                  final  SolrServer  solrServer  =  getSolrServer(zkHost);                                  List<SolrInputDocument>  batch  =  new  ArrayList<SolrInputDocument>();                                  while  (solrInputDocumentIterator.hasNext())  {                                      batch.add(solrInputDocumentIterator.next());                                      if  (batch.size()  >=  batchSize)                                          sendBatchToSolr(solrServer,  collection,  batch);                                  }                                  if  (!batch.isEmpty())                                      sendBatchToSolr(solrServer,  collection,  batch);                              }                          }                      );                      return  null;                  }              }          );      }  
  • 21. com.lucidworks.spark.ShardPartitioner •  Custom partitioning scheme for RDD using Solr’s DocRouter •  Stream docs directly to each shard leader using metadata from ZooKeeper, document shard assignment, and ConcurrentUpdateSolrClient final  ShardPartitioner  shardPartitioner  =  new  ShardPartitioner(zkHost,  collection);   pairs.partitionBy(shardPartitioner).foreachPartition(      new  VoidFunction<Iterator<Tuple2<String,  SolrInputDocument>>>()  {          public  void  call(Iterator<Tuple2<String,  SolrInputDocument>>  tupleIter)  throws  Exception  {              ConcurrentUpdateSolrClient  cuss  =  null;              while  (tupleIter.hasNext())  {                    //  ...  Initialize  ConcurrentUpdateSolrClient  once  per  partition                cuss.add(doc);              }        }   });  
  • 22. Extra goodies: Reading Term Vectors from Solr •  Pull TF/IDF (or just TF) for each term in a field for each document in query results from Solr •  Can be used to construct RDD<Vector> which can then be passed to MLLib: SolrRDD  solrRDD  =  new  SolrRDD(zkHost,  collection);     JavaRDD<Vector>  vectors  =        solrRDD.queryTermVectors(jsc,  solrQuery,  field,  numFeatures);   vectors.cache();     KMeansModel  clusters  =        KMeans.train(vectors.rdd(),  numClusters,  numIterations);     //  Evaluate  clustering  by  computing  Within  Set  Sum  of  Squared  Errors   double  WSSSE  =  clusters.computeCost(vectors.rdd());    
  • 23. Document Matching using Stored Queries •  For each document, determine which of a large set of stored queries matches. •  Useful for alerts, alternative flow paths through a stream, etc •  Index a micro-batch into an embedded (in-memory) Solr instance and then determine which queries match •  Matching framework; you have to decide where to load the stored queries from and what to do when matches are found •  Scale it using Spark … need to scale to many queries, checkout Luwak
  • 24. Document Matching using Stored Queries Stored Queries DocFilterContext Twitter map() Slide Legend Provided by Spark Custom Java / Scala code Provided by Lucidworks JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }}); JavaDStream<SolrInputDocument> enriched = SolrSupport.filterDocuments(docFilterContext, …); Get queries Index docs into an EmbeddedSolrServer Initialized from configs stored in ZooKeeper … ZooKeeper Key abstraction to allow you to plug-in how to store the queries and what action to take when docs match
  • 25. Getting Started git clone https://github.com/LucidWorks/spark-solr/ cd spark-solr mvn clean package -DskipTests See: https://lucidworks.com/blog/solr-spark-sql-datasource/
  • 26. Wrap-up and Q & A Download Fusion: http://lucidworks.com/fusion/download/ Feel free to reach out to me with questions: tim.potter@lucidworks.com / @thelabdude