In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
6. Apache Lucene
* Lucene is a powerful Java search library that lets
you easily do search or Information Retrieval (IR)
* Used by LinkedIn, Twitter, and many more…
* Scalable and High-performance indexing
* Powerful, Accurate and Efficient Search algorithms
7. LuceneRDD
What is it?
Open-source project
https://github.com/zouzias/spark-lucenerdd
Spark:
Partitioning / distribution
of queries / data
Lucene:
Indexing / querying
data (single partition)
LuceneRDD:
Query / dispatch /
aggregation of data
8. Motivation
Why LuceneRDD exists?
* Natively support of full-text / spatial search in Spark
(without external Elastic/Solr cluster)
* Scalable Entity Linkage (approx. join) with Spark & Lucene
* Personal: Better understanding of Spark’s Internals (RDDs)
* Personal: Better understanding of Scala (implicits)
9. LuceneRDD: RDD with search
LuceneRDD
Full-text search & Entity Linkage
FacetedLuceneRDD
LuceneRDD + faceted search
ShapeLuceneRDD
LuceneRDD + Spatial search
* Main development: Spark 2.x (supports Spark >= 1.4)
* Lucene 5.5.3 (Lucene 6.2.2 JVM 8)
* Released on maven central & spark-packages (Scala 2.10 / 2.11)
or
10. LuceneRDD
Main idea
Index RDD partition data with
distributed Lucene index
(inverted index per partition)
Index
Index
Index
…
LuceneRDD
Executor #1
…
Spark
Master
Executor #2
Executor #n
[Paris]
[Bern]
[Athens]
[Ottawa]
[Warsaw]
[Rome]
[Zagreb]
[Lisbon]
[Dublin]
…
RDD
Index storage
disk or memory
11. Index
Index
Index
…
LuceneRDD
DataFrame to LuceneRDD*
Executor #1
…
Spark
Master
Executor #2
Executor #n
[Paris]
[Bern]
[Athens]
[Ottawa]
[Warsaw]
[Rome]
[Zagreb]
[Lisbon]
[Dublin]
…
RDD/DataFrame
Language analyzers, i.e.,
French, English etc.. are
configurable
Language analyzers,
i.e., French, English etc.. are
configurable
20. Join left & right dataset
according to a “similarity”
(NO ids available)
Left Dataset Right Dataset
Entity Linkage (approx. join)
?
Simplicity:
one-to-one linkage
21. Left Dataset Right Dataset
Entity Linkage (approx. join)
?
Naive approach
Compute pairwise scores does not scale: O(n2)
23. LuceneRDD Approximate Join
1) Index right dataset using LuceneRDD
2) Define “search similarity function” between each left entity &
right dataset
3) For every left entity, search & zip using topK results
Index
Index
Index
…
LuceneRDD
24. LuceneRDD Entity Linkage
1) Index right dataset using LuceneRDD
2) Define “search similarity function” between left & right dataset
3) For every left entity, search & zip by search similarly
Index
Index
Index
…
LuceneRDD
List(“nam
e:james")
List(“nam
e:joel")
List(“na
me:ram")
…
RDD[String]
Spark
Master
(3b)Collect&broadcast(RDD[String])
(3a)Maptosearchqueries
25. LuceneRDD: linkage codeQuery each partition and
zip query with topK results
Reduce topK monoids
per query
Map each left entity
to search query & broadcast to
each partition
27. LuceneRDD Entity Linkage Scalability (EMR 4.8.0 / Spark 1.6.2)LinkageTime(seconds)
0
300
600
900
1200
# of executors (= # of AWS core nodes)
2
3
5
8
10
15
20
1135.23
755.69
482.98
324.70 315.50
187.22 185.51
AWS: master: m3.xlarge, Core: r3.xlarge, Exec. mem: 9G
Joined Datasets
* H1B US Visa Applications (2.6 million records)
* Geonames US Cities (4.4 million records)
28. Demo
LuceneRDD Linkage API
(ACM vs DBLP research articles)
https://github.com/zouzias/spark-lucenerdd-examples
Köpcke, H.; Thor, A.; Rahm, E.
Evaluation of entity resolution approaches on real-world match problems
Proc. 36th Intl. Conference on Very Large Databases (VLDB), 2010
31. Future Work
* Contributors / use cases are welcome!
* More performance tests with approx. join
* Spatial-linkage performance tests using
open Street Map data
* Bypass the collect-broadcast Spark pattern?