SlideShare a Scribd company logo
1 of 33
Download to read offline
LuceneRDD 

Entity Linkage & (Spatial) Search
Anastasios Zouzias
ScalaIO @ Lyon 2016
* Data Scientist @ Swisscom, CH: Mobility Insights Team
* Researcher @ IBM Research, Zurich: Machine
Learning & Search Engineer
* PhD in Univ. of Toronto: ML & Randomized Algorithms
* Scala Developer (day job): 2-3 years
* Apache Solr / Elasticsearch developer: 3+ years
About myself
How many of you
have used Spark?
Apache Spark
RDD (Resilient Distributed Dataset)
Distributes data over multiple nodes
Executor #1
…
Spark

Master
Executor #2
Executor #n
[Paris]
[Bern]
[Athens]
[Ottawa]
[Warsaw]
[Rome]
[Zagreb]
[Lisbon]
[Dublin]
…
RDD
“Spark Driver”
How many of you
have used
Lucene/Solr/Elastic?
Apache Lucene
* Lucene is a powerful Java search library that lets
you easily do search or Information Retrieval (IR)
* Used by LinkedIn, Twitter, and many more…
* Scalable and High-performance indexing
* Powerful, Accurate and Efficient Search algorithms
LuceneRDD
What is it?
Open-source project

https://github.com/zouzias/spark-lucenerdd
Spark:
Partitioning / distribution
of queries / data
Lucene:
Indexing / querying
data (single partition)
LuceneRDD:
Query / dispatch /
aggregation of data
Motivation
Why LuceneRDD exists?
* Natively support of full-text / spatial search in Spark

(without external Elastic/Solr cluster)
* Scalable Entity Linkage (approx. join) with Spark & Lucene
* Personal: Better understanding of Spark’s Internals (RDDs)
* Personal: Better understanding of Scala (implicits)
LuceneRDD: RDD with search
LuceneRDD
Full-text search & Entity Linkage
FacetedLuceneRDD
LuceneRDD + faceted search
ShapeLuceneRDD
LuceneRDD + Spatial search
* Main development: Spark 2.x (supports Spark >= 1.4)
* Lucene 5.5.3 (Lucene 6.2.2 JVM 8)
* Released on maven central & spark-packages (Scala 2.10 / 2.11)
or
LuceneRDD
Main idea
Index RDD partition data with
distributed Lucene index
(inverted index per partition)
Index
Index
Index
…
LuceneRDD
Executor #1
…
Spark

Master
Executor #2
Executor #n
[Paris]
[Bern]
[Athens]
[Ottawa]
[Warsaw]
[Rome]
[Zagreb]
[Lisbon]
[Dublin]
…
RDD
Index storage

disk or memory
Index
Index
Index
…
LuceneRDD
DataFrame to LuceneRDD*
Executor #1
…
Spark

Master
Executor #2
Executor #n
[Paris]
[Bern]
[Athens]
[Ottawa]
[Warsaw]
[Rome]
[Zagreb]
[Lisbon]
[Dublin]
…
RDD/DataFrame
Language analyzers, i.e.,
French, English etc.. are
configurable
Language analyzers,
i.e., French, English etc.. are
configurable
Simplistic Example
Executor #1
…
Spark

Master
Executor #2
Executor #n
search:par*
Index
Index
Index
…
LuceneRDD
[Paris]
[Bern]
[Athens]
[Ottawa]
[Warsaw]
[Rome]
[Zagreb]
[Lisbon]
[Dublin]
…
RDD
Executor #1
…
Spark

Master
Executor #2
Executor #n
Broadcast search
query to executors
Index
Index
Index
…
LuceneRDD
search:par*
search:par*
search:par*
search:par*
Simplistic Example
Executor #1
…
Spark

Master
Executor #2
Executor #n
Lucene prefix Search
on each partition
Index
Index
Index
…
LuceneRDD
search:par*
List(Paris)
List()
List()
…
LuceneRDDResponse
topK results / partition
Simplistic Example
Aggregate Results
Executor #1
…
Spark

Master
Executor #2
Executor #n
List(Paris)
List()
List()
…
LuceneRDDResponse
response.take(k)
List(Paris)
List()
List()
How to aggregate?
TopK monoids in action
LuceneRDDResponse
TopKMonoid((0.8, Paris), (0.7, Result2), (0.6, Result3))
TopKMonoid((0.71, XXX), (0.35, YYY), (0.25, ZZZ))
TopKMonoid((0.12, XXX), (0.09, YYY), (0.05, ZZZ))
Executor #1
…
Spark

Master
Executor #2
Executor #n
List(Paris)
List()
List()
+
=
+
TopK Sorted
Scored Results
Twitter’s Algebird TopKMonoid
Aggregate Results
Executor #1
…
Spark

Master
Executor #2
Executor #n
List(Paris)
List()
List()
…
LuceneRDDResponse
LuceneRDDResponse.take(k)
List(Paris)
List()
List()
LuceneRDD Operations
ShapeLuceneRDD
LuceneRDD + Spatial search
FacetedLuceneRDD
LuceneRDD + faceted search
Flexible multi-fields query:
term/Fuzzy/Prefix/etc sub-queries
+ boolean queries
Entity Linkage
a.k.a. approximate join
Join left & right dataset
according to a “similarity”
(NO ids available)
Left Dataset Right Dataset
Entity Linkage (approx. join)
?
Simplicity:
one-to-one linkage
Left Dataset Right Dataset
Entity Linkage (approx. join)
?
Naive approach
Compute pairwise scores does not scale: O(n2)
Entity Linkage API
LuceneRDD Approximate Join
1) Index right dataset using LuceneRDD
2) Define “search similarity function” between each left entity &
right dataset
3) For every left entity, search & zip using topK results
Index
Index
Index
…
LuceneRDD
LuceneRDD Entity Linkage
1) Index right dataset using LuceneRDD
2) Define “search similarity function” between left & right dataset
3) For every left entity, search & zip by search similarly
Index
Index
Index
…
LuceneRDD
List(“nam
e:james")
List(“nam
e:joel")
List(“na
me:ram")
…
RDD[String]
Spark

Master
(3b)Collect&broadcast(RDD[String])
(3a)Maptosearchqueries
LuceneRDD: linkage codeQuery each partition and
zip query with topK results
Reduce topK monoids
per query
Map each left entity
to search query & broadcast to
each partition
Does LuceneRDD approx. join scale?
https://github.com/zouzias/spark-lucenerdd-aws
LuceneRDD Entity Linkage Scalability (EMR 4.8.0 / Spark 1.6.2)LinkageTime(seconds)
0
300
600
900
1200
# of executors (= # of AWS core nodes)
2
3
5
8
10
15
20
1135.23
755.69
482.98
324.70 315.50
187.22 185.51
AWS: master: m3.xlarge, Core: r3.xlarge, Exec. mem: 9G
Joined Datasets
* H1B US Visa Applications (2.6 million records)
* Geonames US Cities (4.4 million records)
Demo
LuceneRDD Linkage API
(ACM vs DBLP research articles)
https://github.com/zouzias/spark-lucenerdd-examples
Köpcke, H.; Thor, A.; Rahm, E.
Evaluation of entity resolution approaches on real-world match problems
Proc. 36th Intl. Conference on Very Large Databases (VLDB), 2010
Spatial/Entity Linkage API
Demo
ShapeLuceneRDD Linkage
(Country polygons vs Capital points)
https://github.com/zouzias/spark-lucenerdd-examples
Future Work
* Contributors / use cases are welcome!
* More performance tests with approx. join
* Spatial-linkage performance tests using 

open Street Map data
* Bypass the collect-broadcast Spark pattern?
Merci pour votre attention!
Questions?
Elastic/SolrCloud Connector
Executor
#1
…
Executor
#2
Executor
#n
Driver
Cluster
Node 1
/
Node 2
/
Node m
/
…
Limitations
- m is fixed & (usually) small compared to n
- additional network communication: execs & elastic/solr

More Related Content

What's hot

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerDatabricks
 

What's hot (20)

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Spark
SparkSpark
Spark
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 

Viewers also liked

Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
ieeesb patras study abroad 2007 zouzias
ieeesb patras study abroad 2007 zouziasieeesb patras study abroad 2007 zouzias
ieeesb patras study abroad 2007 zouziasManolis Viennas
 
Geospatial search with SOLR
Geospatial search with SOLRGeospatial search with SOLR
Geospatial search with SOLRNicolas Leroy
 
Geospatial Search With Amazon CloudSearch
Geospatial Search With Amazon CloudSearch Geospatial Search With Amazon CloudSearch
Geospatial Search With Amazon CloudSearch Michael Bohlig
 
I See NoSQL Document Stores in Geospatial Applications
I See NoSQL Document Stores in Geospatial ApplicationsI See NoSQL Document Stores in Geospatial Applications
I See NoSQL Document Stores in Geospatial ApplicationsIBM Cloud Data Services
 
How to Tune Search Requests using Amazon CloudSearch
How to Tune Search Requests using Amazon CloudSearchHow to Tune Search Requests using Amazon CloudSearch
How to Tune Search Requests using Amazon CloudSearchAmazon Web Services
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesItamar
 
'The B of the Bang' - What the UK Government’s white paper on Brexit means fo...
'The B of the Bang' - What the UK Government’s white paper on Brexit means fo...'The B of the Bang' - What the UK Government’s white paper on Brexit means fo...
'The B of the Bang' - What the UK Government’s white paper on Brexit means fo...Graeme Cross
 
Bureaux professionnels : Achat ou Location ?
Bureaux professionnels : Achat ou Location ? Bureaux professionnels : Achat ou Location ?
Bureaux professionnels : Achat ou Location ? FIDAQUITAINE
 
Đặc sản làm quà tặng
Đặc sản làm quà tặngĐặc sản làm quà tặng
Đặc sản làm quà tặngTôi Phượt
 
Mobile's Impact on Moments of Truth
Mobile's Impact on Moments of TruthMobile's Impact on Moments of Truth
Mobile's Impact on Moments of TruthEsendex
 
Projet salle blanche - Telnet Innovation Labs
Projet salle blanche - Telnet Innovation LabsProjet salle blanche - Telnet Innovation Labs
Projet salle blanche - Telnet Innovation LabsEng. Abdelmonaem AYACHI
 
Facing communication challenges in collaborative development
Facing communication challenges in collaborative developmentFacing communication challenges in collaborative development
Facing communication challenges in collaborative developmentFabio Calefato
 

Viewers also liked (20)

Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
ieeesb patras study abroad 2007 zouzias
ieeesb patras study abroad 2007 zouziasieeesb patras study abroad 2007 zouzias
ieeesb patras study abroad 2007 zouzias
 
Geospatial search with SOLR
Geospatial search with SOLRGeospatial search with SOLR
Geospatial search with SOLR
 
Geospatial Search With Amazon CloudSearch
Geospatial Search With Amazon CloudSearch Geospatial Search With Amazon CloudSearch
Geospatial Search With Amazon CloudSearch
 
I See NoSQL Document Stores in Geospatial Applications
I See NoSQL Document Stores in Geospatial ApplicationsI See NoSQL Document Stores in Geospatial Applications
I See NoSQL Document Stores in Geospatial Applications
 
Geohash
GeohashGeohash
Geohash
 
How to Tune Search Requests using Amazon CloudSearch
How to Tune Search Requests using Amazon CloudSearchHow to Tune Search Requests using Amazon CloudSearch
How to Tune Search Requests using Amazon CloudSearch
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Apache Sparkのご紹介 (後半:技術トピック)
Apache Sparkのご紹介 (後半:技術トピック)Apache Sparkのご紹介 (後半:技術トピック)
Apache Sparkのご紹介 (後半:技術トピック)
 
300
300300
300
 
'The B of the Bang' - What the UK Government’s white paper on Brexit means fo...
'The B of the Bang' - What the UK Government’s white paper on Brexit means fo...'The B of the Bang' - What the UK Government’s white paper on Brexit means fo...
'The B of the Bang' - What the UK Government’s white paper on Brexit means fo...
 
Bureaux professionnels : Achat ou Location ?
Bureaux professionnels : Achat ou Location ? Bureaux professionnels : Achat ou Location ?
Bureaux professionnels : Achat ou Location ?
 
Đặc sản làm quà tặng
Đặc sản làm quà tặngĐặc sản làm quà tặng
Đặc sản làm quà tặng
 
Mobile's Impact on Moments of Truth
Mobile's Impact on Moments of TruthMobile's Impact on Moments of Truth
Mobile's Impact on Moments of Truth
 
Projet salle blanche - Telnet Innovation Labs
Projet salle blanche - Telnet Innovation LabsProjet salle blanche - Telnet Innovation Labs
Projet salle blanche - Telnet Innovation Labs
 
Passport logistika nefedov
Passport logistika nefedovPassport logistika nefedov
Passport logistika nefedov
 
Being Sherlock Holmes
Being Sherlock HolmesBeing Sherlock Holmes
Being Sherlock Holmes
 
Facing communication challenges in collaborative development
Facing communication challenges in collaborative developmentFacing communication challenges in collaborative development
Facing communication challenges in collaborative development
 
Novedades Essis 2017
Novedades Essis 2017Novedades Essis 2017
Novedades Essis 2017
 

Similar to LuceneRDD for (Geospatial) Search and Entity Linkage

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkEren Avşaroğulları
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 

Similar to LuceneRDD for (Geospatial) Search and Entity Linkage (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark core
Spark coreSpark core
Spark core
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

LuceneRDD for (Geospatial) Search and Entity Linkage

  • 1. LuceneRDD 
 Entity Linkage & (Spatial) Search Anastasios Zouzias ScalaIO @ Lyon 2016
  • 2. * Data Scientist @ Swisscom, CH: Mobility Insights Team * Researcher @ IBM Research, Zurich: Machine Learning & Search Engineer * PhD in Univ. of Toronto: ML & Randomized Algorithms * Scala Developer (day job): 2-3 years * Apache Solr / Elasticsearch developer: 3+ years About myself
  • 3. How many of you have used Spark?
  • 4. Apache Spark RDD (Resilient Distributed Dataset) Distributes data over multiple nodes Executor #1 … Spark
 Master Executor #2 Executor #n [Paris] [Bern] [Athens] [Ottawa] [Warsaw] [Rome] [Zagreb] [Lisbon] [Dublin] … RDD “Spark Driver”
  • 5. How many of you have used Lucene/Solr/Elastic?
  • 6. Apache Lucene * Lucene is a powerful Java search library that lets you easily do search or Information Retrieval (IR) * Used by LinkedIn, Twitter, and many more… * Scalable and High-performance indexing * Powerful, Accurate and Efficient Search algorithms
  • 7. LuceneRDD What is it? Open-source project
 https://github.com/zouzias/spark-lucenerdd Spark: Partitioning / distribution of queries / data Lucene: Indexing / querying data (single partition) LuceneRDD: Query / dispatch / aggregation of data
  • 8. Motivation Why LuceneRDD exists? * Natively support of full-text / spatial search in Spark
 (without external Elastic/Solr cluster) * Scalable Entity Linkage (approx. join) with Spark & Lucene * Personal: Better understanding of Spark’s Internals (RDDs) * Personal: Better understanding of Scala (implicits)
  • 9. LuceneRDD: RDD with search LuceneRDD Full-text search & Entity Linkage FacetedLuceneRDD LuceneRDD + faceted search ShapeLuceneRDD LuceneRDD + Spatial search * Main development: Spark 2.x (supports Spark >= 1.4) * Lucene 5.5.3 (Lucene 6.2.2 JVM 8) * Released on maven central & spark-packages (Scala 2.10 / 2.11) or
  • 10. LuceneRDD Main idea Index RDD partition data with distributed Lucene index (inverted index per partition) Index Index Index … LuceneRDD Executor #1 … Spark
 Master Executor #2 Executor #n [Paris] [Bern] [Athens] [Ottawa] [Warsaw] [Rome] [Zagreb] [Lisbon] [Dublin] … RDD Index storage
 disk or memory
  • 11. Index Index Index … LuceneRDD DataFrame to LuceneRDD* Executor #1 … Spark
 Master Executor #2 Executor #n [Paris] [Bern] [Athens] [Ottawa] [Warsaw] [Rome] [Zagreb] [Lisbon] [Dublin] … RDD/DataFrame Language analyzers, i.e., French, English etc.. are configurable Language analyzers, i.e., French, English etc.. are configurable
  • 12. Simplistic Example Executor #1 … Spark
 Master Executor #2 Executor #n search:par* Index Index Index … LuceneRDD [Paris] [Bern] [Athens] [Ottawa] [Warsaw] [Rome] [Zagreb] [Lisbon] [Dublin] … RDD
  • 13. Executor #1 … Spark
 Master Executor #2 Executor #n Broadcast search query to executors Index Index Index … LuceneRDD search:par* search:par* search:par* search:par* Simplistic Example
  • 14. Executor #1 … Spark
 Master Executor #2 Executor #n Lucene prefix Search on each partition Index Index Index … LuceneRDD search:par* List(Paris) List() List() … LuceneRDDResponse topK results / partition Simplistic Example
  • 15. Aggregate Results Executor #1 … Spark
 Master Executor #2 Executor #n List(Paris) List() List() … LuceneRDDResponse response.take(k) List(Paris) List() List() How to aggregate? TopK monoids in action
  • 16. LuceneRDDResponse TopKMonoid((0.8, Paris), (0.7, Result2), (0.6, Result3)) TopKMonoid((0.71, XXX), (0.35, YYY), (0.25, ZZZ)) TopKMonoid((0.12, XXX), (0.09, YYY), (0.05, ZZZ)) Executor #1 … Spark
 Master Executor #2 Executor #n List(Paris) List() List() + = + TopK Sorted Scored Results Twitter’s Algebird TopKMonoid Aggregate Results
  • 17. Executor #1 … Spark
 Master Executor #2 Executor #n List(Paris) List() List() … LuceneRDDResponse LuceneRDDResponse.take(k) List(Paris) List() List()
  • 18. LuceneRDD Operations ShapeLuceneRDD LuceneRDD + Spatial search FacetedLuceneRDD LuceneRDD + faceted search Flexible multi-fields query: term/Fuzzy/Prefix/etc sub-queries + boolean queries
  • 20. Join left & right dataset according to a “similarity” (NO ids available) Left Dataset Right Dataset Entity Linkage (approx. join) ? Simplicity: one-to-one linkage
  • 21. Left Dataset Right Dataset Entity Linkage (approx. join) ? Naive approach Compute pairwise scores does not scale: O(n2)
  • 23. LuceneRDD Approximate Join 1) Index right dataset using LuceneRDD 2) Define “search similarity function” between each left entity & right dataset 3) For every left entity, search & zip using topK results Index Index Index … LuceneRDD
  • 24. LuceneRDD Entity Linkage 1) Index right dataset using LuceneRDD 2) Define “search similarity function” between left & right dataset 3) For every left entity, search & zip by search similarly Index Index Index … LuceneRDD List(“nam e:james") List(“nam e:joel") List(“na me:ram") … RDD[String] Spark
 Master (3b)Collect&broadcast(RDD[String]) (3a)Maptosearchqueries
  • 25. LuceneRDD: linkage codeQuery each partition and zip query with topK results Reduce topK monoids per query Map each left entity to search query & broadcast to each partition
  • 26. Does LuceneRDD approx. join scale? https://github.com/zouzias/spark-lucenerdd-aws
  • 27. LuceneRDD Entity Linkage Scalability (EMR 4.8.0 / Spark 1.6.2)LinkageTime(seconds) 0 300 600 900 1200 # of executors (= # of AWS core nodes) 2 3 5 8 10 15 20 1135.23 755.69 482.98 324.70 315.50 187.22 185.51 AWS: master: m3.xlarge, Core: r3.xlarge, Exec. mem: 9G Joined Datasets * H1B US Visa Applications (2.6 million records) * Geonames US Cities (4.4 million records)
  • 28. Demo LuceneRDD Linkage API (ACM vs DBLP research articles) https://github.com/zouzias/spark-lucenerdd-examples Köpcke, H.; Thor, A.; Rahm, E. Evaluation of entity resolution approaches on real-world match problems Proc. 36th Intl. Conference on Very Large Databases (VLDB), 2010
  • 30. Demo ShapeLuceneRDD Linkage (Country polygons vs Capital points) https://github.com/zouzias/spark-lucenerdd-examples
  • 31. Future Work * Contributors / use cases are welcome! * More performance tests with approx. join * Spatial-linkage performance tests using 
 open Street Map data * Bypass the collect-broadcast Spark pattern?
  • 32. Merci pour votre attention! Questions?
  • 33. Elastic/SolrCloud Connector Executor #1 … Executor #2 Executor #n Driver Cluster Node 1 / Node 2 / Node m / … Limitations - m is fixed & (usually) small compared to n - additional network communication: execs & elastic/solr