SlideShare a Scribd company logo
Using Apache Spark for generating
ElasticSearch indices offline
Andrej Babolčai
ESET Database systems engineer
Apache: Big Data Europe 2016
Who am I
• Software engineer in database systems team
• Responsible for collecting, moving and providing access to data
Context
Apache
Kafka
Apache
Hive/Impala
ElasticSearch
Agenda
• Approaches we tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
Agenda
• Approaches we tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
Indexing data to live cluster
• Failed because of
• Slowed search and near real-time (NRT) import
• Reduce ingestion speed - too slow
Spark job with Lucene library
• Approach
• Generate indices with Lucene and “import” them to ES
• Indexing with Lucene is fast, hundreds of GB/hour
• Failed because of
• ES types in Lucene
• ES translog and checksum
Agenda
• Approaches tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
Goal
• Offload ES cluster and generate indices on Spark cluster
• We want indices to be “ready to use”
• When appropriate copy them to ES
Spark + local ES
• Based on https://github.com/MyPureCloud/elasticsearch-
lambda
• Similar approach to Cloudera Solr MapReduceIndexerTool
How do we generate indices offline
S
p
a
r
k
0 partition
1 partition
n partition
…
Start ES
n shards/
1 with data
Start ES
Start ES
…
…
HDFS repository
Index/0
Index/1
Index/n
Input
Data
1 partition 1. shard (data)
0. shard (empty)
n. shard (empty)
… SnapshotConstant shard routing
0. sh. snapshot(empty)
1. sh. snapshot
n. sh. snapshot(empty)
HDFS snapshot repository layout
Dest dir
indices
idxname-2015
idxname-2016
0
__r
__z
1
meta-idxname-
2016.dat
snap-idxname-
2016.dat
Creating local ES node
val nodeSettings: Settings = Settings.builder
.put("http.enabled", false)
.put("processors", 1)
.put("index.merge.scheduler.max_thread_count", 1)
…
.build
val node: Node =
nodeBuilder().settings(nodeSettings).local(true).node()
node.start
val client: Client = node.client
client.admin.indices. … .setSource(mapping).get
HTTP unnecessary, use transport interface
Only JVM local node discovery
Same json mapping as Index API (http)
RDD export like saveAsTextFile
rddToIndex
.repartition(config.numShards)
.saveToESSnapshot(
config,…
We use implicit conversions
package object spark {
implicit class
DBSysSparkRDDFunctions
[T <: Map[String,Object]] //Row
(val rdd: RDD[T]) extends AnyVal {
def saveToESSnapshot(config:String,…):Unit = {
…
rdd.sparkContext.runJob(rdd,esWriter.processPartition _)
Input RDD type bound
Indexing method
Infiltrate spark namespace
Useful ES commands
• Create HDFS snapshot repository:
curl -s -XPUT 'localhost:9200/_snapshot/<Repo name>' -d '{
"type": "hdfs",
"settings": {
"uri": "hdfs://namenode:8020/",
"path": "/user/<username>/<Snapshot repo hdfs path>",
"load_defaults": "false"
}
}’
Useful ES commands
• Start restore process:
curl -XPOST 'localhost:9200/_snapshot/<Repo name>/<snapshot name>/_restore’
• Monitor restore progress:
curl -s –XGET 'localhost:9200/_cat/recovery?v' |
awk '{print $1 " " $11}' |
fgrep -v " 0.0%" |
fgrep -v "100.0%"
Agenda
• Approaches tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
ES cluster configuration
Property Value
Number of nodes 24
ES heap size 29GB
CPUs 8 (/proc/cpuinfo)
HDD 2x3.5TB / node
No. of indices 130
No. of shards ~3900
Data size 16 TB
No. of docs > 60 billion
Indexing speed (what we can handle…) ~1000 docs/s
Offline indexing environment
Property Value
Input size 135GB compr. parquet
Number of docs 470M
CPUs (indexing) 15 spark workers
Memory 4GB /worker
Output index layout 20 string fields, 15 part.
Job duration ~3.5h
Restore duration ~20m
Duration total ~4h
Indexing speed >30k docs/s
Future work
• Shard routing
• Indexing on local FS, use directly HDFS
• Speed up indexing
• Use for stream indexing
Summary
• Hard to directly compare RT with our offline approach
• What we wanted was to make historical data available for
users, without influencing production systems
https://github.com/andybab/OfflineESIndexGenerator
Thank you
Questions?
babolcai@eset.sk

More Related Content

What's hot

Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
Spark Summit
 
Shipping & Visualize Your Data With ELK
Shipping  & Visualize Your Data With ELKShipping  & Visualize Your Data With ELK
Shipping & Visualize Your Data With ELK
Adam Chen
 
Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016
Steve Howe
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench Tools
SATOSHI TAGOMORI
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Spark Summit
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控
Jui An Huang (黃瑞安)
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
Spark Summit
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
Vritika Godara
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Rupy2012 ArangoDB Workshop Part2
Rupy2012 ArangoDB Workshop Part2Rupy2012 ArangoDB Workshop Part2
Rupy2012 ArangoDB Workshop Part2
ArangoDB Database
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data WarehousesTraining Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Continuent
 
Espresso advanced
Espresso advancedEspresso advanced
Espresso advanced
Val Huber
 

What's hot (20)

Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Shipping & Visualize Your Data With ELK
Shipping  & Visualize Your Data With ELKShipping  & Visualize Your Data With ELK
Shipping & Visualize Your Data With ELK
 
Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench Tools
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Rupy2012 ArangoDB Workshop Part2
Rupy2012 ArangoDB Workshop Part2Rupy2012 ArangoDB Workshop Part2
Rupy2012 ArangoDB Workshop Part2
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data WarehousesTraining Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data Warehouses
 
Espresso advanced
Espresso advancedEspresso advanced
Espresso advanced
 

Viewers also liked

October 2015 Old Irving Park Neighborhood Real Estate Market Update
October 2015 Old Irving Park Neighborhood Real Estate Market UpdateOctober 2015 Old Irving Park Neighborhood Real Estate Market Update
October 2015 Old Irving Park Neighborhood Real Estate Market Update
Amanda McMillan
 
November 2015 Near North Side Neighborhood Real Estate Update
November 2015 Near North Side Neighborhood Real Estate UpdateNovember 2015 Near North Side Neighborhood Real Estate Update
November 2015 Near North Side Neighborhood Real Estate Update
Amanda McMillan
 
October 2015 West Loop Neighborhood Real Estate Market Update
October 2015 West Loop Neighborhood Real Estate Market UpdateOctober 2015 West Loop Neighborhood Real Estate Market Update
October 2015 West Loop Neighborhood Real Estate Market Update
Amanda McMillan
 
Absalon Ungdom
Absalon UngdomAbsalon Ungdom
Absalon Ungdom
Boris Temkov Institute
 
October 2015 Old Town Neighborhood Real Estate Market Update
October 2015 Old Town Neighborhood Real Estate Market UpdateOctober 2015 Old Town Neighborhood Real Estate Market Update
October 2015 Old Town Neighborhood Real Estate Market Update
Amanda McMillan
 
November 2015 Noble Square Neighborhood Real Estate Update
November 2015 Noble Square Neighborhood Real Estate UpdateNovember 2015 Noble Square Neighborhood Real Estate Update
November 2015 Noble Square Neighborhood Real Estate Update
Amanda McMillan
 
Presentacion
PresentacionPresentacion
Aplicas funciones periodicas
Aplicas funciones periodicasAplicas funciones periodicas
Aplicas funciones periodicas
Gustavo Hernandez
 
Kti siti maysaroh
Kti siti maysarohKti siti maysaroh
Kti siti maysaroh
SITIMAYSAROH
 
Yrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitus
Yrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitusYrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitus
Yrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitus
Eläketurvakeskus
 
Calidad de software
Calidad de softwareCalidad de software
Calidad de software
Miguel Coello
 
Coping Responses
Coping ResponsesCoping Responses
Coping Responses
rebecagm8
 
Calculo de predicados
Calculo de predicadosCalculo de predicados
Calculo de predicados
Diego Gonzalez
 
อัลบั้มรูป
อัลบั้มรูปอัลบั้มรูป
อัลบั้มรูปtaweekahar37
 
Tipos de-software II
Tipos de-software IITipos de-software II
Tipos de-software II
Armando Rodriguez L
 
Act 4.3 pruebas de software
Act 4.3 pruebas de softwareAct 4.3 pruebas de software
Act 4.3 pruebas de software
Rodrigo Santiago
 
Estrategias y técnicas de pruebas de software
Estrategias y técnicas de pruebas de softwareEstrategias y técnicas de pruebas de software
Estrategias y técnicas de pruebas de software
padrino98
 

Viewers also liked (20)

October 2015 Old Irving Park Neighborhood Real Estate Market Update
October 2015 Old Irving Park Neighborhood Real Estate Market UpdateOctober 2015 Old Irving Park Neighborhood Real Estate Market Update
October 2015 Old Irving Park Neighborhood Real Estate Market Update
 
Saxo Grammaticus
Saxo GrammaticusSaxo Grammaticus
Saxo Grammaticus
 
November 2015 Near North Side Neighborhood Real Estate Update
November 2015 Near North Side Neighborhood Real Estate UpdateNovember 2015 Near North Side Neighborhood Real Estate Update
November 2015 Near North Side Neighborhood Real Estate Update
 
October 2015 West Loop Neighborhood Real Estate Market Update
October 2015 West Loop Neighborhood Real Estate Market UpdateOctober 2015 West Loop Neighborhood Real Estate Market Update
October 2015 West Loop Neighborhood Real Estate Market Update
 
Gesta Danorum
Gesta DanorumGesta Danorum
Gesta Danorum
 
Absalon Ungdom
Absalon UngdomAbsalon Ungdom
Absalon Ungdom
 
October 2015 Old Town Neighborhood Real Estate Market Update
October 2015 Old Town Neighborhood Real Estate Market UpdateOctober 2015 Old Town Neighborhood Real Estate Market Update
October 2015 Old Town Neighborhood Real Estate Market Update
 
November 2015 Noble Square Neighborhood Real Estate Update
November 2015 Noble Square Neighborhood Real Estate UpdateNovember 2015 Noble Square Neighborhood Real Estate Update
November 2015 Noble Square Neighborhood Real Estate Update
 
Presentacion
PresentacionPresentacion
Presentacion
 
Fremtidensbibliotek
FremtidensbibliotekFremtidensbibliotek
Fremtidensbibliotek
 
Aplicas funciones periodicas
Aplicas funciones periodicasAplicas funciones periodicas
Aplicas funciones periodicas
 
Kti siti maysaroh
Kti siti maysarohKti siti maysaroh
Kti siti maysaroh
 
Yrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitus
Yrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitusYrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitus
Yrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitus
 
Calidad de software
Calidad de softwareCalidad de software
Calidad de software
 
Coping Responses
Coping ResponsesCoping Responses
Coping Responses
 
Calculo de predicados
Calculo de predicadosCalculo de predicados
Calculo de predicados
 
อัลบั้มรูป
อัลบั้มรูปอัลบั้มรูป
อัลบั้มรูป
 
Tipos de-software II
Tipos de-software IITipos de-software II
Tipos de-software II
 
Act 4.3 pruebas de software
Act 4.3 pruebas de softwareAct 4.3 pruebas de software
Act 4.3 pruebas de software
 
Estrategias y técnicas de pruebas de software
Estrategias y técnicas de pruebas de softwareEstrategias y técnicas de pruebas de software
Estrategias y técnicas de pruebas de software
 

Similar to using-apache-spark-for-generating-elasticsearch-indices-offline

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Icinga 2009 at OSMC
Icinga 2009 at OSMCIcinga 2009 at OSMC
Icinga 2009 at OSMC
Icinga
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 

Similar to using-apache-spark-for-generating-elasticsearch-indices-offline (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Icinga 2009 at OSMC
Icinga 2009 at OSMCIcinga 2009 at OSMC
Icinga 2009 at OSMC
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Spark core
Spark coreSpark core
Spark core
 

using-apache-spark-for-generating-elasticsearch-indices-offline

  • 1. Using Apache Spark for generating ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe 2016
  • 2. Who am I • Software engineer in database systems team • Responsible for collecting, moving and providing access to data
  • 4. Agenda • Approaches we tried and why they failed • Solution used, Spark + ES • Benchmark, summary and possible improvements
  • 5. Agenda • Approaches we tried and why they failed • Solution used, Spark + ES • Benchmark, summary and possible improvements
  • 6. Indexing data to live cluster • Failed because of • Slowed search and near real-time (NRT) import • Reduce ingestion speed - too slow
  • 7. Spark job with Lucene library • Approach • Generate indices with Lucene and “import” them to ES • Indexing with Lucene is fast, hundreds of GB/hour • Failed because of • ES types in Lucene • ES translog and checksum
  • 8. Agenda • Approaches tried and why they failed • Solution used, Spark + ES • Benchmark, summary and possible improvements
  • 9. Goal • Offload ES cluster and generate indices on Spark cluster • We want indices to be “ready to use” • When appropriate copy them to ES
  • 10. Spark + local ES • Based on https://github.com/MyPureCloud/elasticsearch- lambda • Similar approach to Cloudera Solr MapReduceIndexerTool
  • 11. How do we generate indices offline S p a r k 0 partition 1 partition n partition … Start ES n shards/ 1 with data Start ES Start ES … … HDFS repository Index/0 Index/1 Index/n Input Data 1 partition 1. shard (data) 0. shard (empty) n. shard (empty) … SnapshotConstant shard routing 0. sh. snapshot(empty) 1. sh. snapshot n. sh. snapshot(empty)
  • 12. HDFS snapshot repository layout Dest dir indices idxname-2015 idxname-2016 0 __r __z 1 meta-idxname- 2016.dat snap-idxname- 2016.dat
  • 13. Creating local ES node val nodeSettings: Settings = Settings.builder .put("http.enabled", false) .put("processors", 1) .put("index.merge.scheduler.max_thread_count", 1) … .build val node: Node = nodeBuilder().settings(nodeSettings).local(true).node() node.start val client: Client = node.client client.admin.indices. … .setSource(mapping).get HTTP unnecessary, use transport interface Only JVM local node discovery Same json mapping as Index API (http)
  • 14. RDD export like saveAsTextFile rddToIndex .repartition(config.numShards) .saveToESSnapshot( config,…
  • 15. We use implicit conversions package object spark { implicit class DBSysSparkRDDFunctions [T <: Map[String,Object]] //Row (val rdd: RDD[T]) extends AnyVal { def saveToESSnapshot(config:String,…):Unit = { … rdd.sparkContext.runJob(rdd,esWriter.processPartition _) Input RDD type bound Indexing method Infiltrate spark namespace
  • 16. Useful ES commands • Create HDFS snapshot repository: curl -s -XPUT 'localhost:9200/_snapshot/<Repo name>' -d '{ "type": "hdfs", "settings": { "uri": "hdfs://namenode:8020/", "path": "/user/<username>/<Snapshot repo hdfs path>", "load_defaults": "false" } }’
  • 17. Useful ES commands • Start restore process: curl -XPOST 'localhost:9200/_snapshot/<Repo name>/<snapshot name>/_restore’ • Monitor restore progress: curl -s –XGET 'localhost:9200/_cat/recovery?v' | awk '{print $1 " " $11}' | fgrep -v " 0.0%" | fgrep -v "100.0%"
  • 18. Agenda • Approaches tried and why they failed • Solution used, Spark + ES • Benchmark, summary and possible improvements
  • 19. ES cluster configuration Property Value Number of nodes 24 ES heap size 29GB CPUs 8 (/proc/cpuinfo) HDD 2x3.5TB / node No. of indices 130 No. of shards ~3900 Data size 16 TB No. of docs > 60 billion Indexing speed (what we can handle…) ~1000 docs/s
  • 20. Offline indexing environment Property Value Input size 135GB compr. parquet Number of docs 470M CPUs (indexing) 15 spark workers Memory 4GB /worker Output index layout 20 string fields, 15 part. Job duration ~3.5h Restore duration ~20m Duration total ~4h Indexing speed >30k docs/s
  • 21. Future work • Shard routing • Indexing on local FS, use directly HDFS • Speed up indexing • Use for stream indexing
  • 22. Summary • Hard to directly compare RT with our offline approach • What we wanted was to make historical data available for users, without influencing production systems https://github.com/andybab/OfflineESIndexGenerator