APACHE SPARK & ELASTICSEARCH
Holden Karau
Reducing duplicated code and saving on network overhead
Who am I?
Holden Karau
● Software Engineer @ Databricks
● I’ve worked with Elasticsearch before
● I prefer she/her for pro...
What is Spark & Elasticsearch
Spark
● Apache Spark™ is a fast and general engine for large-
scale data processing.
● http:...
Talk overview
Goal: understand how to work with ES & Hadoop
● Spark & Spark streaming let us re-use indexing code
● Its a ...
Why should you care?
Small differences between off-line and on-line
Spot the difference picture from http://en.wikipedia.o...
Leads to
fire works photo by Hailey Toft
Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Lets start with the on-line pipeline
val ssc = new StreamingContext(master, "IndexTweetsLive",
Seconds(1))
// Set up the s...
Lets get ready to write the data into
Elasticsearch
Photo by Cloned Milkmen
Lets get ready to write the data into
Elasticsearch
def setupEsOnSparkContext(sc: SparkContext) = {
val jobConf = new JobC...
Add a schema
curl -XPUT 'http://localhost:
9200/twitter/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"...
Lets format our tweets
def prepareTweets(tweet: twitter4j.Status) = {
val loc = tweet.getGeoLocation()
val lat = loc.getLa...
And save them...
tweets.foreachRDD{(tweetRDD, time) =>
val sc = tweetRDD.context
// The jobConf isn’t serilizable so we cr...
Now let’s query them!
{"filtered" : {
"query" : {
"match_all" : {}
}
,"filter" :
{"geo_distance" :
{
"distance" : "${dist}...
Now let’s find the hash tags :)
jobConf.set("es.query", query)
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFo...
oh wait :(
Sad panda by Jose Antonio Tovar
Now let’s find some common
words….
// Extract the top words
val words = tweets.flatMap{tweet =>
tweet.flatMap{elem =>
elem...
object WordCountOrdering extends Ordering[(String, Int)]
{
def compare(a: (String, Int), b: (String, Int)) = {
b._2 compar...
NYC words
I,144
a,83
the,76
to,75
you,66
my,56
and,52
me,47
that,39
of,38
in,38
on,37
so,36
like,35
is,32
this,30
was,29
i...
SF words
I,18
the,16
to,15
you,9
a,9
my,8
me,8
for,8
at,7
of,7
want,6
Just,6
in,6
is,6
this,5
got,5
she,5
when,5
was,4
so,...
Indexing Part 2
(electric boogaloo)
Writing directly to a node with the correct shard saves us network overhead
Screen sho...
Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-h...
Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-h...
So what does that give us?
Spark sets the file name to part-[partition number]
If we have same partitioner we write direct...
Re-index all the things*
// Read in our data set
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, M...
Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iN...
“Useful” links
● Feedback: holden@databricks.com
● Customized ES connector*: https://github.
com/holdenk/elasticsearch-had...
Upcoming SlideShare
Loading in...5
×

2014 spark with elastic search

4,620

Published on

Published in: Technology

2014 spark with elastic search

  1. 1. APACHE SPARK & ELASTICSEARCH Holden Karau Reducing duplicated code and saving on network overhead
  2. 2. Who am I? Holden Karau ● Software Engineer @ Databricks ● I’ve worked with Elasticsearch before ● I prefer she/her for pronouns ● Author of a book on Spark and co-writing another* ● github https://github.com/holdenk ● e-mail holden@databricks.com ● @holdenkarau *Which is why I might be sleepy today.
  3. 3. What is Spark & Elasticsearch Spark ● Apache Spark™ is a fast and general engine for large- scale data processing. ● http://spark.apache.org/ Elasticsearch ● Elasticsearch is a real-time distributed search and analytics engine. ● http://www.elasticsearch.org/
  4. 4. Talk overview Goal: understand how to work with ES & Hadoop ● Spark & Spark streaming let us re-use indexing code ● Its a bit ugly right now…. ● Demo* with tweets & top hash tags per region ● We can customize the ES connector to write to the shard based on partition ● This is an early version of the talk (feedback welcome!) Assumptions: ● Familiar(ish) with Elasticsearch or at least Solr ● Can read Scala *Demo gods willing.
  5. 5. Why should you care? Small differences between off-line and on-line Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File: Spot_the_difference.png
  6. 6. Leads to fire works photo by Hailey Toft
  7. 7. Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  8. 8. Lets start with the on-line pipeline val ssc = new StreamingContext(master, "IndexTweetsLive", Seconds(1)) // Set up the system properties for twitter System.setProperty("twitter4j.oauth.consumerKey", cK) System.setProperty("twitter4j.oauth.consumerSecret", cS) System.setProperty("twitter4j.oauth.accessToken", aT) System.setProperty("twitter4j.oauth.accessTokenSecret", ats) val tweets = TwitterUtils.createStream(ssc, None)
  9. 9. Lets get ready to write the data into Elasticsearch Photo by Cloned Milkmen
  10. 10. Lets get ready to write the data into Elasticsearch def setupEsOnSparkContext(sc: SparkContext) = { val jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat") jobConf.setOutputCommitter(classOf[FileOutputCommitter]) jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, “twitter/tweet”) FileOutputFormat.setOutputPath(jobConf, new Path("-")) jobconf }
  11. 11. Add a schema curl -XPUT 'http://localhost: 9200/twitter/tweet/_mapping' -d ' { "tweet" : { "properties" : { "message" : {"type" : "string"}, "hashTags" : {"type" : "string"}, "location" : {"type" : "geo_point"} } } } '
  12. 12. Lets format our tweets def prepareTweets(tweet: twitter4j.Status) = { val loc = tweet.getGeoLocation() val lat = loc.getLatitude() val lon = loc.getLongitude() val hashTags = tweet.getHashtagEntities().map(_.getText()) HashMap( "docid" -> tweet.getId().toString, "message" -> tweet.getText(), "hashTags" -> hashTags.mkString(" "), "location" -> s"$lat,$lon" ) } } // Convert to HadoopWritable types mapToOutput(fields) }
  13. 13. And save them... tweets.foreachRDD{(tweetRDD, time) => val sc = tweetRDD.context // The jobConf isn’t serilizable so we create it here val jobConf = SharedESConfig.setupEsOnSparkContext(sc, esResource, Some(esNodes)) // Convert our tweets to something that can be indexed val tweetsAsMap = tweetRDD.map( SharedIndex.prepareTweets) tweetsAsMap.saveAsHadoopDataset(jobConf) }
  14. 14. Now let’s query them! {"filtered" : { "query" : { "match_all" : {} } ,"filter" : {"geo_distance" : { "distance" : "${dist}km", "location" : { "lat" : "${lat}", "lon" : "${lon}" }}}}}}
  15. 15. Now let’s find the hash tags :) jobConf.set("es.query", query) val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) val tweets = currentTweets.map{ case (key, value) => SharedIndex.mapWritableToInput(value) } val hashTags = tweets.flatMap{t => t.getOrElse("hashTags", "").split(" ") } println(hashTags.countByValue())
  16. 16. oh wait :( Sad panda by Jose Antonio Tovar
  17. 17. Now let’s find some common words…. // Extract the top words val words = tweets.flatMap{tweet => tweet.flatMap{elem => elem._2 match { case null => Nil case _ => elem._2.split(" ") }}} val wordCounts = words.countByValue() println("------") wordCounts.foreach{ case(key, value) => println(key + ":" + value) } println("------")
  18. 18. object WordCountOrdering extends Ordering[(String, Int)] { def compare(a: (String, Int), b: (String, Int)) = { b._2 compare a._2 } } val wc = words.map(x => (x, 1)).reduceByKey((x,y) => x+y) val topWords = wc.takeOrdered(40)(WordCountOrdering) ok, fine, the “top” words
  19. 19. NYC words I,144 a,83 the,76 to,75 you,66 my,56 and,52 me,47 that,39 of,38 in,38 on,37 so,36 like,35 is,32 this,30 was,29 it,28 with,27 for,25 I'm,24 i,23 but,22 just,22 at,21 be,21 are,20 don't,19 have,18 lol,17 out,17 your,16 love,16 up,16 all,16 her,15 when,14 not,13 it's,13
  20. 20. SF words I,18 the,16 to,15 you,9 a,9 my,8 me,8 for,8 at,7 of,7 want,6 Just,6 in,6 is,6 this,5 got,5 she,5 when,5 was,4 so,4 what,4 &,4 he,4 your,4 as,4 they,4 it,4 @,4 get,4 and,4 are,4 say,3 w/,3 do,3 dont,3 going,3 fuck,3 i,3 know,3 or,3
  21. 21. Indexing Part 2 (electric boogaloo) Writing directly to a node with the correct shard saves us network overhead Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
  22. 22. Slight hack time We clone the connector and do update EsOutputFormat.java [see https://github. com/holdenk/elasticsearch-hadoop ] private int detectCurrentInstance(Configuration conf) { if (sparkInstance != null) { if (log.isDebugEnabled()) { log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri)); } return sparkInstance; } …. }
  23. 23. Slight hack time We clone the connector and do update EsOutputFormat.java [see https://github. com/holdenk/elasticsearch-hadoop ] public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) { EsRecordWriter writer = new EsRecordWriter(job, progress); // This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our // jobconf asks for it we use this partition number as the shard number. if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) { writer.setSparkInstance(Integer.valueOf(name.substring(5))); } return writer; }
  24. 24. So what does that give us? Spark sets the file name to part-[partition number] If we have same partitioner we write directly Likely the best place to use this is in re-indexing data
  25. 25. Re-index all the things* // Read in our data set val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) // Fetch them from twitter val t4jt = tweets.flatMap{ tweet => val twitter = TwitterFactory.getSingleton() val tweetID = tweet.getOrElse("docid", "") Option(twitter.showStatus(tweetID.toLong)) } t4jt.map(SharedIndex.prepareTweets) .saveAsHadoopDataset(jobConf) *Until you hit your twitter rate limit…. oops
  26. 26. Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61- 6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/
  27. 27. “Useful” links ● Feedback: holden@databricks.com ● Customized ES connector*: https://github. com/holdenk/elasticsearch-hadoop ● Demo code: https://github.com/holdenk/elasticsearchspark ● Elasticsearch: http://www.elasticsearch.org/ ● Spark: http://spark.apache.org/ ● Spark streaming: http://spark.apache.org/streaming/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×