10. Along come Apache Spark!
• Opensource Big Data Framework
from UC Berkeley
• In-Memory Analytics
• > 800 contributors
• Largest Cluster = 8,000+ nodes (Tencent)
11. Source: M. Zaharia, “New Directions for Spark in 2015”, Spark Summit East 2015, 18 March 2015.
21. Action#2 – Thai Monitoring Corpus
val data = sqlContext.read.json("gs://tmc-data/*")
val d = data.withColumn("date",
from_unixtime(data("timestamp")/1000, "yyyy-MM-dd"))
val failre = "<[Ff][^>]+>[^<]+</[Ff][^>]+>".r
val nonthaire = "[u0021-u007F]+".r
def prepareText(x:Any):String = {
var text = x.asInstanceOf[String]
text = text.replace("n"," ")
text = text.replace("<s>"," ")
text = text.replace("~","")
text = text.replace("}"," ")
text = failre.replaceAllIn(text, " ")
text = nonthaire.replaceAllIn(text, " ")
return text
}
val dTmp = d.select(d("date"), d("wordseg"))
val dataMap = dTmp.map(item => (item(0), prepareText(item(1))))
val baseGramRDD = dataMap.flatMapValues(v => v.split("( )+"))
val baseGramCountRDD = baseGramRDD.countByValue()
22. Source: P.Zecevic and M.Bonaci, Spark in Action, Manning Publications, Summer 2016 (est.)
23. Why I like Spark?
• RDD: Simple but Elegant Design
• Very consistent abstraction
• Extensible with minimal overheads
• One feature added to base RDD,
all other modules get benefits