2. DOWNLOAD/INSTALL SCALA, SPARK
• DOWNLOAD/INSTALL SCALA FROM
• HTTPS://WWW.SCALA-LANG.ORG
• DOWNLOAD/INSTALL SPARK FROM
• HTTP://SPARK.APACHE.ORG
• SETUP SCALA_HOME AND SPARK_HOME BASED ON
INSTALL
3. TEST INSTALL
• run $SPARK_HOME/bin/spark-shell
• should bring you to prompt
• scala>
• run a few commands on such as
• val lines=sc.parallelize(List("Hello world”,"hi"))
• lines.count()
• should print 2
4. LOAD TEXT FILE
scala> val kjv=sc.textFile(“kjv.txt")
kjv: org.apache.spark.rdd.RDD[String] = kjv.txt MapPartitionsRDD[1] at
textFile at <console>:24
scala> kjv.count()
res0: Long = 31143
Now Bible text is loaded into kjv
5. SEARCH
• How many verses contains word “God”?
scala> val god=kjv.filter(x=>x.contains("God"))
god: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:26
scala> god.count()
res2: Long = 3585
• How many verses contains word “Love”/“love”
scala> val love=kjv.filter(x=> x.contains("Love")||x.contains("love"))
love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:26
scala> love.count()
res3: Long = 546
6. SEARCH
How many verses contains God or Love?
scala> val god_or_love=god.union(love)
god_or_love: org.apache.spark.rdd.RDD[String] = UnionRDD[5] at union at <console>:30
scala> god_or_love.count()
res4: Long = 4131
scala>
How many verses contains both God and Love?
scala> val god_and_love=god.intersection(love)
god_and_love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at intersection at <console>:30
scala> god_and_love.count()
res5: Long = 100
7. STATISTICS
• How many words are used in Bible?
scala> val words = kjv.flatMap(line=>line.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:26
scala> words.count()
res6: Long = 820867
How many unique words are used in Bible?
scala> val distinctword=words.distinct()
distinctword: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at distinct at <console>:28
scala> distinctword.count()
res7: Long = 60023
• How many times each word is used in Bible?
scala> count_by_word.toList.sortBy(_._2).reverse.take(10).foreach (println)
(the,62050)
(and,38571)
(of,34393)
(to,13363)
(And,12739)
(that,12454)
(in,12167)
(shall,9759)
(he,9509)
(unto,8931)
8. STATISTICS
• How many verses in Genesis
scala> val ge=kjv.filter(line=>line.startsWith("Ge"))
ge: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[22] at filter at
<console>:26
scala> ge.count()
res24: Long = 1534