DATA ANALYSIS
WITH
SCALA/SPARK
YIGUANG HU
DOWNLOAD/INSTALL SCALA, SPARK
• DOWNLOAD/INSTALL SCALA FROM
• HTTPS://WWW.SCALA-LANG.ORG
• DOWNLOAD/INSTALL SPARK FROM
• HTTP://SPARK.APACHE.ORG
• SETUP SCALA_HOME AND SPARK_HOME BASED ON
INSTALL
TEST INSTALL
• run $SPARK_HOME/bin/spark-shell
• should bring you to prompt
• scala>
• run a few commands on such as
• val lines=sc.parallelize(List("Hello world”,"hi"))
• lines.count()
• should print 2
LOAD TEXT FILE
scala> val kjv=sc.textFile(“kjv.txt")
kjv: org.apache.spark.rdd.RDD[String] = kjv.txt MapPartitionsRDD[1] at
textFile at <console>:24
scala> kjv.count()
res0: Long = 31143
Now Bible text is loaded into kjv
SEARCH
• How many verses contains word “God”?
scala> val god=kjv.filter(x=>x.contains("God"))
god: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:26
scala> god.count()
res2: Long = 3585
• How many verses contains word “Love”/“love”
scala> val love=kjv.filter(x=> x.contains("Love")||x.contains("love"))
love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:26
scala> love.count()
res3: Long = 546
SEARCH
How many verses contains God or Love?
scala> val god_or_love=god.union(love)
god_or_love: org.apache.spark.rdd.RDD[String] = UnionRDD[5] at union at <console>:30
scala> god_or_love.count()
res4: Long = 4131
scala>
How many verses contains both God and Love?
scala> val god_and_love=god.intersection(love)
god_and_love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at intersection at <console>:30
scala> god_and_love.count()
res5: Long = 100
STATISTICS
• How many words are used in Bible?
scala> val words = kjv.flatMap(line=>line.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:26
scala> words.count()
res6: Long = 820867
How many unique words are used in Bible?
scala> val distinctword=words.distinct()
distinctword: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at distinct at <console>:28
scala> distinctword.count()
res7: Long = 60023
• How many times each word is used in Bible?
scala> count_by_word.toList.sortBy(_._2).reverse.take(10).foreach (println)
(the,62050)
(and,38571)
(of,34393)
(to,13363)
(And,12739)
(that,12454)
(in,12167)
(shall,9759)
(he,9509)
(unto,8931)
STATISTICS
• How many verses in Genesis
scala> val ge=kjv.filter(line=>line.startsWith("Ge"))
ge: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[22] at filter at
<console>:26
scala> ge.count()
res24: Long = 1534

Data analysis scala_spark

  • 1.
  • 2.
    DOWNLOAD/INSTALL SCALA, SPARK •DOWNLOAD/INSTALL SCALA FROM • HTTPS://WWW.SCALA-LANG.ORG • DOWNLOAD/INSTALL SPARK FROM • HTTP://SPARK.APACHE.ORG • SETUP SCALA_HOME AND SPARK_HOME BASED ON INSTALL
  • 3.
    TEST INSTALL • run$SPARK_HOME/bin/spark-shell • should bring you to prompt • scala> • run a few commands on such as • val lines=sc.parallelize(List("Hello world”,"hi")) • lines.count() • should print 2
  • 4.
    LOAD TEXT FILE scala>val kjv=sc.textFile(“kjv.txt") kjv: org.apache.spark.rdd.RDD[String] = kjv.txt MapPartitionsRDD[1] at textFile at <console>:24 scala> kjv.count() res0: Long = 31143 Now Bible text is loaded into kjv
  • 5.
    SEARCH • How manyverses contains word “God”? scala> val god=kjv.filter(x=>x.contains("God")) god: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:26 scala> god.count() res2: Long = 3585 • How many verses contains word “Love”/“love” scala> val love=kjv.filter(x=> x.contains("Love")||x.contains("love")) love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:26 scala> love.count() res3: Long = 546
  • 6.
    SEARCH How many versescontains God or Love? scala> val god_or_love=god.union(love) god_or_love: org.apache.spark.rdd.RDD[String] = UnionRDD[5] at union at <console>:30 scala> god_or_love.count() res4: Long = 4131 scala> How many verses contains both God and Love? scala> val god_and_love=god.intersection(love) god_and_love: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at intersection at <console>:30 scala> god_and_love.count() res5: Long = 100
  • 7.
    STATISTICS • How manywords are used in Bible? scala> val words = kjv.flatMap(line=>line.split(" ")) words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:26 scala> words.count() res6: Long = 820867 How many unique words are used in Bible? scala> val distinctword=words.distinct() distinctword: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at distinct at <console>:28 scala> distinctword.count() res7: Long = 60023 • How many times each word is used in Bible? scala> count_by_word.toList.sortBy(_._2).reverse.take(10).foreach (println) (the,62050) (and,38571) (of,34393) (to,13363) (And,12739) (that,12454) (in,12167) (shall,9759) (he,9509) (unto,8931)
  • 8.
    STATISTICS • How manyverses in Genesis scala> val ge=kjv.filter(line=>line.startsWith("Ge")) ge: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[22] at filter at <console>:26 scala> ge.count() res24: Long = 1534