Spark Conf Taiwan 2016
Apache Spark
RICH LEE
2016/9/21
Agenda
Spark overview
Spark core
RDD
Spark Application Develop
Spark Shell
Zeppline
Application
Spark Overview
Apache Spark is a fast and general-purpose cluster computing system
Key Features:
Fast
Ease of Use
General-purpose
Scalable
Fault tolerant
Logistic regression in Hadoop and Spark
Spark Overview
Cluster Mode
Local
Standalone
Hadoop YARN
Apache Mesos
Spark Overview
Spark Overview
Spark High Level Architecture
Driver Program
Cluster Management
Worker Node
Executor
Task
Spark Overview
Install and startup
Download
http://spark.apache.org/downloads.html
Start Master and Worker
./sbin/start-all.sh
http://localhost:8080
Start History server
./sbin/start-history-server.sh hdfs://localhost:9000/spark/directory
RDD
Resilient Distributed Datase
RDD represents a collection of partitioned data elements that can be operated on in
parallel. It is the primary data abstraction mechanism in Spark.
Partitioned
Fault Tolerant
Interface
In Memory
RDD
Create RDD
parallelize
val xs = (1 to 10000).toList
val rdd = sc.parallelize(xs)
textFile
val lines = sc.textFile("/input/README.md")
val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-1.6.2/README.md")
HDFS - "hdfs://"
Amazon S3 - "s3n://"
Cassandra, HBase
RDD
RDD
Transformation
Creates a new RDD by performing a computation on the source RDD
map
val txtFile = sc.textFile("/input/README.md")
val lengths = lines map { l => l.length}
flatMap
val words = lines flatMap { l => l.split(" ")}
filter
val longLines = lines filter { l => l.length > 80}
RDD
Action
Return a value to a driver program
first
val numbersRdd = sc.parallelize(List(10, 5, 3, 1))
val firstElement = numbersRdd.first
max
numbersRdd.max
reduce
val sum = numbersRdd.reduce((x, y) => x + y)
val product = numbersRdd.reduce((x, y) => x * y)
RDD
Filter log example:
val logs = sc.textFile("path/to/log-files")
val errorLogs = logs filter { l => l.contains("ERROR")}
val warningLogs = logs filter { l => l.contains("WARN")}
val errorCount = errorLogs.count
val warningCount = warningLogs.count
log RDD
error
RDD
warn
RDD
count count
RDD
Caching
Stores an RDD in the memory or storage
When an application caches an RDD in memory, Spark stores it in the executor
memory on each worker node. Each executor stores in memory the RDD
partitions that it computes.
cache
persist
MEMORY_ONLY
DISK_ONLY
MEMORY_AND_DISK
RDD
Cache example:
val logs = sc.textFile("path/to/log-files")
val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")}
errorsAndWarnings.cache()
val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")}
val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")}
val errorCount = errorLogs.count
val warningCount = warningLogs.count
Spark Application Develop
Spark-Shell
Zeppline
Application (Java/Scala)
spark-submit
Spark Application Develop
WordCount
val textFile = sc.textFile("/input/README.md")
val wcData = textFile.flatMap(line => line.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
wcData.collect().foreach(println)
工商時間
Taiwan Hadoop User Group
https://www.facebook.com/groups/hadoop.tw/
Taiwan Spark User Group
https://www.facebook.com/groups/spark.tw/

Apache Spark Introduction

  • 1.
  • 2.
  • 3.
    Agenda Spark overview Spark core RDD SparkApplication Develop Spark Shell Zeppline Application
  • 6.
    Spark Overview Apache Sparkis a fast and general-purpose cluster computing system Key Features: Fast Ease of Use General-purpose Scalable Fault tolerant Logistic regression in Hadoop and Spark
  • 8.
  • 9.
  • 10.
    Spark Overview Spark HighLevel Architecture Driver Program Cluster Management Worker Node Executor Task
  • 12.
    Spark Overview Install andstartup Download http://spark.apache.org/downloads.html Start Master and Worker ./sbin/start-all.sh http://localhost:8080 Start History server ./sbin/start-history-server.sh hdfs://localhost:9000/spark/directory
  • 13.
    RDD Resilient Distributed Datase RDDrepresents a collection of partitioned data elements that can be operated on in parallel. It is the primary data abstraction mechanism in Spark. Partitioned Fault Tolerant Interface In Memory
  • 14.
    RDD Create RDD parallelize val xs= (1 to 10000).toList val rdd = sc.parallelize(xs) textFile val lines = sc.textFile("/input/README.md") val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-1.6.2/README.md") HDFS - "hdfs://" Amazon S3 - "s3n://" Cassandra, HBase
  • 15.
  • 16.
    RDD Transformation Creates a newRDD by performing a computation on the source RDD map val txtFile = sc.textFile("/input/README.md") val lengths = lines map { l => l.length} flatMap val words = lines flatMap { l => l.split(" ")} filter val longLines = lines filter { l => l.length > 80}
  • 17.
    RDD Action Return a valueto a driver program first val numbersRdd = sc.parallelize(List(10, 5, 3, 1)) val firstElement = numbersRdd.first max numbersRdd.max reduce val sum = numbersRdd.reduce((x, y) => x + y) val product = numbersRdd.reduce((x, y) => x * y)
  • 18.
    RDD Filter log example: vallogs = sc.textFile("path/to/log-files") val errorLogs = logs filter { l => l.contains("ERROR")} val warningLogs = logs filter { l => l.contains("WARN")} val errorCount = errorLogs.count val warningCount = warningLogs.count log RDD error RDD warn RDD count count
  • 19.
    RDD Caching Stores an RDDin the memory or storage When an application caches an RDD in memory, Spark stores it in the executor memory on each worker node. Each executor stores in memory the RDD partitions that it computes. cache persist MEMORY_ONLY DISK_ONLY MEMORY_AND_DISK
  • 20.
    RDD Cache example: val logs= sc.textFile("path/to/log-files") val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")} errorsAndWarnings.cache() val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")} val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")} val errorCount = errorLogs.count val warningCount = warningLogs.count
  • 22.
  • 23.
    Spark Application Develop WordCount valtextFile = sc.textFile("/input/README.md") val wcData = textFile.flatMap(line => line.split(" ")) .map((_, 1)) .reduceByKey(_ + _) wcData.collect().foreach(println)
  • 24.
    工商時間 Taiwan Hadoop UserGroup https://www.facebook.com/groups/hadoop.tw/ Taiwan Spark User Group https://www.facebook.com/groups/spark.tw/