Apache Spark Introduction

Apache Spark
RICH LEE
2016/9/21

Agenda
Spark overview
Spark core
RDD
Spark Application Develop
Spark Shell
Zeppline
Application

Spark Overview
Apache Spark is a fast and general-purpose cluster computing system
Key Features:
Fast
Ease of Use
General-purpose
Scalable
Fault tolerant
Logistic regression in Hadoop and Spark

Spark Overview
Cluster Mode
Local
Standalone
Hadoop YARN
Apache Mesos

Spark Overview
Spark High Level Architecture
Driver Program
Cluster Management
Worker Node
Executor
Task

Spark Overview
Install and startup
Download
http://spark.apache.org/downloads.html
Start Master and Worker
./sbin/start-all.sh
http://localhost:8080
Start History server
./sbin/start-history-server.sh hdfs://localhost:9000/spark/directory

RDD
Resilient Distributed Datase
RDD represents a collection of partitioned data elements that can be operated on in
parallel. It is the primary data abstraction mechanism in Spark.
Partitioned
Fault Tolerant
Interface
In Memory

RDD
Create RDD
parallelize
val xs = (1 to 10000).toList
val rdd = sc.parallelize(xs)
textFile
val lines = sc.textFile("/input/README.md")
val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-1.6.2/README.md")
HDFS - "hdfs://"
Amazon S3 - "s3n://"
Cassandra, HBase

RDD
Transformation
Creates a new RDD by performing a computation on the source RDD
map
val txtFile = sc.textFile("/input/README.md")
val lengths = lines map { l => l.length}
flatMap
val words = lines flatMap { l => l.split(" ")}
filter
val longLines = lines filter { l => l.length > 80}

RDD
Action
Return a value to a driver program
first
val numbersRdd = sc.parallelize(List(10, 5, 3, 1))
val firstElement = numbersRdd.first
max
numbersRdd.max
reduce
val sum = numbersRdd.reduce((x, y) => x + y)
val product = numbersRdd.reduce((x, y) => x * y)

RDD
Filter log example:
val logs = sc.textFile("path/to/log-files")
val errorLogs = logs filter { l => l.contains("ERROR")}
val warningLogs = logs filter { l => l.contains("WARN")}
val errorCount = errorLogs.count
val warningCount = warningLogs.count
log RDD
error
RDD
warn
RDD
count count

RDD
Caching
Stores an RDD in the memory or storage
When an application caches an RDD in memory, Spark stores it in the executor
memory on each worker node. Each executor stores in memory the RDD
partitions that it computes.
cache
persist
MEMORY_ONLY
DISK_ONLY
MEMORY_AND_DISK

RDD
Cache example:
val logs = sc.textFile("path/to/log-files")
val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")}
errorsAndWarnings.cache()
val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")}
val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")}
val errorCount = errorLogs.count
val warningCount = warningLogs.count

Spark-Shell
Zeppline
Application (Java/Scala)
spark-submit

WordCount
val textFile = sc.textFile("/input/README.md")
val wcData = textFile.flatMap(line => line.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
wcData.collect().foreach(println)

工商時間
Taiwan Hadoop User Group
https://www.facebook.com/groups/hadoop.tw/
Taiwan Spark User Group
https://www.facebook.com/groups/spark.tw/

Apache Spark Introduction

More Related Content

What's hot

Viewers also liked

Similar to Apache Spark Introduction

More from Rich Lee

Recently uploaded

Apache Spark Introduction