Apache Spark Overview

Presenter Introduction
• Tim Spann, Senior Solutions Architect, airis.DATA
• ex-Pivotal Senior Field Engineer
• DZONE MVB and Zone Leader
• ex-Startup Senior Engineer / Team Lead
http://www.slideshare.net/bunkertor
http://sparkdeveloper.com/
http://www.twitter.com/PaasDev

airis.DATA
airis.DATA is a next generation system integrator that specializes in rapidly deployable
machine learning and graph solutions.
Our core competencies involve providing modular, scalable Big Data products that can be
tailored to fit use cases across industry verticals.
We offer predictive modeling and machine learning solutions at Petabyte scale utilizing
the most advanced, best-in-class technologies and frameworks including Spark, H20,
Mahout, and Flink.
Our data pipelining solutions can be deployed in batch, real-time or near-real-time
settings to fit your specific business use-case.

Agenda
• Overview
• What is Map Reduce?
• Hands-On:
• Installation
• Spark Map Reduce
• Build with IntelliJ/SBT
• Deploy Local

Overview
Spark is a fast cluster computing
system that supports Java, Scala,
Python and R APIs. It allows for
multiple workloads using the same
system and coding.
One stop shopping for your big
data processing at scale needs.
It works well with existing Hadoop
clusters, by itself, with AWS or on
it’s own.
http://spark.apache.org/docs/latest/index.html

What is Map Reduce?
TRANSFORMATION
map(func) Return a new distributed dataset formed by
passing each element of the source through a
function func.
ACTION
reduce(func) Aggregate the elements of the dataset using a
function func (which takes two arguments and
returns one). The function should be
commutative and associative so that it can be
computed correctly in parallel.

Problem Definition
We have Apache logs from our website. They follow a standard pattern and we
want to parse them to gain some insights on usage.
114.200.179.85 - - [24/Feb/2016:00:10:02 -0500] "GET /wp HTTP/1.1" 200 5279
"http://sparkdeveloper.com/" "Mozilla/5.0"
Bytes Sent
HTTP Referer
User Agent
IP Address
ClientID
UserID
Date Time Stamp
Request String
HTTP Status Code

Map Function
logFile.map(parseLogLine)
LogRecord(m.group(1), m.group(2), m.group(3), m.group(4),
m.group(5), m.group(8).toInt, m.group(9).toLong, m.group(10),
m.group(11))
Our mapping function is parseLogLine which takes a Log String and splits it into fields in a Case class
using regular expressions.
val contentSizes = accessLogs.map(log => log.bytesSent)
Our second mapping function, maps to just the byte field

Reduce
contentSizes.reduce(_ + _)
We reduce by a summing up all the bytes in the dataset. The result is a final sum of all sizes.

Spark 1.6.1 Stack
Spark SQL
Spark
Streaming
MLlib GraphX
Spark Core
Standalone YARN Mesos

Hands-On
Spark Map Reduce
Build with IntelliJ/SBT
Deploy Local
Run History Server
spark-1.6.1-bin-hadoop2.6/sbin/start-history-server.sh

Installation
• Install JDK
• Install Scala 2.10
• Install SBT
• Install Maven (Optional)
• Unzip Spark 1.6.1
Environment Variable Value (example)
Unix / Linux / Mac
export SCALA_HOME=/usr/local/share/scala
export PATH=$PATH:$SCALA_HOME/bin
Windows
Set SCALA_HOME=c:Progra~1Scala
set PATH=%PATH%;%SCALA_HOME%bin

Spark Resources
• https://courses.edx.org/courses/BerkeleyX/CS100.1x/1T2015/info
• http://airisdata.com/scala-spark-resources-setup-learning/
• http://spark.apache.org/docs/latest/monitoring.html
• http://spark.apache.org/docs/latest/submitting-applications.html

Spark Cluster
http://spark.apache.org/docs/latest/cluster-overview.html

Term Meaning
Application User program built on Spark. Consists of a driver program and executorson the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application
along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at
runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In
"client" mode, the submitter launches the driver outside of the cluster.
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across
them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll
see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages
in MapReduce); you'll see this term used in the driver's logs.
Glossary
The following table summarizes terms you’ll see used to refer to cluster concepts:

Apache Spark Overview

More Related Content

What's hot

Viewers also liked

Similar to Apache Spark Overview

Recently uploaded

Apache Spark Overview