Spark 101

Apache Spark 101
what is Spark all about
Shahaf Azriely
Sr. Field Engineer Southern EMEA
© Copyright 2013 Pivotal. All rights reserved. 1

Agenda
 What is Spark
 Spark Programming Model
– RDDs, log mining, word count …
 Related Projects
– Shark, Spark SQL, Spark streaming, Graphx, Mllib and more …
 So what next

What is Spark?

The Spark Challenge
• Data size is growing MapReduce greatly simplified big data analysis
• But as soon as it got popular, users wanted more:
- More complex, multi-stage applications (graph algorithms, machine learning)
- More interactive ad-hoc queries
- More real-time online processing
• All of these apps require fast data sharing across parallel jobs
Pivotal Confidential–Internal Use Only

Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS
read
Slow due to replication, serialization, and disk IO

Data Sharing in Spark
iter. 1 iter. 2 . . .
Input
Distributed
memory
Input
query 1
query 2
query 3
. . .
one-time
processing
10-100× faster than network and disk

Spark is
 Fast MapReduce-like engine.
– In memory storage for fast iterative computation.
– Design for low latency ~100ms jobs
 Competitive with Hadoop storage APIs
– Read/write to any Hadoop supported systems including Pivotal HD.
 Designed to work with data in memory
 Programmatic or Interactive
 Written in Scala but have bindings for Python/Java /Scala.
 Make life easy and productive for Data Scientists
Spark is one of the most actively developed open source projects. It has over 465
contributors in 2014, making it the most active project in the Apache Software
Foundation and among Big Data open source projects.

Short History
 Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009.
 2010 Open Sourced
 June 21 2013 the project was donated to the Apache Software Foundation and it’s founders
created Databricks out of AmpLab.
 Feb 27 2014 Spark becomes top level ASF project.
 In November 2014, the engineering team at Databricks used Spark and set am amazing record in
the Daytona GraySort sorting 100TB (1 trillion records) in 23 Min 4.27 TB/min.
 http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.
html

Spark Programming
Model
RDDs in Detail

Programming Model
• Key idea: resilient distributed datasets (RDDs)
- Resilient – if data in memory is lost, it can be recreated.
- Distributed – stored in memory across the cluster.
- Dataset – initial data can be created from a file or programmatically.
• Parallel operations on RDDs
- Reduce, collect, foreach, …
• Interface
- Clean language-integrated API in Scala, Python, Java
- Can be used interactively

RDD Fault Tolerance
RDDs maintain lineage information that can be used to reconstruct lost partitions
cachedMsgs = textFile(...).map(_.split(‘t’)(2))
.filter(_.contains(“error”))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
CachedRDD

Demo: Intro & Log Mining
1 Create basic RDD in Scala: 2
Log Mining - Load error messages from a log into memory, then
interactively search for various patterns
Base RDD
Transformed RDD
Action

Transformation and Actions
 Transformations
– Map
– filter
– flatMap
– sample
– groupByKey
– reduceByKey
– union
– join
– sort
 Actions
– count
– collect
– reduce
– lookup
– save
Look at
http://spark.apache.org/docs/latest/progra
mming-guide.html#basics

More Demo: Word count & Joins
3 Word count in Scala and python shells 4 Join two RDDs

Example of Related
Projects

Related Projects
 Shark is dead long live Spark SQL
 Spark Streaming
 GraphX
 MLbase
 Others

Shark is dead but what it was
 Hive on Spark
– HiveQL, UDFs, etc.
 Turn SQL into RDD
– Part of the lineage
 Based on Hive, but takes advantage of Spark for
– Fast Scheduling
– Queries are DAGs of jobs, not chained M/R
– Fast broadcast variables
© Apache Software Foundation

Spark SQL
 Lib in Spark Core to treat RDDs as relations SchemaRDD
 RDDs are columnar memory store.
 Dynamic query optimization
 Lighter weight version of Shark
– No code from Hive
 Import/Export in different Storage formats
– Parquet, learn schema from existing Hive warehouse

Spark SQL Code

Spark Streaming
• Framework for large scale stream processing
• Scales to 100s of nodes
• Can achieve second scale latencies
• Integrates with Spark’s batch and interactive processing
• Provides a simple batch-like API for implementing complex algorithm
• Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Traditional streaming
• Traditional streaming systems have a event-driven record-at-a-time processing model
– Each node has mutable state
– For each record, update state & send new records
• State is lost if node dies!
• Making stateful stream processing be fault-tolerant is challenging

Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
live data stream
batches of X seconds
processed
results
22
Spark
Streaming
Spark
• Chop up the live stream into batches of X
seconds
• Spark treats each batch of data as RDDs and
processes them using RDD operations
• Finally, the processed results of the RDD
operations are returned in batches

Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
live data stream
batches of X seconds
processed
results
23
Spark
Streaming
Spark
• Batch sizes as low as ½ second, latency ~ 1
second
• Potential for combining batch processing and
streaming processing in the same system

How Fast Can It Go?
Can process 4 GB/s (42M records/s) of data on 100 nodes at sub-second latency
Recovers from failures within 1 sec

Streaming how does it works

MLlib
 MLlib is a Spark subproject providing machine learning primitives.
 It ships with Spark as a standard component.
 Many different Algorithms
– Classification, Regression, Collaborative Filtering:
o Regression: generalized linear regression (GLM)
o Collaborative filtering: alternating least squares (ALS)
o Clustering: k-means
o Decomposition: singular value decomposition (SVD), principal
o Component analysis (PCA)

Why MLlib
 It is built on Apache Spark, a fast and general engine for large-scale data processing.
 Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
 Write applications quickly in Java, Scala, or Python.
 You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into
Hadoop workflows.

Spark SQL + MLlib

GraphX
 What are Graphs? They are inherently recursive data-structures as properties of vertices depend
on properties of their neighbors which in turn depend on properties of their neighbors.
 GraphX unifies ETL, exploratory analysis, and iterative graph computation within a single system.
 We can view the same data as both graphs and collections, transform and join graphs with RDDs.
 For example Predicting things about people (eg: political bias)
– Look at posts, apply classifier, try to predict attribute
– Look at context of social network to improve prediction

GraphX Demo

Others
 Mesos
– Enable multiple frameworks to share same cluster resources
– Twitter is largest user: Over 6,000 servers
 Tachyon
– In-memory, fault tolerant file system that exposes HDFS.
– Use as the FS for Spark.
 Catalyst
– SQL Query Optimizer

So We is Spark
important for Pivotal

So How Real is Spark?
 Leveraging modern MapReduce engine and techs from database, Spark support both SQL and
complex analytics efficiently.
 There are many indicators that Spark is heading to success
– Solid technology
– Good buzz
– Community is getting bigger https://cwiki.apache.org/confluence/display/SPARK/Committers

Pivotal’s Positioning of Spark
Map-Reduce S?park HAWQ Gemfire XD
Better/Faster
Batch Processing Near-Real Time Real Time
Batch Processing
• PHD is a highly differentiated and the only platform that brings the benefits of closed loop
analytics to enable business data lake
• With Spark we extend that differentiation by allowing up to 100x faster batch processing

 Spark 1.0.0 on PHD https://support.pivotal.io/hc/en-us/articles/203271897-Spark-on-Pivotal-
Hadoop-2-0-Quick-Start-Guide
 Databrick’s announcing Pivotal certification https://databricks.com/blog/2014/05/23/pivotal-hadoop-
integrates-the-full-apache-spark-stack.html
 We attend Spark meetup’s
 Join the SocialCast group!

Thank you
Q&A

Spark 101

More Related Content

What's hot

Viewers also liked

Similar to Spark 101

Recently uploaded

Spark 101

Editor's Notes