Apache Spark Overview @ ferret

APACHE SPARK OVERVIEW
tech talk @ ferret
Andrii Gakhov

• Apache Spark™ is a fast and general engine for
large-scale data processing.
• Lastest release: Spark 1.1.1 (Nov 26, 2014)
• spark.apache.org
• Originally developed in 2009 in UC Berkeley’s
AMPLab, and open sourced in 2010. Now Spark is
supported by Databricks.

APACHE SPARK
Spark SQL MLlib GraphX Streaming
standalone
with local
storage
Apache Spark
MESOS YARN
EC2
S3 HDFS
node node node node

RDD
• Spark’s primary conception is a Resilient
Distributed Dataset (RDD) - abstraction of an
immutable, distributed dataset.
textFile = sc.textFile(“api.log")
anotherFile = sc.textFile(“hdfs://var/log/api.log”)
• Collections of objects that can be stored in memory
or disk across the cluster
• Parallel functional transformations (map, filter, …)
• Automatically rebuild of failure

RDD
• RDDs have actions, which retur n values, and
transformations, which return pointers to new RDDs.
• Actions:
• reduce collect count countByKey take saveAsTextFile
takeSample …
• Transformations:
• map filter flatMap distinct sample join union intersection
reduceByKey groupByKey sortByKey …
errors = logFile.filter(lambda line: line.startswith(“ERROR”))
print errors.count()

PERSISTANCE
• You can control persistence of RDD across operations
(MEMORY_ONLY, MEMORY_AND_DISK …)
• When you persist an RDD in memory, each node stores
any partitions of it that it computes in memory and
reuses them in other actions on that dataset (or datasets
derived from it)
• This allows future actions to be much faster (often by
more than 10x).
errors.cache()
endpoint_errors = errors.filter(
lambda line: “/test/endpoint” in line)
endpoint_errors.count()

HDFS
iteration iteration iteration
Hadoop MapReduce
iteration iteration iteration
MEMORY HDFS
Apache Spark

INTERACTIVE DEMO
STRATA+HADOOP WORD EXAMPLE
http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona-2014-with-apache-spark-and-d3.html

SPARK SQL
TRANSFORM RDD WITH SQL

SCHEMA RDD
• Spark SQL allows relational queries expressed in SQL,
HiveQL, or Scala to be executed using Spark.
• At the core of this component is a new type of RDD -
SchemaRDD.
• SchemaRDDs are composed of Row objects, along with a
schema that describes the data types of each column in the row.
• A SchemaRDD is similar to a table in a traditional relational
database.
• A SchemaRDD can be created from an existing RDD, a Parquet
file, a JSON dataset, or by running HiveQL against data stored in
Apache Hive.

SCHEMA RDD
• To work with SparkSQL you need SQLContext
(or HiveContext)
from spark.sql import SQLContext
sqlCtx = SQLContext(sc)
records = sc.textFile(“customers.csv”)
customers = records.map(lambda line: line.split(“,”))
.map(lambda r: Row(name=r[0], age=int(r[1])))
customersTable = sqlCtx.inferSchema(customers)
customersTable.registerAsTable(“customers”)

SCHEMA RDD
User
User
User
Name Age Phone
Name Age Phone
Name Age Phone
RDD SchemaRDD
• Transformations over RDD are just functional
transformation on partitioned collections of objects
• Transformation over the SchemaRDD are
declarative transformations on par titioned
collections of tuples

SPARK SQL
• Schema RDD could be used as regular RDD at
the same time.
seniors = sqlCtx.sql(“””
SELECT from customers WHERE age >= 70”””)
print seniors.count()
print seniors.map(lambda r: “Name: “ + r.name).take(10)

MLLIB
Distributed Machine Learning

MACHINE LEARNING LIBRARY
• MLlib uses the linear algebra package Breeze,
which depends on netlib-java, and jblas
• MLlib in Python requires NumPy version 1.4+
• MLlib is under active development
• Many API changes every release
• Not all algorithms are fully functional

• Basic statistics
• Classification and regression
• linear models (SVMs, logistic regression, linear
regression)
• decision trees
• naive Bayes
• Collaborative filtering
• alternating least squares (ALS)
• Clustering
• k-means

• Dimensionality reduction
• singular value decomposition (SVD)
• principal component analysis (PCA)
• Feature extraction and transformation
• Optimization
• stochastic gradient descent
• limited-memory BFGS (L-BFGS)

• LinearRegression with stochastic gradient descent (SGD)
example on Spark:
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
parsedData = data.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData)
valuesAndPreds = parsedData.map(
lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2)
.reduce(lambda x, y: x + y) / valuesAndPreds.count()

SPARK STREAMING
Fault-tolerant stream processing

SPARK STREAMING
• Spark Streaming enables scalable, high-throughput,
fault-tolerant stream processing of live data streams
• Spark Streaming provides a high-level abstraction
called discretized stream or DStream, which
represents a continuous stream of data
• Internally, a DStream is represented as a sequence
of RDDs.

SPARK STREAMING
• Example of processing Twitter Stream with Spark
Streaming:
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
…
val ssc = new StreamingContext(sc, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)
val hashTags = tweets.flatMap(status=>getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

SPARK STREAMING
• Any operation applied on a DStream translates to
operations on the underlying RDDs.
RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4

SPARK STREAMING
• Spark Streaming also provides windowed
computations, which allow you to apply
transformations over a sliding window of data

SPEED
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
Logistic regression
in Hadoop and Spark
• Spark has won the Daytona GraySort contest for
2014 (sortbenchmark.org) with 4.27 TB/min
(in 2013 Hadoop was the winner with 1.42 TB/min)

EASE OF USE
• Supports out of the box:
• Java
• Scala
• Python
• You can use it interactively from the Scala and
Python shells

GENERALITY
• SQL with SparkSQL
• Machine Learning with MLlib
• Graphs computation with GraphX
• Streaming processing with Spark Streaming

RUNS EVERYWHERE
• Spark could be run on
• Hadoop (YARN)
• Mesos
• standalone
• in the cloud
• Spark can read from
• S3
• HDFS
• HBase
• Cassandra
• any Hadoop data source.

Thank you.
• Credentials:
• http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014
• http://spark.apache.org
• http://www.databricks.com
• http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona-
2014-with-apache-spark-and-d3.html

Apache Spark Overview @ ferret

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Spark Overview @ ferret

Similar to Apache Spark Overview @ ferret (20)

More from Andrii Gakhov

More from Andrii Gakhov (20)

Recently uploaded

Recently uploaded (20)

Apache Spark Overview @ ferret