Apache Spark Overview

Let's learn Apache Spark
1. Overview

What is Spark?
Spark is an in-memory cluster computing framework developed by UC,Berkeley
and presently maintained by Apache software Foundation.
Cluster Computing means computations on large datasets can be done in
parallel, distributed fashion over a cluster.
Spark requires a cluster manager(spark standalone,Mesos,Yarn etc.) and
distributed storage system(Hadoop,Hive,Cassandra etc.)

Hadoop to Spark
Spark is build on top of Hadoop which uses MapReduce API(developed by
Google) for doing parallel,distributed computations.
Hadoop is slow as all the operations are on the disk and also coding is not
developer friendly.
As Apache Spark is an in-memory processing engine it is 10-100X faster than
hadoop.

Why Spark is getting popular
No doubt about the spark performance but there are other factors also which
contributes to its success.
Delight for developers as it provides language flexibility and supports
Scala,Java,Python,R
Simple yet rich API makes it easy to use
Runs everywhere

Spark API
This picture is taken from apache spark site and shows different components
of Spark API.
All APIs are available in scala, java and python.

Spark Core
Spark core is the foundation of Spark project.
This API is centered on a data structure called resilient distributed
dataset(RDD), a read only multi set of data items distributed over a cluster of
machines, that is maintained in fault tolerant way.
It provides distributed task dispatching, scheduling and basic IO functionalities.

Spark core continued...
Creating RDDs in scala. SC is the spark context which comes initialized in spark
shell and needs to be initialized in user created applications.
1)Existing collections
2) External Datasets

Spark Core continued...
RDD supports two types of operations.
1) Transformations(map,filter,union etc.)
2) Actions(reduce,collect,count etc.)

Spark SQL
This API let you query structured data inside spark programs using either SQL
or familiar DataFrame API.
It provide access to variety of data sources including Hive, Avro, JSON, JDBC
etc.

Spark SQL continued..
SQLContext is the starting point to be used for Spark SQL
Creation of Dataframes
1)Using datasource
2) Using sql

Spark SQL continued...
DataFrame Operations
1)Displaying full dataframe
2) displaying one column

Spark SQL continued...
DataFrame Operations
3) grouping by column
4) filtering data

Spark Streaming
Spark streaming is an extension of spark core API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams.
Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ
etc

Spark Streaming continued…
It provides a high-level abstraction called DStream or Discretized stream which
represent a continuous stream of data.
DStream is represented internally as a sequence of RDDs.
Creating Dstream

For any queries please email me.
email-id:= pythonpeer@gmail.com
Thanks for your Time.

Apache Spark Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark Overview

Similar to Apache Spark Overview (20)

Recently uploaded

Recently uploaded (20)

Apache Spark Overview