Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Spark - Lightning-Fast Cluster
Computing by Example
Ramesh Mudunuri, Vectorum
Saturday, December 6, 2014

About me
• Big data enthusiast
• Vectorum.com , Startup Product development team member and
using spark technology

What to expect
• Introduction to Spark
• Spark Eco system
• How Spark is different from Hadoop Map Reduce
• Where Spark shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
This is not…
• Training class
• Work shop
• Product demo with commercial interest

Spark - Lightning-Fast Cluster Computing by Example
• How is it different from Hadoop Map Reduce
• Where it shines well

What is Spark ?
General purpose large-scale high performance processing engine
http://spark.apache.org/

What is Spark ?
Like map-Reduce but in-memory processing engine and also runs fast
http://spark.apache.org/

What is Spark
• Apache Spark™ is a fast and general engine for
large-scale data processing.

Spark History
• Started as research project in 2009 at UC Berkeley amplab and
became Apache open source project since 2010
• Matai Zaharia Spark Dev. team member and Databricks co-founder

Why is Spark so special
Speed
General purpose faster processing In-
memory engine
(relatively) Easy to develop and
deploy complex analytical
applications
APIs for : Java, Scala and Python
Well integrated eco system tools
www.databricks.com

Why is Spark so special…..
• In-memory processing makes well suites for Iterative nature
Algorithm computations
• Can run in various setups
– Standalone (my favorite way to learn Spark)
– Cluster, EC2,
– Yarn, Mesos
• Read data from
– Local file system
– HDFS
– Hbase, Cassandra and …
http://www.cloudera.com

Apache Spark Core
• Foundation
• Scheduling,
• Memory Management
• Fault Recovery and etc.

Spark SQL
• Execute SPARK with SQL expressions
• Compatible with Hive*
• JDBC/ODBC connection capabilities
* Hive :Distributed Data storage SQL software with custom UDF
capabilities

Spark Streaming
• Component to process live streaming of the data.
• API to handle streaming data
• E.g: Sources : Log files, queued messages, sensor emitted data

MLlib- Machine leaning
Libraries for Machine learning Algorithms
Eg : Classification, regression, clustering, collaborative filtering,
dimensionality reduction
Very active Spark Development community

GraphX
APIs for Graph computation
• PageRank
• Connected components
• Label propagation
• SVD++
• Strongly connected
components
• Triangle count
Alpha level

Spark Engine Terminology
• Spark Context
– An Object Spark uses to access cluster
• Driver & Executor
– Driver runs main program and execute parallel operations
– Executor runs inside worker and execute the tasks
– Resilient Distributes Dataset (RDD)
• Immutable fault tolerant collection object
– RDD functions (similar to Hadoop map-Reduce functions)
1. Transformation
2. Action

Driver & Executor
Driver runs main program and execute parallel operations
Executor runs inside worker and execute the tasks

RDD-Resilient Distributed Dataset
• Resilient Distributed Data‐ sets(RDD)is Spark’s fundamental
abstraction for representing a collection of objects that can be
distributed across multiple machines in a cluster.
• Simple Definition: Immutable and fault tolerant collection object
• There are two ways to create an RDD in Spark:
– 1. Create an RDD from an external data source
– 2. Performing a transformation on one or more existing RDDs
– val lines = sc.textFile("/filepath/README.md")
– val errors = lines.filter(_.startsWith("ERROR"))

RDD
• There are two ways to create an RDD in Spark:
1. Create an RDD from an external data source
val lines = sc.textFile("/filepath/REDME.md")
2. Performing a transformation on one or more existing RDDs
val errors = lines.filter(_.startsWith("ERROR"))

Transformation - Action
• Transformations operations are lazy (will not be executed immediately)
• Transformations can create new RDDs from existing RDD
e.g filter, map,
• Action operations return final values to driver program or write data into
file system
e.g: Collect, SaveAsTextFile
http://www.mapr.com

• How is Spark different from Hadoop Map Reduce
• Where it shines well

How is Spark different from Hadoop Map-Reduce
SPARK Hadoop
1 Speed :
• 100X in-memory and
• 10X on Disk
2 Easy of Use :
• Easily write application using Java, Scala, Python
• Interactive Shell available with Scala and Python
• High level simple map-reduce Operations
• Java
• No shell
• complex map-reduce operations
3 Tools :
• Well integrated tools (Spark SQL, Streaming,
Mllib and etc.) to develop complex analytical
application
• Loosely coupled large set of tools,
but very matured
4 Deployment:
• Hadoop : V1/V2(YARN)
• And also Meson, Amazon-EC2
--
5 Data Source:
• HDFS(Hadoop), HBase, Cassandra, Amazon-S3
--

How is spark different from Hadoop Map-Reduce
SPARK Hadoop
6 Applications:
• Spark ‘Application’ is higher level of Unit, runs
multiple jobs in sequence or parallel
• Application process are called executors, runs on
clusters(workers)
• Hadoop ‘job’ is higher level
unit process data with map
reduce and writes data to
storage
7 Executors:
• Executors can run multiple tasks in a single
processor
• Each mapReduce runs in its
own processor
8 Shared Variable:
• Broadcast variables: Read-only(look-up) variable,
ships only once to worker
• Accumulators: Workers add values and driver reads
the data, and fault tolerant
• Hadoop counter have
additional (system ) metric
counters like ‘Map input
records’
9 Persisting/Caching RDD:
• Cached RDDs can be used & reused in across the
operation, thus increase the processing speed
--
10 Lazy Evaluation:
Transformation functions execution plan bundled
together and executes only with RDD action function
--

• How is it different from Hadoop Map Reduce
• Where Spark shines well

Where Spark shines well
• Well suited for any iterative computations
– Machine Learning Algorithms
– Iterative Analytics
• Multi data source Computations
– Multi sourced Sensor data
• Aggregated Analytics
– Transforming and Summering the data

• Link http://spark.apache.org/downloads.html
• Standalone - Chose a package type
: Prebuild for hadoop1.x
• Source code is also available
Build Toolls : maven or sbt
Distro-versions : Hadoop, Cloudera, MapR

Current Spark version
Release Cycle : Every 3 months

How easy to install and start learning
Can install quickly on our laptop/PC
• Parameter check lists
– JAVA 1.7
– SCALA 2.10X
– SPARK/Conf

Spark Scala REPL
 cd $SPARK_HOME
 ./bin/spark-shell
 port 4040
Spark Master & Worker in background
 cd $SPARK_HOME
 ./sbin/start-all.sh
Starts both Master and worker

Use case with Spark SQL
• Spark Scala REPL
• Spark SQL
• Write some interesting code snippets on REPL using scala
• 1. Read Meetup participants info and prepare data file
• 2. Use Spark SQL to create a aggregated data
• 3. Show visualization with Spark output data

Spark SQL Code : Create table and Run Queries
1. Create SPARK Context
// Spark context will be created as sc when we launch the
shell.
2. Create SQL Context
3. Create Case Class
4. Create RDD
5. Create Schema
6. Register RDD as Table in the Schema
7. Run Select statements
8. Save SQL output
9. Visualization - D3

Code
1. import sqlContext.createSchemaRDD
1. Spark context available as sc
2. val sqlContext = new org.apache.spark.sql.SQLContext(sc)
3. case class Attendees(Name: String, Interest: String )
4. val meetup = sc.textFile("/Users/vectorum/Documents/Ramesh/Dec6/meetup.csv").map(_.split(",")).map(a => Attendees( a(0),a(1)))
1. val hyd = sqlContext.createSchemaRDD(meetup)
2. hyd.registerTempTable("iiit")
3. val iiitRoster = sqlContext.sql("SELECT Name, Interest FROM iiit")
4. iiitRoster.count()
5. iiitRoster.map(a => "Name: " + a(0) + "Interest :" + a(1) ).collect().foreach(println)
1. val iiitAChart = sqlContext.sql("SELECT Interest, count( Interest) FROM iiit group by Interest order by Interest”)
1. iiitAChart.map(a => a(0) + "," + a(1) ).collect().foreach(println)

Visualization
HighCharts,D3
SPARK
(SQL, Hive,MLlib)
Data
HDFS, MySql, Files
Technology Stack

Spark Programing Model
• 1. Define set of transformations on input datasets
• 2.Invoke actions that output the transformed dataset into persistent
state/local memory
• Running local computations that operate on the results computed in
a distributed fashion.
These can help decide what transformations and actions to undertake next.

Example RDD Lineage
HDFS/File
Prepare
Dataset
(RDD-0)
Cached RDD
Filtered Data
Set
0
Filtered Data Sets
.. n
Export Data
Visualization
Machine
Learning

Demo - Visualization
• Bubble Chart : Data Distribution
• Heat Chart :Correlation

Where to find additional information
• http://spark.apache.org/
• http://spark-summit.org/2014#videos
• http://databricks.com/spark-training-resources
• Users mailing list user@spark.apache.org
• Developers mailing list dev@spark.apache.org
• My twitter handle https://twitter.com/rameshmudunuri

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

More Related Content

What's hot

Similar to Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

More from Hyderabad Scalability Meetup

Recently uploaded

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Editor's Notes