Spark - Lightning-Fast Cluster
Computing by Example
Ramesh Mudunuri, Vectorum
Saturday, December 6, 2014
About me
• Big data enthusiast
• Vectorum.com , Startup Product development team member and
using spark technology
What to expect
• Introduction to Spark
• Spark Eco system
• How Spark is different from Hadoop Map Reduce
• Where Spark shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
This is not…
• Training class
• Work shop
• Product demo with commercial interest
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system
• How is it different from Hadoop Map Reduce
• Where it shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
What is Spark ?
General purpose large-scale high performance processing engine
http://spark.apache.org/
What is Spark ?
Like map-Reduce but in-memory processing engine and also runs fast
http://spark.apache.org/
What is Spark
• Apache Spark™ is a fast and general engine for
large-scale data processing.
Spark History
• Started as research project in 2009 at UC Berkeley amplab and
became Apache open source project since 2010
• Matai Zaharia Spark Dev. team member and Databricks co-founder
Why is Spark so special
Speed
General purpose faster processing In-
memory engine
(relatively) Easy to develop and
deploy complex analytical
applications
APIs for : Java, Scala and Python
Well integrated eco system tools
www.databricks.com
Why is Spark so special…..
• In-memory processing makes well suites for Iterative nature
Algorithm computations
• Can run in various setups
– Standalone (my favorite way to learn Spark)
– Cluster, EC2,
– Yarn, Mesos
• Read data from
– Local file system
– HDFS
– Hbase, Cassandra and …
http://www.cloudera.com
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system
• How is it different from Hadoop Map Reduce
• Where it shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
Apache Spark Core
• Foundation
• Scheduling,
• Memory Management
• Fault Recovery and etc.
Spark SQL
• Execute SPARK with SQL expressions
• Compatible with Hive*
• JDBC/ODBC connection capabilities
* Hive :Distributed Data storage SQL software with custom UDF
capabilities
Spark Streaming
• Component to process live streaming of the data.
• API to handle streaming data
• E.g: Sources : Log files, queued messages, sensor emitted data
MLlib- Machine leaning
Libraries for Machine learning Algorithms
Eg : Classification, regression, clustering, collaborative filtering,
dimensionality reduction
Very active Spark Development community
GraphX
APIs for Graph computation
• PageRank
• Connected components
• Label propagation
• SVD++
• Strongly connected
components
• Triangle count
Alpha level
Spark Engine Terminology
• Spark Context
– An Object Spark uses to access cluster
• Driver & Executor
– Driver runs main program and execute parallel operations
– Executor runs inside worker and execute the tasks
– Resilient Distributes Dataset (RDD)
• Immutable fault tolerant collection object
– RDD functions (similar to Hadoop map-Reduce functions)
1. Transformation
2. Action
Spark shell and Spark context
Driver & Executor
Driver runs main program and execute parallel operations
Executor runs inside worker and execute the tasks
RDD-Resilient Distributed Dataset
• Resilient Distributed Data‐ sets(RDD)is Spark’s fundamental
abstraction for representing a collection of objects that can be
distributed across multiple machines in a cluster.
• Simple Definition: Immutable and fault tolerant collection object
• There are two ways to create an RDD in Spark:
– 1. Create an RDD from an external data source
– 2. Performing a transformation on one or more existing RDDs
– val lines = sc.textFile("/filepath/README.md")
– val errors = lines.filter(_.startsWith("ERROR"))
RDD
• There are two ways to create an RDD in Spark:
1. Create an RDD from an external data source
val lines = sc.textFile("/filepath/REDME.md")
2. Performing a transformation on one or more existing RDDs
val errors = lines.filter(_.startsWith("ERROR"))
Transformation - Action
• Transformations operations are lazy (will not be executed immediately)
• Transformations can create new RDDs from existing RDD
e.g filter, map,
• Action operations return final values to driver program or write data into
file system
e.g: Collect, SaveAsTextFile
http://www.mapr.com
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system
• How is Spark different from Hadoop Map Reduce
• Where it shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
How is Spark different from Hadoop Map-Reduce
SPARK Hadoop
1 Speed :
• 100X in-memory and
• 10X on Disk
2 Easy of Use :
• Easily write application using Java, Scala, Python
• Interactive Shell available with Scala and Python
• High level simple map-reduce Operations
• Java
• No shell
• complex map-reduce operations
3 Tools :
• Well integrated tools (Spark SQL, Streaming,
Mllib and etc.) to develop complex analytical
application
• Loosely coupled large set of tools,
but very matured
4 Deployment:
• Hadoop : V1/V2(YARN)
• And also Meson, Amazon-EC2
--
5 Data Source:
• HDFS(Hadoop), HBase, Cassandra, Amazon-S3
--
How is spark different from Hadoop Map-Reduce
SPARK Hadoop
6 Applications:
• Spark ‘Application’ is higher level of Unit, runs
multiple jobs in sequence or parallel
• Application process are called executors, runs on
clusters(workers)
• Hadoop ‘job’ is higher level
unit process data with map
reduce and writes data to
storage
7 Executors:
• Executors can run multiple tasks in a single
processor
• Each mapReduce runs in its
own processor
8 Shared Variable:
• Broadcast variables: Read-only(look-up) variable,
ships only once to worker
• Accumulators: Workers add values and driver reads
the data, and fault tolerant
• Hadoop counter have
additional (system ) metric
counters like ‘Map input
records’
9 Persisting/Caching RDD:
• Cached RDDs can be used & reused in across the
operation, thus increase the processing speed
--
10 Lazy Evaluation:
Transformation functions execution plan bundled
together and executes only with RDD action function
--
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system
• How is it different from Hadoop Map Reduce
• Where Spark shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
Where Spark shines well
• Well suited for any iterative computations
– Machine Learning Algorithms
– Iterative Analytics
• Multi data source Computations
– Multi sourced Sensor data
• Aggregated Analytics
– Transforming and Summering the data
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system
• How is it different from Hadoop Map Reduce
• Where it shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
• Link http://spark.apache.org/downloads.html
• Standalone - Chose a package type
: Prebuild for hadoop1.x
• Source code is also available
Build Toolls : maven or sbt
Distro-versions : Hadoop, Cloudera, MapR
Current Spark version
Release Cycle : Every 3 months
How easy to install and start learning
Can install quickly on our laptop/PC
• Parameter check lists
– JAVA 1.7
– SCALA 2.10X
– SPARK/Conf
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system
• How is it different from Hadoop Map Reduce
• Where it shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
Spark Scala REPL
 cd $SPARK_HOME
 ./bin/spark-shell
 port 4040
Spark Master & Worker in background
 cd $SPARK_HOME
 ./sbin/start-all.sh
Starts both Master and worker
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system
• How is it different from Hadoop Map Reduce
• Where it shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
Use case with Spark SQL
• Spark Scala REPL
• Spark SQL
• Write some interesting code snippets on REPL using scala
• 1. Read Meetup participants info and prepare data file
• 2. Use Spark SQL to create a aggregated data
• 3. Show visualization with Spark output data
Spark SQL Code : Create table and Run Queries
1. Create SPARK Context
// Spark context will be created as sc when we launch the
shell.
2. Create SQL Context
3. Create Case Class
4. Create RDD
5. Create Schema
6. Register RDD as Table in the Schema
7. Run Select statements
8. Save SQL output
9. Visualization - D3
Code
1. import sqlContext.createSchemaRDD
1. Spark context available as sc
2. val sqlContext = new org.apache.spark.sql.SQLContext(sc)
3. case class Attendees(Name: String, Interest: String )
4. val meetup = sc.textFile("/Users/vectorum/Documents/Ramesh/Dec6/meetup.csv").map(_.split(",")).map(a => Attendees( a(0),a(1)))
1. val hyd = sqlContext.createSchemaRDD(meetup)
2. hyd.registerTempTable("iiit")
3. val iiitRoster = sqlContext.sql("SELECT Name, Interest FROM iiit")
4. iiitRoster.count()
5. iiitRoster.map(a => "Name: " + a(0) + "Interest :" + a(1) ).collect().foreach(println)
1. val iiitAChart = sqlContext.sql("SELECT Interest, count( Interest) FROM iiit group by Interest order by Interest”)
1. iiitAChart.map(a => a(0) + "," + a(1) ).collect().foreach(println)
Our Product
Visualization
HighCharts,D3
SPARK
(SQL, Hive,MLlib)
Data
HDFS, MySql, Files
Technology Stack
Spark Programing Model
• 1. Define set of transformations on input datasets
• 2.Invoke actions that output the transformed dataset into persistent
state/local memory
• Running local computations that operate on the results computed in
a distributed fashion.
These can help decide what transformations and actions to undertake next.
Example RDD Lineage
HDFS/File
Prepare
Dataset
(RDD-0)
Cached RDD
Filtered Data
Set
0
Filtered Data Sets
.. n
Export Data
Visualization
Machine
Learning
Demo - Visualization
• Bubble Chart : Data Distribution
• Heat Chart :Correlation
Spark - Lightning-Fast Cluster Computing by Example
• Introduction to Spark
• Spark Eco system
• How is it different from Hadoop Map Reduce
• Where it shines well
• How easy to install and start learning
• Small code demos
• Where to find additional information
Where to find additional information
• http://spark.apache.org/
• http://spark-summit.org/2014#videos
• http://databricks.com/spark-training-resources
• Users mailing list user@spark.apache.org
• Developers mailing list dev@spark.apache.org
• My twitter handle https://twitter.com/rameshmudunuri
Final note
Thank you

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

  • 1.
    Spark - Lightning-FastCluster Computing by Example Ramesh Mudunuri, Vectorum Saturday, December 6, 2014
  • 2.
    About me • Bigdata enthusiast • Vectorum.com , Startup Product development team member and using spark technology
  • 3.
    What to expect •Introduction to Spark • Spark Eco system • How Spark is different from Hadoop Map Reduce • Where Spark shines well • How easy to install and start learning • Small code demos • Where to find additional information This is not… • Training class • Work shop • Product demo with commercial interest
  • 4.
    Spark - Lightning-FastCluster Computing by Example • Introduction to Spark • Spark Eco system • How is it different from Hadoop Map Reduce • Where it shines well • How easy to install and start learning • Small code demos • Where to find additional information
  • 5.
    What is Spark? General purpose large-scale high performance processing engine http://spark.apache.org/
  • 6.
    What is Spark? Like map-Reduce but in-memory processing engine and also runs fast http://spark.apache.org/
  • 7.
    What is Spark •Apache Spark™ is a fast and general engine for large-scale data processing.
  • 8.
    Spark History • Startedas research project in 2009 at UC Berkeley amplab and became Apache open source project since 2010 • Matai Zaharia Spark Dev. team member and Databricks co-founder
  • 9.
    Why is Sparkso special Speed General purpose faster processing In- memory engine (relatively) Easy to develop and deploy complex analytical applications APIs for : Java, Scala and Python Well integrated eco system tools www.databricks.com
  • 10.
    Why is Sparkso special….. • In-memory processing makes well suites for Iterative nature Algorithm computations • Can run in various setups – Standalone (my favorite way to learn Spark) – Cluster, EC2, – Yarn, Mesos • Read data from – Local file system – HDFS – Hbase, Cassandra and … http://www.cloudera.com
  • 11.
    Spark - Lightning-FastCluster Computing by Example • Introduction to Spark • Spark Eco system • How is it different from Hadoop Map Reduce • Where it shines well • How easy to install and start learning • Small code demos • Where to find additional information
  • 12.
    Apache Spark Core •Foundation • Scheduling, • Memory Management • Fault Recovery and etc.
  • 13.
    Spark SQL • ExecuteSPARK with SQL expressions • Compatible with Hive* • JDBC/ODBC connection capabilities * Hive :Distributed Data storage SQL software with custom UDF capabilities
  • 14.
    Spark Streaming • Componentto process live streaming of the data. • API to handle streaming data • E.g: Sources : Log files, queued messages, sensor emitted data
  • 15.
    MLlib- Machine leaning Librariesfor Machine learning Algorithms Eg : Classification, regression, clustering, collaborative filtering, dimensionality reduction Very active Spark Development community
  • 16.
    GraphX APIs for Graphcomputation • PageRank • Connected components • Label propagation • SVD++ • Strongly connected components • Triangle count Alpha level
  • 17.
    Spark Engine Terminology •Spark Context – An Object Spark uses to access cluster • Driver & Executor – Driver runs main program and execute parallel operations – Executor runs inside worker and execute the tasks – Resilient Distributes Dataset (RDD) • Immutable fault tolerant collection object – RDD functions (similar to Hadoop map-Reduce functions) 1. Transformation 2. Action
  • 18.
    Spark shell andSpark context
  • 19.
    Driver & Executor Driverruns main program and execute parallel operations Executor runs inside worker and execute the tasks
  • 20.
    RDD-Resilient Distributed Dataset •Resilient Distributed Data‐ sets(RDD)is Spark’s fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a cluster. • Simple Definition: Immutable and fault tolerant collection object • There are two ways to create an RDD in Spark: – 1. Create an RDD from an external data source – 2. Performing a transformation on one or more existing RDDs – val lines = sc.textFile("/filepath/README.md") – val errors = lines.filter(_.startsWith("ERROR"))
  • 21.
    RDD • There aretwo ways to create an RDD in Spark: 1. Create an RDD from an external data source val lines = sc.textFile("/filepath/REDME.md") 2. Performing a transformation on one or more existing RDDs val errors = lines.filter(_.startsWith("ERROR"))
  • 22.
    Transformation - Action •Transformations operations are lazy (will not be executed immediately) • Transformations can create new RDDs from existing RDD e.g filter, map, • Action operations return final values to driver program or write data into file system e.g: Collect, SaveAsTextFile http://www.mapr.com
  • 23.
    Spark - Lightning-FastCluster Computing by Example • Introduction to Spark • Spark Eco system • How is Spark different from Hadoop Map Reduce • Where it shines well • How easy to install and start learning • Small code demos • Where to find additional information
  • 24.
    How is Sparkdifferent from Hadoop Map-Reduce SPARK Hadoop 1 Speed : • 100X in-memory and • 10X on Disk 2 Easy of Use : • Easily write application using Java, Scala, Python • Interactive Shell available with Scala and Python • High level simple map-reduce Operations • Java • No shell • complex map-reduce operations 3 Tools : • Well integrated tools (Spark SQL, Streaming, Mllib and etc.) to develop complex analytical application • Loosely coupled large set of tools, but very matured 4 Deployment: • Hadoop : V1/V2(YARN) • And also Meson, Amazon-EC2 -- 5 Data Source: • HDFS(Hadoop), HBase, Cassandra, Amazon-S3 --
  • 25.
    How is sparkdifferent from Hadoop Map-Reduce SPARK Hadoop 6 Applications: • Spark ‘Application’ is higher level of Unit, runs multiple jobs in sequence or parallel • Application process are called executors, runs on clusters(workers) • Hadoop ‘job’ is higher level unit process data with map reduce and writes data to storage 7 Executors: • Executors can run multiple tasks in a single processor • Each mapReduce runs in its own processor 8 Shared Variable: • Broadcast variables: Read-only(look-up) variable, ships only once to worker • Accumulators: Workers add values and driver reads the data, and fault tolerant • Hadoop counter have additional (system ) metric counters like ‘Map input records’ 9 Persisting/Caching RDD: • Cached RDDs can be used & reused in across the operation, thus increase the processing speed -- 10 Lazy Evaluation: Transformation functions execution plan bundled together and executes only with RDD action function --
  • 26.
    Spark - Lightning-FastCluster Computing by Example • Introduction to Spark • Spark Eco system • How is it different from Hadoop Map Reduce • Where Spark shines well • How easy to install and start learning • Small code demos • Where to find additional information
  • 27.
    Where Spark shineswell • Well suited for any iterative computations – Machine Learning Algorithms – Iterative Analytics • Multi data source Computations – Multi sourced Sensor data • Aggregated Analytics – Transforming and Summering the data
  • 28.
    Spark - Lightning-FastCluster Computing by Example • Introduction to Spark • Spark Eco system • How is it different from Hadoop Map Reduce • Where it shines well • How easy to install and start learning • Small code demos • Where to find additional information
  • 29.
    • Link http://spark.apache.org/downloads.html •Standalone - Chose a package type : Prebuild for hadoop1.x • Source code is also available Build Toolls : maven or sbt Distro-versions : Hadoop, Cloudera, MapR
  • 30.
    Current Spark version ReleaseCycle : Every 3 months
  • 31.
    How easy toinstall and start learning Can install quickly on our laptop/PC • Parameter check lists – JAVA 1.7 – SCALA 2.10X – SPARK/Conf
  • 32.
    Spark - Lightning-FastCluster Computing by Example • Introduction to Spark • Spark Eco system • How is it different from Hadoop Map Reduce • Where it shines well • How easy to install and start learning • Small code demos • Where to find additional information
  • 33.
    Spark Scala REPL cd $SPARK_HOME  ./bin/spark-shell  port 4040 Spark Master & Worker in background  cd $SPARK_HOME  ./sbin/start-all.sh Starts both Master and worker
  • 34.
    Spark - Lightning-FastCluster Computing by Example • Introduction to Spark • Spark Eco system • How is it different from Hadoop Map Reduce • Where it shines well • How easy to install and start learning • Small code demos • Where to find additional information
  • 35.
    Use case withSpark SQL • Spark Scala REPL • Spark SQL • Write some interesting code snippets on REPL using scala • 1. Read Meetup participants info and prepare data file • 2. Use Spark SQL to create a aggregated data • 3. Show visualization with Spark output data
  • 36.
    Spark SQL Code: Create table and Run Queries 1. Create SPARK Context // Spark context will be created as sc when we launch the shell. 2. Create SQL Context 3. Create Case Class 4. Create RDD 5. Create Schema 6. Register RDD as Table in the Schema 7. Run Select statements 8. Save SQL output 9. Visualization - D3
  • 37.
    Code 1. import sqlContext.createSchemaRDD 1.Spark context available as sc 2. val sqlContext = new org.apache.spark.sql.SQLContext(sc) 3. case class Attendees(Name: String, Interest: String ) 4. val meetup = sc.textFile("/Users/vectorum/Documents/Ramesh/Dec6/meetup.csv").map(_.split(",")).map(a => Attendees( a(0),a(1))) 1. val hyd = sqlContext.createSchemaRDD(meetup) 2. hyd.registerTempTable("iiit") 3. val iiitRoster = sqlContext.sql("SELECT Name, Interest FROM iiit") 4. iiitRoster.count() 5. iiitRoster.map(a => "Name: " + a(0) + "Interest :" + a(1) ).collect().foreach(println) 1. val iiitAChart = sqlContext.sql("SELECT Interest, count( Interest) FROM iiit group by Interest order by Interest”) 1. iiitAChart.map(a => a(0) + "," + a(1) ).collect().foreach(println)
  • 38.
  • 39.
  • 40.
    Spark Programing Model •1. Define set of transformations on input datasets • 2.Invoke actions that output the transformed dataset into persistent state/local memory • Running local computations that operate on the results computed in a distributed fashion. These can help decide what transformations and actions to undertake next.
  • 41.
    Example RDD Lineage HDFS/File Prepare Dataset (RDD-0) CachedRDD Filtered Data Set 0 Filtered Data Sets .. n Export Data Visualization Machine Learning
  • 42.
    Demo - Visualization •Bubble Chart : Data Distribution • Heat Chart :Correlation
  • 43.
    Spark - Lightning-FastCluster Computing by Example • Introduction to Spark • Spark Eco system • How is it different from Hadoop Map Reduce • Where it shines well • How easy to install and start learning • Small code demos • Where to find additional information
  • 44.
    Where to findadditional information • http://spark.apache.org/ • http://spark-summit.org/2014#videos • http://databricks.com/spark-training-resources • Users mailing list user@spark.apache.org • Developers mailing list dev@spark.apache.org • My twitter handle https://twitter.com/rameshmudunuri
  • 45.

Editor's Notes

  • #14  SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.