Spark and Hadoop Technology

© 2014 MapR Technologies 1© 2014 MapR Technologies
An Overview of Apache Spark
 IT Engineer

© 2014 MapR Technologies 2
Agenda
• What is Spark?
• Evolution Of Spark
• Features Of Spark
• Component Of Spark
• Brief Introduction To Hadoop
• The Difference with Spark
• Examples and Resources

What is Spark?

Apache Spark
• Originally developed in 2009 in UC
Berkeley’s AMP Lab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation
• Apache spark is a lightning-fast
cluster computing technology,
designed for fast computation.

Evolution Of Spark
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming.
• It reduces the management burden of maintaining separate tools.
• It was donated to Apache software foundation in 2013, and now
Apache Spark has become a top level Apache project from Feb-
2014.

Features Of Spark
• Spark helps to run an application up to 100 times faster in
memory, and 10 times faster when running on disk.
• Spark provides built-in APIs in JAVA, SCALA, or
PYTHON. Spark comes up with 80 high-level operators
for interactive querying.
• It also supports SQL queries, Streaming data, Machine
learning (ML), and Graph algorithms.

Component Of Spark
• Spark core provides In-Memory computing and referencing datasets in external storage
systems.
• Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
• MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture.
Spark SQL
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark Core (General execution engine)
GraphX (Graph
computation)

Introduction To Hadoop?

Introduction To Hadoop
• Apache Hadoop is an open source software framework that supports data-
intensive distributed applications.
• Apache Hadoop is scalable, flexible, fault-tolerant and cost effective.
• Apache Hadoop maintain speed in processing large datasets in terms of
waiting time between queries and waiting time to run the program.
• Hadoop is just one of the ways to implement Spark.

The Difference with Spark

Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution graphs
– In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory

Spark is the Most Active Open Source Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear

Examples and Resources

Resources
• Pig on Spark
– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html
– https://github.com/aniket486/pig
– https://github.com/twitter/pig/tree/spork
– http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
– https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• Latest on Spark
– http://databricks.com/categories/spark/
– http://www.spark-stack.org/

Examples
• Word Count
• Text Search

Q&AEngage with us!

Spark and Hadoop Technology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark and Hadoop Technology

Similar to Spark and Hadoop Technology (20)

Recently uploaded

Recently uploaded (20)

Spark and Hadoop Technology

Editor's Notes