SlideShare a Scribd company logo
Spark
Any one who want know about spark. No specific prerequists are required.
It is not a tutorial to learn spark !.
Intension of presantation is to introduce spark and an overview in general users
prospective. We are not going to cover any concepts of specific to
developer/programming or adminstrative aspects.
Audience & Intension
Sudhakara.st
Mail:sudhakara.st@gmail.com
https://in.linkedin.com/in/sudhakara-st-82820539
Agenda
Introduction to Spark
Spark
What leads to spark trending
Spark components.
Resilient Distributed Dataset(RDD)
Input to spark
Benefits for spark
Spark “Word count” example
Spark VS Hadoop.
Conclusion.
Credits
Content and images source
 http://spark.apache.org/
 https://databricks.com/
 Learning Spark - O'Reilly Media
By Holden Karau, Andy Konwinski,
Patrick Wendell, Matei Zaharia
Apache :
Spark™ is a fast and general engine for large-scale data
processing.
Datairicks:
Spark™ is a powerful open source processing engine built
around speed, ease of use, and sophisticated analytics.
Spark is open source distributed computing engine for data
processing and data analytics.
It was originally developed at UC Berkeley in 2009
What leads to Spark trending !.
Just in time dataware house
Today enterprise have variety of data realtime, streaming,
batch and analytics. Spark is designed for that.
Big data is versatile. Spark execution engine in handles
versatility, its every growing library help for that .
Spark bring data processing, analyze and analytics brings to
one platform.
 Spark significantly simplifies Bigdata processing. Hosts end
to end platform. Ingest to product
What leads to Spark trending !. Continue..
 Spark support wide range of ecosystem & apps
Spark friendly !
Apache Spark is a general-purpose, distribute cluster
computing, data processing framework that, like
MapReduce in Apache Hadoop, offers powerful
abstractions for processing large dataset
Apache Spark is designed to work seamlessly with Hadoop*,
Amazon S3, Cassandra or as a standalone application.
Support languages:
Rich set High level APIs and increases user productivity
Integration with new & existing system.
Spark friendly ! continue…
Spark Components
Spark Components continue…
The Spark core is complemented by a set of powerful,
higher-level libraries
 SparkSQL
 Spark Streaming,
 MLlib (for machine learning)
 GraphX
Scala, Java, Python the language in which Spark is written.
Spark Core
Spark Core is the base engine for large-scale parallel and
distributed data processing. It is responsible for:
 memory management and fault recovery
 scheduling, distributing and monitoring tasks, jobs on a
cluster
 interacting with storage systems
SparkSQL
SparkSQL is a Spark component that supports querying data
either via SQL or via the Hive Query Language. It originated
as the Apache Hive port to run on top of Spark (in place of
MapReduce) and is now integrated with the Spark stack. In
addition to providing support for various data sources, it
makes it possible to weave SQL queries with code
transformations which results in a very powerful tool.
Below is an example of a Hive compatible query:
// sc is an existing
SparkContext. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
Spark Streaming
Spark Streaming supports real time processing of streaming
data, Spark Streaming is an extension of the core Spark API
that enables scalable, high-throughput, fault-tolerant stream
processing of live data streams.
Data can be ingested: Kafka, Flume, Twitter, ZeroMQ, Kinesis
or TCP sockets
Processed data can be pushed out to filesystems, databases,
and live dashboards. In fact, you can apply Spark’s machine
learning and graph processing algorithms on data streams
Spark Streaming continue..
Spark Streaming receives live input data streams and divides
the data into batches, which are then processed by the
Spark engine to generate the final stream of results in
batches
MLlib
MLlib is a machine learning library that provides various
algorithms designed to scale out on a cluster for
classification, regression, clustering, collaborative filtering,
and so on (check out Toptal’s article on machine learning for
more information on that topic).
These algorithms also work with streaming data, such as
linear regression using ordinary least squares or k-means
clustering (and more on the way). Apache Mahout (a
machine learning library for Hadoop) has already turned
away from MapReduce and joined forces on Spark MLlib.
Resilient Distributed Dataset(RDD)
Spark introduces the concept of an RDD , an immutable fault-tolerant,
distributed collection of objects that can be operated on in parallel.
RDD can contain any type of object and is created by loading an
external dataset or distributing a collection from the driver program.
RDDs support two types of operations:
 Transformations : transform one data collection into another (such as
map, filter, join, union, and so on), that are performed on an RDD and
which yield a new RDD containing the result. Means create a new
dataset from an existing one
 Actions : require that the computation be performed (such as reduce,
count, first, collect, save and so on) that return a value after running a
computation on an RDD.
which return a value to the driver program or file after running a
computation on the dataset.
Resilient Distributed Dataset continue..
RDD which is a fault-tolerant collection of
elements/partitions that can be operated on in parallel
across the nodes.
Properties for RDD:
 Immutability
 Cacheable – linage – persist
 Lazy evaluation (it different than execution)
 Type Inferred
Two ways to create RDDs: parallelizing an existing collection
in your driver program, or referencing a dataset in an
external storage system, such as a shared filesystem, HDFS,
Hbase, S3, Cassandra or any data source offering a Hadoop
InputFormat.
Input for Spark
 External store
HDFS, Hbase, Hive, S3 Cassandra, Ext3/Ext4, NTFS ..
Data formats
 CSV, Tablimited, TXT, MD
 Json
 SquenceFile
Input for Spark continue…
Spark File Based input
Spark’s file-based input methods, including textFile, support running
on directories, compressed files, and wildcards as well.
 Eg. you can use textFile("/my/directory"),
textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
The textFile method also takes an optional second argument for
controlling the number of partitions of the file.
 By default, Spark creates one partition for each HDFS block of the file,
but you can also ask for a higher number of partitions by passing a
larger value.
JavaRDD<String> distFile = sc.textFile("data.txt");
This is in contrast with textFile, which would return one record per line in
each file
Spark File Based input continue…
Benefits of spark
Fault recovery
In memory – processing
Scalable
Fast
Rich set of Library
Optimized
Unified tool set
Easy Programming- Spark and scala APIs are fairly high level
Spark “Word count”
Spark “Word count” continue…
The first thing a Spark program has to do is create a
SparkContext object,
 SparkContext represents a connection to a Spark cluster, and
can be used to create RDDs, accumulators and broadcast
variables on that cluster.
 To create a SparkContext, you first need to create a
SparkConf object to configure your application
// Create a Java Spark Context.
SparkConf conf = new SparkConf().setAppName("JavaWordCount");
//SparkConf conf = new
SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
Spark “Word count” continue…
Create an RDD from a file
RDDs can be created from Hadoop InputFormats (such as
HDFS files) or by transforming other RDDs. The following
code uses the SparkContext to define a base RDD from the
file inputFile
Parallelized collections are created by calling
JavaSparkContext’s parallelize method on an existing
Collection in your driver program
// Create a Java Spark Context.
String inputFile = args[0];
JavaRDD input = sc.textFile(inputFile);
Spark “Word count” continue…
Transform input RDD with flatMap
To split the input text into separate words, we use the
flatMap(func) RDD transformation, which returns a new
RDD formed by passing each element of the source through
a function. The String split function is applied to each line of
text, returning an RDD of the words in the input RDD:
// map/split each line to multiple words
JavaRDD<String> words = input.flatMap( new
FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
} } );
Spark “Word count” continue…
Transform words RDD with map
We use the map(func) to transform the words RDD into an
RDD of (word, 1) key-value pairs:
JavaPairRDD<String, Integer> wordOnePairs = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) {
return new Tuple2(x, 1); } } );
Transform wordOnePairs RDD with reduceByKey
To count the number of times each word occurs, we combine
the values (1) in the wordOnePairs with the same key (word)
using reduceByKey(func),
 This transformation will return an RDD of (word, count) pairs
where the values for each word are aggregated using the given
reduce function func x+y:
// reduce add the pairs by key to produce counts
JavaPairRDD<String, Integer> counts = wordOnePairs.reduceByKey (
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) {
return x + y; } } );
Spark “Word count” continue…
Output with RDD action saveAsTextFile
Finally, the RDD action saveAsTextFile(path) writes the
elements of the dataset as a text file (or set of text files) in
the outputFile directory
String outputFile = args[1];
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(outputFile);
Spark “Word count” continue…
Running Your Application
You use the bin/spark-submit script to launch your application.
This script takes care of setting up the classpath with Spark
and its dependencies. Here is the spark-submit format:
$./bin/spark-submit --class <main-class> --master <master-url>
<application-jar> [application-arguments]
$bin/spark-submit --class example.wordcount.JavaWordCount --master yarn sparkwordcount-1.0.jar
/user/user01/input/alice.txt /user/user01/output
//Here is the spark-submit command to run the scala SparkWordCount:
$bin/spark-submit --class SparkWordCount --master yarn sparkwordcount-1.0.jar /user/user01/input/alice.txt
/user/user01/output
Spark vs Hadoop
Hello.. Spark Or Hadoop Which Is The Best Big Data
Framework ?
Hey…Spark has overtaken Hadoop as most active open
source Big Data project !.
The fact is they are not directly comparable products. Why ?
They do not perform exactly the same tasks, and they are
not mutually exclusive, as they are able to work together.
They provide some of the most popular tools used to carry
out common Big Data-related tasks.
Spark vs Hadoop continue…
Spark vs Hadoop continue…
Spark the edge over Hadoop is speed.
 Spark handles most of its operations and data “in memory” –
copying them from distributed physical storage into far faster
logical RAM memory.
 This reduces amount of time consuming writing/reading to
hard disk each level/phase, other end needs to be done under
Hadoop’s MapReduce system
 MapReduce writes all of the data back to the physical storage
medium after each operation
Spark support iterative, interactive and batch data
processing. Hadoop limited batch processing !
Spark vs Hadoop continue…
Although Spark is reported to work up to 100 times faster
than Hadoop in certain circumstances, but it does not
provide its own distributed storage system. Spark does not
include its own storage system for organizing files. Hadoop
has it!.
Spark’s advanced analytics applications can make use of
data stored using the HDFS in data processing layer.
Spark includes its own machine learning libraries, called
MLib, whereas Hadoop systems must be interfaced with a
other machine learning library, for example Apache
Mahout.
Spark vs Hadoop continue…
Apache Spark may only be the processing step in your ETL
(Extract, Transform, Load) chain. It doesn't provide the
stabled rich tool set that the Hadoop ecosystem contains.
You may still need Hbase/Nutch/Solr for data acquisition
Hadoop has wide ranges tools
 Sqoop and Flume for moving data; Oozie for scheduling; and
HBase, or Hive for storage.
 The point that I’m making is that although Apache Spark is a
very powerful processing system, it should be considered a
part of the wider Hadoop ecosystem
To summarize Hadoop and Spark are Perfect Together &
Spark fits in Hadoop data processing layer.
Both we can do better !!.
Spark is Heir to Mapreduce
MapReduce is not the best framework for all computations !
To perform complex operations, many Map and Reduce
phases must be strung together. It limited with respect to
complex and iterative operations
Spark support Varity of data sources. It is robust !.
Spark support iterative, interactive and batch data
processing. It is fast!.
It’s entirely possible to re-implement MapReduce-like
computations in Spark. It is easy!
When spark is not needed !
Your Big Data simply consists of a huge amount of very
structured data (i.e customer names and addresses) or may
have no need for the advanced streaming analytics and
machine learning functionality provided by Spark.
Spark, although developing very quickly, is still in its infancy,
and the security and support infrastructure is not as
advanced.
Who use Spark.
Spark is being adopted by major players like Amazon, eBay,
and Yahoo! Many organizations run Spark on clusters with
thousands of nodes. According to the Spark FAQ
Conclusion
Apache Spark is a cluster computing platform designed to be
fast, speed side and extends the popular MapReduce model to
efficiently support more types of computations, including
interactive queries and stream processing. Spark integrates
closely with other Big Data tools, this tight integration is the
ability to build applications that seamlessly combine different
processing models.
Spark is fit wide range (almost all) usecase because of its
versatility, integration and rich set different libraries.
People fall in love with spark
 Enterprises – fit for all, open source
 Mangers – less resource, productivity
 Developers – High level language
 Data scientist – Algorithms, simple API
References
 http://spark.apache.org/
 https://databricks.com/
 Learning Spark - O'Reilly Media
By Holden Karau, Andy Konwinski,
Patrick Wendell, Matei Zaharia
Thank You

More Related Content

What's hot

Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Spark
SparkSpark
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
Mohammed Guller
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
 

What's hot (20)

Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Spark
SparkSpark
Spark
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 

Similar to Apache Spark Introduction

Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
Infinity Tech Solutions
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708
Srikrishna k
 
Apache spark
Apache sparkApache spark
Apache spark
Ramakrishna kapa
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
phanleson
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
GauravBiswas9
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Elvis Saravia
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Anant Corporation
 

Similar to Apache Spark Introduction (20)

Spark core
Spark coreSpark core
Spark core
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708
 
Apache spark
Apache sparkApache spark
Apache spark
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Apache Spark Introduction

  • 2. Any one who want know about spark. No specific prerequists are required. It is not a tutorial to learn spark !. Intension of presantation is to introduce spark and an overview in general users prospective. We are not going to cover any concepts of specific to developer/programming or adminstrative aspects. Audience & Intension Sudhakara.st Mail:sudhakara.st@gmail.com https://in.linkedin.com/in/sudhakara-st-82820539
  • 3. Agenda Introduction to Spark Spark What leads to spark trending Spark components. Resilient Distributed Dataset(RDD) Input to spark Benefits for spark Spark “Word count” example Spark VS Hadoop. Conclusion.
  • 4. Credits Content and images source  http://spark.apache.org/  https://databricks.com/  Learning Spark - O'Reilly Media By Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
  • 5. Apache : Spark™ is a fast and general engine for large-scale data processing. Datairicks: Spark™ is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. Spark is open source distributed computing engine for data processing and data analytics. It was originally developed at UC Berkeley in 2009
  • 6. What leads to Spark trending !. Just in time dataware house Today enterprise have variety of data realtime, streaming, batch and analytics. Spark is designed for that. Big data is versatile. Spark execution engine in handles versatility, its every growing library help for that . Spark bring data processing, analyze and analytics brings to one platform.  Spark significantly simplifies Bigdata processing. Hosts end to end platform. Ingest to product
  • 7. What leads to Spark trending !. Continue..  Spark support wide range of ecosystem & apps
  • 8. Spark friendly ! Apache Spark is a general-purpose, distribute cluster computing, data processing framework that, like MapReduce in Apache Hadoop, offers powerful abstractions for processing large dataset Apache Spark is designed to work seamlessly with Hadoop*, Amazon S3, Cassandra or as a standalone application. Support languages: Rich set High level APIs and increases user productivity Integration with new & existing system.
  • 9. Spark friendly ! continue…
  • 11. Spark Components continue… The Spark core is complemented by a set of powerful, higher-level libraries  SparkSQL  Spark Streaming,  MLlib (for machine learning)  GraphX Scala, Java, Python the language in which Spark is written.
  • 12. Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:  memory management and fault recovery  scheduling, distributing and monitoring tasks, jobs on a cluster  interacting with storage systems
  • 13. SparkSQL SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language. It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack. In addition to providing support for various data sources, it makes it possible to weave SQL queries with code transformations which results in a very powerful tool. Below is an example of a Hive compatible query: // sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
  • 14. Spark Streaming Spark Streaming supports real time processing of streaming data, Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested: Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets Processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams
  • 15. Spark Streaming continue.. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches
  • 16. MLlib MLlib is a machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering, and so on (check out Toptal’s article on machine learning for more information on that topic). These algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering (and more on the way). Apache Mahout (a machine learning library for Hadoop) has already turned away from MapReduce and joined forces on Spark MLlib.
  • 17. Resilient Distributed Dataset(RDD) Spark introduces the concept of an RDD , an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program. RDDs support two types of operations:  Transformations : transform one data collection into another (such as map, filter, join, union, and so on), that are performed on an RDD and which yield a new RDD containing the result. Means create a new dataset from an existing one  Actions : require that the computation be performed (such as reduce, count, first, collect, save and so on) that return a value after running a computation on an RDD. which return a value to the driver program or file after running a computation on the dataset.
  • 18. Resilient Distributed Dataset continue.. RDD which is a fault-tolerant collection of elements/partitions that can be operated on in parallel across the nodes. Properties for RDD:  Immutability  Cacheable – linage – persist  Lazy evaluation (it different than execution)  Type Inferred Two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, Hbase, S3, Cassandra or any data source offering a Hadoop InputFormat.
  • 19. Input for Spark  External store HDFS, Hbase, Hive, S3 Cassandra, Ext3/Ext4, NTFS .. Data formats  CSV, Tablimited, TXT, MD  Json  SquenceFile
  • 20. Input for Spark continue… Spark File Based input Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well.  Eg. you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz"). The textFile method also takes an optional second argument for controlling the number of partitions of the file.  By default, Spark creates one partition for each HDFS block of the file, but you can also ask for a higher number of partitions by passing a larger value. JavaRDD<String> distFile = sc.textFile("data.txt"); This is in contrast with textFile, which would return one record per line in each file
  • 21. Spark File Based input continue…
  • 22. Benefits of spark Fault recovery In memory – processing Scalable Fast Rich set of Library Optimized Unified tool set Easy Programming- Spark and scala APIs are fairly high level
  • 24. Spark “Word count” continue… The first thing a Spark program has to do is create a SparkContext object,  SparkContext represents a connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.  To create a SparkContext, you first need to create a SparkConf object to configure your application // Create a Java Spark Context. SparkConf conf = new SparkConf().setAppName("JavaWordCount"); //SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf);
  • 25. Spark “Word count” continue… Create an RDD from a file RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. The following code uses the SparkContext to define a base RDD from the file inputFile Parallelized collections are created by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program // Create a Java Spark Context. String inputFile = args[0]; JavaRDD input = sc.textFile(inputFile);
  • 26. Spark “Word count” continue… Transform input RDD with flatMap To split the input text into separate words, we use the flatMap(func) RDD transformation, which returns a new RDD formed by passing each element of the source through a function. The String split function is applied to each line of text, returning an RDD of the words in the input RDD: // map/split each line to multiple words JavaRDD<String> words = input.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String x) { return Arrays.asList(x.split(" ")); } } );
  • 27. Spark “Word count” continue… Transform words RDD with map We use the map(func) to transform the words RDD into an RDD of (word, 1) key-value pairs: JavaPairRDD<String, Integer> wordOnePairs = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String x) { return new Tuple2(x, 1); } } );
  • 28. Transform wordOnePairs RDD with reduceByKey To count the number of times each word occurs, we combine the values (1) in the wordOnePairs with the same key (word) using reduceByKey(func),  This transformation will return an RDD of (word, count) pairs where the values for each word are aggregated using the given reduce function func x+y: // reduce add the pairs by key to produce counts JavaPairRDD<String, Integer> counts = wordOnePairs.reduceByKey ( new Function2<Integer, Integer, Integer>() { public Integer call(Integer x, Integer y) { return x + y; } } );
  • 29. Spark “Word count” continue… Output with RDD action saveAsTextFile Finally, the RDD action saveAsTextFile(path) writes the elements of the dataset as a text file (or set of text files) in the outputFile directory String outputFile = args[1]; // Save the word count back out to a text file, causing evaluation. counts.saveAsTextFile(outputFile);
  • 30. Spark “Word count” continue… Running Your Application You use the bin/spark-submit script to launch your application. This script takes care of setting up the classpath with Spark and its dependencies. Here is the spark-submit format: $./bin/spark-submit --class <main-class> --master <master-url> <application-jar> [application-arguments] $bin/spark-submit --class example.wordcount.JavaWordCount --master yarn sparkwordcount-1.0.jar /user/user01/input/alice.txt /user/user01/output //Here is the spark-submit command to run the scala SparkWordCount: $bin/spark-submit --class SparkWordCount --master yarn sparkwordcount-1.0.jar /user/user01/input/alice.txt /user/user01/output
  • 31. Spark vs Hadoop Hello.. Spark Or Hadoop Which Is The Best Big Data Framework ? Hey…Spark has overtaken Hadoop as most active open source Big Data project !. The fact is they are not directly comparable products. Why ? They do not perform exactly the same tasks, and they are not mutually exclusive, as they are able to work together. They provide some of the most popular tools used to carry out common Big Data-related tasks.
  • 32. Spark vs Hadoop continue…
  • 33. Spark vs Hadoop continue… Spark the edge over Hadoop is speed.  Spark handles most of its operations and data “in memory” – copying them from distributed physical storage into far faster logical RAM memory.  This reduces amount of time consuming writing/reading to hard disk each level/phase, other end needs to be done under Hadoop’s MapReduce system  MapReduce writes all of the data back to the physical storage medium after each operation Spark support iterative, interactive and batch data processing. Hadoop limited batch processing !
  • 34. Spark vs Hadoop continue… Although Spark is reported to work up to 100 times faster than Hadoop in certain circumstances, but it does not provide its own distributed storage system. Spark does not include its own storage system for organizing files. Hadoop has it!. Spark’s advanced analytics applications can make use of data stored using the HDFS in data processing layer. Spark includes its own machine learning libraries, called MLib, whereas Hadoop systems must be interfaced with a other machine learning library, for example Apache Mahout.
  • 35. Spark vs Hadoop continue… Apache Spark may only be the processing step in your ETL (Extract, Transform, Load) chain. It doesn't provide the stabled rich tool set that the Hadoop ecosystem contains. You may still need Hbase/Nutch/Solr for data acquisition Hadoop has wide ranges tools  Sqoop and Flume for moving data; Oozie for scheduling; and HBase, or Hive for storage.  The point that I’m making is that although Apache Spark is a very powerful processing system, it should be considered a part of the wider Hadoop ecosystem To summarize Hadoop and Spark are Perfect Together & Spark fits in Hadoop data processing layer. Both we can do better !!.
  • 36. Spark is Heir to Mapreduce MapReduce is not the best framework for all computations ! To perform complex operations, many Map and Reduce phases must be strung together. It limited with respect to complex and iterative operations Spark support Varity of data sources. It is robust !. Spark support iterative, interactive and batch data processing. It is fast!. It’s entirely possible to re-implement MapReduce-like computations in Spark. It is easy!
  • 37. When spark is not needed ! Your Big Data simply consists of a huge amount of very structured data (i.e customer names and addresses) or may have no need for the advanced streaming analytics and machine learning functionality provided by Spark. Spark, although developing very quickly, is still in its infancy, and the security and support infrastructure is not as advanced.
  • 38. Who use Spark. Spark is being adopted by major players like Amazon, eBay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. According to the Spark FAQ
  • 39. Conclusion Apache Spark is a cluster computing platform designed to be fast, speed side and extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Spark integrates closely with other Big Data tools, this tight integration is the ability to build applications that seamlessly combine different processing models. Spark is fit wide range (almost all) usecase because of its versatility, integration and rich set different libraries. People fall in love with spark  Enterprises – fit for all, open source  Mangers – less resource, productivity  Developers – High level language  Data scientist – Algorithms, simple API
  • 40. References  http://spark.apache.org/  https://databricks.com/  Learning Spark - O'Reilly Media By Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
  • 41.