Remember the last time you tried to write a MapReduce job (obviously something non trivial than a word count)? It sure did the work, but has lot of pain points from getting an idea to implement it in terms of map reduce. Did you wonder how life will be much simple if you had to code like doing collection operations and hence being transparent* to its distributed nature? Did you want/hope for more performant/low latency jobs? Well, seems like you are in luck.
In this talk, we will be covering a different way to do MapReduce kind of operations without being just limited to map and reduce, yes, we will be talking about Apache Spark. We will compare and contrast Spark programming model with Map Reduce. We will see where it shines, and why to use it, how to use it. We’ll be covering aspects like testability, maintainability, conciseness of the code, and some features like iterative processing, optional in-memory caching and others. We will see how Spark, being just a cluster computing engine, abstracts the underlying distributed storage, and cluster management aspects, giving us a uniform interface to consume/process/query the data. We will explore the basic abstraction of RDD which gives us so many awesome features making Apache Spark a very good choice for your big data applications. We will see this through some non trivial code examples.
Session at the IndicThreads.com Confence held in Pune, India on 27-28 Feb 2015
http://www.indicthreads.com
http://pune15.indicthreads.com
2. 2
Some properties of “Big Data”
•Big data is inherently immutable, meaning it is not supposed to
updated once generated.
•Mostly the operations are coarse grained when it comes to write
•Commodity hardware makes more sense for storage/computation
of such enormous data,hence the data is distributed across cluster
of many such machines
• The distributed nature makes the programming complicated.
3. 3
Brush up for Hadoop concepts
Distributed Storage => HDFS
Cluster Manager => YARN
Fault tolerance => achieved via replication
Job scheduling => Scheduler in YARN
Mapper
Reducer
Combiner
9. 9
MapReduce pain points
• considerable latency
• only Map and Reduce phases
• Non trivial to test
• results into complex workflow
• Not suitable for Iterative processing
10. 10
Immutability and MapReduce model
• HDFS storage is immutable or append-only.
• The MapReduce model lacks to exploit the immutable nature of
the data.
• The intermediate results are persisted resulting in huge of IO,
causing a serious performance hit.
11. 11
Wouldn’t it be very nice if we could have• Low latency
• Programmer friendly programming model
• Unified ecosystem
• Fault tolerance and other typical distributed system properties
• Easily testable code
• Of course open source :)
12. 12
What is Apache Spark
• Cluster computing Engine
• Abstracts the storage and cluster management
• Unified interfaces to data
• API in Scala, Python, Java, R*
13. 13
Where does it fit in existing Bigdata ecosystem
http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html
14. 14
Why should you care about Apache Spark
• Abstracts underlying storage,
• Abstracts cluster management
• Easy programming model
• Very easy to test the code
• Highly performant
15. 15
• Petabyte sort record
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
16. 16
• Offers in memory caching of data
• Specialized Applications
• GraphX for graph processing
• Spark Streaming
• MLib for Machine learning
• Spark SQL
• Data exploration via Spark-Shell
18. 18
Word Count example
val file = spark.textFile("input path")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
counts.saveAsTextFile("destination path")
21. 21
RDD
• RDD stands for Resilient Distributed Dataset.
• basic abstraction for Spark
22. 22
• Equivalent of Distributed collections.
• The interface makes distributed nature of underlying data
transparent.
• RDD is immutable
• Can be created via,
• parallelising a collection,
• transforming an existing RDD by applying a transformation
function,
• reading from a persistent data store like HDFS.
23. 23
RDD is lazily evaluated
RDD has two type of operations
• Transformations
Create a DAG of transformations to be applied on the RDD
Does not evaluating anything
• Actions
Evaluate the DAG of transformations
31. 31
Reduced IO
• No disk IO between phases since phases themselves are pipelined
• No network IO involved unless a shuffle is required
32. 32
No Mandatory Shuffle
• Programs not bounded by map and reduce phases
• No mandatory Shuffle and sort required
33. 33
In memory caching of data
• Optional In memory caching
• DAG engine can apply certain optimisations since when an action is
called, it knows what all transformations as to be applied