Hadoop/Spark Non-Technical Basics

Hadoop/Spark Non-Technical Basics
Zitao Liu
Department of Computer Science
University of Pittsburgh
ztliu@cs.pitt.edu
September 24, 2015
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17

Big Data Analytics
Big Data Analytics always require two components:
A ﬁlesystem to store big data.
A computation framework to analysis big data.

Big Data Analytics
Big Data Analytics always require two components:
A ﬁlesystem to store big data.
A computation framework to analysis big data.
Hadoop

Apache Hadoop
Too many meanings associated with “Hadoop”. Let’s look at Apache
Hadoop ﬁrst.
Apache Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets
on computer clusters built from commodity hardware.

Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System
Hadoop YARN
Hadoop MapReduce

Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System ( ) - storage
Hadoop YARN
Hadoop MapReduce ( ) - processing

Hadoop Distributed File System (HDFS)
The Hadoop distributed file system (HDFS) is a distributed, scalable, and
portable file-system written in Java for the Hadoop framework.
Hadoop Distributed File System (HDFS) a distributed file-system that
stores data on commodity machines, providing very high aggregate
bandwidth across the cluster.
HDFS stores large files (typically in the range of gigabytes to terabytes)
across multiple machines.

Hadoop MapReduce
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
A MapReduce program is
composed of
Map procedure
Reduce procedure
Figure 1: Image from
http://tessera.io/docs-datadr/

Hadoop Ecosystem
Hadoop Ecosystem includes:
Distributed Filesystem, such as HDFS.
Distributed Programming, such as MapReduce, Pig, Spark.
SQL-On-Hadoop, such as Hive, Drill, Presto.
NoSQL Databases.
Column Data Model, such as HBase, Cassandra.
Document Data Model, such as MongoDB.
· · ·

MapReduce V.S. Spark
A quick history:
Figure 2: Image from
http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Advantages of MapReduce
MapReduce has proven to be an ideal platform to implement complex
batch applications as diverse as sifting through
analyzing system logs
running ETL
computing web indexes
powering personal recommendation systems
· · ·

Limitations of MapReduce
Some limitations of MapReduce:
Batch mode processing (one-pass computation model)
diﬃcult to program directly in MapReduce
performance bottlenecks
In short, MR doesn’t compose well for a large number of applications.
Therefore, people built specialized systems as workarounds, such as Spark.
Details can be found in http:
//stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf.

Apache Spark
Spark ﬁts into the Hadoop open-source community, building on top of the
Hadoop Distributed File System (HDFS). It is a framework for writing
fast, distributed programs.
Faster (a in-memory approach) 10 times faster than MapReduce for
certain applications. Better for iterative algorithms in ML.
Clean, concise APIs in Scala, Java and Python.
Interactive query analysis (from the Scala and Python shells).
Real-time analysis (Spark Streaming).

Advantages of Spark
Low-latency computations by caching the working dataset in memory
and then performing computations at memory speeds.
Eﬃcient iterative algorithm by having subsequent iterations share
data through memory, or repeatedly accessing the same dataset.
Figure 3: Image from http://blog.cloudera.com/blog/2013/11/
putting-spark-to-use-fast-in-memory-computing-for-your-big-data-app

Apache Spark
Spark has the upper hand as long as were talking about iterative
computations that need to pass over the same data many times.
But when it comes to one-pass ETL-like jobs, for example, data
transformation or data integration, then MapReduce is the deal - this is
what it was designed for1.
1
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

Apache Spark Cost
The memory in the Spark cluster should be at least as large as the amount
of data you need to process, because the data has to ﬁt into the memory
for optimal performance. So, if you need to process really Big Data,
Hadoop will deﬁnitely be the cheaper option since hard disk space comes
at a much lower rate than memory space2.
2
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

Thank you
Thank You
Q & A

Hadoop/Spark Non-Technical Basics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Hadoop/Spark Non-Technical Basics

Similar to Hadoop/Spark Non-Technical Basics (20)

Recently uploaded

Recently uploaded (20)

Hadoop/Spark Non-Technical Basics