Hadoop/Spark Non-Technical Basics
Zitao Liu
Department of Computer Science
University of Pittsburgh
ztliu@cs.pitt.edu
September 24, 2015
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17
Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17
Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Hadoop
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17
Apache Hadoop
Too many meanings associated with “Hadoop”. Let’s look at Apache
Hadoop first.
Apache Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets
on computer clusters built from commodity hardware.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17
Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System
Hadoop YARN
Hadoop MapReduce
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17
Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System ( ) - storage
Hadoop YARN
Hadoop MapReduce ( ) - processing
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17
Hadoop Distributed File System (HDFS)
The Hadoop distributed file system (HDFS) is a distributed, scalable, and
portable file-system written in Java for the Hadoop framework.
Hadoop Distributed File System (HDFS) a distributed file-system that
stores data on commodity machines, providing very high aggregate
bandwidth across the cluster.
HDFS stores large files (typically in the range of gigabytes to terabytes)
across multiple machines.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17
Hadoop MapReduce
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
A MapReduce program is
composed of
Map procedure
Reduce procedure
Figure 1: Image from
http://tessera.io/docs-datadr/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17
Hadoop Ecosystem
Hadoop Ecosystem includes:
Distributed Filesystem, such as HDFS.
Distributed Programming, such as MapReduce, Pig, Spark.
SQL-On-Hadoop, such as Hive, Drill, Presto.
NoSQL Databases.
Column Data Model, such as HBase, Cassandra.
Document Data Model, such as MongoDB.
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17
MapReduce V.S. Spark
A quick history:
Figure 2: Image from
http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17
Advantages of MapReduce
MapReduce has proven to be an ideal platform to implement complex
batch applications as diverse as sifting through
analyzing system logs
running ETL
computing web indexes
powering personal recommendation systems
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17
Limitations of MapReduce
Some limitations of MapReduce:
Batch mode processing (one-pass computation model)
difficult to program directly in MapReduce
performance bottlenecks
In short, MR doesn’t compose well for a large number of applications.
Therefore, people built specialized systems as workarounds, such as Spark.
Details can be found in http:
//stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17
Apache Spark
Spark fits into the Hadoop open-source community, building on top of the
Hadoop Distributed File System (HDFS). It is a framework for writing
fast, distributed programs.
Faster (a in-memory approach) 10 times faster than MapReduce for
certain applications. Better for iterative algorithms in ML.
Clean, concise APIs in Scala, Java and Python.
Interactive query analysis (from the Scala and Python shells).
Real-time analysis (Spark Streaming).
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17
Advantages of Spark
Low-latency computations by caching the working dataset in memory
and then performing computations at memory speeds.
Efficient iterative algorithm by having subsequent iterations share
data through memory, or repeatedly accessing the same dataset.
Figure 3: Image from http://blog.cloudera.com/blog/2013/11/
putting-spark-to-use-fast-in-memory-computing-for-your-big-data-app
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17
Apache Spark
Spark has the upper hand as long as were talking about iterative
computations that need to pass over the same data many times.
But when it comes to one-pass ETL-like jobs, for example, data
transformation or data integration, then MapReduce is the deal - this is
what it was designed for1.
1
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17
Apache Spark Cost
The memory in the Spark cluster should be at least as large as the amount
of data you need to process, because the data has to fit into the memory
for optimal performance. So, if you need to process really Big Data,
Hadoop will definitely be the cheaper option since hard disk space comes
at a much lower rate than memory space2.
2
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17
Thank you
Thank You
Q & A
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17

Hadoop/Spark Non-Technical Basics

  • 1.
    Hadoop/Spark Non-Technical Basics ZitaoLiu Department of Computer Science University of Pittsburgh ztliu@cs.pitt.edu September 24, 2015 Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17
  • 2.
    Big Data Analytics BigData Analytics always require two components: A filesystem to store big data. A computation framework to analysis big data. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17
  • 3.
    Big Data Analytics BigData Analytics always require two components: A filesystem to store big data. A computation framework to analysis big data. Hadoop Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17
  • 4.
    Apache Hadoop Too manymeanings associated with “Hadoop”. Let’s look at Apache Hadoop first. Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17
  • 5.
    Apache Hadoop The baseApache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System Hadoop YARN Hadoop MapReduce Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17
  • 6.
    Apache Hadoop The baseApache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System ( ) - storage Hadoop YARN Hadoop MapReduce ( ) - processing Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17
  • 7.
    Hadoop Distributed FileSystem (HDFS) The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17
  • 8.
    Hadoop MapReduce MapReduce isa programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of Map procedure Reduce procedure Figure 1: Image from http://tessera.io/docs-datadr/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17
  • 9.
    Hadoop Ecosystem Hadoop Ecosystemincludes: Distributed Filesystem, such as HDFS. Distributed Programming, such as MapReduce, Pig, Spark. SQL-On-Hadoop, such as Hive, Drill, Presto. NoSQL Databases. Column Data Model, such as HBase, Cassandra. Document Data Model, such as MongoDB. · · · Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17
  • 10.
    MapReduce V.S. Spark Aquick history: Figure 2: Image from http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17
  • 11.
    Advantages of MapReduce MapReducehas proven to be an ideal platform to implement complex batch applications as diverse as sifting through analyzing system logs running ETL computing web indexes powering personal recommendation systems · · · Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17
  • 12.
    Limitations of MapReduce Somelimitations of MapReduce: Batch mode processing (one-pass computation model) difficult to program directly in MapReduce performance bottlenecks In short, MR doesn’t compose well for a large number of applications. Therefore, people built specialized systems as workarounds, such as Spark. Details can be found in http: //stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17
  • 13.
    Apache Spark Spark fitsinto the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). It is a framework for writing fast, distributed programs. Faster (a in-memory approach) 10 times faster than MapReduce for certain applications. Better for iterative algorithms in ML. Clean, concise APIs in Scala, Java and Python. Interactive query analysis (from the Scala and Python shells). Real-time analysis (Spark Streaming). Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17
  • 14.
    Advantages of Spark Low-latencycomputations by caching the working dataset in memory and then performing computations at memory speeds. Efficient iterative algorithm by having subsequent iterations share data through memory, or repeatedly accessing the same dataset. Figure 3: Image from http://blog.cloudera.com/blog/2013/11/ putting-spark-to-use-fast-in-memory-computing-for-your-big-data-app Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17
  • 15.
    Apache Spark Spark hasthe upper hand as long as were talking about iterative computations that need to pass over the same data many times. But when it comes to one-pass ETL-like jobs, for example, data transformation or data integration, then MapReduce is the deal - this is what it was designed for1. 1 https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17
  • 16.
    Apache Spark Cost Thememory in the Spark cluster should be at least as large as the amount of data you need to process, because the data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space2. 2 https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17
  • 17.
    Thank you Thank You Q& A Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17