SlideShare a Scribd company logo
Hadoop/Spark Non-Technical Basics
Zitao Liu
Department of Computer Science
University of Pittsburgh
ztliu@cs.pitt.edu
September 24, 2015
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17
Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17
Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Hadoop
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17
Apache Hadoop
Too many meanings associated with “Hadoop”. Let’s look at Apache
Hadoop first.
Apache Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets
on computer clusters built from commodity hardware.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17
Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System
Hadoop YARN
Hadoop MapReduce
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17
Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System ( ) - storage
Hadoop YARN
Hadoop MapReduce ( ) - processing
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17
Hadoop Distributed File System (HDFS)
The Hadoop distributed file system (HDFS) is a distributed, scalable, and
portable file-system written in Java for the Hadoop framework.
Hadoop Distributed File System (HDFS) a distributed file-system that
stores data on commodity machines, providing very high aggregate
bandwidth across the cluster.
HDFS stores large files (typically in the range of gigabytes to terabytes)
across multiple machines.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17
Hadoop MapReduce
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
A MapReduce program is
composed of
Map procedure
Reduce procedure
Figure 1: Image from
http://tessera.io/docs-datadr/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17
Hadoop Ecosystem
Hadoop Ecosystem includes:
Distributed Filesystem, such as HDFS.
Distributed Programming, such as MapReduce, Pig, Spark.
SQL-On-Hadoop, such as Hive, Drill, Presto.
NoSQL Databases.
Column Data Model, such as HBase, Cassandra.
Document Data Model, such as MongoDB.
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17
MapReduce V.S. Spark
A quick history:
Figure 2: Image from
http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17
Advantages of MapReduce
MapReduce has proven to be an ideal platform to implement complex
batch applications as diverse as sifting through
analyzing system logs
running ETL
computing web indexes
powering personal recommendation systems
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17
Limitations of MapReduce
Some limitations of MapReduce:
Batch mode processing (one-pass computation model)
difficult to program directly in MapReduce
performance bottlenecks
In short, MR doesn’t compose well for a large number of applications.
Therefore, people built specialized systems as workarounds, such as Spark.
Details can be found in http:
//stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17
Apache Spark
Spark fits into the Hadoop open-source community, building on top of the
Hadoop Distributed File System (HDFS). It is a framework for writing
fast, distributed programs.
Faster (a in-memory approach) 10 times faster than MapReduce for
certain applications. Better for iterative algorithms in ML.
Clean, concise APIs in Scala, Java and Python.
Interactive query analysis (from the Scala and Python shells).
Real-time analysis (Spark Streaming).
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17
Advantages of Spark
Low-latency computations by caching the working dataset in memory
and then performing computations at memory speeds.
Efficient iterative algorithm by having subsequent iterations share
data through memory, or repeatedly accessing the same dataset.
Figure 3: Image from http://blog.cloudera.com/blog/2013/11/
putting-spark-to-use-fast-in-memory-computing-for-your-big-data-app
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17
Apache Spark
Spark has the upper hand as long as were talking about iterative
computations that need to pass over the same data many times.
But when it comes to one-pass ETL-like jobs, for example, data
transformation or data integration, then MapReduce is the deal - this is
what it was designed for1.
1
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17
Apache Spark Cost
The memory in the Spark cluster should be at least as large as the amount
of data you need to process, because the data has to fit into the memory
for optimal performance. So, if you need to process really Big Data,
Hadoop will definitely be the cheaper option since hard disk space comes
at a much lower rate than memory space2.
2
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17
Thank you
Thank You
Q & A
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17

More Related Content

What's hot

Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Evert Lammerts
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 
Hadoop Case Studies in the Real World
Hadoop Case Studies in the Real WorldHadoop Case Studies in the Real World
Hadoop Case Studies in the Real World
Mobin Ranjbar
 
CSB_community
CSB_communityCSB_community
Python for data science
Python for data sciencePython for data science
Python for data science
Tanzeel Ahmad Mujahid
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
Jie-Han Chen
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
Edureka!
 
Big data computing
Big data computingBig data computing
Big data computing
TasneemKhan47
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?
Edureka!
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
Edureka!
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
amrutupre
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
Edureka!
 

What's hot (20)

Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Hadoop Case Studies in the Real World
Hadoop Case Studies in the Real WorldHadoop Case Studies in the Real World
Hadoop Case Studies in the Real World
 
CSB_community
CSB_communityCSB_community
CSB_community
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 
Big data computing
Big data computingBig data computing
Big data computing
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
 

Viewers also liked

Hadoop
HadoopHadoop
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
Jane Man
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Non technical presentation
Non technical presentationNon technical presentation
Non technical presentation
connorhowe131
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
Jose Luis Lopez Pino
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
huguk
 
Soft Skills Presentation
Soft Skills PresentationSoft Skills Presentation
Soft Skills Presentation
Stephanie Rule
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
shilpi nagpal
 

Viewers also liked (9)

Hadoop
HadoopHadoop
Hadoop
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Non technical presentation
Non technical presentationNon technical presentation
Non technical presentation
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
 
Soft Skills Presentation
Soft Skills PresentationSoft Skills Presentation
Soft Skills Presentation
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
 

Similar to Hadoop/Spark Non-Technical Basics

Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Store app a shared storage appliance for efficient and scalable virtualized h...
Store app a shared storage appliance for efficient and scalable virtualized h...Store app a shared storage appliance for efficient and scalable virtualized h...
Store app a shared storage appliance for efficient and scalable virtualized h...
kiwenlau
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Turkish Testing Board
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
Nirmal Fernando
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooAutomated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Jason Dai
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
elephantscale
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Gezim Sejdiu
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
Makoto Yui
 

Similar to Hadoop/Spark Non-Technical Basics (20)

Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Store app a shared storage appliance for efficient and scalable virtualized h...
Store app a shared storage appliance for efficient and scalable virtualized h...Store app a shared storage appliance for efficient and scalable virtualized h...
Store app a shared storage appliance for efficient and scalable virtualized h...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooAutomated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 

Recently uploaded

Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 

Recently uploaded (20)

Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 

Hadoop/Spark Non-Technical Basics

  • 1. Hadoop/Spark Non-Technical Basics Zitao Liu Department of Computer Science University of Pittsburgh ztliu@cs.pitt.edu September 24, 2015 Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17
  • 2. Big Data Analytics Big Data Analytics always require two components: A filesystem to store big data. A computation framework to analysis big data. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17
  • 3. Big Data Analytics Big Data Analytics always require two components: A filesystem to store big data. A computation framework to analysis big data. Hadoop Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17
  • 4. Apache Hadoop Too many meanings associated with “Hadoop”. Let’s look at Apache Hadoop first. Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17
  • 5. Apache Hadoop The base Apache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System Hadoop YARN Hadoop MapReduce Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17
  • 6. Apache Hadoop The base Apache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System ( ) - storage Hadoop YARN Hadoop MapReduce ( ) - processing Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17
  • 7. Hadoop Distributed File System (HDFS) The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17
  • 8. Hadoop MapReduce MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of Map procedure Reduce procedure Figure 1: Image from http://tessera.io/docs-datadr/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17
  • 9. Hadoop Ecosystem Hadoop Ecosystem includes: Distributed Filesystem, such as HDFS. Distributed Programming, such as MapReduce, Pig, Spark. SQL-On-Hadoop, such as Hive, Drill, Presto. NoSQL Databases. Column Data Model, such as HBase, Cassandra. Document Data Model, such as MongoDB. · · · Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17
  • 10. MapReduce V.S. Spark A quick history: Figure 2: Image from http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17
  • 11. Advantages of MapReduce MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through analyzing system logs running ETL computing web indexes powering personal recommendation systems · · · Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17
  • 12. Limitations of MapReduce Some limitations of MapReduce: Batch mode processing (one-pass computation model) difficult to program directly in MapReduce performance bottlenecks In short, MR doesn’t compose well for a large number of applications. Therefore, people built specialized systems as workarounds, such as Spark. Details can be found in http: //stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17
  • 13. Apache Spark Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). It is a framework for writing fast, distributed programs. Faster (a in-memory approach) 10 times faster than MapReduce for certain applications. Better for iterative algorithms in ML. Clean, concise APIs in Scala, Java and Python. Interactive query analysis (from the Scala and Python shells). Real-time analysis (Spark Streaming). Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17
  • 14. Advantages of Spark Low-latency computations by caching the working dataset in memory and then performing computations at memory speeds. Efficient iterative algorithm by having subsequent iterations share data through memory, or repeatedly accessing the same dataset. Figure 3: Image from http://blog.cloudera.com/blog/2013/11/ putting-spark-to-use-fast-in-memory-computing-for-your-big-data-app Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17
  • 15. Apache Spark Spark has the upper hand as long as were talking about iterative computations that need to pass over the same data many times. But when it comes to one-pass ETL-like jobs, for example, data transformation or data integration, then MapReduce is the deal - this is what it was designed for1. 1 https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17
  • 16. Apache Spark Cost The memory in the Spark cluster should be at least as large as the amount of data you need to process, because the data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space2. 2 https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17
  • 17. Thank you Thank You Q & A Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17