SlideShare a Scribd company logo
TACKLING BIG DATA
WITH THE ELEPHANT
IN THE ROOM
WHAT’S THE PROBLEM WITH BIG DATA?
Volume VarietyVelocity
WHAT’S THE SOLUTION TO BIG DATA?
“In pioneer days they used oxen for heavy
pulling, and when one oxen couldn’t budge
a log, they didn’t try to grow a larger ox.
We shouldn’t be trying for bigger
computers, but for more systems of
computers.” – Grace Hopper
HADOOP’S SOLUTION
Sqoop
Pig Hive
HBase Mahout Flume
Oozie …
Hadoop Distributed
File System
MapReduce
Hadoop
Core
Components
Hadoop
Ecosystem
WHAT
IS
HDFS?
HOW DOES HDFS WORK?
Large
Data
File
Block #1
Block #2
HOW DOES HDFS WORK?
Large
Data
File
Block #1
Block #2
Block #1
Block #1
Block #1
HOW DOES HDFS WORK?
Large
Data
File
Block #1
Block #2
Block #1
Block #1
Block #1
Block #2
Block #2
Block #2
HOW DOES HDFS WORK?
Large
Data
File
Block #1
Block #2
Block #1
Block #1
Block #1
Block #2
Block #2
Block #2
WHAT IS MAP-REDUCE?
Core Ideas
–  Data Locality
–  Parallelism
–  Block Independence
Three Stages
1.  Map
2.  Swap & Sort
3.  Reduce
WORD COUNT MAP
the cat sat on the mat
the aardvark sat on the
…
Node 1
the mahout drove the
….
Node 2
the cat sat on the mat
The aardvark sat on the
…
The mahout drove the
…
Mapper
WORD COUNT MAP
the cat sat on the mat
the aardvark sat on the
…
Node 1
the mahout drove the
….
Node 2
Mapper
map()
map()
Mapper
WORD COUNT MAP
the cat sat on the mat
the aardvark sat on the
…
Node 1
the mahout drove the
….
Node 2
Mapper
map()
map()
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
Mapper
WORD COUNT MAP
the cat sat on the mat
the aardvark sat on the
…
Node 1
the mahout drove the
….
Node 2
Mapper
map()
map()
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
map()
the 1
aardvark 1
sat 1
on 1
the 1
WORD COUNT SWAP & SORT
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
the 1
aardvark 1
sat 1
on 1
the 1
WORD COUNT SWAP & SORT
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
the 1
aardvark 1
sat 1
on 1
the 1
aardvark 1
cat 1
mat 1
on 1,1
sat 1
the 1,1,1,1
drove 1
mahout 1
the 1,1
WORD COUNT SWAP & SORT
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
the 1
aardvark 1
sat 1
on 1
the 1
aardvark 1
cat 1
mat 1
on 1,1
sat 1
the 1,1,1,1
drove 1
mahout 1
the 1,1
aardvark 1
cat 1
mat 1
mahout 1
sat 1
drove 1
on 1,1
the 1,1,1,1,1,1
Node 3
Node 4
Node 5
WORD COUNT REDUCER
aardvark 1
cat 1
mat 1
mahout 1
sat 1
drove 1
on 1,1
the 1,1,1,1,1,1
Node 3
Node 4
Node 5
Reducer 0
Reducer 1
Reducer 2
aardvark 1
cat 1
mat 1
mahout 1
sat 1
drove 1
on 2
the 6
TAKE-AWAYS
Sqoop
Pig Hive
HBase Mahout Flume
Oozie …
Hadoop Distributed
File System
MapReduce
Hadoop
Core
Components
Hadoop
Ecosystem
QUESTIONS?

More Related Content

Similar to Tackling Big Data with the Elephant in the Room

Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Sergejus Barinovas
 
Hadoop
HadoopHadoop
Hadoop
Po-Han Chen
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
bigdatasyd
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир Суворов
EYevseyeva
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with Hadoop
Ferran Galí Reniu
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
Eduard Hildebrandt
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
Veda Vyas
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
Institute of Information Systems (HES-SO)
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Sharing bisnis big data v3 part1
Sharing  bisnis big data v3 part1Sharing  bisnis big data v3 part1
Sharing bisnis big data v3 part1
Dwika Sudrajat
 

Similar to Tackling Big Data with the Elephant in the Room (13)

Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир Суворов
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with Hadoop
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Sharing bisnis big data v3 part1
Sharing  bisnis big data v3 part1Sharing  bisnis big data v3 part1
Sharing bisnis big data v3 part1
 

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 

Tackling Big Data with the Elephant in the Room

  • 1. TACKLING BIG DATA WITH THE ELEPHANT IN THE ROOM
  • 2. WHAT’S THE PROBLEM WITH BIG DATA? Volume VarietyVelocity
  • 3. WHAT’S THE SOLUTION TO BIG DATA? “In pioneer days they used oxen for heavy pulling, and when one oxen couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” – Grace Hopper
  • 4. HADOOP’S SOLUTION Sqoop Pig Hive HBase Mahout Flume Oozie … Hadoop Distributed File System MapReduce Hadoop Core Components Hadoop Ecosystem
  • 6. HOW DOES HDFS WORK? Large Data File Block #1 Block #2
  • 7. HOW DOES HDFS WORK? Large Data File Block #1 Block #2 Block #1 Block #1 Block #1
  • 8. HOW DOES HDFS WORK? Large Data File Block #1 Block #2 Block #1 Block #1 Block #1 Block #2 Block #2 Block #2
  • 9. HOW DOES HDFS WORK? Large Data File Block #1 Block #2 Block #1 Block #1 Block #1 Block #2 Block #2 Block #2
  • 10. WHAT IS MAP-REDUCE? Core Ideas –  Data Locality –  Parallelism –  Block Independence Three Stages 1.  Map 2.  Swap & Sort 3.  Reduce
  • 11. WORD COUNT MAP the cat sat on the mat the aardvark sat on the … Node 1 the mahout drove the …. Node 2 the cat sat on the mat The aardvark sat on the … The mahout drove the …
  • 12. Mapper WORD COUNT MAP the cat sat on the mat the aardvark sat on the … Node 1 the mahout drove the …. Node 2 Mapper map() map()
  • 13. Mapper WORD COUNT MAP the cat sat on the mat the aardvark sat on the … Node 1 the mahout drove the …. Node 2 Mapper map() map() the 1 cat 1 sat 1 on 1 the 1 mat 1 the 1 mahout 1 drove 1 the 1
  • 14. Mapper WORD COUNT MAP the cat sat on the mat the aardvark sat on the … Node 1 the mahout drove the …. Node 2 Mapper map() map() the 1 cat 1 sat 1 on 1 the 1 mat 1 the 1 mahout 1 drove 1 the 1 map() the 1 aardvark 1 sat 1 on 1 the 1
  • 15. WORD COUNT SWAP & SORT the 1 cat 1 sat 1 on 1 the 1 mat 1 the 1 mahout 1 drove 1 the 1 the 1 aardvark 1 sat 1 on 1 the 1
  • 16. WORD COUNT SWAP & SORT the 1 cat 1 sat 1 on 1 the 1 mat 1 the 1 mahout 1 drove 1 the 1 the 1 aardvark 1 sat 1 on 1 the 1 aardvark 1 cat 1 mat 1 on 1,1 sat 1 the 1,1,1,1 drove 1 mahout 1 the 1,1
  • 17. WORD COUNT SWAP & SORT the 1 cat 1 sat 1 on 1 the 1 mat 1 the 1 mahout 1 drove 1 the 1 the 1 aardvark 1 sat 1 on 1 the 1 aardvark 1 cat 1 mat 1 on 1,1 sat 1 the 1,1,1,1 drove 1 mahout 1 the 1,1 aardvark 1 cat 1 mat 1 mahout 1 sat 1 drove 1 on 1,1 the 1,1,1,1,1,1 Node 3 Node 4 Node 5
  • 18. WORD COUNT REDUCER aardvark 1 cat 1 mat 1 mahout 1 sat 1 drove 1 on 1,1 the 1,1,1,1,1,1 Node 3 Node 4 Node 5 Reducer 0 Reducer 1 Reducer 2 aardvark 1 cat 1 mat 1 mahout 1 sat 1 drove 1 on 2 the 6
  • 19. TAKE-AWAYS Sqoop Pig Hive HBase Mahout Flume Oozie … Hadoop Distributed File System MapReduce Hadoop Core Components Hadoop Ecosystem