SlideShare a Scribd company logo
1 of 26
Download to read offline
COLLEGE OF COMPUTING AND INFORMATICS
MSc. in Information Technology
Storing Big Data
Nov, 2021
Assosa, Ethiopia
Contents
• Introduction
• Overview of Big Data
• Delivering business benefit from Big Data
• Big Data with traditional data
Objectives
• At the end of this chapter, you are able to:
–Understand concepts, business benefits, characteristics and
sources of big data.
–Explain Big Data with traditional data
Introduction
• What makes the big data valuable?
– Insightful, actionable and predictive with time.
• What are the challenges of big data?
– Unimaginable size & growth, heterogonous systems and data
– Traditional systems do not scale up and is costly (RDMS)
• What are the solutions?
– Scale up (increase configuration of single system-Storage, RAM, CPU)
– Scale out (use multiple machines (commodity) and distribute the load)
• Nodes may fail frequently (network or machine issues), #nodes keep changing
• During analysis take results from different machines and merge/aggregate
them. (B/c, same files divided into multiple machines for parallel processing)
– A solution to handle and process structural and unstructured data which is huge-
Hadoop
• Some of the big data problems are:
– Storage exponentially growing and variety of huge dataset & Processing
• Hadoop as a Solution
– A framework that allows us to store and process large data sets in parallel and
distributed fashion. It has three basic components.
Introduction to Hadoop
YARN- Resource Manager
• Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
– Hides underlying system details and complexities from user
– Developed in Java
• Meant for heterogeneous commodity hardware
• Hadoop Distributed File System = HDFS
– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop cluster
– It links together the file systems on many local nodes to make them into one
big file system
• Has a large ecosystem with both open-source & proprietary Hadoop-related
projects
– Hbase / Zookeeper / Avro / etc.
Introduction to Hadoop…
• A large (and growing) Ecosystem
Introduction to Hadoop….
Hadoop has an ecosystem that has evolved from its four core
components. It is continuously growing to meet the needs of Big Data.
Introduction to Hadoop….
• Who uses Hadoop?
Introduction to Hadoop…..
• What Hadoop is good for:
– Massive amounts of data through
parallelism
– A variety of data (structured, unstructured,
semi-structured)
– Inexpensive commodity hardware
• Hadoop is not good for:
– Not to process transactions (random access)
– Not good when work cannot be parallelized
– Not good for low latency data access
– Not good for processing lots of small files
– Not good for intensive calculations with little data
Introduction to Hadoop…..
HDFS
• HDFS Creates a level of abstraction over the resources, from where
we ca see the whole HDFS logical as a single unit to store big data. But
actually restoring the data a ross multiple systems.
• Characteristics
–Scalable Storage for Large Files
–Replication
–Streaming Data Access
–File append
HDFS
• Has two core components
– NameNode: main node that contains meta data about the data stored
• Which data block is stored in which data node, where are the
replication of the data block
• persistently stores the filesystem meta-data and the
mappings of the blocks to the datanodes, on the disk as two files:
fsimage and edits files.
• fsimage contains a complete snapshot of the filesystem meta-
data.
• The edits file stores the incremental updates to the meta-data.
• responsible for executing operations such as opening
and closing of files, no data actually flows through the Namenode
HDFS
• DataNode:
–commodity hardware in the distributed environment in
which actual data is stored on it.
• Replicate the data block that is present in the data nodes
and by default the replication factor is 3.
• The placement of replicas on the Datanodes is determined
by a rack-aware placement policy.
• This placement policy ensures reliability and availability of
the blocks.
• For a replication factor of three, one replica is placed on a
node on a local rack, the second replica is placed on a
different node on a remote rack and the third replica is
placed on a different node on the same remote rack.
HDFS- Replication
HDFS
HDFS
HDFS-Architecture…..
Secondary Namenode
The edits file keeps growing in size, over time, as the incremental updates are stored. The
responsibility of applying the updates to the fsimage file is delegated to the Secondary
Namenode, as the Namenode may not have enough resources available, as it is
performing other operations.
Multiple Namenodes / Namespaces
• To scale the name service horizontally, federation uses multiple
independent NameNodes / namespaces.
• The NameNodes are federated; the NameNodes are independent
and do not require coordination with each other.
• The DataNodes are used as common storage for blocks by all the
NameNodes.
• Each DataNode registers with all the NameNodes in the cluster.
• DataNodes send periodic heartbeats and block reports.
• They also handle commands from the NameNodes.
• In Federated NameNode, One million blocks or ~100TB of
data require roughly one GB RAM in NN
HDFS
HDFS
• In earlier versions of Hadoop/HDFS, the default blocksize was
often quoted as 64 MB, but the current
• A typical block size used by HDFS is 128 MB. Thus, an HDFS file is
chopped up into 128 MB chunks, and if possible, each chunk will reside
on a different DataNode
• It should be noted that Linux itself has both a logical block size (typically
4 KB) and a physical or hardware block size (typically 512 bytes).
HDFS-Read Path
HDFS- Write Path
Differences b/n RDBMS &
Hadoop/HDFS
26
?

More Related Content

Similar to Storing and Processing Big Data with Hadoop

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With HadoopUmair Shafique
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 

Similar to Storing and Processing Big Data with Hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 

Recently uploaded

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

Storing and Processing Big Data with Hadoop

  • 1. COLLEGE OF COMPUTING AND INFORMATICS MSc. in Information Technology Storing Big Data Nov, 2021 Assosa, Ethiopia
  • 2. Contents • Introduction • Overview of Big Data • Delivering business benefit from Big Data • Big Data with traditional data
  • 3. Objectives • At the end of this chapter, you are able to: –Understand concepts, business benefits, characteristics and sources of big data. –Explain Big Data with traditional data
  • 4. Introduction • What makes the big data valuable? – Insightful, actionable and predictive with time. • What are the challenges of big data? – Unimaginable size & growth, heterogonous systems and data – Traditional systems do not scale up and is costly (RDMS) • What are the solutions? – Scale up (increase configuration of single system-Storage, RAM, CPU) – Scale out (use multiple machines (commodity) and distribute the load) • Nodes may fail frequently (network or machine issues), #nodes keep changing • During analysis take results from different machines and merge/aggregate them. (B/c, same files divided into multiple machines for parallel processing) – A solution to handle and process structural and unstructured data which is huge- Hadoop
  • 5. • Some of the big data problems are: – Storage exponentially growing and variety of huge dataset & Processing • Hadoop as a Solution – A framework that allows us to store and process large data sets in parallel and distributed fashion. It has three basic components. Introduction to Hadoop YARN- Resource Manager
  • 6. • Apache open source software framework for reliable, scalable, distributed computing of massive amount of data – Hides underlying system details and complexities from user – Developed in Java • Meant for heterogeneous commodity hardware • Hadoop Distributed File System = HDFS – Where Hadoop stores data – A file system that spans all the nodes in a Hadoop cluster – It links together the file systems on many local nodes to make them into one big file system • Has a large ecosystem with both open-source & proprietary Hadoop-related projects – Hbase / Zookeeper / Avro / etc. Introduction to Hadoop…
  • 7. • A large (and growing) Ecosystem Introduction to Hadoop….
  • 8. Hadoop has an ecosystem that has evolved from its four core components. It is continuously growing to meet the needs of Big Data. Introduction to Hadoop….
  • 9. • Who uses Hadoop? Introduction to Hadoop…..
  • 10. • What Hadoop is good for: – Massive amounts of data through parallelism – A variety of data (structured, unstructured, semi-structured) – Inexpensive commodity hardware • Hadoop is not good for: – Not to process transactions (random access) – Not good when work cannot be parallelized – Not good for low latency data access – Not good for processing lots of small files – Not good for intensive calculations with little data Introduction to Hadoop…..
  • 11. HDFS • HDFS Creates a level of abstraction over the resources, from where we ca see the whole HDFS logical as a single unit to store big data. But actually restoring the data a ross multiple systems. • Characteristics –Scalable Storage for Large Files –Replication –Streaming Data Access –File append
  • 12. HDFS • Has two core components – NameNode: main node that contains meta data about the data stored • Which data block is stored in which data node, where are the replication of the data block • persistently stores the filesystem meta-data and the mappings of the blocks to the datanodes, on the disk as two files: fsimage and edits files. • fsimage contains a complete snapshot of the filesystem meta- data. • The edits file stores the incremental updates to the meta-data. • responsible for executing operations such as opening and closing of files, no data actually flows through the Namenode
  • 13. HDFS • DataNode: –commodity hardware in the distributed environment in which actual data is stored on it. • Replicate the data block that is present in the data nodes and by default the replication factor is 3. • The placement of replicas on the Datanodes is determined by a rack-aware placement policy. • This placement policy ensures reliability and availability of the blocks. • For a replication factor of three, one replica is placed on a node on a local rack, the second replica is placed on a different node on a remote rack and the third replica is placed on a different node on the same remote rack.
  • 15. HDFS
  • 16. HDFS
  • 17. HDFS-Architecture….. Secondary Namenode The edits file keeps growing in size, over time, as the incremental updates are stored. The responsibility of applying the updates to the fsimage file is delegated to the Secondary Namenode, as the Namenode may not have enough resources available, as it is performing other operations.
  • 18.
  • 19.
  • 20. Multiple Namenodes / Namespaces • To scale the name service horizontally, federation uses multiple independent NameNodes / namespaces. • The NameNodes are federated; the NameNodes are independent and do not require coordination with each other. • The DataNodes are used as common storage for blocks by all the NameNodes. • Each DataNode registers with all the NameNodes in the cluster. • DataNodes send periodic heartbeats and block reports. • They also handle commands from the NameNodes. • In Federated NameNode, One million blocks or ~100TB of data require roughly one GB RAM in NN
  • 21. HDFS
  • 22. HDFS • In earlier versions of Hadoop/HDFS, the default blocksize was often quoted as 64 MB, but the current • A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode • It should be noted that Linux itself has both a logical block size (typically 4 KB) and a physical or hardware block size (typically 512 bytes).
  • 25. Differences b/n RDBMS & Hadoop/HDFS
  • 26. 26 ?