SlideShare a Scribd company logo
1 of 13
1
STRICTLY
PRIVATE
AND
CONFIDENTIAL
SQL
• Insert
• Update
• Joins, Join Types, Subquery
• View
• Sort,
• Aggregation
• Group by
• Order by
• Having Clause
• Where Clause
2
STRICTLY
PRIVATE
AND
CONFIDENTIAL
Linux
• Permission
• Folders and file creation
• recursive search
• grep
• awk
• head
• tail arguments
• sed
• Cron tab
3
STRICTLY
PRIVATE
AND
CONFIDENTIAL
Bigdata
Ecosystem
Overview:
Hive -> SQL Engine for Hadoop -> select queries on read only data
Pig -> Scripting Language for Hadoop
Spark -> In Memory execution engine
Kafka -> Message Queue & Streaming platform
Sqoop -> Data Import & Export(SQL & NoSQL)
Flume -> Data Import & Export(Files & Streams)
HBase -> NoSQL database built on top of Hadoop => Columnar NoSQL
database
4
STRICTLY
PRIVATE
AND
CONFIDENTIAL
The Four V’s of BigData
Volume
Variety
Velocity
Veracity
5
STRICTLY
PRIVATE
AND
CONFIDENTIAL
Hadoop Architecture
Hadoop is a framework permitting the storage of large volumes of data on
node systems. The Hadoop architecture allows parallel processing of data
using several components:
• Hadoop HDFS to store data across slave machines
• Hadoop YARN for resource management in the Hadoop cluster
• Hadoop MapReduce to process data in a distributed fashion
• Zookeeper to ensure synchronization across a cluster
6
STRICTLY
PRIVATE
AND
CONFIDENTIAL
HDFS
7
STRICTLY
PRIVATE
AND
CONFIDENTIAL
HDFS
HDFS in Hadoop Architecture divides large data into different blocks. Replicated three times by default, each block contains 128 MB of data. Replications
operate under two rules:
1. Two identical blocks cannot be placed on the same DataNode
2. When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack
 blocks A, B, C, and D are replicated three times and placed on different racks. If DataNode 7 crashes, we still have two copies of block C data on DataNode
4 of Rack 1 and DataNode 9 of Rack 3.
8
STRICTLY
PRIVATE
AND
CONFIDENTIAL
HDFS
 File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is divided into multiple blocks
of size 128MB which is default and you can also change it manually.
9
STRICTLY
PRIVATE
AND
CONFIDENTIAL
HDFS
 There are three components of the Hadoop Distributed File System:
1. NameNode (a.k.a. masternode): Contains metadata in RAM and disk
2. Secondary NameNode: Contains a copy of NameNode’s metadata on disk
3. Slave Node: Contains the actual data in the form of blocks
NameNode:
NameNode is the master server, NameNode holds metadata information on the various DataNodes, their locations, the size of each block, etc. It also helps to
execute file system namespace operations, such as opening, closing, renaming files and directories.
Secondary NameNode
 secondary NameNode server is responsible for maintaining a copy of the metadata in the disk. The main purpose of the secondary NameNode is to create
a new NameNode in case of failure.
 In a high availability cluster, there are two NameNodes: active and standby. The secondary NameNode performs a similar function to the standby
NameNode.
Datanodes
 Datanodes store and maintain the blocks. While there is only one namenode, there can be multiple datanodes, which are responsible for retrieving the
blocks when requested by the namenode. Datanodes send the block reports to the namenode every 10 seconds; in this way, the namenode receives
information about the datanodes stored in its RAM and disk.
HDFS Read and Write mechanisms are parallel activities. To read or write a file in HDFS, a client must interact with the namenode. The namenode checks
the privileges of the client and gives permission to read or write on the data blocks.
10
STRICTLY
PRIVATE
AND
CONFIDENTIAL
YARN(Yet Another Resource Negotiator)
 YARN is the middle layer between HDFS and MapReduce in the Hadoop architecture. ,YARN is a Framework on which MapReduce works,
 YARN performs 2 operations that are Job scheduling and Resource Management. The Purpose of Job schedular is to divide a
big task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized. Job Scheduler also keeps track of which job is important, which job has more priority, dependencies between the
jobs and all the other information like job timing, etc. And the use of Resource Manager is to manage all the resources that are
made available for running a Hadoop cluster.
Features of YARN

• Multi-Tenancy
• Scalability
• Cluster-Utilization
• Compatibility
11
STRICTLY
PRIVATE
AND
CONFIDENTIAL
MAP REDUCE
MapReduce is a framework conducting distributed and parallel processing of large volumes of data. Written using a number of programming languages, it has
two main phases: Map Phase and Reduce Phase.
Map Phase
 Map Phase stores data in the form of blocks. Data is read, processed and given a key-value pair in this phase. It is responsible for running a particular task
on one or multiple splits or inputs.
Reduce Phase
 The reduce Phase receives the key-value pair from the map phase. The key-value pair is then aggregated into smaller sets and an output is produced.
Processes such as shuffling and sorting occur in the reduce phase.
 The mapper function handles the input data and runs a function on every input split (known as map tasks). There can be one or multiple map tasks based
on the size of the file and the configuration setup. Data is then sorted, shuffled, and moved to the reduce phase, where a reduce function aggregates the
data and provides the output.
12
STRICTLY
PRIVATE
AND
CONFIDENTIAL
MapReduce Job
Execution
• The input data is stored in the HDFS and read using an input format.
• The file is split into multiple chunks based on the size of the file and
the input format.
• The default chunk size is 128 MB but can be customized.
• The record reader reads the data from the input splits and forwards
this information to the mapper.
• The mapper breaks the records in every chunk into a list of data
elements (or key-value pairs).
• The combiner works on the intermediate data created by the map
tasks and acts as a mini reducer to reduce the data.
• The partitioner decides how many reduce tasks will be required to
aggregate the data.
• The data is then sorted and shuffled based on their key-value pairs
and sent to the reduce function.
• Based on the output format decided by the reduce function, the
output data is then stored on the HDFS.
13
STRICTLY
PRIVATE
AND
CONFIDENTIAL
MAP REDUCE

More Related Content

Similar to Big Data Reverse Knowledge Transfer.pptx

Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxNIKHILGR3
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptManiMaran230751
 

Similar to Big Data Reverse Knowledge Transfer.pptx (20)

Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 

Recently uploaded

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Recently uploaded (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

Big Data Reverse Knowledge Transfer.pptx

  • 1. 1 STRICTLY PRIVATE AND CONFIDENTIAL SQL • Insert • Update • Joins, Join Types, Subquery • View • Sort, • Aggregation • Group by • Order by • Having Clause • Where Clause
  • 2. 2 STRICTLY PRIVATE AND CONFIDENTIAL Linux • Permission • Folders and file creation • recursive search • grep • awk • head • tail arguments • sed • Cron tab
  • 3. 3 STRICTLY PRIVATE AND CONFIDENTIAL Bigdata Ecosystem Overview: Hive -> SQL Engine for Hadoop -> select queries on read only data Pig -> Scripting Language for Hadoop Spark -> In Memory execution engine Kafka -> Message Queue & Streaming platform Sqoop -> Data Import & Export(SQL & NoSQL) Flume -> Data Import & Export(Files & Streams) HBase -> NoSQL database built on top of Hadoop => Columnar NoSQL database
  • 4. 4 STRICTLY PRIVATE AND CONFIDENTIAL The Four V’s of BigData Volume Variety Velocity Veracity
  • 5. 5 STRICTLY PRIVATE AND CONFIDENTIAL Hadoop Architecture Hadoop is a framework permitting the storage of large volumes of data on node systems. The Hadoop architecture allows parallel processing of data using several components: • Hadoop HDFS to store data across slave machines • Hadoop YARN for resource management in the Hadoop cluster • Hadoop MapReduce to process data in a distributed fashion • Zookeeper to ensure synchronization across a cluster
  • 7. 7 STRICTLY PRIVATE AND CONFIDENTIAL HDFS HDFS in Hadoop Architecture divides large data into different blocks. Replicated three times by default, each block contains 128 MB of data. Replications operate under two rules: 1. Two identical blocks cannot be placed on the same DataNode 2. When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack  blocks A, B, C, and D are replicated three times and placed on different racks. If DataNode 7 crashes, we still have two copies of block C data on DataNode 4 of Rack 1 and DataNode 9 of Rack 3.
  • 8. 8 STRICTLY PRIVATE AND CONFIDENTIAL HDFS  File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is divided into multiple blocks of size 128MB which is default and you can also change it manually.
  • 9. 9 STRICTLY PRIVATE AND CONFIDENTIAL HDFS  There are three components of the Hadoop Distributed File System: 1. NameNode (a.k.a. masternode): Contains metadata in RAM and disk 2. Secondary NameNode: Contains a copy of NameNode’s metadata on disk 3. Slave Node: Contains the actual data in the form of blocks NameNode: NameNode is the master server, NameNode holds metadata information on the various DataNodes, their locations, the size of each block, etc. It also helps to execute file system namespace operations, such as opening, closing, renaming files and directories. Secondary NameNode  secondary NameNode server is responsible for maintaining a copy of the metadata in the disk. The main purpose of the secondary NameNode is to create a new NameNode in case of failure.  In a high availability cluster, there are two NameNodes: active and standby. The secondary NameNode performs a similar function to the standby NameNode. Datanodes  Datanodes store and maintain the blocks. While there is only one namenode, there can be multiple datanodes, which are responsible for retrieving the blocks when requested by the namenode. Datanodes send the block reports to the namenode every 10 seconds; in this way, the namenode receives information about the datanodes stored in its RAM and disk. HDFS Read and Write mechanisms are parallel activities. To read or write a file in HDFS, a client must interact with the namenode. The namenode checks the privileges of the client and gives permission to read or write on the data blocks.
  • 10. 10 STRICTLY PRIVATE AND CONFIDENTIAL YARN(Yet Another Resource Negotiator)  YARN is the middle layer between HDFS and MapReduce in the Hadoop architecture. ,YARN is a Framework on which MapReduce works,  YARN performs 2 operations that are Job scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be Maximized. Job Scheduler also keeps track of which job is important, which job has more priority, dependencies between the jobs and all the other information like job timing, etc. And the use of Resource Manager is to manage all the resources that are made available for running a Hadoop cluster. Features of YARN  • Multi-Tenancy • Scalability • Cluster-Utilization • Compatibility
  • 11. 11 STRICTLY PRIVATE AND CONFIDENTIAL MAP REDUCE MapReduce is a framework conducting distributed and parallel processing of large volumes of data. Written using a number of programming languages, it has two main phases: Map Phase and Reduce Phase. Map Phase  Map Phase stores data in the form of blocks. Data is read, processed and given a key-value pair in this phase. It is responsible for running a particular task on one or multiple splits or inputs. Reduce Phase  The reduce Phase receives the key-value pair from the map phase. The key-value pair is then aggregated into smaller sets and an output is produced. Processes such as shuffling and sorting occur in the reduce phase.  The mapper function handles the input data and runs a function on every input split (known as map tasks). There can be one or multiple map tasks based on the size of the file and the configuration setup. Data is then sorted, shuffled, and moved to the reduce phase, where a reduce function aggregates the data and provides the output.
  • 12. 12 STRICTLY PRIVATE AND CONFIDENTIAL MapReduce Job Execution • The input data is stored in the HDFS and read using an input format. • The file is split into multiple chunks based on the size of the file and the input format. • The default chunk size is 128 MB but can be customized. • The record reader reads the data from the input splits and forwards this information to the mapper. • The mapper breaks the records in every chunk into a list of data elements (or key-value pairs). • The combiner works on the intermediate data created by the map tasks and acts as a mini reducer to reduce the data. • The partitioner decides how many reduce tasks will be required to aggregate the data. • The data is then sorted and shuffled based on their key-value pairs and sent to the reduce function. • Based on the output format decided by the reduce function, the output data is then stored on the HDFS.

Editor's Notes

  1. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  2. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  3. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  4. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  5. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  6. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  7. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  8. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  9. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  10. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  11. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  12. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager
  13. These are only illustrations of the KRAs. Engineers- Test Engineers/ Senior Test Engineers Leads- Associates Test Lead/ Test Lead Managers- Associate Test Manager/Test Manager/ Senior Manager