SlideShare a Scribd company logo
BIG DATA – HADOOP
Governance Team
6 Dec 16
• Big Data Fundamentals1
• Hadoop and
Components2
• QA3
Today’s Overview
Agenda – Big Data Fundamental
• What is Big Data ?
• Basic Characteristics
of Big Data
• Sources of Big Data
• V’s of Big Data
• Processing Of Data
– Traditional Approach
VS Big Data Approach
What is Big Data
What is Big Data –con’t
• Basically Big Data is nothing but collection of
large set of Data that not able to processed
using traditional approach and also its
contains the followings
– Structured Data- Traditional Data
– Semi Structure Data- XML
– Unstructured Data – Image/PDF/Media and etc
Various V’s- Big Data
Processing - Data
• Traditional Approach
• Big Data Approach
Hadoop Fundamental
• What is Hadoop ?
• Key Characterstics
• Components
• HDFS
• MapReduce
• Yarn
• Benefits of Hadoop
What is Hadoop
• Hadoop is an open-source software
framework for storing large amounts of data
and processing/querying those data on a
cluster with multiple nodes of commodity
hardware (i.e. low cost hardware).
Key Characteristics -Hadoop
• Reliable
• Flexible
• Scalable
• Economical
Components
• Common Libraries
• High Volume of Distributed Data Storage
System –HDFS
• High Volume of Distributed Data Processing
Framework –MapReduce
• Resource and Meta Data Management -YARN
– HDFS
• What is HDFS?
• Architecture
• Components
• Basic Features
What is HDFS ?
HDFS holds very large amount of data and
provides easier access. To store such huge data,
the files are stored across multiple machines.
These files are stored in redundant fashion to
rescue the system from possible data losses in
case of failure. HDFS also makes applications
available to parallel processing
Components- HDFS
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a
master server that manages the file system
namespace and regulates access to files by
clients.
 There are a number of DataNodes usually one
per node in a cluster.
 The DataNodes manage storage attached to the
nodes that they run on.
Components -HDFS
HDFS exposes a file system namespace and
allows user data to be stored in files.
A file is split into one or more blocks and set
of blocks are stored in DataNodes.
DataNodes: serves read, write requests,
performs block creation, deletion, and
replication upon instruction from Namenode
Features
• Highly fault-tolerant
• High throughput
• Suitable Distributed Storage for large
Amount of Data
• Streaming access to file system data
• Can be built out of commodity hardware
MapReduce
• What is MapReduce
• Tasks /Components
• Basic Features
• Demo
What is MapReduce
• Its framework mainly used to process the
large Amount of Data in parallel on the large
clusters of commodity hardware
• Its based on divide –conquer Principle which
provides built-in fault tolerance and
redundancy
• Its batch oriented parallel processing engine
to process the large volume of data
MapReduce
– Map stage : The map or mapper’s job is to process
the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes
the data and creates several small chunks of data.
– Reduce stage : This stage is the combination of the
Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the
mapper. After processing, it produces a new set of
output, which will be stored in the HDFS.
Stages of each Tasks
• Map Task have the following Stages
– Map
– Combine
– Partition
• Reduce Task have the following stages
– Shuffle and Sort
– Reduce
Demo
• Refer the PDF Attachment
• Mainly for reading the text and count the no
of word
– YARN
• What is YARN?
• Architecture and
Components
YARN
• YARN (Yet Another Resource Nagotiator): A
framework for job scheduling and cluster
resource management
– Hive
• What is Hive?
• Architecture of Hive
• Flow in Hive
• Data Types
• Sample Query
• Not Hive
• Demo
What is Hive
• Its Data warehouse infrastructure tool to
process the structured data in Hadoop
platform
• Its originally developed by Facebook then
moves into apache umbrella
• Basic large volume of data is retrieve from
multiple resources and RDBMS system could
not fit as perfect solutions .We move into
Hive.
What is Hive
• Its Query Engine wrapper on top of the Hadoop
to perform the OLAP
• Provides the HiveQL is similar to SQL
• Targeted to the users/developer with SQL
background
• Its stores schema in database and process the
data in HDFS
• Data Stored in HDFS/HBASE and every tables
should reference to the file on HDFS/HBASE
Architecture - Hive
• Components
– User Interface- Infrastructure tool used to interaction
between user and HDFS/HBASE
– Meta Store – Used to store Schema/tables and etc,
Mainly used to store the meta data information
– SerDe- libraries used to Serialize/Deserialize for their
own data format. Read and Writes the rows from/in
the tables
– Query Processor -
Architecture -Hive
Data Type
• Integral Type
• SmallInt,BigInt,TinyInt,INT
• Float Type
– Double,Decimal
• String Type
– Char , Varchar
• Misc Type
– Boolean ,Binary
• TimeStamp,Dates,Decimal
• Complex Type
– Struct,Map,Arrays
Sample Query
• Create Table
• Drop Table
• Alter Table
• Rename Table- Rename the table name
• Load Data –Insert
• Create View
• Select
Operator and Built in Function
• Arithmetic Operator
• Relational Operator
• Logical Operator
• Aggregate and Built in Function
• Supports Index/Order/Join
Disadvantages of HIVE
• Not for Real time Query
• Supports ACID from 0.14 version onwards
• Poor performance – It took more time to
process since each time Hive will
generate/process the Map Reduce or Spark
Program internally while processing the
Records sets
Disadvantages of HIVE
• It can process only for large volume of
Structured data not for other categories
Hive Interface Option
• CLI
• HUE(Hadoop User Experience)-
www.gethue.com
• JDBC/ODBC - JAVA
QUESTIONS?
APPENDIX
CAP
• CAP Theorem
– Consistency
• Read the data from all the notes always consistent
– Availability
• Read/write always acknowledge either success or failure
– Partition Tolerance
• It can tolerate communication outage that spit the cluster
into multiple silos /data set
Distributed Data System only provides the any two of
the above properties
Distributed Data Storage based on the above theorem
ACID
• ACID
– Atomicity
– Consistency
– Isolation
– Durability
BASE
• BASE
– Basic availability
– Soft state
– Eventual consistency
Above property mainly used in database
distributed data for non transactional data
SCV
• SCV
– Speed
– Consistency
– Volume
High Data Volume Data Processing is based on the
above algorithm
Data Processing should satisfied at max of two of
the above properties
Sharding
• Sharding
It’s the process of Horizontally partitioning of large
volume of data into smaller set of more
manageable data set
Replication
• Replication
Stores the multiple copies of the data set known as
replicas
Provides always high availability , scalability and
fault tolerance since its stores into multiple nodes
Replicas implements the following was
Master-slave
Peer -Peer
HDFS
HDFS-
HDFS Commands
• https://hadoop.apache.org/docs/r2.7.1/hado
op-project-dist/hadoop-
hdfs/HDFSCommands.html
HDFS
• Blocks
– In HDFS File can split into small segments which
used to store the Data .Each Segments called as
Block
– Default size of the Block is 64 MB (Hadoop 1.X) ,
you can change the size in HDFS Configuration
upto 128 MB(Hadoop 2.x Advisable approach)
Types of File Format -MR
• TxtInputFormat-- Default
• KeyValueTxtInputFormat
• SequenceFileInputFormat
• SequenceAsFileTxtInputFormat
Reader and Writer
• RecordReader –
– Read the Record from file line by line , Each line
in the file treat as a record
– Perform before the Mapper function
• RecordWriter
–Write content into file as a output
– Perform after the Reducer
Reducer
• IdentityReducer- Does not have the shuffle
capability
• CustomReducer- Shuffle and Sorting
Capability
BoxClasses in MR
• Its equivalent to wrapper in JAVA
• IntWritter
• FloatWritter
• LongWritter
• DoubleWritter
• TextWritter
• Mainly used for (K,V) in MR
Schema on Read/Write
• Hadoop –Schema on Read approach
• RDBMS – Schema on Write approach
Key Steps in Big Data Solution
• Ingesting Data
• Storing Data
• Processing Data
HDFS
Hadoop Tools
• 15+ frameworks & tools like Sqoop, Flume,
Kafka, Pig, Hive, Spark, Impala, etc to ingest data
into HDFS, store and process data within HDFS,
and to query data from HDFS for business
intelligence & analytics. Some tools like Pig &
Hive are abstraction layers on top of
MapReduce, whilst the other tools like Spark &
Impala are improved architecture/design from
MapReduce for much improved latencies to
support near real-time (i.e. NRT) & real-time
processing.
NRT
• Near Real time –
– Near real-time processing is when speed is
important, but processing time in minutes is
acceptable in lieu of seconds
HeartBit - HDFS
• Heartbeat is referred to a signal used
between a data node and Name node, and
between task tracker and job tracker
MapReducer – Partition
• all the value of a single key goes to the same
reducer from Mapper, eventually which helps
evenly distribution of the map output over
the reducers
HDFS VS NAS(Network Attached
Storage)
• HDFS data blocks are distributed across local
drives of all machines in a cluster
• NAS data is stored on dedicated hardware.
• HDFS there is data redundancy because of
the replication protocol.
• NAS there is no probability of data
redundancy
Commodity Hardware
• Commodity Hardware refers to inexpensive
systems that do not have high availability or
high quality. Commodity Hardware consists
of RAM because there are specific services
that need to be executed on RAM
Port Number
• NameNode 50070
• Job Tracker 50030
• Task Tracker 50060
Combine-MapReduce
• A “Combiner” is a mini “reducer” that
performs the local “reduce” task. It receives
the input from the “mapper” on a particular
“node” and sends the output to the
“reducer”. “Combiners” help in enhancing
the efficiency of “MapReduce” by reducing
the quantum of data that is required to be
sent to the “reducers”.
MapReduce Programs
• Driver – Main method class which invoke by
the scheduler
• Mapper
• Reducer
JobTracker –Functionality
– When Client applications submit map reduce jobs to the Job tracker. The
JobTracker talks to the Name node to determine the location of the data.
– The JobTracker locates Tasktracker nodes with available slots at or near the
data
– The JobTracker submits the work to the chosen Tasktracker nodes.
– The TaskTracker nodes are monitored. If they do not submit heartbeat
signals often enough, they are deemed to have failed and the work is
scheduled on a different TaskTracker.
– When the work is completed, the JobTracker updates its status.
– Client applications can poll the JobTracker for information.
DW –Data Warehouse
• Database specific for analysis and reporting
purpose
Hive Support File Format
• Text File (Plain raw data)
• Sequence File(Key value pairs)
• RCFile (Record Columnar files which are
stored columns of the table in columnar
Database)
NameNode Vs MetaNode
• NameNode- Stores the MetaData information
about the files in Hadoop
• MetaNode-Stores the MetaData information
about the Tables /Data Base in Hive
Tez- Hive
• execute complex directed acyclic graphs of
general data processing tasks
• Its better than the MapReduce
Bucketing -Hive
• Bucketing provides mechanism to query and
examine random samples of data.
• Bucketing offers capability to execute queries
on a sub-set of random data
Reference -Hive
• http://dl.farinsoft.com/files/94/Ultimate-
Guide-Programming-Apache-Hive-ebook.pdf
WHAT NEXT ………….
TOOLS OF HADOOP
PIG/SQOOP/HBASE
/HIVE/SPARK…….

More Related Content

What's hot

Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
markgrover
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
Fadi Yousuf
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Allen Day, PhD
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
Nojan Emad
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
Cisco Canada
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Adam Muise
 

What's hot (19)

Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 

Viewers also liked

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Praveen Kumar Donta
 
Ppt hadoop
Ppt hadoopPpt hadoop
Ppt hadoop
Fajar Nugraha
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Hire Hadoop Developer
Hire Hadoop DeveloperHire Hadoop Developer
Hire Hadoop Developer
Geeks Per Hour
 
Micro service architecture
Micro service architecture  Micro service architecture
Micro service architecture
Ayyappan Paramesh
 
Working With Big Data
Working With Big DataWorking With Big Data
Working With Big Data
Seth Familian
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 
Power of OpenStack & Hadoop
Power of OpenStack & HadoopPower of OpenStack & Hadoop
Power of OpenStack & Hadoop
Tuan Yang
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Herramientas de Microsoft para el Científicos de Datos
Herramientas de Microsoft para el Científicos de DatosHerramientas de Microsoft para el Científicos de Datos
Herramientas de Microsoft para el Científicos de Datos
Eduardo Castro
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
HGanesh
 
El Valor de construir First Party Data Orgánico a través del Ecosistema Digit...
El Valor de construir First Party Data Orgánico a través del Ecosistema Digit...El Valor de construir First Party Data Orgánico a través del Ecosistema Digit...
El Valor de construir First Party Data Orgánico a través del Ecosistema Digit...
Esther Checa
 
The What, Why and How of Big Data
The What, Why and How of Big DataThe What, Why and How of Big Data
The What, Why and How of Big Data
Luca Naso
 
Part 1 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 1 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Part 1 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 1 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Andreas Buckenhofer
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Boston Consulting Group
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
「すぐ実践したくなる!アイデア創発ワークショップ」ワークショップデザイナー育成プログラム特別講座
「すぐ実践したくなる!アイデア創発ワークショップ」ワークショップデザイナー育成プログラム特別講座「すぐ実践したくなる!アイデア創発ワークショップ」ワークショップデザイナー育成プログラム特別講座
「すぐ実践したくなる!アイデア創発ワークショップ」ワークショップデザイナー育成プログラム特別講座
Keita Takizawa
 

Viewers also liked (20)

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Ppt hadoop
Ppt hadoopPpt hadoop
Ppt hadoop
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Hire Hadoop Developer
Hire Hadoop DeveloperHire Hadoop Developer
Hire Hadoop Developer
 
Micro service architecture
Micro service architecture  Micro service architecture
Micro service architecture
 
Working With Big Data
Working With Big DataWorking With Big Data
Working With Big Data
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Power of OpenStack & Hadoop
Power of OpenStack & HadoopPower of OpenStack & Hadoop
Power of OpenStack & Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Herramientas de Microsoft para el Científicos de Datos
Herramientas de Microsoft para el Científicos de DatosHerramientas de Microsoft para el Científicos de Datos
Herramientas de Microsoft para el Científicos de Datos
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
 
El Valor de construir First Party Data Orgánico a través del Ecosistema Digit...
El Valor de construir First Party Data Orgánico a través del Ecosistema Digit...El Valor de construir First Party Data Orgánico a través del Ecosistema Digit...
El Valor de construir First Party Data Orgánico a través del Ecosistema Digit...
 
The What, Why and How of Big Data
The What, Why and How of Big DataThe What, Why and How of Big Data
The What, Why and How of Big Data
 
Part 1 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 1 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Part 1 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
Part 1 - Data Warehousing Lecture at BW Cooperative State University (DHBW)
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
「すぐ実践したくなる!アイデア創発ワークショップ」ワークショップデザイナー育成プログラム特別講座
「すぐ実践したくなる!アイデア創発ワークショップ」ワークショップデザイナー育成プログラム特別講座「すぐ実践したくなる!アイデア創発ワークショップ」ワークショップデザイナー育成プログラム特別講座
「すぐ実践したくなる!アイデア創発ワークショップ」ワークショップデザイナー育成プログラム特別講座
 

Similar to Big data Hadoop

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Hadoop
HadoopHadoop
Hadoop
avnishagr
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 
Big data
Big dataBig data
Big data
Mayuri Verma
 
Big data
Big dataBig data
Big data
Alisha Roy
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar
 
Anju
AnjuAnju
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 

Similar to Big data Hadoop (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Anju
AnjuAnju
Anju
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

Recently uploaded

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

Big data Hadoop

  • 1. BIG DATA – HADOOP Governance Team 6 Dec 16
  • 2. • Big Data Fundamentals1 • Hadoop and Components2 • QA3 Today’s Overview
  • 3. Agenda – Big Data Fundamental • What is Big Data ? • Basic Characteristics of Big Data • Sources of Big Data • V’s of Big Data • Processing Of Data – Traditional Approach VS Big Data Approach
  • 4. What is Big Data
  • 5. What is Big Data –con’t • Basically Big Data is nothing but collection of large set of Data that not able to processed using traditional approach and also its contains the followings – Structured Data- Traditional Data – Semi Structure Data- XML – Unstructured Data – Image/PDF/Media and etc
  • 7. Processing - Data • Traditional Approach • Big Data Approach
  • 8. Hadoop Fundamental • What is Hadoop ? • Key Characterstics • Components • HDFS • MapReduce • Yarn • Benefits of Hadoop
  • 9.
  • 10. What is Hadoop • Hadoop is an open-source software framework for storing large amounts of data and processing/querying those data on a cluster with multiple nodes of commodity hardware (i.e. low cost hardware).
  • 11. Key Characteristics -Hadoop • Reliable • Flexible • Scalable • Economical
  • 12. Components • Common Libraries • High Volume of Distributed Data Storage System –HDFS • High Volume of Distributed Data Processing Framework –MapReduce • Resource and Meta Data Management -YARN
  • 13.
  • 14. – HDFS • What is HDFS? • Architecture • Components • Basic Features
  • 15. What is HDFS ? HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing
  • 16.
  • 17. Components- HDFS  Master/slave architecture  HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.  There are a number of DataNodes usually one per node in a cluster.  The DataNodes manage storage attached to the nodes that they run on.
  • 18. Components -HDFS HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode
  • 19. Features • Highly fault-tolerant • High throughput • Suitable Distributed Storage for large Amount of Data • Streaming access to file system data • Can be built out of commodity hardware
  • 20. MapReduce • What is MapReduce • Tasks /Components • Basic Features • Demo
  • 21. What is MapReduce • Its framework mainly used to process the large Amount of Data in parallel on the large clusters of commodity hardware • Its based on divide –conquer Principle which provides built-in fault tolerance and redundancy • Its batch oriented parallel processing engine to process the large volume of data
  • 22. MapReduce – Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. – Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 23. Stages of each Tasks • Map Task have the following Stages – Map – Combine – Partition • Reduce Task have the following stages – Shuffle and Sort – Reduce
  • 24. Demo • Refer the PDF Attachment • Mainly for reading the text and count the no of word
  • 25. – YARN • What is YARN? • Architecture and Components
  • 26. YARN • YARN (Yet Another Resource Nagotiator): A framework for job scheduling and cluster resource management
  • 27.
  • 28. – Hive • What is Hive? • Architecture of Hive • Flow in Hive • Data Types • Sample Query • Not Hive • Demo
  • 29. What is Hive • Its Data warehouse infrastructure tool to process the structured data in Hadoop platform • Its originally developed by Facebook then moves into apache umbrella • Basic large volume of data is retrieve from multiple resources and RDBMS system could not fit as perfect solutions .We move into Hive.
  • 30. What is Hive • Its Query Engine wrapper on top of the Hadoop to perform the OLAP • Provides the HiveQL is similar to SQL • Targeted to the users/developer with SQL background • Its stores schema in database and process the data in HDFS • Data Stored in HDFS/HBASE and every tables should reference to the file on HDFS/HBASE
  • 31. Architecture - Hive • Components – User Interface- Infrastructure tool used to interaction between user and HDFS/HBASE – Meta Store – Used to store Schema/tables and etc, Mainly used to store the meta data information – SerDe- libraries used to Serialize/Deserialize for their own data format. Read and Writes the rows from/in the tables – Query Processor -
  • 33. Data Type • Integral Type • SmallInt,BigInt,TinyInt,INT • Float Type – Double,Decimal • String Type – Char , Varchar • Misc Type – Boolean ,Binary • TimeStamp,Dates,Decimal • Complex Type – Struct,Map,Arrays
  • 34. Sample Query • Create Table • Drop Table • Alter Table • Rename Table- Rename the table name • Load Data –Insert • Create View • Select
  • 35. Operator and Built in Function • Arithmetic Operator • Relational Operator • Logical Operator • Aggregate and Built in Function • Supports Index/Order/Join
  • 36. Disadvantages of HIVE • Not for Real time Query • Supports ACID from 0.14 version onwards • Poor performance – It took more time to process since each time Hive will generate/process the Map Reduce or Spark Program internally while processing the Records sets
  • 37. Disadvantages of HIVE • It can process only for large volume of Structured data not for other categories
  • 38. Hive Interface Option • CLI • HUE(Hadoop User Experience)- www.gethue.com • JDBC/ODBC - JAVA
  • 41. CAP • CAP Theorem – Consistency • Read the data from all the notes always consistent – Availability • Read/write always acknowledge either success or failure – Partition Tolerance • It can tolerate communication outage that spit the cluster into multiple silos /data set Distributed Data System only provides the any two of the above properties Distributed Data Storage based on the above theorem
  • 42. ACID • ACID – Atomicity – Consistency – Isolation – Durability
  • 43. BASE • BASE – Basic availability – Soft state – Eventual consistency Above property mainly used in database distributed data for non transactional data
  • 44. SCV • SCV – Speed – Consistency – Volume High Data Volume Data Processing is based on the above algorithm Data Processing should satisfied at max of two of the above properties
  • 45. Sharding • Sharding It’s the process of Horizontally partitioning of large volume of data into smaller set of more manageable data set
  • 46. Replication • Replication Stores the multiple copies of the data set known as replicas Provides always high availability , scalability and fault tolerance since its stores into multiple nodes Replicas implements the following was Master-slave Peer -Peer
  • 47. HDFS
  • 48. HDFS-
  • 50. HDFS • Blocks – In HDFS File can split into small segments which used to store the Data .Each Segments called as Block – Default size of the Block is 64 MB (Hadoop 1.X) , you can change the size in HDFS Configuration upto 128 MB(Hadoop 2.x Advisable approach)
  • 51. Types of File Format -MR • TxtInputFormat-- Default • KeyValueTxtInputFormat • SequenceFileInputFormat • SequenceAsFileTxtInputFormat
  • 52. Reader and Writer • RecordReader – – Read the Record from file line by line , Each line in the file treat as a record – Perform before the Mapper function • RecordWriter –Write content into file as a output – Perform after the Reducer
  • 53. Reducer • IdentityReducer- Does not have the shuffle capability • CustomReducer- Shuffle and Sorting Capability
  • 54. BoxClasses in MR • Its equivalent to wrapper in JAVA • IntWritter • FloatWritter • LongWritter • DoubleWritter • TextWritter • Mainly used for (K,V) in MR
  • 55. Schema on Read/Write • Hadoop –Schema on Read approach • RDBMS – Schema on Write approach
  • 56. Key Steps in Big Data Solution • Ingesting Data • Storing Data • Processing Data
  • 57. HDFS
  • 58. Hadoop Tools • 15+ frameworks & tools like Sqoop, Flume, Kafka, Pig, Hive, Spark, Impala, etc to ingest data into HDFS, store and process data within HDFS, and to query data from HDFS for business intelligence & analytics. Some tools like Pig & Hive are abstraction layers on top of MapReduce, whilst the other tools like Spark & Impala are improved architecture/design from MapReduce for much improved latencies to support near real-time (i.e. NRT) & real-time processing.
  • 59. NRT • Near Real time – – Near real-time processing is when speed is important, but processing time in minutes is acceptable in lieu of seconds
  • 60. HeartBit - HDFS • Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker
  • 61. MapReducer – Partition • all the value of a single key goes to the same reducer from Mapper, eventually which helps evenly distribution of the map output over the reducers
  • 62. HDFS VS NAS(Network Attached Storage) • HDFS data blocks are distributed across local drives of all machines in a cluster • NAS data is stored on dedicated hardware. • HDFS there is data redundancy because of the replication protocol. • NAS there is no probability of data redundancy
  • 63. Commodity Hardware • Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM
  • 64. Port Number • NameNode 50070 • Job Tracker 50030 • Task Tracker 50060
  • 65. Combine-MapReduce • A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.
  • 66. MapReduce Programs • Driver – Main method class which invoke by the scheduler • Mapper • Reducer
  • 67. JobTracker –Functionality – When Client applications submit map reduce jobs to the Job tracker. The JobTracker talks to the Name node to determine the location of the data. – The JobTracker locates Tasktracker nodes with available slots at or near the data – The JobTracker submits the work to the chosen Tasktracker nodes. – The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. – When the work is completed, the JobTracker updates its status. – Client applications can poll the JobTracker for information.
  • 68. DW –Data Warehouse • Database specific for analysis and reporting purpose
  • 69. Hive Support File Format • Text File (Plain raw data) • Sequence File(Key value pairs) • RCFile (Record Columnar files which are stored columns of the table in columnar Database)
  • 70. NameNode Vs MetaNode • NameNode- Stores the MetaData information about the files in Hadoop • MetaNode-Stores the MetaData information about the Tables /Data Base in Hive
  • 71. Tez- Hive • execute complex directed acyclic graphs of general data processing tasks • Its better than the MapReduce
  • 72. Bucketing -Hive • Bucketing provides mechanism to query and examine random samples of data. • Bucketing offers capability to execute queries on a sub-set of random data
  • 74. WHAT NEXT …………. TOOLS OF HADOOP PIG/SQOOP/HBASE /HIVE/SPARK…….