SlideShare a Scribd company logo
1 of 39
Database Management Systems
Unit – VI
Introduction to Big Data, HADOOP: HDFS, MapReduce
Prof. Deptii Chaudhari,
Assistant Professor
Department of Computer Engineering
Hope Foundation’s
International Institute of Information Technology, I²IT
What is Big Data?
•Big data means really a big data, it is a collection of large
datasets that cannot be processed using traditional
computing techniques.
•Big data is not merely a data, rather it has become a
complete subject, which involves various tools, techniques
and frameworks.
•Big data involves the data produced by different devices
and applications.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
• Social Media Data : Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
• Stock Exchange Data : The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
• Power Grid Data : The power grid data holds information consumed by a particular
node with respect to a base station.
• Search Engine Data : Search engines retrieve lots of data from different databases.
• Thus Big Data includes huge volume, high velocity, and extensible variety of data. The
data in it will be of three types.
• Structured data : Relational data.
• Semi Structured data : XML data.
• Unstructured data : Word, PDF, Text, Media Logs.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Benefits of Big Data
• Big data is really critical to our life and its emerging as one of the
most important technologies in modern world.
• Using the information kept in the social network like Facebook,
the marketing agencies are learning about the response for their
campaigns, promotions, and other advertising mediums.
• Using the information in the social media like preferences and
product perception of their consumers, product companies and
retail organizations are planning their production.
• Using the data regarding the previous medical history of
patients, hospitals are providing better and quick service.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Big Data Technologies
• Big data technologies are important in providing more accurate
analysis, which may lead to more concrete decision-making
resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
• To harness the power of big data, you would require an
infrastructure that can manage and process huge volumes of
structured and unstructured data in real time and can protect data
privacy and security.
• There are various technologies in the market from different vendors
including Amazon, IBM, Microsoft, etc., to handle big data.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Big Data Challenges
•Capturing data
•Curation (Organizing, maintaining)
•Storage
•Searching
•Sharing
•Transfer
•Analysis
•Presentation
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Traditional Approach
• An enterprise will have a computer to store and process big data. Here data
will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and
sophisticated softwares can be written to interact with the database,
process the required data and present it to the users for analysis purpose.
• This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the
processor which is processing the data.
• But when it comes to dealing with huge amounts of data, it is really a
tedious task to process such data through a traditional database server.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Google’s Solution
• Google solved this problem using an algorithm called MapReduce. This
algorithm divides the task into small parts and assigns those parts to many
computers connected over the network, and collects the results to form the
final result dataset.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Hadoop
• Doug Cutting, Mike Cafarella and team took the solution provided
by Google and started an Open Source Project called HADOOP in
2005 and Doug named it after his son's toy elephant.
• Hadoop runs applications using the MapReduce algorithm, where
the data is processed in parallel on different CPU nodes.
• In short, Hadoop framework is capable enough to develop
applications capable of running on clusters of computers and they
could perform complete statistical analysis for a huge amounts of
data.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
What is Hadoop?
• Hadoop is an open source framework for writing and running distributed
applications that process large amounts of data.
• Distributed computing is a wide and varied field, but the key distinctions of
Hadoop are that it is
• Accessible—Hadoop runs on large clusters of commodity machines or on
cloud computing services
• Robust—Because it is intended to run on commodity hardware, Hadoop is
architected with the assumption of frequent hardware malfunctions. It can
gracefully handle most such failures.
• Scalable—Hadoop scales linearly to handle larger data by adding more
nodes to the cluster.
• Simple—Hadoop allows users to quickly write efficient parallel code.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
•Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment.
•It provides massive storage for any kind of data, enormous
processing power and the ability to handle virtually
limitless concurrent tasks or jobs.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
A Hadoop cluster has many parallel machines that store and process
large data sets. Client computers send jobs into this computer cloud
and obtain results.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
•A Hadoop cluster is a set of commodity machines
networked together in one location.
•Data storage and processing all occur within this “cloud”
of machines .
•Different users can submit computing “jobs” to Hadoop
from individual clients, which can be their own desktop
machines in remote locations from the Hadoop cluster.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Comparing SQL databases and Hadoop
• SCALE-OUT INSTEAD OF SCALE-UP:
• Scaling commercial relational databases is expensive. Their design is
more friendly to scaling up. Hadoop is designed to be a scale-out
architecture operating on a cluster of commodity PC machines.
Adding more resources means adding more machines to the
Hadoop cluster.
• KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES:
• Hadoop uses key/value pairs as its basic data unit, which is flexible
enough to work with the less-structured data types. In Hadoop,
data can originate in any form, but it eventually transforms into
(key/value) pairs for the processing functions to work on.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
• FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE
QUERIES (SQL):
• SQL is fundamentally a high-level declarative language. You query data by stating
the result you want and let the database engine figure out how to derive it.
• Under MapReduce you specify the actual steps in processing the data, which is
more analogous to an execution plan for a SQL engine .
• Under SQL you have query statements; under MapReduce you have scripts and
codes.
• OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS:
• Hadoop is designed for offline processing and analysis of large-scale data. It
doesn’t work for random reading and writing of a few records, which is the type
of load for online transaction processing.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Components of Hadoop
• Hadoop framework includes following four modules:
• Hadoop Common: These are Java libraries and utilities required by
other Hadoop modules. These libraries provide filesystem and OS level
abstractions and contains the necessary Java files and scripts required
to start Hadoop.
• Hadoop YARN: This is a framework for job scheduling and cluster
resource management.
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
• Hadoop MapReduce: This is YARN-based system for parallel processing
of large data sets.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
MapReduce
• Hadoop MapReduce is a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
• The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
• The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
• The Reduce Task: This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map
task.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
• Typically both the input and the output are stored in a file-system.
The framework takes care of scheduling tasks, monitoring them
and re-executes the failed tasks.
• The MapReduce framework consists of a single
master JobTracker and one slave TaskTracker per cluster-node.
• The master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs
component tasks on the slaves, monitoring them and re-executing
the failed tasks.
• The slaves TaskTracker execute the tasks as directed by the master
and provide task-status information to the master periodically.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
The JobTracker is a single point of failure for the Hadoop MapReduce service
which means if JobTracker goes down, all running jobs are halted.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Hadoop Distributed File System
• The most common file system used by Hadoop is the Hadoop
Distributed File System (HDFS).
• The Hadoop Distributed File System (HDFS) is based on the Google
File System (GFS) and provides a distributed file system that is
designed to run on large clusters (thousands of computers) of
small computer machines in a reliable, fault-tolerant manner.
• HDFS uses a master/slave architecture where master consists of a
single NameNode that manages the file system metadata and one
or more slave DataNodes that store the actual data.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
•A file in an HDFS namespace is split into several blocks and
those blocks are stored in a set of DataNodes.
•The NameNode determines the mapping of blocks to the
DataNodes.
•The DataNodes takes care of read and write operation with
the file system. They also take care of block creation,
deletion and replication based on instruction given by
NameNode.
•HDFS provides a shell like any other file system and a list of
commands are available to interact with the file system.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes the
data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.
• Servers can be added or removed from the cluster dynamically and
Hadoop continues to operate without interruption.
• Another big advantage of Hadoop is that apart from being open
source, it is compatible on all the platforms since it is Java based.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Limitations of Hadoop
• Hadoop can perform only batch processing, and data will be
accessed only in a sequential manner. That means one has to
search the entire dataset even for the simplest of jobs.
• A huge dataset when processed results in another huge data set,
which should also be processed sequentially.
• Hadoop Random Access Databases
• Applications such as HBase, Cassandra, couchDB, Dynamo, and
MongoDB are some of the databases that store huge amounts of
data and access the data in a random manner.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
HBase
• HBase is a distributed column-oriented database built on top of the
Hadoop file system.
• It is an open-source project and is horizontally scalable.
• HBase is a data model that is similar to Google’s big table designed to
provide quick random access to huge amounts of structured data.
• It leverages the fault tolerance provided by the Hadoop File System
(HDFS).
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
HBase and HDFS
HDFS HBase
HDFS is a distributed file system
suitable for storing large files.
HBase is a database built on top of the
HDFS.
HDFS does not support fast
individual record lookups.
HBase provides fast lookups for larger
tables.
It provides high latency batch
processing; on concept of batch
processing.
It provides low latency access to single
rows from billions of records (Random
access).
It provides only sequential access
of data.
HBase internally uses Hash tables and
provides random access, and it stores
the data in indexed HDFS files for faster
lookups.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Storage Mechanism in HBase
• HBase is a column-oriented database and the tables in it are
sorted by row. The table schema defines only column families,
which are the key value pairs.
• A table have multiple column families and each column family can
have any number of columns. Subsequent column values are
stored contiguously on the disk.
• In short, in an HBase:
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have
the concept of fixed columns schema;
defines only column families.
An RDBMS is governed by its schema,
which describes the whole structure of
tables.
It is built for wide tables. HBase is
horizontally scalable.
It is thin and built for small tables. Hard
to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as
structured data.
It is good for structured data.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Features of HBase
•HBase is linearly scalable.
•It has automatic failure support.
•It provides consistent read and writes.
•It integrates with Hadoop, both as a source and a
destination.
•It has easy java API for client.
•It provides data replication across clusters.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Applications of HBase
•It is used whenever there is a need to write heavy
applications.
•HBase is used whenever we need to provide fast
random access to available data.
•Companies such as Facebook, Twitter, Yahoo, and
Adobe use HBase internally.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
HBase Architecture
• In HBase, tables are split into regions and are served by the region
servers. Regions are vertically divided by column families into
“Stores”. Stores are saved as files in HDFS.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Components of HBase
• HBase has three major components: the client library, a master
server, and region servers. Region servers can be added or removed
as per requirement.
• Master Server
• Assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
• It Handles load balancing of the regions across region servers. It
unloads the busy servers and shifts the regions to less occupied
servers.
• It Maintains the state of the cluster by negotiating the load
balancing.
• It Is responsible for schema changes and other metadata operations
such as creation of tables and column families.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
• Regions are nothing but tables that are split up and spread across the region servers.
• Region server
• The region servers have regions that -
• Communicate with the client and handle data-related operations.
• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size thresholds.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Zookeeper
• Zookeeper is an open-source project that provides services like
maintaining configuration information, naming, providing
distributed synchronization, etc.
• Zookeeper has temporary nodes representing different region
servers. Master servers use these nodes to discover available
servers.
• In addition to availability, the nodes are also used to track server
failures or network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take care of
zookeeper.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Cloudera
• Cloudera offers enterprises one place to store, process, and
analyze all their data, empowering them to extend the value of
existing investments while enabling fundamental new ways to
derive value from their data.
• Founded in 2008, Cloudera was the first, and is currently, the
leading provider and supporter of Apache Hadoop for the
enterprise.
• Cloudera also offers software for business critical data challenges
including storage, access, management, analysis, security, and
search.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
•Cloudera Inc. is an American-based software company
that provides Apache Hadoop-based software, support
and services, and training to business customers.
•Cloudera's open-source Apache Hadoop distribution, CDH
(Cloudera Distribution Including Apache Hadoop), targets
enterprise-class deployments of that technology.
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
Reference
• Hadoop in Action by Chuck Lam, Manning Publications
Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
THANK YOU
For further details, please contact
Deptii Chaudhari
deptiic@isquareit.edu.in
Department of Computer Engineering
Hope Foundation’s
International Institute of Information Technology, I²IT
P-14,Rajiv Gandhi Infotech Park
MIDC Phase 1, Hinjawadi, Pune – 411057
Tel - +91 20 22933441/2/3
www.isquareit.edu.in | info@isquareit.edu.in

More Related Content

What's hot

What's hot (20)

What Is Cascading Style Sheet?
What Is Cascading Style Sheet?What Is Cascading Style Sheet?
What Is Cascading Style Sheet?
 
Conformal Mapping - Introduction & Examples
Conformal Mapping - Introduction & ExamplesConformal Mapping - Introduction & Examples
Conformal Mapping - Introduction & Examples
 
Basics of Business Management
Basics of Business ManagementBasics of Business Management
Basics of Business Management
 
What Is LEX and YACC?
What Is LEX and YACC?What Is LEX and YACC?
What Is LEX and YACC?
 
Engineering Mathematics | Maxima and Minima
Engineering Mathematics | Maxima and MinimaEngineering Mathematics | Maxima and Minima
Engineering Mathematics | Maxima and Minima
 
Document Type Definition (DTD)
Document Type Definition (DTD)Document Type Definition (DTD)
Document Type Definition (DTD)
 
Artificial Intelligence - Introduction
Artificial Intelligence - IntroductionArtificial Intelligence - Introduction
Artificial Intelligence - Introduction
 
Basics of Computer Graphics
Basics of Computer GraphicsBasics of Computer Graphics
Basics of Computer Graphics
 
Human Computer Interaction - INPUT OUTPUT CHANNELS
Human Computer Interaction - INPUT OUTPUT CHANNELSHuman Computer Interaction - INPUT OUTPUT CHANNELS
Human Computer Interaction - INPUT OUTPUT CHANNELS
 
Factor Analysis & The Measurement Model
Factor Analysis & The Measurement Model Factor Analysis & The Measurement Model
Factor Analysis & The Measurement Model
 
Superstructure and it's various components
Superstructure and it's various componentsSuperstructure and it's various components
Superstructure and it's various components
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
 
Database Query Optimization
Database Query OptimizationDatabase Query Optimization
Database Query Optimization
 
Introduction To Design Pattern
Introduction To Design PatternIntroduction To Design Pattern
Introduction To Design Pattern
 
Systems Programming & Operating Systems - Overview of LEX-and-YACC
Systems Programming & Operating Systems - Overview of LEX-and-YACCSystems Programming & Operating Systems - Overview of LEX-and-YACC
Systems Programming & Operating Systems - Overview of LEX-and-YACC
 
Red Black Tree (and Examples)
Red Black Tree (and Examples)Red Black Tree (and Examples)
Red Black Tree (and Examples)
 
Smart Closet Organizer
Smart Closet OrganizerSmart Closet Organizer
Smart Closet Organizer
 
Importance of Theory of Computations
Importance of Theory of ComputationsImportance of Theory of Computations
Importance of Theory of Computations
 
Types of Artificial Intelligence
Types of Artificial Intelligence Types of Artificial Intelligence
Types of Artificial Intelligence
 
Introduction to TCP Protocol Suite
Introduction to TCP Protocol SuiteIntroduction to TCP Protocol Suite
Introduction to TCP Protocol Suite
 

Similar to Introduction to Big Data, HADOOP: HDFS, MapReduce

Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th JuneOpen-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th JuneInnovative Management Services
 
Big data and hadoop training opens up new opportunities
Big data and hadoop training opens up new opportunitiesBig data and hadoop training opens up new opportunities
Big data and hadoop training opens up new opportunitiesiiht ltd
 
Big data and hadoop training opens up new opportunities
Big data and hadoop training opens up new opportunitiesBig data and hadoop training opens up new opportunities
Big data and hadoop training opens up new opportunitiesiiht ltd
 
IRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET Journal
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Surveyijeei-iaes
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...IRJET Journal
 
Data science lecture2_doaa_mohey
Data science lecture2_doaa_mohey Data science lecture2_doaa_mohey
Data science lecture2_doaa_mohey Doaa Mohey Eldin
 
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET-  	  Youtube Data Sensitivity and Analysis using Hadoop FrameworkIRJET-  	  Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET- Youtube Data Sensitivity and Analysis using Hadoop FrameworkIRJET Journal
 
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET Journal
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewIJERA Editor
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
PI TOP: HARDWARE-ENABLED SUPERCOMPUTER
PI TOP: HARDWARE-ENABLED SUPERCOMPUTERPI TOP: HARDWARE-ENABLED SUPERCOMPUTER
PI TOP: HARDWARE-ENABLED SUPERCOMPUTERIRJET Journal
 
Real time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationReal time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationLeMeniz Infotech
 

Similar to Introduction to Big Data, HADOOP: HDFS, MapReduce (20)

Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
What Is High Performance-Computing?
What Is High Performance-Computing?What Is High Performance-Computing?
What Is High Performance-Computing?
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th JuneOpen-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
 
What Is Cloud Computing?
What Is Cloud Computing?What Is Cloud Computing?
What Is Cloud Computing?
 
Big data and hadoop training opens up new opportunities
Big data and hadoop training opens up new opportunitiesBig data and hadoop training opens up new opportunities
Big data and hadoop training opens up new opportunities
 
Big data and hadoop training opens up new opportunities
Big data and hadoop training opens up new opportunitiesBig data and hadoop training opens up new opportunities
Big data and hadoop training opens up new opportunities
 
IRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET- A Scenario on Big Data
IRJET- A Scenario on Big Data
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
Big Data Hadoop
Big Data HadoopBig Data Hadoop
Big Data Hadoop
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Data science lecture2_doaa_mohey
Data science lecture2_doaa_mohey Data science lecture2_doaa_mohey
Data science lecture2_doaa_mohey
 
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET-  	  Youtube Data Sensitivity and Analysis using Hadoop FrameworkIRJET-  	  Youtube Data Sensitivity and Analysis using Hadoop Framework
IRJET- Youtube Data Sensitivity and Analysis using Hadoop Framework
 
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using HadoopIRJET- Sentiment Analysis on Twitter Posts using Hadoop
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Aishwarya
AishwaryaAishwarya
Aishwarya
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –Review
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
PI TOP: HARDWARE-ENABLED SUPERCOMPUTER
PI TOP: HARDWARE-ENABLED SUPERCOMPUTERPI TOP: HARDWARE-ENABLED SUPERCOMPUTER
PI TOP: HARDWARE-ENABLED SUPERCOMPUTER
 
Real time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationReal time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing application
 

More from International Institute of Information Technology (I²IT)

More from International Institute of Information Technology (I²IT) (20)

Minimization of DFA
Minimization of DFAMinimization of DFA
Minimization of DFA
 
Understanding Natural Language Processing
Understanding Natural Language ProcessingUnderstanding Natural Language Processing
Understanding Natural Language Processing
 
What Is Smart Computing?
What Is Smart Computing?What Is Smart Computing?
What Is Smart Computing?
 
Professional Ethics & Etiquette: What Are They & How Do I Get Them?
Professional Ethics & Etiquette: What Are They & How Do I Get Them?Professional Ethics & Etiquette: What Are They & How Do I Get Them?
Professional Ethics & Etiquette: What Are They & How Do I Get Them?
 
Writing Skills: Importance of Writing Skills
Writing Skills: Importance of Writing SkillsWriting Skills: Importance of Writing Skills
Writing Skills: Importance of Writing Skills
 
Professional Communication | Introducing Oneself
Professional Communication | Introducing Oneself Professional Communication | Introducing Oneself
Professional Communication | Introducing Oneself
 
Servlet: A Server-side Technology
Servlet: A Server-side TechnologyServlet: A Server-side Technology
Servlet: A Server-side Technology
 
What Is Jenkins? Features and How It Works
What Is Jenkins? Features and How It WorksWhat Is Jenkins? Features and How It Works
What Is Jenkins? Features and How It Works
 
Hypothesis-Testing
Hypothesis-TestingHypothesis-Testing
Hypothesis-Testing
 
Data Science, Big Data, Data Analytics
Data Science, Big Data, Data AnalyticsData Science, Big Data, Data Analytics
Data Science, Big Data, Data Analytics
 
Sentiment Analysis in Machine Learning
Sentiment Analysis in  Machine LearningSentiment Analysis in  Machine Learning
Sentiment Analysis in Machine Learning
 
Java as Object Oriented Programming Language
Java as Object Oriented Programming LanguageJava as Object Oriented Programming Language
Java as Object Oriented Programming Language
 
Data Visualization - How to connect Microsoft Forms to Power BI
Data Visualization - How to connect Microsoft Forms to Power BIData Visualization - How to connect Microsoft Forms to Power BI
Data Visualization - How to connect Microsoft Forms to Power BI
 
AVL Tree Explained
AVL Tree ExplainedAVL Tree Explained
AVL Tree Explained
 
Yoga To Fight & Win Against COVID-19
Yoga To Fight & Win Against COVID-19Yoga To Fight & Win Against COVID-19
Yoga To Fight & Win Against COVID-19
 
LR(0) PARSER
LR(0) PARSERLR(0) PARSER
LR(0) PARSER
 
Programming with LEX & YACC
Programming with LEX & YACCProgramming with LEX & YACC
Programming with LEX & YACC
 
Land Pollution - Causes, Effects & Solution
Land Pollution - Causes, Effects & SolutionLand Pollution - Causes, Effects & Solution
Land Pollution - Causes, Effects & Solution
 
Supervised Learning in Cybersecurity
Supervised Learning in CybersecuritySupervised Learning in Cybersecurity
Supervised Learning in Cybersecurity
 
Sampling Theorem and Band Limited Signals
Sampling Theorem and Band Limited SignalsSampling Theorem and Band Limited Signals
Sampling Theorem and Band Limited Signals
 

Recently uploaded

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Recently uploaded (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 

Introduction to Big Data, HADOOP: HDFS, MapReduce

  • 1. Database Management Systems Unit – VI Introduction to Big Data, HADOOP: HDFS, MapReduce Prof. Deptii Chaudhari, Assistant Professor Department of Computer Engineering Hope Foundation’s International Institute of Information Technology, I²IT
  • 2. What is Big Data? •Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. •Big data is not merely a data, rather it has become a complete subject, which involves various tools, techniques and frameworks. •Big data involves the data produced by different devices and applications. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 3. • Social Media Data : Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe. • Stock Exchange Data : The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers. • Power Grid Data : The power grid data holds information consumed by a particular node with respect to a base station. • Search Engine Data : Search engines retrieve lots of data from different databases. • Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types. • Structured data : Relational data. • Semi Structured data : XML data. • Unstructured data : Word, PDF, Text, Media Logs. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 4. Benefits of Big Data • Big data is really critical to our life and its emerging as one of the most important technologies in modern world. • Using the information kept in the social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums. • Using the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production. • Using the data regarding the previous medical history of patients, hospitals are providing better and quick service. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 5. Big Data Technologies • Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business. • To harness the power of big data, you would require an infrastructure that can manage and process huge volumes of structured and unstructured data in real time and can protect data privacy and security. • There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 6. Big Data Challenges •Capturing data •Curation (Organizing, maintaining) •Storage •Searching •Sharing •Transfer •Analysis •Presentation Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 7. Traditional Approach • An enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to interact with the database, process the required data and present it to the users for analysis purpose. • This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. • But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 8. Google’s Solution • Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 9. Hadoop • Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant. • Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. • In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 10. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 11. What is Hadoop? • Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. • Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is • Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services • Robust—Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. • Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. • Simple—Hadoop allows users to quickly write efficient parallel code. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 12. •Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. •It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 13. A Hadoop cluster has many parallel machines that store and process large data sets. Client computers send jobs into this computer cloud and obtain results. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 14. •A Hadoop cluster is a set of commodity machines networked together in one location. •Data storage and processing all occur within this “cloud” of machines . •Different users can submit computing “jobs” to Hadoop from individual clients, which can be their own desktop machines in remote locations from the Hadoop cluster. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 15. Comparing SQL databases and Hadoop • SCALE-OUT INSTEAD OF SCALE-UP: • Scaling commercial relational databases is expensive. Their design is more friendly to scaling up. Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines. Adding more resources means adding more machines to the Hadoop cluster. • KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES: • Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types. In Hadoop, data can originate in any form, but it eventually transforms into (key/value) pairs for the processing functions to work on. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 16. • FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL): • SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. • Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine . • Under SQL you have query statements; under MapReduce you have scripts and codes. • OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS: • Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t work for random reading and writing of a few records, which is the type of load for online transaction processing. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 17. Components of Hadoop • Hadoop framework includes following four modules: • Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provide filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop. • Hadoop YARN: This is a framework for job scheduling and cluster resource management. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 18. MapReduce • Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. • The term MapReduce actually refers to the following two different tasks that Hadoop programs perform: • The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs). • The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 19. • Typically both the input and the output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. • The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. • The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. • The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 20. The JobTracker is a single point of failure for the Hadoop MapReduce service which means if JobTracker goes down, all running jobs are halted. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 21. Hadoop Distributed File System • The most common file system used by Hadoop is the Hadoop Distributed File System (HDFS). • The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on large clusters (thousands of computers) of small computer machines in a reliable, fault-tolerant manner. • HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 22. •A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of DataNodes. •The NameNode determines the mapping of blocks to the DataNodes. •The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by NameNode. •HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 23. Advantages of Hadoop • Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. • Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer. • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. • Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 24. Limitations of Hadoop • Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. • A huge dataset when processed results in another huge data set, which should also be processed sequentially. • Hadoop Random Access Databases • Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 25. HBase • HBase is a distributed column-oriented database built on top of the Hadoop file system. • It is an open-source project and is horizontally scalable. • HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. • It leverages the fault tolerance provided by the Hadoop File System (HDFS). Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 26. HBase and HDFS HDFS HBase HDFS is a distributed file system suitable for storing large files. HBase is a database built on top of the HDFS. HDFS does not support fast individual record lookups. HBase provides fast lookups for larger tables. It provides high latency batch processing; on concept of batch processing. It provides low latency access to single rows from billions of records (Random access). It provides only sequential access of data. HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 27. Storage Mechanism in HBase • HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. • A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. • In short, in an HBase: • Table is a collection of rows. • Row is a collection of column families. • Column family is a collection of columns. • Column is a collection of key value pairs. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 28. HBase and RDBMS HBase RDBMS HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables. It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale. No transactions are there in HBase. RDBMS is transactional. It has de-normalized data. It will have normalized data. It is good for semi-structured as well as structured data. It is good for structured data. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 29. Features of HBase •HBase is linearly scalable. •It has automatic failure support. •It provides consistent read and writes. •It integrates with Hadoop, both as a source and a destination. •It has easy java API for client. •It provides data replication across clusters. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 30. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 31. Applications of HBase •It is used whenever there is a need to write heavy applications. •HBase is used whenever we need to provide fast random access to available data. •Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 32. HBase Architecture • In HBase, tables are split into regions and are served by the region servers. Regions are vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 33. Components of HBase • HBase has three major components: the client library, a master server, and region servers. Region servers can be added or removed as per requirement. • Master Server • Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task. • It Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers. • It Maintains the state of the cluster by negotiating the load balancing. • It Is responsible for schema changes and other metadata operations such as creation of tables and column families. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 34. • Regions are nothing but tables that are split up and spread across the region servers. • Region server • The region servers have regions that - • Communicate with the client and handle data-related operations. • Handle read and write requests for all the regions under it. • Decide the size of the region by following the region size thresholds. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 35. Zookeeper • Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc. • Zookeeper has temporary nodes representing different region servers. Master servers use these nodes to discover available servers. • In addition to availability, the nodes are also used to track server failures or network partitions. • Clients communicate with region servers via zookeeper. • In pseudo and standalone modes, HBase itself will take care of zookeeper. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 36. Cloudera • Cloudera offers enterprises one place to store, process, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. • Founded in 2008, Cloudera was the first, and is currently, the leading provider and supporter of Apache Hadoop for the enterprise. • Cloudera also offers software for business critical data challenges including storage, access, management, analysis, security, and search. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 37. •Cloudera Inc. is an American-based software company that provides Apache Hadoop-based software, support and services, and training to business customers. •Cloudera's open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including Apache Hadoop), targets enterprise-class deployments of that technology. Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 38. Reference • Hadoop in Action by Chuck Lam, Manning Publications Deptii Chaudhari, Dept. of Computer Engineering, Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 | www.isquareit.edu.in | info@isquareit.edu.in
  • 39. THANK YOU For further details, please contact Deptii Chaudhari deptiic@isquareit.edu.in Department of Computer Engineering Hope Foundation’s International Institute of Information Technology, I²IT P-14,Rajiv Gandhi Infotech Park MIDC Phase 1, Hinjawadi, Pune – 411057 Tel - +91 20 22933441/2/3 www.isquareit.edu.in | info@isquareit.edu.in