SlideShare a Scribd company logo
Big Data Analytics
•Big Data Introduction
•Hadoop Introduction
• Different types of Components in Hadoop HDFS
• MapReduce
•PIG
•Hive
Contents
Big Data Introduction
Have you hard “ Flood of Data” ?
Where is the source?
What is next ?
• The New York Stock Exchange generates about one
terabyte of new trade data per day.
• Facebook hosts approximately 10 billion photos,
taking up one petabyte of storage.
•Ancestry.com, the genealogy site, stores around 2.5
petabytes of data.
•The Internet Archive stores around 2 petabytes of
data, and is growing at a rate of 20 terabytes per
month.
• 6,000 pictures uploaded to flickr.
• 1.500 blogs.
• 600 new youTube video.
• The Large Hadron Collider near Geneva, Switzerland,
will produce about 15 petabytes of data per year.
• Google process 20Petabytes in a day.
•Wayback machine has 3petabytes + 100TB per month
•eBay has 6.5 petabytes of user data + 50TB per day.
• Google receives 20 lakhs search quires.
• Facebook receives 34,722 likes in a day.
• Apple receives 47,000 App downloads.
• 37,000 minutes of call on Skype.
• 98,000 posts on tweets.
Now in 2017
No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract value
and hidden knowledge from it…
Big Data Definition
"Big Data are high-volume, high-velocity, and/or high-
variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization” (Gartner 2012)
• Complicated (intelligent) analysis of data may make a
small data “appear” to be “big”
• Any data that exceeds our current capability of
processing can be regarded as “big”
Acquisition
Aggregation
Analysis
Application
Lifecycle of Data: 4 “A”s
Why Big Data ?
Why Big Data ?
Users
Application
Systems
Sensors
Large and growing files
(Big data files)
Why Big Data ?
•Growth of Big Data is needed
–Increase of storage capacities
–Increase of processing power
–Availability of data(different data types)
–Every day we create 2.5 quintillion bytes of
data; 90% of the data in the world today has
been created in the last two years alone
Three Characteristics of Big Data V3s
Volume
•Data
quantity
Velocity
•Data
Speed
Variety
•Data
Types
Volume
A zettabyte is 1021
bytes, or equivalently one thousand
exabytes, one million petabytes, or one billion terabytes.
(one Petabyte = 1024 Terabytes)
• From 0.8 zettabytes to 35zb within 2020.
• Data volume is increasing exponentially
Volume
•A typical PC might have had 10 gigabytes of storage in
2000.
•Today, Facebook ingests 500 terabytes of new data every
day.
•Boeing 737 will generate 240 terabytes of flight data
during a single flight across the US.
• The smart phones, the data they create and consume;
sensors embedded into everyday objects will soon result in
billions of new, constantly-updated data feeds containing
environmental, location, and other information, including
video.
Velocity
• Clickstreams and ad impressions capture user
behavior at millions of events per second
• high-frequency stock trading algorithms reflect
market changes within microseconds
• machine to machine processes exchange data
between billions of devices
• infrastructure and sensors generate massive log data
in real-time
• on-line gaming systems support millions of concurrent
users, each producing multiple inputs per second.
Velocity
•Data is begin generated fast and need to be
processed fast
•Online Data Analytics
•Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your
purchase history, what you like send promotions right
now for store next to you
Healthcare monitoring: sensors monitoring your
activities and body any abnormal measurements require
immediate reaction
Variety
•Various formats, types, and structures
•Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
•Static data vs. streaming data
•A single application can be generating/collecting many
types of data
Variety
To extract knowledge all these types of data need
to linked together
Big Data:3Vs
4Vs
Big Data Analytics
•Examining large amount of data
•Appropriate information
•Identification of hidden patterns, unknown
correlations
•Competitive advantage
•Better business decisions: strategic and
operational
•Effective marketing, customer satisfaction,
increased revenue
Big Data Analytics
•Big data is more real-time in nature than traditional DW
applications
•Traditional DW architectures (e.g. Exadata, Teradata) are
not well-suited for big data apps
•Shared nothing, massively parallel processing, scale out
architectures are well-suited for big data apps
Homeland
Security
Smarter
Healthcare
Multi-channel
sales
Telecom
Manufacturing
Traffic Control
Trading
Analytics
Search
Quality
Application Of Big Data analytics
Hadoop
Hadoop Introduction
•Hadoop is analysis for Bigdata.
•It is an open-source framework for creating distributed
applications that process huge amount of data.
•Huge means:
 25,000 machines
 More than 10 clusters
 3 PB of data
 700+ users
 10,000+ jobs / week
Other Definition
•Hadoop is an open-source distributed, batch-processing
and fault-tolerance system which is capable of sorting
huge amount of data (TB, PB, ZettaByte) along with
processing on the same amount of data.
• Easy to use parallel programming model.
• Consists two main core components
 HDFS
 MapReduce
History
•Hadoop was created by Doug Cutting and Mike
Cafarella.
• Doug Cutting was working at Yahoo.
• He named this project as per his son’s toy elephant.
• It was originally developed to support distribution for
the Nutch search engine project.
Brief History
•2004—implemented by Doug Cutting and Mike
Cafarella.
• December 2005—Nutch ported to the new framework.
Hadoop runs reliably on 20 nodes.
• January 2006—Doug Cutting joins Yahoo!.
• February 2006—Apache Hadoop project officially
started to support the standalone development of
MapReduce and HDFS.
• February 2006—Adoption of Hadoop by Yahoo! Grid
team.
• April 2006—Sort benchmark (10 GB/node) run on 188
nodes in 47.9 hours.
Brief History
• May 2006—Yahoo! set up a Hadoop research cluster—
300 nodes.
• May 2006—Sort benchmark run on 500 nodes in 42
hours (better hardware than April benchmark).
• October 2006—Research cluster reaches 600 nodes.
• December 2006—Sort benchmark run on 20 nodes in
1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2
hours, 900 nodes in 7.8 hours.
• January 2007—Research cluster reaches 900 nodes.
• April 2007—Research clusters—2 clusters of 1000
nodes.
Brief History
• April 2008—Won the 1 terabyte sort benchmark in 209
seconds on 900 nodes.
• October 2008—Loading 10 terabytes of data per day
on to research clusters.
• March 2009—17 clusters with a total of 24,000 nodes.
• April 2009—Won the minute sort by sorting 500 GB in
59 seconds (on 1,400 nodes) and the 100 terabyte sort in
173 minutes (on 3,400 nodes).
Data in Nature
Structured data :
Relational data.
Semi Structured data :
XML data.
Unstructured data :
Word, PDF, Text,
Media Logs.
How Hadoop is Different?
Traditional Approach
Here data will be stored in an RDBMS like Oracle Database, MS SQL
Server or DB2 and sophisticated soft-wares can be written to
interact with the database, process the required data and present it
to the users for analysis purpose.
LIMITATIONS: This approach works well where we have less volume
of data that can be accommodated by standard database servers, or
up to the limit of the processor which is processing the data.
How Hadoop is Different?
Google’s Solution
Google solved this problem using an algorithm called MapReduce.
This algorithm divides the task into small parts and assigns those
parts to many computers connected over the network, and collects
the results to form the final result dataset.
Hadoop
How Hadoop is Different?
Hadoop runs applications using the MapReduce algorithm, where
the data is processed in parallel on different CPU nodes. In short,
Hadoop framework is capable enough to develop applications
capable of running on clusters of computers and they could perform
complete statistical analysis for a huge amounts of data.
Advantages of Hadoop
Fast: In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval. Even the tools to process the data
are often on the same servers, thus reducing the processing time.
It is able to process terabytes of data in minutes and Peta bytes in
hours.
Scalable: Hadoop cluster can be extended by just adding nodes in
the cluster.
Cost Effective: Hadoop is open source and uses commodity
hardware to store data so it really cost effective as compared to
traditional relational database management system.
Resilient to failure: HDFS has the property with which it can
replicate data over the network, so if one node is down or some
other network failure happens, then Hadoop takes the other copy
of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
Advantages of Hadoop
Flexible: Hadoop enables businesses to easily access new data
sources and tap into different types of data (both structured and
unstructured) to generate value from that data. This means
businesses can use Hadoop to derive valuable business insights
from data sources such as social media, email conversations or
clickstream data, log processing, recommendation systems, data
warehousing, market campaign analysis and fraud detection.
Components of Hadoop
 Hadoop Common- Pre-defined set of utilities and libraries that
can be used by other modules within the Hadoop ecosystem.
 Hadoop Distributed File System (HDFS) - The default big data
storage layer for Apache Hadoop is HDFS.
It is the “Secret Sauce” of Apache Hadoop components as users
can dump huge datasets into HDFS and the data will sit there nicely
until the user wants to leverage it for analysis.
 MapReduce- Distributed Data Processing Framework-
MapReduce is a Java-based system created by Google where the
actual data from the HDFS store gets processed efficiently.
 MapReduce breaks down a big data processing job into smaller
tasks.
MapReduce is responsible for the analysing large datasets in
parallel before reducing it to find the results.
Components of Hadoop
 YARN- forms an integral part of Hadoop 2.0.
YARN is great enabler for dynamic resource utilization on Hadoop
framework as users can run various Hadoop applications without
having to bother about increasing workloads.
 Data Access Components of Hadoop Ecosystem-
 Pig- is a convenient tools developed by Yahoo for analysing huge
data sets efficiently and easily.
It provides a high level data flow language that is optimized,
extensible and easy to use.
The most outstanding feature of Pig programs is that their
structure is open to considerable parallelization making it easy for
handling large data sets.
Components of Hadoop
Hive- developed by Facebook is a data warehouse built on top of
Hadoop and provides a simple language known as HiveQL similar to
SQL for querying, data summarization and analysis.
Hive makes querying faster through indexing.
 Data Integration Components of Hadoop Ecosystem-
 Sqoop- Sqoop component is used for importing data from
external sources into related Hadoop components like HDFS, HBase
or Hive.
It can also be used for exporting data from Hadoop, other external
structured data stores.
Sqoop parallelized data transfer, mitigates excessive loads, allows
data imports, efficient data analysis and copies data quickly.
 Flume- is used to gather and aggregate large amounts of data.
Apache Flume is used for collecting data from its origin and sending
it back to the resting location (HDFS).
Components of Hadoop
 Data Storage Component of Hadoop Ecosystem-
 HBase – HBase is a column-oriented database that uses HDFS for
underlying storage of data.
HBase supports random reads and also batch computations using
MapReduce.
 Monitoring and Management Components of Hadoop
Ecosystem-
 Oozie-is a workflow scheduler where the workflows are expressed
as Directed Acyclic Graphs.
Oozie runs in a Java servlet container Tomcat and makes use of a
database to store all the running workflow instances, their states ad
variables along with the workflow definitions to manage Hadoop
jobs (MapReduce, Sqoop, Pig and Hive).
Components of Hadoop
Zookeeper- Zookeeper is the king of coordination and provides
simple, fast, reliable and ordered operational services for a Hadoop
cluster.
Zookeeper is responsible for synchronization service, distributed
configuration service and for providing a naming registry for
distributed systems.
 AMBARI- for RESTful API, mainly web user interface.
 MAHOUT- for machine learning.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes the
data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically
and Hadoop continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open
source, it is compatible on all the platforms since it is Java based.
Formatting, Cleaning
Storage Data
Data Understanding
Data Access
Data Integration
Data Analysis
Data Visualization
Computational view of Bigdata
Formatting, Cleaning
Storage Data
Data Understanding
Data Access
Data Integration
Data Analysis
Data Visualization
Computer Vision
Natural Language Processing
Speech Recognition
Signal Processing
Databases Information Retrieval
Data Warehousing
Data Mining
Machine Learning
Human-Computer Interaction
Information Theory
Many Applications!
Topics related to Bigdata
HDFS
Hadoop comes with a distributed file system called HDFS.
In HDFS data is distributed over several machines and replicated to
ensure their durability to failure and high availability to parallel
application.
HDFS Concepts
Blocks
 Name Node
 Data Node
Block: A Block is the minimum amount of data that it can read or
write.
HDFS blocks are 128 MB by default and this is configurable.
Files n HDFS are broken into block-sized chunks, which are stored
as independent units.
HDFS Concepts
Name Node: HDFS works in master-worker pattern where the name
node acts as master.
Name Node is controller and manager of HDFS as it knows the
status and the metadata of all the files in HDFS; the metadata
information being file permission, names and location of each
block.
The metadata are small, so it is stored in the memory of name
node,allowing faster access to data.
Moreover the HDFS cluster is accessed by multiple clients
concurrently,so all this information is handled by a single machine.
 The file system operations like opening, closing, renaming etc.
are executed by it.
HDFS Concepts
Data Node: They store and retrieve blocks when they are told to; by
client or name node.
They report back to name node periodically, with list of blocks that
they are storing.
The data node being a commodity hardware also does the work of
block creation, deletion and replication as stated by the name node.
Hadoop Models of Installation
Standalone / local model-After downloading Hadoop in your
system, by default, it is configured in a standalone mode and can be
run as a single java process.
 Pseudo distributed model [Labwork]-It is a distributed simulation
on single machine. Each Hadoop such as hdfs, yarn, MapReduce etc.,
will run as a separate java process. This mode is useful for
development.
 Fully Distributed mode [Realtime]-This mode is fully distributed
with minimum two or more machines as a cluster.
Standalone / local model-
 Single machine
 No Daemons are running
 Everything runs in single JVM
 Standard OS storage
 Good for development and test with small data but will not catch
all error.
Pseudo distributed model
 Single machine but cluster is simulated
 Daemons in separate JVM
 Separate JVMs in single node
 Good for development and Debugging
Full Distributed Mode
 Run Hadoop on cluster of machines
 Daemons run
 Production environment
 Good for staging and production
Hadoop Architeture
5 Daemons
NameNode
 DataNode
 JobTracker
 TaskTarcker
 Secondary NameNode
Functionality of Daemons
Namenode
NameNode is the centerpiece of HDFS.
NameNode is also known as the Master
NameNode only stores the metadata of HDFS – the directory tree
of all files in the file system, and tracks the files across the cluster.
NameNode does not store the actual data or the dataset. The
data itself is actually stored in the DataNodes.
NameNode knows the list of the blocks and its location for any
given file in HDFS. With this information NameNode knows how to
construct the file from blocks.
NameNode is so critical to HDFS and when the NameNode is
down, HDFS/Hadoop cluster is inaccessible and considered down.
NameNode is a single point of failure in Hadoop cluster.
NameNode is usually configured with a lot of memory (RAM).
Because the block locations are help in main memory.
Functionality of Daemons
DataNode
DataNode is responsible for storing the actual data in HDFS.
DataNode is also known as the Slave
NameNode and DataNode are in constant communication.
When a DataNode starts up it announce itself to the NameNode
along with the list of blocks it is responsible for.
When a DataNode is down, it does not affect the availability of
data or the cluster. NameNode will arrange for replication for the
blocks managed by the DataNode that is not available.
DataNode is usually configured with a lot of hard disk space.
Because the actual data is stored in the DataNode.
Functionality of Daemons
Job Tracker –
JobTracker process runs on a separate node and not usually on a
DataNode.
JobTracker is an essential Daemon for MapReduce execution in
MRv1. It is replaced by ResourceManager/ApplicationMaster in
MRv2. (MRv1=uses the JobTracker to create and assign tasks to
data nodes, which can become a resource bottleneck when the
cluster scales out far enough (usually around 4,000 nodes).)
JobTracker receives the requests for MapReduce execution from
the client.
JobTracker talks to the NameNode to determine the location of
the data.
JobTracker finds the best TaskTracker nodes to execute tasks
based on the data locality (proximity of the data) and the available
slots to execute a task on a given node.
Functionality of Daemons
Job Tracker –
JobTracker monitors the individual TaskTrackers and the submits
back the overall status of the job back to the client.
JobTracker process is critical to the Hadoop cluster in terms of
MapReduce execution.
When the JobTracker is down, HDFS will still be functional but the
MapReduce execution can not be started and the existing
MapReduce jobs will be halted.
 The JobTracker locates TaskTracker nodes with available slots at
or near the data.
The JobTracker submits the work to the chosen TaskTracker
nodes.
The TaskTracker nodes are monitored. If they do not submit
heartbeat signals often enough, they are deemed to have failed
and the work is scheduled on a different TaskTracker.
Heart Beat Mechanism
•To process the data, Job Tracker assigns certain tasks to the Task
Tracker.
•Let us think that, while the processing is going on one DataNode in
the cluster is down.
•Now, NameNode should know that the certain DataNode is down ,
otherwise it cannot continue processing by using replicas.
•To make NameNode aware of the status(active / inactive) of
DataNodes, each DataNode sends a "Heart Beat Signal" for every
10 minutes(Default).
•This mechanism is called as HEART BEAT MECHANISM.
Functionality of Daemons
Job Tracker –
 Responsible for scheduling and rescheduling the tasks in the
form of mapreduce job.
 Jobtracker will recide on top of namenode.
Functionality of Daemons
TaskTracker –
TaskTracker runs on DataNode. Mostly on all DataNodes.
TaskTracker is replaced by Node Manager in MRv2.
Mapper and Reducer tasks are executed on DataNodes
administered by TaskTrackers.
TaskTrackers will be assigned Mapper and Reducer tasks to
execute by JobTracker.
TaskTracker will be in constant communication with the
JobTracker signalling the progress of the task in execution.
TaskTracker failure is not considered fatal. When a TaskTracker
becomes unresponsive, JobTracker will assign the task executed by
the TaskTracker to another node.
Functionality of Daemons
TaskTracker –
Responsible for instanting and maintaining indivisual map and
reduce work.
Knowns as S/w daemon for Hadoop architeture.
The tasks executed by task tracker is called MRjobs.
Functionality of Daemons
Secondary NameNode:
Secondary NameNode in hadoop is a specially dedicated node in
HDFS cluster whose main function is to take checkpoints of the file
system metadata present on namenode. It is not a backup
namenode.
It just checkpoints namenode’s file system namespace.
The Secondary NameNode is a helper to the primary NameNode
but not replace for primary namenode.
As the NameNode is the single point of failure in HDFS, if
NameNode fails entire HDFS file system is lost. So in order to
overcome this, Hadoop implemented Secondary NameNode whose
main function is to store a copy of FsImage file and edits log file.
FsImage is a snapshot of the HDFS file system metadata at a
certain point of time and EditLog is a transaction log which contains
records for every change that occurs to file.
Functionality of Daemons
Secondary NameNode:
It is replaced by checkpoint node.
It acts as a separate physical file.
When namenode downs the secondary node comes to picture.
It never takes all the functionalities of namenode.
It only reads fsImage space and edit log.
What is Checkpoint Mechanism ?
It is the mechanism used in hadoop cluster to make sure all the
metadata information of hadoop file system will get updated in the
two persistence files.
 fsnamespace Image
 edit log
Checkpoint mechanism can be configured either hourly, daily and
weekly.
What is Speculative execution of
Hadoop?
If a datanode fails to response within 10mins, the namenode will
mark that node either it is dead, running slow or functionally slow.
Immediately the jobtracker will assign the same task to some
other ideal nodes within hadoop cluster. This is know as
Speculative execution of Hadoop.
It does not allow to feel any deal for the end user.
Network Topology
• Hadoop cluster architecture consists of a two-level network
topology.
• Typically there are 30 to 40 servers per rack, with a 1 GB switch
for the rack.
•An uplink to a core switch or router (which is normally 1 GB or
better).
Two-level network topology
Rack awareness
• To get maximum performance out of Hadoop, it is important to
configure Hadoop so that it knows the topology of your network.
• If your cluster runs on a single rack, then there is nothing more to
do, since this is the default.
• For multirack clusters, you need to map nodes to racks.
• By doing this, Hadoop will prefer within-rack transfers (where
there is more bandwidth available) to off-rack transfers when
placing MapReduce tasks on nodes.
• Network locations such as nodes and racks are represented in a
tree, which reflects the network “distance” between locations.
The namenode uses the network location when determining
where to place block replicas, the MapReduce scheduler uses
network location to determine where the closest replica is as input
to a map task.
Rack awareness
• In network the rack topology is described by two network
locations, we can present /switch1/rack1 and /switch1/rack2.
•For the figure we can present the locations as/rack1 and /rack2.
• The Hadoop configuration must specify a map between node
addresses and network locations. The map is described by a Java
interface, DNSToSwitchMapping.
• The topology.node.switch.mapping.impl configuration property
defines an implementation of the DNSToSwitchMapping interface
that the namenode and the jobtracker use to resolve worker node
network locations.
• The names parameter is a list of IP addresses, and the return
value is a list of corresponding network location strings.
Parallel Copying with distcp
•So far we learned HDFS access is on single-threaded access.
• For parallel processing of files you would have to write a program
yourself.
• Hadoop comes with a useful program called distcp for copying
large amounts of data to and from Hadoop filesystems in parallel.
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
This will copy the /foo directory (and its contents) from the first
cluster to the /bar directory on the second cluster, so the second
cluster ends up with the directory structure /bar/foo.
If /bar doesn’t exist, it will be created first. You can specify multiple
source paths, and all will be copied to the destination.
Parallel Copying with distcp
•There are more options to control the behavior of distcp, including
ones to preserve file attributes, ignore failures, and limit the
number of files or total data copied.
•distcp is implemented as a MapReduce job where the work of
copying is done by the maps that run in parallel across the cluster.
• There are no reducers. Each file is copied by a single map, and
distcp tries to give each map approximately the same amount of
data, by bucketing files into roughly equal allocations.
• If you try to use distcp between two HDFS clusters that are
running different versions, the copy will fail if you use the hdfs
protocol, since the RPC systems are incompatible.
•To overcome this you can use the read-only HTTP-based HFTP
filesystem to read from the source.
THANK YOU

More Related Content

What's hot

NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
Navjot Kaur
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
Mohammed Fazuluddin
 
Data cubes
Data cubesData cubes
Data cubes
Mohammed
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
Rahul Jain
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Harri Kauhanen
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
Viet-Trung TRAN
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
neeraj rathore
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
ateeq ateeq
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
Heman Hosainpana
 
Mongo DB Presentation
Mongo DB PresentationMongo DB Presentation
Mongo DB Presentation
Jaya Naresh Kovela
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
MITS Gwalior
 
Web data mining
Web data miningWeb data mining

What's hot (20)

NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
 
Data cubes
Data cubesData cubes
Data cubes
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Mongo DB Presentation
Mongo DB PresentationMongo DB Presentation
Mongo DB Presentation
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
Web data mining
Web data miningWeb data mining
Web data mining
 

Similar to Hadoop HDFS.ppt

Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Big Data
Big DataBig Data
Big Data
Priyanka Tuteja
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
Softweb Solutions
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Sri Kanth
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
Tomy Rhymond
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Big Data
Big DataBig Data
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
Nagarjuna D.N
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
Bob Hardaway
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
Ankur Tripathi
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
DignitasDigital1
 

Similar to Hadoop HDFS.ppt (20)

Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big data
Big dataBig data
Big data
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Big Data
Big DataBig Data
Big Data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big Data
Big DataBig Data
Big Data
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 

Recently uploaded

1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 

Recently uploaded (20)

1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 

Hadoop HDFS.ppt

  • 2. •Big Data Introduction •Hadoop Introduction • Different types of Components in Hadoop HDFS • MapReduce •PIG •Hive Contents
  • 3. Big Data Introduction Have you hard “ Flood of Data” ? Where is the source? What is next ?
  • 4. • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. •Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. •The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. • 6,000 pictures uploaded to flickr. • 1.500 blogs. • 600 new youTube video.
  • 5. • The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year. • Google process 20Petabytes in a day. •Wayback machine has 3petabytes + 100TB per month •eBay has 6.5 petabytes of user data + 50TB per day. • Google receives 20 lakhs search quires. • Facebook receives 34,722 likes in a day. • Apple receives 47,000 App downloads. • 37,000 minutes of call on Skype. • 98,000 posts on tweets.
  • 7. No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… Big Data Definition "Big Data are high-volume, high-velocity, and/or high- variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization” (Gartner 2012)
  • 8. • Complicated (intelligent) analysis of data may make a small data “appear” to be “big” • Any data that exceeds our current capability of processing can be regarded as “big”
  • 11. Why Big Data ? Users Application Systems Sensors Large and growing files (Big data files)
  • 12. Why Big Data ? •Growth of Big Data is needed –Increase of storage capacities –Increase of processing power –Availability of data(different data types) –Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone
  • 13. Three Characteristics of Big Data V3s Volume •Data quantity Velocity •Data Speed Variety •Data Types
  • 14. Volume A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. (one Petabyte = 1024 Terabytes) • From 0.8 zettabytes to 35zb within 2020. • Data volume is increasing exponentially
  • 15. Volume •A typical PC might have had 10 gigabytes of storage in 2000. •Today, Facebook ingests 500 terabytes of new data every day. •Boeing 737 will generate 240 terabytes of flight data during a single flight across the US. • The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
  • 16. Velocity • Clickstreams and ad impressions capture user behavior at millions of events per second • high-frequency stock trading algorithms reflect market changes within microseconds • machine to machine processes exchange data between billions of devices • infrastructure and sensors generate massive log data in real-time • on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
  • 17. Velocity •Data is begin generated fast and need to be processed fast •Online Data Analytics •Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction
  • 18. Variety •Various formats, types, and structures •Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… •Static data vs. streaming data •A single application can be generating/collecting many types of data
  • 19. Variety To extract knowledge all these types of data need to linked together
  • 21. 4Vs
  • 22. Big Data Analytics •Examining large amount of data •Appropriate information •Identification of hidden patterns, unknown correlations •Competitive advantage •Better business decisions: strategic and operational •Effective marketing, customer satisfaction, increased revenue
  • 23. Big Data Analytics •Big data is more real-time in nature than traditional DW applications •Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps •Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps
  • 25.
  • 27. Hadoop Introduction •Hadoop is analysis for Bigdata. •It is an open-source framework for creating distributed applications that process huge amount of data. •Huge means:  25,000 machines  More than 10 clusters  3 PB of data  700+ users  10,000+ jobs / week
  • 28. Other Definition •Hadoop is an open-source distributed, batch-processing and fault-tolerance system which is capable of sorting huge amount of data (TB, PB, ZettaByte) along with processing on the same amount of data. • Easy to use parallel programming model. • Consists two main core components  HDFS  MapReduce
  • 29. History •Hadoop was created by Doug Cutting and Mike Cafarella. • Doug Cutting was working at Yahoo. • He named this project as per his son’s toy elephant. • It was originally developed to support distribution for the Nutch search engine project.
  • 30. Brief History •2004—implemented by Doug Cutting and Mike Cafarella. • December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. • January 2006—Doug Cutting joins Yahoo!. • February 2006—Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS. • February 2006—Adoption of Hadoop by Yahoo! Grid team. • April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
  • 31. Brief History • May 2006—Yahoo! set up a Hadoop research cluster— 300 nodes. • May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark). • October 2006—Research cluster reaches 600 nodes. • December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours. • January 2007—Research cluster reaches 900 nodes. • April 2007—Research clusters—2 clusters of 1000 nodes.
  • 32. Brief History • April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. • October 2008—Loading 10 terabytes of data per day on to research clusters. • March 2009—17 clusters with a total of 24,000 nodes. • April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400 nodes).
  • 33. Data in Nature Structured data : Relational data. Semi Structured data : XML data. Unstructured data : Word, PDF, Text, Media Logs.
  • 34. How Hadoop is Different? Traditional Approach Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated soft-wares can be written to interact with the database, process the required data and present it to the users for analysis purpose. LIMITATIONS: This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data.
  • 35. How Hadoop is Different? Google’s Solution Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.
  • 36. Hadoop How Hadoop is Different? Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data.
  • 37. Advantages of Hadoop Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes of data in minutes and Peta bytes in hours. Scalable: Hadoop cluster can be extended by just adding nodes in the cluster. Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as compared to traditional relational database management system. Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data are replicated thrice but the replication factor is configurable.
  • 38. Advantages of Hadoop Flexible: Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations or clickstream data, log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection.
  • 39. Components of Hadoop  Hadoop Common- Pre-defined set of utilities and libraries that can be used by other modules within the Hadoop ecosystem.  Hadoop Distributed File System (HDFS) - The default big data storage layer for Apache Hadoop is HDFS. It is the “Secret Sauce” of Apache Hadoop components as users can dump huge datasets into HDFS and the data will sit there nicely until the user wants to leverage it for analysis.  MapReduce- Distributed Data Processing Framework- MapReduce is a Java-based system created by Google where the actual data from the HDFS store gets processed efficiently.  MapReduce breaks down a big data processing job into smaller tasks. MapReduce is responsible for the analysing large datasets in parallel before reducing it to find the results.
  • 40. Components of Hadoop  YARN- forms an integral part of Hadoop 2.0. YARN is great enabler for dynamic resource utilization on Hadoop framework as users can run various Hadoop applications without having to bother about increasing workloads.  Data Access Components of Hadoop Ecosystem-  Pig- is a convenient tools developed by Yahoo for analysing huge data sets efficiently and easily. It provides a high level data flow language that is optimized, extensible and easy to use. The most outstanding feature of Pig programs is that their structure is open to considerable parallelization making it easy for handling large data sets.
  • 41. Components of Hadoop Hive- developed by Facebook is a data warehouse built on top of Hadoop and provides a simple language known as HiveQL similar to SQL for querying, data summarization and analysis. Hive makes querying faster through indexing.  Data Integration Components of Hadoop Ecosystem-  Sqoop- Sqoop component is used for importing data from external sources into related Hadoop components like HDFS, HBase or Hive. It can also be used for exporting data from Hadoop, other external structured data stores. Sqoop parallelized data transfer, mitigates excessive loads, allows data imports, efficient data analysis and copies data quickly.  Flume- is used to gather and aggregate large amounts of data. Apache Flume is used for collecting data from its origin and sending it back to the resting location (HDFS).
  • 42. Components of Hadoop  Data Storage Component of Hadoop Ecosystem-  HBase – HBase is a column-oriented database that uses HDFS for underlying storage of data. HBase supports random reads and also batch computations using MapReduce.  Monitoring and Management Components of Hadoop Ecosystem-  Oozie-is a workflow scheduler where the workflows are expressed as Directed Acyclic Graphs. Oozie runs in a Java servlet container Tomcat and makes use of a database to store all the running workflow instances, their states ad variables along with the workflow definitions to manage Hadoop jobs (MapReduce, Sqoop, Pig and Hive).
  • 43. Components of Hadoop Zookeeper- Zookeeper is the king of coordination and provides simple, fast, reliable and ordered operational services for a Hadoop cluster. Zookeeper is responsible for synchronization service, distributed configuration service and for providing a naming registry for distributed systems.  AMBARI- for RESTful API, mainly web user interface.  MAHOUT- for machine learning.
  • 44. Advantages of Hadoop Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer. Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.
  • 45. Formatting, Cleaning Storage Data Data Understanding Data Access Data Integration Data Analysis Data Visualization Computational view of Bigdata
  • 46. Formatting, Cleaning Storage Data Data Understanding Data Access Data Integration Data Analysis Data Visualization Computer Vision Natural Language Processing Speech Recognition Signal Processing Databases Information Retrieval Data Warehousing Data Mining Machine Learning Human-Computer Interaction Information Theory Many Applications! Topics related to Bigdata
  • 47. HDFS Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several machines and replicated to ensure their durability to failure and high availability to parallel application. HDFS Concepts Blocks  Name Node  Data Node Block: A Block is the minimum amount of data that it can read or write. HDFS blocks are 128 MB by default and this is configurable. Files n HDFS are broken into block-sized chunks, which are stored as independent units.
  • 48. HDFS Concepts Name Node: HDFS works in master-worker pattern where the name node acts as master. Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the metadata information being file permission, names and location of each block. The metadata are small, so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this information is handled by a single machine.  The file system operations like opening, closing, renaming etc. are executed by it.
  • 49. HDFS Concepts Data Node: They store and retrieve blocks when they are told to; by client or name node. They report back to name node periodically, with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node.
  • 50. Hadoop Models of Installation Standalone / local model-After downloading Hadoop in your system, by default, it is configured in a standalone mode and can be run as a single java process.  Pseudo distributed model [Labwork]-It is a distributed simulation on single machine. Each Hadoop such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is useful for development.  Fully Distributed mode [Realtime]-This mode is fully distributed with minimum two or more machines as a cluster.
  • 51. Standalone / local model-  Single machine  No Daemons are running  Everything runs in single JVM  Standard OS storage  Good for development and test with small data but will not catch all error. Pseudo distributed model  Single machine but cluster is simulated  Daemons in separate JVM  Separate JVMs in single node  Good for development and Debugging
  • 52. Full Distributed Mode  Run Hadoop on cluster of machines  Daemons run  Production environment  Good for staging and production
  • 53. Hadoop Architeture 5 Daemons NameNode  DataNode  JobTracker  TaskTarcker  Secondary NameNode
  • 54.
  • 55. Functionality of Daemons Namenode NameNode is the centerpiece of HDFS. NameNode is also known as the Master NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster. NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes. NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks. NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop cluster is inaccessible and considered down. NameNode is a single point of failure in Hadoop cluster. NameNode is usually configured with a lot of memory (RAM). Because the block locations are help in main memory.
  • 56. Functionality of Daemons DataNode DataNode is responsible for storing the actual data in HDFS. DataNode is also known as the Slave NameNode and DataNode are in constant communication. When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for. When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available. DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.
  • 57. Functionality of Daemons Job Tracker – JobTracker process runs on a separate node and not usually on a DataNode. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by ResourceManager/ApplicationMaster in MRv2. (MRv1=uses the JobTracker to create and assign tasks to data nodes, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 nodes).) JobTracker receives the requests for MapReduce execution from the client. JobTracker talks to the NameNode to determine the location of the data. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the data) and the available slots to execute a task on a given node.
  • 58. Functionality of Daemons Job Tracker – JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to the client. JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution. When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be started and the existing MapReduce jobs will be halted.  The JobTracker locates TaskTracker nodes with available slots at or near the data. The JobTracker submits the work to the chosen TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  • 59. Heart Beat Mechanism •To process the data, Job Tracker assigns certain tasks to the Task Tracker. •Let us think that, while the processing is going on one DataNode in the cluster is down. •Now, NameNode should know that the certain DataNode is down , otherwise it cannot continue processing by using replicas. •To make NameNode aware of the status(active / inactive) of DataNodes, each DataNode sends a "Heart Beat Signal" for every 10 minutes(Default). •This mechanism is called as HEART BEAT MECHANISM.
  • 60. Functionality of Daemons Job Tracker –  Responsible for scheduling and rescheduling the tasks in the form of mapreduce job.  Jobtracker will recide on top of namenode.
  • 61. Functionality of Daemons TaskTracker – TaskTracker runs on DataNode. Mostly on all DataNodes. TaskTracker is replaced by Node Manager in MRv2. Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker. TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in execution. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task executed by the TaskTracker to another node.
  • 62. Functionality of Daemons TaskTracker – Responsible for instanting and maintaining indivisual map and reduce work. Knowns as S/w daemon for Hadoop architeture. The tasks executed by task tracker is called MRjobs.
  • 63. Functionality of Daemons Secondary NameNode: Secondary NameNode in hadoop is a specially dedicated node in HDFS cluster whose main function is to take checkpoints of the file system metadata present on namenode. It is not a backup namenode. It just checkpoints namenode’s file system namespace. The Secondary NameNode is a helper to the primary NameNode but not replace for primary namenode. As the NameNode is the single point of failure in HDFS, if NameNode fails entire HDFS file system is lost. So in order to overcome this, Hadoop implemented Secondary NameNode whose main function is to store a copy of FsImage file and edits log file. FsImage is a snapshot of the HDFS file system metadata at a certain point of time and EditLog is a transaction log which contains records for every change that occurs to file.
  • 64. Functionality of Daemons Secondary NameNode: It is replaced by checkpoint node. It acts as a separate physical file. When namenode downs the secondary node comes to picture. It never takes all the functionalities of namenode. It only reads fsImage space and edit log.
  • 65. What is Checkpoint Mechanism ? It is the mechanism used in hadoop cluster to make sure all the metadata information of hadoop file system will get updated in the two persistence files.  fsnamespace Image  edit log Checkpoint mechanism can be configured either hourly, daily and weekly.
  • 66. What is Speculative execution of Hadoop? If a datanode fails to response within 10mins, the namenode will mark that node either it is dead, running slow or functionally slow. Immediately the jobtracker will assign the same task to some other ideal nodes within hadoop cluster. This is know as Speculative execution of Hadoop. It does not allow to feel any deal for the end user.
  • 67.
  • 68. Network Topology • Hadoop cluster architecture consists of a two-level network topology. • Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack. •An uplink to a core switch or router (which is normally 1 GB or better).
  • 70. Rack awareness • To get maximum performance out of Hadoop, it is important to configure Hadoop so that it knows the topology of your network. • If your cluster runs on a single rack, then there is nothing more to do, since this is the default. • For multirack clusters, you need to map nodes to racks. • By doing this, Hadoop will prefer within-rack transfers (where there is more bandwidth available) to off-rack transfers when placing MapReduce tasks on nodes. • Network locations such as nodes and racks are represented in a tree, which reflects the network “distance” between locations. The namenode uses the network location when determining where to place block replicas, the MapReduce scheduler uses network location to determine where the closest replica is as input to a map task.
  • 71. Rack awareness • In network the rack topology is described by two network locations, we can present /switch1/rack1 and /switch1/rack2. •For the figure we can present the locations as/rack1 and /rack2. • The Hadoop configuration must specify a map between node addresses and network locations. The map is described by a Java interface, DNSToSwitchMapping. • The topology.node.switch.mapping.impl configuration property defines an implementation of the DNSToSwitchMapping interface that the namenode and the jobtracker use to resolve worker node network locations. • The names parameter is a list of IP addresses, and the return value is a list of corresponding network location strings.
  • 72. Parallel Copying with distcp •So far we learned HDFS access is on single-threaded access. • For parallel processing of files you would have to write a program yourself. • Hadoop comes with a useful program called distcp for copying large amounts of data to and from Hadoop filesystems in parallel. % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar This will copy the /foo directory (and its contents) from the first cluster to the /bar directory on the second cluster, so the second cluster ends up with the directory structure /bar/foo. If /bar doesn’t exist, it will be created first. You can specify multiple source paths, and all will be copied to the destination.
  • 73. Parallel Copying with distcp •There are more options to control the behavior of distcp, including ones to preserve file attributes, ignore failures, and limit the number of files or total data copied. •distcp is implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. • There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data, by bucketing files into roughly equal allocations. • If you try to use distcp between two HDFS clusters that are running different versions, the copy will fail if you use the hdfs protocol, since the RPC systems are incompatible. •To overcome this you can use the read-only HTTP-based HFTP filesystem to read from the source.

Editor's Notes

  1. Quote practical examples