What is Data ?
• What is Data ?
• And why should we care about it ?
What is Big Data ?
• Big data is a collection of data sets so large and
complex that it becomes difficult to process using
traditional data processing applications.
• Web logs
• Social Data-Facebook, Linkedin, Twitter.
• Call Detail Records
• Large-Scale e-commerce
• Medical Records
• Video archives
• Atmospheric Science
• Media & Advertising.
What is Big Data ?
• Ancestry.com stores around 2.5 petabytes of Data.
• The New York Stock Exchange generates about one
terabyte of new trade data per day.
• The Internet Archive stores around 2 petabytes of
data, and is growing at a rate of 20 terabytes per
How to Process The Big Data?
• Need to process large datasets (>100TB)
• Only reading 100TB of data can be overwhelming
• Takes ~11 days to read on a standard computer
• Takes a day across a 10GB link (very high end
• On a single node (@50MB/s) – 23days
• On a 1000 node cluster – 33min
Not so easy………..
• The challenges are in search, sharing, transfer,
• Moving data from storage cluster to computation cluster is not
• In large cluster failure is expected . Computer fails everyday.
• Very expensive to build reliability into each application.
• massively parallel software running on tens, hundreds, or even
thousands of servers
• A programmer worries about errors, data motion,
What we are looking for.
• A common infrastructure and standard set of tools to
handle this complexity.
• A Efficient, Reliable fault-tolerant and usable
What is Hadoop ?
• Its a framework that allows distributed processing of
large data sets across clusters of computers.
• It is designed to scale up from single servers to
thousands of machines.
• Its also designed to run on commodity hardware.
Scalable: store and process petabytes, scale by adding
HW and added without needing to change data
Economical: 1000s of commodity machines.
Efficient: runs tasks where data is located.
Flexible: Hadoop is schema-less, and can absorb any
type of data, structured or not, from any number of
Fault tolerant: When you lose a node, the system
redirects work to another location of the data and
continues processing without missing a beat.
Hadoop is useful for…….
• Batch Data Processing.
• Log Processing.
• Document Analysis & Indexing.
• Text Mining.
• Crawl Data Processing.
• Highly parallel data intensive distributed applications.
Use The Right Tool For The Right Job
When to use?
• Write once read many times.
• Structured or Not (Agility)
• Batch Processing
When to use?
• Interactive Reporting (<1sec)
• Multistep Transactions
• Lots of Inserts/Updates/Deletes
• Manages the filesystem namespace and metadata
• Replicate missing blocks
• No data goes through the NameNode
• NameNode mainly consists of:
fsimage: Contains a checkpoint copy of the metadata on disk
edit logs: Records all write operations, synchronizes with
metadata in RAM after each write
In case of ‘power failure’ on NameNode Can recover using
fsimage + edit logs
Hadoop Nodes-CheckPoint Node
• Periodically creates checkpoints of NameNode filesystem
• The Checkpoint node should run on a different machine
than the NameNode
• Should have same storage requirements as NameNode
• There can be many Checkpoint nodes per cluster
Hadoop Nodes-BackUp Node
• Difference with Checkpoint node is that it keeps and up-
to-date copy of metadata in RAM
• Same RAM requirements as NameNode
• Can only have one Backup node per cluster
Hadoop Nodes-Data Node
Can be many per Hadoop cluster
•Manages blocks with data and serves them to
•Periodically reports to NameNode the list of
blocks it stores
•Use inexpensive commodity hardware for this
Hadoop Nodes-Job Tracker
One per Hadoop cluster (Multiple namenode can be configured in Hadoop 2.2 or letter version)
•Receives job requests submitted by client
•Schedules and monitors MapReduce jobs on task
Hadoop Nodes-Task Tracker
• Can be many per Hadoop cluster
• Executes MapReduce operations
• Reads blocks from DataNodes
• Operates on key and value pairs.
• Two major functions: Map() and Reduce()
• Input formats and splits
• Number of tasks.
• Provides status about jobs to users
• Monitors task progress
Input Output .
The MapReduce framework operates on <key, value> pairs.
It views the input to the job as a set of <key, value> pairs and
produces a set of <key, value> pairs as the output of the job.
It’s a data warehouse system for Hadoop
Providing data summarization, query, and analysis.
Its a high-level platform for creating MapReduce
programs used with Hadoop.
Developed by Yahoo.
Used when needs random, real-time read/write access to
your Big Data.
Also used for storing historical data.
Its a Web application for interacting with Apache Hadoop.
It supports a file browser, job tracker interface, Hive, Pig
Its a Command-line interface application for transferring
data between relational databases and Hadoop.
Microsoft uses a Sqoop-based connector to help transfer
data from Microsoft SQL Server databases to Hadoop.
Its used for efficiently collecting, aggregating, and
moving large amounts of distributed data or log data.