This document discusses big data and how Hadoop solves issues with processing and storing extremely large datasets. It introduces Hadoop, describing its main components HDFS for distributed storage and MapReduce for distributed processing. Hadoop allows applications to run on large clusters of commodity hardware to handle failures and scale easily. The document provides examples of how MapReduce and Hive are used and describes a Twitter sentiment analysis application.
2. Contents
What is Big Data?
Limitations to the existing solutions
How Hadoop solves the problem
Introduction to Hadoop
Hadoop Eco-System
Hadoop main Components
MapReduce execution
File Read and Write
Sentiment Analysis
3.
4. Big Data
Extremely large datasets ( Data is in TBs and PBs ),
Facebook has the world’s largest Hadoop Cluster with 400 TB(2011)
data(currently 22 PB of data) and generates 20TB of data/day,
NYSE generates 1TB data/day,
The internet archive store around 2PB of data and is growing at a very fast
rate,
The WayBack Machine is an example of Internet archive store, it is digital
archive of the WWW and other information on the internet, their intent is to
capture and archive content that would be lost whenever a site is changed or
closed down,
7. Limitations to the existing solutions
Slow to process
Seek Time of general storages:
IDE drive – 75 MB/s, 10ms
SATA drive – 300 MB/s, 8.5ms
SSD – 800 MB/s, 2ms
Scaling is expensive
Unreliable machines : risk of data loss
10. Introduction to Hadoop
Apache Hadoop is a set of algorithms (an open-source software
framework written in Java) for distributed storage and distributed
processing of very large data sets (Big Data) on computer.
All the modules in Hadoop are designed with a fundamental assumption that
hardware failures (of individual machines, or racks of machines) are common
and thus should be automatically handled in software by the framework.
11. In December 2004, Google Labs published a paper on
the MapReduce algorithm, which allows very large scale computations to be
trivially parallelized across large clusters of servers.
Doug Cutting, an employee at Yahoo, realized the importance of this paper
and extended the reality of it to handle extremely large search problems.
In 2005, he created the open-source Hadoop framework that allows
applications based on the MapReduce paradigm to be run on large clusters of
commodity hardware.
12.
13. Hadoop main components
Two main components:
HDFS – Hadoop Distributed File System (Storage):
Distributed across nodes (Datanodes),
NameNode tracks locations,
MapReduce (Processing):
Splits task across processors,
Self healing, high bandwidth,
Clustered storage,
Jobtracker manages the tasktrackers
14.
15. Modes of working
Three modes:
Standalone Mode(default) : in this Hadoop didn’t use HDFS to store files just
use local FS, helpful in debugging,
Pseudomode(Single Node Cluster) : configure the files to run on a single
cluster, R = 1
Distributed Mode : use Hadoop at full scale, consists of thousands of nodes,
use this mode when we work on large data
16. Replication and Block Size
Default replication factor is 3 and block size is 64MB ( recommended 128MB )
Can be updated by changing the configuration files
21. Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
Developed by Facebook.
HiveQL – SQL like query language,
Hive queries are converted into MR first ( at the backend ), therefore slower
than running MR program,
27. Big Data – The road ahead us
Huge repositories of structured and unstructured data across various digital
platforms and social media,
Beyond traditional database methods to analyse,
Big data promises growth and long term sustainability,
Threats – data integrity, security breach
Editor's Notes
1) they have been archiving cached pages of web sites onto their large cluster of Linux nodes. They revisit sites every few weeks or months and archive a new version if the content has changed,
1) Seek time - time a program or device takes to locate a particular piece of data
Hadoop design principle was:
The system should manage and heal itself in case of failures,
Automatically and transparently route around failure,
Proportional in capacity with resource change,
Lower latency,
Simple core
Store and process large amounts of data,
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Zoo Keeper is a centralized service for maintaining the services.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Apache HBase is used when we need random, realtime read/write access to Big Data.
Namenodes is the master, it is metastore in HDFS i.e. it keeps tracks of all the files, blocks, datanodes for each blocks
Also it contains transaction log files like file creation, deletion etc.
There is also a standby node for namenode which is known as the SNN ( secondary namenode ), what it does is it connects to the namenode after regular interval of time and gets the edit logs and fsimage.
Edit logs contains the details of addition, deletion etc of a file,
FSimage contains the in-node details like modification time, access time, access permission etc.
Now if the namenode fails then the SNN already contains the edit logs and fsimage. So when the cluster is restarted is restarted the fsimage of Namenode is updated automatically so there will be no overhead of copying editlogs at the moment of restart. Thus saving time.
This is Hadoop cluster. Each cluster contains racks, each rack contains blocks, each block contains datanodes where files are stored(after splitting).
Each rack contains master nodes i.e. jobtrackers and namenodes.
R=1 because only one JT and NN is used.
Files 1 and 3 have r = 2
Files 2,4,5 have r = 3.
Executed in two phases – mapping and reducing,
Each phase has two functions called mapper and reducer,
Map phase takes input from user and feeds into mapper class,
Reduce phase process output generated from mapper class,
Simply mapping is to filter and reducing is to aggregate,