Introduction to BIg Data and Hadoop

INTRODUCTION TO BIG DATA &
HADOOP

OUTLINE
• Data Generation Sources
• Per minute data evaluation
• What is Big Data?
• Limitations of RDBMS
• What is Hadoop?
• History of Hadoop
• Hadoop Core Components
• Hadoop Architecture
• Hadoop Ecosystem

OUTLINE
• Hadoop V1 v/s Hadoop V2
• Hadoop Distributions
• Who uses Hadoop?
• Overview of Data Lake

DATA GENERATION IN LAST FEW DECADES

’32 BILLION DEVICES PLUGGED IN &
GENERATING DATA BY 2020′
❑The EMC Digital Universe Study launched its seventh edition.
According to the study, by 2020, the amount of data in our
digital universe is expected to grow from 4.4 trillion GB to 44
trillion GB
-11th APRIL 2014
❑According to computer giant IBM, "2.5 exabytes - that's 2.5
billion gigabytes (GB) - of data was generated every day in
2012. That's big by anyone's standards. "About 75% of data is
unstructured, coming from sources such as text, voice and
video."
-Mr. Miles

WHAT IS BIG DATA?
Gartner : “Big Data as high volume, velocity and
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.”

BIG DATA Volume, Velocity and Variety
 Volume
• It refers to the vast amounts of data generated every
second.
 Velocity
• Refers to the speed at which new data is generated and
the speed at which data moves around.
 Variety
• Refers to the different types of data generated from
different sources.

BIG DATA HAS ALSO BEEN DEFINED
BY THE FIVE V’s
1.Volume
2.Velocity
3.Variety
4.Veracity
5.Value

BIG DATA Veracity
 Veracity
• Veracity refers to the biases, noise and abnormality in
data.
• Is the data that is being stored, and mined meaningful
to the problem being analyzed.

BIG DATA Value
 Value
• Data has intrinsic value—but it must be discovered.
• There are a range of quantitative and investigative
techniques to derive value from Big Data.
• The technological breakthrough makes much more
accurate and precise decisions possible.
• exploring the value in big data requires experimentation
and exploration. Whether creating new products or
looking for ways to gain competitive advantage.

Analysing
Big Data:
● Predictive
analysis
● Text analytics
● Sentiment
analysis
● Image
Processing
● Voice
analytics
● Movement
Analytics
● Etc.
Data Sources
● ERP
● CRM
● Inventory
● Finance
● Conversations
● Voice
● Social Media
● Browser logs
● Photos
● Videos
● Log
● Sensors
● Etc.
Volume
Veracity
Variety
Velocity
Turning Big Data into Value
Value

Comparison Between Traditional RDBMS And Hadoop

LIMITATIONS OF RDBMS TO SUPPORT
“BIG DATA”
• Designed and structured to accommodate structured
data.
• Data size has increased tremendously, RDBMS finds
it challenging to handle such huge data volumes.
• lacks in high velocity because it’s designed for steady
data retention rather than rapid growth.
• Not designed for distributed computing.
• Many issues while scaling up for massive datasets.
• Expensive specialized hardware.
• Even if RDBMS is used to handle and store
“BigData,” it will turn out to be very expensive.

WHAT IS HADOOP?
• Hadoop is an open-source software framework
• Allows the distributed storage and processing of large
data sets across clusters of commodity hardware
• uses simple programming models for processing
• It is designed to scale up from single servers to
thousands of machines.
 each offering local computation and storage.
• It provides massive storage for any kind of data.
• enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs.
• Stores files in the form of blocks.

WHAT IS HADOOP?(cont)
• Hadoop is an open-source implementation of Google
MapReduce, GFS(Google File System).
• Hadoop was created by Dough Cutting, the creator of
Apache Lucene, the widely used text search library.

HISTORY OF HADOOP
• 2003 - Google launches project Nutch to handle billions of searches
and indexing millions of web pages.
• Oct 2003 - Google releases papers with GFS (Google File System).
• Dec 2004 - Google releases papers with MapReduce.
• 2005 - Nutch used GFS and MapReduce to perform operations.
• 2006 - Yahoo! created Hadoop based on GFS and MapReduce (with
Doug Cutting and team)
• 2007 - Yahoo started using Hadoop on a 1000 node cluster
• Jan 2008 - Apache took over Hadoop
• Jul 2008 - Tested a 4000 node cluster with Hadoop successfully
• 2009 - Hadoop successfully sorted a petabyte of data in less than 17
hours to handle billions of searches and indexing millions of web
pages.

HADOOP CORE COMPONENTS
HDFS(storage) and MapReduce(processing) are the two core
components of Apache Hadoop.
HDFS
• HDFS is a distributed file system that provides high-
throughput access to data.
• It provides a limited interface for managing the file system to
allow it to scale and provide high throughput.
• HDFS creates multiple replicas of each data block and
distributes them on computers throughout a cluster to enable
reliable and rapid access.

HADOOP CORE COMPONENTS
 MapReduce
• MapReduce is a framework for performing distributed data
processing using the MapReduce programming paradigm.
• Each job has a user-defined map phase and user-defined
reduce phase where the output of the map phase is aggregated.
• HDFS is the storage system for both input and output of the
MapReduce jobs.

HDFS OVERVIEW
• Based on Google’s GFS (Google File System)
• Provides redundant storage of massive amounts of
data
– Using commodity hardware
• Data is distributed across all nodes at load time.
• Provides for efficient Map Reduce processing.
– Operates on top of an existing filesystem.
• Files are stored as ‘Blocks’
– Each Block is replicated across several Data Nodes
• NameNode stores metadata and manages access.
• No data caching due to large datasets

HADOOP ARCHITECTURE
MasterMmmfj-Slave
Master-Slave

HADOOP ARCHITECTURE
NameNode
• Stores all metadata: filenames, locations of each block
on DataNodes, file attributes, etc…
• Keeps metadata in RAM for fast lookup.
• Filesystem metadata size is limited to the amount of
available RAM on NameNode
DataNode
• Stores file contents as blocks.
• Different blocks of the same file are stored on different
DataNodes.
• Periodically sends a report of all existing blocks to the
NameNode.

COMPONENTS(DAEMONS) OF HADOOP
• NameNode
• DataNode
• Secondary NameNode
• JobTracker
• TaskTracker

Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Job Tracker
Name Node
Client
HADOOP ARCHITECTURE
Job Tracker
Name Node

(CONT.)
❑Pig
❑A high-level data-flow language and execution
framework for parallel computation.
❑Hive
❑A data warehouse infrastructure that provides data
summarization and ad hoc querying.
❑Sqoop
❑A tool designed for efficiently transferring bulk data
between Hadoop and structured datastores such as
relational databases.
❑HBase
❑A scalable, distributed database that supports structured
data storage for large tables.

HADOOP DISTRIBUTIONS
➢Let's say we go download Apache Hadoop and
MapReduce from http://hadoop.apache.org/
➢At first it works great but then we decide to start using
HBase
➢No problem, just download HBase from
http://hadoop.apache.org/ and point to your existing
HDFS installation
➢But we find that HBase can only work with a previous
version of HDFS, so we go downgrade HDFS and
everything still works great.
➢Later on we decide to add Pig
➢Unfortunately the version of Pig doesn't work with the
version of HDFS, it wants us to upgrade
➢But if we upgrade we will break HBase.

HADOOP DISTRIBUTIONS
 Hadoop Distributions aim to resolve version
incompatibilities
 Distribution vendors will
 Integration test a set of Hadoop product
 Package Hadoop product in various installation
formats
 linux packages, tarballs, etc
 Distribution may provide additional scripts to execute
Hadoop
 Some vendors may choose to backport features and
bug fixes made by Apache
 Typically vendors will employ Hadoop committers so
the bugs find will make it into Apache repository.

DISTRIBUTION VENDORS
• Cloudera Distribution for Hadoop(CDH)
• MapR Distribution
• Hortonworks Data Platform (HDP)
• Greenplum
• IBM BigInsights

CLOUDERA DISTRIBUTION FOR
HADOOP(CDH)
• Cloudera has taken the lead on providing Hadoop
Distribution
• Cloudera is affecting the Hadoop ecosystem in the same
way RedHat popularized Linux in the enterprise circle
Most Popular Distribution
http://cloudera.com/hadoop
100% open-source
• Cloudera employs a large percentage of core Hadoop
committers
• CDH is provided in various formats Linux package,
Virtual Machine Images and Tarballs
• Integrates majority of popular Hadoop product
HDFS, MapReduce, HBase, Hive, Oozie, Pig, Sqoop,
Zookeeper and Flume etc.

SUPPORTED OPERATING SYSTEM
• Each Distribution will support its own list of Operating
System
• Common OS supported
 Red Hat Enterprise
 CentOS
 Oracle Linux
 Ubuntu
 SUSE Linux Enterprise Server

WHO USES HADOOP
➢Amazon/A9
➢Facebook
➢Google
➢IBM
➢Joost
➢LinkedIn
➢New York Times
➢PowerSet
➢Yahoo!
Now It’s Our
Turn

OVERVIEW OF DATA LAKE
“A Data Lake is a large storage repository and processing
engine. They provide "massive storage for any kind
of data, enormous processing power and the ability to
handle virtually limitless concurrent tasks or jobs."

REFERENCE
• https://hadoop.apache.org/
• http://www.cloudera.com/hadoop-and-big-data.html
• http://hortonworks.com/hadoop/
• Hadoop: The Definitive Guide, 4th Edition - O'Reilly Media

Thank You
Presenter
Amir R. Shaikh
Hadoop Administrator
Thank You
For
Your attention and your time
have a good day ahead
Signing off

Introduction to BIg Data and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to BIg Data and Hadoop

Similar to Introduction to BIg Data and Hadoop (20)

Recently uploaded

Recently uploaded (20)

Introduction to BIg Data and Hadoop