Learn Big Data & Hadoop

www.edureka.in/hadoop
How It Works…
 LIVE classes
 Class recordings
 Module wise Quizzes, Coding Assignments
 24x7 on-demand technical support
 Project work on large Datasets
 Online certification exam
 Lifetime access to the Learning Management System
Complimentary Java Classes

Course Topics
 Week 1
– Understanding Big Data
– Introduction to HDFS
 Week 2
– Playing around with Cluster
– Data loading Techniques
 Week 3
– Map-Reduce Basics, types and formats
– Use-cases for Map-Reduce
 Week 4
– Analytics using Pig
– Understanding Pig Latin
 Week 5
– Analytics using Hive
– Understanding HIVE QL
 Week 6
– NoSQL Databases
– Understanding HBASE
 Week 7
– Data loading Techniques in Hbase
– Zookeeper
 Week 8
– Real world Datasets and Analysis
– Hadoop Project Environment

What Is Big Data?
 Lots of Data(Terabytes or Petabytes)
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
A airline jet collects 10 terabytes of sensor data
for every 30 minutes of flying time.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.

Facebook Example
 Facebook users spend 10.5 billion minutes
(almost 20,000 years) online on the social network.
 Facebook has an average of 3.2 billion likes and
comments are posted every day.

Twitter Example
 Twitter has over 500 million registered users.
 The USA, whose 141.8 million accounts represents 27.4
percent of all Twitter users, good enough to finish well ahead
of Brazil, Japan, the UK and Indonesia.
 79% of US Twitter users are more like to recommend brands
they follow .
 67% of US Twitter users are more likely to buy from brands
they follow .
 57% of all companies that use social media for business use
Twitter.

IBM’s Definition
 IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/

 Estimated Global Data Volume:
 2011: 1.8 ZB
 2015: 7.9 ZB
 The world's information doubles every two years
 Over the next 10 years:
 The number of servers worldwide will grow by 10x
 Amount of information managed by enterprise data
centers will grow by 50x
 Number of “files” enterprise data center handle will
grow by 75x
Source: http://www.emc.com/leadership/programs/digital-universe.htm,
which was based on the 2011 IDC Digital Universe Study
Data Volume Is Growing Exponentially

Un-Structured Data is Exploding

Common Big Data Customer Scenarios
Industry/Vertical Scenarios
Financial Services  Modeling True Risk
 Threat Analysis
 Fraud Detection
 Trade Surveillance
 Credit Scoring And Analysis
Web & E-Tailing  Recommendation Engines
 Ad Targeting
 Search Quality
 Abuse and Click Fraud Detection
Retail  Point of sales Transaction Analysis
 Customer Churn Analysis
 Sentiment Analysis

Industry/Vertical Scenarios
Telecommunications  Customer Churn Prevention
 Network Performance Optimization
 Call Detail Record (CDR) Analysis
 Analyzing Network to Predict Failure
Government  Fraud Detection And Cyber Security
General
(Cross Vertical)
 ETL & Processing Engine
Common Big Data Customer Scenarios (Contd.)

Hidden Treasure
 Insight into data can provide Business Advantage.
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.

What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”
“Leaders in every sector will have to grapple the implications of Big Data.”
McKinsey
Gartner
Forrester
Research
“Big Data analytics are rapidly emerging as the preferred solution to business and
technology trends that are disrupting.”
“Enterprises should not delay implementation of Big Data Analytics.”
“Use Hadoop to gain a competitive advantage over more risk-averse enterprises.”
“Prioritize Big Data projects that might benefit from Hadoop.”

Limitations of Existing Data Analytics Architecture

Solution: A Combined Storage Computer Layer

Differentiating Factors

Some Of the Hadoop Users

Hadoop Users – In Detail
http://wiki.apache.org/hadoop/PoweredBy

Read 1 TB Data
10 Machines
 4 I/O Channels
 Each Channel – 100 MB/s
 4 I/O Channels
1 Machine
Why DFS?

10 Machines
 4 I/O Channels
 4 I/O Channels
1 Machine
Read 1 TB Data
45 Minutes
Why DFS?

4.5 Minutes45 Minutes
10 Machines
 4 I/O Channels
 4 I/O Channels
1 Machine
Read 1 TB Data
Why DFS?

 Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?

Hadoop Key Characteristics

Hadoop History
Doug Cutting & Mike Cafarella
started working on Nutch
NY Times converts 4TB of
Image archives over 100 EC2s
Fastest sort of a TB,
62secs over 1,460 nodes
Sorted a PB in 16.25hours
Over 3.658 nodes
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch
Google publishes GFS &
MapReduce papers Yahoo! hires Cutting,
Hadoop spins out of Nutch
Facebook launches Hive:
SQL Support for Hadoop
Doug Cutting
Joins Cloudera
Hadoop Summit 2009,
750 attendees
Founded

Hadoop Eco-System

 HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
 MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
Hadoop Core Components

Hadoop Core Components (Contd.)

HDFS Architecture

 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes
 DataNodes:
 slaves which are deployed on each machine and provide the
actual storage
 responsible for serving read and write requests for the clients
Main Components Of HDFS

 Secondary NameNode:
 Not a hot standby for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make it
secure
Single Point
Failure
Secondary
NameNode
NameNode
metadata
metadata
Secondary NameNode

NameNode Metadata
 Meta-data in Memory
 The entire metadata is in main memory
 No demand paging of FS meta-data
 Types of Metadata
 List of files
 List of Blocks for each file
 List of DataNode for each block
 File attributes, e.g. access time, replication factor
 A Transaction Log
 Records file creations, file deletions. etc

JobTracker

JobTracker (Contd.)

Anatomy of A File Write

Anatomy of A File Read

Replication and Rack Awareness

Big Data – It’s about Scale And Structure
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMSRDBMS EDW MPP NoSQL HADOOP

 Attempt the following Assignments using the documents present in the LMS:
 Hadoop Installation - Cloudera CDH3
 Execute Linux Basic Commands
 Execute HDFS Hands On
Assignments

Thank You
See You in Class Next Week

Learn Big Data & Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Learn Big Data & Hadoop

Similar to Learn Big Data & Hadoop (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Learn Big Data & Hadoop

Editor's Notes